10Gbit/s full TX wirespeed smallest packet size on a single CPU core

jjoonathan · on Feb 24, 2015

How typical is this benchmark for a single CPU? If this is (approximately) cutting-edge on the software side then it's a pretty big deal because the cutting-edge hardware can handle 100Gbit/s ethernet with 33Gbit/s out of each transceiver [1] which would mean the CPU is the bottleneck!

[1] http://www.xilinx.com/products/silicon-devices/fpga/virtex-u...

mtanski · on Feb 24, 2015

In test case in the article they are sending the smallest possible ethernet ethernet frame at 84 bytes. Further, the caveat here is that they can do that on a single CPU only in the kernel layer. Once you involve userspace at the same packet size you need 11 CPUs to drive it.

If you're using more realist payload (eg. larger packets) you should be able to scale it further. eg 40Gbit

Bottleneck in this ongoing story has been CPU... that's because of how the software was written. So the issues have been package scheduling, synchronization (locking overhead), high level software (TCP layer, and packet filtering).

Here are more details about the original problem and the test case: https://lwn.net/Articles/629155

jjoonathan · on Feb 24, 2015

Thanks for the link, that's exactly the context that I needed.

Some tl;dr highlights:

* the 84B packets are designed to simulate the latency characteristics of 100Gbit/s ethernet on 10Gbit/s hardware

* 2 cache misses are enough to blow the time budget for processing a packet

* The overhead of 1 syscall on SElinux is itself enough to blow the time budget

* Big picture strategy: batch packets, allocs, etc

wyldfire · on Feb 25, 2015

> they are sending the smallest possible ethernet ethernet frame at 84 bytes

It's impressive, but I can't see its application. Does that mean that it will scale well to many different types of traffic? Are there extremely low-latency applications that need these tiny frames and yet want to use 10GbE?

> If you're using more realist payload (eg. larger packets) you should be able to scale it further. eg 40Gbit

Yes, I would think so. I was able to saturate four 10GbE links simultaneously (using 9000 byte frames, among four separate CPU cores spanning two memory nodes). This was using ~2010 hardware using linux 2.6.27.

wtallis · on Feb 25, 2015

Most of the processing overhead is per-packet rather than per-byte so they're testing the worst-case scenario by minimizing the size of the packets and frames, thereby maximizing the number of packets and frames. It doesn't have much direct applicability, but it does mean that the optimizations will be sufficient to achieve wire speed for any traffic pattern.

phil21 · on Feb 25, 2015

> It's impressive, but I can't see its application.

DDoS.

Packets Per Second attacks which you cannot filter on the edge in hardware (e.g. routers) can be quite difficult to handle in some exceptional cases. Being able to process line-rate pps is a pretty nice thing to have.

Thankfully you can usually filter these out upstream, but it's nice to have the spare capacity to absorb traffic while you figure out exactly what you need to filter. Plus you (generally) will never be able to get a filter 100% accurate, and will usually be letting some percentage of an attack through to hit your webservers. For some sites, these attacks can be in the hundreds of gigabits per second range - even 10% of that getting through can be significant.

jsprogrammer · on Feb 25, 2015

>Bottleneck in this ongoing story has been CPU

How is CPU the bottleneck if what can be achieved with 11 CPUs can also be achieved with 1 CPU just by changing the software? If what you say is accurate, then kernel/userspace software is the bottleneck.

wtallis · on Feb 25, 2015

TLB flushes and most other context switch costs are still CPU performance issues even if they're not an instructions-per-second issue.

jsprogrammer · on Feb 27, 2015

But that's only because the software is performing unnecessary actions/computations, it's not an inherent limitation of the CPU device.

virtuallynathan · on Feb 24, 2015

Modern x86 (Intel) boxes can handle ~500Gbps of 128 byte packets per second with DPDK. That FPGA can do ~4.2Tbps, in theory.

dchichkov · on Feb 24, 2015

Impressive. Could be useful out of the box for cheap 10Gbe load-testing.

I'm also curious, how you guys did the test, that you've actually achieved line speed? In my book, line speed means that you don't have millisecond-level gaps between packets from time to time, and it is fairly difficult both to test and to achieve on a stock kernel.

signa11 · on Feb 25, 2015

> Could be useful out of the box for cheap 10Gbe load-testing.

honest question: why not use dpdk / netmap for this ?

DiabloD3 · on Feb 25, 2015

So, according to this, with 9000 byte MTUs, I can drive 4x40gbit (!= 160, oversubscribed cards doing 128gbit of PCI-E) using about 2.9% of my CPU on quad core (assuming same CPU speed, which my CPU is probably about a third faster than theirs too).

Thats pretty sexy.