Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
10Gbit/s full TX wirespeed smallest packet size on a single CPU core (netoptimizer.blogspot.com)
107 points by eloycoto on Feb 24, 2015 | hide | past | favorite | 13 comments


How typical is this benchmark for a single CPU? If this is (approximately) cutting-edge on the software side then it's a pretty big deal because the cutting-edge hardware can handle 100Gbit/s ethernet with 33Gbit/s out of each transceiver [1] which would mean the CPU is the bottleneck!

[1] http://www.xilinx.com/products/silicon-devices/fpga/virtex-u...


In test case in the article they are sending the smallest possible ethernet ethernet frame at 84 bytes. Further, the caveat here is that they can do that on a single CPU only in the kernel layer. Once you involve userspace at the same packet size you need 11 CPUs to drive it.

If you're using more realist payload (eg. larger packets) you should be able to scale it further. eg 40Gbit

Bottleneck in this ongoing story has been CPU... that's because of how the software was written. So the issues have been package scheduling, synchronization (locking overhead), high level software (TCP layer, and packet filtering).

Here are more details about the original problem and the test case: https://lwn.net/Articles/629155


Thanks for the link, that's exactly the context that I needed.

Some tl;dr highlights:

* the 84B packets are designed to simulate the latency characteristics of 100Gbit/s ethernet on 10Gbit/s hardware

* 2 cache misses are enough to blow the time budget for processing a packet

* The overhead of 1 syscall on SElinux is itself enough to blow the time budget

* Big picture strategy: batch packets, allocs, etc


> they are sending the smallest possible ethernet ethernet frame at 84 bytes

It's impressive, but I can't see its application. Does that mean that it will scale well to many different types of traffic? Are there extremely low-latency applications that need these tiny frames and yet want to use 10GbE?

> If you're using more realist payload (eg. larger packets) you should be able to scale it further. eg 40Gbit

Yes, I would think so. I was able to saturate four 10GbE links simultaneously (using 9000 byte frames, among four separate CPU cores spanning two memory nodes). This was using ~2010 hardware using linux 2.6.27.


Most of the processing overhead is per-packet rather than per-byte so they're testing the worst-case scenario by minimizing the size of the packets and frames, thereby maximizing the number of packets and frames. It doesn't have much direct applicability, but it does mean that the optimizations will be sufficient to achieve wire speed for any traffic pattern.


> It's impressive, but I can't see its application.

DDoS.

Packets Per Second attacks which you cannot filter on the edge in hardware (e.g. routers) can be quite difficult to handle in some exceptional cases. Being able to process line-rate pps is a pretty nice thing to have.

Thankfully you can usually filter these out upstream, but it's nice to have the spare capacity to absorb traffic while you figure out exactly what you need to filter. Plus you (generally) will never be able to get a filter 100% accurate, and will usually be letting some percentage of an attack through to hit your webservers. For some sites, these attacks can be in the hundreds of gigabits per second range - even 10% of that getting through can be significant.


>Bottleneck in this ongoing story has been CPU

How is CPU the bottleneck if what can be achieved with 11 CPUs can also be achieved with 1 CPU just by changing the software? If what you say is accurate, then kernel/userspace software is the bottleneck.


TLB flushes and most other context switch costs are still CPU performance issues even if they're not an instructions-per-second issue.


But that's only because the software is performing unnecessary actions/computations, it's not an inherent limitation of the CPU device.


Modern x86 (Intel) boxes can handle ~500Gbps of 128 byte packets per second with DPDK. That FPGA can do ~4.2Tbps, in theory.


Impressive. Could be useful out of the box for cheap 10Gbe load-testing.

I'm also curious, how you guys did the test, that you've actually achieved line speed? In my book, line speed means that you don't have millisecond-level gaps between packets from time to time, and it is fairly difficult both to test and to achieve on a stock kernel.


> Could be useful out of the box for cheap 10Gbe load-testing.

honest question: why not use dpdk / netmap for this ?


So, according to this, with 9000 byte MTUs, I can drive 4x40gbit (!= 160, oversubscribed cards doing 128gbit of PCI-E) using about 2.9% of my CPU on quad core (assuming same CPU speed, which my CPU is probably about a third faster than theirs too).

Thats pretty sexy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: