Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using the FreeBSD Rack TCP Stack (klarasystems.com)
123 points by rodrigo975 on Sept 16, 2021 | hide | past | favorite | 23 comments


This stack was developed by my colleagues at Netflix (primarily Randall Stewart, known for SCTP). It serves the vast majority of our video and other CDN traffic.


Since Netflix serves (relatively) few but extremely large files, would you recommend using this stack for a typical web server (serving lots of small files)?


Yes. We also use it for non-video small files on our CDN.


I have been playing with it yesterday night and noticed that I can achieve a throughput of 17.8 Gbit/s on iperf3 with the standard FreeBSD 13.0 stack (in a VM with virtio-net) and only 13-14 Gbit/s on the Rack Stack. Is there something I'm missing?


Rack is not tuned for single stream bandwidth. There are things that it does which may lead to lower single stream bandwidth. For example, if you have 1MB ready to send and enough receiver window, the default stack will loop around in tcp_output() and send all 1MB immediately 16 64K TSOs. I believe Rack will back off in between.


Interesting, didn’t know that FreeBSD has this feature.

I know of one other operating system which has a somewhat similar feature, but not quite the same. z/OS supports running multiple TCP/IP stacks concurrently on the same OS instance [0]

Whereas this is multiple TCP stacks, but still only one IP stack (or maybe one for v4 and one for v6)

[0] https://www.ibm.com/docs/en/zos/2.2.0?topic=overview-conside...


But on VNET enabled kernels (enabled in GENERIC since 13.0) you can have multiple instances of the IP stack to provide jails with their own IP stacks including a loopback, firewall(s) and IPsec.


So, with FreeBSD multiple TCP stack feature, a single process can talk to multiple TCP stacks simultaneously, by calling setsockopt(TCP_FUNCTION_BLK) on each socket to select a different stack.

Similarly, in z/OS, a single process can have sockets belonging to multiple TCP/IP stacks. There is a system call (setibmopt) which can be used to choose a default stack, and thereafter all sockets created by that process will be bound to that stack only (the "stack affinity" is inherited over fork/exec; can also be set with _BPXK_SETIBMOPT_TRANSPORT environment variable). Alternatively, you can call ioctl(SIOCSETRTTD) on a socket to pick which TCP/IP stack to use for that particular socket. There is also a feature, CINET, where the OS chooses which TCP/IP stack to use for each socket automatically, based on the address the process binds it to. CINET asks each TCP/IP stack to provide a copy of its routing tables, and then uses those routing tables to "preroute" sockets to the appropriate stack.

But I get the impression VNET doesn't allow a single process to use multiple IP stack instances simultaneously? If VNET is bound to jails, a single process can belong to only one jail.

One reason why z/OS has this multiple TCP/IP stack support, is historically the TCP/IP stack has been a third party product, not a core part of the OS. So instead of IBM's stack, but some people used third party ones instead, such as CA TCPaccess (at one point resold by Cisco as IOS for S/390). One can even use both products on the same OS instance, primarily to help with piecemeal migrations from one to the other. Other operating systems with a history of supporting TCP/IP stacks from multiple vendors include OpenVMS and older versions of Windows (especially 3.x)


VNET has been around for quite a while though…


And has been a quick way to panic() for most of them.


No not "most" but some of them. And it's declared stable in FreeBSD13


I'm having trouble parsing the following passage.

>"However, when the loss is at the end of a transmission, near the end of the connection or after a chunk of video has been sent, then the receiver won’t receive more segments that would generate ACKs. When this sort of Tail loss occurs, a lengthy retransmission time out (RTO) must fire before the final segments of data can be sent."

I believe this whole passage is just describing TCP fast retransmit vs a retransmit timeout expiring. However if the final TCP segment from the sender is lost wouldn't the receiver also start sending duplicate ACKs as well? This sentence seems to indicate duplicate ACKs would not be sent if the last segment was the TCP segment that was lost. In other words a duplicate ACK from the receiver is lost and so the RTO expires.


I think the implication is: TCP kind of assumed you will either keep transmitting, or close the connection.

The many video-streaming type workloads, the connection will go idle, for seconds or even minutes at a time. If the loss is at the tail end of some activity, before a period of idle, the recovery takes a lot longer than it would if there are further activity on the connection.


Ah OK thanks that makes sense now. They did actually use the word "furthest" in the sentence previous to the one I quoted which also makes sense tin the context. Cheers.


TIL that Linux has pluggable congestion control algorithms.

Anyone know if there's one that can deal with severe buffer boat? I have a connection where I control both ends and I have seen ping times exceed 20s under load. Throughput is highly variable so I can't just throttle the connection.


You sure that's not just badly defined priority based queues or something else going on in-between? 20s is a hell of a lot of buffer unless this connection is via dialup modem.

That being said it sounds like this connection is a good candidate for using BBR anyways so I'd give that a shot and see if anything changes.


This is a G.hn link that usually sustains 50Mbps but occasionally drops to sub 1Mbps (so more like 1st gen DSL speeds than dialup speeds); it only drops that low for a fairly short period of time, but it takes a long time to recover if the buffers are full.


That's not a buffering issue (as others mentioned before, 20s would be quite a buffer). G.hn made the same mistake as dial-up modem long time ago: implementing forward error correction. That's a great feature on a space probe leaving the solar system, but not for peers on a LAN talking TCP, as that does performs its own error correction (as is well known). If some link layer hiccup occurs, the forward error correction of the link layer causes delays potentially causing re-transmissions on the TCP layer (check with `netstat -s`).


Does it really take 20s to perform FEC on a packet? I assumed it was retransmitting at the link-level.

Either way, the packet is buffered in the sense that it is stored in a buffer on the switch; otherwise the packets would be dropped, not eventually make it through.


Sorry, brainfart, no excuses, I had my coffee already. FEC ought to add only minimal latency, even if correction is necessary. I too believe the long latencies are due to retransmissions on the link level (what modems suffered from too), which, if on-going for long enough, causes retransmissions by TCP.


Hmm maybe make the buffers much smaller and try BBR instead of Cubic?

https://blog.apnic.net/2020/01/10/when-to-use-and-not-use-bb...


Thank you for this link, I wasnt aware of that... building kernel just as I type :)


Great write-up post, I wasn't aware that FreeBSD could do this and this does make me want to give FreeBSD another shot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: