Some of the buffer bloat problem is due to SO_SENDBUF being statically set to some value, which also results in hiding the actual bytes-in-flight from the application.
I think it would be much better to allow the auto-detected TCP window to be exposed to the application level as the "send-buf" size (with perhaps some 10% buffer bloat to allow filling in the gaps when the window grows or acks return prematurely).
Also, it would be good for high-bandwidth-high-latency situations, where the default Linux 0.5MB send-buf size is not enough. Allow the send-buf to grow if the TCP window needs to grow beyond 0.5MB.
Well exposing it per application kinda defeats a bit of the purpose behind it. The point is that you get rid of the buffer so you get packet loss when the link to the other peer(s) is(are) saturated and allow the TCP congestion algorithms to work properly.
Even in a high latency high bandwidth situation having huge buffers is self-defeating. Anybody that has used Bittorrent and assumed their ISP was doing some sort of throttling were you have weird latency spikes and reduced performance... this is a direct result of buffer bloat in the majority of cases.
Large buffers only help in the case of a single TCP connection using up as much as the bandwidth as possible, which helps network routers and such things look good in benchmarks.
What you are talking about is some sort of Quality of Service type thing. Probably most usefully expressed at the edge of networks for prioritizing traffic and at the ISP level so they can route traffic through different internet links based on requirements.
ISPs have to deal with choosing different links to other networks and what it costs. They can do choose stuff like use main backbones versus secondary links and you get different trade-offs based cost versus latency and things of that nature. So ideally there should be some sort of flag you can set in the TCP packet that would indicate latency importance or some such thing. Or just making internet routing equipment application-protocol-aware.
By "expose it to the application", I mean in the flow-control sense. That send() (or select/epoll/etc) will block until there is room in the TCP window, rather than in the pre-determined buffer which will always be too small or too big.
The way it works now, the kernel is basically forcing applications to have buffer-bloat (fill a 0.5MB socket buffer), or auto-detect RTT or manually select a proper buffer size. All are bad options.
Also, in high-latency-high-bandwidth situations, the default 0.5MB buffer will simply fail to make use of the available bandwidth, so increasing the buffer size does not defeat the purpose. Latency spikes are a different situation, there are cases of constant high latency (e.g: inter-continental 1Gbps links).
Regarding bittorrent, I'm wondering if that's not just as much the fault of crappy NAT modems. Around here everyone gets a Zyxel, and bittorrent totally chokes it periodically. The hash table used for NAT grows, the CPU can't handle it, to the point you can't even log in to the modem, and new TCP sessions get dropped until the NAT clears out the old TCP connections. (With a proper network device, this shouldn't be an issue, but it's one I see all the time with the cheap modems we get from the ISPs)
The socket send buffer from applications is not very relevant to buffer bloat, they do not cause retransmissions, nor do they contribute much visible latency.
True. But if you have 0.5MB data to send right now, it will take that long regardless of whether it is queued in the socket buffer or your application. Applications that need to care about this are typically not sitting on top of TCP, and would usually need to control the send buffer size anyway.
If you have 0.5MB data to send right now, you could queue it all in your application layer, then you could still decide to cancel it or schedule your sends based on your own priorities.
I agree that applications that care about this don't use TCP, but one of the primary reasons for this is exactly this problem: that you don't get to send at the edge of the TCP window. There are other reasons, of course, each of which is fixable (and should be fixed!)
I think it would be much better to allow the auto-detected TCP window to be exposed to the application level as the "send-buf" size (with perhaps some 10% buffer bloat to allow filling in the gaps when the window grows or acks return prematurely).
Also, it would be good for high-bandwidth-high-latency situations, where the default Linux 0.5MB send-buf size is not enough. Allow the send-buf to grow if the TCP window needs to grow beyond 0.5MB.