USB4 moved from a lane-switched architecture to a packet switched architecture.
In USB3, for example, if you plug in a DisplayPort display via USB-C, there was an alt-mode negotiation, either 2 or 4 of the 4 lanes would switch over to become DisplayPort lanes (leaving you with either 2 or 0 lanes of USB3). Even if you have a 1080p60 (~4.3Gbps) monitor plugged in, you're still losing 20 of the 40Gbps because you're losing half the lanes.
In USB4, everything (except USB1 & USB2, lol) is tunneled over USB4. If you plug a hub in to a computer, and a display (or two or three) into a hub, the displays only take as much bandwidth as they need from the total link budget. Each 1080p60 monitor is taking 4.3 of the 40Gbps (or less, if the monitor has Display Stream Compression). Same goes for devices plugged in via PCIe/Thunderbolt.
There were a lot of weird architecture limitations because of the lane-switched architecture of USB3. It's more or less reason we almost never saw USB hubs with multiple USB-C ports: people would plug in a USB-C monitor, and maybe the hub could be smart enough for that, but plug in a second USB-C monitor, and now the DisplayPort lanes somehow need to serve two monitors. (This is actually achievable with DisplayPort's Multi-Stream-Transport hubs, but lack of support from Apple caused constant angst here.) With USB4 though, a hub can now just route packets, so you can imagine having multiple USB-C hubs fanning out & serving multiple displays; USB-C makes more sense now & has much less caveats under USB4.
Do you know what the max PCIe bandwidth in practice is? I suppose I'm asking about both right now, the next two years, and the further future of USB4 v2.
In the past there have been some significant limits, like Intel's thunderbolt chips only getting 22Gbps of PCIe data on a 40Gbps link.
My understanding is that TB3 reserved some of the throughput (for video, even if unused?), and then the remaining throughput went through 8b/10b encoding, which left ~26Gbps as the maximum theoretical PCIe bandwidth (significantly less (-18%) than the PCIe 3.0 x4 connection's 32Gbps).
I don't know if there's still restrictions that prevent PCIe from making more use of the link in USB4/TB4. That seemed like a weird restriction, and it'd be especially weird not to see such a big gap when there is USB4 80Gbps. Anyhow, the encoding at least has gone from 8b/10b (+25% overhead) to 64b/66b or 128b/132b (+3%), which I believe should boost TB4 numbers a little.
In practice, rather than the 26Gbps of PCIe it seems like we might be able to get, I've found similar-ish reports that a bit over 22Gbps was all TB3 could typically manage when doing a drive test... I don't know but I'd be very curious to get more details on where the missing 4Gbps are. How much do we lose to NVMe overhead, how much is lost to tunneling overhead, what other factors are there? I haven't heard any real data about what if any additional latency there is. One test I'd like to see is what happens to drive-speed as we go from direct attached, to 1 hub, to 2 to 3 hubs.
As for the future, lots of good questions there; I haven't stumbled into any juicy tidbits at this time. A simple boost to PCIe 4.0 is all too likely, but how great it'd be to specify something more general/generic. It'd be great to see tunnels that have either more fan-in or more fan-out than the underlying connection: here's 8 different PCIe 6 x1 devices plugged in, and yes they're oversubscribed but they all work fine, that'd be a lovely example to see. But that's probably pretty pie in the sky right now.
The announcement said they're aligning with PCIe 4.0. If we assume that's 4 lanes then hopefully it should support 40-50 Gbps of tunneled PCIe traffic.
In USB3, for example, if you plug in a DisplayPort display via USB-C, there was an alt-mode negotiation, either 2 or 4 of the 4 lanes would switch over to become DisplayPort lanes (leaving you with either 2 or 0 lanes of USB3). Even if you have a 1080p60 (~4.3Gbps) monitor plugged in, you're still losing 20 of the 40Gbps because you're losing half the lanes.
In USB4, everything (except USB1 & USB2, lol) is tunneled over USB4. If you plug a hub in to a computer, and a display (or two or three) into a hub, the displays only take as much bandwidth as they need from the total link budget. Each 1080p60 monitor is taking 4.3 of the 40Gbps (or less, if the monitor has Display Stream Compression). Same goes for devices plugged in via PCIe/Thunderbolt.
There were a lot of weird architecture limitations because of the lane-switched architecture of USB3. It's more or less reason we almost never saw USB hubs with multiple USB-C ports: people would plug in a USB-C monitor, and maybe the hub could be smart enough for that, but plug in a second USB-C monitor, and now the DisplayPort lanes somehow need to serve two monitors. (This is actually achievable with DisplayPort's Multi-Stream-Transport hubs, but lack of support from Apple caused constant angst here.) With USB4 though, a hub can now just route packets, so you can imagine having multiple USB-C hubs fanning out & serving multiple displays; USB-C makes more sense now & has much less caveats under USB4.