Intel also had quite a few issues (as usual) getting their own PCI standard work...

wtallis · on Feb 14, 2024

> it's still a challenge to extract more than 50-60% of the theoretical bandwidth PCIe offers without a lot of tuning and mental churn.

Are you referring to using all the bandwidth across all of a CPU's PCIe lanes? Because it's clearly not a challenge to do way better than 50-60% utilization for transfers in one direction at a time to one device. All kinds of cheap crappy products have passed that threshold on their first try.

vrinsd · on Feb 15, 2024

The issue is not lack of raw bandwidth, it's getting the hardware, software, drivers and OS "to do the right thing".

PCIe gives you the building blocks of posted transactions and non-posted transactions but doesn't help you use them effectively. There is no coordinated or designated DMA subsystem to help move data between the root-complex("host") and end-point("device)".

So, if you have to design a new PCIe end-point (target in original PCI terms) using an FPGA or ASIC then trying to actually sustain PCIe throughput in either "direction" isn't trivial.

Posted transactions ("writes") are 'fire and forget' and non-posted transactions ("reads") have a request/acknowledgement system, flow-control, etc.

If you can get your "system" to use ONLY posted writes (fire and forget) with a large enough MPS (payload size), usually >128 Bytes, then you can get to 80%-95% of theoretical throughput (1).

The real difficulty is if you need to do a PCIe 'read' this breaks down into a read-request (MRd) and a Completion with Data (CplD). The 'read' results in a lot of back and forth traffic and tracking the MRds/CplDs becomes a challenge (2).

Often an end-point can use 'posted writes' to blast data to the PCIe root-complex (usually the CPU/host) maximizing throughput since a host usually has hundreds of MegaBytes of RAM to make use of for buffers. Unfortunately to transfer data from the root-complex(host) to the end-point(device), the host usually will have the device's DMA controller initiate a 'read' from the host's memory which results in these split transactions since end-points don't often carry hundreds of MB of RAM. This also means bespoke drivers, tying into the OS PCIe subsystems and hopefully not loosing any MSI-X interrupts.

To re-iterate in the modern "Intel way" the CPU houses the PCIe root complex but does not house ANY DMA controller. So to get "DMA" working means each PCIe end-point's implementation has some kind of DMA "controller" which is different than the DMA controller of all other end-points, rather than Intel having spec'd out an "optional" centralized a DMA controller in the root complex.

1: https://cdrdv2-public.intel.com/666650/an456-683541-666650.p...

2: https://www.intel.com/content/www/us/en/docs/programmable/68...