> and just computed the prior work loads on DSPs and FPGAs sitting behind APIs a...

> and just computed the prior work loads on DSPs and FPGAs sitting behind APIs already wired into LLVM

Ha, ha, good one. Upstream LLVM supports PTX just fine. (and GCC too by the way)

FPGAs have a _much more_ closed-down toolchain. They're really not a good example to take. Compute toolchains for FPGAs are really brittle and don't perform that well. They're _not_ competitive with GPUs perf-wise _and_ are much more expensive.

More seriously, CUDA maps to the hardware well. ROCm is a CUDA API clone (albeit a botched one).

> the market of uncanny cat image generators.

GPUs are used for far more things than that. Btw, Intel's Habana AI accelerators have a driver stack that is also closed down in practice.