Not necessarily! The setup I'm discussing is explicitly non-GPU, and it's not necessarily a TPU either. Any accelerator card with NoC capability will do: the requests are queued/batched from network, trickle through the adjacent compute/network nodes, and written back to network. This is what "compute-in-network" means; the CPU is never involved, main memory is never involved. You read from network, you write to network, that's it. On-chip memory on these accelerators is orders of magnitude larger than L1 (FPGA's are known for low-latency systolic stuff) and the on-package memory is large HBM stacks similar to those you would find in a GPU.
If a client wishes to use your GPU-based RDBMS engine, it needs to make a trip through the CPU first, does it not?