Something like Triton from Microsoft/OpenAI as a cuda bypass? Or pytorch/tensorf...

Something like Triton from Microsoft/OpenAI as a cuda bypass? Or pytorch/tensorflow targeting ROCm without user intervention.

Or there's openmp or hip. In extremis opencl.

I think the language stack is fine at this point. The moat isn't in cuda the tech. It's in code running reliably on nvidia's stack, without things like stray pointers needing a machine reboot. Hard to know how far off robust rocm is at this point.