Something like Triton from Microsoft/OpenAI as a cuda bypass? Or pytorch/tensorflow targeting ROCm without user intervention.
Or there's openmp or hip. In extremis opencl.
I think the language stack is fine at this point. The moat isn't in cuda the tech. It's in code running reliably on nvidia's stack, without things like stray pointers needing a machine reboot. Hard to know how far off robust rocm is at this point.
Or there's openmp or hip. In extremis opencl.
I think the language stack is fine at this point. The moat isn't in cuda the tech. It's in code running reliably on nvidia's stack, without things like stray pointers needing a machine reboot. Hard to know how far off robust rocm is at this point.