My prediction is eventually there will be anti-trust ligitation, they will be required to open the CUDA standard, after which AMD will become a competitor.
NVIDIA could voluntarily open the standard to avoid this ligitation if they wanted to, though, and IMO it would be the smart thing to do, but almost every corporation in history has chosen the ligitation instead.
>My prediction is eventually there will be anti-trust ligitation, they will be required to open the CUDA standard, after which AMD will become a competitor.
If AMD isn't a competitor before government intervention, I don't the government forcing nvidia to open up CUDA changes much. CUDA's moat isn't due to some secret sauce - nvidia put in the developer hours; and if AMDs CUDA implementation is still broken, people will continue to buy nvidia.
There has been a lot of trying to get AMD to work - Hotz has been trying for a while now[1] and has been uncovering a ton of bugs in AMD drivers. To AMD's credit, they have been fixed, but it does give you a sense of how far behind they are in regards to their own software. Now imagine them trying to implement a competitor's spec?
You know what happens to companies that panic and throw all their resources into knee-jerk software projects? I don't, but I'd predict it is ugly. Adding more people to a bad project generally makes it worse.
The issue that AMD has is they had a long period where they clearly had no idea what they were doing. You could tell just from looking at websites, CUDA pretty much immediately gets to "here is a library for FFT", "here is a library for sparse matricies". AMD would explain that ROCM is an abbreviation of the ROCm Software platform or something unspeakably stupid. And that your graphics card wasn't supported.
That changed a few months ago; so it looks like they have put some competent PMs in the chair now or something. But it'll take months for the flow on effects to reach the market. They have to figure out what the problems are which takes months to do properly; then fix the software (1-3 months more minimum); then get it into the open and the foundational libraries like PyTorch pick it up (might take another year). You can speed that up, but more cooks in the kitchen is not the way. Bandwidth use needs to be optimised.
It isn't like ROCm seems lacks key features; it can technically do inference and training. My card crashes regularly though (might be a VRAM issue) so it is useless in practice. AMD can check boxes but the software doesn't really work and grappling with that organisationally is hard. Unless you have the right people in the right places, which AMD didn't have up to at least mid 2023.
> AMD would explain that ROCM is an abbreviation of the ROCm Software platform or something unspeakably stupid. And that your graphics card wasn't supported.
If even that. A few years ago they managed to break basic machine learning code on the few commonly-used consumer GPUs that were officially supported at the time, and it was only after several months of more or less radio silence on the bug report and several releases that they declared those GPUs were no longer officially supported and they'd be closing the bug report: https://github.com/ROCm/ROCm/issues/1265
Look at AMDs vs Intel. They have now surpassed Intel in terms of CPUs sold and market cap. That was unthinkable even six, seven years ago.
It makes perfect sense that, organisationally, they were focused on that battle. If you remember the Athlon days, AMD beat Intel before, but briefly. It didn't last. This time it looks like they beat Intel and have had the focus to stay. Intel will come back and beat them some cycles, but there is no collapse on the horizon.
So it makes sense that they started looking at nVidia in the last year or so. Of course nVidia has amassed an obscene war chest in the meantime...
It's a political problem. Good software engineers are paid more than good hardware engineers, but AMD management is unwilling to pay up to bring on good software engineers because then they'd also need to pay their hardware engineers more, otherwise the hardware engineers would be unsatisfied. If you check NVidia salaries online you'll see NVidia pays significantly more than AMD for both hardware and software engineers; it's a classic case of AMD management being penny-wise, pound-foolish.
Because it is a different type of engineering. If you manage software development like you manage hardware development your software is going to be bad. That has always been AMD's problem and it is not likely to get fixed.
What unis would that include? Isn't ATI Canadian? Therefore i'd expect lots of UToronto and Waterloo people there. Aren't they some of the best in this field?
You have to remember that this only applies to cheap consumer GPUs, they tend to support their datacenter GPUs better. When you consider that Ryzen AI already eats the AI inference lunch, having better GPUs with better software only threatens to cannibalize their data center GPU offering. Given enough time nobody will care about using AMD GPUs for AI.
Getting this working might be worth a trillion $ to AMD - they should be doing more than just waiting for a bootstrapped startup to debug their drivers for them.
It changes alot. It is not legal to make a 'CUDA' driver for an AMD GPU as Nvidia own cuda. You can see there was a open implementation of this that AMD sponsored until they got threatened with a lawsuit by Nvidia
The problem currently, as people like Hotz and many others are discovering, it not the lack of CUDA. Most people use PyTorch and don't care what the underlying software is. Infact most CUDA is hand tuned to nvidia hardware anyways and is optimized to make the most on nvidia. The problem is AMD's drivers - the piece that actually sends the code to run on the GPU, tends to be broken. AMD cannot "sponsor" an outsider to fix this. A legal, but broken, AMDCUDA will not be any better than the current situation; so no, having CUDA on AMD wouldn't change anything.
The problem is not "CUDA is not AMD", the problem is AMD has not, does not, and for some reason will not invest adequately in GPU compute. CUDA is a mirage; if AMD had a similar platform someone would have done the work already to ensure PyTorch works on it. PyTorch already supports ROCm, people don't use it because the performance is bad and it's buggy. When nvidia had this problem, nvidia hired engineers to work on open source projects and debug issues in open source libraries (not even limited to AI, you will find nvidia engineers debugging issues in a wide range of CUDA projects). When AMD has this issue, they barely acknowledge it.
ZLUDA ate the dust not because they implemented CUDA but because they were misusing complied NVIDIA libraries.
If it was a clean room implementation of the API NVIDIA wouldn’t care. Heck that’s exactly what AMD did with HIP.
But what you cannot do is essentially intercept calls to and reverse engineer NVIDIA binaries in real time because you can’t be arsed to build your own.
> But what you cannot do is essentially intercept calls to and reverse engineer NVIDIA binaries in real time because you can’t be arsed to build your own.
And this is precisely what anti-trust ligitation would allow them to do.
Preventing someone from reverse-engineering a product with the sole intention of maintaining monopoly status may be seen as anti-competitive.
AMD makes really, really good CPUs now, but only after ligitation against Intel allowed them to keep up with evolving x86 standards.
It's not about being "arsed" to build your own, the problem is NVIDIA controls the ecosystem-wide standard. NVIDIA can add to CUDA at any point in time and launch a GPU at the same time, the ecosystem would be forced to buy it if they want to stay on the cutting edge, and AMD would never be able to compete or reverse engineer these new standards in time.
They didn’t RE the CUDA API, they reused libraries like cudnn that’s a completely different case than say Oracle vs Google.
What ZULDA did wasn’t to maintain a compatibility with the CUDA API and provided an open implementation of it but rather use all the CUDA based libraries that NVIDIA provided on top of it.
The equivalence again would be that not only Google implemented their own Java compatible API but rather that they used the now Oracle owned JVM to do so and redistributed it.
AMD already implemented CUDA essentially one to one in the form of HIP in ROCm. The issue that they face is that they don't have all the equivalent middleware to make copy pasting code actually work and this was what ZULDA did but instead of building a HIPdnn they just reused NVIDIA binaries.
It would be kind of genius for Nvidia to "open" the CUDA APIs (which have already been unofficially reverse engineered anyway) but not the code. Maybe they'd also officially support HIP and SYCL. Maybe they could open SXM after all competitors have already committed to OAM. They'd create the appearance of opening up while giving up very little.
By "Opening Up" they cement their leadership position. AI frameworks are already targeting CL, SPIR-V, etc. The low level details will fade and so will Nvidias api dominance.
Just because they are a target doesn’t mean things just work. Historically, AMD hardware for GPGPU becomes obsolete well before the software landscape catches up. I am not going to risk my time and money finding out whether history repeats itself, just for a few potential FLOPS per dollar.
Even if they finished it yesterday, it would take years to convince everyone that this time it will be worth investing in AMD, at which point the whole datacenter AI hype may already be over.
The AI hype will not end. AMD just has to support transformers. OSS is already building it, at which point you can finetune an open weights model on an MI300. Nvidia is in the same place that Cisco and Sun were during the dot.com boom. People are just buying the mainstream think that works. It doesn't take years, it takes an accountant.
The cuda api is essentially open... Hip is basically a copy.
CUDA is such a misnomer. Amd doesn't have tensorRT, cuDNN, cutlass, etc. Forcing Nvidia to make these work on AMD is like forcing Microsoft to make windows work on apple hardware... Not going to happen.
IMHO there's reason to believe that was what was discussed here plays a role in that decision: https://news.ycombinator.com/item?id=39592689 - namely NVidia trying to forbid such APIs.
That has nothing to do with the API. The restriction there is you cannot use nvcc to generate nvidia bytecode, take that bytecode, decompile it, and translate it to another platform. This means that, if you use cuDNN, you cannot intercept the already-compiled neural network kernels and then translate those to AMD.
You can absolutely use the names of the functions and the programming model. Like I said, HIP is literally a copy. Llama.cpp changes to HIP with a #define, because llama.cpp has its own set of custom kernels.
And this is what I've said before, CUDA is hardly a moat. The API is well-known and already implemented by AMD. It's all the surrounding work: the thousands of custom (really fast!) kernels. The ease-of-use of the SDKs. The 'pre-built libraries for every use case'. You can claim that CUDA should be made open-source for competition, but all those libraries and supporting SDKs represent real work done by real engineers, not just designing a platform, but making the platform work. I don't see why NVIDIA should be compelled to give those away anymore than Microsoft should be compelled to support device driver development on linux.
NVIDIA could voluntarily open the standard to avoid this ligitation if they wanted to, though, and IMO it would be the smart thing to do, but almost every corporation in history has chosen the ligitation instead.