My prediction is eventually there will be anti-trust ligitation, they will be re...

nemothekid · on March 19, 2024

>My prediction is eventually there will be anti-trust ligitation, they will be required to open the CUDA standard, after which AMD will become a competitor.

If AMD isn't a competitor before government intervention, I don't the government forcing nvidia to open up CUDA changes much. CUDA's moat isn't due to some secret sauce - nvidia put in the developer hours; and if AMDs CUDA implementation is still broken, people will continue to buy nvidia.

There has been a lot of trying to get AMD to work - Hotz has been trying for a while now[1] and has been uncovering a ton of bugs in AMD drivers. To AMD's credit, they have been fixed, but it does give you a sense of how far behind they are in regards to their own software. Now imagine them trying to implement a competitor's spec?

[1] https://twitter.com/__tinygrad__/status/1765085827946942923

AYBABTME · on March 19, 2024

I don't understand AMD in this. Isn't it insanity that they're not throwing all they've got at their software stack?

roenxi · on March 19, 2024

You know what happens to companies that panic and throw all their resources into knee-jerk software projects? I don't, but I'd predict it is ugly. Adding more people to a bad project generally makes it worse.

The issue that AMD has is they had a long period where they clearly had no idea what they were doing. You could tell just from looking at websites, CUDA pretty much immediately gets to "here is a library for FFT", "here is a library for sparse matricies". AMD would explain that ROCM is an abbreviation of the ROCm Software platform or something unspeakably stupid. And that your graphics card wasn't supported.

That changed a few months ago; so it looks like they have put some competent PMs in the chair now or something. But it'll take months for the flow on effects to reach the market. They have to figure out what the problems are which takes months to do properly; then fix the software (1-3 months more minimum); then get it into the open and the foundational libraries like PyTorch pick it up (might take another year). You can speed that up, but more cooks in the kitchen is not the way. Bandwidth use needs to be optimised.

It isn't like ROCm seems lacks key features; it can technically do inference and training. My card crashes regularly though (might be a VRAM issue) so it is useless in practice. AMD can check boxes but the software doesn't really work and grappling with that organisationally is hard. Unless you have the right people in the right places, which AMD didn't have up to at least mid 2023.

makomk · on March 19, 2024

> AMD would explain that ROCM is an abbreviation of the ROCm Software platform or something unspeakably stupid. And that your graphics card wasn't supported.

If even that. A few years ago they managed to break basic machine learning code on the few commonly-used consumer GPUs that were officially supported at the time, and it was only after several months of more or less radio silence on the bug report and several releases that they declared those GPUs were no longer officially supported and they'd be closing the bug report: https://github.com/ROCm/ROCm/issues/1265

Certhas · on March 19, 2024

Look at AMDs vs Intel. They have now surpassed Intel in terms of CPUs sold and market cap. That was unthinkable even six, seven years ago.

It makes perfect sense that, organisationally, they were focused on that battle. If you remember the Athlon days, AMD beat Intel before, but briefly. It didn't last. This time it looks like they beat Intel and have had the focus to stay. Intel will come back and beat them some cycles, but there is no collapse on the horizon.

So it makes sense that they started looking at nVidia in the last year or so. Of course nVidia has amassed an obscene war chest in the meantime...

nicoburns · on March 19, 2024

AMDs graphics was an acquisition though (ATI), and I understand that the company culture of that division might still be quite different.

elcomet · on March 19, 2024

Pytorch has been supporting rocm for all last 2 years

blagie · on March 19, 2024

I'd add quotes there:

Pytorch has been "supporting" rocm for all last 2 years

elcomet · on March 19, 2024

Can you say more?

logicchains · on March 19, 2024

It's a political problem. Good software engineers are paid more than good hardware engineers, but AMD management is unwilling to pay up to bring on good software engineers because then they'd also need to pay their hardware engineers more, otherwise the hardware engineers would be unsatisfied. If you check NVidia salaries online you'll see NVidia pays significantly more than AMD for both hardware and software engineers; it's a classic case of AMD management being penny-wise, pound-foolish.

dheera · on March 19, 2024

This is quite possibly also why Boeing is having issues.

If they paid everyone $1M/year salaries maybe more people would consider going into aerospace engineering.

Right now though Boeing's starting salaries aren't that much higher than what an Uber driver in the bay area makes.

tempaccount420 · on March 19, 2024

Hardware people don't get along very well with software people.

elbear · on March 19, 2024

Why's that?

rapsey · on March 19, 2024

Because it is a different type of engineering. If you manage software development like you manage hardware development your software is going to be bad. That has always been AMD's problem and it is not likely to get fixed.

AYBABTME · on March 19, 2024

2t$ problem of egos?

imtringued · on March 19, 2024

Because they didn't go to uni when hardware-software-codesign was being taught.

nebula8804 · on March 19, 2024

What unis would that include? Isn't ATI Canadian? Therefore i'd expect lots of UToronto and Waterloo people there. Aren't they some of the best in this field?

imtringued · on March 19, 2024

You have to remember that this only applies to cheap consumer GPUs, they tend to support their datacenter GPUs better. When you consider that Ryzen AI already eats the AI inference lunch, having better GPUs with better software only threatens to cannibalize their data center GPU offering. Given enough time nobody will care about using AMD GPUs for AI.

wmf · on March 19, 2024

Yet we've heard about nobody doing training on AMD.

e4325f · on March 19, 2024

They did buy Nod.ai recently

blitzar · on March 19, 2024

Getting this working might be worth a trillion $ to AMD - they should be doing more than just waiting for a bootstrapped startup to debug their drivers for them.

matt-p · on March 19, 2024

It changes alot. It is not legal to make a 'CUDA' driver for an AMD GPU as Nvidia own cuda. You can see there was a open implementation of this that AMD sponsored until they got threatened with a lawsuit by Nvidia

nemothekid · on March 19, 2024

>that AMD sponsored

The problem currently, as people like Hotz and many others are discovering, it not the lack of CUDA. Most people use PyTorch and don't care what the underlying software is. Infact most CUDA is hand tuned to nvidia hardware anyways and is optimized to make the most on nvidia. The problem is AMD's drivers - the piece that actually sends the code to run on the GPU, tends to be broken. AMD cannot "sponsor" an outsider to fix this. A legal, but broken, AMDCUDA will not be any better than the current situation; so no, having CUDA on AMD wouldn't change anything.

The problem is not "CUDA is not AMD", the problem is AMD has not, does not, and for some reason will not invest adequately in GPU compute. CUDA is a mirage; if AMD had a similar platform someone would have done the work already to ensure PyTorch works on it. PyTorch already supports ROCm, people don't use it because the performance is bad and it's buggy. When nvidia had this problem, nvidia hired engineers to work on open source projects and debug issues in open source libraries (not even limited to AI, you will find nvidia engineers debugging issues in a wide range of CUDA projects). When AMD has this issue, they barely acknowledge it.

dogma1138 · on March 19, 2024

ZLUDA ate the dust not because they implemented CUDA but because they were misusing complied NVIDIA libraries.

If it was a clean room implementation of the API NVIDIA wouldn’t care. Heck that’s exactly what AMD did with HIP.

But what you cannot do is essentially intercept calls to and reverse engineer NVIDIA binaries in real time because you can’t be arsed to build your own.

dheera · on March 19, 2024

> But what you cannot do is essentially intercept calls to and reverse engineer NVIDIA binaries in real time because you can’t be arsed to build your own.

And this is precisely what anti-trust ligitation would allow them to do.

Preventing someone from reverse-engineering a product with the sole intention of maintaining monopoly status may be seen as anti-competitive.

AMD makes really, really good CPUs now, but only after ligitation against Intel allowed them to keep up with evolving x86 standards.

It's not about being "arsed" to build your own, the problem is NVIDIA controls the ecosystem-wide standard. NVIDIA can add to CUDA at any point in time and launch a GPU at the same time, the ecosystem would be forced to buy it if they want to stay on the cutting edge, and AMD would never be able to compete or reverse engineer these new standards in time.

dogma1138 · on March 21, 2024

They didn’t RE the CUDA API, they reused libraries like cudnn that’s a completely different case than say Oracle vs Google.

What ZULDA did wasn’t to maintain a compatibility with the CUDA API and provided an open implementation of it but rather use all the CUDA based libraries that NVIDIA provided on top of it.

The equivalence again would be that not only Google implemented their own Java compatible API but rather that they used the now Oracle owned JVM to do so and redistributed it.

AMD already implemented CUDA essentially one to one in the form of HIP in ROCm. The issue that they face is that they don't have all the equivalent middleware to make copy pasting code actually work and this was what ZULDA did but instead of building a HIPdnn they just reused NVIDIA binaries.

wmf · on March 19, 2024

It would be kind of genius for Nvidia to "open" the CUDA APIs (which have already been unofficially reverse engineered anyway) but not the code. Maybe they'd also officially support HIP and SYCL. Maybe they could open SXM after all competitors have already committed to OAM. They'd create the appearance of opening up while giving up very little.

sitkack · on March 19, 2024

By "Opening Up" they cement their leadership position. AI frameworks are already targeting CL, SPIR-V, etc. The low level details will fade and so will Nvidias api dominance.

The MI300 smokes the H100 yet here we are.

incrudible · on March 19, 2024

Just because they are a target doesn’t mean things just work. Historically, AMD hardware for GPGPU becomes obsolete well before the software landscape catches up. I am not going to risk my time and money finding out whether history repeats itself, just for a few potential FLOPS per dollar.

sitkack · on March 19, 2024

Don't disagree, but it is is nuts how AMD is leaving billions on the table by not finishing the project by writing the software.

incrudible · on March 20, 2024

Even if they finished it yesterday, it would take years to convince everyone that this time it will be worth investing in AMD, at which point the whole datacenter AI hype may already be over.

sitkack · on March 21, 2024

The AI hype will not end. AMD just has to support transformers. OSS is already building it, at which point you can finetune an open weights model on an MI300. Nvidia is in the same place that Cisco and Sun were during the dot.com boom. People are just buying the mainstream think that works. It doesn't take years, it takes an accountant.

anon291 · on March 19, 2024

The cuda api is essentially open... Hip is basically a copy.

CUDA is such a misnomer. Amd doesn't have tensorRT, cuDNN, cutlass, etc. Forcing Nvidia to make these work on AMD is like forcing Microsoft to make windows work on apple hardware... Not going to happen.

alphabeta567 · on March 19, 2024

CUDA is not open. See what happened with ZLUDA.

coryrc · on March 19, 2024

I'm not sure your implication. My understanding of the project is AMD didn't want to invest in it anymore.

sgift · on March 19, 2024

IMHO there's reason to believe that was what was discussed here plays a role in that decision: https://news.ycombinator.com/item?id=39592689 - namely NVidia trying to forbid such APIs.

anon291 · on March 19, 2024

That has nothing to do with the API. The restriction there is you cannot use nvcc to generate nvidia bytecode, take that bytecode, decompile it, and translate it to another platform. This means that, if you use cuDNN, you cannot intercept the already-compiled neural network kernels and then translate those to AMD.

You can absolutely use the names of the functions and the programming model. Like I said, HIP is literally a copy. Llama.cpp changes to HIP with a #define, because llama.cpp has its own set of custom kernels.

And this is what I've said before, CUDA is hardly a moat. The API is well-known and already implemented by AMD. It's all the surrounding work: the thousands of custom (really fast!) kernels. The ease-of-use of the SDKs. The 'pre-built libraries for every use case'. You can claim that CUDA should be made open-source for competition, but all those libraries and supporting SDKs represent real work done by real engineers, not just designing a platform, but making the platform work. I don't see why NVIDIA should be compelled to give those away anymore than Microsoft should be compelled to support device driver development on linux.

paulmd · on March 19, 2024

that’s literally old news, it’s from ‘20 or ‘21 and just got noticed iirc

wmf · on March 19, 2024

They did force Microsoft to make Office work on Mac though... (Office for Mac already existed but I think MS agreed to not cancel it.)

pjmlp · on March 19, 2024

It was more like Microsoft had the anti-trust stuff going on, and Apple was on the verge of going bankrupt.