Why wouldn't something similar to RISC V happen in the GPU space?

Symmetry · on July 31, 2020

In a GPU the ISA isn't decoupled from the architecture in the way it is for a post-Pentium Pro CPU. Having a fixed ISA that you couldn't change later when you wanted to make architectural changes would be something of a millstone to be carrying around for a GPU.

QuixoticQuibit · on July 31, 2020

I’m curious, why is this the case for GPUs and not CPUs?

simcop2387 · on July 31, 2020

It's much more advantageous to be able to respin/redesign parts of the GPU for a new architecture since the user interface is at a much much higher level compared to a CPU. They basically only have to certify that it'll be API compatible at CUDA/OpenCL/Vulkan/OpenGL/DirectX level and no more. All of those APIs specify that the drivers are responsible for turning it into the hardware language, so every program is already re-compiled for any new hardware. This does lead to tiny rendering differences in the end (it shouldn't but it frequently does, due to bug fixes and rounding changes). So because they aren't required to keep that architectural similarity anymore, they're free to change as they need new features or come up with better designs (frequently to allow more SIMD/MIMD style stuff, and greater memory bandwidth utilization). I doubt they really change all that much between two generations, but they change enough that exact compatibility isn't really worth working at.

If you want to look at some historical examples where this wasn't quite the case, look at the old 3DFX VooDoo series. They did add features but they kept compatibility to the point where even up to a VooDoo 5 would work with software that only supported the VooDoo 1. (n.b. This is based on my memory of the era, i could be wrong). They had other business problems, but it meant that adding completely new features and changes in Glide (their API) was more difficult.

mhh__ · on July 31, 2020

RISC-V is an ISA rather than silicon, GPUs are generally black boxes that you throw code at. There's not much to standardize around

shmerl · on July 31, 2020

AMD document their ISA: https://llvm.org/docs/AMDGPUUsage.html#additional-documentat...

That's why third party open shader compilers like ACO could be made:

https://gitlab.freedesktop.org/mesa/mesa/-/tree/master/src/a...

moonchild · on July 31, 2020

As does intel - https://01.org/linuxgraphics/documentation/hardware-specific...

makomk · on July 31, 2020

AMD document their ISAs, but each one maps pretty much one-to-one onto a particular implementation. Compatibility and standardization are not goals.

shmerl · on July 31, 2020

As long as they make an open source compiler, there is at least a reference implementation to compare to.

mumblemumble · on July 31, 2020

GPUs do have ISAs. It's just that they're typically hidden behind drivers that provide a more standardized API.

mhh__ · on July 31, 2020

Of course they have ISAs, my point is that the economics of standardization around a single ISA a la RISC-V isn't as good by virtue of the way we use GPUs on today's computers. You could make GPU-V but why would a manufacturer use it

jshap70 · on July 31, 2020

> GPUs are generally black boxes that you throw code at.

umm... what? what does that even mean? lol

I could kind of maybe begin understand your argument from the Graphics side, as users mostly interact with it at an API level, however keep in mind that shaders are languages the same way "cpu languages" work. It's all still compiled to assembly, and there's no reason that you couldn't make an open instruction set for a GPU the same as a CPU. This is especially obvious when it comes to Compute workloads, as you're probably just writing "regular code".

Now, that said, would it be a good idea? I don't really see the benefit. A barebones GPU ISA would be too stripped back to do anything at all, and one with the specific accelerations needed to be useful will always want to be kept under wraps.

kmeisthax · on July 31, 2020

Just 'cause Nvidia might want to keep architectural access under wraps doesn't necessarily mean that everyone else is going to, or that they have to in order to maintain a competitive advantage. CPU architectures are public knowledge, because people need to write compilers for them, and there are still all sorts of other barriers to entry and patent protections that would allow maintaining competitive advantage through new architectural innovations. This smells less of a competitive risk and more of a cultural problem.

I'm reminded of the argument over low-level graphics APIs almost a decade ago. AMD had worked together with DICE to write a new API for their graphics cards called Mantle, while Nvidia was pushing "AZDO" techniques about how to get the best performance out of existing OpenGL 4. Low-level APIs were supposed to be too complicated for graphics programmers for too little benefit. Nvidia's idea was that we just needed to get developers onto the OpenGL happy path and then all the CPU overhead of the API would melt away.

Of course, AMD's idea won, and pretty much every modern graphics API (DX12, Metal, WebGPU) provides low-level abstractions similar to how the hardware actually works. Hell, SPIR-V is already halfway to being a GPU ISA. The reason why OpenGL became such a high-overhead API was specifically because of this idea of "oh no, we can't tell you how the magic works". Actually getting all the performance out of the hardware became harder and harder because you were programming for a device model that was obsolete 10 years ago. Hell, things like explicit multi-GPU were just flat-out impossible. "Here's the tools to be high performance on our hardware" will always beat out "stay on our magic compiler's happy path" any day of the week.

mhh__ · on July 31, 2020

You could make a standardized GPU instruction set but why would anyone use it? We don't currently access GPUs at that level, like we do with the CPU.

It's technically possible but the economics isn't there (was my point). The cost of making a new GPU generally includes writing drivers and shader compilers anyway, so there's not much of a motivation to bother complying with a standard. It would be different if we did expose them at a lower level (i.e. if CPU were programmed with a jitted bytecode then we wouldn't see as much focus on ISA as long as the higher level semantics were preserved)

sprash · on July 31, 2020

SPIR-V looks like a promising standardization. It can not be translated directly into silicon but it doesn't have to. Intel also essentially emulates x86 and runs RISC internally.

imtringued · on July 31, 2020

>Intel also essentially emulates x86 and runs RISC internally.

By that logic anything emulates its ISA because that is the definition of an ISA. An ISA is just the public interface of a processor. You are wrong about what x86 processors run internally. Several micro ops can be fused into a single complex one which is something that cannot be described with a term from the 60s. Come on, let the RISC corpse rot in peace. It's long overdue.

baybal2 · on July 31, 2020

GPUs are also much simpler chips in comparison to CPUs.

90%+ of the core logic area (stuff that is not i/o, power, memory, or clock distribution) on the GPU are very basic matrix multipliers.

They are in essence linear algebra accelerator. Not much space for sophistication there.

All best possible arithmetic circuits, multipliers, dividers, etc. are public knowledge.

raphlinus · on July 31, 2020

I've been studying and blogging about GPU compute for a while, and can confidently assert that GPUs are in fact astonishingly complicated. As evidence, I cite Volume 7 of the Intel Kaby Lake GPU programmers manual:

https://01.org/sites/default/files/documentation/intel-gfx-p...

That's almost 1000 pages, and one of 16 volumes, it just happens to be the one most relevant for programmers. If this is your idea of "simple," I'd really like to see your idea of a complex chip.

baybal2 · on July 31, 2020

The most complex circuit on the GPU would be the thing chops the incoming command stream, and turns it into something which matrix multiplicators can work.

raphlinus · on July 31, 2020

I get the feeling you're only really thinking about machine learning style workloads. Your statement doesn't seem to take into account scatter/gather logic for memory traffic (including combine logic for uniforms), resolution of bank conflicts, sorting logic for making blend operations have in-order semantics, the fine rasterizer (which is called the "crown jewels of the hardware graphics pipeline" in an Nvidia paper), etc. More to the point, these are all things that CPUs don't have to deal with.

Conversely, there is a lot of logic on a modern CPU to extract parallelism from a single thread, stuff like register renaming, scoreboards for out of order execution, and highly sophisticated branch prediction units. I get the feeling this is the main stuff you're talking about. But this source of complexity does not dramatically outweigh the GPU-specific complexity I cited above.

RantyDave · on July 31, 2020

Isn't the rasteriser "simply" a piece of code running on the GPU?

raphlinus · on July 31, 2020

No, there is hardware for it, and it makes a big difference. Ballpark 2x, but it can be more or less depending on the details of the workload (ie shader complexity).

One way to get an empirical handle on this question is to write a rasterization pipeline entirely in software and run it in GPU compute. The classic Laine and Karras paper does exactly that:

https://research.nvidia.com/publication/high-performance-sof...

An intriguing thought experiment is to imagine a stripped-down, highly simplified GPU that is much more of a highly parallel CPU than a traditional graphics architecture. This is, to some extent, what Tim Sweeney was talking about (11 years ago now!) in his provocative talk "The end of the GPU roadmap". My personal sense is that such a thing would indeed be possible but would be a performance regression on the order of 2x, which would not fly in today's competitive world. But if one were trying to spin up a GPU effort from scratch (say, motivated by national independence more than cost/performance competitiveness), it would be an interesting place to start.

tedunangst · on July 31, 2020

Intel will ship larrabee any day now... :)

david-gpu · on July 31, 2020

The host interface has to be one of the simplest parts of the system, and I mean no disrespect to the fine engineers who work on that. Even the various internal task schedulers look more complex to me.

If you don't have insider's knowledge of how these things are made, I suggest using less certain language.

dahart · on July 31, 2020

> The most complex circuit on the GPU would be the thing chops the incoming command stream

That's not true at all. I'd recommend reading up on current architectures, or avoid making such wild assumptions.

dyingkneepad · on July 31, 2020

I completely disagree with this comment.

Just because a big part of the chip are the shading units it doesn't mean it's simple or there's no space for sophistication. Have even you been following the recent advancements in recent GPUs?

There is a lot of space for absolutely everything to improve. Especially now that Ray Tracing is a possibility and it uses the GPU in a very different way compared to old rasterization. Expect to see a whole lot of new instructions in the next years.

TomVDB · on July 31, 2020

> 90%+ of the core logic area (stuff that is not i/o, power, memory, or clock distribution) on the GPU are very basic matrix multipliers. >All best possible arithmetic circuits, multipliers, dividers, etc. are public knowledge.

Combine these 2 statements and most GPUs would have roughly identical performance characteristics (performance/Watt, performance/mm2, etc)

And yet, you see that both AMD and Nvidia GPUs (but especially the latter) have seen massive changes in architecture and performance.

As for the 90% number itself: look at any modern GPU die shot and you'll see that 40% is dedicated just to moving data in and out of the chip. Memory controllers, L2 caches, raster functions, geometry handling, crossbars, ...

And within the remaining 60%, there are large amounts of caches, texture units, instruction decoders etc.

The pure math portions, the ALUs, are but a small part of the whole thing.

I don't know enough about the very low level details of CPUs and GPUs to judge which ones are more complex, but in claiming that there's no space for sophistication, I can at least confidently say that I know much more than you.

david-gpu · on July 31, 2020

> GPUs are also much simpler chips in comparison to CPUs

Funny you say that. I've never heard a CPU architect coming to the GPU world and say "Gosh, how simple is this!".

I invite you to look at a GPU ISA and see for yourself, and that is only the visible programming interface.

baybal2 · on Aug 1, 2020

Judging by your nickname, I think I have reasons to listen. You are that David who writes GPU drivers?

So, what do you think is the most complex thing on an Nvidia GPU?

imtringued · on July 31, 2020

Matrix multipliers? As in those tensor cores that are only used by convolutional neural networks? Aren't you forgetting something? Like the entire rest of the GPU? You're looking at this from an extremely narrow machine learning focused point of view.

ISL · on July 31, 2020

Is that last sentence provable? If so, that's an impressively-strong statement (to state that the provably-most-efficient mathematical-computation circuit designs are known).

baybal2 · on July 31, 2020

Well, at least for reasonably big integers, it is.

There might be faster algos for super long integers, or minute implementation differences that add/subtract few kilogates.

TomVDB · on July 31, 2020

> Well, at least for reasonably big integers, it is.

Big integer calculations are the bread and butter of GPUs now?