GPUs are also much simpler chips in comparison to CPUs. 90%+ of the core logic a...

raphlinus · on July 31, 2020

I've been studying and blogging about GPU compute for a while, and can confidently assert that GPUs are in fact astonishingly complicated. As evidence, I cite Volume 7 of the Intel Kaby Lake GPU programmers manual:

https://01.org/sites/default/files/documentation/intel-gfx-p...

That's almost 1000 pages, and one of 16 volumes, it just happens to be the one most relevant for programmers. If this is your idea of "simple," I'd really like to see your idea of a complex chip.

baybal2 · on July 31, 2020

The most complex circuit on the GPU would be the thing chops the incoming command stream, and turns it into something which matrix multiplicators can work.

raphlinus · on July 31, 2020

I get the feeling you're only really thinking about machine learning style workloads. Your statement doesn't seem to take into account scatter/gather logic for memory traffic (including combine logic for uniforms), resolution of bank conflicts, sorting logic for making blend operations have in-order semantics, the fine rasterizer (which is called the "crown jewels of the hardware graphics pipeline" in an Nvidia paper), etc. More to the point, these are all things that CPUs don't have to deal with.

Conversely, there is a lot of logic on a modern CPU to extract parallelism from a single thread, stuff like register renaming, scoreboards for out of order execution, and highly sophisticated branch prediction units. I get the feeling this is the main stuff you're talking about. But this source of complexity does not dramatically outweigh the GPU-specific complexity I cited above.

RantyDave · on July 31, 2020

Isn't the rasteriser "simply" a piece of code running on the GPU?

raphlinus · on July 31, 2020

No, there is hardware for it, and it makes a big difference. Ballpark 2x, but it can be more or less depending on the details of the workload (ie shader complexity).

One way to get an empirical handle on this question is to write a rasterization pipeline entirely in software and run it in GPU compute. The classic Laine and Karras paper does exactly that:

https://research.nvidia.com/publication/high-performance-sof...

An intriguing thought experiment is to imagine a stripped-down, highly simplified GPU that is much more of a highly parallel CPU than a traditional graphics architecture. This is, to some extent, what Tim Sweeney was talking about (11 years ago now!) in his provocative talk "The end of the GPU roadmap". My personal sense is that such a thing would indeed be possible but would be a performance regression on the order of 2x, which would not fly in today's competitive world. But if one were trying to spin up a GPU effort from scratch (say, motivated by national independence more than cost/performance competitiveness), it would be an interesting place to start.

tedunangst · on July 31, 2020

Intel will ship larrabee any day now... :)

david-gpu · on July 31, 2020

The host interface has to be one of the simplest parts of the system, and I mean no disrespect to the fine engineers who work on that. Even the various internal task schedulers look more complex to me.

If you don't have insider's knowledge of how these things are made, I suggest using less certain language.

dahart · on July 31, 2020

> The most complex circuit on the GPU would be the thing chops the incoming command stream

That's not true at all. I'd recommend reading up on current architectures, or avoid making such wild assumptions.

dyingkneepad · on July 31, 2020

I completely disagree with this comment.

Just because a big part of the chip are the shading units it doesn't mean it's simple or there's no space for sophistication. Have even you been following the recent advancements in recent GPUs?

There is a lot of space for absolutely everything to improve. Especially now that Ray Tracing is a possibility and it uses the GPU in a very different way compared to old rasterization. Expect to see a whole lot of new instructions in the next years.

TomVDB · on July 31, 2020

> 90%+ of the core logic area (stuff that is not i/o, power, memory, or clock distribution) on the GPU are very basic matrix multipliers. >All best possible arithmetic circuits, multipliers, dividers, etc. are public knowledge.

Combine these 2 statements and most GPUs would have roughly identical performance characteristics (performance/Watt, performance/mm2, etc)

And yet, you see that both AMD and Nvidia GPUs (but especially the latter) have seen massive changes in architecture and performance.

As for the 90% number itself: look at any modern GPU die shot and you'll see that 40% is dedicated just to moving data in and out of the chip. Memory controllers, L2 caches, raster functions, geometry handling, crossbars, ...

And within the remaining 60%, there are large amounts of caches, texture units, instruction decoders etc.

The pure math portions, the ALUs, are but a small part of the whole thing.

I don't know enough about the very low level details of CPUs and GPUs to judge which ones are more complex, but in claiming that there's no space for sophistication, I can at least confidently say that I know much more than you.

david-gpu · on July 31, 2020

> GPUs are also much simpler chips in comparison to CPUs

Funny you say that. I've never heard a CPU architect coming to the GPU world and say "Gosh, how simple is this!".

I invite you to look at a GPU ISA and see for yourself, and that is only the visible programming interface.

baybal2 · on Aug 1, 2020

Judging by your nickname, I think I have reasons to listen. You are that David who writes GPU drivers?

So, what do you think is the most complex thing on an Nvidia GPU?

imtringued · on July 31, 2020

Matrix multipliers? As in those tensor cores that are only used by convolutional neural networks? Aren't you forgetting something? Like the entire rest of the GPU? You're looking at this from an extremely narrow machine learning focused point of view.

ISL · on July 31, 2020

Is that last sentence provable? If so, that's an impressively-strong statement (to state that the provably-most-efficient mathematical-computation circuit designs are known).

baybal2 · on July 31, 2020

Well, at least for reasonably big integers, it is.

There might be faster algos for super long integers, or minute implementation differences that add/subtract few kilogates.

TomVDB · on July 31, 2020

> Well, at least for reasonably big integers, it is.

Big integer calculations are the bread and butter of GPUs now?