More

oddity · on July 12, 2022

The difference is much more nuanced than this. A modern GPU can (and probably does) do most of what you've listed for a CPU. Speculative execution and branch prediction are a bit less likely to be invested in (because they don't need it as much due to oversubscription), but that's increasingly true for CPUs as well for high-efficiency cores. The difference (at a category vs category level and not specific microarch) is mostly a matter of tuning for particular workloads. I'm increasingly souring on SIMD/SIMT being a useful distinction now that bleeding-edge CPUs are widening in the microarch and bleeding-edge GPUs are getting better at handling thread divergence in the microarch. There is a difference, certainly, but it's difficult to describe in a few bullet points.

GPUs are more likely to have more exotic features than you'll see on a CPU to deal with things like thread coordination and cache coherence, but there's nothing fundamentally stopping CPUs from adding that (or wanting that) as well.

Lichtso · on July 12, 2022

> GPUs are getting better at handling thread divergence in the microarch

That is an interesting point, how does that work (especially with the dynamics of ray tracing)? Do they recombine under utilized wavefronts or something?

dragontamer · on July 12, 2022

I'm not aware of anything that improves thread-divergence. NVidia's most recent GPUs have superscalar operations, which is a trick from CPU-land (multiple pipelines operating 2 or more instructions per clock tick). NVidia has an integer-pipeline and a floating-point pipeline, and both can operate simultaneously (ex: for(int i=0; i<100; i++) x *=blah; the "i++" is integer, while the "x *= blah" is floating point, so both operate simultaneously.

CPUs have extremely flexible pipelines: Intel's pipeline 0 and 1 basically can do anything, pipeline 5 can do most stuff but is missing division IIRC (and a few other things). Load/store are done on some other pipelines, etc. etc.

Apple's and AMD's CPU pipelines are more symmetrical and uniform.

NVidia GPUs are the only superscalar ones I can think of, aside from AMD GPU's scalar vs vector split (which isn't really the "superscalar" operation I'm trying to describe).

TomVDB · on July 12, 2022

Starting with Volta, Nvidia GPUs have forward progress guarantee, preventing lockups when there’s thread divergence.

That doesn’t improve the performance of a well behaved and well written compute shader. But avoiding hard hangs IMO deserves the label “improved thread divergence.”

jjoonathan · on July 12, 2022

Aren't warps still 32 threads, even though number of threads is skyrocketing, effectively making them proportionately finer granularity? Are things different in AMD land?

JonChesterfield · on July 12, 2022

Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

Bigger differences are the instruction counter per thread since volta on nvidia (which I think is a terrible feature) and that forward progress guarantees are stronger on nvidia (those are _really_ helpful but expensive).

TomVDB · on July 12, 2022

Nvidia GPUs were 32 threads per warps eight from the start of CUDA with the 8800 GTX.

> which I think is a terrible feature <> those are _really_ helpful but expensive

Guaranteed forward progress is a direct consequence of having an instruction counter per thread???

Or so I thought. How else would an SM be able to know the PC of a group of threads that wasn’t stuck?

dragontamer · on July 12, 2022

> Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

AMD GCN was 64 threads/wavefront. NVidia always was 32 threads/warp.

AMD's newest consumer cards RDNA and RDNA2 are 32 threads/wavefront. However, GCN lives on with CDNA (MI200 supercomputer chips), with 64 threads/wavefront architecture.

djmips · on July 13, 2022

There is tech in late model GPUs to keep all same divergent threads in the same warp/wavefront.

oddity · on July 12, 2022

Most people just want an allocator that works reasonably well, and maintaining that expectation means not exposing too many details that you might be held accountable to. If you care deeply, there are usually alternatives.

There's nothing really stopping systems from exposing a hint for controlling this, but usually the people who might care about it don't just want hints, but actual guarantees, and then you have to consider all the users who hint incorrectly (or correctly, but for a different system/version). So, the benefit/cost ratio is low.

Integration of GC with thread scheduling was once an active area of research, but the world has mostly moved on (perhaps prematurely, but so goes).

oddity · on July 12, 2022

Depends on how you define innovation.

The hard truth is that there is no free lunch. We like to pretend we're using a Turing machine, but the moment you start caring about performance or memory limits, the abstraction breaks down and you realize that physics dictates that we have a finite state machine instead. This was "solved" about as well as it could be many years ago with however many billions were poured into GC R&D for all the people who don't care deeply about the limits, but nothing is going to magic away that fundamental trade-off unless we discover new physics.

Every innovation since then has been about developing workarounds to deal with that trade-off in more or less sophisticated ways.

oddity · on July 2, 2022

If you're depending on the performance of malloc, you're either using the language incorrectly or using the wrong language. There is no such thing as a general purpose anything when you care about performance, there's only good enough. If you are 1) determined to stick with malloc and 2) want something predictable and better, then you are necessarily on the market for one of the alternatives to the system malloc anyway.

mwcampbell · on July 2, 2022

The whole point of the article, though, was that the system malloc was good enough on Linux and Darwin.

oddity · on July 2, 2022

This misses the point of my comment. When you put faith in malloc, you're putting hope in a lot of heuristics that may or may not degenerate for your particular workload. Windows is an outlier with how bad it is, but that should largely be irrelevant because the code should have already been insulated from the system allocator anyway.

An over-dependence on malloc is one of the first places I look when optimizing old C++ codebases, even on Linux and Darwin. Degradation on Linux + macOS is still there, but more insidious because the default is so good that simple apps don't see it.

dzaima · on July 2, 2022

Except that I'd guess that there is no "good" case in the case for MSVCRT's malloc. You shouldn't assume malloc is free, but you should also be able to assume it won't be horrifyingly slow. Just as much as you should be able to rely on "x*y" not compiling to an addition loop over 0..y (which might indeed be very fast when y is 0).

Yes, this unfortunately isn't the reality MSVCRT is in, but it is quite a reasonable expectation.

oddity · on July 2, 2022

It's unreasonable to assume that an stdlib must be designed around performance to any capacity. For most software, the priorities for the stdlib are 1) existing, 2) being bug/vulnerability free, and likely, in the Windows case given Microsoft's tradition, 3) being functionally identical to the version they shipped originally. Linux and macOS have much more flexibility to choose a different set of priorities (the former, through ecosystem competition and the latter through a willingness to break applications and a dependence on malloc for objc), so it's not at all a fair comparison. The fact that malloc doesn't return null all the time is a miracle enough for many embedded platforms, for example, so it's not exclusively a Windows concern. Environments emphasizing security in particular might be even slower.

Multiplication is not a great argument... There's a long history of hardware that doesn't have multipliers. Would I complain about that hardware being bad? No, because I'd take a step back and ask what their priorities were and accept that different hardware has different priorities so I should be prepared to not depend on them. Same thing with standard libraries. You can't always assume the default allocator smiles kindly on your application.

dzaima · on July 2, 2022

I don't see a reason for the stdlib to be considered in a different way from the base language is all I'm saying. For most C programmers, the distinction between the stdlib and the base language isn't even a consideration. Thinking most software doesn't heavily rely on malloc (and the rest of the stdlib) being fast is stupid.

Even on hardware without a multiplier you'd do a shift-based version, with log_2(max_value) iterations. What's unreasonable is "for (int i = 0; i < y; i++) res+= x;". If there truly were no way to do a shift, then, sure, I'd accept the loop; but I definitely would be pretty mad at a compiler if it generated a loop for multiplication on x86_64. And I think it's reasonable to be mad at the stdlib being purposefully outdated too (even if there is a (bad) reason for it).

oddity · on July 3, 2022

C and C++ are some of the few languages where the spec goes out of its way to not depend on an allocator, for good reason, and this is well after you've accounted for the majority of the code that, hopefully, doesn't need to do memory allocation at all. The fact that many programmers don't care is an indication that most code in most C or C++ software is not written with performance in mind. And that's (sometimes) fine. LLVM has a good ecosystem reason to use C++, for example, and it's well known in the compiler space that LLVM is not fast. Less recently, for a long time C and C++ were considered high level languages, meaning lots of software was written in it without consideration of performance. But criticizing the default implementation's performance absent a discussion of its priorities when you have all the power to not be bottlenecked in it anyway is just silly.

dzaima · on July 3, 2022

The fact that you should avoid allocation when possible has absolutely nothing to do with how fast allocation should be when you need it. And code not written with performance in mind should still be as fast as reasonably possible by default.

I would assume that quite a few people actually trying to write fast code would just assume that malloc, being provided to you by your OS, would be in the best position to know how to be fast. Certainly microsoft has the resources to optimize the two most frequently invoked functions in most C/C++ codebases, at least more than you yourself would.

MSVCRT being stuck with the current extremely slow thing, even if there are truly good reasonable reasons, is still a horrible situation to be in.

Dylan16807 · on July 3, 2022

> If there truly were no way to do a shift, then, sure, I'd accept the loop

Not even then. You can just use an addition instead of a shift.

Dylan16807 · on July 3, 2022

> It's unreasonable to assume that an stdlib must be designed around performance to any capacity.

To any capacity? That's insane.

It's not a reference implementation to show you what the correct results should be. It's the standard. The default.

jeffbee · on July 2, 2022

There isn't really a "system malloc on Linux". Many distributions come with the GNU allocator based on ptmalloc2, but there is no particular reason that a distro could not come out of the box with any other allocator. The world's most widespread Linux distribution uses LLVM's Scudo allocator. Alpine Linux comes with musl's (unbelievably slow) allocator, although it is possible to rebuild it with mimalloc.

oddity · on June 6, 2022

The semiconductor world is very small by tech standards, and so there are correspondingly many fewer people at the management or upper IC levels that can coordinate large, consequential decisions and even fewer that become well known. Even ignoring attribution biases, it should be unsurprising that there are a few folks that seem to have outsized influence.

But I will say that there are many more people who are not well known outside of the semiconductor world, but are minor legends within it. Unfortunately, the development processes for hardware tend to go through enough hands to strip attribution and most hardware people tend not to talk in public about their accomplishments. Doesn't help that some of the main companies have been stagnant and absorbed in inner and outer turf wars for nontechnical reasons. It's not a great environment for stars to shine.

On top of that, most software people I've seen seem uninterested in understanding and dissecting the hardware enough to appreciate what they've done. If that's the case for software people, I have no hope for anyone else.

oddity · on June 6, 2022

The mobile phone market wouldn't have saved them. Qualcomm killed off everyone who wasn't Apple with their licensing schemes. The CPU tech was largely irrelevant because what really mattered was the wireless patents. Being acquired by Apple was probably the best outcome, and we have PA Semi to partially thank for the rise of ARM processors as we consider it today.

oddity · on June 6, 2022

Considering this is explicitly part 8 of a series, I would recommend starting with part 1, where this is explained.

https://medium.com/@mario.arias.c/comparing-kotlin-and-golan...

The language website is here: https://monkeylang.org

oddity · on June 5, 2022

The PR is undeniably silly, but it's totally plausible that some people just have bad ideas and not bad intent.

The situation doesn't need any more escalation beyond fixing the underlying vector for spam.

oddity · on June 5, 2022

Every large org I've ever been at has had someone reply-all to a company-wide mailing list and then promptly spawn a flood of more company-wide reply-alls. Slack's @channel/@here has made this even worse. On a good day, people have a laugh, tell them not to do it again, and we all move on. The only difference here is that it's in public on github.

I suspect there's some psychological phenomenon that convinces people that norms expected of them have somehow been broken because of the chaos.

oddity · on May 30, 2022

There's an endless stream of monad tutorials and I think they (almost) all misunderstand the point of confusion. No one gives a f*** about monads, they care about state.

If you've internalized reasoning about a program's execution symbolically and in a time-independent way, then monads solve the problem of how you enforce a correct sequencing of otherwise time-independent operations at a library level without any fancy modifications to the type system. And oh by the way, this structure shows up everywhere and isn't that pretty cool. People who hang around in the formal PL world tend to assume this already because it emerges naturally from how we talk about the semantics of languages via symbolic manipulation.

But if you haven't grasped that yet, then monads solve a problem that you probably don't even realize exists and no amount of rephrasing the monad laws will help you.

I'd almost always recommend anyone new to Haskell ignore monads as much as possible. Fiddle around with evaluating toy functions with pen and paper symbolically to reason about the semantics, and then try to imagine how you could represent the state of a program changing across the page by creating a new object representing the state of the program from the old state. Then, dig in to the RealWorld type and how the IO Monad actually works (not at a type level).

huqedato · on May 30, 2022

Thanks for explanation. The point is that I do not really intend to learn Haskell (I don't have the motivation for), I've been just (very) curious to understand what's about these concept called monad: - Why does should exist? Why can't they be simply replaced with control structures (if, case, switch etc.) ? - Why there is so much fuss on the web, in the FP world, about monads?

hansvm · on May 30, 2022

They can't be explicitly replaced with control structures because implicit in that suggestion is the idea that when writing code you write what the code _does_. If you call the same function with the same inputs you can expect exactly the same machine level code to execute (perhaps taking different branches based on external state it reaches out for).

Contrast that with what happens in Haskell and friends. When writing code you define what the _inputs and outputs_ for particular functions should be. The actual code generating those inputs and outputs is free to be replaced or removed as the compiler sees fit. That buys you a lot of things (trivial parallelization, excellent type checking, ...), but it self inflicts an extra problem we didn't have in the previous paradigm:

In the real world, we don't in fact just want to run a program and get an output. The way we interact with, e.g., a GUI is an important part of the program's behavior. If your mental model of code is that we're sequentially doing a series of things then this isn't ever a problem you would even have because you would just write your code to do the right things at the right time, but if you've adopted a model where you're defining outputs for your inputs you need some way to shove state and order of effects into that system for it to be useful.

Enter stage-left: monads! Yes they're pretty and ubiquitous and whatever, but the problem they solve for us is specifying an order of events (by virtue of taking the entire monad in as an input and spitting it out as an output) in a way wholly compatible with the type system we've already developed.