Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RISC vs CISC isn't really about instructions doing "one simple thing period."

It's about increased orthogonality between ALU and memory operations, making it simpler and more predictable in an out-of-order superscalar design to decode instructions, properly track data dependencies, issue them to independent execution units, and to stitch the results back into something that complies with the memory model before committing to memory.

Having a few crazy-ass instructions which either offload to a specialized co-processor or get implemented as specialized microcode for compatibility once you realize that the co-processor is more trouble than it's worth doesn't affect this very much.

What ARM lacks are the huge variety of different instruction formats and addressing mode that Intel has; which substantially affect the size and complexity of the instruction decoder, and I'm willing to bet that creates a significant bottleneck on how large of a dispatch and reorder system they can have.

For a long time, Intel was able to make up this difference with process dominance, clever speculative execution tricks, and throwing a lot of silicon and energy at it which you can do on the server side where power and space are abundant.

But Intel is clearly losing the process dominance edge. Intel ceded the mobile race a long time ago. Power is becoming more important in the data center, which are struggling to keep up with providing reliable power and cooling to increasingly power-hungry machines. And Intel's speculative execution smarts came back to bite them in the big market they were winning in, the cloud, when it turned out that they could cause information leaks between multiple tenants, leading to them needing to disable a lot of them and lose some of their architectural performance edge.

And meanwhile, software has been catching up with the newer multi-threaded world. 10-15 years ago, dominance on single threaded workloads still paid off considerably, because workloads that could take advantage of multiple cores with fine-grained parallelism were fairly rare. But systems and applications have been catching up; the C11/C++11 memory model make it significantly more feasible to write portable lock-free concurrent code. Go, Rust, and Swift bring safer and easier parallelism for application authors, and I'm sure the .net and Java runtimes have seen improvements as well.

These increasingly parallel workloads are likely another reason that the more complex front-ends needed for Intel's instruction set, as well as their stricter memory ordering, are becoming increasingly problematic; it's becoming increasingly hard to fit more cores and threads into the same area, thermal, and power envelopes. Sure, they can do it on big power hungry server processors, but they've been missing out on all of the growth in mobile and embedded processors, which are now starting to scale up into laptops, desktops, and server workloads.

I should also say that I don't think this is the end of the road for Intel and x86. They have clearly had a number of setbacks of the last few years, but they've managed to survive and thrive through a number of issues before, and they have a lot of capital and market share. They have squeezed more life out of the x86 instruction set than I thought possible, and I wouldn't be shocked if they managed to keep doing that; they realized that their Itanium investment was a bust and were able to pivot to x86-64 and dominate there. They are facing a lot of challenges right now, and there's more opportunity than ever for other entrants to upset them, but they also have enough resources and talent that if they focus, they can probably come back and dominate for another few decades. It may be rough for a few years as they try to turn a very large boat, but I think it's possible.



> I'm willing to bet that creates a significant bottleneck on how large of a dispatch and reorder system they can have

My understanding is the reorder buffer of the m1 is particularly large:

"A +-630 deep ROB is an immensely huge out-of-order window for Apple’s new core, as it vastly outclasses any other design in the industry. Intel’s Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD’s newest Zen3 core makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure."

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...


> These increasingly parallel workloads are likely another reason that the more complex front-ends needed for Intel's instruction set, as well as their stricter memory ordering, are becoming increasingly problematic; it's becoming increasingly hard to fit more cores and threads into the same area, thermal, and power envelopes. Sure, they can do it on big power hungry server processors, but they've been missing out on all of the growth in mobile and embedded processors, which are now starting to scale up into laptops, desktops, and server workloads.

Except ARM CPUs aren't any more parallel in comparable power envelopes than x86 CPUs are, and x86 doesn't seem to have any issue hitting large CPU core counts, either. Most consumer software doesn't scale worth a damn, though. Particularly ~every web app which can't scale past 2 cores if it can even scale past 1.


Parallelism isn't a good idea when scaling down, nor is concurrency often. Going faster is still a good idea on phones (running the CPU at higher speed uses less battery because it can turn off faster) but counting background services there is typically less than one core free, there is overhead to threading and asyncing, and your program will go faster if you take most of it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: