Probably biased given that it's on the DuckDB site, but well-reasoned and referenced, and my gut agrees with the overall philosophy
This feels like the kicker:
> In the cloud, you don’t need to pay extra for a “big iron” machine because you’re already running on one. You just need a bigger slice. Cloud vendors don’t charge proportionally more for a larger slice, so your cost per unit of compute doesn’t change if you’re working on a tiny instance or a giant one.
It's obvious once you think about it: you aren't choosing between a bunch of small machines and one big machine, you may very well be choosing between a bunch of small slices of a big machine and one big slice of a big machine. The only difference would be in how your software sees it: as a complex distributed system, or as a single system (that can eg. share memory with itself instead of serializing and deserializing data over network sockets)
The reason this feels non-obvious is that people like to think that they're choosing a variable number of small slices of a big datacenter, scaling up and down hour-by-hour or minute-by-minute to get maximum efficiency.
Really, though, you're generating enormous overhead while turning on and off small slices of a 128-core monster with a terabyte of RAM.
That's not the only difference - there are many more facets of reliability guarantees than the brief hand-waving this author does about it in the article.
over my career i’ve seen countless cases where the complexity of a distributed system introduces as many (or more) reliability concerns as it solves. when everything works well, it can be beautiful. but when things go wrong, they can quickly become a cascading mess of hard-to-debug issues.
FYI, this is from the MotherDuck blog. MotherDuck and DuckDB have a partnership, but they are separate entities.
> DuckDB Labs from its inception has received many investment offers from venture capitalists, but eventually decided not to take venture capital, and rather opted for a business model where it works with a few important partners. “We look forward to the continued growth and adoption that partnering with MotherDuck will bring. Thanks to the MotherDuck team, we are able to focus our efforts on independent innovation of the DuckDB core platform and growth of the open-source community.” says Mühleisen, who currently combines a tenured research position at CWI with the role of CEO at DuckDB Labs.
Along those lines, the graph didn't show Cerebras 2.6 TRILLION transistor wafer scale WSE-2 deep learning accelerator.
Be interesting to see if wafer scale becomes more common. I would think wafer scale CPU core farms would help cloud providers deliver greater one-big-server goodness.
850,000 cores
40GB of ram distributed among cores
220Pb/s of memory bandwidth
Result: 7,500 trillion FP16 operations per second
It's cache or scratchpad memory. Keep in mind this is "on chip", most systems, including this one, have additional memory off chip in the form of DRAM sticks.
edit: For comparison, the NVIDIA A100 has up to 164kB of shared memory per SM while the wafer beast has about 47kB per core if you divide it out.
This is absolutely correct (and gets more correct every year).
A m5.large (2 vCPU, 8GB RAM) is $0.096/hr. m5.24xlarge (96 vCPU, 384GB RAM) is $4.608/hr.
Exactly 1:48 scale up, in capacity and cost.
The largest AWS instance is x2iedn.32xlarge (128 vCPU, 4096GB RAM) for $26.676/hr. Compared to m5.large, a 64x increase in compute and 512x increase in memory for 277x the cost.
Long story short.....you can scale up linearly for a long time in the cloud.
Yeahz but the argument being made is that there's overhead in that, primarily in the extra Dev time needed to build such a system, and the extra failure points introduced requiring more maintenance effort.
Those two items can be substantial (5x headcount), or they can be merely "standard" (2x headcount).
I think for most use cases overprovisioning a single instance to meet max denand will still be cheaper than dynamic provisioning with the synchronization overhead of a distributed system.
I am in two minds about this article; while the author makes some excellent points, they're missing out some advantages of scale-out such as DR, blue-green deployment and multi-region.
But one other thing I think they fail to cover is this: most people don't need to build distributed systems anywhere near as soon as they think.
Scale-up makes the most sense for startups and small projects that haven't gained traction or scale yet. It's perfectly possible to architect your applications ready for the day when you _do_ need to scale out, but without spending time and money on the complexities of it until you actually _need_ to.
I agree. While data can always outgrow any single machine's ability to manage and process it in a optimal manner; the right software can allow larger data sets to be performant without resorting to distributed systems. If the software takes full advantage of all the CPU cores and has an intelligent caching strategy then some amazing things can happen on affordable hardware.
I am building a new kind of data management system. It can manage unstructured data like a file system as an object store. It can build very large relational tables and query them quickly. Its flexible schema allows the management of semi-structured data like NoSql systems. (https://didgets.com/)
It has distributed features built into the architecture, but I have only implemented the single node so far. On a $2000 desktop computer it can easily manage 200 million files and tables with 100M+ rows or thousands of columns.
The issue I see now is that there is no good way to know what files will match well when reading from remote (decoupled) storage.
While it does support hive partitioning (thank god), and S3 list calls, if you are looking at doing inserts frequently you need some way to merge these parquet files.
The MergeTree engine is my favorite thing about ClickHouse, and why it's still my go-to. I think if there was a serverless way to merge parquet (which was the aim of IceDB) that would make DuckDB massively more powerful as a primary OLAP db.
Yea, DuckDB is a slam dunk when you have a relatively static dataset - object storage is your durable primary SSOT, and ephemeral VMs running duckdb pointed at the object storage parquet files are your scalable stateless replicas - but the story gets trickier in the face of frequent ongoing writes / inserts. ClickHouse handles that scenario well, but I suspect the MotherDuck folks have answers for that in mind :)
IceDB sounds like an interesting project. What are your thoughts on Apache Iceberg? Seems like a similar idea, and gaining a lot of traction in the data-eng world.
Reminiscent of Adrian Cockcroft “Gigalith” architecture.
Not sure on the economy of scale though - if you need redundancy on a scale up architecture you essentially need to have parity compute sitting idle (likely in another location/region/data centre).
If you have an HA requirement your infra spend has just doubled, and the ROI on your DR tends to nothing (unless you can do something clever with on demand DR)? You’re also replicating everything your write to DR (so network and storage there).
Also more practically, you’re just less flexible. Waiting 30 minutes for a machine to checksum 1TB of memory before loading the OS is painful in an outage scenario (every reboot hurts, replenishing cache hurts, losing a massive instance hurts). Do like the sentiment around simplicity though!
> If you have an HA requirement your infra spend has just doubled, and the ROI on your DR tends to nothing (unless you can do something clever with on demand DR)? You’re also replicating everything your write to DR (so network and storage there).
You're quite correct, but it's not such a simple trade-off.
1. What's a "disaster"? A) "Datacenter burns down" is very different to B) "Instance got overloaded with unexpected spike, need more capacity". For A), your distributed system will still have outages in a particular area, so you'll spin up new instances in the nearest live DC, at a time when everyone else is trying to do the same! And B) is not a disaster.
2. Disaster recovery is, by definition alone, an exceptional event. If you're reaching for DR protocols more than once a year, there's something seriously wrong. Being able to scale out means that you only need a few minutes before you're up to capacity, but shaving off (say) 30m in the event of a genuine black swan event depends on the exact business. I've seen Fintechs, FAANGs, Fortune 100 companies, banks, etc have hours of downtime with no apparent negative effects. It's a black swan event that's not an extinction level event, nor even close.
A bigger problem for the business is that if a black swan event that results in:
> Waiting 30 minutes for a machine to checksum 1TB of memory before loading the OS is painful in an outage scenario
turns into an ELE, then it's the business that has problems, not the systems. A business that is so fragile to downtime is not going to be in business much longer anyway.
In my toy barebones SQL database, I store rows alternatedly on different replicas based on a consistent hash. I also have a "create join" statement, this keeps join keys colocated.
Then when there is a join query issued, I can always join because the join keys are available and the join query can be executed on each replica and returned to the client to be aggregated.
I want building distributed high throughput systems to be easier and less error prone. I wonder if a mixture of scale up and scale out could be useful architecture.
You want minimum network round trips or crossovers between threads (synchronization cost) as you can get.
All those Enterprise Architects being swayed by fancy Internet scale architecture presentations from the FAANG did lead to Roccococian scale out astrotectures for systems operating at scales 5 to 10 orders of magnitude below where that would make sense.
it's cloud companies raking in the profits of these hardware improvements. "Widely-available machines now have 128 cores and a terabyte of RAM." and I'm still paying $5 bucks for a couple cores.
There's a whole world of disconnect between the enterprise-focused offerings (AWS, GCP, Azure) and what most people would personally use: stuff like DigitalOcean, Vultr, Scaleway, Contabo, Hetzner and so on.
It's gotten to the point where I could personally never justify going for the larger platforms outside of an enterprise setting - even if I'll run my own business, it'll probably be on one of the latter platforms, because they still have most of the features that I actually need, while being more affordable.
I mean, you probably want 2 for resilience but DRBD and active/passive setup is simple enough and doesn't get into the usual fuckery with having clustered DB
This feels like the kicker:
> In the cloud, you don’t need to pay extra for a “big iron” machine because you’re already running on one. You just need a bigger slice. Cloud vendors don’t charge proportionally more for a larger slice, so your cost per unit of compute doesn’t change if you’re working on a tiny instance or a giant one.
It's obvious once you think about it: you aren't choosing between a bunch of small machines and one big machine, you may very well be choosing between a bunch of small slices of a big machine and one big slice of a big machine. The only difference would be in how your software sees it: as a complex distributed system, or as a single system (that can eg. share memory with itself instead of serializing and deserializing data over network sockets)