What about hardware failure? On AWS you just commission a new instance and your ...

gtuhl · on April 21, 2011

There are some nice middle options out there. I'll use Softlayer as an example as I have provisioned a lot of machinery over there.

I can order machines online and SSH in 3-4 hours later. Even exotic stuff they turn around just as fast - we saw that speed on a quad octocore box with a raid 10 of Intel SSDs.

That's real metal too, with real IO (most of my work is IO bound so VMs and the cloud are not options). You get to pick the exact CPUs, disks, etc and they slot them in solid Super Micro boards and use good Adaptec disk controllers. You pay monthly and can spin down the box at any time (though must pay full months, no per-minute pricing like AWS).

That is on the dedicated hardware side, you can also spin up compute instances and those can be cloned and fired up in bulk. But, they also have the IO problems that all other VMs have.

In any case, just wanted to mention they are a decent middle ground. Not as automated and polished as Amazon on the VM side but you can spin up mixtures of metal and VMs to get combinations that make sense - pushing compute or RAM-only stuff to VMs and keeping DBs and persistence layers on real metal. They have a few different datacenters too so you can spread gear around physical locations.

ericd · on April 22, 2011

I'm fairly sure that my downtime due to a hardware failure at softlayer would be less than the downtime AWS has had for huge numbers of people this year. And hardware failures on a given server happen less frequently than 1/year on average.

Problems are just not as common if you're running on a handful of dedicated machines, and a single dedicated machine at a good host can handle a LOT without having to do all the crazy reliability engineering that running on AWS requires. You need backups, but you don't need that same assumption that you need to be able to failover instantly or you will have guaranteed downtime sometime soon. I don't think that that difference can be overstated, since it lets you focus on more important things.

dangrossman · on April 21, 2011

Speaking of Softlayer specifically, they've diagnosed then replaced failed hardware for me (hard disks and power supplies so far) in 15-30 minutes from the time I opened a support ticket. One of the incidents was around 2AM local time where the server is and their response time was the same.