Working around the EC2 outage

ulope · on April 21, 2011

Having read that (and assuming it's accurate) I really wonder how anyone in their right mind can see EC2 as a viable hosting solution.

pan69 · on April 21, 2011

EC2 is built for scalability and distributed systems. The article seems to give a negative spin on something which is the authors own fault.

In most traditional hosting environments you have e.g. one web server. If this one server goes down your website stops working. A solution would be to have two web servers behind a load balancer. If one web server goes down, the other takes over and your site continues to work.

A lot of people who are hosting on EC2 place all their application components in the Virginia data-centers (because its the cheapest data center for reasons the article points out). If the Virginia data center is down nothing works any more. However, EC2 gives you the option to distribute your website over multiple data centers in case of an event like this.

If you choose not to take advantage of this architecture you're no different than running your website on a single web server. E.g. a single point of failure. With EC2 you have to ability to set up a website that never goes down.

Of course, distributing your web site over multiple data centers can be costly. But I guess it's pick and choose, not bitch and moan.

samstokes · on April 21, 2011

You're confusing AWS regions with AWS availability zones. You're not technically wrong - it is possible to have your application spread across regions and doing so would have protected against this outage - but doing so is slow and expensive, and the actual failover to a different region is difficult. Amazon explicitly recommends that if you distribute across multiple availability zones within the same region, you should be robust to the majority of outages, which should only take out one AZ. The current AWS outage is affecting all AZs in the US-East region, which Amazon claims should never happen.

This earlier post [1] (HN discussion [2]) discusses this in more detail.

[1] http://justinsb.posterous.com/aws-down-why-the-sky-is-fallin...

[2] http://news.ycombinator.com/item?id=2471899

pan69 · on April 21, 2011

No, I'm not confusing regions with availability zones. What I'm saying is that AWS gives you all the components to set up a website that never goes down. If you don't take advantage of this then there is no else to blame but yourself. Of course doing this can be very expensive and it's choice you have to make.

Some people keep saying things like; "Well, Amazon promised us that zones don't have a single point of failure". Well, sucked in I guess. Apparently they do.

jpetazzo · on April 21, 2011

Amazon promised that regions wouldn't see two zones going down at the same time. That's totally different, and that's what happened today.

pan69 · on April 21, 2011

"Amazon promised"

Well, that just sounds incredibly naive.

markerdmann · on April 21, 2011

Ok, come on. Now you're just trolling. Amazon also promises not to share your credit card information. It should be obvious that trust is necessary in business.

e40 · on April 21, 2011

Would you just please tell everyone here how you think it should be done rather than talking around the issue?? Otherwise, you're just another troll.

shykes · on April 21, 2011

Why don't you expose how you would go about designing this invincible website? We are all very eager to learn from you.

jpetazzo · on April 21, 2011

Well, US East (North Virginia) is actually 4 datacenters. As pointed out in the article, multiple availability zones (=multiple datacenters) are affected by the outage, meaning that even if you deploy on multiple datacenters, you can go down.

Now, if you're talking about deploying to e.g. both US East and US West, I totally agree: it would be a good thing to do. But EC2 does not give you that option - not easily, at least, because there is no convenient way to move volumes or snapshots between regions.

Setting up HA between close datacenters (e.g. 50 miles from each other) is easy, because the latency remains low. Setting up HA between datacenters coast to coast is a whole different story, and the only help brought by EC2 is the fact that you can use the same API to deploy your machines here and there.

shykes · on April 21, 2011

With EC2 you have to ability to set up a website that never goes down.

I find it hard to believe that you have any practical experience to back your claim. Today's incident affected sites that scrupulously respected all HA best practice.

sbov · on April 22, 2011

At least according to the blog post, it seems like part of the issue may be a stampede of requests due to the complete outtage of a single AZ. If every single service hosted in the Amazon cloud did some sort of solution where they did failover to another Region in the case of multiple AZ outtages, how can we be sure that that stampede of requests wouldn't take out that other region? Could Amazon handle the entirety of the US East workload being dumped onto Europe/US West?

ulope · on April 21, 2011

Well ok that's understood. But still - instances randomly crashing is not acceptable under any circumstances in my view.

shykes · on April 21, 2011

Just to be clear. DotCloud is in fact designed to withstand instances randomly crashing.

So far however, it has not been designed for instances randomly crashing across multiple datacenters. I will add that neither is the canonical high-availability designed recommended by Amazon.

pan69 · on April 21, 2011

Why not? That's the whole point of cloud computing, that you can cater, or have to ability to cater, for situations like this. If you don't take advantage of those abilities, well, then thats is something you have to sort out with yourself.

snehalpatel · on April 22, 2011

Choosing to be on AWS or any other cloud provider means you accept some risk of things going down. Build to fail and when that fails, it all comes down to whether or not you can do it better and how much cost you're willing to bear to get to your goal of HA. For me, I know AWS can do a better job at hosting than I can and can accept multiple AZs going down.

mbailey · on April 21, 2011

Quick answer: have presence in both the VA and CA locations.

shykes · on April 21, 2011

The corollary is: ignore the AZ feature entirely. You may be right, but that's a big hit to the attractiveness of AWS.