Interesting article but the author needs to learn the difference between systemic and systematic.
This is also wrong:
> That’s why the shipbuilders bragged that the Titanic was unsinkable. Instead, systematic design error let water spill from one compartment to the next. The moment an iceberg compromised one section of the hull, the whole ship was doomed.
The Titanic was built to withstand as many as four flooded sections and still remain afloat. No one imagined a scenario in which more than four sections would flood, but then the iceberg ripped open six sections and down it went. Was this a systemic failure? I think it's more of a failure to anticipate a possible worst case scenario. An argument could be (and has been) made that if the watertight sections had been taller, the flooding would have been slower and there would have been more time for aid to arrive. That was arguably a flaw in the system of bulkheads and thus a systemic failure.
This took way too many words to say thst as AWS internal service dependency graph gets deeper and more complex the higher the likelihood that any one service failure cascades into a systemic failure.
- people using us-east-1 when it has been shown to be the most problematic AWS region
- companies building shitty applications and devices that require the cloud to function and not considering what happens when the cloud/internet is down
- people assuming the cloud can't/won't go down or have issues between the end user and the server
These types of events will continue to happen, so for the love of god stop using us-east-1
I live in Michigan and I’ve used us-east-2 since it became available. Not only is Ohio right next door, but it’s also closer to the mean center of the US population (which is currently in Missouri). I don’t know if it’s actually measurably faster for West cost folks, but my application is used mostly in the Midwest and only a little on the coasts so it makes sense for me, but might be a good choice for coast-to-coast companies too.
Have nothing to add on the writeup. But I had to drop a note about how awesome the comic / cartoon (about the "evolution of DevOps") at the end of the writeup is in case anyone misses it by not reading till the end.
I'm a little disappointed how light this article is on what should be done. It seems like the conclusion is "this will happen again." Even the public dependency audit the author is calling for wouldn't do much. Why would third parties be better at planning for systemic failure than amazon's internal teams?
This probably isn't the best use of chaos monkey, though I'm not sure. There's probably room in here somewhere for better tools for modeling failure.
The pattern recommended by amazon themselves is to run stuff in multiple regions. Stuff never really breaks in an AZ, its almost always a region that goes down.
Some workloads can't be split, but they are pretty small, especially when you start getting to medium load levels (anything involving sharding.) Mostly then its down to cost.
What bit us specifically is that we ran our cognito instance in us-east-1, which fair enough sounds like an unstable region and we’re going to move. But as far as we can tell from the docs, cognito does not have multi-az support. Has anyone else figured out a multi-az auth strategy?
This is also wrong:
> That’s why the shipbuilders bragged that the Titanic was unsinkable. Instead, systematic design error let water spill from one compartment to the next. The moment an iceberg compromised one section of the hull, the whole ship was doomed.
The Titanic was built to withstand as many as four flooded sections and still remain afloat. No one imagined a scenario in which more than four sections would flood, but then the iceberg ripped open six sections and down it went. Was this a systemic failure? I think it's more of a failure to anticipate a possible worst case scenario. An argument could be (and has been) made that if the watertight sections had been taller, the flooding would have been slower and there would have been more time for aid to arrive. That was arguably a flaw in the system of bulkheads and thus a systemic failure.