Lessons from the Amazon EC2 Crash: A Survivor’s Tale

April 25, 2011

As the Amazon EC2 service outage tsumani rolled over the sea walls of Quora, Foursquare and others, the phones started ringing at our office. But these calls weren’t from panicked clients, because none of our customers had been affected, thanks in part to redundancy planning and a little bit of luck.
Instead, the calls were coming in from companies that were looking for recovery advice. This gave us a front-row seat to the crisis, from the perspective of several startups that were affected.

Too Big To Fail. Again.

In the aftermath of these troubleshooting conversations, we came up with some theories on why seasoned CTOs who “knew better” had placed their faith in a hosting solution that was considered “too big to fail”. We were surprised to hear from companies who didn’t think they needed Systems Oversight because “doing backups on the cloud is easy” and “our systems team doesn’t need the help”.

It was as if traditional best practices from the good old days of data center management didn’t apply in the magical new world of cloud hosting. As the crisis reminded everyone, virtual machines still need to live in a networked data center on a physical piece of hardware. Virtualization doesn’t change the fact that a shared network environment entails shared risks that increase in proportion to the accelerating adoption rate of new cloud hosting customers.

We share the view of Michael Kirven, co-founder of cloud services provider Bluewolf when he said:

Amazon’s products are only as good as the people putting the architecture up. If you put all of your eggs in one basket, you put yourself at risk.  (CNN Money)

Somehow, the ease and inexpensive nature of cloud computing lulled people who certainly knew better into a reactive, assumptive and even apathetic posture. In 2008 we all learned from the banking meltdown that there is no such thing as “too big to fail,” so why are we surprised to learn that AWS, a rapidly growing, relatively new, shared infrastructure, wasn’t perfect? We should all be planning for technical and process failures as an inevitability, just like in the old days when we had a server cage in the closet next to the coffee maker.

Capacity Does Not Equal Redundancy

For all the industry-transforming efficiencies offered by cloud computing, the fact remains that “capacity-on-demand” does not also buy performance or redundancy. We’ve seen many of our startup customers seduced by the temptation to “light up a few more AMIs” as the solution to every performance bottleneck. This practice only serves to mask underlying structural issues with the deployment, creating a ticking time bomb that can bring systems down at a future time when massive usage makes it much harder to stage a recovery.

Cloud computing does not change the need for traditional best practices.  We still need to design a thoughtful architecture, conduct performance and redundancy planning, and implement proper  security practices.  The good news is that with cloud computing it no longer takes a large IT budget to implement these best practices.

On Amazon’s EC2 “proper architecture” entails live replication across multiple availability zones, coupled with snapshot backups to S3.  This strategy should be paired with generous provisioning, in order to provide a capacity buffer in order to handle usage spikes. For example,  by setting auto-scaling rules to 60% utilization there is capacity available to handle the fail-over from a zone that goes down.

These defensive measures that could have reduced downtime were relatively inexpensive and easy to implement using Amazon’s automated tools.

Culture and Workflow

Aside from the technical details of the Amazon network failure, this crisis points to several structural flaws in the current VC / startup model that Corsis is hoping to correct with our Systems Oversight offering.

The first issue is cultural, and stems from an environment in which small teams are generally focused on rapid turnaround of high-priority tasks using agile approaches.  Because the cloud redundancy tools were cheap, easy to implement, and well- known, many teams simply never got around to implementing them, because it was easy to put the task off until later.

This points to a systemic flaw in the culture of startups that affects their oversight workflows. Given the fast-paced, small-team environment, in many startups noone is focused on oversight as part of their day-to-day job. Instead, the startups who called us for help when it was too late had spent the past 6 months focused only on delivering the “top 3 urgent things” on their weekly list. As one week flows into the next, startups face a new set of  “top 3 priorities” and as a result, IT architecture issues are often bumped. Consequently, the redundancy  precautions never made it to the top of the to-do list at many startups.

An important area that is overlooked in today’s small team startup culture is the need for skill-set redundancy in addition to server redundancy. Since few startups have the time to do proper “run-book” style documentation, we recommend implementing a skill-set escrow service for “human redundancy.”

A second major issue that was exposed by the crisis is related to the current software ecosystem.   With open source “mashup” development and mature web services, it’s possible to go very far with small teams.

VCs and executives have put their faith in small internal IT teams without an insurance policy against key man risk. During the Amazon crash, expertise was in short supply and some firms scrambled to figure out who held the access credentials and undocumented knowledge to do disaster recovery.

Looking Ahead by Looking Back

In hindsight, Corsis clients survived the crisis, not because we invented brilliant new safeguards, but because we stuck to traditional practices and viewed cloud computing as a “capacity solution,” not as an “architecture solution.” For their part, our clients were pragmatic enough to realize that the current startup environment is not conducive to safe deployment practices, so they hired us to focus on “minding the store” in terms of redundancy, backups and documentation.

Expecting the current startup culture to change is probably unrealistic, and it is also unrealistic to expect cloud hosting providers to provide 100% uptime. Instead, we recommend that startups view cloud computing through a traditional lens, and consider outsourcing the “systems oversight” function so that they can focus on rapid development while avoiding the next crash.

0