Working around the EC2 outage

As you probably already know, EC2 is experiencing a massive, unprecedented outage.

Introduction for those of you who aren’t super familiar with Amazon EC2

EC2 is made of multiple regions, each of them technically independent from each other.
There are five different regions:

  • US East (North Virginia),
  • US West (North California),
  • Europe (Ireland),
  • Asia Pacific – Tokyo,
  • Asia Pacific – Singapore.

US East is the oldest, and probably the most used, region; for multiple reasons:

  • it’s a bit cheaper than others;
  • if you want to target both American and European audiences, it’s a not-too-bad choice;
  • many 3rd party providers are there (for the previously quoted reasons), so if your service relies on them, you might as well get your servers as close as possible to optimize latency, bandwidth, and costs;
  • it’s the default one for a lot of tools, so you might use it without even actually choosing it.

Each region can be split in multiple Availability Zones. Those zones are supposed to be mapped to different datacenters; close enough from each other to provide low latency connections (suitable for replication and high availability) but far enough from each other to rely on different network and energy providers.

DotCloud uses US East, and our servers are spread over multiple availability zones – that’s the “best practice” advised by Amazon to deploy highly available services on EC2.

We also use a well-known feature of EC2: Elastic Block Storage (or EBS). EBS is a virtual hard drive that you can plug into your EC2 instance. EBS are supposed to be highly available, and feature a lot of interesting features: you can move them from an instance to another, you can make snapshots (point-in-time images of their content), and later use those snapshots for fun and profit (to start new images, or to restart a crashed instance from a previous backup). Imagine a virtual USB stick with magical cloning abilities: awesome, isn’t it? Well, not as much as it sounds, as we will see.

So you wanna move to EC2…

AWS is not double rainbow all the way. When you deploy on EC2, your app must be designed for failure. Because EC2 instances might fail (i.e., crash unexpectedly) – and they will. Your EBS volumes will have inconsistent performance (sometimes fast, sometimes slow as hell, even if you virtual machine isn’t doing anything special). The EBS volumes might also become totally unresponsive – just as if you had unplugged the virtual USB stick. Those issues are bound to occur, and you should design your whole architecture to work around them, else your uptime will be awful.

You might be lucky (some have managed to keep instances running for more than one year), but if you have a lot of instances (a lot starts around 10), you will statistically experience outages. On our cluster, we have multiple instances failing each week. Sometimes, you can get yourself out of trouble with little tricks. Imagine you forgot to switch some service to its highly available counterpart, and its hosting virtual machine is crashed.

  1. You can try to reboot it from EC2 console. Sometimes, that won’t work.
  2. You can then try to shutdown it, and restart it. It has a better sucess rate, but sometimes, it won’t work either.
  3. You can try to create a snapshot of its volumes, to start a new instance. This will usually work (but can take a few hours if you have a large amount of data); but not always.
  4. You can try to create new volumes from an older snapshot. This will almost always work, but in some rare cases, you’ll be out of luck.
  5. You can try to migrate your snapshots to some other place. Oh, wait, you actually can’t: although Amazon claims that your snapshots are on S3, you can’t handle them like you would handle your other S3 assets.
  6. You can restore from your regular backups (rsync, sql dumps…); well, you had backups, right? Right?

Amazon gives (or rather sells) you a lot of tools and features to work around that: ELB (reliable load balancer), RDS (reliable SQL database), S3 (reliable static storage), SQS (reliable queuing system), and a whole herd of other acronyms. If you are a disgruntled sysadmin in a bad mood, you might think that Amazon is just chopping your legs off (by giving you crippled virtual machines), and then handing out expensive crutches as a makeshift solution.

If you want to move to EC2, you should realize that your application will require major rehauls to adapt to this new environment in a safe and reliable way – just like moving to Google AppEngine means a totally different architecture. You might be fooled by the fact that you are running on a “regular” virtual machine, with SSH, root access, and everything. But your virtual machine might crash at any time, and there won’t be much you can do about it. You can’t even get your car, or a plane, and drive to the datacenter to fix your servers themselves. So you’d better get everything right in the first place if you don’t want your move to the Cloud to end with a nasty freefall.

What the hell happened yesterday?

Yesterday night, around 1 AM (Pacific time), we started to receive alerts from instances going down. Some of them were totally crashed. Others were still reachable through SSH, but services would be unresponsive. We use different EBS volumes for the system and for our users applications: so the system volume was still working, but the users volumes were dead. This seemed to affect multiple Availability Zone (however, to be fair, some Availability Zones seemed more affected than others; so maybe it only affected one, and the others were just “regular” crashes).

Since our primary gateway was affected, we kicked in a backup gateway. Then we tried all the steps listed in the checklist above – all failed, as you might guess.

After one hour, Amazon’s Health Dashboard was still pretending that everything was right. Twitter showed otherwise, since many services were being disrupted by the outage. After a couple of hours, the Health Dashboard was stating that EBS was experiencing “increased latency and error rate”, but things were obviously out of control, with Quora, Reddit and other websites affected.

We’re now more than 10 hours after the initial outbreak. Amazon says that there was a “networking event” (translation: “failure of an important network link or equipment”) which triggered re-mirroring of a large number of EBS volumes in US East. They also mention that a “control plane was inundated”, causing more errors. An educated guess might be that since the beginning of the outage, people (and scripts) have been trying to recover from the first spike of errors, creating snapshots, volumes, and instances; totally flooding API endpoints. Right now, in at least two availability zones, calls to create new snapshots or new volumes on some instances are still failing consistently with “internal errors” on the API side.

About 25% of our users were affected. Our API was still running, so people could deploy new services (which would automatically run on the available instances of our cluster); however, we are still seeing more and more instances failing. We decided to take the API down since we cannot assess accurately the situation. Of course, we plan to get it back up as soon as possible.

We’ll keep you posted as the situation evolves, and will also explain how we plan to shield against that kind of failures in the future.