Sooner or later all successful online businesses need to setup some kind of 24×7 on-call rotation, to make sure that when “something wrong” happens, there is someone to look after it—even when it happens Saturday night at 3 AM, even if it happens between Christmas and New Years’ Eve.
Since being on-call isn’t always fun and games, you will generally want to spread the burden across your team. Here’s how we did it at dotCloud.
Step 1: what events should wake up your on-call team?
There are actually two questions that you need to answer:
- What issues are important enough to justify waking up one of your engineers?
- How do you detect those issues?
It might sound obvious to some of you, but it’s very important to make sure that you don’t trigger a full-scale alert for issues that can safely wait a few hours (i.e. until the next business day).
It’s certainly useless to page someone when available disk space falls below 10%. It’s certainly useful to page someone when available disk space on this mission-critical MySQL server falls to 0% (since MySQL is well-known to do funky things when it can’t write anymore). But between those two easy extremes, there is a full spectrum of situations which need to be carefully thought about.
We use Nagios (I like to quote it as “the worst monitoring system ever, with the exception of all others”), and we decided that CRITICAL notifications had to page someone, while WARNING notifications could be just sent through e-mails, and wait a little bit.
It means that we had to adapt the semantics a little bit. Memory exhaustion, or a crashed machine, usually translates into a CRITICAL state. Now, if the crashed machine is part of a fully redundant cluster—do you really want to wake someone up at once? If memory runs low at 4 AM on some test environment used only by your Q&A team, do you really need to page someone immediately?
Again, this might sound obvious, but if you wake up your best engineers for pointless little things, they are more likely to disregard alerts in the long run, and be slower to respond to legit alerts. So make sure that unimportant stuff doesn’t make it to their pagers!
Now, how do you detect problems? I already mentioned that we used Nagios. You can probably use any monitoring system—as long as it is feasible to integrate it with PagerDuty (since this post is all about how PagerDuty will strengthen your uptime while preserving your sanity as much as possible).
We also have a Zendesk connector. It allows our Enterprise customers to page our on-call team directly: if they notice that something is wrong, and wasn’t detected by our array of Nagios checks, they can use a special-purpose e-mail address, which gets routed through Zendesk (to keep a trail of all requests and answers from our staff), and ultimately pings our on-call team.
Once you’ve settled on what can page your team, we shall move to the who.
Step 2: who should be in your on-call team, and how should they be trained?
This again might sound obvious, but we realized that it wasn’t. The members of your on-call team must be willing and able. Not everyone likes the idea of waking up in the middle of the night to fix servers. Not everyone can fix servers in the middle of the night.
First thing, discuss the on-call duty with potential hires. Make sure that they’re comfortable with it, and that the generous compensation, stock-option and perks that you’re offering come with the expectation that they will quickly be able to join the on-call team. Make sure that they understand how it will work, and make it clear that you will NOT ask them to give up their social life, their sleep, and their household peace by paging them twice a night while they’re on-call.
Next, organize a bootcamp. I’m not claiming that a bootcamp is the silver bullet method to bring your on-call candidates up to speed; but for us, it was great.
Our bootcamp isn’t even a real bootcamp, with an instructor and a bunch of trainees. It’s a self-taught course, embodied by a beefy wiki page, taking the on-call candidate through an itemized list of hoops to jump through.
It starts with a quick toolbelt check: do you have proper access to the code repository, the internal documentation Wiki, the development ticket tracker? Is your VPN access setup properly?
Then you proceed to a quick ZeroRPC 101 (since that’s what powers almost all communication between the components of the dotCloud platform). It explains how to discover, enumerate, and operate our basic internal services.
Next thing, you provision your own private dotCloud cluster, plumb it into the main platform, and learn how to use the automated deployment tools to update it. The tour goes on with the metrics service, the supervision platform, a bunch of migration procedures, and so on.
Our bootcamp can take up to a few days to complete, but after that, the new recruit is well prepared for the on-call duty.
Step 3: OK, how do I setup this rotation?
We already hinted at PagerDuty. PagerDuty receives alerts sent by Nagios, by e-mails, by an API, or by whatever you want. Then, it will in turn send e-mails, SMS, or even phonecalls to alert your team, following a configured calendar.
The following assumes that you already have a very basic PagerDuty setup—even if it’s a single rotation with a single person in it. The point is, that this post doesn’t want to be a PagerDuty tutorial :-)
You don’t want to page all your engineers whenever something happens. We found that the most efficient method was to page only one person, and have another one on backup, just in case.
So in PagerDuty, we create two distinct rotations, named “First Line” and “Second Line”. Both rotations have exactly the same people in them, in exactly the same order—but with an offset. That guarantees that you always have two different people in First Line and Second Line.
We decided to use daily rotations, with a “hand over” point somewhere in the middle of the day. We think it is better to switch to the next person in the middle of the day, rather than the middle of the night, because if a string of issues happens near the “hand over” point, you will probably end up waking up two engineers instead of one.
Also, we decided to use two separate layers for week days, and week-ends. This is useful if some of your engineers can’t or don’t want to be on-call during the weekends, or, conversely, if you have someone part-time in your team, who cannot be bound to a rotation which could put him on call on arbitrary days.
Final touch: every now and then, we have engineers traveling to Europe, Asia, or other places in very different timezones, and working from there. When that happens, we adjust the on-call rotation: we remove them from the “main” rotation, but schedule them to be on call during the day in their local timezone (which generally happens to be the night in the timezone of the rest of the crew). We turn an inconvenience (having someone in a different timezone) into an advantage (reducing the exposure of the team to impromptu duty calls in the middle of the night).
Confused? Look at the following screenshot (our actual rotation) and be enlightened!
In the above example, Andrea Luzzardi is in Europe, in the CEST timezone, and handles alerts during the day in his local timezone. Charles Hooper is not on call during the week, but compensates by being on call twice as often during the weekends. The hand over point is at 11 AM Pacific Time (or 2 PM Eastern Time).
Then, you need to setup proper profiles and escalations. All profiles are setup using the following model.
- The e-mail address is not firstname.lastname@example.org, but something like email@example.com (and everyone receives firstname.lastname@example.org); this is an easy way to send all alerts to the oncall mailing list, and have a really simple remember of who is supposed to handle a given alert: when you see an e-mail going to email@example.com, you know that Jack got paged.
- When an alert is triggered, it immediately sends an e-mail and a SMS.
- After 5 minutes, if the alert hasn’t been acknowledged or resolved, it escalates into a phone call.
The escalation policy is quite simple: the “First Line” gets ringed first. After 10 minutes, if the alert is still active, “Second Line” is alerted. 10 minutes later, start again with “First Line”, and so on.
What happens when someone wants to take a day off (or longer holidays)? We use PagerDuty overrides, and trade on-call days. “Hey, I’m taking a day off next Wednesday, but I’m supposed to be on call; who wants to swap his next on-call day with me?” Each engineer is responsible for setting up overrides for when he can’t be on duty. Entering an override in PagerDuty takes just one minute.
We found out that it was important to be remembered when you are on call. It reminds you that you should keep your cell phone at hand, and charged.
The easiest way to get reminders is to import your personal PagerDuty calendar into Google Calendar (as a separate calendar), and setup a calendar-wide alert on the imported calendar (i.e. a reminder that triggers for all the events of the calendar; in other words, all the occurrences of you being on call). You can use your favorite choice of reminder (e-mail, SMS…) with your preferred advance notice delay.
Also, the bulk of our team is located on the West Coast, but we have some folks on the East Coast. We look forward to further enhancing our on-call rotation to take into account the timezone offset between the two, to reduce the time during which someone is on call during the (local) night.
It took some time to fine-tune this rotation; while it’s not perfect, it helps us react very quickly to critical events, 24/7, without sacrificing the morale or the health of our engineering team. We hope this can be useful for you as well!