Blog

How to say you’re sorry: writing a postmortem after a major outage

There comes a time in every startup’s life when things go very very wrong.  It’s a moment of helplessness when everything you’ve been working towards for months and even years is starting to crumble before you.  There’s usually cussing.  And sometimes crying.  The bigger the hiccup, the more helpless you feel.  And the more customers you have yelling at you. This is actually a good thing.  It means you matter deeply to a lot of people.  That’s a big accomplishment and most startups don’t ever make it that far.  But it also makes it that much more important that you properly handle the crisis, and the aftermath.  This post is about a few things that I’ve learned dealing with bad situations at software-as-a-service companies. Not all of this might be relevant for consumer apps, but a lot of it probably still is given how important even free apps are becoming to all of us.

First, take care of business

This post is mainly how to deal with the fall out from a crisis, but a big part of that depends on how you handle things during the heat of the moment.

Here are some rules of thumb that will set you up for success dealing with the outage afterwards.

  • Clearly communicate what you know, as soon as you know it

In the crisis, it’s all about communicating that there is an issue, when you’ve identified the cause, when you’ve identified a solution and when it’s fixed.   Generally speaking, the more info you can provide the better.  Still, while it is best to be as transparent as possible, you need to use your judgement here.  If for example, you are experiencing an outage caused by hereto unknown security bug, it would be irresponsible to disclose the exact nature of the bug.  That will just open you up to attacks by opportunists and make things worse for your customers.  

  • Update your customers regularly, and not just on Twitter

Many customers“know” to go to Twitter or www.downforeveryoneorjustme.com when there’s the slightest hint that something’s not right.  But many more don’t (plus you shouldn’t make your customers go looking for information anyway when things are going wrong).  

This is a lesson that I learned the hard way recently at Mailgun, a company with over 10,000 customers sending around a half a billion emails a month.  We didn’t have a great process in place (more on that below), but we did take advantage of Intercom.io (that we had coincidentally installed recently to push customers notifications about new features) to update them about the status of our outage and steps we were taking to get things rolling again.  It wasn’t a perfect, but it helped.

  • Fix the problem fast, but don’t be sloppy

This is a really hard one and I’m not going to pretend that I know how to do this, let along teach you.  But that doesn’t make it any less important.  There is a tendency when under pressure to act without fully thinking things through.  It’s why they call it fight or flight, not fight or rational solution.  The thing about the fight or flight instinct is it massively reduces our perception of complex in a situation, but not the complexity itself.  That stays, and fast solutions can often leave tattered edges that can cause problems down the line.  It’s probably impossible to avoid this altogether, but being aware of the trade-offs you make to solve an immediate problem, can help you deal with the unintended consequences that may arise afterwards.

The postmortem: what happened and what are you going to do about it

Any substantial outage should have a postmortem.  If your customers trust you with their business, they deserve to know two things after a crisis: what happened and what you are going to do about it.

What happened

I mentioned above that during the crisis itself, it’s important to keep customers informed of what happened.  But because you are also trying to fix the problem at this point, it’s usually impossible to give all the details that customers deserve to know.  That’s where the postmortem comes in.  The postmortem should provide enough detail for the customer to know exactly what happened and why.  The more dependent the customer is on your service, the more important it is to be transparent.

Besides being honest with your customers (which is an end in itself), providing details can also humanize you, making the average customer more likely to give you the benefit of the doubt.  Here’s an example.  In 2007, Rackspace had a major datacenter outage.  Being in the datacenter business means you need to have plans to make sure that outages don’t happen, so there’s not much hope that customers would be willing to listen to “excuses” about why the outage occurred, especially when it turned out that the data center coolers malfunctioned.  But it did.   Here’s why.

The outage happened after a delivery truck jumped a curb, flew through the air and landed directly ON TOP OF a transformer (flying right over the reinforced concrete pillars surrounding the transformer). Superman monster truck That’s pretty insane, but it gets worse.  Rackspace’s backup generators kicked on as planned, cycling up the huge coolers in place at the datacenter. But because there was a dude, in a truck, on top of a transformer, the utility company had to cut off power to the entire datacenter to safely rescue the driver.  This caused the coolers to cycle off and on repeatedly, lessening their effectiveness and overheating the datacenter.  That’s when the outage happened.

There were a lot of pissed off customers that day. But I guarantee hearing that story made them a lot more willing to take Rackspace’s side in the situation.  And in a crisis and afterwards, the benefit of the doubt is practically everything.

What are you going to do about it

Once you’ve clearly explained what happened, customers will always ask “what are you doing to make sure that this never happens again”.  You need to tell them. The details will vary, and in some cases, you won’t be able offer assurances that the problem is definitively solved, but you have to offer up concrete steps that you are taking to remedy the situation.  Concreteness is key.  People can smell spin a mile away.

I mentioned a recent outage at Mailgun above. During the outage, emails were significantly delayed but what made matters worse is that customers were not notified until roughly an hour into the incident.  It became crystal clear that our status page was a terrible indication of our actual status.  To fix this, we rolled out a new status page immediately after the incident that let our customers subscribe to incidents by email, SMS, webhook, and rss. Previously, none of these options existed. Along with a better internal processes for identifying incidents early, this concrete step gave customers something to do immediately upon reading the postmortem, we hope giving them comfort that the next outage (yes, unfortunately there will be a next because that is the nature of SaaS) would be better than the last.

Don’t forget to listen

Up to this point, I mentioned a bunch of things to say to customers.  But it is just as important to listen.  Under the best of circumstances, your customers will still be pissed at you.  You ruined their day and probably cost them money. Some customers will want to get you on the phone personally to tell you all the terrible things that happened because of your screw up. Let them.  But at the end of the day, if they are reasonable, they’ll just want to know what everyone else wants to know: what happened and what you’re doing about it.  If you’ve already thought that through as part of your postmortem, these conversations won’t be fun, but you’ll survive.

And then move on

Sometimes the mess will take a while to clean up, but eventually, your customers will move on, and you will too. As long as you learn a lesson from the incident, make sure it doesn’t happen again anytime soon, and handle your communications openly and with a human touch, these types of incidents can actually strengthen your relationship with customers. Customers are people (shocker!) and people tend to come together during times of hardship if they feel like they can trust each other.  True character comes out during hard times, not good, so if you can get your customers to trust you when things are bad, you’re in a really good spot.

I’ll end this post with a comment from one of Mailgun’s long time customers, the very type of person who we were terrified to lose when the crisis was full blown, and no solution was yet in sight. It’s a small reward for the effort we put into handle the outage. And a huge reminder that we have people relying on us everyday who we never want to let down again.

P.S.  You can read the full postmortem we wrote at Mailgun after our recent outage here.