Outage management is a core skill for a software engineer to acquire and is critical to achieve high availability of an online service.
First, ideally, you receive an alert that triggers a notification that shows a degradation of the health of the service. The first thing to do is to acknowledge the alert only if you are able to follow up on it.
If your company has an outage or team channel, it is useful to drop a quick note about the outage in case it is useful for someone else facing related issues.
Next is to absolutely avoid forward fixing. Usually, you or an engineer on your team will have an idea of what caused the outage and might be tempted to just deploy a fix. The problem is that most of the time emergency deploys end up exacerbating the problem.
Your first priority should be to return the service to a state of normalcy
You should first try to rollback to a previous deploy or configuration if the outage matches with a deployment time or a configuration change.
In addition, you should check if the outage is happening in all data centers/regions. If not, you should fail over to the healthy one.
Anything you do while failing over or rolling back should be done gradually as this might cause further deterioration or introduces new problems.
If this solves the issue, you can breath a sigh of relief and catch up your breath. Otherwise, it is time to start digging.
If you have a secondary on-call, it is a good idea to get in touch with them. Your team culture should encourage this. Handling outages alone is unhealthy and everyone benefits from an extra set of eyes.
If the outage is a high severity one, you should start an incident page, join a zoom outage and start actively paging folks and on-call engineers to join the meeting.
Next, you should try to find the health of your downstreams, upstreams and see if anything there is unhealthy. After this, you should see what changed in your systems. You should check deployments, configuration changes, experiment running, logs, etc…
If this still does not help, you should try to get a repro either directly or reach out to impacted users and try to see what is special for them that might be causing the outage.
You should try to identify if the impact is effecting all users, some users, regions, features, etc.. This will help you close in on the root cause.
In parallel, you should also deconstruct your systems and try to see which ones are healthy and which ones are not
Communicating during the outage is important. If you are the incident manager (we use to call them commander at Uber), you should keep iterating the knowns/unknowns, update the incident page when possible and delegate tasks. Staying calm and level headed is crucial.
In reality though, the best way to prepare for outages is through work you do prior to them happening. Things like instrumentations, dashboards, rollout strategy, redundancies, etc… Engineering excellence in your organization is highly correlated to them.
Finally, no discussion on outage management can end before we talk about Post Mortems. It is such an important topic that I dedicated previously a post to it by itself.
Software Engineering from the Frontlines Course on Maven
If you liked this article, I will be teaching a “Software Engineering from the Frontlines” course on Maven where I will teach hard-learned lessons I acquired developing large-scale products at companies such as Uber, Airbnb, and Microsoft.
Great article folks 👌 ... May be a typo "iterating the know**ns**/unknowns..." sorry if it is not one