Outage Management

Feb 13, 2023

Error

Outage management is a core skill for a software engineer to acquire and is critical to achieve high availability of an online service.

First, ideally, you receive an alert that triggers a notification that shows a degradation of the health of the service. The first thing to do is to acknowledge the alert only if you are able to follow up on it.

If your company has an outage or team channel, it is useful to drop a quick note about the outage in case it is useful for someone else facing related issues.

Next is to absolutely avoid forward fixing. Usually, you or an engineer on your team will have an idea of what caused the outage and might be tempted to just deploy a fix. The problem is that most of the time emergency deploys end up exacerbating the problem.

Your first priority should be to return the service to a state of normalcy

Software Engineering Tidbits

First things first

Returning a service to a green status is the priority in an outage. This mean looking first into rolling back recent changes or falling back to unaffected regions. Avoid the temptation to forward fix or spending time to identify the root cause before you do this…

3 years ago · 6 likes · Georges El Khoury

You should first try to rollback to a previous deploy or configuration if the outage matches with a deployment time or a configuration change.

In addition, you should check if the outage is happening in all data centers/regions. If not, you should fail over to the healthy one.

Anything you do while failing over or rolling back should be done gradually as this might cause further deterioration or introduces new problems.

If this solves the issue, you can breath a sigh of relief and catch up your breath. Otherwise, it is time to start digging.

If you have a secondary on-call, it is a good idea to get in touch with them. Your team culture should encourage this. Handling outages alone is unhealthy and everyone benefits from an extra set of eyes.

If the outage is a high severity one, you should start an incident page, join a zoom outage and start actively paging folks and on-call engineers to join the meeting.

Next, you should try to find the health of your downstreams, upstreams and see if anything there is unhealthy. After this, you should see what changed in your systems. You should check deployments, configuration changes, experiment running, logs, etc…

If this still does not help, you should try to get a repro either directly or reach out to impacted users and try to see what is special for them that might be causing the outage.

You should try to identify if the impact is effecting all users, some users, regions, features, etc.. This will help you close in on the root cause.

In parallel, you should also deconstruct your systems and try to see which ones are healthy and which ones are not

Software Engineering Tidbits

A good way to debug

One of the best software engineering tip I received is a good way to debug. Once you are investigating something broken, you keep peeling/removing functionalities until things work again and then you add them back one by one until things re-break again…

3 years ago · 8 likes · Georges El Khoury

Communicating during the outage is important. If you are the incident manager (we use to call them commander at Uber), you should keep iterating the knowns/unknowns, update the incident page when possible and delegate tasks. Staying calm and level headed is crucial.

In reality though, the best way to prepare for outages is through work you do prior to them happening. Things like instrumentations, dashboards, rollout strategy, redundancies, etc… Engineering excellence in your organization is highly correlated to them.

Finally, no discussion on outage management can end before we talk about Post Mortems. It is such an important topic that I dedicated previously a post to it by itself.

Software Engineering Tidbits

Blameless Postmortem

Putting together a blameless postmortem culture is one of the most effective way to improve quality in an engineering organization. One of the best manager I had once told us that the only sure way not to have an outage is not to deploy to production and I surely expect you to ship and deploy a lot of features. The only thing I care about once we have an…

3 years ago · 1 like · Georges El Khoury

Software Engineering from the Frontlines Course on Maven

If you liked this article, I will be teaching a “Software Engineering from the Frontlines” course on Maven where I will teach hard-learned lessons I acquired developing large-scale products at companies such as Uber, Airbnb, and Microsoft.

View Course

13 Likes

Software Engineering Tidbits

Outage Management

Software Engineering from the Frontlines Course on Maven

Discussion about this post