Software Engineering Tidbits

Software Engineering Tidbits

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
Outage Management
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Software Engineering Tidbits
Small (or sometimes big) tidbits about software engineering. This is where I share tips and learnings I acquired building, maintaining and supporting software in production at Airbnb, Uber and Microsoft.
Over 10,000 subscribers
Already have an account? Sign in

Outage Management

Georges El Khoury's avatar
Georges El Khoury
Feb 13, 2023
13

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
Outage Management
Copy link
Facebook
Email
Notes
More
3
Share

Outage management is a core skill for a software engineer to acquire and is critical to achieve high availability of an online service.

First, ideally, you receive an alert that triggers a notification that shows a degradation of the health of the service. The first thing to do is to acknowledge the alert only if you are able to follow up on it.

If your company has an outage or team channel, it is useful to drop a quick note about the outage in case it is useful for someone else facing related issues.

Next is to absolutely avoid forward fixing. Usually, you or an engineer on your team will have an idea of what caused the outage and might be tempted to just deploy a fix. The problem is that most of the time emergency deploys end up exacerbating the problem.

Your first priority should be to return the service to a state of normalcy

Software Engineering Tidbits
First things first
Returning a service to a green status is the priority in an outage. This mean looking first into rolling back recent changes or falling back to unaffected regions. Avoid the temptation to forward fix or spending time to identify the root cause before you do this…
Read more
3 years ago · 6 likes · Georges El Khoury

You should first try to rollback to a previous deploy or configuration if the outage matches with a deployment time or a configuration change.

In addition, you should check if the outage is happening in all data centers/regions. If not, you should fail over to the healthy one.

Anything you do while failing over or rolling back should be done gradually as this might cause further deterioration or introduces new problems.

If this solves the issue, you can breath a sigh of relief and catch up your breath. Otherwise, it is time to start digging.

If you have a secondary on-call, it is a good idea to get in touch with them. Your team culture should encourage this. Handling outages alone is unhealthy and everyone benefits from an extra set of eyes.

If the outage is a high severity one, you should start an incident page, join a zoom outage and start actively paging folks and on-call engineers to join the meeting.

Next, you should try to find the health of your downstreams, upstreams and see if anything there is unhealthy. After this, you should see what changed in your systems. You should check deployments, configuration changes, experiment running, logs, etc…

If this still does not help, you should try to get a repro either directly or reach out to impacted users and try to see what is special for them that might be causing the outage.

You should try to identify if the impact is effecting all users, some users, regions, features, etc.. This will help you close in on the root cause.

In parallel, you should also deconstruct your systems and try to see which ones are healthy and which ones are not

Software Engineering Tidbits
A good way to debug
One of the best software engineering tip I received is a good way to debug. Once you are investigating something broken, you keep peeling/removing functionalities until things work again and then you add them back one by one until things re-break again…
Read more
3 years ago · 8 likes · Georges El Khoury

Communicating during the outage is important. If you are the incident manager (we use to call them commander at Uber), you should keep iterating the knowns/unknowns, update the incident page when possible and delegate tasks. Staying calm and level headed is crucial.

In reality though, the best way to prepare for outages is through work you do prior to them happening. Things like instrumentations, dashboards, rollout strategy, redundancies, etc… Engineering excellence in your organization is highly correlated to them.

Finally, no discussion on outage management can end before we talk about Post Mortems. It is such an important topic that I dedicated previously a post to it by itself.

Software Engineering Tidbits
Blameless Postmortem
Putting together a blameless postmortem culture is one of the most effective way to improve quality in an engineering organization. One of the best manager I had once told us that the only sure way not to have an outage is not to deploy to production and I surely expect you to ship and deploy a lot of features. The only thing I care about once we have an…
Read more
3 years ago · 1 like · Georges El Khoury


Software Engineering from the Frontlines Course on Maven

If you liked this article, I will be teaching a “Software Engineering from the Frontlines” course on Maven where I will teach hard-learned lessons I acquired developing large-scale products at companies such as Uber, Airbnb, and Microsoft.

View Course


Thanks for reading Software Engineering Tidbits! Subscribe for free to receive new posts and support my work.

Nurfitra Pujo Santiko's avatar
Sanketh B K's avatar
Erick Edubas's avatar
Oluwatosin's avatar
Somjet Sukjareon's avatar
13 Likes
13

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
Outage Management
Copy link
Facebook
Email
Notes
More
3
Share

Discussion about this post

User's avatar
Kishore's avatar
Kishore
Apr 4, 2023

Great article folks 👌 ... May be a typo "iterating the know**ns**/unknowns..." sorry if it is not one

Expand full comment
Like
Reply
Share
2 replies by Georges El Khoury and others
2 more comments...
A good unit test
A good unit test should be:
Feb 20, 2023 • 
Georges El Khoury
18

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
A good unit test
Copy link
Facebook
Email
Notes
More
5
A good way to debug
One of the best software engineering tip I received is a good way to debug.
Apr 28, 2022 • 
Georges El Khoury
16

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
A good way to debug
Copy link
Facebook
Email
Notes
More
2
Release Management
Great software is released software!
Apr 26, 2023 • 
Georges El Khoury
15

Share this post

Software Engineering Tidbits
Software Engineering Tidbits
Release Management
Copy link
Facebook
Email
Notes
More
4

Ready for more?

© 2025 Georges El Khoury
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.