Outage management is a core skill for a software engineer to acquire and is critical to achieve high availability of an online service. First, ideally, you receive an alert that triggers a notification that shows a degradation of the health of the service. The first thing to do is to acknowledge the alert only if you are able to follow up on it.
Great article folks 👌 ... May be a typo "iterating the know**ns**/unknowns..." sorry if it is not one