Great software is released software!
Today’s post is about how to release software to production.
First, we will cover the required steps before landing a pull request to the main branch.
Build is green
Linter is green
All unit tests are passing
Pull request code reviewed and approved
Storybooks are passing (Frontend only - Optional but recommended). These are unit tests that capture unwanted changes in the UI
Integration tests are passing (Optional but recommended). These are high-level tests that invoke the top-level APIs on environments similar to the production. These tests usually have minimal or no mocking at all. Ensure they run quickly and are not flaky; otherwise they will degrade the developer experience. No engineer likes to resubmit a pull request.
Once a pull request is landed to the main branch, it is time to deploy it to production. The main recommendation is to deploy to the fewest number of machines first and then gradually increase the number of machines to the deployed version.
The deployment usually follows the below patterns:
Canary environment (few machines that get the deployed version first)
Data centers/regions with 1%, then 2% then 5% of the machines at a time. As an example, you first deploy to us-west, then us-east then Europe, etc. It is good to be able to fasten the deployment if necessary through a configuration. This should only be used during an outage mitigation.
Between every step, you either have a manual approval to move to the next stage or automate the approval based on metrics/instrumentation mimicking what the engineers will be looking for to approve deployment to the next stage.
Quick note for mobile apps. You want to release internally to employees first and then also have a gradual rollout on the App Store and Google Play.
The next question is who does the deploy and at what cadence?
At one extreme is a CI/CD (continuous integration/continuous deployment). This means the second a pull request is landed, it gets deployed in the way it was described above.
There is a lot to like about CI/CD.
It ensures you have one pull request/change per deployment. This makes it easy to detect the root cause during an outage.
It puts the deployment on autopilot, freeing development time
It incentives engineering excellence, especially in terms of observability.
The main disadvantage is that it increases the chance of deployment to happen unattended and it couples the landing and the deployment, which is sometimes not what you want to do. Sometimes you just want to land the pull request but not to deploy.
At the other extreme are scheduled deploys (e.g daily or weekly), usually done by the on-call engineer. These are not desirable for many reasons primary among them is that they pile multiple pull requests per deployment, increasing the odds of an outage. In addition, they create a bottleneck on the on-call engineer where everyone starts to ask them when they are going to deploy and whether they have deployed yet.
My preference for small startups is to have CI/CD and to move to on-demand deployments done by engineers once the customer base is bigger. I think it’s the best of both worlds. It ensures all deployments are attended to and are happening when desired. It minimizes having many pull requests per deployment as they happen more frequently, although it still allows deployments to have more than one pull request at a time. Finally, it does not create a bottleneck on on-call engineers, etc. since every engineer is able to deploy whenever they want on demand.
The last topic to discuss is the different environments. I recommend the following:
Development on a developer machine. Big bonus point if this environment is runnable in only one command (e.g. npm run dev boots up everything needed on the local machine)
Staging. This environment is similar to production with a non-production database/auth/configuration, etc.
“Next”. This environment is part of production but is usually not exposed to the outside world.
Production. This is the public environment used by users/customers.
Software Engineering from the Frontlines Course on Maven
If you liked this article, I will be teaching a “Software Engineering from the Frontlines” course on Maven where I will teach hard-learned lessons I acquired developing large-scale products at companies such as Uber, Airbnb, and Microsoft.
I want to emphasize the importance of having good observability and monitoring for engineer driven deployments. I've seen it happen again and again (and I've made this mistake) that people deployed things without keeping an eye on the dashboards, or even having proper metrics in place to begin with. In our deployment guidelines we've added as step #1, bring up all the dashboards that need monitoring and keeping an eye on, then proceeding with the deployment.
I remember at one company (~7 years ago) we had a person with official title "Release Manager" just to manage these steps mentioned in your post, distribute release notes, etc. Wondering if any company still does that as now it is much more automated.