I was recently instructed not to release things on production for a while, unless absolutely necessary.
This is coming from a senior executive, recently introduced to the team1:
Please do not modify the code to fix issues that are not critical. We have a go-live next week, and modifying the code now is risky, since it may break other things.
Having had continuous deployment on our app since the get go, upon receiving this, I was quite dubitative. But holding on deployments because there’s a demo coming up, or because we’ve got a marketing launch soon, or because it’s Friday, seems conventional wisdom. I think it is coming straight out of the orthodox way of doing things, and we can do better.
Risks of deploying
Deploying has risks. You might indeed introduce a bug by fixing another. There may be breaking changes you didn’t notice. You can break something in the process, some files might be missing, some may be corrupt. There may me some data incompatibilities. This could result in interruption of service, degraded functionalities, or data corruption.
Like any risky process, analysis factors in impact (consequences, probability, detectability) and an action plan corresponding to reduce these (mitigation: things you do to reduce the impact) and resolution: things you do if the risk becomes a problem to fix the impact). The riskiest thing is something with high consequences that is very likely to happen but can’t be detected.
We already covered the consequences. The conventional wisdom assumes a non-negligible probability that something will go wrong. The orthodoxy solution is to not deploy on Friday, because having people in the office will increase the probability of detecting potential issues, and to fix them quickly, while also lowering the risk of having to work on a Saturday. This corresponds to a mitigation action plan (detect things faster, reduce probability of working Saturday) and a resolution action plan (fix the problems faster).
To all accounts, not releasing because you have an upcoming go-live is similar. It comes from the same assumptions and consequences, and assumes that modifying the code might break something in an un-remediable way, or will not be detected in a timely manner. By focusing on only the critical bugs, you effectively limit the variations in the code, reducing the probability that a new bug may be introduced (mitigation). You also reduce the change that a deployment may go wrong… by not deploying (mitigation as well).
Issues with delaying deployment
The problem with holding on deployments is that it’s not fixing the root causes. It’s merely trying to fix the impact of the risk.
The consequences are: deployment goes bad → interruption of service. Regression introduced → degraded service. Undetected bug → some horrible stuff.
Conventional wisdom reduces consequences of a bad deployment by removing the deployment altogether. Kind of like treating an infection by amputating the limb, rather than using antibiotics. Worse: this is actually contributing to increasing the risk. Larger releases, larger code fixes, are prone to more errors. By delaying a release, this solution is effectively increasing the probability that something will go wrong when releasing. It’s also not free, because the point of releasing something is to make money out of it2, delaying the release effectively means that we’re delaying the return on investment, effectively lowering it.
As much as this strategy makes sense in a certain context, there are other better solutions to deal with those risks.
The other way
Issues occur with deployment when you start being afraid and uncertain. The solution is to become comfortable and certain. There are several ways, which kind of go together as a whole to be effective and create a virtuous cycle. To arrive to this point:
- Fix your process: minimize scope, lower WIP, work on small batches, and get the team to iterate a lot. By doing this, incremental changes are smaller and less prone to introduce errors, effectively mitigating the change that bugs are introduced. For this process to work, the rest also has to work:
- Fix your engineering: pair program, or get code to be reviewed. If everything hits production with having been seen by two pairs of eyes, the probability that it contains a bug is drastically lower. Introduce high code standards, decoupling, and give time for integration of review comments, reduction of the technical debt et. al. This will help lowering the probability of side effects when introducing incremental changes. This effectively mitigates the chance of a new bug being introduced.
- Automate testing: not having unit tests covering code changes is the single source of most issues. One can’t guarantee that it works. It’s also disrespectful for other maintainers: someone else can’t guarantee that a change is not going to break everything else. Anyone touching the codebase becomes anxious, because they have to go deep in the layers to understand if their modification is gonna create side-effects3. There are very little occasions where testing is not possible ; most of them are smells indicating that the code is not structured appropriately. Automating tests, with very high target of coverage, is mitigating drastically the probability that a new bug will be created. It also helps validating a deployment. Context is always determining when it comes to your targets, but it is entirely possible to reach a point where automated tests are the only barrier preventing something to reach production. That is, assuming that you…
- Automate the deployment process: a full auto CI pipeline removes the human factor from the equation. By having a deployment process you can rely on, it’s very easy to rule-out deployment issue. The best way to rely on a pipeline is by having it run often, and see it turn red whenever something wrong was prevented to reach prod. This is a very effective solution to mitigate the probability of a deployment issue. And the best way to have a deployment pipeline you trust is to:
- Release very often (several times a day, even Fridays)4 is helping being confident that the deployment process itself works, by increasing the number of data-points and the power of drilling. Arriving there also forces the process to be automated. It’s also reducing the increment, making the increment itself less likely to break something that was there. It reduces the probability of a merge conflict, which is a major source of bugs. By itself, releasing often is mitigating the probability of an issue being introduced and that a deployment issue is going to appear. And if it does, get covered by
- Monitoring and alerts should be able to tell you very quickly if something is wrong, effectively mitigate the probability that an issue will not be detected. This includes: alerts on events and errors, monitoring averages, status pages, etc.
- Rollback if something goes wrong, it should be really quick to rollback to a version that’s suspected not to be faulty. And then if it’s still wrong, back one. And then again. This can be done only with a clear repository of self-defined packages (i.e., they can be deployed by themselves). This helps resolve potential bug issues. But what if the issue is not coming from the last incremental change?
- Keep atomicity to be able to re-introduce your changes in a different order. If you’re using trunk-based development, this may come at a hard cost (cherry-picking things here and there). Atomic change flow may help.
Preventing deployment is a smell
Those solutions are effectively fixing the root cause: bringing the process to a point where deploying is normal, and bugs making their way to production are the exception. It takes some time to get there, especially if starting from “the good olde way” of doing things.
Preventing deployments because it’s Friday, because there’s a demo, because there’s a go-live, is a sign that confidence in the process didn’t build. Each increment has a non-negligible chance of breaking things, and the proposed solution is to do less of the risky activity. It is a smell, that things are done in an uncontrolled, uncertain way. “No deploy Friday” is a reasonable answer to problems caused by an engineering practice, and an infrastructure that are not mature. You don’t start recovering from that by deploying on Friday, but you can use that as a target to fix your process and tools until someone, one day, asks you: “why aren’t we releasing on Fridays again?”.
The best way of getting good at something is repetition. More practice, not less.