Sep 12, 2021

Deploy and Rollback

You’re hiking in the mountains. The weather is nice, and you are enjoying some of the best mountain views around.
What could go wrong?
Turns out, a lot could go wrong, especially if you deployed something the day before.
And then, you discover that network is very poor, and you try to think what could be wrong; and while your friends are enjoying the hike, you are playing with the phone, trying to do something.
This is something that obviously should never happen.
Because the quality of the days off is important, but also because the experience of your users is affected, and you can’t do anything about that.

This happened to me; fortunately, a colleague was available to fix up the problem, but this should not have happened in the first instance.

So, what went wrong?
The technical issue is still unknown. There are tons of logs, but at the same time key logs are missing, making investigation harder.
However, the real problem was another: why any problem happened? Why did it happen when I could not do anything to fix it? Why the rollback required a technical strong person, and I could not for instance just give instructions over the telephone?

I think the problems are three: lack of tests, lack of rollback procedure, lack of documentation.

The feature should have been tested more. If it’s not possible to test thoroughly, then it must be deployed in a way that will affect a limited number of users, to be increased with time.
The feature should have been enabled by configuration switches. Something that anyone could change – ideally without the need for a redeploy, but that might be an acceptable compromise in some organizations.
The feature should have been documented more: both as risks, what to monitor, and what to do in case of problems.

Simply put, I did nothing of that.
Because of that, I was able to deploy “faster” but in reality I just removed my safety nets. And when something went wrong it went wrong big.

My takeaways:

More tests. In this case, more stress test, more tests to run automatically over a certain period of time. Tests should also exercise exceptional cases and crashes.
Ensure that everything is visible. Some logs were missing, which is not acceptable. Metrics should be very visible and should trigger alerts.
Deploy in a gradual way. An initial set of users, then more and more; but only when the impact is well-known.
Make rollback a normal operation and fully documented. This is something that anyone can perform, in every occasion.
Never deploy before a hike, or a day when I am offline. It just does not make sense, especially in a small organization.

I am still ashamed of what happened; hopefully, this will work at least as a good learning moment.