Improving Operational Readiness

CTO at Apprentice.io

Problem

I had two teams (about 12 engineers total) that had multiple products to support, some of which we inherited after a reorg. The products were using microservice architecture and we were responsible for about a dozen services written in 3 languages, of which half were written in the previous six months. My organization was growing rapidly. Three of the six backend engineers were new to the company so when a key engineer left unexpectedly (for his dream job), we found ourselves with a gap. This engineer used to take on many of the operational duties of the team during outages. He did a good job transferring knowledge, but my team's average tenure with the company was still less than four months. Because of this reason, we needed the team to become more operationally aware.

Actions taken

I got together with my backend engineers from the two teams and explained to them my view of the situation and the potential risks - they were in agreement. We brainstormed and committed to a team improvement plan and formed a working group. To improve the team we needed to build both capability and we needed to put processes in place that would notify us when problems came up. We also needed to start catching any new problems that arose from logs, exception handlers, or monitoring tools. Additionally, we wanted, over time, to move from being reactive to proactive.
To get the ball rolling we began studying how other teams handled this issue. Then, we adopted some of their dashboards and metrics. I received a couple of copies of "Release It!", an excellent book with some great resilience patterns. We also asked for help from other teams. For example, devops ran some workshops to educate the engineers on monitoring tools. We already had a weekly deploy rotation for our daily deploy process, so we expanded the deploy engineer's responsibility to operations and called them the ops master. We reasoned that when an engineer is responsible for deploys, he or she knows when the system is changing and will then be watching the monitoring tools. Therefore, he or she is in the best position to detect problems. The rotation included a total of six backend engineers and was split between two teams so that the load was spread out. We met weekly to ensure momentum; then we moved to bi-weekly meetings. Some of the engineers were very passionate about it and became the de-facto owners. We setup information radiators - monitors that were always on and had key dashboards ensuring they were available at a glance. These helped us get familiar with the regular cadence of our systems. We kept a list of top bugs that we wanted to investigate and ensured that we allocated time for those in sprints. Better monitoring enabled us to notice new bugs as soon as they were introduced, allowing us to respond while the area of code was still fresh in our minds. Over time we improved our process. For example, now the outgoing ops master meets with the new person to transfer the information they were seeing. We started creating standards for the monitoring of our services such as memory, throughput, etc. Eventually, some of the processes we created were even adopted by other teams. A few months after implementation, we needed to execute an engineering-wide initiative around improving resilience of the whole platform and my team was much better prepared to help drive that initiative.

Lessons learned

Getting engineers to buy into the problem helped create ownership among the team.
Instituting this process allowed us to chip away at some of the bugs that showed up in the logs, as we were able to trace some of them to rare user errors.
We were able to notice new things that we would have likely missed before.
One downside of increased monitoring was false positives. We noticed some things and started paying more attention to them, but then they would resolve automatically.
The whole process enabled engineers, especially the new engineers, to learn more about the systems and their potential failure modes. This ultimately helped us to create better designs.

Be notified about next articles from Jean Barmash

Jean Barmash

CTO at Apprentice.io

Engineering Leadership Leadership Development Communication Organizational Strategy Decision Making Culture Development Engineering Management Sprint Cadence Performance Metrics

We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.

Improving Operational Readiness

Problem

Actions taken

Lessons learned

Be notified about next articles from Jean Barmash

Connect and Learn with the Best Eng Leaders