Instilling a Culture of Reliability
Engineering Manager at Airbnb
When I joined my current company I became the leader of a team that was transitioning away from product and towards infrastructure. The team was previously focused on improving product metrics and making product changes while spending some time doing systems related work. Yet it was unclear to the team how these tasks were contributing to the overall goal. Moreover, I observed a couple of other concerning matters. One, the team didn’t have a clear idea of what success looked like. And two, there wasn’t a good handle on the system’s alert and on-call procedures. These things were not explicitly discussed nor clearly documented which meant people weren’t thinking about these things. It was apparent that the operational health and reliability of the system, and of the team, needed new direction. Therefore, I took it upon myself to fill in these gaps.
I began by painting the big picture of what success should look like for the team. I brought this notion into a staff meeting and reiterated the idea during one-on-ones. I emphasized that our goals were to make sure we had good uptime on our system and that we improved capabilities of the platform. I framed out a mission and vision statement, working with the team to tweak and refine different areas. This ensured that everyone felt included in generating the statements we would be working with. In the end, the overarching goal revolved around reliable infrastructure.
On the operations side, I encouraged a senior team member to write an on-call responsibilities document. It outlined what it meant to be on-call, why on-call was necessary, and what to expect if one was on-call. Later, we had a session where we shared the doc more widely with the team, gaining input on how to kickoff the new process, what it would look like, and what everyone could expect from it.
In addition to the on-call doc I also implemented an ongoing on-call summary journal. Those on-call would record events that had occurred during their session. On Mondays, we would use the first 15 minutes of our staff meeting to look at the operational health from that past week and debate what needed improvement. This task forced people to be reliable for their on-call session. The designated time and space every Monday ensured everyone in the room was paying attention and contributing to the conversation. Furthermore, it gave people the chance to suggest improvements and supply valuable suggestions. Additionally, establishing discussion of our operational health each week triggered a secondary effect: improvement of overall communication amongst the team, especially when systemic problems occurred.
The last action I took to increase reliability was putting into effect post-mortems on the team. Before I joined the team they didn’t exist, so we started using them. Whenever there was something that affected our operational health, something that needed on-call intervention, we would write out a one-page post-mortem describing the situation. Initially people were hesitant because they weren’t sure when to write one or how much detail to include. Be that as it may, I set up a template for these one-page documents and eventually, as time progressed, the action became a habit and people started writing them automatically.
Reliability is now an ingrained habit on our team. We take reliability work and operational improvements very seriously. Our alerting health has become significantly better, our signal-to-noise ratio is in a much better state, and we have improved our system uptime as well.
I instilled reliability and implemented these changes gradually. I recommend breaking down these actions into manageable steps. The first step, though, is always sharing the wider narrative. Paint the big picture- what you want to achieve and what you’re trying to do, and then break that idea down into smaller initiatives. An underlying vision or mission will energize and drive your team to be successful. It will also make the smaller initiatives more powerful because they are tied to that overarching objective.
Leading by example is extremely important when managing a team. Recognize important behaviors and then encourage those behaviors. It goes a long way. As a leader, you should set the tone and culture of the team so that others may align themselves with you. This is how people will feel successful and integrated into the team. They will feel excited about what they’re doing because they are aligned with leadership and aligned with what is right for the organization.
Be notified about next articles from Paritosh Aggarwal
Engineering Manager at Airbnb
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.