Back to resources

Instilling a Culture of Reliability

Sharing The Vision
Motivation
Team Processes

26 February, 2020

Paritosh Aggarwal

Paritosh Aggarwal

Engineering Manager at Airbnb

Paritosh Aggarwal, Engineering Manager at Airbnb, describes how he shifted the culture and paradigm of a team’s operational health to include objectives and reliability.

Problem

When I joined my current company I became the leader of a team that was transitioning away from product and towards infrastructure. The team was previously focused on improving product metrics and making product changes while spending some time doing systems related work. Yet it was unclear to the team how these tasks were contributing to the overall goal. Moreover, I observed a couple of other concerning matters. One, the team didn’t have a clear idea of what success looked like. And two, there wasn’t a good handle on the system’s alert and on-call procedures. These things were not explicitly discussed nor clearly documented which meant people weren’t thinking about these things. It was apparent that the operational health and reliability of the system, and of the team, needed new direction. Therefore, I took it upon myself to fill in these gaps.
&nbsp

Actions taken

I began by painting the big picture of what success should look like for the team. I brought this notion into a staff meeting and reiterated the idea during one-on-ones. I emphasized that our goals were to make sure we had good uptime on our system and that we improved capabilities of the platform. I framed out a mission and vision statement, working with the team to tweak and refine different areas. This ensured that everyone felt included in generating the statements we would be working with. In the end, the overarching goal revolved around reliable infrastructure.
&nbsp

On the operations side, I encouraged a senior team member to write an on-call responsibilities document. It outlined what it meant to be on-call, why on-call was necessary, and what to expect if one was on-call. Later, we had a session where we shared the doc more widely with the team, gaining input on how to kickoff the new process, what it would look like, and what everyone could expect from it.
&nbsp

In addition to the on-call doc I also implemented an ongoing on-call summary journal. Those on-call would record events that had occurred during their session. On Mondays, we would use the first 15 minutes of our staff meeting to look at the operational health from that past week and debate what needed improvement. This task forced people to be reliable for their on-call session. The designated time and space every Monday ensured everyone in the room was paying attention and contributing to the conversation. Furthermore, it gave people the chance to suggest improvements and supply valuable suggestions. Additionally, establishing discussion of our operational health each week triggered a secondary effect: improvement of overall communication amongst the team, especially when systemic problems occurred.
&nbsp

The last action I took to increase reliability was putting into effect post-mortems on the team. Before I joined the team they didn’t exist, so we started using them. Whenever there was something that affected our operational health, something that needed on-call intervention, we would write out a one-page post-mortem describing the situation. Initially people were hesitant because they weren’t sure when to write one or how much detail to include. Be that as it may, I set up a template for these one-page documents and eventually, as time progressed, the action became a habit and people started writing them automatically.
&nbsp

Lessons learned

  • Reliability is now an ingrained habit on our team. We take reliability work and operational improvements very seriously. Our alerting health has become significantly better, our signal-to-noise ratio is in a much better state, and we have improved our system uptime as well.

  • I instilled reliability and implemented these changes gradually. I recommend breaking down these actions into manageable steps. The first step, though, is always sharing the wider narrative. Paint the big picture- what you want to achieve and what you’re trying to do, and then break that idea down into smaller initiatives. An underlying vision or mission will energize and drive your team to be successful. It will also make the smaller initiatives more powerful because they are tied to that overarching objective.

  • Leading by example is extremely important when managing a team. Recognize important behaviors and then encourage those behaviors. It goes a long way. As a leader, you should set the tone and culture of the team so that others may align themselves with you. This is how people will feel successful and integrated into the team. They will feel excited about what they’re doing because they are aligned with leadership and aligned with what is right for the organization.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Effective Collaboration Between Engineering and Sales
21 July

Sebastien Cuendet, Senior Director of Engineering at AppFolio, relieved interdepartmental tension by bringing his Engineering team together with Sales in order to see things from their point of view.

Cross-Functional Collaboration
Sharing The Vision
Team Processes
Alignment
Sebastien Cuendet

Sebastien Cuendet

Sr. Director of Engineering at AppFolio

Connecting the Dots Between Data Science and Business Challenges
21 July

Harold Li, Director of Data Science at VTS, makes his department’s spiritual evolution within the company one of his top priorities as a leader.

Team Processes
Data Team
Harold Li

Harold Li

Director, Data Science at VTS

Don’t Be a Victim of Your Difficult Employee
19 July

Bogdan Chebac, Engineering Manager at Gorgias, explains how he managed to bounce back from a tough situation.

Juniors
Hiring
Retention
Team Processes
Bogdan Chebac

Bogdan Chebac

Engineering Manager at Gorgias

What to Do When Joining a Non-Structured Team
19 July

Bogdan Chebac, Engineering Manager at Gorgias, talks about a stressful situation of handling clients and ample work pressure side by side.

Agile / Scrum
Health / Stress / Burn-Out
Team Processes
Bogdan Chebac

Bogdan Chebac

Engineering Manager at Gorgias

Moving From Cowboy to Agile Delivery
15 July

Catalin Stoiovici, Head of Engineering Delivery at Capco, shares how he helped his team transition to a more mature operational practice and replace their ad hoc, cowboy style of delivery with Agile.

Scaling Team
Career Path
Agile / Scrum
Team Processes
Catalin Stoiovici

Catalin Stoiovici

Head of Engineering Delivery at Capco

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.