Plato Elevate Winter Summit has been announced (Dec 7th-8th)

🔥

Back to resources

Blameless Incident Management

Team Processes

5 May, 2021

Karthik Gandhi
Karthik Gandhi

Engineering Manager at Stripe

Karthik Gandhi, Engineering Manager at Stripe, shares how he made incident management blameless by focusing on a problem and not on blaming a person.

Problem

Some years ago, I took over a team that had no sense of incident management. If something would go wrong, the conversation was never about the conditions that led to the incident but about who did it.

While many companies claim to have blameless incident postmortems, people are, in fact, frequently penalized for what they did or didn’t do. I was set on creating a culture that encouraged acknowledging mistakes without focusing on a person but on the conditions that caused an incident.

Actions taken

I had to start by transforming an on-call process because that was where the crux of the problem was. When I joined the team, there was no structured on-call process. Everyone would respond to alerts almost on a whim, and a lack of processes entailed a lack of team cohesion. I wanted to solve the problem on the team level before moving further to address it on the company level. Also, I noticed that people were failing to understand what was causing incidents in the first place, which allowed them to play the blame game. They were unaware that it was a chain of cascading events causing an incident, not a single, last change that was made.

To start with, I tasked every person on the team to compile the list of all the incidents that would come their way, along with actions they took to address them. The list had to be thoroughly documented, and we would be collectively discussing their submissions. I established a weekly on-call hand-off process where they would talk to each other about actions they took or didn’t and would explain why they would choose a specific course of action.

I introduced that process with a twofold intention. I wanted people on my team to take seriously every alert that would come their way and drive them to closure. If they got an alert or a request to address an incident during their on-call rotation, they would be responsible for getting it to closure, even if that would make dealing with it after their on-call rotation was over. I was determined not to tolerate handing over alerts to the next person.

Also, I wanted to prevent recurring cross-functional failures. If there was an independent service that was failing, they would need to file a ticket and notify them of the failure so that it could be fixed. That would help establish ownership, prevent recurring failures, and enhance cross-functional collaboration.

Finally, I put an incident review for the whole company, which I made optional. Every Thursday for two hours, we would talk about incidents that occurred in the past week without pointing the finger at anyone. I was moderating the sessions and was quick to intervene if I would think that someone was blaming another person or pointing at a single change. This approach proved to be much more effective than writing down a document and explaining why blaming incident management would be detrimental to the whole team. Instead, I would have them in one place, and by using real-life examples, I would be able to instill in them a different approach that would make them efficient, accountable, and supportive to fellow team members.

Lessons learned

  • Prevention is second to none. Have your team understand the benefits of proactively fixing alerts. If there is an alert, there should be an action associated with it. Noisy alerts lead to complacency. Alerts should always have an action associated with them, and they should be fixed immediately. That is why you are on-call. Surprisingly, I had to explain to my manager the meaning of on-call. If one is on-call, they are not doing regular work; they should be fixing incidents and bugs that, if unaddressed, could escalate.
  • Before I joined the team, conversations about incidents were taking place in ten different channels. Now, a person who is on-call starts a Zoom discussion and invites everyone to join. They have clear instructions to follow -- mitigate first, act on it later. It took an effort to streamline the focus on mitigation and actions since some would still drift into “who did it” mode. Also, by directing focus to mitigation, I was able to divert their attention to team efforts to solve a problem instead of allowing them to be stuck in the root cause analysis which could be done as a follow-up.
  • Not everyone is made for on-call rotation. Not everyone can handle the pressure when the team is on fire, and the product is down. It is good to run some simulation sessions where a bug is intentionally introduced and have a team sit together and work on solving it. That kind of training would prepare them to handle stressful situations in real-time better.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Building a New Team in a Foreign Country

23 November

Adam Hawkins, Site Reliability Engineer at Skillshare, went all the way across the world to build a brand new team who worked very differently than he was used to.

Team Processes
Adam Hawkins

Adam Hawkins

Site Reliability Engineer at Skillshare

What It Takes to Understand Other’s Perspective

23 November

Nicholas Cheever, Divisional Vice President, Global Supply Chain Technology at Trimble Transportation, shares how to really understand someone else’s point of view.

Team Processes
Nicholas Cheever

Nicholas Cheever

Divisional Vice President, Global Supply Chain Technology at Trimble Transportation

How to Handle Team Collaboration After a Merger?

23 November

Nicholas Cheever, Divisional Vice President, Global Supply Chain Technology at Trimble Transportation, shares how he helped the acquired company’s team members understand the business mission and give them focus.

Acquisition / Integration
Team Processes
Nicholas Cheever

Nicholas Cheever

Divisional Vice President, Global Supply Chain Technology at Trimble Transportation

Surefire Ways to Boost Team Morale

11 November

Rajesh Agarwal, VP & Head of Engineering at Syncro, talks about effective ways to boost team morale when stepping in as a new manager in the team.

Motivation
Team Processes
Rajesh Agarwal

Rajesh Agarwal

VP and Head of Engineering at Syncro

How an Empathetic Approach Can Solve Problems

10 November

Han Wang, Director of Engineering at Sonder Inc., shares how he changed a manager’s viewpoint for achieving better results and improved team coordination.

Goal Setting
Personal Growth
Leadership
Coaching / Training / Mentorship
Team Processes
Han Wang

Han Wang

Director of Engineering at Sonder Inc

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.