Back to resources

Blameless Incident Management

Team Processes

5 May, 2021

Karthik Gandhi
Karthik Gandhi

Engineering Manager at Stripe

Karthik Gandhi, Engineering Manager at Stripe, shares how he made incident management blameless by focusing on a problem and not on blaming a person.

Problem

Some years ago, I took over a team that had no sense of incident management. If something would go wrong, the conversation was never about the conditions that led to the incident but about who did it.

While many companies claim to have blameless incident postmortems, people are, in fact, frequently penalized for what they did or didn’t do. I was set on creating a culture that encouraged acknowledging mistakes without focusing on a person but on the conditions that caused an incident.

Actions taken

I had to start by transforming an on-call process because that was where the crux of the problem was. When I joined the team, there was no structured on-call process. Everyone would respond to alerts almost on a whim, and a lack of processes entailed a lack of team cohesion. I wanted to solve the problem on the team level before moving further to address it on the company level. Also, I noticed that people were failing to understand what was causing incidents in the first place, which allowed them to play the blame game. They were unaware that it was a chain of cascading events causing an incident, not a single, last change that was made.

To start with, I tasked every person on the team to compile the list of all the incidents that would come their way, along with actions they took to address them. The list had to be thoroughly documented, and we would be collectively discussing their submissions. I established a weekly on-call hand-off process where they would talk to each other about actions they took or didn’t and would explain why they would choose a specific course of action.

I introduced that process with a twofold intention. I wanted people on my team to take seriously every alert that would come their way and drive them to closure. If they got an alert or a request to address an incident during their on-call rotation, they would be responsible for getting it to closure, even if that would make dealing with it after their on-call rotation was over. I was determined not to tolerate handing over alerts to the next person.

Also, I wanted to prevent recurring cross-functional failures. If there was an independent service that was failing, they would need to file a ticket and notify them of the failure so that it could be fixed. That would help establish ownership, prevent recurring failures, and enhance cross-functional collaboration.

Finally, I put an incident review for the whole company, which I made optional. Every Thursday for two hours, we would talk about incidents that occurred in the past week without pointing the finger at anyone. I was moderating the sessions and was quick to intervene if I would think that someone was blaming another person or pointing at a single change. This approach proved to be much more effective than writing down a document and explaining why blaming incident management would be detrimental to the whole team. Instead, I would have them in one place, and by using real-life examples, I would be able to instill in them a different approach that would make them efficient, accountable, and supportive to fellow team members.

Lessons learned

  • Prevention is second to none. Have your team understand the benefits of proactively fixing alerts. If there is an alert, there should be an action associated with it. Noisy alerts lead to complacency. Alerts should always have an action associated with them, and they should be fixed immediately. That is why you are on-call. Surprisingly, I had to explain to my manager the meaning of on-call. If one is on-call, they are not doing regular work; they should be fixing incidents and bugs that, if unaddressed, could escalate.
  • Before I joined the team, conversations about incidents were taking place in ten different channels. Now, a person who is on-call starts a Zoom discussion and invites everyone to join. They have clear instructions to follow -- mitigate first, act on it later. It took an effort to streamline the focus on mitigation and actions since some would still drift into “who did it” mode. Also, by directing focus to mitigation, I was able to divert their attention to team efforts to solve a problem instead of allowing them to be stuck in the root cause analysis which could be done as a follow-up.
  • Not everyone is made for on-call rotation. Not everyone can handle the pressure when the team is on fire, and the product is down. It is good to run some simulation sessions where a bug is intentionally introduced and have a team sit together and work on solving it. That kind of training would prepare them to handle stressful situations in real-time better.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Streamlining Product Processes After a Reorganization

16 May

Snehal Shaha, Lead Technical Program Manager at Momentive (fka SurveyMonkey), details her short-term technical strategy to unify processes among teams following an acquisition.

Acquisition / Integration
Product Team
Product
Building A Team
Leadership
Internal Communication
Collaboration
Reorganization
Strategy
Team Processes
Cross-Functional Collaboration
Snehal Shaha

Snehal Shaha

Senior EPM/TPM at Apple Inc.

The Optimization and Organization of Large Scale Demand

4 May

Kamal Qadri, Senior Manager at FICO, drives the importance of setting expectations when optimizing large-scale requirements.

Managing Expectations
Delegate
Team Processes
Prioritization
Kamal Qadri

Kamal Qadri

Head of Software Quality Assurance at FICO

Why Documentation Is the Key to Success

6 April

Henning Muszynski, Head of Frontend at Doist, promotes his ideas on how documentation ensures consistency, efficiency, and standardization.

Alignment
Collaboration
Productivity
Hiring
Team Processes
Henning Muszynski

Henning Muszynski

Head of Frontend at Doist

It's Time to Say 'No' to Manual Business Processes

6 April

Henning Muszynski, Head of Frontend at Doist, talks about the cost of slow and arduous processes that add up over time and how to bring the changes systematically.

Changing A Company
Conflict Solving
Internal Communication
Feedback
Team Processes
Henning Muszynski

Henning Muszynski

Head of Frontend at Doist

Typical Challenge of Scaling Teams: What to Do When Your Process Doesn’t Scale

30 March

Christophe Broult, Director of Test Engineering at diconium, focuses on how he scaled his team while building organization and management teams in place.

Scaling Team
Building A Team
Reorganization
Team Processes
Christophe Broult

Christophe Broult

Director Test Engineering at diconium

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.