Back to resources

Blameless Incident Management

Team Processes

5 May, 2021

Karthik Gandhi
Karthik Gandhi

Engineering Manager at Stripe

Karthik Gandhi, Engineering Manager at Stripe, shares how he made incident management blameless by focusing on a problem and not on blaming a person.

Problem

Some years ago, I took over a team that had no sense of incident management. If something would go wrong, the conversation was never about the conditions that led to the incident but about who did it.

While many companies claim to have blameless incident postmortems, people are, in fact, frequently penalized for what they did or didn’t do. I was set on creating a culture that encouraged acknowledging mistakes without focusing on a person but on the conditions that caused an incident.

Actions taken

I had to start by transforming an on-call process because that was where the crux of the problem was. When I joined the team, there was no structured on-call process. Everyone would respond to alerts almost on a whim, and a lack of processes entailed a lack of team cohesion. I wanted to solve the problem on the team level before moving further to address it on the company level. Also, I noticed that people were failing to understand what was causing incidents in the first place, which allowed them to play the blame game. They were unaware that it was a chain of cascading events causing an incident, not a single, last change that was made.

To start with, I tasked every person on the team to compile the list of all the incidents that would come their way, along with actions they took to address them. The list had to be thoroughly documented, and we would be collectively discussing their submissions. I established a weekly on-call hand-off process where they would talk to each other about actions they took or didn’t and would explain why they would choose a specific course of action.

I introduced that process with a twofold intention. I wanted people on my team to take seriously every alert that would come their way and drive them to closure. If they got an alert or a request to address an incident during their on-call rotation, they would be responsible for getting it to closure, even if that would make dealing with it after their on-call rotation was over. I was determined not to tolerate handing over alerts to the next person.

Also, I wanted to prevent recurring cross-functional failures. If there was an independent service that was failing, they would need to file a ticket and notify them of the failure so that it could be fixed. That would help establish ownership, prevent recurring failures, and enhance cross-functional collaboration.

Finally, I put an incident review for the whole company, which I made optional. Every Thursday for two hours, we would talk about incidents that occurred in the past week without pointing the finger at anyone. I was moderating the sessions and was quick to intervene if I would think that someone was blaming another person or pointing at a single change. This approach proved to be much more effective than writing down a document and explaining why blaming incident management would be detrimental to the whole team. Instead, I would have them in one place, and by using real-life examples, I would be able to instill in them a different approach that would make them efficient, accountable, and supportive to fellow team members.

Lessons learned

  • Prevention is second to none. Have your team understand the benefits of proactively fixing alerts. If there is an alert, there should be an action associated with it. Noisy alerts lead to complacency. Alerts should always have an action associated with them, and they should be fixed immediately. That is why you are on-call. Surprisingly, I had to explain to my manager the meaning of on-call. If one is on-call, they are not doing regular work; they should be fixing incidents and bugs that, if unaddressed, could escalate.
  • Before I joined the team, conversations about incidents were taking place in ten different channels. Now, a person who is on-call starts a Zoom discussion and invites everyone to join. They have clear instructions to follow -- mitigate first, act on it later. It took an effort to streamline the focus on mitigation and actions since some would still drift into “who did it” mode. Also, by directing focus to mitigation, I was able to divert their attention to team efforts to solve a problem instead of allowing them to be stuck in the root cause analysis which could be done as a follow-up.
  • Not everyone is made for on-call rotation. Not everyone can handle the pressure when the team is on fire, and the product is down. It is good to run some simulation sessions where a bug is intentionally introduced and have a team sit together and work on solving it. That kind of training would prepare them to handle stressful situations in real-time better.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Team Development Framework for new managers

26 June

Individual Contributors are familiar with a technical development framework that helps them with building products. Managers, especially new managers can leverage a parallel framework to help them build their teams while drawing analogies from an already familiar framework.

Building A Team
Team Processes
New Manager
Viswa Mani Kiran Peddinti

Viswa Mani Kiran Peddinti

Sr Engineering Manager at Instacart

Promoting Interdepartmental Teamwork for More Efficient Problem-Solving

13 June

Roland Fiala, Senior Vice President of Engineering at Productsup, raises an interesting issue about autonomy in teams: does it hinder collaboration opportunities that lead to better problem-solving? He shares his system for promoting teamwork in engineering departments.

Internal Communication
Collaboration
Roadmap
Team Processes
Cross-Functional Collaboration
Roland Fiala

Roland Fiala

Senior Vice President of Engineering at Usergems

How to Motivate Your Engineers to Grow in Their Careers

13 June

Roland Fiala, Senior Vice President of Engineering at Productsup, highlights the importance of soft skills and shares how he motivates his engineers to further their careers by focusing on personal growth.

Goal Setting
Different Skillsets
Handling Promotion
Personal Growth
Coaching / Training / Mentorship
Motivation
Team Processes
Career Path
Performance
Roland Fiala

Roland Fiala

Senior Vice President of Engineering at Usergems

Streamlining Product Processes After a Reorganization

16 May

Snehal Shaha, Lead Technical Program Manager at Momentive (fka SurveyMonkey), details her short-term technical strategy to unify processes among teams following an acquisition.

Acquisition / Integration
Product Team
Product
Building A Team
Leadership
Internal Communication
Collaboration
Reorganization
Strategy
Team Processes
Cross-Functional Collaboration
Snehal Shaha

Snehal Shaha

Technical Program Management at Apple Inc.

The Optimization and Organization of Large Scale Demand

4 May

Kamal Qadri, Senior Manager at FICO, drives the importance of setting expectations when optimizing large-scale requirements.

Managing Expectations
Delegate
Team Processes
Prioritization
Kamal Qadri

Kamal Qadri

Head of Software Quality Assurance at FICO