Blameless Incident Management

Senior Engineering Manager at Stripe

Problem

Some years ago, I took over a team that had no sense of incident management. If something would go wrong, the conversation was never about the conditions that led to the incident but about who did it.

While many companies claim to have blameless incident postmortems, people are, in fact, frequently penalized for what they did or didn’t do. I was set on creating a culture that encouraged acknowledging mistakes without focusing on a person but on the conditions that caused an incident.

Actions taken

I had to start by transforming an on-call process because that was where the crux of the problem was. When I joined the team, there was no structured on-call process. Everyone would respond to alerts almost on a whim, and a lack of processes entailed a lack of team cohesion. I wanted to solve the problem on the team level before moving further to address it on the company level. Also, I noticed that people were failing to understand what was causing incidents in the first place, which allowed them to play the blame game. They were unaware that it was a chain of cascading events causing an incident, not a single, last change that was made.

To start with, I tasked every person on the team to compile the list of all the incidents that would come their way, along with actions they took to address them. The list had to be thoroughly documented, and we would be collectively discussing their submissions. I established a weekly on-call hand-off process where they would talk to each other about actions they took or didn’t and would explain why they would choose a specific course of action.

I introduced that process with a twofold intention. I wanted people on my team to take seriously every alert that would come their way and drive them to closure. If they got an alert or a request to address an incident during their on-call rotation, they would be responsible for getting it to closure, even if that would make dealing with it after their on-call rotation was over. I was determined not to tolerate handing over alerts to the next person.

Also, I wanted to prevent recurring cross-functional failures. If there was an independent service that was failing, they would need to file a ticket and notify them of the failure so that it could be fixed. That would help establish ownership, prevent recurring failures, and enhance cross-functional collaboration.

Finally, I put an incident review for the whole company, which I made optional. Every Thursday for two hours, we would talk about incidents that occurred in the past week without pointing the finger at anyone. I was moderating the sessions and was quick to intervene if I would think that someone was blaming another person or pointing at a single change. This approach proved to be much more effective than writing down a document and explaining why blaming incident management would be detrimental to the whole team. Instead, I would have them in one place, and by using real-life examples, I would be able to instill in them a different approach that would make them efficient, accountable, and supportive to fellow team members.

Lessons learned

Prevention is second to none. Have your team understand the benefits of proactively fixing alerts. If there is an alert, there should be an action associated with it. Noisy alerts lead to complacency. Alerts should always have an action associated with them, and they should be fixed immediately. That is why you are on-call. Surprisingly, I had to explain to my manager the meaning of on-call. If one is on-call, they are not doing regular work; they should be fixing incidents and bugs that, if unaddressed, could escalate.
Before I joined the team, conversations about incidents were taking place in ten different channels. Now, a person who is on-call starts a Zoom discussion and invites everyone to join. They have clear instructions to follow -- mitigate first, act on it later. It took an effort to streamline the focus on mitigation and actions since some would still drift into “who did it” mode. Also, by directing focus to mitigation, I was able to divert their attention to team efforts to solve a problem instead of allowing them to be stuck in the root cause analysis which could be done as a follow-up.
Not everyone is made for on-call rotation. Not everyone can handle the pressure when the team is on fire, and the product is down. It is good to run some simulation sessions where a bug is intentionally introduced and have a team sit together and work on solving it. That kind of training would prepare them to handle stressful situations in real-time better.

Be notified about next articles from Karthik Gandhi

Karthik Gandhi