Blameless Incident Management
5 May, 2021
Some years ago, I took over a team that had no sense of incident management. If something would go wrong, the conversation was never about the conditions that led to the incident but about who did it.
While many companies claim to have blameless incident postmortems, people are, in fact, frequently penalized for what they did or didn’t do. I was set on creating a culture that encouraged acknowledging mistakes without focusing on a person but on the conditions that caused an incident.
I had to start by transforming an on-call process because that was where the crux of the problem was. When I joined the team, there was no structured on-call process. Everyone would respond to alerts almost on a whim, and a lack of processes entailed a lack of team cohesion. I wanted to solve the problem on the team level before moving further to address it on the company level. Also, I noticed that people were failing to understand what was causing incidents in the first place, which allowed them to play the blame game. They were unaware that it was a chain of cascading events causing an incident, not a single, last change that was made.
To start with, I tasked every person on the team to compile the list of all the incidents that would come their way, along with actions they took to address them. The list had to be thoroughly documented, and we would be collectively discussing their submissions. I established a weekly on-call hand-off process where they would talk to each other about actions they took or didn’t and would explain why they would choose a specific course of action.
I introduced that process with a twofold intention. I wanted people on my team to take seriously every alert that would come their way and drive them to closure. If they got an alert or a request to address an incident during their on-call rotation, they would be responsible for getting it to closure, even if that would make dealing with it after their on-call rotation was over. I was determined not to tolerate handing over alerts to the next person.
Also, I wanted to prevent recurring cross-functional failures. If there was an independent service that was failing, they would need to file a ticket and notify them of the failure so that it could be fixed. That would help establish ownership, prevent recurring failures, and enhance cross-functional collaboration.
Finally, I put an incident review for the whole company, which I made optional. Every Thursday for two hours, we would talk about incidents that occurred in the past week without pointing the finger at anyone. I was moderating the sessions and was quick to intervene if I would think that someone was blaming another person or pointing at a single change. This approach proved to be much more effective than writing down a document and explaining why blaming incident management would be detrimental to the whole team. Instead, I would have them in one place, and by using real-life examples, I would be able to instill in them a different approach that would make them efficient, accountable, and supportive to fellow team members.
- Prevention is second to none. Have your team understand the benefits of proactively fixing alerts. If there is an alert, there should be an action associated with it. Noisy alerts lead to complacency. Alerts should always have an action associated with them, and they should be fixed immediately. That is why you are on-call. Surprisingly, I had to explain to my manager the meaning of on-call. If one is on-call, they are not doing regular work; they should be fixing incidents and bugs that, if unaddressed, could escalate.
- Before I joined the team, conversations about incidents were taking place in ten different channels. Now, a person who is on-call starts a Zoom discussion and invites everyone to join. They have clear instructions to follow -- mitigate first, act on it later. It took an effort to streamline the focus on mitigation and actions since some would still drift into “who did it” mode. Also, by directing focus to mitigation, I was able to divert their attention to team efforts to solve a problem instead of allowing them to be stuck in the root cause analysis which could be done as a follow-up.
- Not everyone is made for on-call rotation. Not everyone can handle the pressure when the team is on fire, and the product is down. It is good to run some simulation sessions where a bug is intentionally introduced and have a team sit together and work on solving it. That kind of training would prepare them to handle stressful situations in real-time better.
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Individual Contributors are familiar with a technical development framework that helps them with building products. Managers, especially new managers can leverage a parallel framework to help them build their teams while drawing analogies from an already familiar framework.
Viswa Mani Kiran Peddinti
Sr Engineering Manager at Instacart
Roland Fiala, Senior Vice President of Engineering at Productsup, raises an interesting issue about autonomy in teams: does it hinder collaboration opportunities that lead to better problem-solving? He shares his system for promoting teamwork in engineering departments.
Senior Vice President of Engineering at Usergems
Roland Fiala, Senior Vice President of Engineering at Productsup, highlights the importance of soft skills and shares how he motivates his engineers to further their careers by focusing on personal growth.
Senior Vice President of Engineering at Usergems
Snehal Shaha, Lead Technical Program Manager at Momentive (fka SurveyMonkey), details her short-term technical strategy to unify processes among teams following an acquisition.
Technical Program Management at Apple Inc.
Kamal Qadri, Senior Manager at FICO, drives the importance of setting expectations when optimizing large-scale requirements.
Head of Software Quality Assurance at FICO