Bouncing Back From Failure
19 August, 2021
Only a couple of weeks after I joined my previous company, we went through a big ramp, developing new features and racing toward the deadline. Our operational cadence was rather seasonal -- in fall, we would get a lot of traffic and would onboard a lot of new customers, but we knew that was coming.
Suddenly, we started to see some alarming signals. The traffic was going up, which was expected, but some critical aspects of website performance were quite disturbing. Things would slow down, timeouts would be frequent, and intermittent failures would be occurring out of the blue.
We rushed in to better understand what was happening because we were expecting additional growth and had to handle the stability inside first. We started to assess how much headroom we had, looking at and comparing past data. I was still rather new in the company and was leaning on the team to make recommendations.
Senior engineers did all the analysis, scrutinized data, and then concluded that we should be fine based on the headroom on our database. Nevertheless, we still needed to act. We had to push back on priorities and streamline our focus on infrastructure. We came up with a mitigation plan which we presented to the CEO. We felt confident, “Give us a week, and we will handle this.”
The next day, the database crashed. The website went down, and we spent the entire day firefighting. Furthermore, Murphy’s law proved itself once again; I was at the dentist and had to work remotely. Coordinating people on different teams remotely, from Sales to Customer Success, was tremendously difficult. Eventually, we got through that day.
We didn’t have much experience in post-mortems, but we had one of those uncomfortable conversations afterward. People were pointing fingers, others were taking the blame, and the overall tone of the conversation was rather stressful. We did the upgrade overnight, and the rest of the quarter went well. We accelerated some DevOps work, did ruthless prioritization, and went ahead with hiring.
But it was one of those moments. The failure came up as an inevitable consequence of taking things lightly, relying on other less competent people to make decisions, and being confident without grasping the potential severity of the situation. But it’s easy to play it smart in hindsight. At that moment, “we didn’t know what we didn’t know.” We looked at the data and misled some of the things. For me personally, it was a massive, visible, and critical failure. But I had to move past and forgive myself. That was a prerequisite to understanding what went wrong and how I could ensure that it won’t happen ever again.
- This experience taught me a great deal about the importance of failure and resilience in leadership. From a product angle, I learned not to over-index on building features without thinking about infrastructure and ensuring stability. You should be able to always strike a healthy balance between developing features and scaling infrastructure.
- Some of the assumptions we made and on which we calculated how much headroom we had were false. We didn’t estimate data growth; when we did load testing, we based it on the present, not projecting six months into the future, which meant that the performance testing we did was not accurate.
- This failure helped us introduce processes that would ensure that this kind of problem wouldn’t happen again. One should overcome the pointing-finger phase and feeling sorry for themselves to be able to build processes that would prevent a similar situation from happening.
- At that time, we didn’t have a solid post-mortem process, so we had to introduce one. We had to detail how it should be run and how it should look like -- actionable and blameless with no one feeling uncomfortable about their past actions.
- Our communication was not always clear. There was much confusion on who does this or that. Understanding that helped us mature as a team. We had to ask ourselves, How should we handle ourselves in those situations? What would be our reactions in situations of great emergency? We acknowledged the real possibility of such situations and created guardrails that would make our reactions efficient but calm and composed.
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Snehal Shaha, Lead Technical Program Manager at Momentive (fka SurveyMonkey), details her short-term technical strategy to unify processes among teams following an acquisition.
Senior EPM/TPM at Apple Inc.
Kamal Qadri, Senior Manager at FICO, drives the importance of setting expectations when optimizing large-scale requirements.
Head of Software Quality Assurance at FICO
David Pearson, Sr. Engineering Manager at Square, recalls his experience of reassuring a first-time manager and highlights the importance of emotional support.
Sr. Engineering Manager at Square
Henning Muszynski, Head of Frontend at Doist, promotes his ideas on how documentation ensures consistency, efficiency, and standardization.
Head of Frontend at Doist
Henning Muszynski, Head of Frontend at Doist, talks about the cost of slow and arduous processes that add up over time and how to bring the changes systematically.
Head of Frontend at Doist
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.