## Problem
Only a couple of weeks after I joined my previous company, we went through a big ramp, developing new features and racing toward the deadline. Our operational cadence was rather seasonal -- in fall, we would get a lot of traffic and would onboard a lot of new customers, but we knew that was coming.   
&nbsp;


Suddenly, we started to see some alarming signals. The traffic was going up, which was expected, but some critical aspects of website performance were quite disturbing. Things would slow down, timeouts would be frequent, and intermittent failures would be occurring out of the blue.  
&nbsp;

## Actions taken
We rushed in to better understand what was happening because we were expecting additional growth and had to handle the stability inside first. We started to assess how much headroom we had, looking at and comparing past data. I was still rather new in the company and was leaning on the team to make recommendations.   
&nbsp;


Senior engineers did all the analysis, scrutinized data, and then concluded that we should be fine based on the headroom on our database. Nevertheless, we still needed to act. We had to push back on priorities and streamline our focus on infrastructure. We came up with a mitigation plan which we presented to the CEO. We felt confident, “Give us a week, and we will handle this.”  
&nbsp;


The next day, the database crashed. The website went down, and we spent the entire day firefighting. Furthermore, Murphy’s law proved itself once again; I was at the dentist and had to work remotely. Coordinating people on different teams remotely, from Sales to Customer Success, was tremendously difficult. Eventually, we got through that day.   
&nbsp;


We didn’t have much experience in post-mortems, but we had one of those uncomfortable conversations afterward. People were pointing fingers, others were taking the blame, and the overall tone of the conversation was rather stressful. We did the upgrade overnight, and the rest of the quarter went well. We accelerated some DevOps work, did ruthless prioritization, and went ahead with hiring.   
&nbsp;


But it was one of those moments. The failure came up as an inevitable consequence of taking things lightly, relying on other less competent people to make decisions, and being confident without grasping the potential severity of the situation. But it’s easy to play it smart in hindsight. At that moment, “we didn’t know what we didn’t know.” We looked at the data and misled some of the things. For me personally, it was a massive, visible, and critical failure. But I had to move past and forgive myself. That was a prerequisite to understanding what went wrong and how I could ensure that it won’t happen ever again.   
&nbsp;

## Lessons learned
- This experience taught me a great deal about the importance of failure and resilience in leadership. From a product angle, I learned not to over-index on building features without thinking about infrastructure and ensuring stability. You should be able to always strike a healthy balance between developing features and scaling infrastructure. 
- Some of the assumptions we made and on which we calculated how much headroom we had were false. We didn’t estimate data growth; when we did load testing, we based it on the present, not projecting six months into the future, which meant that the performance testing we did was not accurate. 
- This failure helped us introduce processes that would ensure that this kind of problem wouldn’t happen again. One should overcome the pointing-finger phase and feeling sorry for themselves to be able to build processes that would prevent a similar situation from happening. 
- At that time, we didn’t have a solid post-mortem process, so we had to introduce one. We had to detail how it should be run and how it should look like -- actionable and blameless with no one feeling uncomfortable about their past actions. 
- Our communication was not always clear. There was much confusion on who does this or that. Understanding that helped us mature as a team. We had to ask ourselves, How should we handle ourselves in those situations? What would be our reactions in situations of great emergency? We acknowledged the real possibility of such situations and created guardrails that would make our reactions efficient but calm and composed. 



Learn from a critical failure in website performance and the importance of infrastructure and stability in leadership. Implement processes and clear communication to prevent future issues.

After a sudden increase in traffic, a company experienced alarming website performance issues. Senior engineers assessed the situation and proposed a mitigation plan to the CEO. However, the next day the database crashed and the website went down. After the incident, the team learned lessons about failure and resilience in leadership, the importance of balancing feature development and infrastructure scaling, and the need for post-mortem processes and clear communication.