Navigating a Crisis Situation
15 October, 2019
We had a custom client implementation that involved the migration of a popular, existing website to our platform. We had performed such migrations before, but we didn't realize until late in the game their site was a couple orders of magnitude larger than we had ever handled before.
The migration took far longer than expected, so our QA time period was squeezed. With a massive effort, we loaded the data and validated functionality just in time for a highly publicized launch date. There was a lot of pressure to make that date. Load testing was nowhere on our radar. It had never been necessary before, and no one on our team--including me--had the experience to recognize what was about to come. We launched, and the software failed spectacularly. First, it slowed, and then it died completely within the hour. We had to shut it down and revert to the old site. It was a major PR problem for the customer, and I had a huge, very public failure on my hands.
My immediate reaction was "I'm going to be fired." That might have happened at another company, but what happened instead, though I didn't realize it at the time, was that I was given an opportunity to lead us out of this mess. A crisis is an emotional thing; everybody from the internal team to the customer is upset and fearful. I wish I could say I completely rose to the challenge, or that I knew what to do. Several people involved, including the customer, prompted us to a series of actions that became my blueprint for handling crisis situations over my career. This blueprint has helped me both in preventing and gracefully handling situations that go sideways: Deliver the News Immediately after the incident, we owed the customer and their management team a full and transparent accounting of what happened. This had to make sense at the high level, with enough supporting detail to satisfy them they were dealing with a competent partner who had nonetheless made a mistake. If we had tried to hide or minimize anything, we never would have regained their trust. Accept Responsibility There also needed to be a moment of "falling on your sword" and accepting responsibility, both personally and as a company for what happened. The customer's attitude towards us shifted perceptibly once this happened. Increase Communications The lynchpin to navigating this crisis situation was radically increasing the transparency and frequency of communications with the customer. Initially, the customer required meetings at the start and end of each day, including weekends. It felt like micromanagement, and it was, but it was necessary in order to regain their trust. As we worked through the coming weeks it lessened to once a day, and eventually a weekly cadence. Align Internally Beyond communications, what we needed to do was actually solve the problem. For that, we had to get buy-in with our own internal management that this was a top priority. The company understood the situation and committed the necessary resources for us to solve the problem. Form a Solid Plan After that, we merely had a series of challenging technical issues to solve! We had to establish new ways of measuring performance and systematically work on multiple parts of the system to achieve our performance goals. These issues ended up taking a couple of months to solve. But we re-established trust over time by building a credible plan and showing daily progress toward it. Work the Plan and Keep Communicating The customer was fairly technical and wanted to see detailed load testing and DB metrics of our progress. We worked with engineering to produce the baseline and track improvements. Sometimes we improved, and sometimes there were setbacks, but the key to it all was the ongoing communication. Eventually, we re-launched. You may have even used the software at one point!
- Rely on mentors
- I didn't have mentors at the time I could turn to. Everyone here is doing well on that score - someone experienced, sympathetic, and removed from the situation can be a great support and provide perspective.
- Accept your mistakes. Resist perfectionism.
- Everyone makes mistakes. Publicly accepting responsibility when things go wrong is part of leadership, but internally, it's a different story. It's important to accept responsibility for your part, but recognize that many events and people played a role in the outcome.
- Blame may be flying around, but try not to get caught up in blaming others or the situation, so you can approach it with a clear mind. This is much easier if you've cut yourself some slack.
- If you're a perfectionist, you may be intolerant of mistakes, and you'll have a tendency to overdo it with self-recrimination; recognize that no one benefits from that, you least of all.
- Avoid surprises.
- Foresight comes with experience, but it's good practice to try to anticipate what will go wrong. I conduct pre-mortem meetings before major launches, where a large team gets together and thinks of everything that could go wrong. Then we make sure we've got a plan to deal with it.
- Identify & communicate risks. When risks arise, it's important to make stakeholders aware of them. This gives leaders and stakeholders the full information they need to make a decision to move forward or not.
- Trust your gut.
- Pay attention to the warning signs even when there is external pressure to ignore them. Very public deadlines, for example.
- For us, the fact that the data import job was taking a lot longer than we expected and experiencing failures was a risk I could have recognized and raised.
- Be willing to deliver the bad news.
- In that situation, it felt like the prospect of having to delay launch was the worst possible news you could deliver, but in reality, it was much worse: you could actually launch and fail! I learned that the hard way.
- It takes some courage, and may not be well received at the time, but in the end, people will appreciate and value your honesty & transparency, and efforts to warn them. If not, you're better off somewhere else!
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Nani Nitinavakorn, the Sr Product Owner at Revolut, describes how she keeps learning hard skills to increase motivation and respect her team.
Sr Product Owner at Revolut
Nani Nitinavakorn, the Sr Product Owner at Revolut, shares how she gained her first technical position, creating a direct method to apply for jobs.
Sr Product Owner at Revolut
Jason De Oliveira, CTO for more than 10 years, describes his methods of re-platforming an organization with nearly thirty years of existence using specific techniques and technologies.
Jason De Oliveira
CTO at Kolquare
Federico Fregosi, VP of Engineering at Contino, shares how he hired a candidate with an untraditional background and grew into a key player in the industry.
VP of Engineering at Contino
Nikita Ostrovsky, Sr. Manager. Site Reliability Engineering at Peloton Interactive explains how he overcame confrontation with leadership over an organizational vision, respectfully communicating with customers and engineers.
Sr. Manager. Site Reliability Engineering at Peloton Interactive
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.