Back to resources

Developing Better Incident Response Policies

Dev Processes
Team Processes

25 May, 2021

Harrison Hunter
Harrison Hunter

CTO at MaestroQA

Harrison Hunter, CTO at MaestroQA, speaks of his current efforts to develop better incident response policies by introducing automated testing and improving on-call scheduling.

Problem

As we became a more enterprise-focused company, we realized that we have to support higher levels of uptime and ensure greater performance guarantees for our customers. At the same time, the number of people contributing to a codebase was increasing, which consequently implied an increased number of changes. Therefore, we had to make sure that the new layer of changes won’t result in instability and a more significant number of issues for our customers.

Actions taken

The two biggest areas to make improvements and hence make a significant impact on our incident response policies were centered around automation of testing and on-call scheduling.

Automated testing

For starters, we added automation to the process in several places. We already had some unit testing, UI, and click-through testing, but we put a lot more emphasis on testing as such and hired a team to expand the level of automated and end to end testing for an increased number of changes. We particularly expanded testing to include load tests; we didn’t just test under a small load and for the existing size of our customers, but we tested under a large load projecting the future size of our customer base. By doing so, we would catch issues before going into production and thus prevent incidents from happening.

We also added automated status check-ins and tied our tools together. By linking tools together, we ensured that our health checks, database, and server metrics were altering us in the right places. All information was piped into Slack or email, which allowed people to respond quickly to issues and set up a diagnosis in no time with monitors that would be in alarm when things would get wrong. When an alert would be triggered, we would get an indication that something was not working someplace rather than going to each place to check.

Scheduling

To ensure that our policies could be efficiently implemented, we established clear points of contact and communication mechanisms by adding on-call rotation planning, scheduling, and alerting. That required a change that was both cultural and process-related.

We didn’t merely create an on-call schedule and informed people when they would be on call. We created a checklist for people to go through and ensured that everyone had access to all the tools and systems. Also, we set up clear expectations for doing testing incident response, which helped with the cultural aspect of the change.

While this is still a work in progress, we noticed significant improvements. For example, we redefined incidents and created dedicated places for communication. We also organized communication around the incident response playbook that we compiled. Consequently, clear communication resulted in the faster resolution of issues.

Lessons learned

  • Ensure end-to-end testing, not only the regular one but also for a projected scale that will generate a significant load.
  • Make sure that your alerting system is all piped into one place and that all communication is happening there. You should also have a unified view of alerts across the system to be able to respond quickly.
  • When you create an on-call schedule, make sure you also create a checklist for everyone going on call to be comfortable and all set tool- and access-wise. Make sure to refresh those regularly since processes are changing over time, and people should feel comfortable and have access at all times. Also, develop a detailed playbook for people who should be able to apply the right investigation and mitigation techniques in the heat of the moment.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Streamlining Product Processes After a Reorganization

16 May

Snehal Shaha, Lead Technical Program Manager at Momentive (fka SurveyMonkey), details her short-term technical strategy to unify processes among teams following an acquisition.

Acquisition / Integration
Product Team
Product
Building A Team
Leadership
Internal Communication
Collaboration
Reorganization
Strategy
Team Processes
Cross-Functional Collaboration
Snehal Shaha

Snehal Shaha

Senior EPM/TPM at Apple Inc.

Navigating Disagreements When It Comes to Priorities

9 May

Pavel Safarik, Head of Product at ROI Hunter, shares his insights on how to deal with disagreements about prioritization when building a product.

Innovation / Experiment
Product Team
Product
Dev Processes
Conflict Solving
Internal Communication
Collaboration
Convincing
Strategy
Prioritization
Pavel Safarik

Pavel Safarik

Head of Product at ROI Hunter

The Optimization and Organization of Large Scale Demand

4 May

Kamal Qadri, Senior Manager at FICO, drives the importance of setting expectations when optimizing large-scale requirements.

Managing Expectations
Delegate
Team Processes
Prioritization
Kamal Qadri

Kamal Qadri

Head of Software Quality Assurance at FICO

Why You Should Take Technology Risks in Product Development

25 April

Matias Pizarro, CTO and VP of Residents at ComunidadFeliz, recalls a time in his early career when he took a technology risk that had wide-ranging benefits to his product's user experience.

Innovation / Experiment
Product
Scaling Team
Dev Processes
Matias Pizarro

Matias Pizarro

CTO and VP of Residents at ComunidadFeliz

Why Documentation Is the Key to Success

6 April

Henning Muszynski, Head of Frontend at Doist, promotes his ideas on how documentation ensures consistency, efficiency, and standardization.

Alignment
Collaboration
Productivity
Hiring
Team Processes
Henning Muszynski

Henning Muszynski

Head of Frontend at Doist

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.