login


Google Sign inLinkedIn Sign in

Don't have an account? 

Practice Makes Perfect: How to Run a Game Day

Dev Processes
Impact

26 July, 2020

Pierre-Alexandre Lacerte, Senior Principal Engineer at Upgrade, Inc., explains why it is important to simulate failures and how game days benefited his organization.

Problem

Throughout my professional experience at different companies, I was involved and led multiple initiatives with the goal of giving more ownership to engineers to control and deploy their services to production. Here is the flow: Once a feature was merged, a deployment pipeline would kick in and the engineer could trigger / monitor / rollback the deployment to production using an internal tool. Each team owned a set of services in the production environment, and was part of our alerting & paging rotation. But what happens when a problem occurs during the deployment, or to the given service a few hours later? How do you investigate it and fix it? And how can you “execute” this loop faster? How do we know if our tools were giving enough visibility to our engineers, and what features were they missing?
 

When moving the engineering teams on this new “system”, we realized they loved getting more autonomy, but they didn’t have a lot of experience investigating production issues. When a problem would occur, a number of blockers would suddenly resurface -- people wouldn’t have access to some of our monitoring tools, and were not aware of what was available to them, ex: “Oh we have a tracing tool here? Really!”. Some of them couldn’t investigate on their own, would do a quick check in our logging tool and would escalate the alert to the next level until it would reach someone who would have enough expertise and access to fix.
 

Instead of embracing failure and learning how to deal with them, we saw what I would call a “Fear of Production”. How could we “share” this knowledge with our engineers, and as an organization get more resilient?
 

Actions taken

Our first action seemed far too obvious but was nevertheless missing -- we had to ensure that everyone had access to ALL our monitoring tools, and automate this part of the new engineer onboarding. Then, we organized onboarding and advanced training on how to dig into problems. But training and wikis can only get you so far...
 

Finally, we decided we were ready to take the plunge into practice. During (or close to) business hours, we would run a game day. A game day is a simulation of failure or an event to test a system, processes, and team responses that help improve reliability practices.
 

We did two types of game days. Either we would notify our engineers in advance that a game day would happen at a specific date, or we would stage it without any warning.
 

In the first case, we would tell them to block a certain amount of their time and then we would have all of the engineers in one room while someone would lead the simulation -- as in fantasy role-plays -- triggering failures and encouraging participation from other people. It was vital that we would be in the same room and could observe each other's responses and mutually rely on quick help.
 

The other type that included an element of surprise, was mostly done on non-critical systems. We triggered the scenarios during business hours, and gave teams a heads up that “something” may happen in a given week.
 

Even in the first scenario, when engineers would be notified in advance, there were a lot of surprises and unanticipated situations that resulted in some great learnings. What was important was that engineers knew that for whatever reason they should block time in their calendars, so that they didn’t have to rush a feature at the same time.
 

Sometimes when failures were triggered, the problems would ramify rapidly and beyond our ability to predict all the consequences. But that would -- more than anything else -- help us identify pain points in our service architecture or system.
 

As a result, the overall confidence among engineers increased. Also, game days helped us create a much-needed feedback loop. On the platform team, we were developing a lot of tools but not getting enough feedback on them -- how helpful or easy were they to use or what was the time between the alert and its resolution. Thanks to the feedback loop we further improved our tools and upgraded our fairly rudimentary alerting tools and procedures.
 

Lessons learned

  • There are two important aspects that game days highlight -- the technology and human aspects. While the tech aspect is often examined far and wide, the human aspect -- how people will react -- is often neglected.
  • A game day is not only about figuring out how to deal with failures, it is about how to deal with failures in the briefest amount of time. Sometimes you will hit a problem that could be solved but would take days, and the game day with a limited amount of available time will force engineers to look up for solutions from different perspectives.
  • Don’t underestimate the importance of the feedback loop established through those events. This particularly applies to platform work since the usefulness of many tools, unlike specific customer services that generate x amount of dollars, is hardly measurable without it.
  • These events are fun! While non-engineering leadership was perplexed if we were deliberately causing failures to our system, engineers saw not only the importance and value of it but found it fun and challenging.
  • There are some great tools out there to help you start on Chaos engineering, while providing a safety net to end a “scenario”.

Related stories

Finding Time to Remove Tech Debt
28 July

Pascal Rodriguez, Director of Engineering at Bestmile, explains why taking care of tech debt is important and what it takes you away from.

Dev Processes
Pascal Rodriguez

Pascal Rodriguez

Director of Engineering at Bestmile

How to Approach Service Refactoring
26 July

Pierre-Alexandre Lacerte, Senior Principal Engineer at Upgrade, Inc., discusses two approaches of service refactoring explaining why he would choose one over the other.

Dev Processes
Pierre Lacerte

Pierre Lacerte

Senior Principal Engineer at Upgrade inc

Practice Makes Perfect: How to Run a Game Day
26 July

Pierre-Alexandre Lacerte, Senior Principal Engineer at Upgrade, Inc., explains why it is important to simulate failures and how game days benefited his organization.

Dev Processes
Impact
Pierre Lacerte

Pierre Lacerte

Senior Principal Engineer at Upgrade inc

Ways to Reduce Your Cloud Costs
26 July

Andrew First, Co-Founder and Chief Technologist at Leanplum, shares how with a focused effort his company succeeded in reducing cloud costs by more than 60 percent in only six months.

Productivity
Impact
Andrew First

Andrew First

Co-founder & Chief Technologist at Leanplum

API Migration: A Story of Confidence, Clear Goal, and Focus
3 August

Andrew First, Co-Founder and Chief Technologist at Leanplum, discusses how he managed to complete a large infrastructure project by instilling confidence in his team, setting precise benchmarks, and streamlining his focus on what really mattered.

Leadership
Managing Expectations
Dev Processes
Andrew First

Andrew First

Co-founder & Chief Technologist at Leanplum

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.