Practice Makes Perfect: How to Run a Game Day
Director of Software Development at Upgrade inc.
Throughout my professional experience at different companies, I was involved and led multiple initiatives with the goal of giving more ownership to engineers to control and deploy their services to production. Here is the flow: Once a feature was merged, a deployment pipeline would kick in and the engineer could trigger / monitor / rollback the deployment to production using an internal tool. Each team owned a set of services in the production environment, and was part of our alerting & paging rotation. But what happens when a problem occurs during the deployment, or to the given service a few hours later? How do you investigate it and fix it? And how can you “execute” this loop faster? How do we know if our tools were giving enough visibility to our engineers, and what features were they missing?
"How do we know if our tools were giving enough visibility to our engineers, and what features were they missing?"
When moving the engineering teams on this new “system”, we realized they loved getting more autonomy, but they didn’t have a lot of experience investigating production issues. When a problem would occur, a number of blockers would suddenly resurface -- people wouldn’t have access to some of our monitoring tools, and were not aware of what was available to them, ex: “Oh we have a tracing tool here? Really!”. Some of them couldn’t investigate on their own, would do a quick check in our logging tool and would escalate the alert to the next level until it would reach someone who would have enough expertise and access to fix.
Instead of embracing failure and learning how to deal with them, we saw what I would call a “Fear of Production”. How could we “share” this knowledge with our engineers, and as an organization get more resilient?
Our first action seemed far too obvious but was nevertheless missing -- we had to ensure that everyone had access to ALL our monitoring tools, and automate this part of the new engineer onboarding. Then, we organized onboarding and advanced training on how to dig into problems. But training and wikis can only get you so far...
Finally, we decided we were ready to take the plunge into practice. During (or close to) business hours, we would run a game day. A game day is a simulation of failure or an event to test a system, processes, and team responses that help improve reliability practices.
We did two types of game days. Either we would notify our engineers in advance that a game day would happen at a specific date, or we would stage it without any warning.
In the first case, we would tell them to block a certain amount of their time and then we would have all of the engineers in one room while someone would lead the simulation -- as in fantasy role-plays -- triggering failures and encouraging participation from other people. It was vital that we would be in the same room and could observe each other's responses and mutually rely on quick help.
The other type that included an element of surprise, was mostly done on non-critical systems. We triggered the scenarios during business hours, and gave teams a heads up that “something” may happen in a given week.
Even in the first scenario, when engineers would be notified in advance, there were a lot of surprises and unanticipated situations that resulted in some great learnings. What was important was that engineers knew that for whatever reason they should block time in their calendars, so that they didn’t have to rush a feature at the same time.
Sometimes when failures were triggered, the problems would ramify rapidly and beyond our ability to predict all the consequences. But that would -- more than anything else -- help us identify pain points in our service architecture or system.
As a result, the overall confidence among engineers increased. Also, game days helped us create a much-needed feedback loop. On the platform team, we were developing a lot of tools but not getting enough feedback on them -- how helpful or easy were they to use or what was the time between the alert and its resolution. Thanks to the feedback loop we further improved our tools and upgraded our fairly rudimentary alerting tools and procedures.
- There are two important aspects that game days highlight -- the technology and human aspects. While the tech aspect is often examined far and wide, the human aspect -- how people will react -- is often neglected.
- A game day is not only about figuring out how to deal with failures, it is about how to deal with failures in the briefest amount of time. Sometimes you will hit a problem that could be solved but would take days, and the game day with a limited amount of available time will force engineers to look up for solutions from different perspectives.
- Don’t underestimate the importance of the feedback loop established through those events. This particularly applies to platform work since the usefulness of many tools, unlike specific customer services that generate x amount of dollars, is hardly measurable without it.
- These events are fun! While non-engineering leadership was perplexed if we were deliberately causing failures to our system, engineers saw not only the importance and value of it but found it fun and challenging.
- There are some great tools out there to help you start on Chaos engineering, while providing a safety net to end a “scenario”.
Be notified about next articles from Pierre-Alexandre Lacerte
Director of Software Development at Upgrade inc.
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.