Infrastructuring On-Call Rotation
Anthony Broad-Crawford
CTO & CPO at SpotHero
Problem
On-call is really about infrastructure. The organizational structures that you put in place will determine whether being on-call is a chore or a learning opportunity. I work in an environment where the setup of after-hours support is inclusive and fairly low impact. Here is what I have done in my last couple of companies in terms of infrastructure for on-call and after-hours support as well as the philosophy and the strategy behind it.
Actions taken
-
We do an on-call rotation with three separate pools of on-call contacts. We have a primary and a secondary on-call engineer with the third point of contact being a member of the product team. Thus, for every engineer that is on-call there will always be a primary product team member on-call as well. They are both triggered at the same time and stand by each other. There are two pairs of eyes on the incident and if a decision needs to be made they've got the business support right alongside each of them.
-
Everybody participates in the rotation. At some point in time everyone works their way into the on-call schedule, including junior engineers. We work in one-week rotations so that everyone has to be primary on-call about every three months.
-
Once the week is over, the primary is released from duty, the secondary moves up to primary, and a new secondary comes into play. This transition takes place in a weekly cadence hand-off meeting on Monday mornings at 9am. A passing of the baton if you will. Included in this meeting are the above mentioned parties, plus the head of devops, head of engineering, and myself. During the meeting we pull the dashboards, go to the stats, look at the logs and see if there is anything of note that is going on. We also look at what is going to be released in the current week and if anything else major will be happening. If a secondary person is new we do a walk-through on what are the top 10 things that could go wrong, how they'd handle it, and the next steps to take after that. We also make sure that they log into everything so that the person is feeling comfortable with the role. The hand-off meeting ensures that everyone has a handle on what to expect in the coming week and that they are prepared and on the same page with the rest of the teams.
-
Each primary on-call keeps a journal. This journal is reviewed in the weekly hand-off meeting. We encourage the primary to take care of any action items in terms of updating gaps or if something was out of date. But we also know that life happens and that it could actually become an official action item. In this case we allow the old primary to follow through with it and set their own hard date of completion within the next week.
-
If something breaks we have pretty bullet proof documentation on who to communicate with. We make sure it's crystal clear who the point of contact is, what they need to know, and how to work together to resolve the problem. And no matter what's going on, there's an update at least every 15 minutes. We have our on-calls practice keeping regular updates on a Slack channel even if it is the same update as the last one that was sent. This prevents them from getting bombarded with individualized messages and keeps everyone informed of the situation.
Lessons learned
- I think a big change I've made over the last several years that has made everyone feel easy about the on-call rotation is using a release schedule and pulling in the product team. Tech leads and product leads walk in prepared with plans and talk through what is being released for that week. It's definitely not something fancy, just a written one page paper that people can have access to but it is something concrete and locked in.
- I have a concern that some people don't get exposed to on-call frequently enough, especially junior engineers. I encourage that as they go through the career ladder there's a point where they have to level up. If they really want to move up they need to set goals and work on informing themselves, with the help of their manager of course.
Be notified about next articles from Anthony Broad-Crawford
Anthony Broad-Crawford
CTO & CPO at SpotHero
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.