Loading...

Focusing Engineers on KPIs Versus Short Term Fixes

Chris Rude

Senior Engineering Manager at Facebook

Loading...

Problem

"When we started tracking what time our team was spending, we realized 30% of it went to solving live site issues. This was insane! There were a few reasons this was so high."

"30% of our team's time went to solving live site issues."

First, good engineers really like making a system that works. When there’s a defect out there somewhere, an engineer will want to fix it. But it might be hard for them to know which defect is worse than another until they’ve gone through the trouble of root-causing it, which was most of the work of fixing it. So effectively, issues were very hard to triage in importance due to a lack of clear metrics.

Actions Taken

An easy way to fix this is to move from an on-call system based on system errors to one entirely driven by metrics. If customers had errors that didn’t break our metrics, we didn’t spend any time looking at it.

"Move from an on-call system based on system errors to one entirely driven by metrics."

As a caveat -- we did care about customers who were broken. So we invested in a general-purpose “magic fix-it button”. This would give the customer a service credit while also putting their account into a known-good state, without any data loss for them. This allowed a large number of small edge-case bugs to be worked around at very little engineering cost.

We also established a rule that if a metric was violated and an alert triggered, you would fix the issue completely, then and there. It was a shift in mindset and very quickly lead to long-term fixes to the most common issues since they occurred the most frequently.

The way the engineering team got on board with this KPI plan is that they ultimately didn’t want to be interrupted either, even if they enjoyed the feeling of being helpful in the moment. Although there were times they felt valuable, it got really burdensome and toilsome. They liked to maintain the system, but they weren’t really learning and nothing was helping them grow as an engineer. A lot of people want to be architecting and building, not spending a year of their life fixing broken stuff someone else wrote. If you frame it in a longer time frame or in terms of their next promotion or interview, these system projects are what they want to focus on.

Lessons learned

There was skepticism that we really meant it with the engineering team during rollout, which was the hardest part. What’s happened before was that unplanned work caused them to get jerked around, so some trust needed to be rebuilt between engineers and management. It took the KPI priorities and the guarded time with the team to really rebuild this trust. If we didn’t do this, we would’ve seen a decreased ability from the team to take on work in the future.


Be notified about next articles from Chris Rude

Chris Rude

Senior Engineering Manager at Facebook


Engineering ManagementPerformance MetricsLeadership TrainingTechnical ExpertiseCareer GrowthSkill DevelopmentIndividual Contributor RolesLeadership RolesEngineering ManagerTeam & Project Management

Connect and Learn with the Best Eng Leaders

We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.


Product

HomeCircles1-on-1 MentorshipBounties

© 2024 Plato. All rights reserved

LoginSign up