Tech Debt Bankruptcy
SVP of Engineering at Replicant
After a multi-month long series of customer-impacting events caused by known issues we hadn’t prioritized, we realized that our fixed, 20 percent tech debt budget was not working. We could anticipate most of the problems; these were the things we knew were brittle or a bit broken, but we kept putting them off.
We all knew that teams were being regularly sidetracked by surprises that came from production incidents or customers, but not how much. To quantify this, we starting tagging all items that appeared and began work within the same week. It turned out that 40 percent of our time was going to this “unplanned” work. This was such a huge number that it was easy to get the Product and Executive team on board with making significant changes to the way we worked.
Setting Clear Priorities
We created a three-level priority system for work that put working on product features LAST:
- Production issues (Hotfixes, RCA items, security vulnerabilities, visibility issues);
- Blockers (deploy pipeline, phantom tests, etc.);
- Planned work (both tech work AND product work).
This really isn’t anything shocking; all teams prioritize production incidents over planned work. What was different was what we classified as “production.” Critically, we included many things that were not directly customer-impacting, like fixing issues identified in Root Cause Analysis (RCA) from previous incidents and issues with our monitoring, alerting, and logging systems. Also different was our second priority, Blockers: anything that prevents or slows down a team’s work. By putting this above the planned work, we ensured that all teams were unblocked and able to work efficiently.
We were all aware of the service ownership (aka “Spotify model”) of team organization but did not think we were big enough to warrant such a structure. Work was being prioritized top-down, sent out to teams based on the judgement of engineering leaders like myself. However, after establishing the development priorities, it became obvious that a lack of clear ownership was causing two significant problems:
- A “tragedy of the commons”: If everyone owns it, then no one does.
- Those doing the top-down prioritization lacked awareness of the issues with the services.
The combination of problems caused many critical issues to never get worked on. To resolve these issues, we did two things:
- Enumerated every service and significant feature in the platform and assign a team to own it.
- Completely rebuilt our prioritization process to be bottom-up with the teams responsible for establishing their own roadmaps.
The first, enumerating the services, was relatively easy. Switching to bottom-up prioritization required some significant effort with the product team and executive leadership. Despite copious research showing this methodology worked, ultimately, they remained skeptical and would only commit to a six-month test. That six-month test was a resounding success. Unplanned work dropped in half (and continued dropping after that), production incidents virtually disappeared, and developer happiness (as measured via a survey) doubled.
- Service Ownership Works at Small Scales. After seeing this organizational structure in place with only four teams, I am convinced that it would work well with any number of teams larger than one.
- There is more to “production” than your customer-facing components. Our commitment to prioritize all things that were impacting product, customer visible or not, made a massive impact to the long term stability of our platform. This especially applied to RCA items. If you can only prioritize one thing above planned work, prioritize those.
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.