Tech Debt Bankruptcy
17 February, 2021
After a multi-month long series of customer-impacting events caused by known issues we hadn’t prioritized, we realized that our fixed, 20 percent tech debt budget was not working. We could anticipate most of the problems; these were the things we knew were brittle or a bit broken, but we kept putting them off.
We all knew that teams were being regularly sidetracked by surprises that came from production incidents or customers, but not how much. To quantify this, we starting tagging all items that appeared and began work within the same week. It turned out that 40 percent of our time was going to this “unplanned” work. This was such a huge number that it was easy to get the Product and Executive team on board with making significant changes to the way we worked.
Setting Clear Priorities
We created a three-level priority system for work that put working on product features LAST:
- Production issues (Hotfixes, RCA items, security vulnerabilities, visibility issues);
- Blockers (deploy pipeline, phantom tests, etc.);
- Planned work (both tech work AND product work).
This really isn’t anything shocking; all teams prioritize production incidents over planned work. What was different was what we classified as “production.” Critically, we included many things that were not directly customer-impacting, like fixing issues identified in Root Cause Analysis (RCA) from previous incidents and issues with our monitoring, alerting, and logging systems. Also different was our second priority, Blockers: anything that prevents or slows down a team’s work. By putting this above the planned work, we ensured that all teams were unblocked and able to work efficiently.
We were all aware of the service ownership (aka “Spotify model”) of team organization but did not think we were big enough to warrant such a structure. Work was being prioritized top-down, sent out to teams based on the judgement of engineering leaders like myself. However, after establishing the development priorities, it became obvious that a lack of clear ownership was causing two significant problems:
- A “tragedy of the commons”: If everyone owns it, then no one does.
- Those doing the top-down prioritization lacked awareness of the issues with the services.
The combination of problems caused many critical issues to never get worked on. To resolve these issues, we did two things:
- Enumerated every service and significant feature in the platform and assign a team to own it.
- Completely rebuilt our prioritization process to be bottom-up with the teams responsible for establishing their own roadmaps.
The first, enumerating the services, was relatively easy. Switching to bottom-up prioritization required some significant effort with the product team and executive leadership. Despite copious research showing this methodology worked, ultimately, they remained skeptical and would only commit to a six-month test. That six-month test was a resounding success. Unplanned work dropped in half (and continued dropping after that), production incidents virtually disappeared, and developer happiness (as measured via a survey) doubled.
- Service Ownership Works at Small Scales. After seeing this organizational structure in place with only four teams, I am convinced that it would work well with any number of teams larger than one.
- There is more to “production” than your customer-facing components. Our commitment to prioritize all things that were impacting product, customer visible or not, made a massive impact to the long term stability of our platform. This especially applied to RCA items. If you can only prioritize one thing above planned work, prioritize those.
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Brad Jayakody outlines the roadmap to maintaining a healthy balance between technical debt and team growth. However, just as balancing acts go it is important to have a strong foundation.
Director of Engineering at Motorway
Tejas Kokje, Senior Software Engineer at Netflix, Inc., highlights how long-term thinking, planning, and organizing can reduce technical debt in a product organization.
Senior Software Engineer at Netflix, Inc
For Today’s Q&A, we have Ron Pragides. Ron is currently the VP of Engineering at Trustly. Previously he was an engineering lead at Carta, AppDirect, BigCommerce, Twitter, and Salesforce, as well as an advisor to many startups. Welcome!
SVP Engineering at Trustly Group AB
Mary Fisher, Software Engineering Manager at DrChrono, shares how diligently she worked with teams within her organization to retain customers.
Software Engineering Manager at Curative
Ben Picolo, Engineering Manager at PolicyGenius Inc., shares how together with his team, they gamified the whole technical debt solving process.
null at PolicyGenius Inc.