Lessons From Leading Total Rewrite Projects
Problem
In 2014, Twitter wanted to rewrite its internal A/B testing system. There was a team that had created the first system. However, there was a perception from the company that the system wasn't working very well. When the system said an experiment was underperforming, people tended to blame the testing framework rather than believe the results -- a trust problem. A lot of teams also felt it was difficult to measure the things they wanted to measure. Some of this was based on perception, and some was based in reality. A few senior engineers from other groups came together to form an ad-hoc working group, with the VP Eng's blessing, to develop a new tool from the ground up and started implementing it. The problem was that this created an "us" versus "them" situation, where there was the old team that owned the product, who was too small to create the new tool given all of the challenges it would face, and there was the other team who thought they could do it better, so decided they'd just go around the old team, but lacked context and lessons learned from the old system. This situation created tension and a lack of clarity across the board.
Actions taken
The team who had originally built the tool felt completely sidelined and unappreciated; the new team was perceived as dismissive of their efforts -- yet the burden of keeping the older production system up still landed on the old team. The other side felt that whenever the original team made improvements to their tool, they were attacking the working groups work, as now they had to continually catch up with what the old tool could do, whilst still building the new tool. I realized that:
- A/B testing was really important for the company
- the new project wasn't going to succeed the way it was set up; and
- I was one of the few people that all of the camps trusted I decided to help fix the problem. I argued that a new team with responsibility for A/B testing should be created, staffed with both data scientists and engineers. I merged the two teams, and pulled in people responsible for critical components A/B testing relied on from a few other groups. The single new team was responsible for both maintaining the old system and developing a new one. This really helped to align people behind one goal, and we could more easily agree on how we were going to verify that the new system was correct and how we were going to migrate people off of the old system. Had to build trust and make it clear everyone's contribution was valued, regardless of "old" vs "new" affiliation. For example the lead data scientist from the old team was given decision power over when we switch to the new system in production, and asked to do the analysis to support the decision. That empowered her and prevented her from feeling as though she had been sidelined, while also putting her in a position to discover issues in the old system and converting her from a sceptic to an advocate for change.
Lessons learned
I have led teams that wound up completely replacing a system a number of times. One thing I've seen consistently is that while it is tempting to separate people into a legacy-support team and a fun-new-system team, this approach doesn't work well. Don't create parallel efforts that build resentment, bad blood, lack of context, and misaligned incentives. The team building something new needs to understand practical considerations that went into the old system. There's a reason it has grown awkward warts and convoluted business logic -- this was a response to the business logic often actually being convoluted! Learning all the special cases and gotchas can take a long time, and what seems like a straightforward, clean solution may be flawed because it doesn't account for real-world subtleties. It's also good to be responsible for the old system because you become intimately familiar with its failings, and what exactly it does well, and what it doesn't do so well; you can also discover good incremental ways to migrate off it. The reasons for an old system failing are likely to be similar to the reasons a new system will fail, and it is helpful for people to get to know the old system so they can avoid repeating mistakes. Nothing quite incentivises delivering a new piece of software quickly -- and having it be production-ready and stable -- like having to fix bugs and put out fires in the legacy software until the new stuff is ready. On the other hand, there are few things as demotivating as having your team be told that you have to support something that's seen as outdated and waiting to be turned off, and that you shouldn't even be improving it since that would be wasted effort and scope creep for the new system.
Be notified about next articles from Dmitriy Ryaboy
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.