Implementing a major platform technology change
20 April, 2018
Our platform's full-text search was showing its age and its limits. Our fundamental business model of providing subscription access to e-books via a handful of large collections was being challenged, as customers now wanted to buy individual titles, and wanted to see the changes instantly. However, the existing mechanics of our platform meant that indexing individual titles took hours or even days, as the entire index of all documents had to be regenerated. We needed to improve our search subsystems so that they would allow near-instantaneous addition and search of new documents. And of course, we needed to do this without breaking the system, taking it down, or disenfranchising any customers or end-users. This was a bet-the-company change.
Our objectives were fairly clear, but detailed plans about how to actually achieve our goal were not. It wasn't as easy as just building a completely separate stack, as the content and historical user data had to be migrated in almost real-time, and many of the technologies we have today weren't available then.
I started out by sponsoring a sequence of internal experiments on every aspect of what was going to change for customers and users with our Chief Architect, and formed a dedicated Core team to pursue them. This was the "managing down" part of the process. However, I also did some "managing up". It became very clear to me that we needed to broadcast beyond our normal mode of "feature release" communication to the company about the big changes that would be occurring, and we needed to provide upper management with choices about tradeoffs in terms of budget and resources. And, we felt we needed to communicate the technical risks we were facing and tell the story of how we would manage it.
Once we said exactly what we were going to do technologically, we also had to decide on how we were going to implement the changes. We couldn't just build another stack, and permanently switch both customers and users to it, since so much historical data needed to be migrated and transformed to support the features of the new platform. Instead, we engaged in what I refer to as "wing-walking". In the 1920s, there were lots of experiments and stunts with airplanes, as aviation was still a new technology. One popular mode was to walk across and through the airplane's wings while it was flying, sometimes to demonstrate the stability of the aircraft, sometimes just a daredevil stunt. The first rule of wing-walking is, "Don't let go of what you have a firm grip on until you have a firm grip on the next thing". For our project, we did things in an extremely methodical "wing walking" way, moving customers, technology and components, and we always gave them a way to go back. We had determined that we could run two complete systems in parallel. That meant building, maintaining and feeding all content into both of the platforms in parallel. However, while this allowed incremental migration of a customer's holdings to the new platform, it really constituted throwing a one-way switch in code with respect to recording new purchases and user interaction with their new bookshelves. Due to this, I fostered discussion and garnered acceptance of a set of mechanisms to migrate customers and users to back to the old system, even with the new data, in the event of a truly catastrophic fail. I personally did the research and experiments (SQL database work) on syncing user purchases and user bookshelf migration ( bi-directional). Finally, before releasing the new system, we practiced, practiced, practiced procedures, failback, monitoring and tested, tested, tested performance, accuracy, content ingest volume, and the new document search capability (instantaneous availability). All this I continually communicated to other leaders in the company, presented the story visually at all-hands meetings, and invited all to help test. Then we took action. There was no major rollback, and while there were a few problems with admin and prep we corrected these with each phase before moving to the next. After the initial experimental wave we "migrated" customers first by only sending their purchases and searches to the new subsystem, to affirm search stack resilience. Then we followed with business logic and user bookshelf migration. Within days we had a verified success. Most of the failbacks were never used, but we were glad to have them at hand.
We were scared to death that we were going to risk the business, but inaction would have resulted in the same thing. By insisting on taking a very deliberate approach, rather than just rushing it through and "hopefully taking a couple of weeks" as the company initially believed it would take, we were able to successfully introduce our new technology. When you, as an engineering manager, are being asked to make a thing happen "for the company", you have to do a lot of legwork yourself to ensure that stakeholders really understand the consequences of what they're asking for. Often, they won't really want to know the details, but presenting a plan in terms of risk will immediately get their attention. However, it is not sufficient to discuss risk without a ready plan to address it. Then, what you are seeking is approval to pursue a plan your stakeholders understand. You can then move to complete the process with budget, manpower and timing adjustments to any previously set expectations. Engage with your stakeholders, early and often, and don't forget the first rule of wing-walking!
Jeff Foster, Head of Product Engineering, shares how he managed to break down silos in his organization by encouraging their employees to choose their own team.
Head of Product Engineering at Redgate
Pierre Bergamin, VP of Engineering at Assignar, outlines some useful tips for decoupling releases from deployment and increasing deployments by a huge factor, speeding up reverts and planning releases in a better way.
VP of Engineering at Assignar
Murali Bala, Director, Software Engineering at Capital One, outlines how he applied a root cause analysis to fix a recurring outage of their website.
Director, Software Engineering at Capital One
Agata Grzybek, ex-Uber Engineering Manager, outlines her efforts to inspire mission-driven culture among engineers on her team.
Engineering Manager at ex-Uber
Tim Olshansky, EVP of Engineering at Zenput, explains all the challenges of migrating legacy software to the cloud emphasizing the importance of identifying the riskiest things first and applying small, incremental changes.
EVP Product & Engineering at Zenput
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.