Back to resources

Zero Downtime Migration

Dev Processes

19 May, 2021

null null

null null

null at InVision

Shawn Hartsell, Lead Software Engineer at InVision, explains how he helped successfully migrate the mission-critical database with zero downtime.

Problem

We had a large MySQL database in AWS RDS that was mission-critical to the business. As the database grew overtime we ran into the following scaling limits:

  • For very large tables, making schema migrations became non-trivial. We would often need to set up special infrastructure to ensure enough CPU and memory available for a migration to complete.
  • The service connecting to the database often experiences unpredictable bursts in throughput that are not predictable. We often found ourselves needing to increase IOPS on our RDS instance to ensure that queries are not queued for execution.
  • RDS does not have adaptive capacity features, so vertically scaling the RDS instance requires manual intervention by our platform team.

To solve these issues, our team decided to move the database from MySQL to AWS DynamoDB. DynamoDB provided the following benefits to us over MySQL:

  • The service’s data model and throughput needs are better suited to a NoSQL database. For example, the service does not require transactions, the data model is not hierarchical, and most queries are executed via the primary key.
  • Data that resides in very large MySQL tables can be partitioned across multiple DynamoDB nodes. This not only gives us higher availability guarantees but allows us to have consistent latency as our data set grows.
  • The auto-scaling and adaptive capacity features of DynamoDB remove the operation burden for our team and ensures that we only pay for the capacity we use.

One of the key challenges that drove our migration architecture was designing a solution with zero downtime. As this service powers the foundation of many of our products, we couldn’t declare a long maintenance window for the migration as that could lead to a significant loss of revenue. Our migration needed to be continuous and completed in incremental stages to provide the least disruption for our customers.

Actions taken

Moving the service to the new database

Before we could perform the actual data migration, we needed to move all read/paths in the service to work with the new database. This proved to be challenging as business logic, and API endpoints had a high amount of coupling to RDBS semantics such as atomic batch processing. To remove the coupling, we introduced new layers into the API that would allow us to switch from one database to the other without breaking existing API contracts. Since we could not switch all of the read/write paths over at once, we also used feature flags as a part of these layers. Once the data migration pipeline was set up, we would use these flags to incrementally cut over reads and writes to the new database. It would also provide us with a way to quickly revert back to the old database.

Designing data migration for zero downtime

To have as little downtime as possible, we chose to leverage stream processing to continuously replicate data from MySQL to DynamDB. Luckily, MySQL exposes its write-ahead log that shows each individual INSERT, UPDATE, and DELETE statements made to a table, known as Change Data Capture events (CDC). To set up the stream, we used the AWS Database Migration service to read the log and send each event to a highly partitioned Kafka topic. Once in Kafka, an ETL service consumes these events and maps them to the new model in DynamoDB. Using Kafka also gave us the power to reprocess the event stream from any point in the case where we found bugs in our ETL logic and needed to rebuild our DynamoDB table.

How-to of the actual migration

Once we integrated our service with DynamoDB and set up our continuous migration pipeline, the question was how we could switch over from MySQL. To do this, we developed a dual-read/write algorithm that leveraged feature flags. While still writing to MySQL, we incrementally toggled reads to go against DynamoDB. If we encountered issues in our DynamoDB queries, we would simply toggle the flag to read against MySQL without incurring data loss.

Writes were treated somewhat differently. To migrate write paths in the service, I would toggle flags to write to both MySQL and DynamoDB. At first, all writes would go to MySQL, while a write to DynamoDB was done in the background. This allowed us to verify that our writes to DynamoDB were functionally correct and tuned for performance. Once we gained enough confidence, we would toggle flags to write to DynamoDB first. Eventually, we would reach a state where the service was solely reading and writing to the new database. Finally, once there were no more CDC events being processed by our ETL service, we shut down the AWS Data Migration service task.

Lessons learned

  • The migration project was a good lesson in the value of recognizing the quality attributes your architecture must satisfy. By focusing on service availability, data consistency, and testability we were able to design a solution that worked best for our customers.
  • The technical complexity of the process taught us the value of prototypes and incremental change. As we designed and speced our each part of the migration we would spike a proof of concept in code to validate our assumptions and to iterate towards our final design. More often than not, these prototypes ended up serving as a reference implementation that less experienced engineers could use to ramp themselves up quickly on DynamoDB.
  • Prior to this project, our team had some experience building event-driven systems but not at this scale nor in a database migration scenario. By leveraging the power of stream processing, we were able to easily test out our migration logic using unit tests instead of if we would have written the ETL logic directly into AWS DWS. It also allowed us to scale different parts of the pipeline independently (i.e., The AWS DMS infrastructure vs. the ETL Kafka consumer).

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Navigating Disagreements When It Comes to Priorities

9 May

Pavel Safarik, Head of Product at ROI Hunter, shares his insights on how to deal with disagreements about prioritization when building a product.

Innovation / Experiment
Product Team
Product
Dev Processes
Conflict Solving
Internal Communication
Collaboration
Convincing
Strategy
Prioritization
Pavel Safarik

Pavel Safarik

Head of Product at ROI Hunter

Why You Should Take Technology Risks in Product Development

25 April

Matias Pizarro, CTO and VP of Residents at ComunidadFeliz, recalls a time in his early career when he took a technology risk that had wide-ranging benefits to his product's user experience.

Innovation / Experiment
Product
Scaling Team
Dev Processes
Matias Pizarro

Matias Pizarro

CTO and VP of Residents at ComunidadFeliz

Company Growth: The Good, the Bad and the Ugly

18 March

Renaldi, Director of Engineering at Boku Inc., shares his guide for improving problem-plague processes into strategic initiatives.

Dev Processes
Feedback
Strategy
Team Processes
Renaldi Renaldi

Renaldi Renaldi

Director of Engineering at Boku Inc

The Evolution of a Startup Co-Founder

15 March

Shawn Sullivan, Co-founder & CTO at Phase Genomics, shares how his career has spanned from working at a tech giant to co-founding a startup in every stage of his growth.

Changing A Company
Dev Processes
Company Culture
Impact
Convincing
Changing Company
Career Path
Shawn Sullivan

Shawn Sullivan

Cofounder & CTO at Phase Genomics

What Companies Can Do to Reduce Technical Debt

8 March

Tejas Kokje, Senior Software Engineer at Netflix, Inc., highlights how long-term thinking, planning, and organizing can reduce technical debt in a product organization.

Dev Processes
Convincing
Tech Debt
Tejas Kokje

Tejas Kokje

Senior Software Engineer at Netflix, Inc

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.