Zero Downtime Migration
19 May, 2021
null at InVision
We had a large MySQL database in AWS RDS that was mission-critical to the business. As the database grew overtime we ran into the following scaling limits:
- For very large tables, making schema migrations became non-trivial. We would often need to set up special infrastructure to ensure enough CPU and memory available for a migration to complete.
- The service connecting to the database often experiences unpredictable bursts in throughput that are not predictable. We often found ourselves needing to increase IOPS on our RDS instance to ensure that queries are not queued for execution.
- RDS does not have adaptive capacity features, so vertically scaling the RDS instance requires manual intervention by our platform team.
To solve these issues, our team decided to move the database from MySQL to AWS DynamoDB. DynamoDB provided the following benefits to us over MySQL:
- The service’s data model and throughput needs are better suited to a NoSQL database. For example, the service does not require transactions, the data model is not hierarchical, and most queries are executed via the primary key.
- Data that resides in very large MySQL tables can be partitioned across multiple DynamoDB nodes. This not only gives us higher availability guarantees but allows us to have consistent latency as our data set grows.
- The auto-scaling and adaptive capacity features of DynamoDB remove the operation burden for our team and ensures that we only pay for the capacity we use.
One of the key challenges that drove our migration architecture was designing a solution with zero downtime. As this service powers the foundation of many of our products, we couldn’t declare a long maintenance window for the migration as that could lead to a significant loss of revenue. Our migration needed to be continuous and completed in incremental stages to provide the least disruption for our customers.
Moving the service to the new database
Before we could perform the actual data migration, we needed to move all read/paths in the service to work with the new database. This proved to be challenging as business logic, and API endpoints had a high amount of coupling to RDBS semantics such as atomic batch processing. To remove the coupling, we introduced new layers into the API that would allow us to switch from one database to the other without breaking existing API contracts. Since we could not switch all of the read/write paths over at once, we also used feature flags as a part of these layers. Once the data migration pipeline was set up, we would use these flags to incrementally cut over reads and writes to the new database. It would also provide us with a way to quickly revert back to the old database.
Designing data migration for zero downtime
To have as little downtime as possible, we chose to leverage stream processing to continuously replicate data from MySQL to DynamDB. Luckily, MySQL exposes its write-ahead log that shows each individual INSERT, UPDATE, and DELETE statements made to a table, known as Change Data Capture events (CDC). To set up the stream, we used the AWS Database Migration service to read the log and send each event to a highly partitioned Kafka topic. Once in Kafka, an ETL service consumes these events and maps them to the new model in DynamoDB. Using Kafka also gave us the power to reprocess the event stream from any point in the case where we found bugs in our ETL logic and needed to rebuild our DynamoDB table.
How-to of the actual migration
Once we integrated our service with DynamoDB and set up our continuous migration pipeline, the question was how we could switch over from MySQL. To do this, we developed a dual-read/write algorithm that leveraged feature flags. While still writing to MySQL, we incrementally toggled reads to go against DynamoDB. If we encountered issues in our DynamoDB queries, we would simply toggle the flag to read against MySQL without incurring data loss.
Writes were treated somewhat differently. To migrate write paths in the service, I would toggle flags to write to both MySQL and DynamoDB. At first, all writes would go to MySQL, while a write to DynamoDB was done in the background. This allowed us to verify that our writes to DynamoDB were functionally correct and tuned for performance. Once we gained enough confidence, we would toggle flags to write to DynamoDB first. Eventually, we would reach a state where the service was solely reading and writing to the new database. Finally, once there were no more CDC events being processed by our ETL service, we shut down the AWS Data Migration service task.
- The migration project was a good lesson in the value of recognizing the quality attributes your architecture must satisfy. By focusing on service availability, data consistency, and testability we were able to design a solution that worked best for our customers.
- The technical complexity of the process taught us the value of prototypes and incremental change. As we designed and speced our each part of the migration we would spike a proof of concept in code to validate our assumptions and to iterate towards our final design. More often than not, these prototypes ended up serving as a reference implementation that less experienced engineers could use to ramp themselves up quickly on DynamoDB.
- Prior to this project, our team had some experience building event-driven systems but not at this scale nor in a database migration scenario. By leveraging the power of stream processing, we were able to easily test out our migration logic using unit tests instead of if we would have written the ETL logic directly into AWS DWS. It also allowed us to scale different parts of the pipeline independently (i.e., The AWS DMS infrastructure vs. the ETL Kafka consumer).
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Pavel Safarik, Head of Product at ROI Hunter, shares his insights on how to deal with disagreements about prioritization when building a product.
Head of Product at ROI Hunter
Matias Pizarro, CTO and VP of Residents at ComunidadFeliz, recalls a time in his early career when he took a technology risk that had wide-ranging benefits to his product's user experience.
CTO and VP of Residents at ComunidadFeliz
Renaldi, Director of Engineering at Boku Inc., shares his guide for improving problem-plague processes into strategic initiatives.
Director of Engineering at Boku Inc
Shawn Sullivan, Co-founder & CTO at Phase Genomics, shares how his career has spanned from working at a tech giant to co-founding a startup in every stage of his growth.
Cofounder & CTO at Phase Genomics
Tejas Kokje, Senior Software Engineer at Netflix, Inc., highlights how long-term thinking, planning, and organizing can reduce technical debt in a product organization.
Senior Software Engineer at Netflix, Inc
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.