Zero Downtime Migration
19 May, 2021
null null
null at InVision
Problem
We had a large MySQL database in AWS RDS that was mission-critical to the business. As the database grew overtime we ran into the following scaling limits:
- For very large tables, making schema migrations became non-trivial. We would often need to set up special infrastructure to ensure enough CPU and memory available for a migration to complete.
- The service connecting to the database often experiences unpredictable bursts in throughput that are not predictable. We often found ourselves needing to increase IOPS on our RDS instance to ensure that queries are not queued for execution.
- RDS does not have adaptive capacity features, so vertically scaling the RDS instance requires manual intervention by our platform team.
To solve these issues, our team decided to move the database from MySQL to AWS DynamoDB. DynamoDB provided the following benefits to us over MySQL:
- The service’s data model and throughput needs are better suited to a NoSQL database. For example, the service does not require transactions, the data model is not hierarchical, and most queries are executed via the primary key.
- Data that resides in very large MySQL tables can be partitioned across multiple DynamoDB nodes. This not only gives us higher availability guarantees but allows us to have consistent latency as our data set grows.
- The auto-scaling and adaptive capacity features of DynamoDB remove the operation burden for our team and ensures that we only pay for the capacity we use.
One of the key challenges that drove our migration architecture was designing a solution with zero downtime. As this service powers the foundation of many of our products, we couldn’t declare a long maintenance window for the migration as that could lead to a significant loss of revenue. Our migration needed to be continuous and completed in incremental stages to provide the least disruption for our customers.
Actions taken
Moving the service to the new database
Before we could perform the actual data migration, we needed to move all read/paths in the service to work with the new database. This proved to be challenging as business logic, and API endpoints had a high amount of coupling to RDBS semantics such as atomic batch processing. To remove the coupling, we introduced new layers into the API that would allow us to switch from one database to the other without breaking existing API contracts. Since we could not switch all of the read/write paths over at once, we also used feature flags as a part of these layers. Once the data migration pipeline was set up, we would use these flags to incrementally cut over reads and writes to the new database. It would also provide us with a way to quickly revert back to the old database.
Designing data migration for zero downtime
To have as little downtime as possible, we chose to leverage stream processing to continuously replicate data from MySQL to DynamDB. Luckily, MySQL exposes its write-ahead log that shows each individual INSERT, UPDATE, and DELETE statements made to a table, known as Change Data Capture events (CDC). To set up the stream, we used the AWS Database Migration service to read the log and send each event to a highly partitioned Kafka topic. Once in Kafka, an ETL service consumes these events and maps them to the new model in DynamoDB. Using Kafka also gave us the power to reprocess the event stream from any point in the case where we found bugs in our ETL logic and needed to rebuild our DynamoDB table.
How-to of the actual migration
Once we integrated our service with DynamoDB and set up our continuous migration pipeline, the question was how we could switch over from MySQL. To do this, we developed a dual-read/write algorithm that leveraged feature flags. While still writing to MySQL, we incrementally toggled reads to go against DynamoDB. If we encountered issues in our DynamoDB queries, we would simply toggle the flag to read against MySQL without incurring data loss.
Writes were treated somewhat differently. To migrate write paths in the service, I would toggle flags to write to both MySQL and DynamoDB. At first, all writes would go to MySQL, while a write to DynamoDB was done in the background. This allowed us to verify that our writes to DynamoDB were functionally correct and tuned for performance. Once we gained enough confidence, we would toggle flags to write to DynamoDB first. Eventually, we would reach a state where the service was solely reading and writing to the new database. Finally, once there were no more CDC events being processed by our ETL service, we shut down the AWS Data Migration service task.
Lessons learned
- The migration project was a good lesson in the value of recognizing the quality attributes your architecture must satisfy. By focusing on service availability, data consistency, and testability we were able to design a solution that worked best for our customers.
- The technical complexity of the process taught us the value of prototypes and incremental change. As we designed and speced our each part of the migration we would spike a proof of concept in code to validate our assumptions and to iterate towards our final design. More often than not, these prototypes ended up serving as a reference implementation that less experienced engineers could use to ramp themselves up quickly on DynamoDB.
- Prior to this project, our team had some experience building event-driven systems but not at this scale nor in a database migration scenario. By leveraging the power of stream processing, we were able to easily test out our migration logic using unit tests instead of if we would have written the ETL logic directly into AWS DWS. It also allowed us to scale different parts of the pipeline independently (i.e., The AWS DMS infrastructure vs. the ETL Kafka consumer).
Discover Plato
Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader
Related stories
21 December
Consideration for starting a multi year software infrastructure ( V2 ) project that involves hundreds of globally distributed engineers.

Ahsan Habib
VP Software Engineering at human
29 November
Why DevSecOps matter and what's really in it for you, the team and the organisation?
Vikash Chhaganlal
Head of Engineering at Xero
14 October
Teams have tremendous impact on the products on they build. T.E.A.M definition - Together Everybody Achieves More is true. A collaborative and empowered team builds great product versus the good ones.

Praveen Cheruvu
Senior Software Engineering Manager at Anaplan
12 July
A proposal for how to create an org structure that will deliver software systems that you want, not ones you get stuck with.

Ram Singh
Principal / Founder at id8 inc
6 June
Adir Nashawi, Senior Product Manager at Hibob, shares his insight and experience from rebuilding a product to handle many feature requests and offerings.

Adir Nashawi
Senior Product Manager at Hibob