Plato Elevate Winter Summit has been announced (Dec 7th-8th)

🔥

Back to resources

Zero Downtime Migration

Dev Processes

19 May, 2021

Shawn Hartsell
Shawn Hartsell

Lead Software Engineer at InVision

Shawn Hartsell, Lead Software Engineer at InVision, explains how he helped successfully migrate the mission-critical database with zero downtime.

Problem

We had a large MySQL database in AWS RDS that was mission-critical to the business. As the database grew overtime we ran into the following scaling limits:

  • For very large tables, making schema migrations became non-trivial. We would often need to set up special infrastructure to ensure enough CPU and memory available for a migration to complete.
  • The service connecting to the database often experiences unpredictable bursts in throughput that are not predictable. We often found ourselves needing to increase IOPS on our RDS instance to ensure that queries are not queued for execution.
  • RDS does not have adaptive capacity features, so vertically scaling the RDS instance requires manual intervention by our platform team.

To solve these issues, our team decided to move the database from MySQL to AWS DynamoDB. DynamoDB provided the following benefits to us over MySQL:

  • The service’s data model and throughput needs are better suited to a NoSQL database. For example, the service does not require transactions, the data model is not hierarchical, and most queries are executed via the primary key.
  • Data that resides in very large MySQL tables can be partitioned across multiple DynamoDB nodes. This not only gives us higher availability guarantees but allows us to have consistent latency as our data set grows.
  • The auto-scaling and adaptive capacity features of DynamoDB remove the operation burden for our team and ensures that we only pay for the capacity we use.

One of the key challenges that drove our migration architecture was designing a solution with zero downtime. As this service powers the foundation of many of our products, we couldn’t declare a long maintenance window for the migration as that could lead to a significant loss of revenue. Our migration needed to be continuous and completed in incremental stages to provide the least disruption for our customers.

Actions taken

Moving the service to the new database

Before we could perform the actual data migration, we needed to move all read/paths in the service to work with the new database. This proved to be challenging as business logic, and API endpoints had a high amount of coupling to RDBS semantics such as atomic batch processing. To remove the coupling, we introduced new layers into the API that would allow us to switch from one database to the other without breaking existing API contracts. Since we could not switch all of the read/write paths over at once, we also used feature flags as a part of these layers. Once the data migration pipeline was set up, we would use these flags to incrementally cut over reads and writes to the new database. It would also provide us with a way to quickly revert back to the old database.

Designing data migration for zero downtime

To have as little downtime as possible, we chose to leverage stream processing to continuously replicate data from MySQL to DynamDB. Luckily, MySQL exposes its write-ahead log that shows each individual INSERT, UPDATE, and DELETE statements made to a table, known as Change Data Capture events (CDC). To set up the stream, we used the AWS Database Migration service to read the log and send each event to a highly partitioned Kafka topic. Once in Kafka, an ETL service consumes these events and maps them to the new model in DynamoDB. Using Kafka also gave us the power to reprocess the event stream from any point in the case where we found bugs in our ETL logic and needed to rebuild our DynamoDB table.

How-to of the actual migration

Once we integrated our service with DynamoDB and set up our continuous migration pipeline, the question was how we could switch over from MySQL. To do this, we developed a dual-read/write algorithm that leveraged feature flags. While still writing to MySQL, we incrementally toggled reads to go against DynamoDB. If we encountered issues in our DynamoDB queries, we would simply toggle the flag to read against MySQL without incurring data loss.

Writes were treated somewhat differently. To migrate write paths in the service, I would toggle flags to write to both MySQL and DynamoDB. At first, all writes would go to MySQL, while a write to DynamoDB was done in the background. This allowed us to verify that our writes to DynamoDB were functionally correct and tuned for performance. Once we gained enough confidence, we would toggle flags to write to DynamoDB first. Eventually, we would reach a state where the service was solely reading and writing to the new database. Finally, once there were no more CDC events being processed by our ETL service, we shut down the AWS Data Migration service task.

Lessons learned

  • The migration project was a good lesson in the value of recognizing the quality attributes your architecture must satisfy. By focusing on service availability, data consistency, and testability we were able to design a solution that worked best for our customers.
  • The technical complexity of the process taught us the value of prototypes and incremental change. As we designed and speced our each part of the migration we would spike a proof of concept in code to validate our assumptions and to iterate towards our final design. More often than not, these prototypes ended up serving as a reference implementation that less experienced engineers could use to ramp themselves up quickly on DynamoDB.
  • Prior to this project, our team had some experience building event-driven systems but not at this scale nor in a database migration scenario. By leveraging the power of stream processing, we were able to easily test out our migration logic using unit tests instead of if we would have written the ETL logic directly into AWS DWS. It also allowed us to scale different parts of the pipeline independently (i.e., The AWS DMS infrastructure vs. the ETL Kafka consumer).

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

The Right Way to Ship Features in a Startup

11 November

Matt Anger, Senior Staff Engineer at DoorDash, shares how he took the risk and shipped features in a startup.

Alignment
Product
Dev Processes
Matt Anger

Matt Anger

Senior Staff Engineer at DoorDash

The Problem-Solving Process: A Modern, Data-Driven Approach for Engineering leaders

28 October

Sudheer Bandaru, CEO at Insightly Analytics, recalls how he formed a company for carrying out data-driven solutions to daily engineering problems.

Dev Processes
Team Processes
Sudheer Bandaru

Sudheer Bandaru

CEO at Insightly Analytics

Taking The Lead As A Manager In Crisis

14 October

James Tobias, Senior Product Manager at Mapware, unveils a riveting journey to build a product from ground zero successfully.

Product Team
Product
Dev Processes
James Tobias

James Tobias

Senior Product Manager at Mapware

Strategic Ways to Stop Losing Customers

13 October

Mary Fisher, Software Engineering Manager at DrChrono, shares how diligently she worked with teams within her organization to retain customers.

Alignment
Innovation / Experiment
Product
Dev Processes
Convincing
Tech Debt
Prioritization
Mary Fisher

Mary Fisher

Software Engineering Manager at DrChrono

Powerful Reasons Why Goal Setting Is Important

12 October

Mary Fisher, Software Engineering Manager at DrChrono, shares how goal setting provides the foundation to drive an organization.

Goal Setting
Dev Processes
Deadlines
Productivity
Motivation
Cross-Functional Collaboration
Prioritization
Agile / Scrum
Mary Fisher

Mary Fisher

Software Engineering Manager at DrChrono

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.