Canva’s Live Database Migration: From MySQL to DynamoDB
Online design service Canva has an obscene amount of media. In 2017, Canva’s media store led more than 25 billion user-uploaded media with another 50 million uploaded daily by 100 million monthly active users. This is the story of how Canva migrated that dataset from MySQL to Amazon Web Services‘ DynamoDB with zero downtime.
MySQL was doing just fine as Canva’s media store until around the one billionth media entry, according to a few creative solutions implemented by the engineering team held the OG system together just long enough for the live migration to DynamoDB to complete.
A huge amount of preparation and testing went into the process. Five years later, the company published the migration details and basic system architecture diagrams, in a blog post written by Canva Engineering Lead Robert Sharp and Jacky Chen, Canva software developer.
Here are a few of the lessons they learned:
- Be lazy. Understand access patterns and migrate commonly accessed patterns first.
- Do it live. Gather as much information as possible by migrating live, identifying bugs and learning to use and run the technology.
- Test in production. Introduce checks to do it safely.
The Media Service
Canva’s microservices architecture is primarily comprised of resource-oriented services, meaning operations are managed around different resources (i.e., documents, folders, users and media). Each service exposes an API, has isolated persistence and is owned by a team of engineers. The media service, for example, manages all things media within Canva.
The media service manages operations on and encapsulates the state of media resources. Some of the attributes the service stores are ID, owning user, external source information and a reference to the media files and where they’re stored for all media files. Detailed metadata is also stored in the media service.
Reads are more common than writes and in the case of user-generated content, most reads are of media recently created.
The Starting Point
The media service started as a thin layer around MySQL hosted on Amazon Web Services Relational Database Service (AWS RDS), as did many other Canva services. It worked to a point, but scaling was needed for busier services. Vertical scaling was executed first, where larger instances occurred. Then horizontal scaling was tackled by introducing eventually consistent reads powered by MySQL read replicas.
But all good things come to an end, and the Canva developers say the cracks began to show when some schema change operations took days. Days! Then MySQL’s online DDL operations caused performance degradation so much so that Canva couldn’t perform them while also serving user traffic.
GitHub’s online schema migration tool for MySQL, gh-ost, temporarily plugged one hole and allowed them to perform online schema migrations without user impact but the problems were stacking.
Hard limits on MySQL 5.6 replication rate imposed a ceiling on the rate of writes to the read replicas.
Feature releases were blocked because schema migrations took upwards of six weeks even with the use of gh-ost.
Latencies were higher because of the increase in their EBS volume size.
The downtime was now required for instance restarts and version upgrades because of servicing normal production traffic.
Keeping MySQL Afloat
Something had to change with Canva now approaching its one billionth media, a number that was increasing exponentially with no signs of slowing down. But there were two challenges — making the current system work while the new system was found and migrated to and finding a new system that will work for Canva’s ever-expanding needs.
With a little work, MySQL kept on going. Some of the engineering changes made were:
- Metadata, the most commonly modified part of the schema, was migrated into a JSON column with its schema managed by the media service itself.
- Media was imported differently to reduce the number of metadata updates.
- Lock contention and joins were reduced by denormalizing tables.
- Repeated content (i.e., s3 bucket names) was removed or encoded into a shorter representation.
The data was sharded. The new sharding solution was optimized for lookups by ID, the most common request, but required inefficient scatter-gather queries for less common requests such as listing all media owned by a user. The purpose of this was to avoid the 2TB ext3 table file size limit, avoid the replication throughout the ceiling and improve performance for users.
The team was also investigating and prototyping different long-term solutions. The image below shows the contenders and how they are stacked up. There was a short runway for the new system and the team had a strong preference for managed solutions.
DynamoDB was selected because of Canva’s experience working with the technology on less complex workloads. Before fully integrating though, it needed testing with the media service, a much more complex workload.
Canva decided on a live migration opting for zero impact to users, zero cutover time and zero downtime. The migration included a full database migration of all old, newly created and updated media to DynamoDB. The team wanted to shed load from the MySQL cluster as soon as possible.
An AWS SQS messaging queue was used to orchestrate the migration. Writes replication was prioritized over reads as the eventual serving of consistent reads from DynamoDB was the intention. The message creates and updates were placed in a high-priority queue and reads in a low-priority queue.
A message identifying whether or not a particular media was created updated or read was enqueued. A worker instance then processed the message to read the current state from the MySQL primary and updated DynamoDB if necessary. This allowed the trying and retiring of messaging for processing to get paused or slowed down. Worker instances read from the high-priority queue until all messages were consumed before moving to the low-priority queue.
The MySQL cluster load was shed via the scanning process Canva implemented. The scanning process scanned through the cluster starting with the most recently created media, placing a message in the low-priority queue to replicate the media. Backpressure was used to ensure the synchronization process only advanced while the low-priority queue was approximately empty.
It wasn’t as straightforward as just adding to DynamoDB. Results from MySQL were compared to the new DynamoDB media service implementation via a dual-read comparison process with the purpose of testing the replication process. Some bugs were found and subsequently fixed.
The team met their overall goal of eventually serving consistent reads from DynamoDB, with MySQL only serving media that hadn’t been replicated yet. The requests serviced by MySQL were read requests outside of requests that identified media by ID, such as a query to find all media owned by the user since media was replicated individually.
Zero Downtime Thanks to Testing
The biggest risk during the migration was switching the writes to DynamoDB. The new service code required to handle creation and updates included transactional and conditional whites to guarantee the same contracts as the previous implementation. Testing and methodical preparation made it possible.
The team migrated the existing integration tests for media update requests to test against both migrated media and mediate created directly on DynamoDB. They also migrated the remainder of the integration tests to run against the DynamoDB service implementation, running these tests alongside the MySQL implementation tests.
The new implementation was tested in their local development environment. The new implementation was also tested using their end-to-end test suite.
The developers wrote a run book for the cutover using their flag system to allow them to switch reads back to MySQL within seconds if required. They rehearsed the run book as they rolled out the change through their development and staging environments.
Their cutover in production was seamless with no downtime, errors and significant improvement to media service latency, as shown in the following diagram.
This migration took place in 2017, making a wider lane of experience to draw upon when looking back on that many blog posts. So was DynamoDB the right choice? Well, this writer’s interpretation was: short answer yes; long answer yes(?).
Canva’s user base more than tripled since the original migration. The pros: DynamoDB excelled in autoscaling and costs less than the AWS RDS clusters. The challenges: Schema changes and backfills are “far more complicated” to perform and the ability to run ad-hoc SQL queries on a MySQL replica was lost (Canva is performing them via an alternate solution).
Also, composite global secondary indexes are needed to support existing access patterns and those require manually concatenating attributes together. This is likely neither a pro nor a challenge since new access patterns aren’t frequent because at this stage of Canva’s growth the structure of core media metadata is relatively stable.
Would Canva choose DynamoDB again if they were facing the same problem today? The team would consider mature hosted “NewSQL” products like Spanner or CockroachDB. Very interesting.