How DoorDash Migrated from Aurora Postgres to CockroachDB
NEW YORK — If your customer-facing application fails, the only silver lining might be if your direct competitors suffer the same calamity at the same time.
That day arrived for DoorDash on Friday, April 17, 2020 — roughly a month into the global pandemic lockdown that caused demand for online food ordering services like DoorDash to explode.
The company’s main database cluster peaked at 1.636 million QPS. Buckling under the load of unprecedented demand, it “crashed and burned,” resulting in hours of downtime, Allessandro Salvatori, a principal engineer at DoorDash, told an audience at RoachFest23, Cockroach Labs’ user conference, held here in October.
The only bright side? “Our two main competitors also went down,” Salvatori said, to booming laughter from his audience.
But it wasn’t funny, he noted, to the establishments then relying on DoorDash and similar online services to keep them afloat during the pandemic: “Put yourselves in the shoes of the poor restaurant owners … They didn’t have any bookings, any on-premises business — and they didn’t have online orders, either.”
DoorDash took the lessons it learned from the incident as it transitioned from Amazon Aurora Postgres to CockroachDB. At RoachFest, Salvatori — and his colleague, Michael Czabator, a DoorDash staff engineer — told how their company made the move, and how it helped make the business’s flagship application more resilient and consistent, improving availability for its customers.
A ‘Poor Man’s Solution,’ Until a Crisis Hits
When DoorDash began a decade ago, it was built around a Django monolith that talked to a single, Aurora database cluster, which offered a single writer. By the time Salvatori joined the company five years ago, the company had broken some parts of the monolith into microservices and added a few more database clusters.
“But their destiny was already written,” Salvatori said. “Due to our exponential growth, they soon would be in the same spot as our main databases.”
Until the monolith was broken up, it offered a single view of the toll that demand for the application was taking on the databases. But once that monolith was broken into microservices, that visibility would disappear.
“Our biggest enemy was the single primary architecture of our database,” Salvatori said. “And our North Star would be to move to a solution that offered multiple writers.”
In the meantime, the DoorDash team adopted a “poor man’s solution,” approach to dealing with its overmatched database architecture, Salvatori told the Roachfest audience: building vertical federation of tables, while not blocking microservices extractions.
In this game of “whack-a-mole,” he said, “Different tables would be able to get their own single writer and therefore scale a little bit and allow us to keep the lights on for a little bit longer. But we needed to take steps toward limitless horizontal scalability.”
Cockroach, a distributed SQL database management system, seemed like the right answer. The April 17 disaster brought new urgency to the transition plans.
Building a Tool for Extracting Data
To solve the immediate crisis, DoorDash’s engineers made use of “escape hatches” in the core object-relational mapper (ORM) and hacks “deep down in the guts of the Django ORM,” as Salvatori put it, to gain some short-term headroom. Then, the team quickly built a prototype tool, ready by the end of that month, to extract table extractions.
As a guinea pig, he said, DoorDash used its critical identity table: “If that didn’t work, no consumer, no Dasher, no merchant would be able to log in to our website or our mobile apps. The name of the game was to be able to revert back to the previous source of truth if everything didn’t go as planned.”
Four unsuccessful attempts followed; on the fifth try, the cutover to a new cluster was successful. Over the next 33 days, the team extracted seven databases and 34 tables, moving them to seven new database clusters. The transition brought the total QPS on the main database cluster down to 100,000.
The motivation for investing in the purpose-built extraction tool — which flowered into a collection of tools — was clear from a business standpoint. Without it, “we would have had to throw several bodies at the problem,” Salvatori said, citing a previous example when it took six engineers six months to extract a single table from the legacy database.
DoorDash’s bespoke extraction tool, he said, was designed to not only minimize labor but also to be constantly improved and enhanced. It was also given a UI that gave users visibility and allowed them to edit the SQL that would be generated from extraction.
It also was designed to be operated safely across databases used in production, with an emergency revert function, that allowed the data to switch back to the legacy database in as few clicks as possible.
The extraction tool, he said, also had other use cases, such as local migrations, repacking a table in a suspendable way, performing “hitless” upgrades (without degrading service or performance) — and extracting tables to CockroachDB.
A Strategy for Database Migration
As DoorDash began to undertake the migration to CockroachDB, Salvatori said, it prepared a strategy to avoid the common pitfalls of a traditional database migration. Among the pillars of that strategy:
Tail changes directly from information in database tables, using the custom tool “We had a lot of clusters on Aurora 9.6 that had no logical replication,” Salvatori said. Even though logical replication was available on Aurora 10, “that comes with a 40% on our single primary, which was our biggest pain point.”
The team wanted the ability to suspend the replication if it needed to deal with a problem. Without tailing changes in this manner, he said. it would “run the risk of the OOM killer kicking in, preventing the vacuum from making progress and eventually running into transaction wraparound issues.”
Use pulls instead of pushes. The team wanted to be able to suspend a migration and resume it later.
Maintain a feedback loop. “We wanted to be able to slow down if the health of either the source of the destination database is showing some cracks,” Salvatori said.
Spread out batches. Use an auto-tunable batch size. Activity is spread out its activity over ranges of the tables.
Keep it simple. The team maintained a single source of truth: Before extraction, the source of truth would be the “old” database. After extraction, the source of truth would be the new database.
Also, it didn’t replay every single change, but rather captured only what had been inserted, updated or deleted, fast-forwarding to the latest value.
When the cutover to CockroachDB occurred, it took three attempts to succeed. Salvatori showed the Roachfest audience a video screen capture of the moments leading up to the successful migration — and the triumphant aftermath.
“This is literally a lossless cutover, as far as I can tell,” one engineer said in the video, after the migration worked.
Handling 1.9 Petabytes of Data
So how is it working? Czabator, who joined DoorDash in February (from Cockroach) and works on its storage infrastructure team, gave the RoachFest audience an update.
DoorDash currently runs Cockroach in a single AWS region, running three AZs in that region. Since the fall of 2022, it has seen the number of nodes it runs increase by 55% (from about 1,500 to more than 2,300). However, the amount of data it handles has more than doubled, from just under 1 to 1.9 petabytes.
The online food ordering company’s teams use a self-service console to manage CockroachDB users and to handle schema management, index operations and changefeeds.
DoorDash has also developed a control plane for automation and operational tasks within CockroachDB and other technologies, Czabator said. The control plan leverages Argo workflows for things like repaving (replacing all the nodes in a cluster with new ones), in-place upgrades, scaling up or down, operations monitoring, managing settings and users, and running health checks.
“This has been a really big win for us,” he said. “Historically some of these operations have been vectors for incidents. When you have a human involved, things can get messed up. Sometimes there’s too much data, too much to interpret.”
The time saved has been significant, he said. For instance, manually repaving all nodes in production would take 100 days of work. With the automation, it takes only 10.
Among other results:
- Scaling, using Amazon Web Services autoscale groups, has also become easier. When a certain level of CPU is used, three new nodes — one for each AZ — are autogenerated within two minutes, with traffic flowing to the new nodes.
- Over the last year, despite the addition of about 70 clusters, the volume of alerts the teams received about incidents declined.
- DoorDash has piloted using AWS EC2 m7g incidences, powered by AWS Graviton processors, in production alongside its CockroachDB ARM build, seeing better query throughput and significant declines in latency: about 35% in average p999 latency, and 45% in average P99 latency. The cost of using EC2 m7g incidences also saves DoorDash about 15% in cost compared to AWS EC2 m6i incidences, which are powered by Intel processors.
One of DoorDash’s core values, Czabator said, is “1% better every day.”
That means constant iteration, he said: “One of the ways that manifests in our organization is by continually refining the alarms, alerts, the metrics we have around Cockroach to improve the service to our customers.”