Dolt, a Relational Database with Git-Like Cloning Features
Tim Sehn had wondered for years why there wasn’t a place on the internet to collaborate on data.
He had been vice president of engineering at Snapchat, but in his quest to figure out what was to come next, it wasn’t until Brian Hendriks and Aaron Son from Snapchat agreed to join him that they set to work on the problem, producing Dolt, a relational database with Git-like features.
“I was like, you could rent a computer, you can rent a database, but you can’t earn access to the data in the database. There’s no easy way to do that,” Sehn said of the problem they set out to solve.
“We kind of had the idea to build an API marketplace, kind of like mashape or rapidapi. Now with unique kind of branch-merge capabilities like you find in source code, like in Git — branch-merge on top of APIs.
“We kind of realized APIs weren’t what people wanted in the new data world. So we went one layer deeper into the database itself and wanted to offer branch-merge on top of databases.”
Built on Noms
Of late, there has been an ongoing quest for a Git-like solution for Data. Developing a git-like solution for Data has not been an easy quest. The obstacles on this quest, … have had to do with the fact data is nothing like code. With the advent of Infrastructure as Code technologies like Chef, Puppet, Terraform, OpenStack HEAT etc, Infrastructure finally got the capability to be versioned and managed on branches as code. This was possible because these technologies allowed Infrastructure and its configurations to be represented as, well code. This has not been the case with data.
Sehn, Hendriks and Son launched their Los Angeles-based company, Liquidata, in 2018. They released Dolt, as open source a year ago. It takes its name from British slang for “idiot.” DoltHub, a cloud-based storage site for hosting Dolt databases, followed in September 2019. It contains public data sets such as Covid-19, census and income tax data by ZIP code.
“If you go to DoltHub, you will find a bunch of different data that’s open, free for use, you can clone it and get a functioning SQL database in three commands in less than five minutes” — Tim Sehn
“If you go to DoltHub, you will find a bunch of different data that’s open, free for use, you can clone it and get a functioning SQL database in three commands in less than five minutes, whereas doing that with other datasets takes at least an hour with other [technologies], like if you have a CSV and want to get it in a database,” Tim Sehn, the company’s CEO, said, calling it a better way for distributing CSVs, JSON documents and APIs on the internet. This fall, it’s adding election data.
Dolt is based on the noms open source database, the brainchild of former Google engineers Aaron Boodman and Rafael Weinstein. Their company, Attic Labs, was bought out by Salesforce in early 2018. Noms, written in Go, lets users replicate data and edit it offline on multiple machines, then syncs up the edits later.
Noms wasn’t designed to replace the popular databases that enterprises use, but to make it easier to write software to consume and understand data and how it’s changed, I pointed out in an article about Noms from a few years ago.
Instead of versioning files in the SQL database, Dolt versions tables. Data and schema are versioned together. The Liquidata team wanted to rely on users’ familiarity with SQL and GitHub to make Dolt easy to use.
Using Merkle Trees
Dolt uses a Merkle tree architecture that enables users to share data between versions.
“If you have a 50 million row table and you add a row, that we only incrementally store the additional row in order to provide versioning to it,” Sehn said. “In previous technologies, you’d have to make a copy, a whole other 50 million row copy, which would be really slow. And so this content addressing Merkle tree technology allows us to do that efficiently and provides basically full versioning into databases in a way that Git allows you to quickly and easily provide versioning of files.”
Dolt users create a local repository containing tables that can be read and updated using SQL. Similar to Git, writes are staged until the user issues a commit, which are then sent to permanent storage. All changes to data and schema are stored in the commit log.
Branch-merge semantics allow tables to grow at multiple users’ pace, providing loose collaboration on data as well as multiple views on the same core data. Dolt provides table-specific diffs and conflict detection across data and schema. Data conflicts are cell-based, not line-based with efficient diff computation. Remote repositories allow for cooperation among repository instances. Clone, push, and pull semantics are available.
Dolt is a database with its own storage layer, query engine, and query parser, according to Sehn. At this point, Dolt can’t be distributed easily — data must fit on one hard drive. With enough traction, the team will build “big Dolt” for big data, he said.
Dolt uses the Snappy open-source compression library, which prioritizes speed over size, to store data to disk. That means data chunks must be decompressed to process queries.
Public datasets are hosted for free on DoltHub; private repositories are available at $50/month.
Indeed, discussion on Hacker News has noted the “flavor-of-the-month” nature of in “Git for data” projects.
Sehn counts GitHub itself as Dolt’s main competitor, though in a blog post he outlines projects such as Quilt and qri and Kaggle, as well as data pipeline versioning options including Pachyderm and DVC.
Going forward, the company has four areas of focus, he said:
- Improving Dolt and DoltHub, such as its recently added data queries on the web, which allow users to see an Excel audit log of queries on Dolt. (“Being able to seeing the history of any cell in your database up to hundreds of gigs is pretty interesting to people,” he said.)
- Add functionality such as social features to make it easier to collaborate.
- Adding features to become fully MySQL compliant.
- Adding more Git functionality, such as rebasing, which enables users to see the history or database without creating a branch.