Everyone knows the devil’s in the details, but who could have guessed there were quite so many details in building a highly consistent, always available, scalable SQL database?
That’s the lesson Cockroach Labs took from the beta of its CockroachDB project, according to CEO Spencer Kimball. At the same time, the company has been “blown away” by interest in the project, he said.
Three years in the making, the company announced the database now is production-ready.
CockroachDB is a distributed SQL database built on a transactional and strongly consistent key-value store. It scales horizontally; survives disk, machine, rack, and even datacenter failures — hence its reference to the insect’s ability to survive — with minimal latency disruption and no manual intervention. It also supports strongly consistent ACID transactions; and provides a familiar SQL API for structuring, manipulating, and querying data.
Cockroach Labs’ Jessica Edwards explained the CockroachDB architecture in a previous post for The New Stack.
Development over the past year has focused on three areas, Kimball explained in a blog post: distributed SQL to support and scale seamlessly for both small and large use cases; multi-active availability for always-consistent high availability; and flexible deployment in virtually any environment.
It has had more than 100,000 downloads since its beta release in March 2016. Customers include Heroic Labs, which provides an open-source scalable server for building social and real-time games, and the Chinese Internet company Baidu, where it recently demonstrated a use case involving processing 2 billion inserts a day while testing against “chaos events” to check for resiliency.
Version 1.0 offers fully-distributed ACID transactions, zero-downtime schema changes, and support for secondary indexes and foreign keys. It also introduces a distributed query execution engine, enabling distributed JOINs to support analytics queries that speed up linearly as nodes are added to your cluster.
The database uses the concept of “multi-active availability,” which originated at Google, the Cockroach founders’ former employer. It involves using more than two resources and gaining consensus from them quickly for consistency rather than achieving eventual consistency. Amazon’s Aurora also provides similar cross-site replication, he said.
Not only does the data live in three or more places, but processing takes place in those multiple locales as well, he said. Any of the nodes, any of the replication sites can serve the application.
It took about a week to write the initial database implementation, Kimball said, and more than two years to work out the kinks, especially around the Raft consensus algorithm.
CockroachDB can be run locally on a laptop, public or private cloud. It still does not have its own cloud service, though that’s probably coming within the next 12 months. Its advantage over AWS or Google Cloud Service is that it will not be tied to a particular cloud, Kimball said.
In discussing Google Spanner previously, Kimball argued that since it is proprietary, there’s no real off-ramp should you decide later that you want to move your data elsewhere. Many large companies run their own data centers but are looking to move into public cloud services down the road. That’s simple with Cockroach, he said — just a configuration change.
Cockroach often is compared with Google Spanner, which has grabbed most of the recent database headlines. Yet a number of other projects have joined the quest to add scale to distributed SQL, including TiDB, which also aims to build an open-source version of Spanner; Crate, focused on supporting Docker containers and microservices; and Timescale, built for time-series data. There’s also FaunaDB, which describes itself as relational, but not SQL.
Cockroach Labs also announced $27 million in series B funding, led by Satish Dharmaraj of Redpoint Ventures, joined by Benchmark, FirstMark Capital, GV (formerly Google Ventures), Index Ventures, and Work-Bench.
The new funding will further enable the company to work toward filling the gap between what companies need and databases offer, Kimball said.
The company has been touting geo-partitioning, a feature it says its competitors don’t have. Cockroach plans to implement a geo-partitioning beta by the end of the year.
It’s just the tip of the iceberg as far as what companies need to run global services, especially as more regulations crop up covering data sovereignty, he said.
Kimball previously explained geo-partitioning as an efficient way to move data to different regions, because, after all, customers do move:
“Geo-partitioning allows a company to have a single, logical database. But each region is run kind of transparently and independently. Australian customers’ data is domiciled in Australia. The EU data only in the EU. And in Germany, it might be even more restricted than the EU. But there’s nothing in the system that stops you from doing operations across these systems if you need to. So if you add some feature across regions, it can be done as a single operation. It makes it atomic, makes it consistent,” he said.
From the developer’s perspective, it’s just a column in the database schema. Say the data’s domiciled in the EU, if you change that field, all the data about that customer moves behind the scenes. It’s simpler for the ops team because you have an economy of scale. Instead of a different set of monitoring, different machines, different release cycles for each of your machines, you have one global service.
CockroachDB originated as a GitHub project. However, the company also announced a paid enterprise tier. One enterprise tier feature is distributed, incremental backup and restore. The free version includes non-distributed backup and restore.
Kimball said the company needs a workable business model, but it wants to keep most truly useful features in the free version. Only those applicable solely to large companies with huge data sets will go into the enterprise version, he said.