Cloud Native / Data / Microservices

Stateful Data: Couchbase and Its Multiple Consistency Models

20 Jan 2020 11:23am, by

Building distributed applications aren’t easy. We need to throw away much of what we have learned working in data centers and on desktops, where code can run in controlled clusters and where latency is the least of our worries. Take that to the cloud and the model changes: we’re no longer worried about whether a microservice is running or not, only that its state is readily accessible and can be picked up by another service instance.

State is all-important in modern cloud native applications, and that makes the databases we use even more important. There’s a catch, though. Those databases need to be designed for the cloud-native world, where scale is key, and where we need to design our applications and services around the physics of a global scale network that’s shifting ever-increasing amounts of data.

NoSQL Gets Structured

It’s interesting to see how databases, especially NoSQL-based stores, are approaching this problem. NoSQL is particularly attractive as a state store: it’s a schema-less approach to data gives NoSQL platforms speed and flexibility that’s not possible with traditional relational platforms. It doesn’t matter what you’re storing, whether it’s simple key/value pairs or more complex JSON documents, NoSQL has evolved to offer what you need, supporting semistructured data in a scalable manner.

Couchbase, one of the original NoSQL databases, has been through a major re-architecture with a focus on supporting cloud native applications. Like Microsoft’s Cosmos DB, it’s now a multimodel database, dynamically generating a flexible schema from the semistructured data it stores. That allows it to offer the speed of NoSQL for writes, with the option of querying that data via a dialect of SQL, N1QL. As Ravi Mayuram, Cochbase’s Senior Vice President Engineering and Chief Technology Officer notes, “SQL is basically a functional programming language”, one that’s based on set theory.

Designing databases is a complex, mathematical task. At the heart is the concept of an index, managing access to your data.

Much of modern database design goes back to those mathematical roots. You only have to look at the influence of Microsoft Research scientists like Phil Bernstein and the late Jim Gray on the field, as well as Turing Prize winners like Leslie Lamport. Their theoretical work has been at the heart of much recent database innovation, even if much of it dates back decades. It’s not that the work they did was forgotten, it’s that only now with cloud-hosted distributed programming models that we need to implement that fundamental research.

Designing databases is a complex, mathematical task. At the heart is the concept of an index, managing access to your data. Under the hood, Couchbase offers several indexes, with a key/value store acting as a primary index with the dynamic schema a secondary index. It’s that secondary index that gives Couchbase multiple personalities, allowing SQL queries across the entire store, even if the data isn’t inherently relational. By building on an in-memory store, it’s fast.

As Mayuram pointed out to me, the speed doesn’t only come from being in memory. “So there are two aspects that are always going to get better. One is memory, which is good. But that doesn’t solve the whole problem. The second part of it is the network getting faster and faster. It’s actually faster to go fetch the data across the network, than caching it from disk.”

Using Consistency Models to Reduce Query Latency

Latency is always going to be a problem for distributed databases, even if they’re in-memory and built around modern networking technologies. Databases intended for use in the cloud at scale need to take advantage of alternate ways of ensuring consistency between shards and instances. When data is replicated between instances is it necessary for the instances to be completely in sync, adding to application latency, or can we take advantage of alternate models of consistency that are more closely aligned to how we build and operate applications.

At heart Couchbase is strongly consistent for writes, with secondary indexes scoped to a single index node and then replicated across a cluster. However, things are different where it comes to querying your data, offering similar consistency models to Cosmos DB, offering different approaches to querying data with different levels of latency and reliability. All you need to do is pick and choose the model closest to your application needs.

Perhaps the most useful option is Read Your Own Writes, a session-based consistency model. While the database is replicating data across clusters and replicas, your queries are scoped to the instance you’re using for writes. It’s a useful technique to use if you’re using Couchbase to manage a shopping cart or similar where you’re only interested in the records associated with the current transaction. Changes made in other instances won’t be reflected in the results, so you need to be sure that your queries are only on data you have written and that that data is immutable.

Consistency in the Couchbase SDK

There are three other consistency models you can use, addressed by SDK calls. The default is intended for speed, when you’re not concerned with the consistency of the underlying data, only the current state of the index. That does imply results could change between reads, or that data you have written to the database isn’t immediately visible. You’re likely to want to use the faster of Couchbase’s two indexers if you’re using this option as it updates the in-memory secondary indexes every 20ms.

Request plus queries are a time-bounded query, defined by the time you make the query. That makes this a very strict consistency model, as it needs to ensure that all indexes are up to date to that timestamp. There will be latency for any query that uses this approach, as the database must process all relevant writes and changes, and then pass them between instances. It can be expensive in terms of compute with all the necessary checks and processing.

The final option offers “high-performance consistency,” internally referred to as AT Plus. Here the query is linked to updates that are tied to specific changes to the database. So, if you write a record and process it, reading it after it’s been processed, the changes to the record are returned as part of the results. That means your data is consistent with a specific change, which can be extremely fast if your query is tied to a single node in a cluster. In practice, an AT Plus query can be 10 times faster than an equivalent request plus query.

Putting this all in a NoSQL database that works at cloud scale makes a lot of sense, especially now that Couchbase offers tools to run inside Kubernetes and to automatically scale on demand. Distributed databases like this are the future of cloud storage, and offering more than one consistency model is key to delivering both the flexibility and the performance developers need. This is a road many other databases are going to take over the next few years.

A newsletter digest of the week’s most important stories & analyses.