An Apache Cassandra Breakthrough: ACID Transactions at Scale
Let’s just get the best part out first. ACID transactions are coming to Apache Cassandra. Globally available, general-purpose transactions that work the way Cassandra works. This isn’t some trick with fine print or application of some old technique.
It’s due to an extraordinary computer science breakthrough called Accord (pdf) from a team at Apple and the University of Michigan. It will help Cassandra change how we think of data by opening new use cases.
Here’s what this means for those who aren’t steeped in the ins and outs of the Cassandra project and its features. Nothing is more important than how fast you can put an application into production. But developers that want to put Cassandra’s scale, resilience and legendary multi-data center support to work on something like financial transactions had to code a bunch of complicated workarounds into their apps. The trade-offs versus using, say, Oracle were significant.
With Accord? No trade-offs. Cassandra will now support everything that has made it amazing while shifting transaction support to the database, which will greatly reduce code complexity.
The Eye of the Observer
A database system has essential functions, such as storing data reliably and being available for query. Managing changes in data isn’t always a database function. In the case of many NoSQL systems, the burden of change management is deferred to the user. The observer of the data change is who assigns the importance of exclusivity.
Suppose the point is to accumulate data as given. In that case, the observer must know the data was received and stored durably — for example, stock ticker data where every data point is unique and cumulative. There is no need for exclusivity.
In more sensitive operations, the observer of a data change needs to feel like they are the only process using the database. It’s a concept in computer science called “isolation”; it’s the “I” in ACID (atomicity, consistency, isolation, durability).
A classic example is a bank transfer where money is subtracted from one bank account and then added to another — in exactly that order. The observing process needs exclusivity to avoid other processes making changes that can lead to inconsistencies or surprises. Surprises include inadvertently allowing a transfer from an account that has gone below $0. Isolation guarantees only one process can make a change at a time, and if two are competing for the same data, one of them will have to wait for the other to complete.
Embrace Your Laziness
Developers need to move quickly with a system they can trust. ACID transactions have been the gold standard of trust within database systems for almost 50 years. Developers work through trade-offs based on requirements, sometimes leading them to work with systems that don’t support ACID transactions.
With NoSQL systems, trade-off biases historically tended to tip toward scale and uptime while sacrificing transactions.
Bringing ACID transactions to Cassandra has been about reducing trade-offs. Cassandra already has a solid reputation for linear scaling while maintaining uptime in the worst scenarios.
Cassandra has been the choice when you need a database that will hold up to what the internet can deliver. It’s no surprise then that the need for transactions has been a source of trade-off conflict for developers.
Can We Get a Consensus?
In distributed systems, each member node in the larger cluster can act independently or need to coordinate with other nodes. In a transaction where, “Hey, we all need to agree on something,” computer scientists call this consensus, and developing those protocols is a continuous area of improvement.
Paxos has been a long-established consensus protocol and was adopted by Cassandra in 2013 for what was called “lightweight transactions.” Lightweight because it ensures that a single partition data change is isolated in a transaction, but more than one table or partition is not an option. In addition, Paxos requires multiple round trips to gain a consensus, which creates a lot of extra latency and fine print about when to use lightweight transactions in your application.
The Raft protocol was developed as the next generation to replace Paxos and several systems such as Etcd, CockroachDB and DynamoDB adopted it. It reduced round trips by creating an elected leader.
The downside for Cassandra in this approach is that leaders won’t span data centers, so multiple leaders are required (see Spanner). Having an elected leader also violates the “shared-nothing” principles of Cassandra and would layer new requirements on handling failure. If a node goes down, a new leader has to be elected.
Other databases — FaunaDB and FoundationDB, for example — have gone down the route of trying to solve the multileader problem by reducing down to a single, global leader, as described in the Calvin paper. Because these were built for other databases with different requirements, the approaches used in those cases failed to meet the criteria Cassandra expects with failure modes.
Cassandra assumes failures as a part of running a large distributed system. One or more nodes going offline should not cause rapid performance degradation or availability issues. We needed a different approach.
Have We Reached an Accord?
We can get very opinionated on what is acceptable for the Cassandra project. Our criteria are about holding true to the core beliefs on how distributed systems should run. Performance and scaling should always be preserved while operating multiple nodes across one or more data centers. We can be pretty demanding, but this is what makes Cassandra the choice for so many organizations.
The previous iterations of consensus protocols solved different parts of the problem, but each presented a trade-off that would violate some of Cassandra’s values. It’s been said the next big breakthrough is two papers away from the last. In this case, the paper was Accord, and it took a big swipe at eliminating tradeoffs.
Accord addresses two problems that aren’t solved in previous consensus protocols: How can we have a globally available consensus and achieve it in one round trip? The first novel mechanism is the reorder buffer.
Assuming commodity hardware is in use, differences in clocks between nodes are inevitable. The reorder buffer measures the difference between nodes in addition to the latency between them. Each replica can use this information to correctly order data from each node and account for the differences, guaranteeing one round-trip consensus with a timestamp protocol.
The other mechanism is fast-path electorates. Failure modes can create latency when electing a new leader before resuming. Fast-path electorates use pre-existing features in Cassandra with some novel implementations to maintain a leaderless fast path to quorum under the same level of failure tolerated by Cassandra. More details can be read in the proposal.
How Does It Work?
The biggest impact will be in developer productivity, so let’s see what that looks like in practice. Consider the following bank account transfer example we mentioned earlier:
First is the new syntax you’ll see in the Cassandra Query Language (CQL). Transactions are contained in a
BEGIN TRANSACTION and
COMMIT TRANSACTION declaration. Everything inside the transaction markers will happen atomically in isolation from other processes. We will transfer $20 from Alice’s account to Bob in this example. It doesn’t get any more classic than that!
In section A, we can select data from an existing record and assign the result to a tuple (multiple items stored in a single variable). Depending on how many columns are in the
SELECT clause, you can store one or more values in the tuple. These values will be used in section B to test conditions before making data changes.
In this case, we will test to see if Alice has $20 in her account before transferring to Bob. If so, then an
UPDATE decrements Alice’s account balance by $20 and then increments Bob’s by $20. If Alice had less than $20, then the changes wouldn’t happen.
Behind the scenes is a serialized set of database commands that execute exclusively, as seen from the observing process. Across one or more data centers, the transaction only required one round trip to gain consensus, and if any nodes were offline, the action would still occur if at least a quorum of replicas were available.
This is exactly how Cassandra likes to work, but we just upped our game with a globally available transaction.
Accord and all of the work that goes with it is still in progress and slated to be included in the next Cassandra release. Since this is all in open source, those of you that can’t wait can clone a copy of the
cep-15-accord branch from the Cassandra repository and build your own copy. For the rest of you, as we get closer to release time, we’ll have builds available for you to use and test. It will be a game changer for Cassandra, and I’m sure you’ll want to see it for yourself.
I’m most interested in hearing from the community about what use cases you’ll find with a globally available transaction running at the speed and resilience you expect from Cassandra. Is it finally time to let go of those last relational database workloads?
We’re also anxious to hear your feedback on all our channels, including the Apache Software Foundation Slack or project mailing list. Features in an open source project are constantly evolving to meet users’ needs. That’s why you’ve got a critical role to play in shaping Apache Cassandra for the future.
And stay tuned for more use cases and information as we evolve this exciting new feature. You can expect that there will be several talks about this at the upcoming Cassandra Summit 2023. You won’t want to miss those.