Microsoft Orleans Brings Distributed Transactions to Cloud
Distributed actor frameworks like Orleans, Akka Cluster and Distributed Erlang aren’t databases, but Orleans 2.1 adds the familiar database abstraction of transactions, complete with ACID guarantees — the atomicity, consistency, isolation and durability guarantees that ensure transactions will be valid no matter what happens.
Distributed applications used to be the domain of specialists building data warehouses; now any large company dealing with lots of customers online needs to have a cloud-scale distributed application. Early web development patterns were layered architectures that were still dependent on monolithic databases, complete with transactions and guarantees, to provide durable state but scaling those proved difficult and we needed ways to record and replay transactions to handle errors.
Moving to distributed cloud systems and NoSQL databases backing the stateless front end and stateless middle tier gave us scale, but those non-transactional systems caused problems of their own. Developers using cloud services have had to learn to work around the idiosyncrasies of eventually consistent data stores. Applications based on microservices are inherently distributed and often end up including ad hoc approaches that try to approximate transactions to get consistency, without any of the formal specifications to make that efficient or effective.
The usual view of transactions is that they’re too slow for the cloud and can’t scale to millions of users and thousands of transactions per second. “The industry gave up on transactions 15 years ago, but we believe we can make transactions work at scale with good performance,” Microsoft Principal Software Engineering Lead Sergey Bykov from the Orleans team told the New Stack.
Transactions give you consistency and transactions work in the same sequential way that people think. Atomicity means that an update succeeds completely or fails completely, so you’re not left with half a change; if you’re transferring money or credits from one account to another, the value will leave the first account and show up in the second, or the transaction won’t be applied. Isolation means you don’t see partial results, so it never looks like the value has reached the second account without leaving the first account.
Event sourcing and approaches like Command Query Responsibility Segregation (CQRS) are durable but they only offer eventual consistency and they don’t’ have atomicity or isolation, Bykov points out.
“Eventual consistency and event sourcing are good patterns but sometimes you need to have guarantees. In gaming, say you want to trade a gold coin for a cannonball and do it at the scale of millions of users. If you run out of coins [because the trade went wrong], it’s a bad experience,” Bykov said. “The in-game economy is really financial transactions and people want guarantees with their commerce. Atomicity and isolation are what developers want. What we want as developers is for the system to just work and take care of it, and not have to build complex machinery to half-achieve the same result.”
“If you’re running an application on a cluster with ten or a hundred nodes and you have grains distributes across all the nodes that represent users or their entities like orders or game sessions, this functionality means that for the first time, you can have two or three or n items that can get updated together,” Bykov said.
The programming model for transactions in Orleans is straightforward; you have to configure silos to use transactions and then put the transactional attribute on methods. When a method that’s marked as being transactional gets called, the local state that’s stored in memory is written into storage and locked, with details of any dependencies.
The Orleans runtime sends the details of the state and any dependencies to the transaction manager, which allocates a transaction ID, waits for any dependencies to complete and then tries to commit the changes. If there’s any failure, like the transactions this transaction is dependent on failing, “everything is unwound as if it didn’t happen.” That means that Orleans can give ACID guarantees when it’s using stores like blob, table or key-value storage that don’t support transactions.
A beta release of transactions was included in the 2.0 release of Orleans using a single transaction manager that coordinates all transactions. That required a separate server, which is a single point of failure and also an extra operational cost because you have to deploy and maintain it. Even read-only operations like getting a balance mean a round trip to the transaction manager to validate the transaction; “that adds to latency but it doesn’t impact throughout,” Bykov notes.
Holding locks until after the forced write of two-phase commit — which Google Spanner users for read-write transactions — is slow in cloud storage, which limits the number of transactions that can be completed in a given time. To get around that, Orleans accumulates transaction requests and sends them as a batch, so it gets back a batch of results; that means all the locks can be released during the first phase of the two-phase commit, which avoids blocking.
The chain of dependencies could cause cascading aborts where one transaction aborts and causes the whole chain of transactions to abort in turn, but that can only happen with physical failure like a server crashing, Bykov says, which should be rare.
With the single transaction manager, that scales to 50.000 or 100,000 transactions per second on a cluster; a traditional two-phase commit would limit scale to more like 25 transactions per second with cloud storage. Orleans 2.1 removes the scalability limits by switching to distributed transaction managers. “The grain that originates the transaction becomes the transaction manager, so every grain can be a transaction manager and this model shouldn’t have any limits on scaling”. That reduces raw performance but there’s also no single bottleneck in the system and no need to deploy anything beyond the normal Orleans runtime.
In 2.1, distributed transactions are “release candidate quality”; there will be another point release with the final version of transactions. That will have the same API but will have gone through a higher level of testing and analysis; “we have internal teams [at Microsoft] validating and testing and building applications on top so we’ll be comfortable to declare that it’s ready,” Bykov explains. One of those customers is a virtual commerce service for games that would have needed very complex code and now has a much simpler programming model because of transactions.
Distributed transactions at cloud scale bring together concepts that simply haven’t worked together before; Bykov notes that it’s “a daring proposition to say we can scale distributed transactions because people gave up on this a long time ago” but transactions might not be the only database abstraction to come to Orleans.
The idea came from Microsoft Distinguished Scientist Phil Bernstein — a pioneer of transaction processing and one of the designers of the SQL Azure database engine. As well as transactions, his research into what he calls ‘actor-oriented database systems’ covers indexing (which could be the next feature to come to Orleans) as well as query processing, streams, replication and geo-distribution.
And as Bykov notes, this is open source, not proprietary. “This is not tied to any database; it’s a middle-level tier for distributed transactions,” he points out and the Orleans community has already come up with a transactional storage provider for AWS DynamoDB. If you think of the actors in Orleans as distributed virtual objects, the concept is very like containerized microservices, and Bernstein’s paper covers implementing the concepts in a JVM as well as in .NET, which Orleans uses. If the idea takes off, we might see distributed transactions as an option for a range of services and frameworks.