In our previous post, we gave a basic overview of Project Orleans, designed by Microsoft to simplify building cloud services. In this post, we have a more detailed look at Orleans and how it compares to other similar technologies in the market.
Project Orleans is a .NET framework built by the eXtreme Computing Group at Microsoft Research that has given the actor model a somewhat different role to play compared to Erlang, the best known example of the actor model. Erlang is designed for distributed systems, originally created by Ericsson in 1986 to control circuits in telecoms equipment, as noted by Dr. Natalia Chechina of the University of Glasgow in a talk last year about scalable Erlang.
Erlang, for example, is used in Riak Core, as discussed in an excellent post by Yan Cuti titled: “A look at Microsoft Orleans through Erlang-tinted glasses.” He writes:
In the Erlang space, Riak Core provides a set of tools to help you build distributed systems, and its approach gives you more control over the behavior of your system. You do, however, have to implement a few more things yourself, such as how to move data around when cluster topology changes (though the vnode behavior gives you the basic template for doing this) and how to deal with collisions, etc.
The Orleans Approach
The Orleans team made some very different decisions when considering the trade-offs between high availability and consistency, and especially between ease of use and a more academic approach.
The idea was to democratize the actor model and put it in the hands of a broad audience of developers who might appreciate a language like Erlang but would never turn to it for daily development. “Actors represent real life better than other paradigms,” maintains Sergey Bykov from the Orleans team. “Object-oriented is close, but you’re not tightly coupled in real life.” That makes it too useful a model to restrict to the small community of developers who were prepared to start from scratch and learn a new language in a small ecosystem.
“We often hear from people saying ‘this is so easy to pick up — I just downloaded the SDK and it just works. I didn’t need to learn anything, I deployed the worker role on a cluster in Azure and it just works,'” Bykov says. “They’re so happy, and that’s the customer we’re after.”
“When we were thinking about how to frame the key APIs and what audience to target, we deliberately chose to go to a more democratized version of message passing. The trade-off we made there is for ease of use and flexibility and acceptance for the majority of developers.”
That means it’s not just the .NET the Orleans framework is written in that will look familiar to developers. “The method names, the argument names; everything will be familiar to them,” says Bykov. Anyone who’s worked with the Windows Communication Foundation will also feel at home.
“You can always squeeze a few extra percentage points of performance out by hand-crafting things, but in the modern world, five, ten even twenty percent performance is nothing compared to developer productivity and time to market and being able to hire people quickly to build a product,” Bykov maintains. “Expensive developers may create a system that’s faster five years later, but you need a system in a year or six months. It’s a complex question of optimization that’s not just about raw performance but the economics of software development and hiring people.”
Message passing is a key feature of the actor model, and how efficient Orleans can be at message passing depends on whether you’re running it on a single system or distributing it over a network. “If you’re sending messages between two actors on different machines you’ve got no choice, you have to send it over the network and the efficiency comes down to how much data you’re sending,” Bykov points out. “Orleans’ messages come with a bunch of headers — who sent it, who it’s to, some tracking information. Some fields look extraneous, but in reality when you’re debug giving a system because something doesn’t work, you appreciate having the call stack and the debug context so you can see what happened. It saves time when it’s critical when you’re in production.”
If you’re running on a single system, Orleans can save the overhead of serializing messages. “You can just isolate messages and do a mem copy instead of real serialization, and we do that if a message is between two grains on the same machine.”
The team also made what might seem an unusual choice about serialization and built its own serialization layer. “We could have picked up Bond or another framework. We looked at the five serialization frameworks that are already used inside Microsoft,” explains Bykov, but they all had drawbacks, often being expensive in terms of computation and message size. “.NET is expensive, binary XML even more. Others require you to define message formats up front and compile them; WCF does that with contracts. Bond is very efficient but is subject to the limitation of having to define message types up front. We wanted something more flexible than that. We wanted it to look like almost any call to function. So we ended up building our own serialization layer. It may sound expensive, but it just works. You can send a dictionary with ten references to the same object, and on the receiving end and you’ll get exactly that. It will preserve types so you don’t have to know up front what type will be sent. We made a trade-off to invest more [in development] to get flexibility and efficiency.”
One decision developers often ask the Orleans team about is the trade-off between availability and consistency. Bykov disagrees with the suggestion that the actor model demands consistency to the point that you can only ever have a single instance of an actor. “That’s not a feature of the actor model, it’s an implementation detail. We deliberately chose to optimize for availability. If you have partitioning failures, if your cluster is in some funky transition state, it will resolve itself overtime but have uncertainty and that allows for the potential creation of duplicate actors with the same identity. You could end up with more than one. It’s very unlikely even in the event of a failure, but it’s possible.”
That’s a trade-off that is common in cloud systems, he says.
“For cloud systems, people choose availability all the time. You don’t want to be consistent but not available, because that means you’ll be consistent five minutes later. We chose this instead of the actor being unavailable until the uncertainty is resolved.”
What Orleans gives you is eventual consistency. “It’s eventually a single instance of an actor. It’s relatively rare that you end up with two instances; you almost never end up with three, and eventually we guarantee a single instance of actor, so there’s an eventual consistency of state.”
If both instances of the actor have been updated, the updates need to be merged. “It might seem like a big problem but in reality it’s much less so, because in reality your system is usually backed by persistent storage. If it’s an Azure table or a blob, they support detecting conflicts.”
Your system may be able to get the state for the actor without untangling the updates. For the Halo presence service that keeps track of everyone in the game, “the truth is on the console, so even if you become inconsistent in the cloud, you don’t need to resolve anything. Eventually the console will send a new version of the truth to the cloud.”
Even if you need to append to an existing value instead of just updating it, Bykov maintains eventual consistency is enough because of the nature of distributed systems. “In a distributed system you can never guarantee delivery exactly once without two phase commit; in reality it’s at least once. You may lose a message coming in or you may lose the acknowledgement, you know Byzantine failures are possible, so your operations need to be commutative. That means you’ll already be building the storage layer to be eventually consistent and handle duplicates. These are inherent problems in distributed systems.”
Choosing availability is even more important at cloud scale. “It’s easy to say I will take a lock on this thing in storage and nobody can update it concurrently, but this leads to things being unavailable. Suppose you lock it for a minute and during that time the machine dies. Now no one can update it until the failure’s detected and the lock is removed or the lock just expires. It’s expensive and slow, and you need a lot of resources to keep track of locks. If you have a million objects, now you’re keeping track of a million locks.”
For efficiency, the Orleans team made a similar trade-off globally. “We decided that everything would be asynchronous; you cannot define a blocking operation. If you use promises and you block on a promise, instead of using async/await, you’re blocking threads. If you have ten concurrent requests, you’re blocking ten threads; if you have a thousand threads you’re blocking, you’re in trouble. And if you’re trying to process 10,000 requests a second, you just cannot get the throughput. Threads are expensive.”
To deal with that, Orleans has a tightly-controlled thread pool on the server side — the silo the grains live in — with one thread per physical call. “If there’s nothing to process on the thread, if the code makes an IO call that’s going to be 100ms, there’s no point waiting 100ms,” explains Bykov. “So we essentially relinquish the thread to the system. While you’re waiting for that thread to have something to process, we can process hundreds thousands of messages that require CPU. That’s a design decision that means you can have a huge number of requests in flight without exhausting resources.”
That was the Orleans model even before the async/await pattern was available. Previously it was implemented with a promises library, but Bykov is enthusiastic in his praise for the new pattern, even though he admits it has some extra overhead (because the compiler has to create a state machine to handle it).
“This async/await pattern is the best thing since sliced bread for parallel programing. It’s made it so much easier to write what looks like sequential code without any call-backs, and synchronization works very efficiently. It’s a paradigm shift.”
Using async/await inside Orleans meant the team could remove nearly half of their application code. “Instead of continuations and curly braces and ‘in case of success,’ we just do try catch, await call and catch exceptions.” And again, that helps with making Orleans easier for the broader developer audience to understand, and easier to develop with.
“We’re all human. The longer the code is, the more likely it is that we’ll miss something and make a mistake.”
Orleans: An Unusual Way of Getting Scale Without Complexity
Cloud scale applications aren’t suited to the common MVC, MVVM and other n-tier design patterns, especially when you’re developing microservices for scale-out deployments. That’s where the actor model, at the heart of Orleans, comes in to play. Messages are passed between blocks of code — the actors that process the content of a message. It’s a model that allows you to quickly create new instances of an actor as there’s no concurrency — all you need is a new address to send messages. As each actor is a separate functional element, they’re easy to use as the basis of a parallel computing framework at massive scale. All an actor needs are the addresses of the actors that are the intended recipients for its messages, and there’s no need for an individual actor to do more than process message contents as they arrive and send the results on to the next actor. That next actor might be anything from an API endpoint to a marshaling engine to a piece of business logic.
There’s a lot of similarity between Orleans’ actor constructs and Erlang. Erlang’s started as a language for developing telephony switch applications, and can best be thought of as a functional programming environment for actor models with message passing. Erlang is a powerful tool, and is a common tool for building large-scale actor services, especially in the financial services industry, which takes advantage of its functional basis to handle complex tasks. It’s also found a role at the heart of distributed NoSQL databases and in configuration and source control management systems. But it’s a complex language that not all developers will be interested in learning and it doesn’t have the advantages of Orleans’ virtual actors.
Similarly, there’s a lot of support for Scala, which is again popular in financial services and in online gaming. As RedMonk analyst James Governor points out, “It likely won’t be for everyone, and while Scala brings scale and really powerful pattern matching, it has a steep learning curve.” That gives Orleans an opportunity, as Governor notes, “If Microsoft can provide actor-based concurrency with a simpler programming syntax, Orleans could be a useful tool, certainly for Microsoft shops.”
Perhaps the best known actor framework is the open source Akka from Typesafe. It’s a Scala-based framework running on standard JVMs. Like Orleans, it’s built around asynchronous messages with additional tooling to handle clustering and to work with message queuing systems. Akka has developed a large ecosystem, with more than 250 GitHub projects, as well as a port to .NET. As Akka is part of the Typesafe platform, is provides event-driven middleware functions for the Play Java framework. But not only do you have the steep learning curve of Scala to cope with, Akka is intended as a far more low level solution than Orleans.
Other actor frameworks include Pulsar, which adds actors to Clojure. While it gives you asynchronous IO, it works through a synchronous API, reducing the complexity of your code. Underneath Pulsar is a Java queuing framework written using light-weight threads — Quasar — which Pulsar gives an Erlang-like API. It’s still very much in development, and currently isn’t designed to handle distributed actors, making it harder to write scale-out microservices because many operations end up being blocking operations. The intent is to deliver a framework where Java handles the heavy lifting, while the Clojure-wrapper manages concurrency as part of Parallel Universe’s Galaxy in-memory data store, which handles distributed data structures. Using actors, data can be consistently shared across processing nodes using point-to-point messaging.
Named after a pioneering Sanskrit grammarian, Panini describes itself as a “capsule-oriented” language delivered on top of a JVM as PaniniJ. The aim is to deliver an actor-like programming environment for concurrent programming using asynchronous messages while sequential code capsules handle messages, avoiding common concurrency errors. Panini is perhaps best thought of as a way of delivering parallel programming on Java, with development using familiar techniques that are abstracted away from the underlying actor-message model — in much the same way garbage collectors manage memory — with code running inside modules, called capsules. Capsules are created by taking a program and breaking it up into simple actors, which are wrapped with definitions of the capsules that need to work — defining messages and APIs at the same time. Panini is a research language at the moment, but it shows promise as a set of techniques that can be ported to other languages and run times, not just the JVM.
None of these really work at the same high level of abstraction as Orleans. That, and the successful Xbox services built using Orleans-inspired Electronic Arts’ BioWare division to develop its own virtual actor platform for its cloud gaming properties, Orbit. Electronic Arts credits Orleans as its inspiration for creating a Java version of the virtual actor model, and the intention is to solve “many of the problems that make working with actor frameworks difficult” by being “lightweight and simple to use, providing location transparency and state as a first-class citizen.” It’s a clear validation of the Orleans approach, using virtual actors and favoring simplicity over the supposedly ‘purer’ architectural ideas.
Orbit has now been open-sourced as a JVM actor framework, so it’ll work with any language that can run on a JVM, including Java and Scala. Like Orleans, Orbit will manage your actors for you, simplifying application development. A container can be used to wrap your applications, and handle wiring objects together, as well as starting and stopping your applications. The framework also includes web service interfaces, so you can hook an Orbit application to other services and tools.
The future for the actor model is a promising one, with many different implementations in use and in development. It’s also at the heart of popular construction games, like Minecraft and Project Spark, where programmable objects can easily be thought of as interacting actors, where those interactions are handled by asynchronous messages. That means the next generation of developers will be familiar with event-driven actor frameworks without knowing it, just from playing games. That’s going to make an actor framework that’s easy to work with — like Orleans — particularly appealing.
Open Source Unknowns
Thanks to the open source model and simplicity of Orleans, there are plenty of Orleans developers the team knows nothing about, or only hears of from the Azure support teams or through comments on GitHub or Codeplex. The Halo systems are well known, but there are plenty of other projects using Orleans. “It’s so easy to use that some people go into production without asking us a single question,” Bykov points out.
He didn’t know about the German company that had had an Orleans system in production for six months until they asked a question on Codeplex, or about another European business developing large IoT solutions; “they manage a major energy storage facility for renewable energy.” Their Orleans system is used in pest control: “They manage up to two million mousetraps.” He’s come across a wide range of projects. “They’re controlling devices or processing data coming from devices, and organizing them in hierarchical manner for building control, or vehicle telemetry.”
That’s possible because of the flexibility of the virtual actor model. “It’s a range from large-scale device deployments to handling a small number of high throughput devices. It works for high throughput and for the infrequent message that arrives once a day or once an hour. The resources are managed automatically: I just set the time window and say I want this to be garbage-collected after two hours or five minutes of inactivity. I don’t have to worry about how many of them come and go, how many are activated or deactivated. I can program as if they’re all always there. I don’t have to write code to resource manage them, and this really expands the range of applications.” And with that broad of a community, being an open source project makes sense for Orleans.
Riak is a a decentralized datastore from Basho, a sponsor of The New Stack.
Feature image via Flickr Creative Commons.