Project Orleans, the .NET Framework from Microsoft Research Used for Halo 4
Late in 2014 Microsoft announced that it was planning to open source Project Orleans — a .NET framework built by the eXtreme Computing Group at Microsoft Research (MSR) to simplify building cloud services that can scale to thousands of servers and millions of users, without needing to be an expert in distributed systems.
The promise of cloud is you can scale out whenever you need to (and scale back when you don’t), and get resiliency and performance without having to worry about getting bogged down in maintenance and management. But getting those benefits for the service you build, rather than just the cloud platform you run on, means building the architectural principles of cloud right into your service as well, so you can handle hundreds of thousands of requests per second, across thousands of servers, in real-time and scale to high demand, low latency and high throughput.
Project Orleans is designed to help with that. First announced as a research project back in 2010, it’s been used for several internal Microsoft systems including two very large player services in Halo 4: one that tracks all players and game sessions, another that handles detailed statistics about players and all their actions in every Halo game ever played, which show up in the Halo Waypoint interface.
If you try building that kind of system with the familiar n-tier development model at scale in real-time — even though cloud lets you scale up when you need to — you quickly run into storage bottlenecks. To avoid information bouncing back and forth between the front-end and the mid-tier storage you add a cache, but that makes for a far more complex development model, because your cache has different semantics from your main storage, and you have to handle moving information from storage that has concurrency to storage that has no concurrency.
The n-tier model doesn’t handle common cloud scenarios well either. If you want to chat with someone on a social network, switch playing an online match with one group of friends to another, or deal with information streaming in from multiple devices, you need to efficiently pass information back and forth. Even map/reduce, which is ideal for off-line processing of very large data sets, isn’t really efficient for handling interactive requests where you need to work with a small set of related data items.
It’s easier to handle those issues on a smaller scale, but if you’re building a cloud service you want it to scale up when it gets popular, without having to change the architecture. And concurrency and distributed computing are just hard to get right. Reliability, distributed resource management, complicated concurrency logic and scaling patterns are problems almost every developer building a cloud service needs to deal with, and the idea behind Orleans is that these things shouldn’t be difficult to write.
“The goal of Project Orleans,” explains Sergey Bykov, from the eXtreme Computing Group at MSR, “was to make building cloud services much easier than it has been and to make it so you write relatively straightforward code, but it can scale through several orders of magnitude. If it’s successful, you shouldn’t be required to throw away the code you wrote.”
Scaling Cloud Like a Farm, With Grains and Silos
To handle cloud workloads that need high throughput, low latency and efficient crosstalk, the team took their inspiration from the actor model of the 1970s, although you’re more likely to know it because Erlang used it in the 1990s.
Each of the objects you need to work with is treated as an actor — an agent that has its own isolated state. Actors don’t share any memory and they only exchange messages, so they’re easy to distribute. It doesn’t matter if another actor is on the same server or in a cluster on the other side of the country, because you have a way to pass messages built in to the system. That’s a much higher level of abstraction than the distributed objects in COM and CORBA and Enterprise Java Beans. And unlike some cloud app development systems (say, Google App Engine), Orleans is both asynchronous and single-threaded.
That makes it far easier to write. Developers don’t have to handle concurrency, locks, race conditions or the usual problems of distributed programming — just actors, which Orleans calls grains. Grains are .NET objects and they live in runtime execution containers called silos — there’s one on each node. The Orleans framework will automatically create more silos when they’re needed, clean them up when they’re not and restart any that fail. It also handles creating the grains, recreating them in another silo if a node fails, and again, cleaning them up when they’re not needed. Messaging is done through .NET interfaces and Orleans uses the async/await patter in C# to be asynchronous.
The Twist in Orleans is Making High Level Abstraction Efficient by Virtualizing the Actor
For grains, you have both the object class that defines them and the instance of the grain that does the work, but there’s also the physical activation of the actor that’s in memory. A grain that’s not being used still exists and you can still program against it — but until you call it, the grain is virtual. When you need it, the Orleans runtime creates an in-memory copy to handle your request; and if it’s been idle for a while, it garbage collects it.
If you need it back, the framework can reactivate the grain, because the logical entity is always there. The same grain might get activated and deactivated in memory many times on many different machines in a cluster. It’s there when you need it, it’s not taking up memory when you don’t, and the state is handled by the framework without you having to remember to create or tear down anything. If a grain is stateless, the runtime can create multiple activations of it to improve performance.
Orleans uses familiar design patterns, like dispatchers, to send a batch of messages in a single call and distribute them to the correct grains; hubs that give clients a single publishing endpoint to channel events from multiple sources; observers that monitor a grain and get notifications when it changes state; hierarchical reduction to efficiently aggregate values that are stored in many different grains; and timers that mean interactions between grains don’t have to be tied to the cadence of inputs from external systems.
Many failures are handled automatically, because the runtime restarts silos and grains as necessary.
Orleans is ideal for building many typical cloud services. One of the proof of concept (POC) systems MSR built was a Twitter-style social network (called Chirper) Another was a linear algebra library to do the kind of vector matrix multiplications used in algorithms like PageRank and machine learning clustering, feature extraction and partitioning.
It’s particularly well suited to handling large graphs — like a social network graph, you need to query efficiently even though it’s distributed over many machines. But it’s also good for handling social graphs, and also for near-real-time analytics, streaming data from IoT sensors, as a mobile backend and as a smart distributed cache that sits in front of your more traditional storage system.
There are plenty of other tools for building these kinds of systems, but very few that let you work at a high level without having to implement your own logic for scaling and concurrency and still get really high performance, which is why Orleans has been getting a lot of interest.
Open Source for Confidence and Community
Orleans is a perfect example of how Microsoft is using open source strategically, to improve key technologies and give developers the confidence to build on them. This isn’t a side project or any kind of abandonware; Orleans is a powerful system that’s actively being worked on and is being used for major projects inside Microsoft.
Take the new Microsoft ‘trillion events per day’ in-memory streaming analytics engine for handling big data, Trill. Trill is a .NET library used for analytics on Bing, is currently speeding up Bing Ads, and powers the query processing in Azure Stream Analytics. MSR vice president Jeannette Wing has called it out as one of MSR’s success stories and it’s being used in a range of Microsoft services. For a number of those, to get scale, Trill is being hosted inside Project Orleans, for example, to handle large numbers of independent, long-standing queries against real-time streaming data and generate alerts.
There aren’t many details about Trill, and it’s still a research project that you can’t get your hands on unless you work at Microsoft. However, you can see Trill hosted inside Orleans in this session from Microsoft’s 2014 Build conference, where Paolo Salvatori from the Azure team uses them together to build a real-time analytics service to handle events from IoT sensors using the Azure Service Bus.
Each instance of Trill is inside an Orleans grain, so the system can scale up and down to handle input from many devices. (The term grain might make you think of small objects, but there aren’t any restrictions on the size of a grain, so they can hold a lot of state.)
The first public preview of Project Orleans was released at the Build conference in 2014, to lots of enthusiasm. Tapping that enthusiasm is why the team decided to take it open source. Developers who were working with Orleans liked the fact that many of their bugs and suggestions made it into the September refresh, but they weren’t as keen on waiting months for fixes. In late January, the Orleans source code was released on GitHub under the MIT license, and it was getting pull requests the same day.
Open sourcing Orleans is part of the pattern we’re seeing from Microsoft on development frameworks for building services. Just as we’ve heard from the .NET team, open source is what developers expect — even from Microsoft. “Services are very different from boxed software,” points out Sergey Bykov.
Open source is the norm for the services stack. Everyone needs the assurance that they can fix an issue fast if necessary, and not have to wait for an official patch. Most people will never make a fix to the stack, but they still need the assurance that they can if necessary. Open source is that assurance.
Plus, Bykov is excited about getting the community to help improve Orleans. “Scaling out development is an equally important reason [to go open source]. There are way more talented and experienced people out there than you can ever hire. If you manage to make them part of the ‘extended family’ team, you can do so much more, and faster.”
As usual, the community contributions range from simple code cleanup to tracking down some complex bugs in the code. That’s exactly what Bykov was hoping for and it’s why you can see the Orleans team on GitHub, proposing API changes, debating project goals and being very clear about the work they still have to do. For example, to allow developers to write runtime extensions to Orleans as well as use it to develop their own applications.
“You have to treat these people truly as your peers. You have to listen to them, discuss issues and trade-offs, disagree with them. You can’t be above them and give them orders. They are just like you, but collectively smarter and more experienced than you can ever be.”
Feature image via Flickr Creative Commons.