Twitter’s Aurora and How it Relates to Google’s Borg (Part 1)

Here is part one of our series about Twitter’s Aurora project. In part two, Scott explores how Aurora has uses for any company, no matter what their size.
The information jobs for the remainder of the 21st century will not be managed by operating systems. Today, we perceive Twitter as one of a very few examples of services that run at “Internet scale” — at a scale so large that the size of its domain is meaningless. Yet Twitter is actually an example of what one day, within most of our lifetimes, will be considered an everyday job, the sort of thing you expect networks of clustered servers numbering in the tens of thousands to do.
Twitter is building a service automation platform called Aurora. It isn’t done; its current version number is 0.7. It is a job control system of sorts, but rather than controlling the server that runs the job — as operating systems used to do back when they were in command of the data center — it controls the job that indentures the servers.
But unlike the premier “Internet-scale” service Google, Twitter is building Aurora in the open.
It’s with the intent of modeling a world where the Twitters, Googles, and Facebooks are no longer superpowers… just powers.
One Big Operating System
“Platforms like Twitter operate across tens of thousands of machines, with hundreds of engineers deploying software daily,” reads a post last week to Twitter’s Engineering Blog. “Aurora is software that keeps services running in the face of many types of failure, and provides engineers a convenient, automated way to create and update these services.”
The inspiration for Aurora was an internal project at Google, entitled (as is Google’s wont) Borg. It was one of that company’s first efforts at automating services across its entire global infrastructure.
Bill Farner, Aurora’s lead engineer, previously worked at Google and personally observed Borg’s metastasis. In an e-mail exchange with The New Stack, Farner notes that Google was all-in on investment in Borg from the very beginning — that essentially, its goal was to assimilate Google’s existing infrastructure as rapidly as possible into a new and more automated method of operation.
“But, in my estimation, the software of Borg itself became very complex to manage,” says Farner. “In that respect, when developing Aurora, there were certain learnings that we [Twitter] were able to borrow from Borg about some things we should do, but more importantly, some things we should not implement or aim to support.”
Twitter’s approach to Aurora is somewhat different, first for its more cautious attitude in the early going, but also for its transparency. Aurora is being built as a framework for Mesos, the Apache project, to build an operating system for Internet-scale jobs that’s one layer of abstraction removed from the underlying hardware.
In a YouTube video produced last September, Two Sigma systems architect David Greenberg explained Mesos analogously to Linux. In a Linux system, he reminded attendees of the StrangeLoop conference in St. Louis, process isolation is accomplished when the operating system gives a process its own memory address space, an exclusive set of file descriptors, and separate “standard-outs” (stdout) for default data streams.
Mesos, which Greenberg described as an operating system in its own right, achieves process isolation at Internet scale by running jobs within Linux containers. Such containers — made more prevalent in recent months with the astonishing success of Docker — attain exclusive CPUs, memory, and namespaces for the processes contained within them.
“So when we run our applications inside of these Linux containers on Mesos,” stated Greenberg, “then you can run two or three or four different systems on the same physical machine, and the Linux containers will ensure that each one gets its allocated quota, and that each one is not able to starve the others of network, or of disk, or memory or CPU.”
Think of Mesos as reverse virtualization. A virtual machine builds an abstraction layer inside a physical machine, within which is encapsulated a separate operating system with its own set of applications. Often a VM has its own desktop, and drivers that think they’re running hardware directly when they’re really not.
Mesos is an abstraction layer outside physical machines, sublimating them so they act as pools of non-distinct servers. This way, tasks can be distributed to these servers with fault tolerance and controlled redundancy. Rather than a desktop, Mesos uses an open API that enables it to be monitored remotely through any standard Web client.
Aurora becomes the task master in this environment. In the same vein that an operating system assigns process threads to individual cores in a single computer, Aurora deploys jobs (indeed, “job” is the technical term) to multiple clusters in a network. But, unlike the conventional operating system model, for Twitter’s purposes, Aurora can deploy jobs in stages. Rather than forcing Mesos to orchestrate the same version of a job across its domain, Aurora can trickle out newer versions to designated failure domains, with the ability to roll back a deployment if it’s found to be causing trouble.
The Human Factor
How does Aurora mitigate circumstances when two or more versions of a job are running concurrently within the same domain? Isn’t it possible that a future version of a task, leaked out into the same system as its own predecessors, could wreck the broader job? Imagine if Microsoft Windows had to manage multiple versions of the .NET Framework continually. (Oh, wait a minute…)
I put these questions to Twitter’s Farner, who responds that some of these factors are still mitigated by human beings.
“There are a handful of factors here, not the least of which is diligent API and data model management,” says Farner.
Sanity can be injected into the process, he remarks, through the strict use of interface description languages.
Twitter’s IDL of choice is Apache Thrift, which spells out the types of data that functions can expect to receive as input, the explicit types of function output, and the domains to which these expectations may be restricted. An IDL allows functions to become mutable and to evolve naturally, while maintaining a fixed expectation of the data that will be exchanged.
Put another way, an IDL is a common language for the contracts to which modern services are bound. Alternately, adds Farner, jobs can utilize more complex data serialization systems, which define more complex schema or structures that sets of exchanged data should follow. As examples, Farner offers Google Protocol Buffers, which utilizes a form of source code to define more structured data representations, and Apache Avro, which uses JSON to define rich data schemas using dynamic data typing rather than static, though without generating code.
The broader motivator for developers tasked with helping ensure that staged upgrades progress smoothly, adds Farner, “is acceptance that distributed serving systems cannot safely be updated atomically. It would be very risky to flip a switch and begin sending traffic to new code. Instead, graceful updates allow one to observe failures early, and in a way that only affects a fraction of traffic.”
The key here is observation. While Aurora presents a degree of automation that distributed systems may have never had before, it also introduces a new mandate for human intervention, monitoring, and the implementation of best practices — functions that a fully “cybernetic” automation system, implied by the original inspiration for the “Borg” name, may have lacked.
Since different server clusters present varying performance profiles (latencies, access times, availability), how can Aurora account for these differences in real-time without negatively impacting how services can be impacted? This is one of the questions I’ll be asking Twitter’s Bill Farner in Part 2 of our discussion.
Image via Flickr Creative Commons.