It was called “Google’s Secret Weapon” by a March 2013 Wired article, which was ostensibly about the open source project at Twitter inspired by it. It is (or just perhaps, “we are”) Borg, the data center process scheduling and orchestration system quite obviously named for the first name in the title of the Indianapolis 500 trophy.
What had been known about Borg outside of Google could be gleaned from an examination of Apache Aurora, which is what The New Stack did back in February. At that time, Google would not go so far as to acknowledge Borg’s existence, declining to respond to requests.
Now, in an article for next week’s EuroSys 2015 conference in Bordeaux, a team of Google and former Google engineers has published the first public acknowledgment that it had used a home-grown task orchestration system, as opposed to sorcery, for handling millions of simultaneous processes. And in a surprisingly candid assessment, the team compares Borg’s architecture to that of Kubernetes — the open source container orchestration system currently stewarded by Google — and draws some valuable conclusions about design choices they would not have repeated if given a second chance.
What We Now Know
Aurora lead engineer Bill Farner’s observation, made in our story last February, that Borg was a brilliant idea that was becoming too complex to manage, was confirmed by the Google team’s article. Here is a summary of its main points:
- A Borg cell is a group of about 10,000 machines, managed as a single unit. A large cell is typically the major inhabitant, sometimes the sole one, of a cluster of machines that resides within a single building.
- Programs that share dependencies and runtime environments are packaged together, along with their shared databases.
- A job is the basic unit of functionality in Borg. It can be triggered by a command-line call, so it has a handle. It’s defined by a set of attributes called constraints that act like enforceable policies — for example, governing which classes of hardware it’s allowed to run on, and which classes are disallowed. Each job is designated a cell.
- The subdivision of labor within a job is a task, each of which is designated a Linux control group (“cgroup”) container. Tasks are scheduled, not jobs. Tasks are tagged for their latency sensitivity, which serves as a kind of prioritization scheme: The tasks which are most susceptible to compromise through latency are considered high priority. In a rather Darwinian environment where the whiniest creatures survive, highly sensitive tasks are capable of beating out the lower-class, or batch, tasks to the executive quite frequently. (Life in Borg is, as they say, a batch.)
- The central controller of a cell is the designated Borgmaster, which governs agent processes on every machine in the cell, called Borglets. Determining the Borgmaster of a cell is accomplished by way of a fair and democratic election, using an algorithm first developed at Digital Equipment Corp. called Paxos, named for an Aegean island that operated with only a part-time parliament (PDF). The duties of a Borglet include to start and stop tasks, maintain the state of local resources, and periodically report on the machine’s active state to the Borgmaster. (Google would never confirm this, but you just know that there’s a Borglet machine somewhere with a Jeri Ryan poster taped to it.)
- Submitted jobs are kept on a kind of spike, with high-priority (low sensitivity) tasks on top. A scheduler takes one of these jobs, and searches for machines that fit the bill. Depending upon the new task’s constraints, the scheduler can pre-empt a lower-priority task already running. Machines determined to be feasible for the task are then scored based on their relative availability of resources.
“The vast majority of the Borg workload does not run inside virtual machines,” the Borg team writes, “because we don’t want to pay the cost of virtualization.”
The team does not go into detail as to what that cost consisted of, but it’s a very safe bet they were referring to 1) an extra hypervisor layer, which leads to 2) a complex division of labor along two separate scales, limiting overall system scalability, not to mention 3) the monetary cost in procuring, supporting, and maintaining the necessary hardware.
(VMware makes an effort to debunk what it calls the “exaggerated” idea of the virtualization tax. But imagine these performance costs increasing exponentially as they scale from ordinary enterprise workloads to Google-size workloads.)
The team did provide essential detail about one critical design decision that influenced Kubernetes, the Docker-oriented scheduling system conceived after Borg and its successor, Omega. Not allowing jobs to be clustered together, they write, was a mistake that forced users to come up with a hack which required the use of a homegrown management tool; Kubernetes avoids this pitfall by letting users arbitrarily label any object, including its own version of scheduling units (“pods”). (That extra abstraction aids Kubernetes with load balancing, it turns out.) Assigning a single IP address to each machine, rather than to each service like Kubernetes, forced each Borglet to manage the duty of port isolation on top of everything else.
Even the good lessons had their dark side. Borg’s intentionally replete debugging information made it feasible, the team wrote, to do deep introspection as to the cause of the faults.
“Although Borg almost always ‘just works,’ when something goes wrong, finding the root cause can be challenging,” the team writes.
Results of stress tests conducted by the team on Borg cells reveal robustness and resilience. For instance, each task added to a machine appeared to increase the clock cycles consumed per instruction (CPI) by only about 0.3 percent. But that fact ended up revealing, by process of deduction, that variances in CPI that were recorded during everyday processing must be caused by other factors, such as the design of certain applications. Put another way, the design of one app’s tasks could have as much as 16 times greater impact on the performance of an entire Borg cell than the average performance cost of simply adding tasks to a machine’s schedule.
Borg is still used within Google, evidently, as is Omega and Kubernetes, as all three are referred to in the present tense. It’s part of Google’s evolutionary design principle to let newer versions of system processes co-exist with the older ones, phasing out the latter in batches. Still, what’s amazing to realize is that Borg was not originally designed to be a distributed system orchestrator, but eventually became one.
“For example, we split off the scheduler and the primary UI (Sigma) into separate processes, and added services for admission control, vertical and horizontal autoscaling, re-packing tasks, periodic job submission (cron), workflow management, and archiving system actions for off-line querying,” the team writes.
Together, these have allowed us to scale up the workload and feature set without sacrificing performance or maintainability.
Kubernetes’ role as a container orchestrator came about as an evolved form of a function Borg was not supposed to have.
Google’s secrecy surrounding Borg’s design was partly out of a need to protect its intellectual property against fierce competition, and partly just Google mimicking Apple’s way of generating interest in a topic by not talking about it. It almost worked, but ironically, general interest in Borg began to germinate after Twitter’s Aurora, inspired by Borg, was submitted as an Apache project.
Now a sizable chunk of the history of modern hyperscale systems design in the modern era has taken its place as a rightful part of the public record.
Feature image via Flickr Creative Commons.