LinkedIn Stands Up a Private Cloud to Speed Development

Hoping to reduce development and hardware costs, as well as set up a pipeline to speed development of new features, LinkedIn launched its own private cloud, called LPS (LinkedIn Platform as a Service).
“LPS presents an entire data center as a single resource pool to application developers. It allows them to deploy their own applications in minutes, with zero tickets,” wrote said Steven Ihde, director of engineering at LinkedIn, in a blog post announcing the initiative. “This allows developers to focus on building, not wasting their time finding machine resources or waiting to deploy.”
“We’re excited to be taking automation to the next level,“ Ihde said in a follow-up interview. “We are now automating the whole process of managing applications.”
Ihde noted LinkedIn already had good automation for low-level processes like updating software builds. Just two months ago, the company announced the automation of a rapid release pipeline. But more complex operations like setting up new apps required a ticket-based system with people making manual interventions.
“LPS presents an entire data center as a single resource pool to application developers,” wrote Ihde in his blog post announcing the launch. “By the time all of the systems that make up LPS are complete, our entire hosting environment will exist as a single holistic system where our internal users can bring a new service online with only API calls, not multiple JIRA tickets.”
That certainly is next-level.
LinkedIn has almost a thousand services and hundreds of thousands of service instances that make the website run. LPS replaces many manual or semi-automated processes allocating resources, deciding what to run where and making the most efficient use of the hardware.
“We’ve been exploring ways to get more resources out of a smaller hardware footprint while increasing productivity by making our software stack more application-oriented,” wrote Ihde. “LPS has also reduced the hardware footprint for some workloads by 50 percent or more. In short, this new internal platform will allow our engineers to be more productive, flexible, and innovative while saving the company money.”
The Beginning
Engineers set forth the criteria for what heir new ideal hosting environment would need to do in practical terms:
- Enable service owners and engineers to manage the lifecycle of their services.
- Relentlessly optimize our infrastructure for maximum performance while maintaining robustness and adaptability.
- Automatically compensate for human error and unexpected events by bringing more applications or resources online to maintain high availability.
- No “hacks” or extra technical debt incurred to achieve the above points.
LPS is a custom cloud solution. Engineers started by pulling existing LinkedIn services including Nuage, inGraphs, and AutoAlerts to “provide the functionality for automatically provisioning data stores, providing operational and performance metrics, and monitoring applications to ensure that new application instances are spun up when they are needed.”
Even though LPS is custom, open source solutions are leveraged when possible. After reviewing several possible options including Docker and LXC, the company decided on runC, which has the best fit to the custom apps already in use by LinkedIn.
LPS is multi-faceted, but there are four pieces that LinkedIn detailed in the launch: Rain, RACE, Orca and Maestro.
Rain
Rain “combines containerization resource limitation with host allocation in order to safely run a job on a host. Rain figures out where to run it and how to run it, ” Ihde said.
On his blog, he went into more detail, calling Rain “LinkedIn’s answer for resource allocation and containerization, which uses Linux cgroups and namespaces directly, and also takes advantage of libcontainer via runC.”
“It is designed not only to provide a resource guarantee and security isolation for applications but also to integrate seamlessly with our existing infrastructure,” Ihde wrote. “Before deploying Rain, it sometimes took more than two days to deploy a service. After Rain, that time was reduced to 10 minutes, a time savings of 95 percent.”
RACE
RACE (Resource Allocation and Control Engine) is a “suite management system,” Ihde said. RACE takes RAIN to the next level: A process may need a large number of instances, and RACE can automatically manage instances – adding or subtracting as necessary, making sure the right number of instances are running at any given time. It also redirects traffic in response to failures or demand surges.
Orca
The name Orca is a play on the term orchestration, and the app is focused on orchestrating one-off jobs. “One capability that engineers have often asked for is the ability to spin up a large number of hosts for short run experiments and other jobs,” Ihde wrote.
In using Rain and RACE, engineers realized LPS already had automated pieces for creating jobs, resource allocation, running jobs, and recording results, but these systems were not tied together.
Orca allows the to take advantage of temporarily available resources to run experiments and other short-term projects and is expected to conduct testing and to replace current provisioning short run jobs.
Currently, Orca runs 2,000 jobs per day, and Ihde estimates that number will increase to 50,000 jobs by year end.
Maestro
Maestro is the “conductor” of the LPS symphony, providing a “global view of the LPS system, enabling
Maestro to “manage every aspect of an application’s configuration on our platform,” Ihde wrote.
Maestro is “bringing together a bunch of resources for a long-term process, for example, allocate a database, or register a new service; it connect all pieces together. Intended as a ‘one stop shop,’ Maestro maintains a persistent store of settings and configurations for a platform-enabled application.”
The persistent store “provide the data, the plan, and the execution model for deploying applications to LPS. The blueprint defines the aspirational state for an application and Maestro ‘conducts’ by taking the necessary actions to make reality in the data center match the aspiration … By building control plane APIs for every one of these systems, we can automate the process of responding to events like sudden spikes in demand or network interruption.”
Not everything will be moved to LPS. For example, there are no plans to move the company’s Oracle databases. But Ihde wrote that eventually most of the in-house applications and “associated infrastructure services currently used like Kafka and Pinot will eventually all find a home on this new platform.”
Open Source?
Ihde has plans to review parts of LPS that could be released to the open source community. There’s no timetable for this yet, but he expects they will be lower-level pieces, starting with Rain or pieces of Rain.
“The lower down on the stack, the fewer dependencies, so the more useful the process are to other people,” he explained. RACE, Orca and Maestro all use Rain to run jobs, so it makes the most sense to start with Rain. Additionally, he suggested the company might do “stuff around creating Rain in Python, built some things in Python around Docker and runc, which are examples of what might be useful to the open source community.”
What is exciting to the team at LinkedIn is they have just begun to explore the possibilities of LPS.
One interesting potential is using Nuage; LPS could bring cooperation to on-line and off-line worlds. The automation provided through LPS could mean sharing hardware resources between data process jobs and on-line serving jobs. LPS could re-allocate resources during slower traffic surges from on-line service to development or data process jobs. The impact for savings could be tremendous.
Docker is a sponsor of The New Stack.
Feature Image: the Grand Canyon. Photo by T.C. Currie.