Salesforce, Comcast Test Kubernetes for Massively Scalable Workloads
While Kubernetes is currently riding high on top of the hype cycle, the real proof of its utility will be enterprise adoption. At the recent Cloud Native Computing Foundation‘s Kubecon Europe event, engineers from both Salesforce and Comcast revealed how their respective companies are testing Kubernetes, with an eye towards using the open source container orchestration engine software to help them scale applications and even the workforce used to manage these applications.
For several years now, Salesforce, which has been running a Kubernetes pilot since January, has been driving towards a software-driven “NoOps” approach, an autonomous delivery model “where the human’s job is to simply set the goal, and the software does the rest,” explained Salesforce Principal Architect Steve Sandke, discussing the company’s deployment in a presentation.
A growing SaaS provider, Salesforce has been steadily adding to its infrastructure. Between 2014 and 2016, it doubled the number of data centers in runs, from 10 to 20. In this time, it doubled the number yearly transactions on behalf of its customers, 1.1 trillion up from 490 billion, while actually lowering the transaction latency time for these transactions, from 230ms to 210ms.
To manage, the company is moving away from monolithic software stacks, where large number of developers write into single humongous JAR files. Instead, developers are writing microservices to cover individual functionalities. Here, each team owns the software it creates, using containers as the deployment artifact. On the hardware side the company has a scale-out architecture where capacity can be easily added.
The company set a number of goals that it wanted to have from this scalable infrastructure, according to Sandke:
- Easy onboarding for new services.
- A “High Fidelity” between production and non-production environments, for developer sanity.
- IT assets would be described in a declarative rather than imperative manner, so they could be orchestrated.
- Assets would be “secure by default.”
- Full automation to cut human involvement and error.
- The infrastructure must be able to span across public cloud and private data centers.
The engineering team understood they needed a container orchestration tool of some sort to help accomplish these goals. Many of these engineers had long histories with managing at-scale architectures at other companies, and they easily vibed with the designers behind Kubernetes.
“We were frankly blown away. The development velocity was incredible, even back then,” Sandke said. “These people clearly knew what they were doing.”
While the engineers saw that Kubernetes clearly could help in their goals of automating IT infrastructure, getting the rest of the company onboard to Kubernetes was another challenge altogether. Management was concerned that Kubernetes was a new project, still with many loose ends. A number of demos helped build the case however. Another challenge: Kubernetes couldn’t be run in a vacuum. An entire Docker-based continuous integration pipeline had to be laid down. And the security operations team needed to be included.
On top of Kubernetes, the company built a developer “abstraction,” an interface that would offer developers a smooth set of Kubernetes-based deployment steps for their applications, Sandke said.
“Guardrails are important. Kubernetes an incredibly powerful. You can do so many different things. However, we don’t want to foist that [power] off on a several thousand developer organization,” Sandke said. Plus, in terms of forward planning, “it is far easier to expose new features than to take one away,” he said.
The deployment manifests are kept in a single git repository. A deployment request is issued through a git pull request. A typical deployment scenario would work like this: A developer would check some code into git. Once that pull request has been approved, a controller uses the manifest, which points to approved signed Docker images of the resources needed to build that service, to create an updated Kubernetes artifact.
“Thirty minutes after approving the manifest change, your systems are running live in production, with no humans are involved whatsoever,” Sandke said. “The nice thing is that the service owners just do it. We don’t even know half the time.” If there is an error in the build process, the controller halts the deployment.
Salesforce now has three Kubernetes-driven services in production, running on about 10 clusters across two data centers. By the end of 2017, however, the company wants to have more than 20 services in production running on 20 Kubernetes clusters composed of more than 1,000 nodes, spread across all of the data centers.
“We are intentionally running slow, because running Kubernetes clusters is an art, and you have to learn how to do it,” Sandke said.
On the To-Do list: The company still wants to make onboarding for developers even easier. The engineering team would like more visibility into the service states. They would also like to run native cluster applications, such as Redis. And a big focus this year will be on supporting “stateful” applications that require a data store backend of some sort. This goal might involve configuring Kubernetes to know to plant workloads close to the disks that hold their required data.
Scaling software is hard, but scaling organizations to support all this scalable software is even more difficult, noted Richard Fliam, a Comcast lead for engineering effectiveness, who also spoke at Kubecon EU 2017.
Comcast has found that Kubernetes acts as an effective “data center operating system,” one that saves developers and administrators time by simply keeping their concerns cleanly separated, Fliam said.
Although Fliam did not provide many specific details of the Kubernetes deployment, he did share the lessons learned.
The use of the technology is about “decoupling the operations of these teams at an organization, and giving them a platform to manage their own large, complicated distributed systems as easily as I can manage my Linux box,” Fliam said. Not only would this approach cut deployment time for apps, it also cuts the amount of time staff has to spend in documentation and communication required across groups.
Fliam leads Comcast’s VIPER (Video IP Engineering and Research) Engineering Efficiency group, which focuses on this scalability issues of video-based customer services. As the size of the team’s projects grew from its origins of five years ago, so did the number of engineers needed to support them. With just a few engineers, the group could easily package everything within virtual machines. As the deployments grew in complexity and scope, more IT automation tools, such as Puppet were needed.
They soon outgrew this mode as well. A sophisticated digital video recording (DVR) service for users, which is used by customers to record and later stream petabytes of data each day, brought about an entirely new set of engineering needs. The DVR service, running across dozens of data centers, is actually a composite of multiple services — advertising systems, configuration systems, video back-end systems, which are all managed by different groups.
“Each system has a flat networking name space and has a methodology to deploy it, like Geronimo,” Fliam said. Intertwining all these services together can’t be done through Puppet alone.
“It’s hard to make 200 people as effective per person as 20 when they are all working on the same technology stack,” Fliam said. The more people who are involved in a project, the higher the bandwidth requirements are for everyone involved in the project. Risk goes up; the ability to make changes slows down.
Kubernetes is valuable in this scenario in that relieves the requirement of developers to understand the particular, multiple environments they are deploying to. “No dev team could interact with all the machines in an effective manner,” Fliam said.
A containerized environment allows Comcast to capture as code all changes, deployments, and versions of the system, paving the way to automating deployment processes. It cleanly separates the apps from the underlying infrastructure, allowing ops to make changes to the infrastructure without requiring alterations with the apps themselves.
The Cloud Native Computing Foundation is a sponsor of The New Stack.