The SRE team was increasingly pulled off projects aimed at directly improving the GitHub.com user experience in order to support all kinds of other, internal initiatives. As the number of GitHub services increased, so did the number of teams independently supporting each of them — meaning the site reliability team spent more and more of its time on server maintenance, provisioning, and other tasks unrelated to its main mission.
GitHub’s other engineers were limited, and even slowed, by the the SRE team’s availability. “New services could take days, weeks, or even months to deploy, depending on their complexity and our team’s availability,” said Jesse Newland, GitHub’s principal site reliability engineer. “We very much needed to provide these other engineers with a self-service platform they could use to experiment with while building new services, and even deploy and scale them.”
Thus began, one year ago in August 2016, a joint project between Newland’s team and the Platform and Developer Experience teams to evaluate platform as a service solutions. The open source Kubernetes container orchestration engine, developed by Google and maintained by the Cloud Native Computing Foundation, quickly emerged as the leading contender.
“The Kubernetes project is one of the best maintained and curated open source projects I’ve ever seen, and the communities around it are fantastic — there are really great people in every corner of that project,” said Newland. “I’m convinced this technology is going to help others run projects in a more sane way, not just at GitHub but everywhere.”
Coming to Containers
It wasn’t GitHub’s first venture into containers. “We already used them for a number of things, for CI and testing and isolation, but had not served a lot of production workloads out of them,” he explained. “So we had some experience, though not deep. But we did have familiarity with both the positives and negatives — Dockerfile format for example — and that certainly helped.”
At the earliest stages of this project, the team decided to target the migration of one of the organization’s most critical workloads: github.com and the GitHub API. It was a deliberate decision, influenced by many factors, said Newland. These included a need for self-service capacity expansion tooling, a desire to increase the portability of this specific workload over time, and a desire for the new platform to persist for years to come.
“Given the critical nature of the workload we chose to migrate, we knew that we’d need to build a high level of operational confidence before serving any production traffic,” said Newland.
Hack Week, Kubernetes-Style
A small trial project was assembled to build a Kubernetes cluster and deployment tooling in support of an upcoming hack week to gain some practical experience with the platform. It was also an opportune scenario for having numerous different teams simultaneously trying out clusters and deployment — a small-scale version of GitHub’s full-scale organizational requirements for the platform.
This “hack week” as it was called, was the perfect testing ground, said Newland. It was “a low-risk experimentation environment to play around with how GitHub worked in Kubernetes, and also a good standard of quality for our first investigation — since hack week is all about software that mostly works!” More importantly, however, was the fact that the company’s developers enjoyed having a repeatable environment to test their changes to all necessary subsystems.
In short, GitHub’s first deep foray into Kubernetes was a genuine success. “Our experience with this project as well as the feedback from engineers that used it in the scope of hack week was overwhelmingly positive and in support of expanding this experiment,” Newland concluded. “With this positive experience under our belt, we began planning a larger rollout.”
(Additionally, Newland said, hack week became the pattern for “playing with” other GitHub applications that would benefit from having a standalone environment spun up, created on demand when necessary rather than on a fixed set of staging servers. “In my experience, non-production or staging environments tend to bit rot over time, because they’re not really maintained or paid much attention to —at least right up til the moment you need them, which can cause all kinds of trouble,” said Newland. “So we are excited to apply this model, across all of GitHub, to any applications with a similar need for a pre-production testing environment.”)
The Road Is Made by Building
Rather than establish a set timeline for the entire Kubernetes migration, the team took an achievement-oriented approach. The goal was migrating within existing relative targets for performance and error rates instead of making a hard commitment to any deadlines to finish.
“We knew more or less in general where we wanted to go, picked some proximal goals along the way, and put guesstimate dates just to provide some urgency to the team,” said Newland. It was an incremental approach, he explained: “Tiny experiments directly little amounts of traffic to this new platform, seeing if we hit our targets before trying a larger step.”
The next key accomplishment: the building of a “review lab.”
“We knew that we’d need to design, prototype, and validate a replacement for the service currently provided by our front end servers using Kubernetes primitives like Pods, Deployments, and Services,“ said Newland. “Some validation could be performed by running existing test suites in a container rather than on a server configured to mimic front-end servers, but we also needed to build confidence in how this container behaved as a part of a larger set of Kubernetes resources.”
The decision was made to build “review lab” — a Kubernetes-powered deployment environment that supported exploratory testing of the combination of Kubernetes and the services that would run on it, similar to an existing GitHub environment known as “branch lab.”
The team spent the remainder of 2016 building review lab and in the process shipped multiple sub-projects. These included a K8s cluster running in an AWS Virtual Private Cloud, managed using a combination of Hashicorp’s Terraform and kops; a Dockerfile for github.com and api.github.com; YAML representations of more than 50 Kubernetes resources; and a set of Bash integration tests that exercise ephemeral Kubernetes clusters, which Newland described as “used heavily in the beginning of the project to gain confidence in Kubernetes.”
The end result was a chat-based interface for creating what amounted to an isolated deployment of GitHub for any pull request. “We’re extremely pleased with the way that this environment empowers engineers to experiment and solve problems in a self-service manner,” said Newland. He explained that the review lab environment, upon internal release, exposed a large number of engineers to a new style of deployment and helped the site reliability team build confidence via feedback from interested engineers — as well as the lack of feedback of engineers that had not noticed any change during continued use.
Kubernetes on Metal
Review lab successfully launched, Newland and his team came back from the end of 2016 holiday break ready to focus on Kubernetes clusters that worked on GitHub’s physical environment and start migrating over traffic.
“One thing I think is particularly interesting is that we are running Kubernetes clusters on metal — actual physical machines. The Kubernetes ecosystem is very strongly focused on running on cloud providers, and most documentation is based on that,” said Newland. “We faced challenges to get a design that worked in our physical data centers. There are some priors scattered around, but most blog posts and discussions we found were by people running Kubernetes clusters at their houses. So several team members actually started implementing Kubernetes at home, as an experiment and a learning opportunity.”
“In fact, several of my own home automations are now powered by Kubernetes,” Newland added.
Needless to say, some very interesting adjustments were required. To meet the performance and reliability requirements of GitHub’s flagship service — parts of which depend upon low-latency access to other data services — GitHub’s Kubernetes infrastructure would need to support the “metal cloud” in the company’s physical data centers and POPs.
Newland’s team proceeded with caution: “Following no less than a dozen reads of Kelsey Hightower’s indispensable “Kubernetes The Hard Way,” we assembled a handful of manually provisioned servers into a temporary Kubernetes cluster that passed the same set of integration tests we used to exercise our AWS clusters,” said Newland.
A few more small stops along the track to migration station included:
- Using the Calico software-defined network provider for out-of-the box functionality to ship a cluster quickly in IP-in-IP mode while later allowing exploration of peering with GitHub’s network infrastructure
- Building a small tool to generate the certificate authority and configuration necessary for each cluster in a format that could be consumed by GitHub’s internal Puppet and secret systems.
- Puppetizing the configuration of two instance roles — Kubernetes nodes and Kubernetes API servers — in a fashion that allows a user to provide the name of an already-configured cluster to join at provision time.
- Building a small Go-based service to consume container logs, append metadata in key/value format to each line, and send them to the hosts’ local Syslog endpoint.
- Enhancing GitHub’s internal load balancing service to support Kubernetes NodePort Services.
“The combination of all of this hard work resulted in a cluster that passed our internal acceptance tests,” said Newland. Given that, he said, the team was reasonably confident that the same set of inputs (the Kubernetes resources in use by review lab), the same set of data (the network services review lab connected to over a VPN), and same tools would create a similar result.
“In less than a week’s time — much of which was spent on internal communication and sequencing in the event the migration had significant impact — we were able to migrate this entire workload from a Kubernetes cluster running on AWS to one running inside one of our data centers,” he concluded.
The Final Countdown
Having created a successful — and stable — pattern for assembling Kubernetes clusters on the GitHub metal cloud, it was time to start easing away segments of the load from GitHub’s front-end servers.
“At GitHub, it is common practice for engineers to validate new functionality they’re building by creating a Flipper feature that exposes their new functionality and then opting into it as soon as it is viable to do so,” said Newland. After enhancing the deployment system to deploy a new set of Kubernetes resources to a Github-production namespace in parallel with existing front-end servers — and enhancing the Github Load Balancer to support routing staff requests to a different back-end based on a Flipper-influenced cookie — the team allowed GitHub staff to opt into an experimental Kubernetes backend.
“The load from internal users helped us find problems, fix bugs, and begin to build confidence,” said Newland. “We also routed small amounts of traffic to this cluster to confirm our assumptions about performance and reliability under load, starting with 100 requests-per-second and expanding later to 10 percent of the requests to github.com and api.github.com.”
A few problems presented themselves, including Integrating clusters with GitHub’s existing load balancing infrastructure. “The documentation for running highly available clusters glosses over the necessary characteristics of a load balancer in this implementation — not a deficiency, just not a frequently used thing,” he said. “We are charting new waters here, and in the process of more concretely understanding the issue. When we do, we want to share it with the community.”
Poised on the verge of going live to 100 percent of traffic routed to Kubernetes, the team opted to run GitHub’s front-end ops on multiple clusters in each site, automating the process of diverting requests away from an unhealthy cluster to the other healthy ones. “So instead of putting all our eggs in one Kubernetes cluster, we ran with several clusters — that way, if anything went wrong we would only lose a portion of the servers that were supposed to be serving a request at the time,” explained Newland. “We shifted our design to be able to provide reasonable services in the areas we saw failure in our testing.”
Ultimately, the transition from internal to external Kubernetes took place over the course of a month, all the while keeping performance and error rates carefully within targeted zones. “I’m here because I like solving problems,” said Newland. “But I’m also interested in not creating any more new ones than absolutely necessary.”
The Human Angle
“As a site reliability engineer, my desire is to build tools to enable software developers to be more creative than me — to build creative and innovative solutions that then help other people down the road use them on GitHub,” Newland said in summary. “I’ve been encouraged by this project — that review lab environment we set up enables our engineers to try new things and experiment on their own when previously they would have needed to wait for the SRE team. They’re no longer limited by the number of SRE’s on staff. We have already seen engineers experiment with different approaches for replacing large chunks of our software stack — and the review lab and Kubernetes combo has enabled them to not only do that themselves, but go beyond to things the SRE team might not even think of…”
“I can already see how the move to Kubernetes is creating an environment at GitHub for more rapid innovation — innovation that will benefit users and, ultimately, the software industry as a whole.”
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image via GitHub.