Containers in Production: Case Studies, Part 1
What do music streaming service Spotify, video streaming company DramaFever, Backend as a Service Provider Built.io, and the education think-tank “Instituto de Investigacion, Innovacion y Estudios de Posgrado para la Educacion” (IIIEPE) have in common? They all depend on containers to run critical Web infrastructures.
Their stories illustrate some of the benefits and challenges of utilizing microservice architectures, creating web-scale applications, and running containers in production.
Running a Global Application at Scale
Spotify has a whole host of features, including logging in with social media credentials, seeing recommendations based on previously listened to tracks, and sharing music. But it’s playing the music itself that is most important. To keep Spotify’s 60 million users happy, the music must keep streaming even if, say, a developer introduces a bug into the album cover art feature, or one of its more than 5,000 production servers in its four different data centers bites the dust.
To accomplish this, the company packages each discrete feature of Spotify as a microservice. Containers are then used to allow Spotify to bundle some of these microservices together so that a single request call can take all the relevant information and display it in the interface that an end user — and possible subscriber — sees, without making too many repeat calls to match specific information. This also allows other microservices to run independently at the same time.
At DockerCon 2014 in San Francisco, Spotify software engineer Rohan Singh mapped out Spotify’s game plan for packaging microservices into a stream of containers that can be spun up and down, depending on user needs, around the globe. Singh’s talk at DockerCon coincided with the first week that Spotify was planning to deploy containers in production. Nine months later — in March 2015 — Spotify’s Evgeny Goldin took the stage at DevCon in Tel Aviv to provide on-the-ground insights into how containers are central to the streaming service’s continuous delivery culture.
Goldin said Spotify’s problem was what he calls NxM: “You have ‘N’ services and you want to deploy them on ‘M’ hosts,” he told the DevCon audience.
The NxM problem refers to the problems at scale that occur when a global audience of users are all using Spotify in their own ways, turning on and off features, playing music, or shutting down their app usage for the night. So, as a Spotify user’s app requires a particular microservice, it makes a request to Spotify’s servers and gets the data it needs. As new features are added, or more users are simultaneously asking for their own app services to start, Spotify needs to spin up more containers to meet the additional demand, and then drop them when there isn’t the need to maintain processing power on the server.
To address this, Spotify created the open source project Helios.
“We call it a Docker orchestration framework. It solves one problem, and it solves it really well,” said Singh. “Given a list of hosts and given a list of services to deploy, it just deploys them there,” Goldin said.
Containers and the Helios model has been used in production at Spotify since July 2014 and is a key pillar in their scalability strategy. “We’re already at hundreds of machines in production,” the team announced in their Helios README notes on GitHub. “But we’re nowhere near the limit before the existing architecture would need to be revisited.”
DramaFever’s Video Demands
DramaFever runs white-label sites such as AMC’s SundanceNow Doc Club and horror streaming site Shudder and has contracts to deliver its content to other streaming services. Kromhout, who worked in site reliability at DramaFever before her recent move to work as a principal technologist for Cloud Foundry at Pivotal, said of her time there: “We are not Netflix, we are not Hulu, but if you watch a Korean drama on Hulu, they are streaming content they licensed from DramaFever.”
During her work at DramaFever, Kromhout took a “cattle, not pets” mindset to server architecture, with everything in the request path containerized to enable autoscaling up and down. DramaFever started using Docker in production in October 2013, at version 0.6. Using Docker enforces consistent development and repeatable deployment; containers help keep the code and configuration environments used by developers the same, even as they move code from their laptops through to the cloud-based QA, staging and production environments. The code in the containers can be trusted every time the architecture autoscales a new instance of the application or microservices components, as it is going to have the same operational context as any testing or development arrangement.
“When I started in July 2014, using Docker in production was a sign that it would be an exciting place to work,” Kromhout said.
One of the initial obstacles DramaFever had to address in autoscaling sufficiently at a production scale has been building a robust container registry. Kromhout said that in a distributed architecture, you often have a situation where multiple containers are launching at the same time. This is done dynamically, so the architecture is only firing up instances and launching containers as needed based on usage. What happens is that these ‘docker pull’ requests are sent to the private Docker registry. The registry tended to fall over when around twenty containers were being started at the same time.
Built.io continues to keep an eye on the container ecosystem for other tools that could take over the management layer for them, and while Deis comes close, for now, they have decided to continue doing this themselves.
To solve this problem, DramaFever runs the registry container everywhere, on each host locally, so that the image pull is done locally, with AWS S3 serving as the backing storage.
“DramaFever could have gone in the direction of scaling up,” Kromhout said. But she said that Tim Gross, director of operations at DramaFever, realized that “if he had to scale the registry, why not do so in a way that you would never have to think ‘can our registry server handle this many image pulls?’”
She admits this is a unique situation, but it’s exactly the sort of problem that you’ll face running containers at scale with a self-hosted registry. Kromhout spoke about DramaFever’s containers story at OSCON 2015, “Docker in Production: Reality, not Hype”.
Managing Containers Via API in Production
Nishant Patel, CTO of Integration and Backend as a Service provider Built.io has been using containers in production for almost two years now, and believes that it is helping them differentiate the company’s service offering from other providers in their market.
Patel said that for the most part, the majority of IaaS and BaaS customers are really looking for a database-as-a-service solution with out-of-the-box authentication and similar features which enable the database to be provided to end customers as an application. But 20% of their customers require something a bit more rigorous, with additional business logic wrapped around the database. Built.io’s customers need to be able to write the custom logic and have it added to Built.io’s servers.
“We didn’t want it to get too complicated for our customers to have to set up their own server, etc. So we wanted an easy way to let our customers upload their custom code and run it alongside their database. So, essentially, we wanted a container in the cloud,” said Patel.
That led Built.io’s Mumbai-based engineering team to look at available options and two years ago, one of their engineers came across Docker when it was still at about 0.9 version release. “We did a quick proof of concept and set up these containers, uploaded code and you know what, we were pretty impressed with what it provided. It was a secure container that didn’t mess with anyone else’s code. So then we went a step further and looked at writing our own management layer.”
Patel estimates that Built.io’s customers spin up thousands of containers at any given time, as each customer’s account by default comes with a Docker container.
“It gave us a competitive advantage to go with containers. Our competitors had the same requirements as us: they also have to cater to these 20 percent of customers with specific needs, but because we use Docker we were able to create a platform as a service which is something our competitors couldn’t do. It made our story much more powerful. Using our MBaaS with Docker in production, you can upload full Node.js applications, for example.”
Like Spotify and DramaFever, Built.io found existing tools lacking, given the early adoption nature of their containers-in-production environments. This led to Built.io choosing to build their own management layer in the same way that Spotify built their own orchestration service and how DramaFever built a horizontally scalable host-local Docker registry architecture.
“What we wanted was the ability through APIs is to make an API call to set the containers up. Then we built all sorts of stuff, for example, we wanted higher paid customers to be able to set up bigger containers. We wanted customers to be able to have a load balancer container, we wanted to add additional security provisions, and enable through the API to manage starting and stopping containers, and using an API to get container logs and put it into our customer management UI.”
Patel said Built.io continues to keep an eye on the container ecosystem for other tools that could take over the management layer for them, and while Deis comes close, for now, they have decided to continue doing this themselves.
Leveraging Containers to Introduce Continuous Delivery
There is no shortage of those wanting to experiment with using containers in production, as Luis Elizondo found after he published a series of blog posts on his Docker production workflow. Elizondo works for the IIIEPE — a government-funded agency dedicated to education research — where he manages everything from small static HTML sites to full-blown learning management systems for teachers.
“We have 25 web applications, and most of them are run on Drupal, but we also use Node.js and Python. The biggest problem is that we have a lot of issues when we are moving a web app from dev to production,” Elizondo said.
Elizondo was looking for solutions that enabled use of their private cloud infrastructure that could scale horizontally. They expected their web applications to grow in both number and usage, so Docker was an appealing option for infrastructure solutions that could scale with their intended growth.
At the time, IIIEPE wasn’t using continuous delivery methodologies in their application architecture, so moving to a Docker container infrastructure also gave them the chance to implement new continuous delivery workflows in conjunction with their orchestration flow.
Elizondo’s blog series documented his approach to storage, orchestration, service discovery and load balancing in production as a way of sharing the workarounds and challenges he had to solve in making his workflow production-ready.
The team has two servers. One runs MaestroNG and is responsible for starting stock Docker containers. The other runs a virtual machine running Jenkins. “We use it as a kind of robot that performs repetitive work,” he said. “We programmed Jenkins so that when we push a new change to any of the projects, Jenkins detects the new change, rebuilds the whole image, adds the code to the new image, then orders MaestroNG to pull the image from Docker Hub, and that’s basically our workflow.”
Elizondo used the DigitalOcean platform to test the new architecture and container workflow towards the end of 2014 and, amazingly, it only took about a month and a half to test it and work out all the kinks.
“The biggest discovery in our experiment was that in our containers and in our Git repos, this workflow doesn’t just work for Drupal,” he said. “It also works for Node.js applications. I also tested it with our Ruby application. So it is a standard workflow. Anyone can use it.”
Managing Container Complexity with Kubernetes in Production
For Elizondo, the solution to containers in production was Docker. Elizondo said he had experimented with Kubernetes, for example, but found it too complex for his needs. But for others, such as the core engineering team at e-commerce giant Zulily, the reverse has been true. Zulily first began experimenting with containers in production in May 2014, according to Rachael King at the Wall Street Journal. King reports that Zulily’s software lead Steve Reed said at OSCON in July 2015 “the hardest part is operating the container, especially in a production environment.”
When first assessing the benefits of containers in production, Zulily ended up shelving their plans for a Docker-based production use case due to the complexity inherent in orchestration. With the maturing Kubernetes platform now able to manage orchestration of Docker containers, however, Zulily has been able to return to their production goals. “Kubernetes is production ready, even for a deployment like ours at Zulily where we’re not talking about hundreds of nodes,” Reed now said.
At the start of 2015, using containers in production was seen as either an experiment or a bold choice for enterprise. Now, only twelve months later, containers in production are being deployed not just for pilot or specific projects but are being woven into the architectural fabric of an enterprise.