Did you binge-watch “Stranger Things” or “Sense8″? Was it ever interrupted due to a software glitch, something we have experienced with almost every other streaming service out there? Probably never. That’s incredible given that Netflix has one of the most dynamic infrastructures out there, one in which developers are continuously releasing updates and new software. How do they manage it?
“We don’t want your ‘Stranger Things’ streaming to be interrupted by the fact that an engineer, maybe, pushed out an errant deployment or made a change to a property that they shouldn’t have and the whole system goes down,” said Dianne Marsh, Netflix’s director of engineering tools.
Crucial to maintaining constant uptime for a dynamic infrastructure has been Spinnaker, a fully open source, multi-cloud, continuous delivery platform that Netflix developed internally to help development teams release software changes with confidence that nothing is going to break. Spinnaker offers two core features: cluster management and deployment management.
Spinnaker succeeds another open source tool by Netflix called Asgard, a cloud delivery platform that was built to simplify the delivery of Netflix services to Amazon Web Services (AWS).
Asgard enjoyed massive adoption by the community, but it also met some challenges. Asgard was designed to support only AWS, but users wanted to use it in their own environments. In addition, being an internal tool of Netflix, developers didn’t need permission to be able to do deployments, but external users needed such capabilities. People had to fork Asgard in order to use it for their own needs. As people forked it, Netflix saw a problem.
“It led to a fragmentation in the Asgard source code because it wasn’t really easy for them to bring that technology back. We lost the innovation that those other companies did on those works,” Marsh said. “So it’s really important to us to think about how we might make this an extensible platform that didn’t need to be forked, and that we could work with the community rather than throwing something out there that the community might adopt but then not figure out how to be able to contribute back.”
Netflix started looking at the successor of Asgard not just as a deployment tool, rather as a continuous delivery platform. Thus started the work on a project that was later known as Spinnaker.
While the central team at Netflix was working on Spinnaker, there was another deployment management project within Netflix led by Sangeeta Narayanan, who was the director of the edge developer experience at Netflix. They were building features and capabilities for their own needs as they could not wait for the centralized team to finish the product they were working on.
Marsh explained that the Edge Center API team had very immediate needs; they needed to conduct fast deployments in a safe, reliable and repeatable way. When other teams saw the work done by the Edge Center API team, they wanted to use the project. It was putting unnecessary pressure on the Edge team as it was not part of the centralized team and the tool was not designed to support Netflix as a whole.
The centralized team was chasing the features that were being developed by the Edge Center API team, and trying to keep up. “If we wanted to centralize a tool, it was going to serve all of Netflix. We needed to have those features represented in the tool that we were building,” said Marsh. The Edge Center API team was solving a problem for a very distinct purpose that allowed them to run really fast, the centralized team was trying to solve the problems of the entire organization so they had to walk slowly and carefully.
Over time it became quite clear that the centralized team was never going to catch up with the Edge Center API team. At that pace, the centralized product was going to be ready for the Edge team to use in production. Andrew Glover, manager of delivery engineering at Netflix, was responsible for Spinnaker. He met with Narayanan and admitted that his team could not possibly keep up with her’s and asked if there was a way to join forces so that the Spinnaker team could build the needs of Edge team into the centralized tools.
Narayanan gave him two developers who were working on the project so that they could bring the features and the context from Edge Center team to the centralized team. Edge Center team was working on a continuous delivery tool whereas Asgard, in its infancy, was a delivery tool and an infrastructure management tool. His two teams worked together and the resulting product was a continuous delivering infrastructure management platform. “Spinnaker brought those two features together. It was designed to be a pluggable architecture,” said Marsh.
Netflix started consuming Spinnaker internally, in 2014 and was open sourced the following year. Based on the company’s experience with Asgard, Netflix ensured that Spinnaker solved most of the problems faced by the Asgard community. Netflix worked with some partners to develop the multi-cloud strategy and build the pluggable architecture.
The development teams worked with Google, Microsoft and many other companies. As a result, Google, Microsoft, Amazon, Pivotal and even Oracle were able to use Spinnaker in their environment. No fork was needed. The circle was complete. These companies were benefiting from the innovation of Netflix and in return, Netflix was benefiting from the innovation the communities were adding to it. It was the pure open source win-win model.
Tools and Culture
“We want to provide guardrails, not gates. I don’t want to stop you from doing something I want to give you context about why we think it might be a bad idea. I want to make sure that you have all the context to make that decision. But I don’t want to prevent you from doing it,” Marsh said.
“Out-of-the-box, Spinnaker supports sophisticated deployment strategies like release canaries, multiple staging environments, red/black (a.k.a. blue/green) deployments, traffic splitting and easy rollbacks,” wrote Google Product Manager Christopher Sanson, in a blog post. “This is enabled in part by Spinnaker’s use of immutable infrastructure in the cloud, where changes to your application trigger a redeployment of your entire server fleet. Compare this to the traditional approach of configuring updates to running machines, which results in slower, riskier rollouts and hard-to-debug configuration-drift issues.”
Netflix has a multi-region environment, and in a deployment system like that, you really don’t want your engineer to push changes globally to all regions at the same time.
Netflix gives developers the freedom to make that choice, but with great power comes great responsibilities. All of that is reflected in Spinnaker.
“Engineers decide their own deployment strategy. There are many different strategies built into Spinnaker. In the continuous delivery pipeline, we give them the ability to add a manual judgment,” said Marsh. Manual judgment might sound counter to continuous deployment, but Netflix gives its developers that option. “I think it gives them the opportunity to build confidence in our tooling,” said Marsh. It’s like that last window before you check out of a retail store where you can still see everything one more time.
However, in order to help developers so they don’t end up bringing down the streaming service, Netflix wants to offer smart defaults to developers, based on the criticality of a service. Streaming a movie or a show is critical as compared to a service that tells a user what you are watching on other devices or recommended shows.If those services go down it doesn’t matter much but streaming is critical. “We want to treat those core services, those critical services a little bit differently than we would treat other services,” said Marsh.
Looking at the future, she said that declarative continuous delivery provides the ability to be a bit more abstract about what is being deployed and where. It frees developers to focus on the domain and the problem that they’re trying to solve. “We’re taking on some of that responsibility by them giving us some configuration information rather than what specific instance type do they want to deploy to or what specific parameters they want to use,” she said. “People really do want to focus more on the problems that they’re coming to solve rather than every single person understanding the details of building deployment.”
Spinnaker, Sense8 of Developers
Netflix culture has a lot of influence over Spinnaker. “One of the ways that are one of the guiding principles at Netflix is that we need to be loosely coupled but highly aligned. This means that the teams themselves own the responsibility of communicating with other teams about what they’re doing and what impact they might have on someone else,” said Marsh, “Spinnaker gives those teams the ability to be able to go through various stages to ensure confidence along the way that they’re going to be deploying something that’s an improvement rather than a regression to the service.”
Marsh strongly believes in the relationship between the culture of an organization and tools that it develops or use. “Technology can impact culture and culture can impact tools,” said Marsh. She said that it’s really important to understand that when we build tools, they reflect our culture in them. “If we’re not reflecting the culture in those tools, we might be challenging the culture with our tools. Those are two very different approaches and they need to be dealt with very differently,” said Marsh.
Feature Image: Netflix’s Dianne Marsh, speaking at KubeCon 2017.