Cloud Native: Service-driven Operations that Save Money, Increase IT Flexibility

I obsess about operations. I think it started when I was a department IT manager at a financial services institute. It was appallingly difficult to get changes deployed into production and the cost of change was spectacularly high. It felt like there had to be a better way, and most every decision I have made professionally since 2008 has led me to work on technology that makes that guy or gal’s life easier.
Cloud Native operations embrace the idea the pairing of operations automation with specialized operations focused engineers.
As I think about Heptio, the new company that Joe Beda and I are creating to bring Kubernetes to enterprises of all sizes, I am optimistic about our ability to not only impact the way engineers design systems, but the way that they think about operations and structure of the teams that deliver technology to the business.
Let’s gain a little perspective by looking at three distinct operations models. Two you have probably heard about and a new one that is emerging and offers a lot of promise.
SysAdmin: The Reign of the Ticket
The Systems Administrator stands between the business and chaos. Though we have seen some changes in the philosophy of automation and configuration this remains true in many places today.
With the SysAdmin the ticket is the atom of work, and a human being is the operator.
You need a new VM instance? File a ticket. You need to update a system? File a ticket. Change configuration on a system? File a ticket. A human takes care of the ticket on the back end and work happens. Interestingly, folks that have built complicated automation flows based on tickets. Often this evolves to “ticket as API” and can be very painful as you are using the wrong tool for the job.
The introduction of the system administrator addresses an obvious problem: it creates a control point for production changes. It puts a professional in the workflow of deployment and configuration, and it is certainly a lot better than the chaotic world of developers having direct access and full authority over production systems. It is unfortunately tortuously slow, and even the best system administrator is human: errors happen. Worse than this is the fact that it scales sub-linearly. The more complex the system being deployed, and the higher the scale, the more intense the toil.
The SysOp model also adds tension between development teams who want to go fast, and the centralized organizations trying to define and manage risk systematically. Software is legitimately eating the world and the modern business needs to iterate faster, deploy faster, and scale more effortlessly than operators can practically manage.
DevOps: Enter the Heroic Engineer
Developers code. Given the right coding tools, they are able to solve intricate business problems. So why not have them code their way out of the system deployment configuration problem? Either with a domain specific Language (DSL), or an imperative framework that allows them to describe precisely how a system should be configured. Each time you need to deploy something, you can pave over what has been before and create a perfectly configured system based on the code or DSL ‘recipe.’
This is the world of IT automation tools: Ansible, Chef, Puppet, and Salt. Here, integration is the atom of work. It is the world where developers code up the business solution in one language and deploy it using a different one.
DevOps definitely creates rigor around deployment, but I see a few issues over and over again:
- Writing your Chef scripts can be as onerous as getting the code done in the first place and it puts pressure on managers to build out organizations with a wider array of skills than is perhaps strictly necessary, and…
- You are running (often) imperative code in production scenarios with relatively primitive tooling. If you rely on these capabilities to deal with a scaling event, there is a non-trivial chance that some ‘apt-get’ step will fail for esoteric reasons and the new node you just added will be in an odd state. Heaven help you.
- It is also really hard to know what precisely is running where and it is challenging to uniformly enforce policy and there is a raft of other nuanced issues that I will dig into in future posts.
- It requires a new skill set and set of responsibilities for all developers. Having operations knowledge spread through an organization is a double edged sword. And finally…
- DevOps struggles to scale to larger teams. The tools aren’t, generally, built provide the level of centralized control. This can often lead to a DevOps org — something that is suspiciously similar to the SysAdmin model.
Cloud Native Operations: API Driven Services
There are two distinct ways to think about cloud native operations. The first is somewhat philosophical, and, as a result, tends to play out as geeks writing ‘letters from the future’ to the CTO or IT developer telling them about this pristine, glowing new world they could live in If They Just Run Like Google. Or Facebook. Or Twitter. Or Netflix. They just need to believe, buy into containers, orchestrators, micro-services and everything will be swell.
There is undoubtedly truth there. I have seen many small shops more than halve their AWS spend by getting this right, witnessed 10x improvements in time to production by organizations that have made the jump. Perhaps even more notable are the remarkable efficiencies Google is able to achieve with Borg, and it’s jaw-dropping efficiency improvements over what customers can accomplish with traditional IaaS offerings. But it is hard to internalize if you are approaching it from the other side of the fence, and wondering how you are going to get there. What is the first step on this road?
I see Kubernetes users reduce their AWS bill when they transition to Kubernetes. I have personally seen users report between 50 percent and 80 percent reduction in infrastructure utilization.
I prefer to come at this Cloud Native story from the other side. If you think about one of the most distinct attributes of the cloud, it is that technology is delivered as a service behind an API. Amazon, Microsoft or Google have effectively put hardware, and a hardware operations team behind an API. Developers can exercise that API (either programmatically, through a tool, or via a web console) to get a piece of virtual infrastructure delivered to them ‘as a service’.
The question we need to ask is what else can be put behind an API, and task a dedicated operations team to deliver as a service to the broader organization? And what are the fundamental building blocks that let us push that envelope sustainably?
Cloud Native operations embrace the idea the pairing of operations automation with specialized operations focused engineers. It offers the ability to organizations to specialize around core functions, and deliver them efficiently to a service at different levels of consumption. The use of SaaS, whether it be fully packaged applications, or discrete infrastructure related services (like Amazon Redshift, or Google’s BigQuery) is already leading organizations down this path. You can get a very useful piece of technology behind an API with an ops team on retainer to help you deal with issues. But you can also find yourself very locked in, dealing with a highly fragmented provider base, and lacking common tooling. And it doesn’t necessarily help you manage your application. PaaS offer a simpler, API driven interface to deploy and operate sub-systems, but too often they are tied to a very specific way of operating (12-factor apps only, for example), and they aren’t designed to help you deliver general services to your own organization.
Getting there is going to require some things that organizations aren’t already doing.
Some Ingredients for Cloud-Native Operations
To embrace a cloud-native operations model, there are several ingredients that move you down the path. Technically you could deliver this approach using the ingredients you have on hand (just like you could write object oriented code in C if you really wanted to), but there are some technologies that just naturally lead you in this direction.
- Containers (Docker, RKT, OCID/CRIO, Garden, etc): Containers solve an essential part of the operations problem. They create hermetically sealed, repeatable and reliable units of code deployment. If you want to put an automatically provisioned service behind an API, a great starting point is a repeatable unit of deployment and a sealed application environment.
- Orchestrators (Kubernetes, Mesos, Cloud Foundry Diego, etc): Clustering technologies when done right create programmable ‘logical infrastructure.’ The service provider’s application is well decoupled from the infrastructure, and typical service lifecycle operations can be handled programmatically. The ‘right’ set of abstractions are essential in achieving high levels of automation.
- Microservice Frameworks: Microservice frameworks allow the engineer to tie together discrete sub-systems that are potentially operated and managed by different organizations, and published behind stable interfaces into a coherent service. They can optionally replicate that service and publish it to others.
The union of Kubernetes and a container technology like Docker or CRI-O will yield a powerful starting point on this road. Docker simplifies development and deployment of individual components, and Kubernetes offers a programmatic ‘logical infrastructure’ platform that make automation easy. Kubernetes services framework makes typing together professionally managed services easy.
The Promise
It is difficult to fully quantify the potential for impact of the transition to cloud native operations on today’s enterprise, but the opportunity is compelling. The benefits around consolidation are obvious: I see Kubernetes users reduce their AWS bill when they transition to Kubernetes. I have personally seen users report between 50 percent and 80 percent reduction in infrastructure utilization.
I don’t think this is where the primary value lies, though. For most IT organizations, the cost of physical infrastructure is lost in the noise vs the opportunity cost of being able to more effectively engage their engineers to solve problems. The promise being able to build extremely lightweight decentralized engineering teams that build on technology provided to them by a wide array of sources. They need to be able to safely consume wide arrays of business specific services delivered and operated by centralized teams to solve problems, through simple and consistent platform technologies.
To hear more about Heptio, and cloud-native computing, check out our podcast recorded with McLuckie and Beda at Kubecon 2016:
Craig McLuckie and Joe Beda At KubeCon
Docker is a sponsor of The New Stack.