The Great DevOps Train Wreck is About to Crash into Your Data Center
If you knew a disaster was about to happen, you’d say something, right?
Well, I see one coming, and I’m raising the alarm. Based on the conversations I’m hearing, it’s becoming dangerously fashionable to say: “All the interesting technology problems in DevOps are solved. We have Docker. We have Kubernetes. We have OpenStack. They’re stable, and they’re widely deployed. All that’s left to do is process transformation, teaching dullard enterprise IT teams how to consume the new-new IT.”
This is B.S., and it’s dangerous. We in the DevOps community need to get focused — and fast — on solving a basket of problems that threaten the progress and adoption of agile methodologies.
The DevOps revolution, driven largely by containers, is placing new and mighty demands on the enterprise data center, demands that will break infrastructure and policies designed for a dying era of monolithic apps. If we don’t work together as a community to build a new generation of tooling for the DevOps era, performance won’t meet expectations, and then we’ll all get to read articles about how DevOps failed.
Good news is, we can fix this. There are three areas that need attention. If we get busy, we can prepare for the fundamentally new infrastructure and process demands of DevOps. Here’s what needs fixing:
- Automating the Developer / Operator Workspace
- Infrastructure Stressors Arising from Short Life Cycles
- Managing From the Metal, Not the Host OS
Let’s look at each in turn.
Waterfall affords ample time for software developers and infrastructure operators to get on the same page concerning service levels, demand loading, and the unique configuration requirements of each app. In DevOps, that’s out the window for the most part.
Iteration cycles are so much faster — AMEX once said DevOps led to a 40x increase in deployment iterations over comparable time periods. Developers and operators working in agile settings can no longer rely on tribal knowledge and cultural norms to manage the relationship between the two. We need tools that automate shared workspaces that devs and ops can use together, in real time, to see how software is being deployed, consumed and managed.
You cannot just crank out an application, throw it over the fence like we all did in waterfall and expect ops to make it work in production across all loading and operational circumstances. Collaboration is key. Both teams need to be on the same page without getting into each other’s turf. Without these tools to monitor in production and share that info with dev and ops, we’re running blind.
Let’s talk about short application lifecycles. In agile environments — especially where containers are the compute scheme — applications can effectively be created and destroyed in minutes. Our current monitoring tools rely on logging and other OS-level data. It takes several minutes to collect and analyze this information. To this, we add some amount of time for operator intervention based on what the analysis is telling us.
When container life cycles can last just a few minutes, these monitoring approaches fail. Stated more bluntly, a tool that takes longer to use than the life of thing it seeks to fix is useless. We need tools that work in real time, or nearly so. And, we need to automate application performance tuning without human intervention, with intelligent rules that change how applications are managed based on rules that change dynamically.
All this means we need rules-based automation of application orchestration. Humans need to be removed from the loop altogether.
Let’s complicate this further: You likely will have no clue where in your infrastructure a specific set of containers are running, nor will you know what else is running on that node. Even if it’s private infrastructure, it’s probably to one degree or another a shared environment.
— The New Stack (@thenewstack) June 20, 2016
Smart people are working on tools to address this, including our own engineering team. Popular open source tools like Zabbix and Nagios rely on human-generated static pre-configuration. A newer generation of tools like DataDog are designed with flexible schemas that allow for the applications to emit bits of OS-level performance information that are monitored, offering the first step towards automated outlier detection. For a description of how outlier and anomaly detection works, check out Homin Lee’s presentation from OSCON this year.
This is all a good start, to be sure. But, as we’ll see next, we need to get below the OS. We need to see and manage what’s happening at the processor level. In a DevOps environment, systems are dynamic. Workloads are dynamic. The only way to manage this two-variable environment is with software that can automate dynamic allocation based on policies and do it in real time. All of the old tools we relied on work on static thresholding.
So, what we’re missing is that no human can constantly provide the context of a changing infrastructure and application environment. We need software that monitors and controls this dynamically and in real-time.
Management in Metal
The first two points illustrate why this one is so important. Current monitoring technology relies on data that comes from the host operating system. Unfortunately, the host OS has a dangerously limited view of what’s happening at the processor level.
Here’s why this matters. Users deploy containerized applications as pods. Those pods are deployed to servers, and the containers from those pods execute on individual threads on each node. That leaves a lot of bad things that can happen below the OS. Today’s monitoring tools work “top down.” That is, they look at application intent, or what resources the developer has described in policies as necessary for the application to meet its SLAs.
But, to be successful in a DevOps environment, operators need to know what is happening both top down and “bottom up.” Bottom up means monitoring and control, starting at the thread level and looking up at the containers and pods running there. It’s not an either/or situation. We need both.
Intel is working to address the bottom-up challenge. Their RDT technology exposes counters that can allow software to intelligently monitor and manage resource consumption by containers at the thread level. Our team has integrated RDT, and others will too.
It’s All About Context
The coming DevOps data center train wreck is all about context. Both applications and infrastructure are constantly changing in these dynamic environments. Monitoring tools based on the core technology that uses static thresholds are useless in this world. Developers and operators need tools that understand the context of what’s happening in the applications and the shared infrastructure, from both bottom-up and top-down views, and those tools need to automate scheduling of containers and pods dynamically in real time. Without this full-view context, the train is going to run off the tracks.
We’re staking our future on solving this problem, and others are as well. If more smart people get busy solving this problem, we will all benefit by getting our industry into shape for the new-new way of writing and running software.
Docker is a sponsor of The New Stack.
Feature image: “Versailles Rail Accident” by A. Provost, Public Domain.