Cloud Native DevOps: Four Horsemen of the Operations Apocalypse
This post is part of a series, sponsored by CloudBees, exploring the emerging concept of “cloud native DevOps.” Check back each Monday for future installments, and keep an eye out for the eBook in early 2019.
Co-founder of RunDeck, Damon Edwards kicked off his London DevOpsDays talk with a familiar and painful story, about an unnamed organization that took on the latest technologies — the cloud, Docker, microservices, Kubernetes — only to scramble in the complexity when something breaks.
First, a bridge call happens that just generates a lot of questions. The lead dev escalates it to the Scrum Master. It can’t be figured out. The questions get bigger and bigger and now everyone is on the phone and on a growing ticket — network engineers, business managers, app managers, lead developers, a site reliability engineer (SRE), system administers, middleware managers, SVP, Chief of Staff, two technical VPs, more middleware folks… the list goes on.
Eventually, the whole network is down, so the operations team is focusing on that. Somebody else realizes the original outage was all because of a changed firewall.
“Can you change it back?”
“OK fill out a form and we’ll do it next week.”
“But it’s an emergency!”
Finally, the company adds a customer engagement manager to the case “who has to test it all instead of trusting SRE,” Edwards said.
When all is fixed, the next day the SVP wants to know:
- What happened?
- Whose fault is this?
- What processes and approvals can we add to keep this from happening again?
“They had digital, agile, DevOps transformation, and site reliability engineering. On the tech side, they had the cloud, Docker, Kubernetes, and microservices. Then the exec team asks why everything takes too long and costs too much,” Edwards said.
But in all of these shiny tools and cultural transformations, operations was just kind of ignored.
Edwards says conventional SVP wisdom says:
- We need better tools.
- We need more people.
- We need more discipline and attention to detail.
- We need more change reviews and approvals.
He argues we need to forget about these four pillars of this so-called conventionalism and “really challenge the wisdom of operations work” by going after what he calls the Four Horsemen of the Operations Apocalypse:
- Ticket Queues
- Low Trust
Today we break down how to keep those demons at bay.
1. Cross-Company Silos
Edwards calls this just a different way of working where silos, or traditional departmental divisions, are torn down and everyone in IT shares:
- a common backlog
- a common tooling
- a common context
He went on to say ideally everyone uses tools the same way and everyone shares a common set of priorities.
“Nothing lives in isolation, especially in an enterprise, and you always need something from somebody else. That’s when these disconnects start to happen. Our tooling doesn’t really line up and our capacity doesn’t line up, nor does context or process,” Edwards said.
“By optimizing for that you are just creating the disconnect. Silos get in the way of feedback loops, get in the way of learning and quality,” he said.
Edwards continued that this isn’t just happening in development and operations but in the environment, network, and on customer teams, each being isolated from the other resulting in a negative business impact.
2. Ticket Queues
“We create a ticket queue to solve silos,” Edwards said. He referenced back to the story above, saying “Then I wait for something. I don’t know what I’m asking. I’m not a firewall engineer but I’m typing into a blank box trying to expand what I need.”
He says this makes queues not only slow down business processes and DevOps work, but ticket queues are expensive.
He quoted Donald G. Reinertsen’s talk on the principles of product development flow in that queues problematically create:
- Longer cycle times
- Increased risk
- More variability
- More overhead
- Lower quality
- Less motivation
After all, the longer people have to wait for something, the more detached they become.
“We talk a lot about value streams and end goals, but we split the goals and we’re distributing and obfuscating the goals,” Edwards said.
He went onto explain that each goal becomes like a snowflake — unique, brittle, technically acceptable, but not reproducible. This makes it much harder to automate things.
“The only things worse than automating things that are broken is automating something that’s just a bit off,” Edwards said, pointing to ticket queues as a huge contributor to bottlenecks.
He says ticket queues are further aggregated — and aggravating — when they push primary management focuses on protecting team capacity and when operations repeatedly says no. He says the latter is interpreted as Ops being afraid of change, but a lot of times they are just trying to protect capacity.
“Toil” is a more common term now because of the growing popularity of site reliability engineering (SRE), especially in the DevOps world. First, let’s distinguish between toil and overhead. Overhead is important work that doesn’t directly affect production services. It may be anything from setting goals to human resources activities to team meetings — important but doesn’t necessarily affect the code.
On the other hand, toil typically includes things that are:
- Able to be automated
- Not strategy or value driven
- Repeatedly waking on-call devs up
- Not very scalable
“It may be necessary but it should be viewed as something a little bit icky,” Edwards said.
“Excessive toil prevents us from improving today.” — Damon Edwards
At Google and many companies, managers try to keep toil down to less than 50 percent of the SRE team’s work. It isn’t moving the company forward and it can frankly be demotivating to your engineers. Google particularly points out their fear that greater toil means that SREs will fall into a strictly Ops or strictly Dev role, while they should be working with both.
“You want to keep that toil at a manageable capacity because engineering is important for two things — to add business value and to reduce toil so you have more time to improve the business,” Edwards said.
4. Low Levels of Trust
Where are decisions made? How do we escalate stuff up to make decisions?
Edwards says all work that is done is contextual and answers always depend on something. This is particularly true when you’re working with complex distributed systems. Yet the people who create the code still aren’t often the ones making major decisions.
“If these are the people who have the context, why are we escalating it to the people who have little image” into the decision-making, Edwards asked.
In DevOps, you need to revisit how much decision-making actually needs to be escalated up versus how much power should be entrusted with the people actually adding business value and working with the tools.
How Do We Scare Off the Four Horsemen of ‘Opsocalypse’?
The first step for Edwards is creating cross-functional teams, breaking down as many silos as possible.
“It’s about a shared end-to-end responsibility for a service. Not everybody does everything,” Edwards explained. “Netflix talks a lot about ‘There is no DevOps’ — it’s all about teams who have cradle-to-grave” responsibility.
On the other hand, he says the Google model is a Dev and an Ops hybrid with clean hand-off requirements from development to SRE and then error budget consequences where SRE can push back to development. Both sides have the same emphasis on quality.
“It’s about a shared end-to-end responsibility for a service.” — Damon Edwards
Now, Edwards says you can’t get rid of ticket queues completely. Organizations just have to be aware when they are being used as a general purpose work management system.
“Tickets are really good at documenting true problems, issues, exceptions, and routing for necessary approvals,” Edwards said. “The idea is that you cut down on all the interruptions.”
Overall, Edwards says successful DevOps is about shifting the ability to take action leftward, giving everyone the same tooling and enablement for a safer pathway to do things.
Another part of transparency is tracking toil levels and sharing these levels with the team. Edwards echoes Google’s work to set toil limits to 50 percent but says that orgs need to fund efforts that actually actively reduce that toil. He recommends reading David Blank-Edelman’s book “Seeking SRE” to learn how to scale toil-reducing activities.
Finally, Edwards warns that you should challenge that aforementioned conventional wisdom and bring Operations into your digital transformation strategy — it has of DevOps, after all — and take time to understand how the four horsemen are undermining Ops work. Focus on taking down silos and limiting queues for them as well. Encourage them to focus on self-service Operations as a Service (OaaS) as much as possible.