DevOps / Machine Learning / Microservices / Sponsored / Contributed

Intention-as Code: Making Self-Healing Infrastructure Work

21 Sep 2021 4:00am, by

Deepak Giridharagopal
Deepak is CTO at Puppet.

As we’ve become more sophisticated in the way we build, we’ve also become faster at building. Infrastructure is programmable, dynamic, and can be quickly tweaked to meet our needs. Atop that foundation we’ve built elaborate applications that are distributed, containerized and broken into microservices. The rate of change in our apps and infrastructure has increased dramatically.

Basic automation has let us sustain this level of complexity without everything collapsing. We’re now in an era where the plumbing is so advanced that the state and scale of our applications can change in the blink of an eye, faster than the speed of a Git commit. Our tools must adapt and become much more event-oriented and responsive if we hope to keep things under control, which is why self-healing infrastructure is inevitable. It’s the only way to keep pace and certainly our best hope for doing so.

Reliability is Non-Negotiable

Every company, including the big players vaunted for their operations aptitude, runs into reliability issues. Outages hurt: They hurt revenue, they hurt the user experience, they hurt your reputation, and they can hurt your teams. Unreliable systems are a competitive disadvantage at best and catastrophic at worst. Companies are increasingly looking toward automation to help them better manage the outages of tomorrow.

Self-healing systems are just what they sound like: automated systems that can both detect and repair errors without much, if any, human intervention. Self-healing infrastructure is the application of this idea to all the things that operations teams manage. It encompasses a wide variety of approaches, including low-level infrastructure-as-code tools, policy enforcement engines, container orchestrators and beyond.

We need to move up a level of abstraction from infrastructure-as-code and start thinking about capturing intention-as-code.

Like most things in the world of operations, some companies are further along this journey than others. As our applications and infrastructure have become more complex, it is now harder than ever to automate all the moving pieces.

One way to break down the problem is by thinking about your automation surface area.

Every application, every team’s infrastructure, every estate — they all have an automation surface area, which includes the set of components involved in all your development, operations and security workflows. This surface area represents all the processes that tie everything together and keeps things running. It represents, in theory, all the things you could automate.

For many teams, the operating system (OS) is a significant part of their automation surface area. Operations teams automate workflows that involve direct, OS-level interactions across their entire infrastructure: manipulating file content, installing packages, configuring user accounts, setting up firewall rules and more. In this world, configuration management tools are great; they can easily wrangle the complexity of an OS, and do it safely and securely at scale.

But, as we know all too well, technology is changing. Infrastructure evolves. Application architectures evolve. Platforms evolve. With the rising popularity of microservices, infrastructure-as-a-service and cloud native tooling, the operating system represents a smaller percentage of the overall automation surface area relative to the components and services with which teams interact via APIs.

These APIs operate at a higher level than the OS, which means they present us with some really great abstractions for controlling aspects of our infrastructure and the applications that run on top. This makes it possible to tackle self-healing infrastructure in a way that wasn’t feasible for most of us until now. And yet, very few people are doing it. Honestly ask yourself just how self-healing all your production applications, infrastructure and services truly are.

We have the capability to make self-healing work; we just have to do it.

Digital Duct Tape

We’ve heard over and over from our users how even their most straightforward-sounding operations tasks are deceptively tricky and involve sequencing lots of actions across all manner of different services. Going through these tasks manually introduces room for error, even when the procedures are properly documented. Between responding to service-down incidents, rolling back failed deployments and securing cloud resources, the struggle is real.

For many, solving these problems involves gluing together a patchwork of existing scripts, bespoke in-house tools and third-party services. I call this “digital duct tape.” It’s not pretty, it’s not sustainable and it’s not a permanent fix, but it’s the best many people can do.

I think about the era before continuous integration/ continuous delivery (CI/CD), in which lots of bespoke scripts tied together random things in a brittle and unsustainable way. We’ve improved a ton with continuous delivery, but continuous operability remains elusive to most of us.

So what would it take to make self-healing infrastructure more achievable for the masses?


I think that we need to move up a level of abstraction from infrastructure-as-code and start thinking about capturing intention-as-code: “When this thing happens, here is what must happen in response.”

That’s the “core loop” of a self-healing system, and it’s a fundamental part of operations as a field. We capture the set of triggers that indicate a problem, error or situation needing attention, and we capture what actions we need to take to remediate the problem. What actions can run in parallel? Which must wait until a preceding step has completed? When, if ever, do we need human-in-the-loop approval? Actions don’t have to be limited to infrastructure alone; they can involve filing tickets, pinging colleagues on Slack, hitting an API to manipulate cloud resources and more. If it helps fix the problem, then why not automate that part of the process?

Operations-focused workflow engines let users express these triggers and actions as code, in a simplified notation that broad operations audiences can understand and customize to suit their needs. Combining triggers and actions into repeatable workflows leads to truly responsive automation that can cover the full continuum of scenarios that Ops folks regularly face, at the velocity they need.

Most companies understand the value in fully automating their CI/CD pipelines. Manual steps hold up the assembly line and introduce unnecessary risks. Yet software isn’t “done” when it’s delivered; the moment of deployment is only the beginning of the rest of the application’s life, a life that operations teams have to continuously oversee and manage.

When it comes to managing applications through their entire life cycle, CD covers the beginning and self-healing infrastructure can cover the end. We’re going to need both. Whatever tool you use to do this, the important thing is that we work together to get to a place where our systems end up more reliable and where we can spend less time getting paged in the middle of the night to fix stuff and more time relaxing. I think we’ve all earned some relaxation.

Join me at Puppetize Digital 2021 online on Sept. 29-30 to settle in, relax and learn more.

Lead image via Pexels.

A newsletter digest of the week’s most important stories & analyses.