Infrastructure APIs: The Good, the Bad and the Ugly
This year at SCaLE 20x, I kept overhearing people talking about struggles scaling with HashiCorp‘s open source Infrastructure as Code (IaC) software Terraform.
True, we can wrap it inside a runner so that at least we can get control, shared state and visibility about runs, but that doesn’t address the fundamental reuse and collaboration problems with plans. When I asked myself why, I realized this is not specifically a Terraform problem.
I’m convinced that there’s a deeper challenge with the ops contract (aka API) between the users and the infrastructure.
I’ve been thinking deeply about scalable infrastructure APIs for nearly a decade in my role at RackN, which offers an Infrastructure as Code platform called Digital Rebar. We’ve collaborated with operations teams at banks, service providers, telcos, gaming and media companies to manage globally distributed infrastructure. These operations APIs challenges are universal.
So what makes an API scalable? It’s not just the number of requests or machines that it can service. It is much more about enabling reliability, consistency and uniformity of the service supporting the API. Even more so, if it empowers the teams supporting the API to collaborate openly while invisibility maintaining the infrastructure and systems.
|Good APIs||Bad APIs||Ugly APIs|
|Cognitive Impact||Low Load||High Load||Creates Anxiety|
|Reliable||people forget the service actually does complex work, it’s function is assumed||people resist, requires lots of support on backend||In practice, hard to troubleshoot and figure out what happened|
|Consistent||people trust the results provided by the API both when successful and when failed.||people have to create lots of defenses when using, require specialists||users cannot predict which inputs and behavior is needed|
|Uniform||people can use the APIs abstraction in many scenarios without having to understand the underlying system.||people have to create a wrapper layer above it, overly restrictive & cannot innovate||information means different things depending on the context its used|
Terraform client tool is an incredibly powerful API abstraction and brilliant single pass orchestration, but now everyone is wrapping it with a service API to improve scale. So we’d better think carefully about what it means to have a good API around the individual Terraform run.
The first thing everyone needs to really understand is what it is they are using TerraForm to try to fix. Each plan is not abstract because it must be specific to the cloud, application and cluster. It doesn’t have any way to provide real feedback about what’s been done, how it’s been done, or being updated. It’s not even an addressable API unless you wrap it in something else. And APIs that just wrapped Terraform are stuck with the plans’ design contract. They have to maintain an expectation that your unit of work is defined by a plan, not by the target start of the broader system, workflow or IaC process.
To have a good API here, Operators need to have a control plane that serves operational interests behind that API.
We need an abstraction that creates improved transparency for the infrastructure. That’s important to provide clear insights into the workflow and the actions it is taking. Unlike with a Terraform plan, the requestor should influence but not be able to redefine the processes, that is what got me thinking about how we’ve been building operational APIs at RackN.
In our product Digital Rebar, we specifically differentiate between intent, workflow and state with clear APIs for each. For an operator, these differences are important. Intent is your objective and can be described as inputs to a process. These intents are an abstraction that shapes what the system will do like configuration, but cannot (and should not) fully describe the system because many decisions cannot be made until the workflow has started. The fact that an intent does not have to be specific configuration details allows operators to work at a higher level of abstraction. The automation fills in details via code or makes assumptions based on defaults.
Once the operator starts a workflow, Digital Rebar collects state information and guides the transformation of the system towards the intent. The state information is observable, subscribable and addressable via the API throughout the workflow. This means that operators have the transparency to manage systems throughout the process. As automation inevitably bumps into unexpected situations, it is possible to understand how we arrived at this state in the process and even recover. In addition, the workflow artifacts themselves are defined and managed via the API.
That design provides clear, persistent and strong controls over the infrastructure behind the API. Even more importantly, it means automation can be secured, repeated and tested using true Infrastructure as Code techniques and Infrastructure Pipelines.
There is significant power in being able to clearly explain what makes APIs effective! It allows us to emphasize the productive behaviors of the platforms we are using. It helps us define criteria to select new systems. And it enables us to ask for targeted improvements to the systems that we are using.
That’s good news because it is possible to have great amazing and productive APIs for infrastructure. We just have to be willing to elevate system and operational needs. When we do, we’ve proven that everyone in the value chain benefits.
(Note: This story was updated from an earlier version posted today.)