What KPMG Learned About Infrastructure as Code: Tools, People, and Process
While there have been great changes in software development to help companies deliver faster, that’s been less true on the operations side, but this needs to happen, argued Jeff Ardilio, director of software engineering at the KPMG consultancy.
Infrastructure as code (IaC) offers a way to make operations more agile and efficient, he said at a DevOps World|Jenkins World talk in San Francisco last week. He was recounting lessons learned from the professional services firm’s work with clients.
In the talk, he discussed the considerations around infrastructure-as-code in an organization setting, including the potential changes in culture, in the daily workflow, and in the tools used.
Single Source of Truth
IaC is a way to manage infrastructure and all its bits and pieces — networks to VMs to load balancers — into a single source of truth about an environment.
Infrastructure as code is a declarative model for defining what your infrastructure is going to look like. In this approach, I can look up all my connections. I can share bits and pieces of the configuration with other projects, other teams. And I can have other people review those changes, he explained.
For one financial services client, a team of three was able to manage and create 7,000 different infrastructure resources over seven different environments using IaC. These were production and non-production environments covering multiple regions.
Managing 7,000 different pieces takes a lot of time to do manually and is prone to mistakes. Mistakes lead to drift. Infrastructure as code can help standardize that process and ensure every environment is identical and propagate changes to each of those environments, he said.
For highly regulated industries like financial services, auditing is very important. You need a way to easily review the infrastructure and how it’s going to be deployed. This declarative model is human readable. Using IaC, tech risk, security and internal auditing can evolve. It can build trust to speed the approval process to get changes out the door faster.
It also doesn’t require a lot of time for debugging, he said. With the model or plan, you can just compare the existing state with the expected state to find the problem, then reapply the plan to restore it.
SaaS vs. Agnostic
It’s a matter of tools, people and process, Ardilio said.
“There are a lot of tools out there, but we have to have our people and processes aligned to execute on these tools,” he said, explaining he didn’t want to go deep into the technology.
All the major cloud providers offer some level of infrastructure as code, and though it is repeatable, that doesn’t mean it’s push-button start in all these new environments. It does take some preparation to go into a new environment, he said. And if other teams are going to use these environments, you need to understand that and prepare for it.
The downside of IaC-as-a-service is lock-in with that provider, which matters more to some organizations than others. The other consideration is resiliency, as cloud providers are not immune to outages.
Taking a more vendor-agnostic approach eliminates the problem of lock-in, but comes with other concerns.
“If a new product or service becomes available on one of the cloud providers, and I’m using Terraform, for instance, an abstraction layer built on plugins and modules, that new plugin or module might not exist right away with that cloud provider. I need to think about what that means. I might need to look to the community to build support for that module,” he said.
Culture of Sharing
Then there’s the issue of people “This is probably one of the most difficult hurdles the organizations I’ve worked with to overcome, the cultural aspect,” he said.
We need to start looking at breaking things down into modules, sharable modules or components with different teams. We need to take principles learned on the software side — best practices, iterative processes, feedback loops — and apply them on the infrastructure side, he said.
Software developers working with infrastructure as code might not have infrastructure knowledge, but an engineering mindset about how to break large things down into smaller, usable component.
“How do I find repeating patterns and apply engineering skills to infrastructure. It takes that kind of a mindset,” he said.
“We need to create a culture that is open to experimenting and trying new things. We need to fail fast. We need to learn from our mistakes. We need to share, we need to collaborate, we need to set best practices,” he said.
Building that culture enables an organization to bring in the tech risk folks and security folks earlier in the process.
“They’re part of our team. As part of our team, they’re filling in … the requirement of what we need to implement, this is the acceptance criteria that we expect.
“We push code through and make changes, and they review it based on acceptance criteria. Because of IaC and this data model, it’s actually very human-readable. The tech risk people can actually understand the implementation we’re doing and approve that process. As a tech risk person, I’m no longer signing off on something that’s already running. I’m signing off on what’s about to be deployed,” he said.
Shorter iterations are key to agility.
“We want to get changes out the door. We don’t want them to pile up. I had a client with a quarterly release cycle and that release cycle had 40 approvers listed on the change request. Everyone was afraid to trust this system because everything was manual,” he said.
Once changes pile up, they become riskier. IaC buys a means to build trust. The more the tech risk folks are involved in the process, the more they become comfortable and more trusting and more willing to approve a standard change to a process, which helps get these things out the door quicker, he said.
Version control is as essential in infrastructure code as in application code, he said.
“I’ve seen a lot of companies that have migrated to Git. They’re just using it as another tool. They’re doing the same things they did before, but just in Git and not necessarily leveraging best practices,” he said. Yet you have all these changes coming in, how can you best manage all these releases? He recommends something like GitFlow or some other modeling tool.
And if you’re managing seven different environments, source control can be a problem, whether you use one repo and a fork for each of the different environments or a branch for each or whatever. It becomes a mess to manage. People miss something in trying to propagate changes across these environments. KPMG uses single source code repo — a file with a few things unique to each environment such as its name and the peers it connects to — to guarantee that a change in one will propagate the requirements to them all.
We need to support a modular, component-driven design, he said. We need to look for opportunities to create reusable components.
“If I solve that problem one way and another team solves it a different way, if we want to make a change, if we make a change in one place, it needs to change in all the other places. So we need to think about that. Identify those repeating patterns,” he said.
Involve Myriad Teams
Code review also is as with application code.
“As companies start adopting IaC principles, that framework, code review is a great way to share that information,” he said. “Everybody’s going to approaching things a little differently, especially when everyone’s still learning. Having multiple people involved in the code review process who approach things in different ways is extremely helpful.”
In these code reviews, they’re not just reviewing code, but these plans. With a pull request, the first thing the pipeline is going to do is come up with a plan for each of the seven environments attached to it. This is what the system says it’s going to do in the seven different environments. It’s going to create this resource, delete this resource, modify this resource. You can see how it’s going to play out in each of those environments. That creates the criteria for accepting that pull request.
In the pipeline, I can start introducing things like policy checks. It allows us to create an automated pass/fail based on certain conditions. Those conditions become tollgates for how we enforce different policies. If I put out a new storage volume attached to one of my VMs it has to be encrypted because our policy requires that. The policy check can enforce that.
“These are ways that automation can help us in building confidence in what our changes might produce before we deploy,” he said.
It also provides a means to start testing not just these policies, but the creation and destruction of these environments.
This is a common scenario, he said:
You made a change and deployed it now and it worked now. But you didn’t realize that your change had an indirect dependency on something else in the plan. The 10-step plan had already run. There’s a change to Step 3, so it’s 3.1. If you apply 3.1, it might run because Step 10 already was applied sometime in the past, and you didn’t need to reapply it. But when you try to set up a new environment, maybe months later, the plan starts to fail because you had an unknown dependency.
“Some frameworks have a built-in means to help with that dependency tracking, but that’s not always the case. … This is a great way to discover these things early. If I can allocate states on my pull requests and bring up a whole new environment, then tear the whole thing down, then I’ll be able to identify these things quicker,” he said.
All these teams are working on different bits and pieces, different platforms that are part of a larger whole. Everybody might have their own project they’re managing for each of these pieces. Inevitably there will be shared resources and teams will collide when working on different parts of them. The changes of one group will blow away the changes made by another. It’s essential to figure out how teams can work this out, he said.
CloudBees sponsored this story, written independently by The New Stack.