Why Use Infrastructure as Code?
I was a quality engineer on a team at a large e-commerce company that moved from .NET deployed to on-premises data centers to all open source languages on AWS. There was a team of three called the “Cloud” team serving more than 300 software developers, and they quickly put out the message: “We don’t deploy things, we make VPCs (virtual private clouds) and security rules and advise other teams. You deploy things, so someone on your team needs to learn how to deploy your app into AWS.” I raised my hand.
The rule was we could make things by hand in the dev Amazon Web Services (AWS) account, but anything we created would have to be in staging or production using CloudFormation, and they recommended using Troposphere. After entire days of repeatedly waiting 45 minutes for a CloudFormation stack to build 90% of the way and fail on a missing comma, and then waiting 45 minutes for it to tear down before I could try again, this experience mostly left me with a deep, lasting dislike of CloudFormation, Troposphere and even Python (by association. Don’t @ me, I’ve written my fair share of Python, without complaint, since then). But I saw the promise, understood the rationale and reaped the benefits of Infrastructure as Code (IaC) in my future roles.
Building a Platform
After the e-commerce company, I was hired at a field data collection startup with the mandate to move off of Heroku, which it had outgrown. It was early 2018, and Kubernetes was clearly winning as the next base from which to build infrastructure — but it was also still early days. The terms “platform engineering” and “platform team” were nascent. The ecosystem was still young and tooling in many areas was not yet mature. We knew we’d be iterating a lot, and knew we’d need to do it in code. We picked Terraform as our tool, as it was (and in some ways still is) the standard and got the first benefit of using IaC.
Infrastructure as Code Is a Record
When you’re iterating quickly, especially on infrastructure, it’s easy to think that doing it in a console would be better, so you can quickly experiment and try things and see what works. But by making all changes in your code and committing them to git as you try them, you have a built-in record of what you’ve tried, both what worked and what failed. There’s a lot of talk these days of building a “personal knowledge base” out of your notes. IaC is the Ops version of this. Why do we turn on the automatic logger on our clusters? Check the git history; you probably left a note to yourself why!
Having that record let us work fast on a large and completely new system that nobody, much less we, had much experience with. We were able to try Amazon Kubernetes Service and decided we didn’t like that much. But we knew what we liked about the internal configuration of our cluster, and migrated those pieces in an attempt at provisioning Kubernetes ourselves, from which we took lessons and bits of code to Google Kubernetes Engine. We tried raw load balancer services in Kubernetes and learned that IP addresses get expensive quickly, but took our learnings from that to using Ambassador Edge Stack as our API gateway. We were able to iterate through multiple platforms and land on production-ready solutions in just a few months.
IaC is not only a record of history — it’s also a searchable catalog of what exists in your system and how it’s configured. This is critical for cost management and helping to reduce redundant work. Someone who may not be very familiar with a cloud console can just search through the code to see if a bucket, cluster or queue exists that would serve their needs before they go creating another one. And knowing how your current systems are configured leads to the next reason to use IaC.
Infrastructure as Code Multiplies Your Scale
IaC is revolutionary, but it’s also evolutionary. Operators of infrastructure have used code for decades, from bash scripts up through highly complex configuration management setups using Puppet , Chef and Ansible. SysAdmins — who then became DevOps engineers (and who now might be called platform engineers) — have always been lazy in a good way: They don’t want to do anything twice. Once you’re done iterating and know what a good setup looks like, configuration management lets you turn “pets” into “cattle” and stamp out copies of a good server.
What created the need for evolving IaC out of configuration management was the advent of the cloud. Configuration management tools are great for configuring a server that exists, but you can’t put the sync daemon on a server that will only exist after you call an API. IaC provides the same power as configuration management for the provisioning of hardware and services themselves.
While we never stopped iterating on our platform, as it was a time of very vast change and deprecation of Kubernetes and cloud APIs, we did get to a point quickly with our clusters where we were ready to take them to production. But in our first try, in an attempt to save money, we created everything in a region that added a lot of latency to our app. Whoops. It wasn’t a huge problem, as we were able to copy/paste our code, change the region, bring up an entirely new setup and swap over the DNS and content delivery network in about 20 minutes.
Later, as we had R&D teams working on greenfield apps in new regions, we could quickly launch environments for them. Only about four years have passed since then, and now we have major retailers putting Kubernetes clusters, or multiple clusters, in every store, communicating with hybrid data centers around the world. None of this would be possible without IaC.
Building a Legacy
After we successfully migrated our entire system off Heroku to a mixture of Kubernetes for applications and virtual servers for databases (all managed by IaC), I took a CTO role at a small nonprofit that provides free online tutoring for low-income high school students. In February 2020, the organization facilitated 30 tutoring sessions with its web and mobile apps. In September 2020, it facilitated 1,500 sessions and the app was falling over. I started in January 2021, and my responsibilities were:
- Build out an engineering team.
- But first, stop the app from falling over.
Infrastructure as Code Teaches and Preserves
My entire infrastructure, operations and tooling budget for the first year was $10,000. Because our budget was so low, most Platform as a Service (PaaS) places I would have otherwise started us on were out. We could get much more bang for our buck by going lower down the abstraction ladder, so I started building out servers and configuration, and yes, we even used Kubernetes, using IaC.
My staff budget was proportionate to my infrastructure budget, so I was mostly hiring junior devs into their first or second job. None of them had experience with infrastructure or Ops. By doing everything in IaC, I was able to ask them to do code reviews with me, where they could ask me everything they didn’t know about. That was a great way to force me to justify why I was doing things a certain way. I could’ve sent them off to do a Cloud 101 course on Udemy, but instead, they got to learn in a familiar way: by reviewing code!
Not only that but with the evolutions IaC had gone through, I was able to write in their language. All of our app code was written in Node.js and TypeScript, and by this time I was using Pulumi , which supports TypeScript, so our app code and infrastructure code were in the same language. The other huge benefit of that switch was much more powerful, flexible abstractions and strong typing. We could build “environment factory” functions that built everything from just a few parameters. And no more scrolling through endless documentation trying to figure out if this field should be a string or a number — now there’s IntelliSense for your infrastructure!
In the same way that IaC is a record for yourself as you’re building something, it’s also a legacy you leave to others. Legacy is a loaded word in tech, and often used derisively, but when you move on from your current role, it’s critical that you leave comprehensible things to those who come after you. I left the role at the nonprofit over a year ago, and I occasionally get asked what something is, but for the most part, my code speaks for itself about what was built and why. In fact, you can read the GitLab repo and send me critiques. Whenever there’s turnover, you lose institutional knowledge. You lose a lot less when you use IaC.
Infrastructure as Code Is the New Standard
The bottom line is that if you’re not using IaC, you should be. You don’t have an excuse just because you have a lot of existing infrastructure that was created manually. You can import your existing resources. IaC is no longer cutting-edge technology; it’s the new standard. All of us obsessed with new, shiny things have moved on to platform engineering tools like Crossplane and Backstage, which are only possible due to the advancements IaC brought to the field. Whether you start with Terraform, Pulumi or the product specific to your cloud platform, it’s worth starting today.