How to Scale Your Terraform Infrastructure
My personal journey with Terraform began many years ago, and resource creation is almost second nature, but I got to this point with the help of documentation, a thriving community and a fantastic partner network. Creating infrastructure is really only one part of the DevOps equation but due to its simple nature, things have become a bit tricky. Operating at scale has uncovered some challenges including complexity of deployments, internal and external dependencies, security build in and maintainability.
This article will take you through the journey and identify some of best practices and solutions for using Terraform at scale.
Using only your Terraform binary and access to your target provider, you can start working with your resources locally on the same computer your codebase resides. This approach makes it easier to run daily activities on Terraform like state, import, move, etc. Remote state and state locking are also available when working with teammates. If running version control with no continuous integration, you can use
pre-commit-terraform, which runs a set of tools for code linting before committing.
As cool as working locally is, doing multiple manual processes can lead to errors and be a waste of time. Imagine a scenario where multiple developers are working on a single process. I personally lived through this scenario a few years ago when the team I was on was building out infrastructure and resources on AWS. We either had to wait and rebase for each pull request to apply changes or forgo the ability to work on the code as a team. The tooling at the time led us to have a single person be responsible for the infrastructure creation and deployment. That person turned out to be me! Writing unit tests for your Terraform modules is a best practice, but running tests for every change made to your code can be time-wasting since it has to be manual. Trust me, I skipped this step.
Another concern with local creation is the security aspects. Even when using an external solution like Vault, the state file can be accessed and a simple terraform state pull can be catastrophic. There was also the time that a coworker created some IaC locally but was unable to upload it to our GitHub organization. I did the nice thing and said I would create the repository and upload the code. Unfortunately, I didn’t know there was hard-coded AWS access and security keys buried in it. I took the blame for uploading my coworkers’ sensitive data to a public repository.
Creating Your Own CI/CD Pipeline
The next step in the journey was simply adding the Terraform code into your existing (and sometimes legacy) Continuous Integration and Continuous Deployment tool. With CI/CD, the issue of privileged access discussed above can be avoided as the access can be applied only at the execution layer.
At the same time, the developers have only read-level permissions. CI/CD pipeline produces system logs that can be used to track changes on each run. Pull request status checks can be configured to run linting, compliance checks and automated unit tests.
Concurrently running CI/CD pipelines can lead to race conditions where pipelines fail. When a pull request is merged seconds ahead of another, the pipelines are triggered, leading to some failing. This is common when the codebase includes state locking (a critical best practice). The first pull request locks the state, failing the others since they were triggered simultaneously.
This can be solved by configuring queued runs, but it is challenging to create in most CI/CD tools. In creating a functioning CI/CD for successful integration and deployment, we will have to think about, for example, unit tests, modules sharing and periodical drift detection.
The same team I was part of, as I described earlier, lived through this exact situation. You can even find some videos on YouTube where we describe in detail from infrastructure to application all with a single pipeline. This worked great for demo purposes but operationalizing this would have been extremely difficult. As tempting as it might seem to customize your very own Frankenstein CI/CD, please consider carefully before heading down this route.
Using Open Source
There are many open source tools that could help with Terraform automation. An entire category of tools was created called TACOS, Terraform Automation and Collaboration Software. When scale becomes a concern, they can be utilized to add additional features to the Terraform infrastructure.
For example, take a look at Atlantis, the most common or popular open source tool associated with Terraform to enhance the automation and collaboration layer. It’s an open source tool that allows developers to manage tasks and operations more easily.
Other popular options include Terratest, Terraformer and Terragrunt. There are many open source Terraform projects available, depending on your requirements.
While I love open source and continue to advocate for it, sometimes they only fill a few gaps in the journey and managed solutions quickly came into existence.
Using Managed Solutions
The first entry into the market was built and released by Terraform creators and maintainers, HashiCorp. By nature, newly developed Terraform features can and will be implemented in Terraform Cloud. With advanced security compliance mechanisms like Sentinel, it fully allows you to use organization standards before code deployments.
The remote execution backend that Terraform Cloud offers is very efficient as it allows working with Terraform on your computer as you upload code base locally and run plan against it as an easy way of verifying if your configuration works.
Spacelift and Terraform practically have the same functionality as Terraform Cloud in terms of Terraform support and deployment execution. You can compare Spacelift with Terraform Cloud.
Compliance is an area where Spacelift has an advantage in part because of the reliance on Open Policy Agent (OPA); the unified open source toolset and framework for policy across the cloud native stack. OPA has become the foundation for many of the security solutions currently found on the market today. The selection of OPA allows Spacelift to provide policy within the solution and integrate with the many vendors that specialize in security and compliance best practices.
Where do you even begin? The first step is to understand organizational best practices and technical requirements. Each organization is different, starting at a different spot, and functions with a different set of principles and practices to ensure DevOps success. After being on this journey for several years and talking with peers in the industry, I’ve learned that it is more cost-effective and ideal from a human perspective to leverage a platform like Terraform Cloud or Spacelift to provide necessary features and scale Infrastructure-as-Code, which leaves us all with more time to spend with family, enjoy our hobbies and volunteer with our favorite nonprofits instead of worrying about our cloud resources.