6 Fundamentals of a Secure Terraform Workflow
Why is it so difficult to ensure your infrastructure is secure? There might be a few reasons:
- Slow, error-prone, manual workflows.
- A lack of built-in security controls or secure templates for infrastructure code.
- Inconsistent or nonexistent policy enforcement processes.
- No system to detect noncompliant or drifted infrastructure.
- Insufficient auditing and observability.
This post offers six fundamental practices for your HashiCorp Terraform workflow — along with stories from high-performing enterprise IT teams that use Terraform — that can help address these issues and ensure secure infrastructure from the first provision to months and years in the future.
But before we get to the specifics, it’s important to understand the dilemma facing organizations that want to secure their infrastructure at scale without slowing down development.
Control vs. Freedom
The two most common problematic states that IT teams experience sit at opposite ends of the developer autonomy spectrum:
State 1: Development teams have to submit infrastructure changes for manual review by security, Ops and/or compliance teams, which can cause bottlenecks and slowdowns.
State 2: Management gives developers freedom to provision whatever they want, with little to no tracking or oversight.
Longstanding enterprises often start in State 1. They carry over their legacy data-center strategies into their cloud-provisioning processes. This keeps provisioning orderly and regulated, but it’s slow and often involves siloed workflows that create other problems. A primary reason companies adopt cloud computing is to accelerate developer experimentation and iteration. But the manual checks in State 1 can cause organizations to fall behind competitors and leave more room for human error.
In an attempt to get out of State 1, many enterprises end up in State 2 as they try to emulate cloud infrastructure success stories like Netflix, where developers take on significantly more autonomy and responsibility. World-class teams with exceptional coordination can make State 2 work, but it’s way too easy to end up with a mountain of technical debt, security holes, infrastructure sprawl and out-of-control costs. Even world-class teams need some rules and standardization.
But can you easily find the best talent? Not only is there an ongoing skills shortage in the job market that makes it difficult to find unicorn developers with the all-around skill sets required for State 2, but frankly, many developers just want to write applications. They don’t want to learn about networking, cloud deployment and security just to get started.
The emerging discipline of platform engineering is about finding the right equilibrium between States 1 and 2: balancing the productivity blockers for developers against the best practices and requirements set by operations, site reliability engineering (SRE), security, finance and compliance teams. Platform teams forge this middle path by providing a standardized shared service with curated self-service workflows, tools and templates for developers that propagate best practices to every deployment while automating secure practices and guardrails.
These six fundamental practices for secure infrastructure provisioning can help platform teams establish a secure balance of control and freedom:
1. Bridge the Skills Gap with a Standardized Workflow
Platform teams should build workflows with the junior developer in mind. Since organizations are having trouble hiring engineers with the level of expertise they need, not only does this make developers more productive sooner, it also protects the organization’s systems from security flaws or outages that inexperienced developers might cause.
The fewer things a developer needs to learn in order to deploy applications, the better. And fewer tools for operations teams to manage reduces complexity and toil. That’s why teams often adopt the popular platform engineering concept of a standardized workflow, sometimes called a golden or paved path. With a golden path, any developer, senior or junior, from any team, can find documentation on a single portal, log in to Terraform and have a unified workflow for using all the infrastructure they’ll need.
The Golden Path
Creating a golden path is where the platform team starts the process of baking security fundamentals into the workflow and templates (Terraform modules) in an automated, self-service way. Petco and Morgan Stanley used this general approach to help out their developers:
“Maybe they don’t want to think about infosec. They’re worried about their software architecture. They don’t want to think about CMDB [a configuration management database]. Patching compliance, forget it. They just want to go fast.”
— Chad Prey, “Terraform for the Rest of Us: A Petco Ops Case Study”
“We have many developers who want to deploy apps on the cloud, but they aren’t familiar with all the different cloud service providers, or they might want to deploy their application on multiple clouds. By using modules, we can deploy standardized solutions across multiple clouds using a common syntax: HashiCorp Terraform [which uses HashiCorp configuration language (HCL)]. We’re able to bake in our security controls so our developers don’t have to go look at a long list of controls before they’re able to do anything in the cloud.”
— Itay Cohai, “Compliance at Scale: Hardened Terraform Modules at Morgan Stanley”
The gentle learning curve for HCL is a big reason for Terraform’s popularity. It’s easy to read and edit values even if you don’t yet completely understand the language, and platform teams can also set up Cloud Development Kit (CDK) for Terraform as an option for development teams that want to use a Python, Typescript, Java, C# or Go to code their infrastructure. With more than 3,000 Terraform providers, plus the ability to build your own, it’s straightforward to integrate security tools into the Terraform workflow.
Terraform and VCS
A key benefit of Infrastructure as Code is the ability to version and collaborate on configurations in a version control system (VCS). Getting the VCS-to-Terraform workflow right is another key to making the process as frictionless as possible for developers. In order to give platform teams the flexibility to offer a secure and consistent workflow for all the ways their downstream teams prefer to work, it’s ideal to have many ways to initiate this provisioning pipeline, including webhooks from your VCS provider, UI controls within Terraform, API calls or the Terraform CLI. While there are DIY methods for this, simple often works best, so whichever path you choose, don’t build a complex solution that’s hard to maintain and doesn’t scale.
Your goals for building a provisioning workflow should be:
- Automatically initiate runs when changes are committed to the specified branch.
- Use automated checks to predict how pull requests will affect infrastructure.
- Have a central internal module registry.
- Have an integrated secrets management workflow with as few touches as possible, with auto-generation and rotation for credentials.
One main focus for secure Terraform provisioning is cloud service and tooling credential management through the workflow. Using a proven secrets management solution integrated with automated secrets generation and rotation is a good start. HashiCorp Vault is a popular choice that integrates well with Terraform.
The next step for platform teams is implementing just-in-time access workflows for each service that Terraform accesses through a provider (using OpenID Connect (OIDC) as a preferred protocol). Eventually, you want to eliminate static cloud credentials from your provisioning workflows.
Dynamically generated credentials with short time-to-live (TTL) are the ideal option. Ephemeral credentials significantly limit the impact of credential exposure and reuse. Terraform dynamic provider credentials offer granular permissions control over your Terraform operations by scoping privileges down to the run phase, workspace, project and organization, which helps you uphold the least-privilege principle. The free version of Terraform Cloud can automatically plug in single-use, autogenerated credentials pulled directly and securely from Vault, Amazon Web Services (AWS) , Google Cloud or Microsoft Azure.
2. Build Secure Modules
Default settings and templates that ensure secure infrastructure are the first line of defense against breaches. For platform teams that use Terraform, that means coding your organization’s security requirements and best practices into your module collection. There are a variety of ways to bake in security, but Morgan Stanley has a strict example of module security that illustrates how much you can harden this layer:
“At Morgan Stanley, there are required fields in our hardened modules, and if they’re left empty, you can’t build the module in the first place. For example, a KMS [AWS Key Management Service] encryption key field forces the use of encryption in order to build the module. … If someone passes in an invalid KMS, they get an error message right off the bat, and this helps shorten their cycle. … Other ways we make these modules secure is by relying heavily on the environment and execution context. We make use of cloud service provider data sources to fetch sensitive resources that might lead to misconfiguration if they were mishandled by the end user. Things like an AWS region: What happens if someone passes it in as the wrong string? It could get created in the wrong region. They could copy and paste a string that breaks the environment. By pulling this information automatically from the environment and execution context, we can be certain that it’s being deployed correctly.”
— Brett Tagart, “Compliance at Scale: Hardened Terraform Modules at Morgan Stanley”
But before organizations can reach this ideal state, they need to centralize their modules.
Early in cloud migrations or Terraform adoption, operations teams often deploy Terraform in separate silos around the organization. Several anti-patterns emerge from this fragmented approach:
- Each team may create its own “snowflake” modules that do the same things as other similar modules in another team’s repository. Instead of sharing modules, they’re performing duplicative labor.
- If an organizationwide standard for compliance and security needs to be followed in these configurations, it can be a nightmare for security teams to review all of these separate module repositories.
- There’s no golden path for developers to follow across the organization.
For many organizations, the solution is to set up a central internal Terraform module registry. This is a good starting point for anyone using a Terraform-based internal developer platform. It provides one place a developer can go to find their platform team’s validated and approved modules, which can be standardized and reused throughout the organization.
Version-controlling, improving and managing these modules is a lot easier when you have them stored and codified in one place. It’s one place for security and compliance teams to review infrastructure code, set requirements and provide feedback. And it’s one place where the platform engineers and multidisciplinary experts of the organization can rally around to implement their field-tested best practices and propagate them out to every deployment.
Platform teams can take this even further by standardizing reusable workspace setups and entire accounts with premade landing zones. The ultimate goal of centralizing modules and other infrastructure templates is to have a single place where developers across the organization can go to find and provision modules that will ensure every deployment is compliant and secure by default.
Streamlining Module Usage
Platform teams are charged with making the lives of developers easier and their infrastructure more secure. The ideal workflow of well-built, easy-to-use modules usually requires only a few fields to be filled in by the developer, with the rest handled by Terraform. Most module repository setups have more friction than this ideal. Developers typically still have to select a module based on its contents, add it to a version control repo, create a workspace in Terraform and provision the module from that workspace.
With the skills gap challenges mentioned above, platform teams should go even further to set up workflows that don’t require developers to be trained on Terraform to provision infrastructure modules. HCL is easy to learn, but no-code workflows allow junior developers to be productive on Day 1 and save time for developers who don’t really need to learn the ins and outs of Terraform. Plus, platform teams can spend less time servicing repetitive internal requests for simple provisioning tasks.
Securing Golden Images
While image builds aren’t strictly in the purview of Terraform, virtual machine and container image management and security is a key component of platform engineering for some organizations. Just like with modules, platform teams want to centralize image management and create a golden image pipeline. The goals are similar to the approach to modules, in which you want a central repository where you can version and track images, as well as update them or revoke their usage in provisioning workflows en masse.
3. Policy as Code Guardrails and Gates
After you’ve set up a central module registry, how do you ensure developers are using the modules they’re required to use? And how do you do that in an automated way? This is where a policy engine with its own policy configuration code adds another layer of depth to your defense. This means using Terraform’s Policy as Code framework, Sentinel, and/or a general policy engine such as open policy agent (OPA). Here’s an example of what Asian Development Bank (ADB) does:
“The way that we enforce our security policies is to place it on the modules. But you can never be 100% sure because maybe the workspace did not use the modules — for some reason, it bypassed the modules. …
Sentinel still enforces policies, regardless of whether the module enforces it or not. Sentinel is going to be that bouncer in a club that allows you to go in or out. For us, that gives us 100% confidence that anything provisioned by Terraform is following our security postures.”
— Krista Camille-Lozada, “Scaling Innovation: ADB’s Cloud Journey with Terraform”
With a policy engine attached, Terraform can automate the enforcement of custom rules within the provisioning pipeline. That means you can write and enforce a policy that requires the use of predetermined secure module sets. Policies can enforce everything from identity and access management (IAM) controls, CIS benchmarks, proper infrastructure tagging, the storage location of data (for GDPR compliance) and an infinite number of other things. One mistake developers make is deploying storage buckets that are publicly accessible. Policy as Code can stop that too.
Mature Policy-as-Code Practices
The next level of Policy as Code maturity is creating reusable policy sets in the same way you create modules and keeping them in a central repository. As with modules, you can take inspiration for policy sets from the policy libraries section of the Terraform Registry.
It’s important to get policy enforcement levels right so you don’t block progress unnecessarily or leave gaps to bypass non-negotiable policies. Be familiar with levels of enforcement for your policy engines. Sentinel has three levels:
- Advisory: Warns you when a policy is violated but doesn’t stop provisioning.
- Soft mandatory: Requires a manual action to override a policy violation.
- Hard mandatory: Blocks provisioning until policy requirements are met.
Platform teams should use hard mandatory policies only when absolutely necessary, instead giving senior engineers the flexibility to bypass violations when they can manage risks or use a sandbox environment. Junior developers may need firmer restrictions.
Finally, platform teams should look for ways to integrate tools for policy and security checks into one workflow. There are myriad of other code analysis tool types that aren’t covered here, and most organizations use tools from many cloud providers and third-party sources, so integrating them all as steps in Terraform’s provisioning pipeline is key to building your golden path. Thankfully, many tool vendors have built run task integrations that plug into the Terraform plan and apply stages.
4. Custom Condition Checks
Another way to streamline module usage and write more maintainable, better-timed security checks is to use input variable validation conditions, pre- and postconditions and checks. Pre- and postconditions are custom rules that can be added for resources, data sources and outputs in a Terraform configuration. Preconditions are checked before object evaluation; postconditions are checked after, during the
terraform apply phase. Terraform’s check blocks feature is a more holistic version of postconditions used for the overall functional validation of the infrastructure.
These features allow platform engineers to provide more descriptive error messages when a module is used incorrectly or in an insecure way. This helps developers by:
- Raising errors in context.
- Showing users which inputs they entered incorrectly.
- Providing faster feedback loops.
Postconditions are especially beneficial because they can check infrastructure after real values have been added, when security flaws can slip through. This is important because some values inside the configuration aren’t received until after the
terraform apply phase. One example could be an Amazon EC2 instance, which won’t get its root volume ID from AWS until Terraform requests it, so Terraform won’t know that value until the
Platform engineers can add security checks at all phases of the Terraform plan-apply cycle. One postcondition example could be guaranteeing that a configuration for an EC2 instance will be running in a network that assigns it a private DNS record.
Securely designed and validated modules, along with policy guardrails and custom condition checks, ensure a high level of security at the outset of provisioning. The remaining fundamentals focus on maintaining secure infrastructure after provisioning, on Day 2 and beyond.
5. Drift Detection and Continuous Validation
Even with a secure initial provisioning process, secure settings on infrastructure can still be undone or circumvented. Some organizations try to adhere to immutable infrastructure principles, where infrastructure is not modified in place but erased and rebuilt, but others find it hard to follow that pattern 100% of the time. Some teams update certain pieces of infrastructure in place by design or because they have no other choice.
It even happens at LinkedIn:
“There could be a break glass scenario where something’s on fire, ‘I don’t have time for this. Let me run this one command, and it’ll be fixed.’ Or just a force of habit. We found that within a team itself, some embraced IaC culture while going through the review process, but others still wanted to use a CLI.”
— Vaibhav Tandon, “Enabling Infrastructure as Code at LinkedIn”
This opens infrastructure up to the possibility of configuration drift. Teams should always have some system in place to detect this drift, otherwise you’re leaving your company open to outages, unnecessary costs and emergent security holes. You can build this, or you can use Terraform-native drift detection and health assessments.
Ideally, platform teams want to provide drift detection that has:
- Customizable alerts that can notify via email, Slack/chat or webhook.
- A dashboard to highlight resources that have drifted.
- Metadata that includes information such as the last time drift was checked.
- Visualizations showing which attributes have changed.
Last, you’ll want fast drift remediation options in your Terraform interface that can accept changes and refresh your state file or overwrite changes to bring infrastructure back in line with the intended configuration.
Continuous Validation of Custom Conditions
Drift isn’t the only thing you need to check for throughout your infrastructure’s life cycle. Components such as identity and access management, service configuration and anything used by an application might break post-deployment even if the end result of a
terraform apply was successful. Custom conditions like pre- and postconditions are checked only during the initial plan-apply workflow, so platform teams also need a way of continuously validating those conditions on Day 2 and beyond.
Similar to the detection and visibility features for configuration drift, your continuous validation system should provide alerts to various channels, a dashboard showing a list of all ongoing checks along with their status (passed/failed) and the ability to drill down into failed checks to see what error messages were triggered.
Together, drift detection and continuous condition validation are important automated systems that close many Day 2 loopholes in your infrastructure’s security posture.
The final step to building secure Day 2 practices is gaining general observability into the overall security of your infrastructure. In a Terraform environment, this means visibility into your workspaces, with a clear audit trail for all changes.
- The Terraform state file via backups of previous state files
- Secrets and credential usage
- All run activity, including:
- User comments
- Any references to the changes that caused the run
- Policy violations
- Who overrode a policy and when
In addition to an audit trail, the ability to quickly find and drill into workspaces on your Terraform dashboard/UI is key to speedy debugging and health checking. If you have a lot of workspaces, the ability to tag and filter them is essential. The ability to manage the tags themselves in bulk is also nice to have.
Having central visibility over Terraform versions, along with module and provider versions used in every workspace, is another key observability and reporting component for your infrastructure. You also want to be able to check on workspace access and answer questions about workspace usage:
- Which users are accessing what workspaces?
- What configurations are they changing, at what time, from where?
- Who is accessing, modifying and removing sensitive variables?
- Which users are changing or attempting to change your policy sets?
Overall, platform teams need to provide an organizationwide Terraform audit trail to security and compliance when needed.
Secure Fundamentals Improve Everything
Platform teams typically focus on four things with Terraform: increasing speed, automating secure practices and checks, reducing errors, and improving cost efficiency. While the six fundamentals outlined in this post focus on security, they also contain foundational practices that will help platform teams achieve better speed, reliability, cost optimization and efficiency.