How DoorDash Governs Its Infrastructure with Open Policy Agent
Online delivery service DoorDash uses Infrastructure as Code (IaC) to automate the DevOps process. In order to do that safely and successfully, the company implemented safeguarding practices to ensure against breaking changes and security issues.
DoorDash uses Terraform as their infrastructure provisioning tool. A big selling point of Terraform is that it uses HashiCorp‘s HCL (HashiCorp Configuration Language), which is similar to JSON but with the bonus of additional data structures and capabilities built in, according to a case study from the DoorDash engineering blog written by DoorDash Software Engineer Lin Du and DoorDash Security Engineer Juvenal Santos.
There’s a lot to be said for why IaC is growing in popularity. Maybe it’s because applications and features continue to grow. Or maybe because the more human involvement there is, the more potential for human error. Even if the human error is small, 2% of 2000 is still 40. That’s not a huge number until you consider the potential severity. Either way, I think in its purest form, we are engineers. We communicate in code. We automate. And now that bleeds into DevOps.
Welcome to Infrastructure as Code
In short IaC, it’s config files that contain infrastructure specs rather than human implementation. IaC is the managing and provisioning of infrastructure through code rather than manual processes.
DoorDash uses GitHub for version control and IaC lifecycle management and GitOps for CI/CD tools.
Because there is no IaC without policy-as-code but no, we don’t call it PaC (though we might as well). The policy (aka rules, conditions, instructions) are the policies that hold the shape of the automated infrastructure.
What instructions or (ahem policies) might be included in the policy files? DoorDash suggests areas such as cloud native infrastructure, application authorization, or Kubernetes admission control. The rules defining the conditions required for infrastructure code to pass a security control and be deployed are also very much inline with decisions made when the policy is written.
Not only does policy define what can be done but also what shouldn’t be done. It’s one thing to say make sure it looks like this but what happens if, say a load balancer is being adjusted? That has the potential to be a breaking change but it doesn’t have to lead to critical downtime.
DoorDash has a policy guardrail in place that requires extra code review from different teams when needed (i.e. in the case of the load balancer, a traffic engineer would need to review before approval).
Some of the other guardrails are:
- The supported Terraform modules allowed for infrastructure changes, where so long as engineers are taking the recommended approach with respect to deploying a cloud resource, the approval is automated.
- The changes that require security team review.
- The cost parameters around allowable changes to infrastructure.
The Open Policy Agent Here to Decouple Logic from Logic
The goal is to derive business logic from policy logic, because the application, aka the business logic, aka policy enforcement can’t also make decisions about where to implement policy. So enter stage left: the Open Policy Agent (OPA). Now folks, if we look lightly to our left we’ll see this lovely open source, general-purpose policy engine decouple policy decisions from the other responsibilities of an application. Rego (high-level declarative) is the language policy and simple APIs are written in.
When software needs a policy decision, it queries the OPA by sending structured data, likely JSON as the input. The OPA then generates policy decisions by evaluating the query input against policies and data.
All right, folks! The OPA in Action
Recipe 1. Require core-infra admin group review when critical resources are deleted
Attempts to delete critical cloud respires from infrastructure code will generate the following “OPA check failed” message:
Attempts to create/update a security group with a port22 and CIDR 0.0.0.0/0 generates the following “OPA check failed” message:
After a GitHub pull request is created, Atlantis runs a Terraform plan and passes the plan file to conftest. Conftest pulls the custom Rego policies Amazon Web Services‘ S3 bucket, evaluates the OPA policy based on the Terraform plan, then comments the output to the PR.
All of this takes place in a single action which is pretty wild, if you consider this sort of thing wild… and you just might since you made it this far in the article.
The owner knows rather quickly if it meets all policy requirements or needs revisions before it’s submitted before further review. This probably doesn’t hurt relations between the DevOps and other engineering teams either.
There is fertile ground for more exploration on how this can be used on a larger scale. DoorDash mentioned that they are exploring cloud cost policies. The DoorDash duo writes that “The goal is to provide a unified experience for all infrastructure operations, which we believe is the future of engineering workflow changes,” is the sentiment on continuing to push in this direction.