One SRE’s Struggle and Success to Improve Infrastructure as Code
Puppet sponsored this post, written for Puppetize PDX in Portland, Oregon, Oct. 9-10. This two-day, multitrack conference will feature user-focused DevOps and infrastructure delivery talks and hands-on workshops.
Last year, Oscar Health — a health insurance company that develops seamless technology and provides personalized support to support our members — decided to allocate time to improve our review process for infrastructure due to rapid growth in employees and member interactions with our products. Oscar’s goal is to engage our members and make their health care easier to navigate, more transparent, and more affordable. Notably, we are a full-stack insurer: we’ve built the full stack technology platform to support our members’ health care needs.
Because we are growing rapidly and are hiring large groups of developers, we wanted to make sure these groups were going to just more than support the ongoing evolution of our infrastructure. This, however, wasn’t happening. When we were onboarding, we heard tech teams were spending hours building systems by hand, or had to wait weeks for someone else to build the systems for them. This, inevitably, made infrastructure slow to deliver and created inconsistencies in quality and our infrastructure work became non-reproducible.
The end result was all of the applications and services we delivered on top of our infrastructure inevitably suffered.
After some research, we discovered that unexpected and unreviewed changes were one of the leading causes of downtime. It was imperative for us to start to build out a better practice of infrastructure as code and the culture around this. What we found (at least at my company) was that infrastructure as code is more of a culture change than a technical change.
In this post, we describe key things we did to improve code review to improve our systems. It highlights some of the challenges we faced, not only around tooling, but also with team dynamics — we had to shift workflows and systems in order to cut downtime with our infrastructure and make all our apps and services perform well to translate into a better member experience.
We also made the shift, in part, by applying what we learned in Puppet’s guide for transitioning to infrastructure as code. The part of the guide we used to help solve application and services-delivery issues was on code reviewing infrastructure.
Finding Your Desired State
Convergence is our goal because we expect our infrastructure to reach a desired state over time expressed in the code. Software idempotence means software can run as many times as it wants and unintended changes don’t happen. As a result, we built an in-house service that runs as specified to apply configurations in source control. Traditionally, we’ve aimed for a masterless configuration design so our configuration agent looks for information on the host.
A particular host, for example, may run N amount of roles. Data consistency includes a common role consisting of a configuration task, such as making sure networking, services and packages are installed on every host. Primary roles are then broken down into secondary and other tag rules.
Only a few exceptions exist for credentials that need to get pushed out to a specific host from time to time. We sought to refactor these exceptions to not use “SSH” since we view SSH-ing as an anti-pattern itself, which creates complexity in terms of operability. (It’s a good thing only 5% of our working infrastructure relies on SSH.)
Put a Development Workflow into Practice
Experience suggests trunk-based development is a key enabler of continuous delivery and continuous deployment. Trunk-based development encourages developers to commit to a single branch to avoid conflicts. At Oscar, we call it “the world” and it includes every dependency involved to build a particular piece of software at our company and includes passing tests.
Trunk-based development is good for ensuring configuration drifts don’t happen. It also keeps machines as identical as possible. For the cultural angle, Oscar focuses on a shared responsibility model where developers are free to contribute to infrastructure repositories. We expect differentials to contain as much context as possible, including but not limited to design documents and JIRA/Phabricator tickets. Differentials and commit messages serve as the historical record for changes and are important when refactoring, investigating issues, etc., especially as original authors come and go or change teams.
Performing infrastructure upgrades should include a document with proposed maintenance time, a sample email to send out to stakeholders, all the things that may go wrong or right and a checklist for the upgrade and rollback steps if possible. Our checklist outlines a pre-, in- and post-flight plan. Documenting ideas allows expertise from others and you can learn a lot. It also gives other teams’ a voice, especially when making changes that impact other teams. This helps us catch problems early and it’s cheap and easy to do.
Automation in Code Review Process
Don’t worry about 30% automation coverage in the beginning. We currently do not have everything automated. What is especially important is presenting build output to developers at a commit level whether it’s formatters, static analyzers or any tool to evaluate code against your development standards. This may include variable declarations, naming schemes and other style checks. They make it easier for everyone to read and catch problems early. Increasing the feedback loop keeps developers motivated to follow standards.
Another tip: automating linters and configuration errors at commit time are good steps toward continuous delivery.
Testing Your Infrastructure
Deployments get complex when, for example, introducing systems with consensus algorithm such as Raft. Since it’s hard to get away from introducing Raft with services such as ETCD, effective canary strategies must be applied. It’s not a good idea, for example, to apply configuration that involves restarting service units for every node in your cluster. It’s hard for everyone to know the impact of these changes for a specific application, even for the primary engineer on call. One way our team prevented these failures is by sending a notification that configuration changes were detected in production and manual interaction is needed for the changes to take place.
Always test in production. Chances are there are roles where you only have production instances or the environments are so different you have to “do it live.” An example is testing character set changes on sensitive days you cannot bring to a non-production environment to test. There aren’t as many examples of testing configuration changes as we’d like, while configuration changes shouldn’t cause your cluster to go down. For example, Prometheus provides a test flag to test the configuration. Add this step in your configuration to ensure the configuration is able to build and run and fail hard if the service won’t be able to start.
Instantiating Your Infrastructure from a Defined Model
Treat your infrastructure as a service too. Even if it’s a third-party service, operate it is something you’ve built. Treat services with the same level of discovery, monitoring, alerting and tracing as you can, even if it’s a one-off script that runs. Register it as a service in a way that operators or users don’t have to use SSH to figure out if it’s running.
A lot of our mean time to repair went towards SSH-ing into a box to see if a config was running. We started adding metrics to generate alerts to know whether a configuration has not run as many times as intended. Register services to know whether spawn off subprocess from the main config process has a UI so operators and users may know if configs are currently running or not.
Deep Dive into SLO Culture
As mentioned above, we used a Puppet guide for transitioning infrastructure as code. The guide also touches on other processes that we have used, such as how to set service level objectives (SLO) for your infrastructure code.
Users of your infrastructure don’t care about any other team’s service performance. Owners want to know if their service may perform better than what it is doing today. Deep diving into SLO culture prepares the team to converse around metrics, usability and operability.
Host labs for infrastructure to gain support. We run weekly SRE labs. Our SRE lab’s objective is for teams that interact with operational systems to come and ask questions, get feedback and learn about the infrastructure team’s world. Teach others how to configure, write client code and what to do when you are experiencing increased error rates. When our SRE labs are focused around services that they interact with daily, we have a 200% increase in attendance. Developers took time to share their feedback about interruptions in their service and how usability looks from their perspective.
Where We Go Next
We are now on a journey for more automation testing as we now perform a majority of our testing manually. However, these small changes made during the review process and updates enabled us to make informed decisions and ensured that our infrastructure was able to perform in a much more straightforward and controlled way.
To learn more about how to transform infrastructure into code and other DevOps processes, we invite you to attend Puppetize PDX in Portland, Oregon, Oct. 9-10. This is a two-day, multitrack conference that focuses on the broader community of Puppet users, featuring user-focused DevOps and infrastructure delivery talks and hands-on workshops.
Feature image by Zsóka Vehofsics from Pixabay.