DevOps / Sponsored / Contributed

How SREs Can Avoid Configuration Drift

18 Jun 2019 2:21pm, by

Puppet sponsored this post.

Mikker Gimenez-Peterson
Mikker, principal site reliability engineer for Puppet, has been working in the operations space since 2000, aside from a few years spent in 2005 to study counseling. His interest in the field has been kept alive by the increased focus on automation and working closely with the software development org during the evolution from traditional operations to dev/ops and later SRE.

Infrastructure automation can save engineers tons of time by empowering them to  transform rote tasks into repeatable, consistent and orchestrated actions. Typically engineers focus on what’s within their known infrastructure, using automation to enforce their infrastructure configuration at scale.

An enforced state model brings consistency and security to your infrastructure management. In this case of exclusively running an enforced-state model, any changes to unmanaged systems need an ad-hoc method (one-off tasks, etc.) For simple infrequent changes, this is probably okay. But the time it can take to make ad-hoc changes and run manual tasks outside automated configuration management at scale can add up quickly.

On the other hand, if you’re only running ad-hoc changes, you trade low overhead for the wasted time making repetitive changes to infrastructure due to the lack of means to enforce state.

The sustainable, agile answer: blend the best of both worlds: by adopting both an enforced-state method and adopting simple ways to automate ad-hoc things “outside of the stack,” SREs can get even more time back in our days to tackle other things.

In other words: let’s combine two commonly accepted ways to automate configuration management to make our teams more efficient.

Before We Jump Heavily into Infrastructure Management Speak…

Let’s iron out a few terms that I’ll be using throughout this article before we dive in further.

Typical agent-server configuration management focuses on the known infrastructure, whether on-prem, hybrid cloud, cloud-native, etc.. Things outside of this stack could be remote hosts, networking devices, and their respective nodes and endpoints, ephemeral cloud resources, development environments and other assets outside of what’s deemed as “your infrastructure.”

Enforcing the State of an Infrastructure

Infrastructure management tools like Puppet check on your various nodes (servers, devices, etc.) and ensure they comply to a software-defined configuration. If there’s any drift (change) on a node that doesn’t map to the configuration, your infrastructure management tool will correct that. Additionally, any new hosts that connect to the server will be automatically configured. It’s enforcing a state of infrastructure defined by your DevOps team over and over so you don’t have to.

Ad-Hoc Tasks

Resources that are outside of the infrastructure, such as remote devices, need to be controlled with manual, one-off tasks. Sometimes you might need to reset one server and ONLY that server. It’s way easier to send a remote command to reboot it. You can orchestrate multiple ad-hoc tasks into larger plans as well.

What’s at Stake When We Don’t Automate Infrastructure Management with a Blended Approach?

For folks only running an enforced-state methodology, any remediation on remote nodes requires manual work. When you’re a team or organization that deals with thousands of nodes at scale, this manual work seriously adds up.

Teams can manage infrastructure with a bevy of ad-hoc tasks. However, this method can also be wasteful when trying to emulate the benefit of enforced-state configuration management with a boatload of manual tasks.

We unlock the most agility when we automate how we enforce state and orchestrate ad-hoc tasks. An enforced-state model doesn’t meet the realities of your day-to-day business, with needs to patch systems, gather reports or run maintenance tasks.

Conversely, relying only on ad-hoc methods can lead to manual, repetitive changes to emulate what would typically be an enforced configuration across your infrastructure. Weren’t you automating to do less work?

The Solution at Work: an SRE Day-in-the-Life at Puppet

Let’s use a real working example — a normal day in my life at Puppet as a Principal Site Reliability Engineer. We save tons of time and reduce the number of manual tasks we need to run by blending an enforced-configuration agent-server approach, as guided by Puppet Enterprise, and an ad-hoc method such as Bolt. This combo gives SREs more agility to tackle configuration drift and to reduce the amount of work it takes to wrangle remote hosts.

I work on the team that runs Puppet Enterprise (PE) at Puppet. PE enforces the configuration of our infrastructure and maintains the standards we’ve decided on as a team. However, configuring servers that don’t exist within the confines of our production continuous integration (CI) infrastructure requires an ad-hoc touch. This is when we turn to an open source solution, Bolt, to ensure that those servers are configured in a repeatable fashion.

Let’s take a closer look at this scalable workflow that uses Bolt, Apply and Plans to supplement our existing configuration management.

How We Combine Ad-Hoc Tasks and Plans to Our Infrastructure Management

PE manages the physical servers that run OpenStack, but the virtual machines (VMs)  themselves aren’t centrally managed because this is a sandboxed environment. I wanted to solve a user problem within this environment: Our users would report issues with their VMs in the OpenStack environment, but I didn’t have access to ssh (a means to access a remote device via the internet and the devices’ network) into the VMs that my internal customers were setting up. Even with access to ssh into the users’ VMs, I don’t have historical data about the experience of being on a VM in the environment.

The solution: I built two servers in the environment that communicated with each other to get these metrics using telegraf to send performance data to InfluxDB. There was shared code between these two servers, and I didn’t want to configure them manually, particularly in an environment that doesn’t have the SLA of production.

To take this solution even further, I use Bolt, an agentless open source task runner, to create Plans (orchestrated tasks in Bolt) to help me manage these servers quickly.

How to Use Bolt Apply

To cover the ad-hoc task and plan orchestration in this example, we’ll focus on Bolt to manage agentless infrastructure using the same code and modules that you use to manage your infrastructure managed by PE. One notable thing to consider is that Bolt is opinionated about how you set up your environment. This means that you can take advantage of code you write, as well as modules on the Puppet Forge to get the most out of your ad-hoc task orchestration.

Here are some code examples to give more context to this approach. These examples are based on users using a unix-based system. After [installing Bolt], change to your ~/.puppetlabs/bolt directory, and create a Puppetfile.

A few things to note:

  • Add any modules that you will be using, in this case, I knew I wanted to use the telegraf module, and the telegraf module was dependent on the concat and stdlib modules;
  • Ensure you add the module you are about to create here with the flag local: true. Bolt will manage your modules from here on out, and if you do not list it as local_true, bolt will delete the code that you wrote, attempting to overwrite it with a module from the forge.

Use the command bolt puppetfile install to install all of the modules you listed in your puppetfile, then create a modules directory if one doesn’t exist, and create the directory for your module inside that directory. Bolt supports simpler tasks, which can run in any language you desire, and plans which are more extensible and use the Puppet language, in this case, I wanted to use the bolt apply feature which requires puppet plans. I created the puppet plan, telegraf.pp in the directory ~/.puppetlabs/bolt/modules/profiles/plans/telegraf.pp.

This is the plan which is used to install telegraf on the nodes. In this plan, I am passing in three variables:

The node variable is populated using the bolt flag –nodes. Here is an example of how you would apply this bolt plan:

$nodes.apply_prep is what gets the node ready to run puppet on them, though this is an agentless experience, we need to copy over the code which runs puppet apply onto the hosts running this code.

apply($nodes) { is the line that tells bolt you are about to pass Puppet language code to bolt to apply onto the server. You are running it on the nodes you passed into $nodes. This means you can generate this variable in other ways if you want to develop a node list programmatically.

The Big Picture

Bolt is a great example of how to automate the infrastructure in your organization that is detached from your primary environment. This gives operations engineers and SREs the benefit of code to reuse between teams managing infrastructure, while also keeping our codebase smaller, and more purpose built. The ad-hoc capability allows teams to respond on-the-fly to changes to infrastructure as well.

Enforced configuration management delivers many efficiencies for what’s currently in your stack. But as your business needs change and more remote hosts are needed, SREs need an ad-hoc method to automate these tasks. Combining both enforced and ad-hoc automation approaches can unlock some serious agility and give you time back to focus on other priorities.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.