How Static Config Management Kills Developer Productivity
Did you know that a static config management setup is the main reason why things go wrong for developers? And that taking a static approach to config management doesn’t work at scale? While many have a vague understanding of why and how, we’re here to help break things down.
Why Config Management Today Sucks
Assuming an organization has CI/CD figured out and more or less under control, every individual component should then follow a structured pattern. The “80% case” relating to daily interaction between developer and toolchain is in most cases sufficiently optimized.
And as an industry we spent the past 10 years trying to get the “git-push update, and the change is rolled out” motion into place. What we neglected is what happens if you go beyond the simple update of any component. To be more specific, when you change the relationship between services (and infrastructure) and across environments, which is something we manage in app and infrastructure configurations.
Our approach to configuration management determines significantly how easy it is for every developer to perform these “actions that go beyond the simple update of an image.” This includes everything from onboarding to a service, changing an environment variable to promoting between environments, adding services or infrastructure components, handling dependencies or rolling back.
There are two “competing” methodologies to configuration management: static configuration management and dynamic configuration management. In this article, we aim to discuss why “remaining static” might be the No. 1 productivity killer in your organization and why “going dynamic” can help drive productivity, lower change failure rate and reduce security incidents.
Methodologies of Configuration Management
Let’s start by dissecting the difference between the two methodologies. By using the static method, we’re blending environment-specific and environment-agnostic elements of the configuration. The individual developer needs to understand, operate and maintain the dependencies across all workloads and environments. The answer to how the workload relates to which resource in the final environment has to be provided before deployment time.
If you follow the approach of dynamic configuration management (DCM), developers create workload specifications describing everything their workloads need to run successfully. The specification is used to dynamically create the configuration to deploy the workload in a specific environment. With DCM, developers do not need to define or maintain any environment-specific configuration for their workloads.
The configuration needs to contain all relationships between the workload and its dependent resources. This means workloads configured using the static approach require teams to manage more configuration files. The more workloads, the more files, which renders static configuration management unable to scale beyond 50 services without heavily damaging productivity. This not only directly affects the productivity of developers and Ops teams, but it can also drive change failure rates and security incidences.
One Config File Isn’t Complicated, but Try Managing Hundreds
“My single Helm chart isn’t complicated” is an argument you often hear from less-experienced architects who think “our static setup is fine and will scale.” But while a single config file (irrespective of the format) isn’t complicated, scaling changes everything. Things get so complex that significant overhead and costs are created as a result of needing to understand, operate and maintain these files.
As we highlighted, static setups have to maintain a large number of individual configuration files for each workload, in each environment. This is necessary because the exact relationship between workloads and their dependencies has to be provided prior to deployment time.
Increasing static setup complexity caused by the number of services can be demonstrated with a quick calculation: Let’s assume our app has 10 services and dependencies, and is deployed across four environments. In our standard cloud native setup (assuming we deploy three times a day within 21 working days), this would result in 300 to 600 configuration files with up to 30,000 versions a month (600*21*3). Complexity isn’t only driven by the sheer number of files, but also by the variance between one file and the other. Formats like static Helm charts allow the user to express the same desired state of an environment in hundreds of different ways. So static configuration management tends to have a larger number of files, which tend to largely vary in style and structure. If you map the complexity of your configuration against the application lifetime, you can observe exponential growth.
Low-performing teams following this approach will most likely end up in a disaster situation, as their number of applications drives further complexity.
Static Config Management Is Likely Your No. 1 Reason for Low Productivity
There are several reasons why hundreds of config files damage the productivity of engineering organizations.
Large numbers of config files require somebody to operate and maintain them. For any average Ops team, this demands a huge amount of time fixing stuff in static config files. This could include spinning up new environments, services or resources, debugging deployments and trying to make sense of how parts of the app fit together. Most organizations have internalized the cost of static configuration management. The faulty setup is compensated simply by hiring the next team of Ops colleagues, who then suffer the same repetitive tickets in ServiceNow or Jira.
Cognitive Load for Developers
Static setups are also incredibly draining to work with as a developer. You have to understand and operate the configurations and workloads, including their dependency across all environments and services. If you’re swapping teams, you spend the first chunk of time trying to make sense of which component relates to the other. If you need to tie a service into your architecture, add a resource or even a new environment, you may have to spend time figuring out how to dig through dozens of files and formats. Or do what most of your colleagues do: Ask Ops and file a ticket. This is especially likely in security-heavy environments where there is no other way to make a major change to resources or architecture, which needs to go through security processes. If you’re overwhelmed, or rather focused on your React stack, you probably hear over and over that it isn’t complicated and you should just “build and run it.”
Hundreds of files also constitute a large attack surface both for errors driving change failure rate (think of the production workload connecting to a test database) as well as for security incidents. In some setups, we’ve observed 90% of change failure rate attributed to errors in the configuration as well as 80% of security incidents.
The ‘Service Template’ Hack Is Just Pushing Your Problem into the Future
While the correct answer to mitigating the issues outlined above would be dynamic configuration management, teams often shy away from fixing the root cause. Perhaps because they don’t take the time to understand DCM, they underestimate the problem at scale, or they believe they can apply quick fixes that solve most of the problems.
The most commonly applied “quick fix” is trying to standardize at the creation of a new service. Teams assume this will have a long-lasting effect on standardization, reliability and maintainability. For example, they use GitHub templates or more fancy options like service catalogs and developer portals to make it easy for developers to spin up a well-configured “Spring Boot service.” But because of the lack of continuous standardization between files, they are really just pushing the problem out into the future. In fact, if you compare config setups that start from templates two years in, you barely notice the difference.
Continuous Standardization with Dynamic Config Management
To really tame the complexity beast, we need to reduce the number of files, and we need to continuously standardize, not just on Day 1. And this is where DCM comes into play. The idea is that rather than having developers maneuver hundreds of files, they use exactly one file per service, and the final config files are created depending on the context of the deployment. This file is a workload specification, such as Score (7k+ stars on GitHub), which has seen a tremendous uptake in interest in a short period of time.
The impact is stunning. Let’s think back on the example with our 10 services and dependent resources that led to 300 to 600 config files in the static setup. With DCM, the same application can be configured with 10 files (plus the necessary base configs and templates). That’s up to 95% fewer than static setups.
If we plot the complexity vs. the application lifetime now, we can see that we don’t only standardize Day 1, but also Day n. Rather than exponentially, complexity grows linearly.
Get Going with Dynamic Configuration Management
Dynamic configuration management is a paradigm shift for modern IT organizations. It’s the solution to the No. 1 productivity killer: the amount of complexity produced by static configuration management.
DCM originated from the growing trend of platform engineering. It was developed when engineering teams tried to build platforms at a larger scale that drive standardization by design and introduce separation of concerns, without abstracting context from developers. Because of this, DCM has now spread beyond the platform engineering community and is experiencing rapid adoption.
Want to learn more about DCM? Check out this article and head to Score for more information on how to enable DCM using this open source workload specification in combination with a platform orchestrator.