3 Key Configuration Challenges for Kubernetes Monitoring with Prometheus
Observability is essential to running huge workloads in Kubernetes clusters. Prometheus is a monitoring system and time-series database that has proven to be adept at managing large-scale, dynamic Kubernetes environments. In fact, Prometheus is considered a foundational building block for running applications on Kubernetes and has become the de-facto open source standard for visibility and monitoring in Kubernetes environments.
Although open source, Prometheus does not come for free in terms of configuration that is required to properly monitor Kubernetes workloads. In this article, part one of a two-part piece on Prometheus, I highlight the most common challenges facing platform operators and site reliability engineers (SREs) for onboarding new workloads to Prometheus and configuring the tool ecosystem needed to manage Prometheus, along with potential solutions for overcoming each of these challenges.
Disclaimer: In this article, I don’t discuss the challenge of high-availability setups of Prometheus and multicluster setups. Instead, I focus on how to scale Prometheus to onboard more applications and to create dashboards for each application, so that more people can use it. If you are interested in the high-availability setups, you can refer to projects such as Thanos or VictoriaMetrics.
To start getting Prometheus ready in your organization, you can configure scraping to pull metrics from your services, build dashboards on top of your data using Grafana, and define alerts for important metrics breaching thresholds in your production environment (see figure below).
As soon as you are comfortable with Prometheus as your weapon of choice, your next challenge will be scaling and managing Prometheus for your whole fleet of applications and environments. Naturally, automation is needed so that new applications can be onboarded fast and safely.
Challenge 1: Onboarding and Configuring Applications
Modern workloads often consist of hundreds or thousands of microservices, either as multiple instances of the same application or different smaller applications talking to each other, all orchestrated by Kubernetes. These workloads are not running on a single cluster or in a single environment, but are spread over multiple clusters and environments (or “stages” such as development, hardening, and production).
For example, Uber’s workloads have grown to over 4,000 microservices, as of late 2019. To manage and operate complex applications like these, you need advanced observability, which demands dedicated configurations for scraping, dashboarding, and alerting for each application. Not only do you have to create these configurations, but you also have to apply them to each environment — often done manually and in an ad-hoc manner every time something changes.
The problem: This all represents a huge manual effort for managing configurations in your ecosystem for both Prometheus and Grafana.
Solution: Leverage GitOps to Stay in Control
Instead of applying configurations ad-hoc, you can take a “GitOps” approach where a Git repository holds all configurations, as well as documentation and code, and an operator component applies it automatically to the corresponding systems to be managed — e.g. Prometheus, Grafana, or even a Kubernetes cluster. Instead of making direct changes to the Prometheus configuration or Grafana dashboards, all changes must be committed first to the Git repository and are then synchronized to Prometheus, Grafana, or other tools — maintaining a centralized Git repository as a single source of truth.
Among the many benefits of the GitOps approach is the ability to version all configurations plus audit logs, to identify when and why each change has happened. In the case of problematic changes, you have the ability to roll them back easily. By having Git as the central repository, the workflow is aligned with developers who already base their workflows on Git. Using this approach, you can also promote a configuration (i.e., before applying it in the next stage) using the concept of pull requests that have proven successful for development processes already.
The figure below shows a Git repository and an operator added as an intermediate layer to manage all configuration files. The operator must hold the logic and permissions to apply the configuration to the underlying systems.
Challenge 2: Manual Creation of Configurations and Dashboards
Setting up a GitOps single source of truth that is version controlled and holds all configurations as code is a first step. But there are still a lot of manual configurations to deal with.
Writing and learning Prometheus PromQL queries is not a trivial task, and this is only one piece of the bigger picture. Besides PromQL, you need Grafana dashboard configurations (written in JSON) to have a comprehensive overview of your applications. You also need alerting rules (written in Yaml) in Prometheus to set up alerting for production issues. You may also need an engineer or two for writing PromQL or creating alerting rules, which require different skills than configuring dashboards in Grafana.
The problem: You need a team of engineers knowledgeable in different configuration languages to write and maintain all the manual configurations.
Solution: Code generation empowers scaling
Code generation to the rescue! Instead of manually writing queries and rules for Prometheus and its alert manager, as well as dashboard configurations for Grafana, you can use code generators to mitigate the manual work.
One great example is generating Prometheus alerting and recording rules based on SRE concepts, such as the Golden Signals or the RED method, or even the USE method, that are widely considered as the most useful and critical metrics. Another example would be generating Grafana dashboards (for examples, see uber / grafana-dash-gen, metalmatze / slo-libsonnet, and prometheus-operator / kube-prometheus on the GitHub website, and Scripted Dashboards on the Grafana Labs website).
Bottom line: Using code generators speeds up configuration efforts. The generated files are stored in the Git repository to reap all the benefits I discussed earlier. The image below compares manual configuration with code-generated configuration and shows how the latter approach does the heavy lifting and reduces the chance for user errors.
Challenge 3: Configurations Drift Out of Sync
Once you start using code generators, you end up with lots of auto-generated configuration files. Those configurations, stored in the Git repository, are independent of each other. There is no control mechanism to base them on the same input files; in fact that might not even be possible since code generators might rely on different kinds of inputs.
For example: Changing the input for code generator 1 outputs a result that is now out-of-sync with the output of code generator 2 or 3 — there is no synchronization mechanism between the generated files. To mitigate this, a change of one input could trigger the execution of all generators, but the actual problem is that the input for each generated file is in a different format, since the code generators are independent solutions. Only a few solutions tackle this, such as prometheus-operator / kube-prometheus.
The problem: Manual work is required to bring a desired change into each input format and to eventually create a new generation of configuration files.
Solution: Use Abstraction to Foster Reuse and Keep Generated Files in Sync
Abstraction in software engineering fosters reuse, and this same concept can help overcome the challenge of configuration files drifting out of sync. Introducing an intermediate language to cover common SRE concepts can help provide a mutual understanding and technical foundation to build upon.
The image below shows how introducing an intermediate language, such as jsonnet or your own defined language, allows you to define common concepts and generate specific configuration files for different platforms like Prometheus and Grafana. Using this higher-order programming language enables you to abstract implementation details. The language you use must provide all concepts that are prevalent in the Prometheus monitoring domain.
There has been the consensus in recent years to focus on terminology and concepts that stem from the SRE community. A mature concept is to build upon the notion of service-level objectives (SLOs) that allow you to define objectives for each microservice. Putting this into machine- and human-readable code (using Yaml files) allows you to generate the configuration for multiple tools and conform all configurations to the defined service-level objective. This reduces complexity and makes it easier to cope with operating and scaling your Prometheus environments.
But this is all just half of the story! In part two, I will detail how Prometheus, when coupled with another open-source solution called Keptn, can deliver automated, advanced observability for your K8s environment more quickly.