Implement Global View and High Availability for Prometheus

Ensuring that systems run reliably is a critical function of a site reliability engineer. A big part of that is to collect metrics, create alerts and graph data. It’s of the utmost importance to gather system metrics, from several locations and services, and correlate them to understand system functionality as well as to support troubleshooting.
Prometheus, a Cloud Native Computing Foundation (CNCF) project, has become one of the most popular open source solutions for application and system monitoring. A single instance can handle millions of time series, but when systems grow, Prometheus needs to be able to scale and handle the increased load. Because vertical scaling will eventually hit a limit, you need another solution.
This article will guide you through transforming a simple Prometheus setup into a Thanos deployment. That setup will enable you to perform reliable queries to multiple Prometheus instances from a single endpoint, seamlessly handling a highly available Prometheus setup.
Implement Global View and High Availability
Thanos provides a set of components that can deliver a highly available metric system, with virtually unlimited storage capacity. It can be added on top of existing Prometheus deployments and provide capabilities like global query view, data backup and historical data access. Moreover, these features run independently of each other, which allows you to onboard Thanos features only when needed.
Initial Cluster Setup
You’ll be deploying Prometheus in a Kubernetes cluster, where you’ll simulate the desired scenario. The kind tool is a good solution to launch a Kubernetes cluster locally. You’ll use the following configuration.
1 2 3 4 5 6 7 8 9 10 11 |
# config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 name: thanos-demo nodes: - role: control-plane Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35 - role: worker Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35 - role: worker image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35 |
With this configuration, you’re ready to launch the cluster.
1 2 3 4 5 6 7 8 9 10 11 |
~ kind create cluster --config config.yaml Creating cluster "thanos-demo" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 📦 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 ✓ Joining worker nodes 🚜 Set kubectl context to "kind-thanos-demo" You can now use your cluster with:kubectl cluster-info --context kind-thanos-demoHave a nice day! 👋 |
With the cluster up and running, you’ll check the installation to be sure you’re ready to launch Prometheus. You’ll need kubectl to interact with the Kubernetes cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
~ kind get clusters thanos-demo ~ kubectl get nodes NAME STATUS ROLES AGE VERSION thanos-demo-control-plane Ready control-plane,master 119s v1.23.0 thanos-demo-worker Ready <none> 88s v1.23.0 thanos-demo-worker2 Ready <none> 88s v1.23.0 ~ kubectl get pods -o name -Apod/coredns-64897985d-mz8bv</p> pod/coredns-64897985d-pxzkq pod/etcd-thanos-demo-control-plane pod/kindnet-27cdw pod/kindnet-42kcv pod/kindnet-5rlcj pod/kube-apiserver-thanos-demo-control-plane pod/kube-controller-manager-thanos-demo-control-plane pod/kube-proxy-49mgg pod/kube-proxy-nhvkm pod/kube-proxy-z4fpn pod/kube-scheduler-thanos-demo-control-plane pod/local-path-provisioner-5bb5788f44-hj5c4 |
With this configuration, you’re ready to launch the cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
~ kind create cluster --config config.yaml Creating cluster "thanos-demo" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 📦 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 ✓ Joining worker nodes 🚜 Set kubectl context to "kind-thanos-demo" You can now use your cluster with: kubectl cluster-info --context kind-thanos-demo Have a nice day! 👋 |
With the cluster up and running, you’ll check the installation to be sure you’re ready to launch Prometheus. You’ll need kubectl to interact with the Kubernetes cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
~ kind get clusters thanos-demo ~ kubectl get nodes NAME STATUS ROLES AGE VERSION thanos-demo-control-plane Ready control-plane,master 119s v1.23.0 thanos-demo-worker Ready <none> 88s v1.23.0 thanos-demo-worker2 Ready <none> 88s v1.23.0 ~ kubectl get pods -o name -A pod/coredns-64897985d-mz8bv pod/coredns-64897985d-pxzkq pod/etcd-thanos-demo-control-plane pod/kindnet-27cdw pod/kindnet-42kcv pod/kindnet-5rlcj pod/kube-apiserver-thanos-demo-control-plane pod/kube-controller-manager-thanos-demo-control-plane pod/kube-proxy-49mgg pod/kube-proxy-nhvkm pod/kube-proxy-z4fpn pod/kube-scheduler-thanos-demo-control-plane pod/local-path-provisioner-5bb5788f44-hj5c4 |
Initial Prometheus Setup
Your goal is to deploy Thanos on top of an existing Prometheus installation and extend its functionality. With that in mind, you’ll start by launching three Prometheus servers. There are several reasons to have multiple Prometheus instances like sharding, high availability or query aggregation from multiple locations.
For this scenario, let’s imagine the following setup: you have one Prometheus server in a cluster in the United States and two replicas of Prometheus server in Europe that scrape the same targets.
To deploy Prometheus, you’ll use the kube-prometheus-stack chart, and you’ll need Helm. After installing Helm, you’ll need to add the kube-prometheus-stack repository.
1 2 3 4 5 6 |
~ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts "prometheus-community" has been added to your repositories ~ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository Update Complete. ⎈Happy Helming!⎈ |
Because in practice you only have one Kubernetes cluster, you’ll simulate multiple regions by deploying Prometheus in different namespaces. You’ll create namespaces for europe
and another for united-states
.
1 2 3 4 |
~ kubectl create namespace europe namespace/europe created ~ kubectl create namespace united-states namespace/united-states created |
Now that you have your regions, you’re ready to deploy Prometheus.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# prometheus-europe.yaml nameOverride: "eu" namespaceOverride: "europe" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicas: 2 replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" # prometheus-united-states.yaml nameOverride: "us" namespaceOverride: "united-states" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" |
Using the configurations above, you’ll deploy the Prometheus instances in each region.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f prometheus-europe.yaml Release "prometheus-europe" does not exist. Installing it now. NAME: prometheus-europe LAST DEPLOYED: Sat Jan 22 18:26:22 2022 NAMESPACE: europe STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace europe get pods -l "release=prometheus-europe" ~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f prometheus-united-states.yaml Release "prometheus-united-states" does not exist. Installing it now. NAME: prometheus-united-states LAST DEPLOYED: Sat Jan 22 18:26:48 2022 NAMESPACE: united-states STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace united-states get pods -l "release=prometheus-united-states" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator. |
You can now ensure your Prometheus is working as expected.
1 2 3 4 5 6 7 |
~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-europe-prometheus-0 2/2 Running 0 18s prometheus-prometheus-europe-prometheus-1 2/2 Running 0 18s ~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-united-states-prometheus-0 2/2 Running 0 39s |
You’re now able to query any metrics on each individual instance, but unable to perform multicluster queries.
Deploy Thanos Sidecars
kube-prometheus-stack supports deploying Thanos as a sidecar, meaning it will be deployed alongside Prometheus itself. Thanos sidecar exposes Prometheus through the StoreAPI, a generic gRPC API that allows Thanos components to fetch metrics from various systems.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# prometheus-europe.yaml nameOverride: "eu" namespaceOverride: "europe" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicas: 2 replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" thanos: baseImage: quay.io/thanos/thanos version: v0.24.0 # prometheus-united-states.yaml nameOverride: "us" namespaceOverride: "united-states" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" thanos: baseImage: quay.io/thanos/thanos version: v0.24.0 |
With the updated configuration, you’re ready to upgrade Prometheus.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f 2/prometheus-europe.yaml Release "prometheus-europe" has been upgraded. Happy Helming! NAME: prometheus-europe LAST DEPLOYED: Sat Jan 22 18:42:24 2022 NAMESPACE: europe STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace europe get pods -l "release=prometheus-europe" ~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f 2/prometheus-united-states.yaml Release "prometheus-united-states" has been upgraded. Happy Helming! NAME: prometheus-united-states LAST DEPLOYED: Sat Jan 22 18:43:06 2022 NAMESPACE: united-states STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace united-states get pods -l "release=prometheus-united-states" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator. |
You can check that the Prometheus pods have an extra container running alongside them.
1 2 3 4 5 6 7 |
~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-europe-prometheus-0 3/3 Running 0 48s prometheus-prometheus-europe-prometheus-1 3/3 Running 0 65s ~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-united-states-prometheus-0 3/3 Running 0 44s |
Deploy Thanos Querier to Achieve Global View
Querier implements the Prometheus HTTP v1 API to query data in a Thanos cluster via PromQL. It will allow you to fetch metrics from a single endpoint. It starts by gathering the data needed to evaluate a query from underlying StoreAPIs, evaluates the query and then returns the result.
You leveraged kube-prometheus-stack to deploy Thanos sidecar. Unfortunately, that chart does not support other Thanos components. For that, you’ll take advantage of the Banzai Cloud Helm Charts repository. As before, you’ll start by adding the repository, the same way you did before.
1 2 3 4 5 6 7 |
~ helm repo add banzaicloud https://kubernetes-charts.banzaicloud.com "banzaicloud" has been added to your repositories ~ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "banzaicloud" chart repository Update Complete. ⎈Happy Helming!⎈ |
To simulate a central monitoring solution, you’ll create a monitoring
namespace.
1 2 |
~ kubectl create namespace monitoring namespace/monitoring created |
The following configuration configures Thanos Querier and points it to the Prometheus instances.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# query.yaml store: # https://thanos.io/tip/components/store/ enabled: false compact: # https://thanos.io/tip/components/compact.md/ enabled: false bucket: https://thanos.io/v0.8/components/bucket/ enabled: false rule: # https://thanos.io/tip/components/rule/ enabled: false sidecar: # https://thanos.io/tip/components/sidecar/ enabled: false queryFrontend: # https://thanos.io/tip/components/query-frontend.md/ enabled: false query: # https://thanos.io/tip/components/query/ enabled: true replicaLabels: - replica stores: - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local" - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local" |
With the above configuration, you’re ready to deploy Querier.
1 2 3 4 5 6 7 8 9 10 11 12 |
~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml Release "thanos" does not exist. Installing it now. NAME: thanos LAST DEPLOYED: Sat Jan 22 18:48:03 2022 NAMESPACE: monitoring STATUS: deployed REVISION: 1 TEST SUITE: None ~ kubectl -n monitoring port-forward svc/thanos-query-http 10902:10902 Forwarding from 127.0.0.1:10902 -> 10902 Forwarding from [::1]:10902 -> 10902 |
Using port-forward, you can connect to your cluster. You can ensure that you can perform multicluster queries. When you deployed Prometheus, you set replicaExternalLabelName: “replica” and prometheusExternalLabelName: “cluster”. The deduplication functionality will take advantage of those. By enabling it, you can ensure metrics from the europe
cluster are deduplicated. That’s because Thanos assumes them to be from the same high-availability group. This happens because they have the same labels, except for the replica label.
Deploy Thanos Query Frontend to improve readability
The last piece of the puzzle will be to deploy Query Frontend, a service that can be put in front of queriers to improve readability. It is based on the Cortex Query Frontend component and supports features like splitting, retries, caching and slow query log.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# query.yaml store: enabled: false compact: enabled: false bucket: enabled: false rule: enabled: false sidecar: enabled: false queryFrontend: enabled: true query: enabled: true replicaLabels: - replica stores: - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local" - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local" |
Updating the previous configuration to deploy Query Frontend, you now can update your setup.
1 2 3 4 5 6 7 8 9 10 11 12 |
~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml Release "thanos" has been upgraded. Happy Helming! NAME: thanos LAST DEPLOYED: Sat Jan 22 18:56:29 2022 NAMESPACE: monitoring STATUS: deployed REVISION: 2 TEST SUITE: None ~ kubectl -n monitoring port-forward svc/thanos-query-frontend-http 10902:10902 Forwarding from 127.0.0.1:10902 -> 10902 Forwarding from [::1]:10902 -> 10902 |
Using port-forward again, you’ll be able to access Query Frontend.
Query Frontend is the entry point to send queries to multiple Prometheus instances. Services that perform these types of queries, such as Grafana, should make them through Query Frontend.
Conclusion
In this article, you’ve gone through the steps required to go from a simple metrics-gathering solution to a global, highly available setup. In this setup, you leveraged Prometheus and Thanos in a Kubernetes cluster.
You started by deploying Prometheus instances separately, simulating a multiregion setup, and then proceeded to add functionality incrementally. You started by injecting Thanos as a sidecar, implementing the Store API on top of Prometheus and paving the way to deploy Querier. Querier gathers data from the underlying Store APIs, evaluates queries and returns results. Lastly, you deployed Query Fronted, a component aimed at improving readability that supports features like splitting, retries, caching and slow query log.
This setup allows you to run multi-replica Prometheus servers, in a highly available setup, and paves the way for more complex scenarios.
Plug: Use K8s with Squadcast for Faster Resolution
Squadcast is an incident management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.