Cloud Native Ecosystem / Observability

Thanos Takes Scalable, Highly-Available Prometheus Monitoring to CNCF Incubation

19 Aug 2020 8:26am, by

The open source Thanos project, which centralizes, scales, and offers long term storage and high availability for Prometheus-based monitoring, has moved on to the incubation level of the Cloud Native Computing Foundation (CNCF).

Prometheus is one of the core open source projects for monitoring Kubernetes applications, as well as a graduated CNCF project, and it was built long before the container orchestration tool became the cloud native household name it is today. While Prometheus does what it does well, it wasn’t made with the scalability and data storage needs in mind that can be necessary for cloud native applications, and so tools like Thanos and Cortex, another CNCF project, were created.

With the move to incubation, Thanos moves slightly ahead of Cortex, which joined the CNCF sandbox a short time before Thanos, but before you get too invested in the horse race narrative, Thanos co-author Bartek Plotka explains that there’s less a real competition here and more a comingling of maintainers, contributors, and approaches. In fact, Plotka is not only a maintainer of Thanos, but also Prometheus, and contributes to Cortex, and the same can be said for many involved.

The real difference between the projects, said Plotka, comes in some initial design approaches, with Cortex using a push-based system, while Thanos is pull-based. Thanos also relies on object storage, which Plotka points out is very affordable, while Cortex uses NoSQL as well as object storage. The final initial difference is that Thanos used time-series database format, whereas Cortex used a custom indexed time-series database.

Plotka says that he co-authored Thanos during his time working at multiplayer gaming startup Improbable after trying to use Cortex, but running into some issues.

“We got started using Cortex 3 years ago, however, the deployment model and complexities were not what we were looking for at the very beginning of Cortex. We were looking for something that is more native to Prometheus that would maybe use the same storage format, a cheaper storage, and maybe an easier deployment model,” said Plotka. “Instead of pushing metrics and streaming metrics immediately to some long term storage cluster, we wanted to have a global federation of queries, so you could do queries on top of many clusters, many Prometheus services, the same way you would on a single instance of Prometheus.”

Working alongside Fabian Reinartz, who was a CoreOS employee at the time, as well as a Prometheus maintainer, the two created Thanos to provide a Cortex alternative for scaling out Prometheus. The core differentiator, said Plotka, was that Thanos could be installed very easily, with just a single Prometheus instance alongside a sidecar and a single microservice, providing a “fully working Thanos with capabilities of high availability of Prometheus and a global federation view.”

Started in late 2017, Thanos had a working version available by early 2018 and quickly gained popularity, said Plotka, because of how easy it was to use and install, and because object storage is cheap, “fast enough,” available on all cloud providers, and even able to be spun up on bare metal using open source projects. By the next year, Thanos had adopters and contributors from a variety of big names, including Alibaba, Red Hat, Tencent, Adobe, eBay and more.

Returning to present-day Thanos, Plotka again emphasized the co-mingling of the various projects’ maintainers, noting that the Thanos team has learned from Cortex about being more performant and stable, while Cortex has learned from Thanos how to be easier to use and cheaper.

“At some point, we started using each other’s code as well, so it’s going in a very interesting kind of direction where those projects might be much more together at some point even more,” said Plotka.

In its current incarnation, Thanos provides a variety of features, including:

  • Global querying view across all connected Prometheus servers
  • Deduplication and merging of metrics collected from Prometheus HA pairs
  • Seamless integration with existing Prometheus setups
  • Any object storage as its only, optional dependency
  • Downsampling historical data for massive query speedup
  • Cross-cluster federation
  • Fault-tolerant query routing
  • Simple gRPC “Store API” for unified data access across all metric data
  • Easy integration points for custom metric providers

Looking ahead on the Thanos roadmap, some key additions to look forward include the deletion of data (something made necessary by things like GDPR) and the addition of a new microservice responsible for query planning and query caching. On this second addition, Plotka notes that the microservice is actually one that Thanos is working on with and sharing with Cortex.

The Cloud Native Computing Foundation, and KubeCon + CloudNativeCon are sponsors of The New Stack.