Cortex 1.0 Offers ‘Enterprise-Ready’ Distributed Prometheus Monitoring

The team at Grafana Labs has been using Cortex to power the Prometheus monitoring backend for its metrics and logging Grafana Cloud for the better part of three years now, and the company has now released Cortex 1.0 for general use, asserting that it comes with a number of new features and guarantees that make it production-ready for enterprise use.
“Cortex has actually been production-ready for a really long time. Our message with 1.0 is very much that you don’t have to be an expert in Cortex now to run it in production,” said Cortex author Tom Wilkie, vice president of product at Grafana Labs. “Up until this point, we’ve been criticized for being hard to operate, for having a lot of moving parts, and generally for being a bit of a moving target, because the development of Cortex has been moving at quite some pace,”
With 1.0, the company introduced stability guarantees, guarantees around configuring and managing the software. It also includes documentation, as well as pre-packaged dashboards. “Generally, we think there’s now enough of a community, enough documentation, enough help and support, for other people to run Cortex,” Wilkie said.
Cortex is an open source project, a sandbox-level member of the Cloud Native Computing Foundation, that offers the ability to query metrics from many Prometheus servers without any gaps in the graphs due to server failure. A blog post details the new features that arrive with the release, including production documentation detailing the steps necessary to build a production-ready Cortex deployment, the aforementioned dashboards and ready-made Prometheus alerts, new stability and backward-compatibility guarantees, and a single process “airplane” mode that offers a single binary executable for getting started with Cortex.
On this last point, Wilkie pointed to a much-simplified process for getting Cortex up and running, as well as new use cases that weren’t as readily available with a microservice architecture.
“Cortex is a microservice architecture, so up until about a year ago, you’d have to run 15 different applications and orchestrate them together to get a working Cortex cluster,” said Wilkie. “Recently we introduced this as a single process model. You can run a single command, a single binary in a single process, and get a fully working Cortex cluster. This dramatically simplifies the operational complexity of Cortex. We’ve only ever run Cortex on Kubernetes, but now we have users running Cortex on bare metal.”
Last month, Sysdig offered its own solution for scalable Prometheus support, saying that Cortex was too complex and didn’t actually offer “the level of scale that could support a whole cloud-scale implementation.” Wilkie dismissed the assessment, saying that “Cortex is incredibly scalable and incredibly mature” and again noting their own use of Cortex “at massive scale” for the past several years.
Cortex maintainer Goutham Veeramachaneni added that “we’ve put a lot of effort into query caching and query parallelization, and the Sysdig backend now uses our query frontend, and one of the engineers contributes to us. So, they use pieces of our stack to make their platform better.”
As evidence of its ability to handle massive scale, Grafana Labs points not only to its own use, but that of Gojek, an Indonesian call center, which it says has 40+ tenants and handles about 1.2 million samples per second, as detailed in a case study.
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image by James Lee on Unsplash.