With distributed system coordination software such as Kubernetes and Mesos, monitored environments have become increasingly more dynamic, Reinartz pointed out in a blog post. The motioning software needed its own dedicated storage to ensure responsiveness in these dynamic environments.
Though Prometheus 1.6 introduced auto-tuning capabilities, the team has been working on a more performant time-series database. “It’s just way more reliable and faster. Ideally, you don’t want to have to reconfigure all the time, so Prometheus just responds to change in demands, so there are way fewer knobs to turn for the people running it,” Reinartz said in an interview.
It collects data from millions of points to monitor applications and infrastructure and stores it on the same node where it’s collected.
“We’re talking about millions of data points coming in every minute. The storage engine improves this by orders of magnitude. Memory usage improves by a factor of five, CPU improves even more,” he said.
The most dramatic improvement in Prometheus 2.0, Reinartz explained is that it writes up to two orders of magnitude less data to disk. This increases the lifetime of SSDs, which lowers costs. Under high series churn, significant disk space savings can be observed, even though the same time series compression algorithm is being used.
Query latency is more consistent and it scales better in the face of high series churn.
At the Munich Prometheus conference, PromCon 2017 held last August, Reinartz described how the new storage layer was designed from the ground up and took a deep dive into how he created a time-series database for the monitoring software, now managed by the Cloud Native Computing Foundation.
The new design provides efficient indexing techniques to handle high turnover rates of monitoring targets and provides consistent query performance. It also reduces resource requirements and paves the way for advanced features like hot backups and dynamic retention policies.
The new storage engine “uses an inverted index, inspired by full-text search, to provide fast lookups across the arbitrary dimensions that time series in Prometheus may have. A new disk format ensures good collocation of related time series data, while a write-ahead log makes Prometheus 2.0 resilient to crashes,” Reinartz explained in the blog post.
Version 2.0 also adds staleness handling, one of the most-requested roadmap items. Now vanishing monitoring targets can be tracked precisely, reducing querying artifacts and alerting responsiveness.
Prometheus 2.0 also provides:
- built-in support for snapshot backups of the entire database.
- recording and alerting rules in the YAML format, making it easier to integrate with configuration management and templating.
The simple and open storage format and library also allow users to easily build custom extensions like dynamic retention policies. This enables the storage layer to meet a wide array of requirements without drawing complexity into Prometheus itself; allowing it to focus on its core goals, Reinartz said in the blog post, adding that the remote APIs will continue to evolve to satisfy requirements for long-term storage without sacrificing Prometheus’ model of reliability through simplicity.
A native read/write integration for any long-term storage environment behind Prometheus is still in the future, Reinartz said.
Originally developed by Soundcloud for internal use, Prometheus is a distributed monitoring tool based on the ideas around Google’s Borgmon, which uses time series data and metrics to give administrators insights into how their operations are performing.
By identifying streams of data as key-value pairs, Prometheus aggregates and filters specified metrics while allowing for finely-grained querying to take place.
Its mature, extensible data model allows users to attach arbitrary key-value dimensions to each time series, and the associated query language allows you to do aggregation and slicing and dicing.
Its support for multi-dimensional data collection and querying is billed as a particular strength, though not the best choice for uses such as per-request billing.
Prometheus became the second project adopted by the Cloud Native Computing Foundation (CNCF), after Kubernetes. CNCF adopted Prometheus in part due to its fit with cloud-native technologies such as containers and microservices.