Prometheus at 10: What’s Been Its Impact on Observability?
Happy 10th birthday to the open source software and community of Prometheus. Named after the Greek god who gave us fire, this second Cloud Native Computing Foundation (CNCF) project became the first major tool driving observability of modern systems— because you can’t fix what you can’t measure, before, during and after experimentation.
Hatched by ex-Googlers at SoundCloud, Prometheus has grown over the last 10 years to more than 700 open source contributors — although only about 25 regulars — and more than a million users. It was also the first open source tool that integrated trigger alerting within service monitoring, which it does organized into time-series, key-value pairs.
But what, besides a rad name, makes Prometheus so popular after all these years? Some of its most prominent creators and advocates weighed in to uncover the impact 10 years of better observability has had on the software industry and the widespread adoption of open source across our distributed world.
A Brief History of Prometheus
“More and more ‘normal’ companies went into a stage where they had fairly complex, distributed systems and were severely lacking troubleshooting skills.” That’s how “Prometheus: The Documentary” and the impetus for the software kicked off.
There was a do-it-yourself culture of tooling that tried to fill this void, while, at the same time, system complexity was rapidly increasing. Existing internal tooling was mostly manual and couldn’t scale to keep up.
Back in 2012, Julius Volz and Matt T. Proud had recently moved from Google to SoundCloud. They soon found that SoundCloud, with hundreds of microservices and thousands of processes running, had a lot of reliability and latency issues. “And we were having trouble even figuring out where these problems were coming from,” or often even pinpointing what was broken, said Volz in the documentary.
SoundCloud had built its own cluster scaler system — a precursor to Docker and Kubernetes — facilitating a dynamic and ever-changing computing environment. However, there were no traditional tools able to peek under the hood of a cluster.
“We were really disappointed in what existed in terms of monitoring outside of Google,” Volz told The New Stack.
At the time, no open source tooling existed to bridge the chasm between time-series metrics and alerting, let alone with the desired flexibility and scalability.
So he and Proud started building what would become Prometheus in their spare time, open sourcing it on GitHub from the start. Then they started building more and more during their day jobs.
“We were a lot of the time taking a lot of liberties,” Volz said. “But also we wanted to make SoundCloud more reliable. But before we do that, we need to know what’s going on.” The pair built a lot of infrastructure reliability diagrams for their colleagues, in which all roads pointed to an urgent need for visibility.
Volz pointed to a tipping point where people at SoundCloud started to recognize the value of Prometheus. For each microservice process, they were finally able to understand things like their resource usage over time — “a huge insight into things that were so opaque before,” he said. About 18 months into the project, it became mandatory at SoundCloud that each microservice be released with Prometheus inside.
While the project was on GitHub for more than two years, its creators kept it in stealth mode until January 2015. With the acquiescence of SoundCloud’s open source program lead, Prometheus became a more public open source project with a website, documentation and some press.
The seeds of Prometheus soon landed on fertile soil, alongside the July 2015 release of Kubernetes container orchestration. Prometheus became the first open source monitoring tool that allowed for service discovery, something that was desperately needed with the highly distributed, highly complex Kubernetes.
“Both inspired by analogous products at Google,” Volz noted, the two projects quickly achieved tight integrations, “with Kubernetes then offering native Prometheus metrics and Prometheus offering Kubernetes service discovery, that meshed really well.”
Soon after the CNCF recruited Kubernetes as its first project, Prometheus became its second.
Since then, the ecosystem and codebase has grown, but notably not pivoted focus over the last decade.
The Technical Impact of Prometheus
Prometheus for the first time opened up observability into increasingly complex systems — which certainly weren’t made less complex with Kubernetes — democratizing data so far more members of an organization could gain that level of visibility and understanding.
The impact on DevOps and site reliability engineers (SREs) was significant, Richard Hartmann, director of community at Grafana Labs and member of the Prometheus team, told The New Stack.
“For the first time, people outside of hyper-scalers actually had the tools to observe the complexity they unleashed with cloud native and similar scaling approaches,” said Hartmann. “Prometheus was the first tool to allow you to dynamically detect and monitor workloads of arbitrary complexity and deployment — and do math with the data. Previously, people were forced to more or less pre-create all dashboards. It was a very static way of working with and understanding data.”
What makes Prometheus technologically special? “It has a unique combination of how easy it is to operate, how consistent it is — there’s always one way to do something, and it’s also the most natural way,” Frederic Branczyk, a Prometheus maintainer and founder of Polar Signals, told The New Stack.
Where’s this special tech heading next? There are two main themes driving the Prometheus roadmap, says Hartmann:
- Tighter integrations.
- Expansion beyond cloud native software, networks and power grids.
Prometheus could have potential in solving the ongoing global supply chain challenge, Hartmann said, suggesting an example of how a coal company could use it: “There’s a port measuring coal intake on their conveyor belts. They weigh the coal on the belts to determine moisture levels, and reject anything too moist. They could pay five to six figures for a proprietary software solution, or use Prometheus and Grafana for free.”
The Prometheus Open Source Ecosystem
While Prometheus will always remain open source, Volz assured, there are plenty of companies built on top of and around it. Volz himself leads PromLabs, which open sourced the PromLens query builder as well as offers Prometheus consulting to businesses. Brian Brazil, another Prometheus contributor, runs the Prometheus training company Robust Perception.
Grafana Labs is one of the entities built as part of the Prometheus open source ecosystem that are both commercial and open source. The organization has open sourced projects like Loki, Mimir, Tempo and, most notably, Grafana, which visualizes the Prometheus metrics stored in the backend.
This unique ecosystem around Prometheus developed, in part, Volz said, because “we were trying to position ourselves as a neutral-ish thing, independent from one company.”
The open source tool has also never had the budget of other projects. “We are never in this mode where startups [can] say, ‘add all the features.’ because our investors will give us more money if they do,” he said. Resource constraints have allowed the small Prometheus team to stay laser focused.
Branczyk started working with Prometheus at CoreOS back in 2016. CoreOS’s mission was to secure the internet through automatic software updates. (“The overwhelming majority of security problems have been fixed, they just don’t update it,” he said.)
CoreOS was performing automatic upgrades on servers, and then Kubernetes, and then any software. But the company soon realized, Branczyk said, that its efforts were meaningless without being able to tell if the upgrades made things better or worse.
Branczyk continued to be a Prometheus maintainer as part of his time heading observability at Red Hat, which bought CoreOS in 2018, until he left in 2020 to found his own business.
He wanted to do what Prometheus does, but in automating high-resolution snapshots in time, which is why he founded and recently open sourced Parca, a continuous profiling tool that he said is “working in that intersection of Kubernetes and Prometheus.”
This newer concept of automated profiling, Branczyk said, has a tool “look at a process for 10 seconds at time, and you record at very high resolution what your program is doing.” This offers organizations continuous understanding of how resources like CPU are spent, which, of course, affects their cloud bills.
“If you’re spending less CPU time, your software is actually faster if you’re still doing the same thing,” which Branczyk said is especially valuable for e-commerce, where every 10 milliseconds you shave off a page doubles the customer conversion rate.
Polar Signals Cloud is a hosted solution that allows you to deploy a Parca agent onto, for example, your Kubernetes cluster, and it will automatically start profiling your infrastructure. It’s one of the several product and service companies that have spun out directly or indirectly within the Prometheus ecosystem, which now includes Amazon, Google, and Microsoft.
“Everything we do is highly inspired by Prometheus,” Branczyk said. He continues to be the maintainer of the Prometheus operator that spun out of now-defunct CoreOS, which manages Kubernetes integrations with Prometheus, commenting that “it’s so ubiquitous that people don’t realize it’s separate” GitHub projects.
A ‘Career-Defining’ Project … for Some
Branczyk called his experience with Prometheus career-defining. “Basically I wouldn’t be in the position that I’m in now having been able to start this company, raise money for this company, if it hadn’t been for the Prometheus project.”
Another Prometheus lifer, Julien Pivotto, open source observability consultant and maintainer of the Prometheus server project, has been a user and then contributor to Prometheus since 2017. By 2020, he began making more significant contributions, and, in 2021, he was promoted to maintainer of the Prometheus server.
Unlike other open source projects, Prometheus has been relatively successful at recognizing regular contributors earlier on and nurturing them up the contribution ladder.
“I contributed to Prometheus because it was useful and I improved it to better suit our use case,” Pivotto said. In turn, “It improved the way I did my SRE job, by providing better insights and enabling us to better react [to] and understand incidents.”
Now, he and his team offer support to Prometheus users. “It also means that I am paid to work on the project,” he said. “Prometheus is my only area of work, whether it is support, or upstream work.”
While there are only about two dozen maintainers and regular contributors, Prometheus stands out from other open source projects in that they all seem to have found a way to get paid for their open source contributions.
Of course, it reflects another common open source concern, as its co-founder Volz acknowledged: “The worst part about the diversity is we are all men.”
He pointed out that other projects, including Kubernetes, have funds directly dedicated to improving diversity and inclusion, while still mostly bootstrapped and understaffed Prometheus does not.
An Unusual Open Source Community
The CNCF’s first two graduates, Kubernetes container orchestration and Prometheus monitoring, have evolved into a symbiotic relationship that rarely one is used without the other. But while they are the peas and carrots of the cloud native world, they do have a few significant differences.
For example, while KubeCon + CloudNativeCon hosts thousands of attendees twice a year, the annual PromCon keeps it under 250. This closeness is part of why all four interviewees for this piece each said PromCon was their favorite part of the community.
Of course, while Kubernetes is a project born and bred at giant Google, Prometheus has achieved a similar popularity without any big corporate backer. “I guess that’s what makes it more challenging,” Volz said. “There’s not one company that was able to dedicate resources. We were always kind of a ragtag team from a bunch of companies,” he continued, although today almost half of the governing board is from Grafana Labs.
Prometheus being a significantly smaller project, with still a big name, should make it attractive to potential contributors. “It’s still a fact that this is a relatively easy way — compared to [like] Kubernetes — to get recognition fast” in the open source world, Hartmann said.
However, he conceded “The negative interpretation is that one of the largest projects on Earth has a group of one to two dozen people plugging all the holes to keep the ship afloat — and we were only a dozen until recently. More hands would be very welcome indeed.”
It’s been a collaborative opportunity from the start — never, like several open source projects, having to teeter precariously on the shoulders of a single maintainer. “My pattern is, I started a lot of things and then other people come in and code properly,” Volz half-joked.
Unsurprisingly, everyone we talked to had one wish for Prometheus — more contributors. “The impact of contributing to open source in general is that you’re getting exposure. In particular, in the current market, having a public track record of your work is not going to hurt. Plus, you’re doing good for society — in most cases,” Hartmann said. “The impact of contributing to Prometheus has a certain level of rockstardom.”
Learn more about PromLens and how it was donated to PromLabs from this episode of The New Stack Makers: