The Prometheus Microservice Monitor Reports for Production

By now, it’s no secret that traditional application monitoring tools are ill-suited for watching over microservices. Only recently have we started to see a new generation of tools limber enough to collect some operational metrics on the containerized services that may pop in and out of existence within a minute or two within these dynamic systems.
Recently, one open source microservices monitoring tool, Prometheus, has just graduated from beta status. The software has been production-ready for the past year, asserted Julius Volz, one of the creators of the software. The 1.0 designation simply ensures that backward compatibility will be maintained going forward, paving the way for programmatic use of mission critical systems.
Over the past few years, the project had to work on polishing various aspects of the software, particularly getting the query language and storage engine correct. The project, which started in SoundCloud in 2012, now gets about 20 new outside contributors a month.
About 300 to 400 organizations are already using the software, including 12 entities that are in the Fortune 500, estimated Volz. It can be used to monitor very small workloads: one user deploys Prometheus to capture operational information of Raspberry Pi. But it also scales easily. DigitalOcean, for instance, uses the software to monitor its fleet of machines.
Perhaps most notably, Prometheus is tightly integrated with the Kubernetes container orchestration tool. This coupling no doubt is one of the reasons that the Cloud Native Computing Foundation took Prometheus under its wing as a second project, following after Kubernetes itself.
This is not surprising given that Prometheus has strong ties to Kubernetes, the basic ideas for both being hatched at Google.
Monitoring for the Microservices Age
The project started in 2012 at SoundCloud by two ex-Google engineers, Volz and Matt Proud. “We both felt that the open source world was lacking good monitoring tools for the kind of dynamic cluster schedule environments,” Volz said.
SoundCloud already had an in-house container scheduling system. The instances were always on different hosts and ports and being scheduled dynamically. Ops needed a tool to “make sense of all this,” Volz said. Most monitoring tools were better equipped for static jobs.
They modeled the software after BorgMon, Google’s in-house monitoring tool for all its container operations. “Prometheus was to BorgMon what Kubernetes was to Borg,” Volz said.
What makes Prometheus unusual is that it can not only measure the performance of machines but of services, Volz explained. “In the cloud, you may have a slew of machines, but you don’t care about them. You care about the service running on the machines,” Brazil said.
Another big advantage that Prometheus offers is a mature, extensible data model. The data model allows you to attach arbitrary key values dimensions to each time series, and then use those dimensions in powerful ways. The associated query language allows you to do aggregation and slicing and dicing. Instead of just determining if a machine is slow, you can determine if the overall service is slow across multiple machines.
Of course, other monitoring services also offer to monitor, though few have such an expressive data model, Volz asserted. Before Prometheus, SoundCloud used Graphite, though “the data model was pretty flat,” Volz said. He noted that Graphite dot notation used was “pretty implicit,” meaning administrators had to know already what a string like “apiservice.get.200” would mean.
Like any collection agent for time series data, Prometheus can amass a lot of data quickly. The good news is that Prometheus works efficiently. The development team demonstrated how a single machine captured 800,000 entries samples per second.
Unlike other pull-based monitors, such as Nagios, Prometheus doesn’t overwhelm its host machine by churning out complicated pull requests. Instead, it sticks to issuing simple HTTP requests to service agents. As a result, a single Prometheus server could, theoretically, be able to monitor over 10,000 machines, at 10-second scrape intervals.
The initial setup can be done within a few minutes. It can be installed on a single machine, first by installing the Prometheus binary and then by just installing the machine agents, which can be located inside a container or directly on the server OS.
While the setup is easy, customizing it to your specific workload may take some time. There is a learning curve involved because “It does so many things so differently than traditional monitoring tools,” Volz said.
Kubernetes and Prometheus
The software was also built up to scale to very large environments. Prometheus works particularly well with Kubernetes. Kubernetes already exports many of Prometheus’ metrics of many of its own endpoints, for instance. Also, Prometheus natively supports Kubernetes’ service discovery.
“The way Prometheus finds the services it wants to poll metric data from is usually from some kind of service discovery, and it has native support for Kubernetes so it can pull data from instances running in Kubernetes containers,” Volz said.
The glue between them is the data model which is shared by the two software programs. A user can map Kubernetes service discovery labels to Prometheus time series labels.
Prometheus doesn’t only work on container-based services, said Brian Brazil, who is the founder of the Robust Perception Prometheus consultancy shop, as well as a core Prometheus developer.
Prometheus has been used to monitor medical equipment, for instance. Other deployments are used to monitor networks, large applications, even virtual machines. As long as you can fit what you want to monitor into a time series model, then Prometheus is a fit, Brazil said.
Feature image via Pixabay.