Prometheus and the Debate Over ‘Push’ Versus ‘Pull’ Monitoring

Prometheus has immense scaling capabilities that make it well suited for keeping an eye on high-volume distributed applications, though many misperceptions still linger around the open source monitoring tool.
“There are a few common reactions when people first see Prometheus. People wonder, ‘Why isn’t it a log processing system?’ And another other common one is, ‘it isn’t push, how can it possibly work?’” said Prometheus creator and infrastructure engineer Brian Brazil, in today’s episode of The New Stack Makers podcast.
Exploring Prometheus Use Cases with Brian Brazil
“Monitoring systems generally fall into two categories,” explained a page on the Boxever tech website. “Those where services push metrics to the monitoring system and those where the monitoring system pulls metrics from services. This can be a surprisingly contentious issue.” (emphasis ours)
Prometheus, now a Cloud Native Computing Foundation project, falls primarily into the “pull” category, an unpopular choice, evidently, among some system architects. Both Brazil and his colleague Julius Volz have written at length about the topic, with Volz debunking some of the common myths circulating around the debate.
“It’s ‘pull can’t scale,’ ‘push can’t scale,’ or ‘They both have security problems,’ which they do, depending on the context,” Brazel noted. “From an engineering standpoint, in reality, the question of push versus pull largely doesn’t matter. In either case, there’s advantages and disadvantages, and with engineering effort, you can work around both cases. It is my belief that pull is very slightly better.”
Then there is the task of composing the Prometheus agents themselves, work that is, in Brazil’s view, worth the effort. Writing agents, he notes, requires attention to detail which will ultimately make the entire monitoring process smoother for administrators and on-call operations team members when working with multiple agents in a system.
“The system you’re writing an agent for might be one of 40-50 things to monitor inside that system and the person who gets a page or alert may only have vague notice your system exists. Ask yourself, ‘How should I name metrics so they can better understand what they need,’ You come across metrics everywhere that are of some time or latency, and they don’t mention the unit. It’s kind of annoying, especially in an emergency.”
Scalability and ease of use are on the minds of many teams looking to implement Prometheus as a part of their monitoring solution. In some use cases, Brazil noted that engineers struggled with labeling their metrics correctly, which caused problems further down the pipeline for operations and system administrators. It is Prometheus’s ability to scale while allowing developers to define and label their alert thresholds that make working with data at scale simpler.
“An interesting question in this area is, when you start off everyone has the same alert threshold. How do you customize that? Another approach was where you can have some other server Prometheus scrapes exposing all thresholds and then you can join those together which helps operationally,” Brazil said.
You can learn more about this technology at the Prometheus Day, held during this year’s KubeCon/CloudNativeCon in Seattle on November 8-9, 2016.