The Art of Monitoring: Introducing Riemann

If only I had the theorems! Then I should find the proofs easily enough. — Bernard Riemann
Riemann is a monitoring tool that aggregates events from hosts and applications and can feed them into a stream processing language to be manipulated, summarized, or actioned. The idea behind Riemann is to make monitoring and measuring events an easy default.
Riemann can also track the state of incoming events and allows us to build checks that take advantage of sequences or combinations of events. It provides notifications, the ability to send events to other services and into storage, and a variety of other integrations.Overall, Riemann has functionality that addresses all of our objectives. It is fast.
Overall, Riemann has functionality that addresses all of our objectives. It is fast and highly configurable. Throughput depends on what you do with each event, but stock Riemann on commodity x86 hardware can handle millions of events per second at sub-millisecond latencies.
Riemann is open source and licensed under the Eclipse Public License. It is primarily authored by Kyle Kingsbury aka Aphyr. Riemann is written in Clojure and runs on top of the JVM.
Events, Streams, and the Index
Riemann is an event processing engine. There are three concepts we need to understand if we’re going to make use of Riemann: events, streams, and the index.
Let’s start by looking at events.
Events
The event is the base construct of Riemann. Events flow into Riemann and can be processed, counted, collected, manipulated, or exported to other systems. A Riemann event is a struct that Riemann treats as an immutable map.
Here’s an example of a Riemann event.
1 2 3 |
{:host riemanna, :service riemann streams rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 355740372471/250, :ttl 20} |
Each event generally contains the following fields.
Field | Description |
host | A hostname, e.g. riemanna. |
service | The service, e.g. riemann streams rate. |
state | A string describing state, e.g. ok, warning, critical. |
time | The time of the event in Unix epoch seconds. |
description | Freeform description of the event. |
tags | Freeform list of tags. |
metric | A number associated with this event, e.g. the number of reqs/sec. |
ttl | A floating-point time in seconds, for which this event is valid. |
Inside our Riemann configuration, we’ll generally refer to an event field using keywords. Remember that keywords are often used to identify the key in a key/value pair in a map and that our event is an immutable map. We identify keywords by their :prefix. So, the host field would be referenced as :host. A Riemann event can also be supplemented with optional custom fields. You can configure additional fields when you create the event, or you can add additional fields to the event as it is being processed — for example, you could add a field containing a summary or derived metrics to an event.
The next layer above events is streams.
Streams
Each arriving event is added to one or more streams. You define streams in the (streams section of your Riemann configuration. Streams are functions you can pass events to for aggregation, modification, or escalation. Streams can also have child streams that they can pass events to. This allows for filtering or partitioning of the event stream, such as by only selecting events from specific hosts or services.
Child streams example:
1 2 3 |
(streams (childstream (childstream))) |
You can think of streams like plumbing in the real world. Events enter the plumbing system, flow through pipes and tunnels, collect in tanks and dams, and are filtered by grates and drains.
You can have as many streams as you like and Riemann provides a powerful stream processing language that allows you to select the events relevant to a specific stream. For example, you could select events from a specific host or service that meets some other criteria.
Like your plumbing, though, streams are designed for events to flow through them and for limited or no state to be retained. For many purposes, however, we do need to retain some state. To manage this state Riemann has the index.
The Riemann Index
The index is a table of the current state of all services being tracked by Riemann. You tell Riemann to specifically index events that you wish to track. Riemann creates a new service for each indexed event by mapping its :host and :service fields. The index then retains the most recent event for that service. You can think about the index as Riemann’s worldview and source of truth for state. You can query the index from streams or even from external services.
We saw in our event definition above that each event can contain a TTL or Timeto-Live field. This field measures the amount of time for which an event is valid.
Events in the index longer than their TTL are expired and deleted. For each expiration, a new event is created for the indexed service with its :state field set to expired. The new event is then injected back into the stream.
Let’s take a closer look at this. Here’s an example event:
1 2 |
{:host www, :service apache connections, :state nil, :description nil, :metric 100.0, :tags [www], :time 466741572492, :ttl 20} |
It’s from a host called www and is for a service called apache connections. It has a TTL of 20 seconds. If we index this event, then Riemann will create a service by mapping www and apache connections. If events keep coming into Riemann, then the index will track the latest event from this service. If the events stop flowing then sometime after 20 seconds, have passed the event will be expired in the index.
A new event will be generated for this service with a :state of expired, like so:
1 2 |
{:host www, :service apache connections, :state expired, : description nil, :metric 100.0, :time 466741573456, :ttl 20} |
This event will then be injected back into streams where we can make use of it. This behavior is going to be pretty useful to us as we use Riemann for monitoring our applications and services. Instead of polling or checking for failed services, we’ll monitor for services whose events have expired.