Collecting Metrics Using StatsD, a Standard for Real-Time Monitoring

“To measure is to know,” said the illustrious Lord Kelvin. As software engineers writing web services, we absolutely need measurements and metrics to understand the performance of our applications and infrastructure in real-time.
You’ve likely asked yourself questions such as: How slow is this query? How many times is this page being accessed? Are my servers running out of memory? To help you answer those questions, and many more, this post introduces StatsD, a standard for real-time monitoring with minimal overhead.
The article begins with a brief introduction to StatsD and an example that demonstrates the situations in which StatsD thrives. We will then dive into the StatsD datagram format, which reveals important details about the inner workings of the system. Finally, we will review decisions that you’ll have to make when setting up your initial StatsD configuration, namely UDP versus TCP, and your choice of backend.
What is StatsD?
StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application. Originally, StatsD referred to a daemon written by Etsy in Node.js. Today, the term StatsD refers to both the protocol used in the original daemon, as well as a collection of software and services that implement this protocol.
A StatsD system requires three components: a client, a server and a backend. The client is a library that is invoked within your application code to send metrics. These metrics are collected by the StatsD server (sometimes also called the daemon). The server aggregates these metrics, then sends the aggregates to one or more backends at regular intervals. Backends perform various tasks with your data — for example, Graphite is a commonly used backend that allows you to view real-time graphs of your metrics. StatsD components are modular, so different implementations can be added, removed or replaced without affecting the rest of the system.
Why does StatsD adopt this client-server-backend model? For two reasons: (a) language independence and (b) reliability. By relying on a simple, text-oriented protocol, StatsD quickly developed an ecosystem of clients for most languages and frameworks in use today. It also ensured strict isolation between the application (and the StatsD client) and the rest of the instrumentation. Should the StatsD server crash, it would have no effect on the performance of the application beyond the loss of instrumentation.
StatsD by Example
Before we dive deeper into StatsD, let us look at an example: how to instrument a function as critical as authentication in a web service.
You dig into the application code, and find this function (shown here in python):
1 2 3 4 5 |
def login(username, password): if password_valid(username, password): render_welcome_page() else: render_error(403) |
The first thing you want to know is how frequently this login is being accessed. This is useful for a number of reasons — for example:
- Authentication is critical to any application, and its performance needs to be absolutely well-understood.
- A sudden change in the number of logins could be an early warning of serious issues (e.g., incorrect DNS change, expired TLS certificate).
Using a StatsD client library, you modify the function:
1 2 3 4 5 6 7 8 9 |
<span style="color: #0000ff;">import statsd statsd_client = statsd.StatsClient('localhost', 8125)</span> def login(username, password): <span style="color: #0000ff;">statsd_client.incr('login.invocations')</span> if password_valid(username, password): render_welcome_page() else: render_error(403) |
The code you added declares a StatsD client which will know where to find your StatsD server, then increments a counter called login.invocations every time the function is executed. After deploying your change, you check the corresponding graph:
It looks like we’re getting more logins than expected, so the system is under a higher load than anticipated. You’d now like to know if this is affecting the performance of your application; in testing, you determined that 20 milliseconds was a reasonable amount of time to execute the function. Modifying the code again, you add the line highlighted below:
1 2 3 4 5 6 7 8 |
import statsd statsd_client = statsd.StatsClient('localhost', 8125) <span style="color: #0000ff;">@statsd_client.timer('login.time')</span> def login(username, password): statsd_client.incr('login.invocations') if password_valid(username, password): render_welcome_page() |
This code measures the execution time of the login function, then sends the time to the StatsD server. After deploying the code, you check the chart again:
Login time is still around 20 ms, which means that the system can handle the unanticipated load it’s receiving.
Using StatsD, you were able to collect two custom metrics and gain immediate visibility into the state of your system, simply by adding four lines of code to your application.
StatsD and Profiling
You can take the same instrumentation from your development environment to production without changing a single line of code. StatsD does not require that you run the application in a certain environment the way a profiler or a debugger does. And as we will see, its default reliance on connectionless protocols is paramount to its operational safety.
The Role of the StatsD Server
The example above shows the StatsD client sending metrics to a StatsD server. So what is the role of the server?
The server aggregates the metrics it receives from clients and forwards the results regularly to a backend for storage and visualization. How often that happens is defined by the flush interval, which is set to 10 seconds by default.
In the example above, the StatsD server accumulates login.invocations for the duration of the flush interval and forwards the actual count, as well as the rate (count per unit of time), to the backends it is connected to.
Splitting the tasks of collecting raw measurements from aggregating them is another way to limit the impact of instrumentation on the running application.
StatsD Datagram
StatsD clients encode metrics into simple, text-based, UDP datagrams. Though your client takes care of forming these datagrams, by exploring the format we can learn important information about features that the StatsD protocol supports.
A StatsD datagram, which contains a single metric, has the following format:
1 |
<bucket>:<value>|<type>|@<sample rate> |
Bucket
The bucket is an identifier for the metric. Metric datagrams with the same bucket and the same type are considered occurrences of the same event by the server. In the example above, we used “login.invocations” and “login.time” as our buckets. Note that periods can be used in buckets to group related metrics. Buckets are not predefined; a client can send a metric with any bucket at any time, and the server will handle it appropriately.
Value
The value is a number that is associated with the metric. Values have different meanings depending on the metric’s type.
Sample Rate
The sample rate is used to indicate to the server that the metric was down-sampled. The sampling rate is intended to reduce the number of metric datagrams sent to the StatsD server, since the server’s aggregations can get expensive. The sample rate determines what percentage of the metric points a client should send to the server. The server accounts for this sampling by dividing the values it receives by the sample rate. For example, if a metric has a sampling rate of 0.1, only 10 percent of the metrics will be sent by the client to the server. The server will then divide the values for these metrics by 0.1 (or multiply by 10) to get an estimate of the true value in the case of additive metrics, such as the login invocation count we used in the example above.
Type
The type determines what sort of event the metric represents. There are several metric types:
Counters
Counters count occurrences of an event. Counters are often used to determine the frequency at which an event is happening, as was done in the login example above. Counter metrics have “c” as their type in the datagram format. The value of a counter metric is the number of occurrences of the event that you wish to count, which may be a positive or negative whole number. Many clients implement “increment” and “decrement” functions, which are shorthand for counters with values of +1 or -1, respectively.
1 2 |
login.invocations:1|c # increment login.invocations by 1 other_key:-100|c # decrement other_key by 100 |
Timers
Timers measure the amount of time an action took to complete, in milliseconds. Timers have “ms” as their metric type. The StatsD server will compute the mean, standard deviation, sum, and upper and lower bounds for a timer over one flush interval. The StatsD server can also be configured to compute histograms for these metrics (see this link for more information about histograms).
1 |
login.time:22|ms # record a login.time event that took 22 ms |
Gauges
Gauges are arbitrary, persistent values. Once a gauge is set to its value, the StatsD server will report the same value each flush period. After a gauge has been set, you can add a sign to a gauge’s value to indicate a change in value. Gauges have “g” as their type.
1 2 3 |
gas_tank:0.50|g # set the gas tank metric to 50% gas_tank:+0.50|g # Add 50% to the gas tank. It now reads 100% gas_tank:-0.75|g # Subtract 75% from the gas tank. It now reads 25% |
Sets
Sets report the number of unique elements that are received in a flush period. The value of a set is a unique identifier for an element you wish to count. Sets have “s” as their type.
Assume the following metrics occur within one flush period:
1 2 3 4 5 6 7 8 9 10 |
# unique_users = 0 unique_users:foo|s # count an occurrence of user `foo`. unique_users = 1 unique_users:foo|s # we’ve already seen `foo`, so again unique_users = 1 unique_users:bar|s # unique_users = 2 |
After a flush, unique_users will reset to 0 until another metric is received.
UDP and TCP
StatsD was designed to be send metrics with as little overhead as possible. By default, a client sends metrics to a server over UDP, which is a “fire-and-forget” protocol. This means that the client will make no attempt to ensure that the server received the metric; if the packet is lost in the network or the server is down, the client will not attempt the resend the packet.
In general, sending metrics over UDP makes sense if any of these three criteria are met:
- Metrics are sent frequently. UDP’s fire-and-forget nature means that as little time as possible is spent executing the StatsD client’s code. This tradeoff between reliability and execution cost makes sense if metrics are sent frequently: if you’re sending a metric to log an event that happens a hundred times per second, and a packet is dropped, the impact on accuracy will be minimal. However, if you’re counting occurrences of an event that happens once per day, a dropped or erroneous packet will have a significant impact on the accuracy of your instrumentation.
- Collecting metrics is peripheral to the purpose of code. Recall our login example above. If an error occurs while sending either of the login metrics, we wouldn’t want our StatsD client to stall or your application to crash, since the primary purpose of the code is to authenticate a user. Sending a metric is not essential to the purpose of this code, so failure to do so should be tolerated without impacting the regular operation of the code.
- The metrics are sent over a “reliable network.” Usually, the client sends metrics to a server running on the same machine, or somewhere in an internal network. If you’re sending metrics to a server on the other side of the world, sending metrics over UDP may not be the best choice, since there is a greater probability that the packet will be dropped or contain errors.
Recently, the ability to send metrics over TCP was added to Etsy’s StatsD server. TCP differs from UDP in that the client will attempt to retransmit a metric if it is dropped or contains errors. If it is imperative that your server receives all measurements without fail, TCP is a better choice than UDP. However, sending metrics with TCP will incur more overhead.
There are a few caveats to using TCP to send metrics to your StatsD server:
- At the time of writing, TCP support is a relatively recent addition to the StatsD protocol. So although Etsy’s StatsD server supports the protocol, many other clients and servers do not.
- TCP tries to ensure that the metric is delivered to the server but does not ensure that the metric was valid. For example, if a client sends a metric that does not fit the StatsD datagram format, the server will be unable to process it. However, since the client is only concerned with the delivery of the metric, it will still report that the metric was successfully delivered.
Choosing your Backend
StatsD supports several pluggable backends which receive metrics from the daemon and perform different functions depending on your needs. For example, Graphite allows you to visualize data, whereas node-bell detects anomalies in your data. A daemon can forward metrics to several backends at once. Backends turn your metrics into useful information, so you should take care to pick those that best suit your needs.
The big decision that you’ll need to make is whether you want to host your StatsD backend yourself, or use a hosted backend service such as Datadog. Hosting your own StatsD backend will give you more flexibility with the services that the backend offers, but you’ll need some expertise to set it up correctly. Also, when problems arise with your systems, it’s possible that your StatsD backend — a tool that you would use to identify and troubleshoot the problem — will also go down. If you’re comfortable managing this yourself, Graphite is a good backend to start with. Several open-source backends are also listed here.
For Good Measure
In this post we’ve briefly covered what StatsD is, how it works, and how to use it. Along the way, we hope that we have also conveyed why to use it. StatsD has many favorable attributes: it is lightweight and developer-friendly, its clients and servers have been implemented in many popular languages, and it works with several different graphing and monitoring backends. But the ultimate benefit of StatsD, of course, is what it provides you — metrics, real-time visibility into your applications and infrastructure, and, as Lord Kelvin put it, knowledge.