3 Challenges to Monitoring StatsD and How to Tackle Them
StatsD is a key unifying protocol and set of tools for collecting application metrics and gaining visibility into the performance of applications. Etsy created StatsD in 2011 as a protocol for emitting application metrics. Soon after, the StatsD server was developed as a tool for receiving StatsD line protocol metrics and subsequently aggregating them. While there are no official backends as part of the StatsD ecosystem, Graphite became the most commonly used. StatsD quickly grew in popularity and today is a critical component of monitoring infrastructure.
Despite its age, legacy StatsD pipelines remain well-suited for application monitoring purposes, so long as you can keep up with the volume and submission frequency, and have a good place to store the data long term. This is likely more realistic for smaller enterprises just beginning to analyze telemetry from their applications. However, as organizations expand their applications and bring in new application teams, their StatsD metric load begins to quickly increase. As this happens, their StatsD monitoring solutions inevitably become too fragile to handle the breadth of StatsD metrics their applications are now emitting, presenting challenges that result in inaccuracies, performance issues and costs.
Many organizations are exploring alternatives to their legacy StatsD pipelines to address some of these challenges. There are multiple solutions from which to choose, from open source to managed offerings. What’s right for you depends on which of these challenges is most affecting your organization and the particular monitoring objectives your organization is working toward. The following is a list of these challenges, the pitfalls and the type of solutions you might want to consider for each. And even if you’re not experiencing significant challenges yet, these will provide insights into how to improve the scale and effectiveness of your StatsD monitoring.
1. Pre-Aggregations Hinder Flexibility when Calculating SLOs
Challenge: StatsD has built-in aggregation functions for timers that are performed by the StatsD daemon, which include count, min, max, median, standard deviation, sum, sum of squares, percentiles (p90) and more. But most StatsD servers only offer static aggregations, which you have to configure upfront. So for example, if you want the 97th percentile for metric values, you have to know you’ll need the 97th percentile and configure that from the start — otherwise you run the risk of not having the data that’s requested.
Pitfalls: Obviously this information is hard to pre-guess, which ultimately prevents teams from having the ability to dynamically analyze latencies or calculate service-level objectives (SLOs) on demand. A manager may want to see a p85 or p80, but the closest thing they may have is a p90. Also, various teams all must use the same SLOs, because they’re forced to share the same pre-calculated aggregations.
Solution: If your organization is looking to implement SRE/DevOps principles like SLOs and “measure everything,” then use log linear histograms for StatsD aggregation. Histograms allow you to efficiently and cost-effectively store all raw data, so you can perform StatsD aggregations and build percentiles on the fly after ingestion. Because the histogram contains all the data, no pre-configuration is required. This flexibility empowers your site reliability engineering (SRE) teams to dynamically set and measure their own SLOs for existing and future use cases.
2. Increase in Scale Results in Performance Issues and High Operational Overhead
Challenge: A lot has changed since 2011. Organizations are embracing Kubernetes, microservices and stateless applications. This means they’re emitting significantly more StatsD metrics. Also, the StatsD server aggregations introduce many challenges when used at scale, including precalculation of a large number of aggregates, potentially in the millions, that are not used. In some cases, more than 20 aggregated metrics are produced for a given application timer metric, so what could have been one raw metric collected at face value becomes 10 or 20 different individual ones for every metric that you want to collect. This is costing a lot of compute power, and all this data must get flushed to a backend.
On top of this, you have to manage multiple atomic implementations of the StatsD server because the same metric has to continue to go to the same server for aggregations to be correct — as well as manage relays to duplicate traffic to multiple servers and backends for redundancy. As the cardinality of metrics increases, many backends just can’t scale as required, and the operational burden of managing these StatsD pipelines is significant.
Pitfalls: Inability to scale as needed inevitably leads to performance issues, lack of visibility and increased time to troubleshoot. Increasingly complex architectures result in more network congestion, more resources and higher costs.
Solution: If your company is investing in Kubernetes and microservices, growing rapidly or emitting a significant volume of StatsD metrics, then you need to invest in a more modern backend database, one that easily can handle the volume of metrics emitted by today’s applications. It should also automate redundancy to remove the burden from your team. You should be able to scale as needed without sacrificing performance and confidently deliver great user experiences.
While not required to ensure scale and performance, log linear histograms can also benefit you here. Because histograms can compress and store all source data for years at low cost, they eliminate the need for multiple StatsD servers performing aggregations. They compress all data into a single histogram and then send all of this data to your backend in one transaction, rather than multiple ones. Overall, you significantly reduce the number of metrics you’re ingesting and storing compared to pre-aggregations (as much as 10 to 20 times less), as well as reducing network bandwidth and associated costs.
3. No Data Correlation = Longer MTTR
Challenge: Sending StatsD metrics to certain server/backend combinations doesn’t support tagging in the line protocol. Modern IT environments are dynamic and ephemeral, making tagging essential to monitoring services and infrastructure.
Pitfalls: Without tagging, monitoring today’s complex IT infrastructures becomes ineffective. You lack the ability to slice and dice metrics for visualization and alerting, identify and resolve issues quickly or correlate insights across business units.
Solution: If you’re looking to advance the sophistication of your monitoring by gaining deeper insights and correlating insights in a way that empowers monitoring to drive more business value, then you need a monitoring solution that enables Metrics 2.0 tagging of StatsD telemetry. Metrics 2.0 requires metrics be tagged with associated metadata or context about the metric that is being collected, such as application version. The additional context and metadata makes it easier to analyze across various dimensions and drastically improves the insight discovery process among millions of unique metrics. You can search based on these tags and also identify specific services for deeper analysis. Tagging allows you to correlate and alert your data, so you can more quickly identify the cause of issues and glean more overall intelligence about your operations and performance.
Improve the Ease, Scale and Flexibility of Your StatsD Monitoring
Many StatsD pipelines are not equipped to handle the volume of data emitted by today’s applications, causing inaccuracies and limitations when monitoring StatsD metrics. Depending on your business, monitoring goals and the StatsD challenges affecting your organization the most, it may be time for you to evaluate other solutions so that you improve the ease and flexibility of your StatsD monitoring and get more value out of all that insightful data you’re generating.