Optimize Your Observability Spending in 5 Steps
Most observability tools rely on storage as the key metric for pricing, so if you want to reduce the cost of your observability data, you can start by focusing on the volume of data that you’re sending to storage. Let’s look at how you can optimize your total observability spending with Mezmo Telemetry Pipelines by following five steps that can be applied to any log or metric data.
The Five Steps of Observability Data Optimization
- Filter out duplicate and extraneous events that don’t contribute value to your observability results.
- Route a full-fidelity copy of the remaining telemetry data to long-term retention solution for future auditing or investigation instead of your observability tools.
- Trim and transform events by removing empty values, dropping unnecessary labels and transforming inefficient data formats into a format specific to your observability destinations.
- Merge events by grouping messages and combining their fields to retain unique data while removing repetitive data.
- Condense events into metrics to reduce the number of hours and resources dedicated to supporting backend tools, and convert unstructured data to structured data before indexing to make searches more manageable, faster and efficient.
From Steps to Practice
Some of these steps may seem obvious, but they are not easy to put into practice.
You can’t use an observability agent on its own to put these steps into practice. Agents are simply neutral forwarders, sending out information to be processed downstream in the observability analysis tools.
You could implement some of these steps using open source tools and in-house development, but this comes with increased operational cost and complexity, requiring your team to build expertise that is not core to your business.
Overall, the main challenge with putting these steps into practice is that the available tools are either like agents, which simply send information, or like observability tools, which simply receive it. You need to be able to process telemetry data in stream, to be able to transform and route it as it passes from agent to tool, to optimize and shape it for your downstream requirements.
Our Mezmo Telemetry Pipelines were conceived with the goal of helping organizations get better control of their data in stream. This approach enables you to control the flow between your data sources and your observability tools, and manage in detail the optimization of your data before it arrives downstream.
Putting the Five Steps into Practice
Noisy logs tend to make up a large percentage of overall data volume. Noise includes events such as positive confirmation signals, recurring process notifications and repeated status values. For example, web logs often contain an abundance of
status=200 messages that deduplication processing can remove.
The key to filtering is being able to compare fields so you can determine whether a log is unique. If it is not unique outside of the timestamp, you should consider dropping it from the stream. If you need to keep the total number of logs that are non-unique, you can include that count in a representative log message.
You can use our Route and Dedupe processors with a configuration that ignores the timestamp as a way to handle noisy logs. Alternatively, you can choose which fields to match on explicitly if you need more criteria for duplication testing.
You may need to keep a portion of your logs in long-term retention with full-fidelity instances of telemetry data for compliance auditing or site reliability engineering (SRE) troubleshooting. These logs can be routed to cold storage in the proper structure and rehydrated as necessary without needing to be sent to your observability tools for operational monitoring.
Consider these criteria for routing logs to long-term storage:
- Audit logs or logs with user actions.
- Logs that confirm a process starting or ending.
- Nonessential operational metrics that may be needed for longer-term performance analyzation but not for short-term operational monitoring.
Mezmo’s Route Processor intelligently routes data to cost-effective storage solutions, such as AWS S3. You can also configure the Route Processor to apply filtering options. For an example, check out our Pipeline Recipe Drop, Encrypt and Route Data to Storage at docs.mezmo.com.
3: Trim and Transform
A common situation is to have events where individual log lines have been packed with information because a lot of data has been dumped into one line within the application code. These events can include stack traces and detailed data objects added for the purpose of debugging.
Large events typically have excess data in the message itself. Often this is unparsed data, though it is likely semi-structured. With Mezmo’s Parse Processor, you can use parsing techniques like regex or grok to extract the important elements from the larger body and then remove the excess data. Stack traces, for example, can have the majority of the trace itself stripped out to retain only the originating source location.
Firewall logs from systems like Palo Alto and AWS Firewalls generate a high volume of log events. Often these logs share a number of fields that are non-unique. However, you would not want to directly drop the logs due to the importance of the information from a security perspective.
With Mezmo’s Reduce Processor, you can merge multiple log input events into a single log event based on specified criteria. For example, threat and traffic logs from the firewall share 70% of the same fields and are tied to the same events by a common
5: Condense Events to Metrics
Beyond reducing the volume of logs based on storage requirements, you should also consider how you can optimize your operational insights to improve performance. This requires careful evaluation of the key performance indicators (KPIs) your teams are using to manage your infrastructure.
For example, you can use the Mezmo Event to Metric Processor to convert logs metrics and visualize them on an operational dashboard, providing valuable business insights while also helping reduce the inefficiencies that SRE teams and others have when accessing information they want.
Research and Findings
To test these techniques and substantiate our data reduction claims, we undertook a research project with our customer engineering and product teams.
Data was collected from internal Mezmo sources where available to make it as close to representative of real-world data as possible. Data collected from external sources was sourced from Kaggle.com and other open source locations, such as GitHub.
Data was then groomed via scripting, as needed, to flatten for loading into Snowflake. Each log schema was parsed and given its own table for storage and comparison.
In parallel, Telemetry Pipelines were created in a production environment with a standard account tied to the individual source types. Data was injected into each pipeline for each sample through an HTTP source.
Each pipeline attempted to follow the Snowflake queries, though variations in the technologies required some alterations.
Data samples sent into the pipeline were forwarded to HTTP destinations for comparison in the byte count from input to output.
Due to the way pipelines and network layer traffic work, this naturally introduces variation versus the Snowflake analysis, so the results were not expected to match perfectly. However, these results more closely resemble real-world cases, because network layer translation would always be a part of any functioning log/metric system.
The net findings are that following these steps can reduce the volume of telemetry data by 50% or more without affecting your observability data, and that this is true across the many data sources we tested. You can see the full results for yourself in our Telemetry Data Reduction Score Card.
- Using the filter technique and dropping redundant events with deduplication criteria resulted in a 62% reduction from standard web logs, such as Apache and nginx, by matching based on the IP, URL and request type.
- Using the route technique, we were able to separate more than 61% of Kubernetes logs by routing them to cold storage based on the sync and IP table processes of the logs.
- Using the trim and transform technique, we were able to reduce Kafka logs by 50% by extracting common message data, including process status updates, topic creation and messages from the controller. Note that we still kept information fidelity in case it was needed for troubleshooting.
- Using the merge technique, we were able to reduce firewall log volume by 94% by removing unnecessary fields and grouping events based on source and destination IPs.
- Converting logs to metrics can result in over 90% reduction in total volume for all informational logs, but the process must be carefully tuned to avoid the risk of losing potentially valuable data while avoiding an explosion of tag cardinality. Our sales engineering team can provide more information based on your data sources and observability needs.
By following the five steps described in this paper in the design of your telemetry pipeline, you can realize significant data optimization to reduce the cost of your observability data.