A Tactical Field Guide to Optimizing APM Bills
In early 2022, Reddit user “engineerL” described an incident in which a team in his company managed to accidentally spend $100,000 on Azure Log Analytics over the course of only a few days.
This specific incident is most likely an unfortunate anomaly in that organization’s logging bill. However, we at Lightrun have seen firsthand, among customers, design partners, and prospects, that it is not unusual for organizations that process large volumes of data or create systems with a high volume of transactions to spend over $1 million a year on logging ingestion, storage and analysis alone.
Logging is, very quickly, becoming a very expensive practice in modern software organizations.
Before we dive into effective ways to reduce overall logging volumes — and, by proxy, the monthly application performance monitoring (APM) — Datadog, New Relic, Elastic, etc.— bill, it’s best to briefly look at how we, as an industry, ended up in a world of ultra-expensive observability.
The Cost of Logging
Why is logging getting so expensive?
The bottom line is that log volumes are spiraling due to a combination of rising complexity and a lack of alternative options.
Organizations that create highly trafficked, data-intensive and/or high-transaction applications are generating vast volumes of logging data each day to gain sufficient visibility into their complex systems.
However, regardless of the actual technical issue encountered, the answer is undoubtedly the same: add more logs, explore, rinse and repeat.
Observability vendors, who are more than aware of that fact, have figured out smart, conniving ways to add a charge to every single activity involved with the discipline of log aggregation and analysis. Today, all the following (each of which is required to reliably get value out of logging information at scale) cost money:
- Log transmission/egress
- Log ingestion
- Log processing/indexing
- Log storage
- Log querying/analysis/scanning
- Vendor subscription/licensing costs (for managed solutions)
- Infrastructure costs (for self-hosted solutions)
- Personnel and training
Some of these are explicit and obvious ( software licensing), but others are less well-known and sometimes flatly ignored until the bill arrives (such as cloud egress costs for telemetry data). When aggregated together, there are two main themes that emerge behind the growing cost:
1. Hidden Costs
As mentioned above, companies don’t realize everything that they are paying for when they purchase observability software. One major underlying cost for self-hosted observability systems, for example, is the infrastructure required to sustain such a system at high-logging volumes — massive compute and storage resources and a significant bandwidth allowance, to mention a few items on the shopping list.
In addition to the cost of observability systems, we should take into account the cost of logging itself.
Note that by adding a significant amount of logs over time to our applications we can increase IO/CPU usage, decrease throughput and incur additional charges in the form of compute and network usage — all for normal, run-of-the-mill application operation.
This over-reliance on traditional, static logging taken together with increasing log volumes can, in practice, significantly affect the number of instances required for handling the same workload, which, naturally, translates to a higher overall cloud bill.
These implicit, hidden costs can creep up on organizations that don’t monitor their logging situation closely.
2. Redundant Logging
In practice, 99% of logs aren’t actually used. They are only created because developers have to take a “better safe than sorry” approach, resulting in a tendency to over-log applications to account for any possible eventuality.
And all these logs, which aren’t even being used, are all being paid for at several points, including transmission, ingestion, storage, processing, etc.
The result of the two points above is that businesses are paying many times over for logs they don’t need and aren’t using.
The next section deals with how these organizations can address optimizing these log volumes, in order to reduce their overall APM bill.
As a side note, we’ve created a detailed report around optimizing logging costs, including exact figures, the potential return on investment and more practical suggestions for log volume reduction. You can check it out here.
Optimizing APM Bills — The Guide
There are three major routes for reducing a logging bill:
- Replacing reproducible logs with dynamic logs
- Performing a log audit
- Optimizing logging
1. Replacing Reproducible Logs with Dynamic Logs
Let’s first define two key terms that will be used in the following paragraphs:
- Reproducible logs are any logs that can be easily reproduced by developers on local, QA or staging environments.
- Dynamic logs are any logs that can be added to an application in during its execution without having to modify source code, run through a whole development cycle or restart/stop the running application. They are ephemeral, granular, conditional, can be instrumented in real time and can be consumed right from the integrated development environment (IDE). We at Lightrun have a platform that enables developers to add such logs to their applications in real time and on demand.
The best way to start reducing overall log volume is to replace so-called reproducible logs with dynamic logs.
This means that only absolutely critical (logs that are required for compliance reasons or for digital forensics) or non-reproducible logs (logs that will prove too difficult to reproduce on dev or staging environments) need to be added to the codebase in development.
All reproducible logs can be removed from the code. When or if they’ll be required again, use dynamic logs instead in order to instrument them dynamically in real time, saving on both precious developer time and observability costs.
Below we’ve collected a few examples of logs that should — and should not — be replaced with dynamic logging:
Examples of Reproducible Logs That Can Be Replaced with Dynamic Logging
Duplicate log messages — Logging instructions that are the result of adding more and more log messages to the same piece of code without noticing what is already logged. They often occur in the form of:
- Methods with multiple logging instructions, which are logging values of the same variables in different forms
- Methods with multiple logging instructions in consecutive lines
State-dependent logs — Logging instructions that are only interesting when the value is not what one would expect it to be, for example when it equals null or zero. They often occur in the form of:
- Logs printing the return value of a method call
- Logs printing the values of input parameters
- Logs that are printing the return value of an API call, database query or information read from a file
- Logs that are often emitted as zero or null values
“Marker Logs” — Logging instructions in the form of static messages that only indicate that the execution has reached a certain point of the code. They often occur as:
- Logs with a static message
- Logs placed in the beginning or end of a method
- Logs placed in the beginning or end of a scope (if blocks, for example)
Examples of Non-Reproducible Logs That Can Be Replaced with Dynamic Logging
- Log messages that indicate unexpected situations, that should not be active most of the time and if reached indicate an out-of-the-ordinary state.
- Logs related to compliance, security, product events and business intelligence.
2. Performing a Log Audit
A log audit is a great way to figure out how to massively reduce the amount of emitted logs in any application.
This type of audit includes a review of the entire system’s logging output, picking apart the specific portions that produce the most costly logs, then performing actions to either remove redundant logs or reduce the cost of each log.
The first step is identifying the “big offenders.” Most log ingestion platforms provide some consoles to identify the logs that take the most space or occur most frequently. However, these consoles process the logs after the logger modified them with information that might make them hard to identify and associate with a specific line of code.
Depending on your preferred method of logging, it’s possible to create internal tools as a workaround for this problem. In our case, we built a simple tool — you can find it here — that works as an additional log handler for the JUL (Java Util Logging) framework and counts every log published.
The tool periodically prints out the number of log invocations. While it doesn’t estimate the size per log, it should give a good sense of which logs are printed in high volume and take up the most space, which is a good place to start. Ideally, such tools also take into account the number of bytes in every log entry, including objects and MDC (Mapped Diagnostic Context ) and other, additional metrics that shed a light on the size of logs the application emits.
I would also recommend spending time with your SRE/support engineers going over the useful/important logs. These individuals usually have a unique perspective on the logs actually used in practice and would have opinions on which logs must be preserved and which ones can be easily removed.
3. Optimizing Logging
Despite the comments above, it’s not always easy to tell which logs are the most important and which aren’t. As a developer, it’s often difficult to put a value on a given log until we’ve run the code and seen how many times it was emitted with real traffic and real transactions.
One way to approach this problem is to simply follow a “top-down” approach and focus on general best practices to optimize each individual log message to its leanest form (while still maintaining its usefulness).
These optimizations can be roughly divided into two types: “global” optimizations and “big offender” optimizations.
Global optimizations are those that have a wide-reaching effect. Even a small change here can significantly affect overall logging volumes and by proxy logging costs:
- Remove objects from MDC:
- Use it for the bare minimum since it can affect everything log coming afterward.
- Reduce the size of MDC objects:
- Don’t log an object, log an ID.
- Make object IDs shorter:
- If you use universally unique identifiers, try using a shortened version such as
- If you use universally unique identifiers, try using a shortened version such as
- Review branching logic:
- Review log instructions placed in part of the code that contains many conditional branches and decides whether all logs should exist.
- Remove log instructions placed in parts of the code that depend on feature flags or various configuration options.
Big Offender Optimizations
Big offender logs are logs that are most often the reason for “log explosion.” In our experience, these are the majority of logs emitted and can be easily removed. The list below details some examples of these types of logs:
- Remove logs that aren’t absolutely necessary to understand the state of the application — logs that are easily deducible from other logs.
- Review the log file and see if the information within the log was already printed higher up in the stack or even in the same method.
- Convert info logs to debug logs when possible, resulting in fewer logs emitted on a regular basis.
- Shorten logs to the absolute minimum. Don’t waste characters on redundant prose.
- Don’t log every object in scope, just those that are absolutely necessary.
- Focus on specific modules/services that are serving more traffic and reduce the logging volumes there.
By auditing your logs, applying the Pareto principle to decide which logs are most suitable for removal and then replacing the biggest offenders with dynamic logs instead, you can significantly reduce the volume of your logs and, thus, the associated APM bills.
If you’d like to learn more about how to produce the same results in practice, read our report on using dynamic logging to reduce log volumes by an average of 35%!