Modal Title
DevOps / Software Development

How Dynamic Logging Saves Strain on Developers and Your Wallet

Dynamic instrumentation in production systems can reduce complexity, increase productivity and reduce the spiraling costs of static logging.
Nov 11th, 2022 10:29am by
Featued image for: How Dynamic Logging Saves Strain on Developers and Your Wallet
Image via Pixabay.

In a perfect world, developers would spend all their time writing creative, valuable code that contributes to meaningful software.

In practice, however, developers spend a lot of time on non-functional requirements. Chief among these is application instrumentation — those aspects of an application that monitor or measure its performance (most commonly logs, but also metrics and traces).

Application instrumentation is a “just-in-case” measure. It’s a safety precaution that engineering organizations create mandates around to ensure the underlying system is reliable in production. It is, by definition, not a customer-oriented value-add.

The current approach to instrumentation, which we will refer to as “static instrumentation,” inserts analysis code into the application only at compile time (in development), which has significant limitations:

  • Time-consuming — Developers can spend huge chunks of time writing instrumentation, waiting for the application to redeploy and sifting through the results for the information they need.
  • Reduces productivity — Each time a developer wants to add new instrumentation, they need to context switch out of their code and into their instrumentation tools, breaking their flow and distracting from their core task.
  • Costly — The cost of the tools required to ingest, manage and store all the instrumentation can quickly add up.

A new approach is emerging: dynamic instrumentation in production systems

Dynamic instrumentation allows developers to add instrumentation “dynamically” at runtime — when the application is running — which has substantial potential to remedy many of the limits of static instrumentation.

This article will explore the limits of static instrumentation through the example of static logging, before exploring how dynamic instrumentation in production systems can reduce complexity, increase productivity and reduce the spiraling costs of static instrumentation.

The Limits of Static Instrumentation: A Look at Logging

Logging is a necessary evil.

If the code being logged is good, then writing those logs and redeploying the application will have proven a significant waste of time. Yet, if a developer neglects to log properly, any bugs that arise during the application’s execution will be significantly harder to fix.

As a result, many developers tend to over-log their applications to account for any possible occurrence. This increases the complexity of sifting through these logs to find the right needle in the haystack and carries with it a significant price tag in the form of application performance management (APM) licenses and other observability-related costs.

Why do developers log so much?

At first glance, the reason for logging so much is obvious: If an application does not contain enough logs, we will lack granularity when troubleshooting tough issues.

But that’s only part of the larger problem at play here: The logging workflow as a whole is broken. It’s important to note that traditional, static logging carries with it⁠ a number of key limitations:

  1. Logs can only be added during development — Since logs can only be added to the application as the application is being written, it’s up to the developer to decide upfront what data will be needed later on (namely, in test and production environments). Should the need arise to add more logs later on, a developer would need to go through a whole release cycle. This creates a tendency to go log-heavy in the first instance to ensure that there is logging in place for every possible contingency.
  2. Logs cannot be added, in advance, for every scenarioA wise system architect knows that one must balance system performance with log coverage. If the developer doesn’t log enough information, it would be impossible to reliably know what’s happening inside the system; it would be a black box. Adding a log to every line of code is also impractical since it can seriously affect system performance and result in telemetry costing more CPU cycles than the functionality itself.
  3. There is no way to know, in advance, the required granularity — Even with massive log coverage, it’s impossible to reliably account for all possible unknowns. When a problem inevitably occurs, the upfront “guesstimation” of the data required is seldom absolutely on point.

The three points above paint a grim picture for the state of logging in most modern applications: If you won’t log, you won’t know. This translates to developers consistently writing more logs than probably are needed, then adding more logs when the original logs are either not placed correctly or do not provide the depth required to mitigate the issue.

When the only weapon is to add more logs, log volumes can grow quickly and without the organization paying attention to them. Many companies — especially data-centric, transaction-heavy companies like e-commerce sites or financial institutions — are generating hundreds of terabytes of logging data each day in order to have sufficient observability of their systems.

The Limits of Static Instrumentation Are Only Getting Worse

As cloud native technologies increasingly become the default for new applications, the situation seems to only be getting worse. The components underlying the apps are getting more and more abstracted and complex, making it harder than ever to troubleshoot those apps when issues arise.

The solution, as the reader might have suspected by now, is to add more logs.

Out of these logs, 99.9% will never be looked at or analyzed. They exist only to fend off the gnawing sense of dread from having to endure through a production incident without enough telemetry.

And, as the volume of logs spirals to cope with the growing complexity of applications, costs spiral with it.

The Cost of Logging

The cost of logging comes, primarily, from the software we use to collect and analyze the application’s logs.

These pieces of software, often called APM and centralized logging solutions, charge by volume. Specifically, they charge for the ingestion, storage and processing of these logs. As the application’s emitted log volume keeps growing, so does the cost of using these platforms.

In addition, these are not only upfront, known costs. As the application grows in size, logging volumes grow in lockstep. If these costs were not accounted for initially (and they rarely are), organizations end up having to use pay-as-you-go billing structures to purchase coverage for the additional emitted logs, which inflate the eventual bill substantially.

At the same time, the license fees (seats, hosts, etc.) for these software packages aren’t cheap either. These companies know that migrating between logging and observability vendors is tricky, and that an engineering organization invested in one platform is hard-pressed to move to another provider without good reason. This vendor lock also creates a real issue with mitigating the rising cost.

(As an aside, if you’d like to get a sense of the actual economic cost associated with static logging as well as the cost reductions that come from dynamic instrumentation, you can review our study on the topic).

Enter: Dynamic Logging

In complex applications, it might be difficult to figure out which part of the application is generating the most logging, and therefore costing the most. In practice, it is challenging to optimize the application logs in any meaningful way, while still maintaining sufficient log coverage.

Developers find themselves stuck between a rock and a hard place: the unfavorable pricing model of the APMs and their need to log as much as needed to get a clear view during incidents.

Should the organization minimize logging — and, by proxy, the observability of the system — to keep costs down and velocity up? Or, in contrast, should developers attempt to cover every possible eventuality with more logs, incurring immense costs along the way?

It’s a tough call.

However, I’d like to suggest shifting from static instrumentation to dynamic instrumentation in production systems.

This means using dynamic logging.

Dynamic logs are those you can add to your application during the application’s execution, without having to modify your source code, run through a whole development cycle or even restart the application.

Using dynamic logging, any developer can write only the logs that they need, when they need them. But, more critically, it offers a way out of the uncomfortable dilemma that many developers find themselves in: having to choose between cost/speed and observability.

This allows developers to add new log statements into a running application at any point, without redeploying, restarting or stopping it.

Why Use Dynamic Logs?

Dynamic logs offer a few key benefits over static logs that make them much more efficient in modern workflows in comparison to traditional logs.

(if you’d like to skip straight to the numbers, you should take a look at Lightrun’s “Impact on Enterprise Logging Costs” report, which goes through the exact spend before and after implementing Lightrun’s dynamic logging in an application with a high transaction volume, developed and maintained by an enterprise team).

Dynamic Logs Can Be Added Retroactively 

Usually, APMs and other observability vendors enable developers to analyze existing information: namely, the logs and metrics that were added to an application during development and then emitted at runtime.

Dynamic logs can be added to live applications, even once it’s already running. If there’s a new piece of code that ends up not having good log coverage or a specifically tricky part of the codebase where visibility is lacking, developers can just add dynamic logs on the fly, without having to alter the state of the application.

This means telemetry can be scaled to meet the organization’s needs on the fly, without developers having to spend time upfront writing logs for every possible eventuality.

Dynamic Logs Are Ephemeral

Dynamic logs do not persist in the codebase. Instead, they have a “lifetime” and expire as soon as that lifetime is over, meaning that if the same code path will be invoked once the expiry time has passed, the log will not be emitted again.

Dynamic logging makes sense in this context because most logs are simply not meant for long-term storage. In most cases, a developer just wants to get a degree of context into the runtime of the application, come to a conclusion about the application’s state and carry on.

While dynamic logs do not replace the logs organizations are mandated to keep for legal or regulatory reasons, they allow developers to drastically reduce reliance on debug logs and remove the operational burden of “cleaning up” the codebase after each troubleshooting session.

Dynamic Logs Are Conditional

Dynamic logs are written with very precise conditionality that is only relevant for the logs themselves, not for the entire application. That means that instead of adding branching functionality (like if/else or switch statements) in order to emit more granular logs, logs can be emitted conditionally at runtime based on any code-level logic the developer needs.

For example, a developer can choose to only emit logs for:

  1. One machine, only one cloud region, only one Kubernetes namespace or the entire production fleet.
  2. A specific user or class of users, based on properties only available at runtime (like their respectable user-agent or other HTTP headers).
  3. When a specific event happens, such as when a customer purchases a product.

In practice, that means that instead of emitting a log on every code path invocation, it’s possible to emit logs only when they are required. If a user passes through a particular checkpoint and no longer needs to be monitored, a dynamic log can be added such that it only emits before the checkpoint and not after, saving the vast majority of log emissions in the process.

Dynamic Logs Are Granular

Dynamic logs can contain exactly the same information normal, static logs contain.

Each dynamic log contains the possibility to add one or more code-level expressions, using the same syntax and patterns a developer is already used to. That means that a developer could, in essence, ask any code-level question and get the information they need in real-time and on-demand from the running application.

When logging everything and analyzing later, each review of the application’s log output contains a massive amount of information to wade through to actually find the problem.

The granularity of dynamic logs means developers can skip sifting through endless logs and get only the information they set out to get in the first place.

Dynamic Logs Are Performant

Static logs always weigh on performance and memory, even if logging is disabled. Dynamic logs only have a small performance impact when active and can be limited to cap and throttle the performance impact during execution, ensuring the integrity of the application.

In addition, there are simply far fewer dynamic logs than static logs, ensuring that the overall throughput of the application is greatly improved due to the reduced amount of logging.

Dynamic logs do not weigh heavily on the application’s throughput, are continuously monitored and throttled to never bypass a custom set of caps, including CPU usage, memory consumption and I/O.

The Benefits of Dynamic Logging

Cost-Effective

The ability to create logs retroactively provides a high level of precision: Rather than relying on an APM to aggregate, filter and analyze all the logs in the system to get a sense of what’s actually going on, an organization can instead choose to only log what is needed, when it is needed.

As a result, the logging and observability bills drop dramatically because of drastically lower logging volumes, resulting in less costly consumption fees: less log ingestion, less log storage and less log analysis.

Increase Developer Productivity

Static logging requires redeploying the entire application every time a developer wants to add new logs, which is a massive cause of context switching, delay and friction within the development life cycle.

By introducing dynamic logging, developers can add telemetry in real time and have the logs piped straight into the integrated development environment (IDE) or into an existing observability system, without having to redeploy the application.

This reduces context switching, eliminates deployment delays and provides a massive boost to developer productivity. You can read here about how using dynamic logging helped WhiteSource to identify issues without having to go through cycles of redeployment.

Reduce MTTR

By using dynamic logs, developers are able to follow a real-time investigation process that does not rely on release cycles to complete. Instead, developers will be able to ask concrete questions and receive answers on-demand, without having to wait for a release cycle to complete.

The expedition of this process results in a more ergonomic, developer-friendly workflow that encourages developers to debug faster without having to wait for CI/CD pipelines to finish or rely on deep, manual investigation routines using proprietary query languages. You can read more here about how Taboola saves over 260 hours of debugging using dynamic logging and other real-time, developer observability techniques.

Stay Safe, Secure and Private

Lastly, it’s worth noting again that all the benefits of dynamic logging can be had without needing to compromise on safety or security:

  1. Dynamic logs are read-only — No instrumentation ever changes application state.
  2. No source code is accessed or transferred — Dynamic logs do not access source code, just metadata.
  3. No inbound ports need to be opened — All networking is done in an outbound, long-poll fashion.

Cut Costs, Reduce Complexity And Maintain Effectiveness With Dynamic Logging

The costs and complexity of static logging are starting to get overwhelming. The current “log everything, analyze later” approach, is deeply ingrained in the way development teams have operated for years. It can also feel like the only practical solution to the problem of understanding what’s happening under the hood in complex production applications.

However, the practice of dynamic logging in production systems overcomes many of the limits of static logging at a fraction of the cost, both financially and in terms of overall system performance.

By using a dedicated developer observability platform like Lightrun, your developers can enjoy safe, real-time and ergonomic dynamic logging without ever leaving the IDE, resulting in improved developer productivity and reduced MTTR.

By the Numbers

For a concrete, data-driven exploration of how dynamic instrumentation with Lightrun can cut spiraling logging costs, improve developer productivity and reduce mean-time-to-resolve, our recent “Impact on Enterprise Logging Costs” report is worth a read.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Lightrun.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.