Lightrun sponsored this post.
In traditional software companies, there’s a strict divide between the people who write the code and the people who run the code: You’re either in dev, or you’re in ops.
This divide is drawn in a clear-cut way: Developers design and write code for the application, which, in turn, runs on infrastructure that operators (DevOps engineers, production engineers, site reliability engineers, IT personnel and so forth) build up, monitor and maintain.
It should come as no surprise, then, that observability — the practice of examining the internal state of a system simply by looking at its outputs — is reserved for operator-land. System output is an ops, not a dev, problem. Developers own the inputs but not the outputs — they’re too far removed from the actual application output emitted by the code they originally wrote.
Observability, generally speaking, is a post-deployment concern for operators. Once the system is up and running, it is expected to emit all the information an operator might need to understand what’s going on inside of it. The operator is not expected to push new code or otherwise significantly alter the state of the system to get more information — it should already be there.
The Cost of ‘Log Everything and Make Sense Later’
The overreliance on instrumentation causes a weird pattern to appear: Teams tend to simply log as much as they can and use complicated, pricey (and sometimes cumbersome) log analysis tools in production to parse through all the data and understand it. This problem is multiplied by the number of tools used to consume and analyze those logs. And with 90% of companies using multiple observability tools, the complexity adds up fast.
Internally, we dub this approach “log everything and make sense later,” and we believe it’s broken from the core due to two main factors:
1. The Breadth of Static Observability — You can never log everything in your application. If every line of functionality will be accompanied by a line of telemetry, it will be extremely difficult to maintain any decently sized code base over time. The throughput of the application will tumble under the weight of never-ending logging, and perusing that data effectively will be an arduous task.
As sane developers, then, we make a compromise: We log only what we think is needed during development and hope that these pieces of information will serve as guiding lights when we need them in production.
The problem, of course, is that blind spots are inevitable. Developers cannot predict all the known unknowns and unknown unknowns during development, as software systems are infinitely complex beings with endless interdependencies that can spring up new issues at any given time.
And when you’re at the end of your wits due to the obvious visibility gaps in front of you, you’re stuck with two bad options:
- Reproducing — Spinning up a machine and replicating the exact state of the production system to try and “redo” the action that caused the bug. This is not always possible and almost always time-consuming and not ergonomic.
- Hotfixing — Adding more telemetry to fill the gaps. This will send you to lengthy redeployment cycles, costing each developer both in context switches and in time spent waiting to get the right answer to their question.
2. The Cost of Observability — When developers over-log their applications, log costs tend to creep up quickly. Let’s take a concrete example: your run-of-the-mill e-commerce website. In a sufficiently trafficked store, you might see the same piece of code being executed millions of times each day — think about some logic that checks the state of the shopping cart on each page load, for example, or a piece of analytics code. Adding a new line of code there will result in hundreds of millions of log lines being emitted on a monthly basis. Let’s take a decently trafficked endpoint with 100 hits a second and three logs emitted on each hit. With an average size per log of 200 bytes, we’re talking over ~1.9TB of new logs every year! This will cost, in an average log management system, around $190 just to ingest every year (with 30-day retention).
A decent code base in a high-traffic environment has many, many thousands of those logs, which can easily run upwards of $2 million a year just for ingesting logs, which leaves a big log-cost problem on your hands. And that’s before you account for logs left over by debugging developers.
In the next article, we will define what “shift left observability” actually means in practice, what developer observability is and why we should use it as a solution to the problems mentioned above.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Lightrun.
Feature image via Pixabay.