Observability / Contributed

How Much Observability Is Enough?

25 May 2022 10:00am, by and
Jujhar Singh
Jujhar Singh is a lead infrastructure developer at Thoughtworks. Previously, he was global DevSecOps practice lead at The Economist. He’s got over 11 years working with AWS and five years in GCP. He loves the areas around DevSecOps practices and how to organize teams and software to maximize the value that cloud offers.

In a popular episode of OpenObservability Talks podcast, host Dotan Horovits, Logz.io’s principal technology evangelist, was joined by guest Jujhar Singh, at the time global DevSecOps practice lead at The Economist and currently a lead DevOps and infrastructure consultant at Thoughtworks. Their conversation was focused on understanding how much observability is enough, including investment and stakeholder adoption.

In this article, Dotan and Jujhar summarize the main learnings from the episode and provide some fresh perspectives. Before jumping into observability implementation, it’s important to keep a few factors in mind. 

Why Is Observability Important? 

Dotan Horovits
Dotan lives at the intersection of technology, product and innovation. With over 20 years in the high-tech industry as a software developer, a solutions architect and a product manager, he brings a wealth of knowledge in cloud computing, big data solutions, DevOps practices and more. Dotan is an avid advocate of open source software, open standards and communities. Currently working as a developer advocate at Logz.io, Dotan evangelizes on observability in IT systems, at Logz.io, using popular open source projects and standards such as ELK stack, Grafana, Jaeger and OpenTelemetry.

Observability is about designing your environment to provide actionable insight through meaningful data.

First, before we jump into the why, let’s understand what observability is. Observability is the process of watching what your systems do at every layer so that you can build a comprehensive picture of how it does what it does. When you first start out, it’s just about collecting telemetry, but when you get good at it and reach high maturity, it’s about shedding the noise and having targeted visibility into your system’s behaviors in different scenarios. 

It is about generating and capturing quality telemetry data from your application and the underlying infrastructure. It is about analyzing the information provided, turning that data into system improvements and honing your visual aids into telling you what is really happening rather than just guessing answers. Observability is about designing your environment to provide actionable insight through meaningful data.

Once you know the behavior and “feel” of your systems, you can make more timely decisions to debug and enrich your systems. Simply, without observability capabilities, you can’t debug and optimize. With these efficiencies at your fingertips, you can improve across many business functions including customer experience (CX) by gaining insights into specific trends to optimize and ultimately answer questions on how users (and potential customers) engage with your system. This is a very data-driven approach. With masses of data flowing into your systems, you need the proper solution to filter and analyze the data to help make more informed decisions — this is observability. 

What Is the Minimum Observability Needed? 

The general rule of thumb is this: The higher the SLO/SLA of the service, the more observability it will require.

The minimum observability needed depends on two things: your business needs and application architecture. For example, your typical monolith, back-of-house accounting application that only needs to work 9 a.m. to 5 p.m., five days a week probably won’t need as much observability as your 24/7, 99.99%, high volume, microservices-based e-commerce platform. Observability is hard, so making the right decisions on where to invest your effort is extremely important. You don’t want to overinvest in your simple, medium importance systems to the detriment of your complex, core business applications.  

The general rule of thumb here is the higher the service-level objective (SLO)/Service-level agreement (SLA) of the service, the more observability it will require.

Much like any tool, cost is a major factor in deciding how much is needed. Observability isn’t cheap, and for some it isn’t easy to implement. It involves investment not only in tooling but also in skillset and engineering capacity and headcount, and the associated change to organizational culture and engineering practices will be difficult. Many times, we’ve purchased an observability tool thinking that it will sort out our problem, but we never accounted for the engineering investment required, and the wonderful tool goes unused and sits on the shelf burning a hole in our budget.

Going into more technical detail, you can break the question down into infrastructure and application observability. For infrastructure services, a good place to start would be the U.S.E. metrics: Utilization, Saturation, and Errors. For application services, the common practice is to cover the R.E.D. metrics: Request rate, Error rate, and Duration (or latency). Google’s “SRE Book” recommends these three together with Traffic (or saturation), referred to as the “four golden signals.” To learn more about SRE at Google in the microservices era, check out this article, which includes insights from Google’s staff SRE.

Many organizations try to follow observability practices published by engineering teams of leading tech companies, such as Google or Netflix, just to end up with an overcomplicated, expensive system for their scale and needs. Or even worse, they have one expert in all the tooling but when that expert leaves, nobody else maintains the dashboards or knows how to use them.

Outsource your tools wherever possible. Observability tools can be beasts to run, they have to process and store high volume telemetry, and it’s a nontrivial challenge. The Economist had an on-prem Splunk instance. Keeping that instance highly available and patched was a nightmare and was almost half an engineer’s workload; that’s ignoring the licensing and ever-increasing hosting costs. Rather than spending ages spinning up, running and securing your own home-grown Prometheus stacks, find a SaaS version. By outsourcing your monitoring and observability tooling, you can focus on using them rather than running them.

The Human Factor of Implementing Observability

Just because you heard or were told how impressive the solution is doesn’t mean it’s right for you at this time. 

Above all, observability has to be easy for you. Avoid committing to a complicated solution if your organization doesn’t have the capabilities or capacity to use and implement them. The more complex the tool is doesn’t mean it will be more effective in your monitoring efforts — just the opposite, actually. Avoid buying a tool for the sake of it. Make sure the tools you’re buying are being put to good use. Just because you heard or were told how impressive the solution is doesn’t mean it’s right for you at this time. 

When you decide to implement observability and monitoring, you need to ensure the right engineering capacity is in place to integrate the tool. Factoring in these foundational elements before making a purchase will help avoid headaches down the road. When buying a tool like this, you need the right people and the time to implement it. Typically, an organization needs someone in engineering to spike the tool or run a proof of concept (PoC) to get a big picture outlook on how long it will take and what it will take to implement. The PoC stage is a critical opportunity to set expectations upfront by asking the right questions and conducting meaningful tests. From here, having a site reliability engineering (SRE) function to conduct product conversations around service-level objectives can be important to your system management. 

Set Clear Objectives, Consolidate Tooling 

Standardize around a core subset of tools on your “golden path.” Even if you can’t — do not concede on the principles you’ve set. 

You must first understand how much observability is enough for your needs and what role different observability tools will have within your organization. Data is everywhere, and the importance behind the management of this data and telemetry is crucial in building a strong foundation for your organization to avoid long-term or undetected disasters and improve your services. 

Once you have this understanding, set your strategic objectives and principles and stick to them. An example of some good observability objectives/principles are:

  • All services must expose a health check and metrics endpoint.
  • All services must generate or forward a traceId to enable request tracing.
  • All services must expose application telemetry in plain text.

Ideally, you should aim to standardize around a core subset of tools on your “golden path” that are well used by your more mature teams. Newer teams will be easier to onboard and influence as they can “copy and paste” what your more mature teams have done. In the real world, however, you might end up with several tools that do the same thing. This is unavoidable at a certain scale, but even if you have to concede around your tooling choices — do not concede on the principles. Stick with open source where you can as it makes migration between tools and integration with other tools much easier.

To better understand the role observability should have within your organization, check out the OpenObservability Talks episode:

Feature image via Pixabay.