Cloud Native Observability for DevOps Teams: an Introduction

3 Aug 2021 4:00am, by

Editor’s note: This is the introduction to The New Stack’s latest ebook, Cloud Native Observability for DevOps Teams. You can download the entire ebook here.

If you’re reading this, you likely already work with cloud native applications and architecture, or your organization is embarking on a journey to the cloud. If so, you are already familiar with the overwhelming choices the cloud native landscape offers. So many tools and so many opportunities to make the wrong decisions.

In a word, so much complexity.

Even if your team is overseeing just one microservices cluster or just a few, those clusters may be deploying across more than one public cloud or a combination of cloud and on-premises servers. More complexity.

When an anomaly pops up — latency, a spike in application programming interface (API) calls, a sudden outage of an essential service — how do you know what’s causing it? How do you know whether it’s an isolated incident or a glitch that’s going to crash everything?

The thing is, you can’t know unless your whole team has full observability.

Nothing is more crucial to an organization’s ability to not simply function but serve its customers, than observability. And nothing, perhaps, is more widely misunderstood.

Observability means inferring the internal state of a system from its external outputs. But it isn’t just the ability to see what’s going on in your systems. It’s the ability to make sense of it all, to gather and analyze the information you need to prevent incidents from happening, and to trace their path when they do happen, despite every safeguard, to make sure they don’t happen again.

No Longer Just Operations’ Job

Traditionally, observability has been the responsibility of operations engineers. But with the advent of DevOps teams, and more responsibility “shifting left” to developers, it’s become every team member’s job. If you’re building an application, observability cannot be relegated to “add-on” status later in the application’s life cycle. To think otherwise would be like building a car and leaving the speedometer, odometer and instrument lights for the dealership to install.

In a survey taken in January 2021 of more than 300 IT professionals by VMware Tanzu, 84% of participants said their cloud applications would have better availability and performance if more stakeholders, including developers, had visibility into their systems’ overall infrastructure and performance metrics.

The current conversation about observability began before the introduction of game-changing cloud native tech like Kubernetes. In a much-cited 2013 blog post by Cory Watson, the tech world learned how engineers at Twitter sought ways to keep track of their systems as the company moved from a monolithic to a distributed architecture.

At this time, as Watson described, Twitter focused its observability efforts on collecting and monitoring metrics, and on the visualizations generated from the data points it collected:

Charts are often created ad hoc in order to quickly share information within a team during a deploy or an incident, but they can also be created and saved in dashboards. A command-line tool for dashboard creation, libraries of reusable components for common metrics, and an API for automation are available to engineers.

Logging and tracing were addressed in a single paragraph, under the heading “Related Systems.”

Twitter, Watson wrote, created a command-line tool to help its engineers create their own dashboards to keep track of their metrics-generated charts:

The average dashboard at Twitter contains 47 charts. It’s common to see these dashboards on big screens or on engineer’s monitors if you stroll through the offices. Engineers at Twitter live in these dashboards!

Beyond the ‘3 Pillars of Observability’

As the decade of cloud computing rolled on, more engineers began to “live in their dashboards.” It’s not enough to merely monitor data points, they learned. And so the notion spread that observability didn’t mean mere monitoring, but was based on three pillars, which became known as:

  1. Metrics: measurement of various activities in a system.
  2. Tracing: the path taken by a request as it moves through a distributed system.
  3. Logs: records of activity within the system.

Increasingly, the conversation around observability is moving beyond the three pillars, taking a more nuanced view. There’s greater awareness of how those three pillars fit together and a greater emphasis on analysis. DevOps teams are becoming more cognizant of the importance of measuring what truly matters to meet service-level objectives (SLOs).

And managers, struggling with high turnover and a relatively small pool of talent from which to hire, are trying to figure out how to alleviate the human cost of “pager fatigue” — the demand for operations engineers to respond to alerts, at all hours of the day or night, that may or may not signal a business-critical incident.

SLOs, a concept documented by Google’s site reliability engineering team in its SRE book, can vary widely depending on each organization or even team’s purpose, such as achieving a particular latency for a certain volume of requests or determining how many customers can make purchases in an online shopping cart application at once. Service-level indicators (SLIs) are the signals that illuminate a robust observability process and can show whether a team is on track to meet its SLOs or if a problem is brewing.

And, as stated previously, distributed systems and cloud native technologies add additional layers of complexity to observability. After all, Kubernetes runs everywhere, and “everywhere” can be tough to track.

In the VMware Tanzu survey, 90% of the IT professionals who participated said that distributed applications create monitoring challenges of an order of magnitude bigger than other applications.

More than 80% of participants in the survey said that legacy monitoring tools aren’t sufficient to track modern cloud applications. And only 8% of respondents said they are “very satisfied” with their organization’s current monitoring tools and processes.

Increasing Business Value

Cloud technologies themselves do not always lend themselves easily to observability. Plain vanilla Kubernetes, for instance, offers only very basic functions, through kubectl, for checking on the status of objects in a cluster and no full-fledged native logging solutions, as Franciss Espenido, LogDNA’s senior technical partnerships program manager, writes in his chapter of our new ebook.

But overcoming these challenges can pay off for businesses.

In the VMware Tanzu survey, 92% of respondents said observability drives better business decisions. One example of how observability is becoming embedded into the way businesses run involves Adidas, the sportswear retailer.

Adidas found that as it scaled up, it needed to make observability a lot easier, according to Rastko Vukasinovic, the company’s director of solution architecture. So it built its own holistic monitoring system that allowed it to not only collect and watch technical metrics, but also business data.

Its worldwide DevOps teams now compile code more than 10,000 times a day. And Adidas’ overall digital transformation has helped its e-commerce revenue soar from $47 million in 2012 to $4.7  billion in 2020.

For developers, having a greater knowledge of observability — and building secure, observable applications that easily lend themselves to meeting SLOs — means contributing more to overall business goals. Fifty-five percent of developers’ time is spent maintaining and managing custom applications that serve current business needs, according to a 2019 survey by 451 Research; only 45% of time is spent building new applications to help the business differentiate itself from their company’s competition.

According to 451 Research’s 2020 report on observability, commissioned by Sumo Logic, greater focus on SLOs can help developers spend more time on applications that fuel new business:

With visibility into key objectives that describe the performance that’s important for end users, developers can prioritize the work they do on existing applications for the most important performance problems rather than, for instance, on infrastructure or application anomalies that have no negative impact on users.

The story of cloud native observability is still adding new chapters. The tech world is watching OpenTelemetry — an open source project aimed at creating a standardized set of tools, APIs and software development kits (SDKs) — which is now in incubation at the Cloud Native Computing Foundation.

In its new ebook, The New Stack has gathered some of its best articles on the current state of observability, with contributions from experts at LogDNA, Buoyant, Dynatrace and Honeycomb.  It’s aimed at every member of a DevOps team, making the case for full-stack involvement in making sure cloud native applications and systems run smoothly and keep customers satisfied. The days of throwing application code “over the wall” and letting operations engineers deal with the consequences are over.

Featured image by Gabriel Dinh.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Spike, Honeycomb.io.

This post is part of a larger story we're telling about Observability.

Get the full story in the ebook

Get the full story in the ebook