The Open Source Container-Native Observability Toolkit
This article will explore concepts related to observability and monitoring along with a number of popular open source monitoring tools for container-native environments. It is a good place to start if you are looking to make your Kubernetes environment more observable and need advice about where to begin.
The adoption of container-native and cloud native development practices presents new operational challenges. Today’s microservice environments, deployed on container orchestration platforms such as Kubernetes, are polyglot, distributed, container-based, highly-scalable and ephemeral. To understand interactions among components of your system, you must be able to follow the lifecycle of a request across distributed environments. Without the proper tools, it can feel impossible to identify symptoms and determine the root cause of an issue. This context requires us to treat observability as a first-class concern in our operations planning.
When I refer to observability, I am referring to designing and operating a more visible system. Systems, especially complex distributed systems, will inevitably experience failures and it is important to recognize and prepare for those failures. This philosophy differentiates itself from systems monitoring because observability takes a holistic approach that includes proactive system and process design. This includes not only the ability for your environment to be tested in a realistic manner and to report useful, actionable data in production, but business-impact considerations are also involved.
Under the umbrella of observability, monitoring provides users the ability to determine the internal state of a system through its external outputs. These external outputs typically consist of logs, metrics and traces. Combined, these are used to accurately and quickly triage adverse events in order to restore normal service. They can also enable you to conduct a useful post-mortem to help you avoid similar events in the future. I will test these concepts firsthand using a sample Java microservice application.
Site reliability engineering (SRE) is a helpful approach to observability and monitoring. This perspective came out of Google in the early 2000s and is focused on best practices for reliably operating systems and infrastructure at scale. Fundamentally, the goal is to maintain system availability and efficiency. SRE concepts such as SLIs, SLOs and SLAs (service level indicators, objectives and agreements, respectively), can be helpful to keep in mind when designing a modern system. These quantitative measures define the metrics that matter most to the business, the ideal values for those metrics and the planned reaction if the expected level of service is not met.
Similarly, concepts like meantime to failure (MTTF), the amount of time a system is running before it fails and meantime to repair (MTTR), the time it takes to bring a system back to a healthy operating state, are helpful to evaluate the effectiveness of the incident response. All of these concepts help to remind us that observability is more than purely an operational concern as a service outage or degradation impacts everyone in a business.
Logs are the most fundamental pillar of observability. They are a record of an event that took place at a given time, consisting of granular information for a specific context that can be used for diagnosing problems.
Logging is supported by most libraries. For example, our sample app makes use of Java logging classes and a logging.properties configuration file that writes to stdout. Stdout and stderr are the most common choices for Kubernetes environments. It is up to the developer to have the discipline to put meaningful logs into their code.
It is important to implement a tool to aggregate logs or else they can become lost. Open source log forwarders, such as FluentD, are used to scrape logs from all of the nodes in your environment and then process and ship them to a persistent data store, such as Elasticsearch.
Elasticsearch, an open source distributed analytics engine, can be queried directly or interacted with by means of Kibana, a customizable visualization dashboard. The so-called EFK stack, Elasticsearch, FluentD and Kibana, provide centralized, cluster-level logging and the tools needed to analyze those logs. There are other great tools available for logging, including Logstash, Graylog and Timber.
At the end of the day, it is important to make sure you choose a tool that captures your logs and provides you with the ability to accurately sift through them for the information needed to diagnose your issue.
Metrics are a numeric aggregation of data describing the behavior of a component or service measured over regular intervals of time. Because metrics are numerical, rather than text-based, they are easy to store and model. Metrics are useful to understand typical system behavior. In the Kubernetes space, there are metrics available on many levels, from the underlying nodes to your application pods, application performance monitoring (requests per second, error rates, etc.) and more. Most application frameworks include libraries to instrument metrics. Prometheus, the open source systems monitoring toolkit, includes libraries to create in-process samples, tools to scrape data and also send it to the Prometheus time-series database and also a query language to analyze the data. Our sample app uses metrics classes that push data to a /metrics endpoint where it is scraped by Prometheus. You also may want to explore other tools like Graphite, InfluxDB and Statsd.
Grafana, an open source data-visualization tool for monitoring, can be used to aggregate metric data from numerous sources into dashboards that provide a summary view of key metrics. Together, Prometheus and Grafana form a systems monitoring, alerting and visualization toolkit recommended by the Cloud Native Computing Foundation (CNCF) for container-based infrastructure.
The Oracle Cloud monitoring service offers out of the box aggregated metrics for Oracle Cloud Infrastructure resources. These metrics are available both on the Oracle Cloud Console and via API. We worked with Grafana to expose the monitoring service as a Grafana data source, which means you can visualize Oracle Cloud Infrastructure data alongside your other data sources in Grafana and use it to create beautiful and useful dashboards.
Metrics are also well-suited to trigger alerts. Alerts are notifications indicating that a human needs to take action immediately in response to something that is either happening or about to happen in order to improve the situation. Grafana can be used to create a rule that will trigger an alert when particular conditions are met. I chose to create a rule based on request duration that will send out an alert when a specific request threshold is surpassed. In this case, I configured an alert to be sent through a notification channel connected to Slack via an incoming webhook.
Traces represent causally related events that are part of a request flow in a distributed environment. They provide visibility into the structure of a request and the path it took. Traces are uniquely suited to understanding the entire lifecycle of a request. As a result, they are useful for pinpointing issues for debugging purposes, for instance, increased latency or resource utilization.
Tracers live inside your application code. They assign each request a global ID and insert metadata at each step in the flow, referred to as a span, before passing along the ID. One challenge of tracing is that it can be hard to retrofit existing applications to support tracing. Each component of an application needs to be instrumented to propagate tracing info, which is especially challenging in polyglot architectures. The sample application takes advantage of the OpenTracing API for Java. OpenTracing is a language-neutral approach to distributed tracing. Traces can be visualized and inspected with tools, such as the open source distributed tracing system Jaeger. Zipkin is another option for tracing and is also compliant with OpenTracing.
Service meshes provide a configurable infrastructure layer for microservice applications. Service meshes monitor and control the flow of traffic through your cluster. In contrast to API Gateways, which work with the north-south traffic into your cluster, service meshes work with east-west traffic between your services. Many service meshes use the sidecar pattern, the practice of provisioning each pod with a proxy container, such as Envoy, which controls and mediates network traffic between services within the mesh without code changes.
This pattern provides observability and awareness of what is running. While there are many mesh options to choose from, including Linkerd and Consul, I chose to implement Istio because it provides out-of-the-box integrations with a number of open source observability tools.
While implementing a service mesh does not remove the need to instrument your applications, in the case of tracing, it can make the process simpler. The mesh will handle tracing and metric collection at the proxy level.
To get more detailed trace information, applications will still need to forward headers to the next hop, but otherwise, the amount of code change is minimal as meshes can capture latency, retry and failure information for each hop in a request. It also makes it simple to deploy the various services ingesting data from our instrumented applications.
The same Helm chart used to deploy Istio can be used to deploy Grafana, Prometheus, Jaeger and Kiali and also prepopulate them with helpful dashboards. Kiali is an observability tool for Istio that helps you visualize the relationships between services running in the mesh. Kiali can also be linked directly to Grafana and distributed tracing tools to easily switch over to their respective dashboards.
To demonstrate these tools, we will use a sample application written in the Helidon framework: Java libraries designed for developing microservices. The application consists of a Main.java and RESTful web services GreetService.java along with manifest files used to deploy the application to a Kubernetes cluster. The function of the application is simple: update a greeting and the recipient of a greeting with PUT and GET requests. The application also includes the option to perform a “slow greeting,” a type of request that injects a delay, which can be used to simulate latency. Simulating error codes, latency and failure is also something you can do with Istio without code changes.
I deployed the application to an Oracle Container Engine for Kubernetes Cluster and enabled Istio sidecar injection. The tools I chose, Grafana, Prometheus, Elasticsearch, FluentD, Kibana, Jaeger, Istio and Kiali, are just some of the options available. I picked these tools because all of them are open source and each one is known to work well with Kubernetes.
Imagine choosing an SLI related to request time, such as the duration of requests made to your application. In order to ensure positive user experience, you decide on an SLO for your requests to take less than two seconds. To meet the SLA tied to your objective, you use the monitoring tools discussed above to make sure you are equipped to quickly and competently identify and address the root cause of slow requests.
In order to test out an example troubleshooting workflow, I have instrumented my application for logging, metrics and tracing. I have also configured the aforementioned tools in my hosted Kubernetes environment. I created an alert rule in Grafana tied to the duration of requests made to my application: any request that takes greater than one second will trigger an alert. I will start by making a request to my application using the artificially slow “slow greeting” handler written into the application. This slower than typical request time will push us beyond the one-second limit, triggering an alert from Grafana, which has been configured to send a notification to Slack with the message: “Please take a look at request duration times,” whenever a request threshold is surpassed.
Next, I can head over to my Grafana dashboard to see how when the abnormally slow requests began and how they compare to the average request time.
Given that I am not sure where this latency is taking place, the next thing I do is look to Jaeger to inspect the request flow through various systems. In Jaeger, I can find the tracing span that shares a timestamp (4:16:47 PM) with one of the slow requests. This will show me which service is experiencing the delay and its Kubernetes namespace and container name.
Once I find the culprit, in this case, the Greet Service, I will head over to Kibana, which will allow me to search for the log of the service in question using the same timestamp, namespace and container name from the trace span we previously reviewed. When I check my logs, I can see the note associated with my “slow greeting” in my application code: “You made this request slow on purpose!”
As soon as I correct the issue, in this case by no longer sending the slow greeting, the alert will stop firing and I will get a notification that everything is okay.
Of course, this is a simple example from a simple application, but the concepts can be applied to complex architectures. If this scenario involved more than a single user (me) inspecting the issue, a ticket queue would be helpful to determine who needed to take action.
Adopting observability practices that provide insight into logging, metrics and tracing gives you maximum visibility into the behavior of a modern distributed system. In the event that an issue occurs in your environment, these tools will allow you to discover the issue, pinpoint its location and determine how to fix it. These same tools can be used to proactively test your environment and also improve the performance and efficiency of your system. Beyond monitoring, observability includes a philosophical approach that recognizes the impact issues can have on the greater business and factors in how to address those issues when they inevitably arise.
A helpful resource for learning more about observability is O’Reilly’s “Distributed Systems Observability.” For more information on SRE, take a look at “Site Reliability Engineering” and “The Site Reliability Workbook.”