How has the recent turmoil within the OpenAI offices changed your plans to use GPT in a business process or product in 2024?
Increased uncertainty means we are more likely to evaluate alternative AI chatbots and LLMs.
No change in plans, though we will keep an eye on the situation.
With Sam Altman back in charge, we are more likely to go all-in with GPT and LLMs.
What recent turmoil?
Cloud Native Ecosystem / DevOps

Effective Monitoring in a Cloud Native World

Cloud native systems are generally more stable and highly available, often including some kind of auto failover components. Unfortunately as these architectures get more complex, so do the potential failure modes.
Nov 19th, 2018 1:30pm by
Featued image for: Effective Monitoring in a Cloud Native World

CloudBees sponsored this story, as part of an ongoing series on “Cloud Native DevOps.” Check back through the month on further editions.

Rob Scott
Rob Scott works out of his home in Chattanooga as a Site Reliability Engineer for ReactiveOps. He helps build and maintain highly scalable, Kubernetes-based infrastructure for multiple clients. He's been working with Kubernetes since 2016, contributing to the official documentation along the way. When he's not building world-class infrastructure, Rob likes spending time with his family, exploring the outdoors, and giving talks on all things Kubernetes.

The transition to new cloud native technologies like Kubernetes has dramatically altered how applications are architected and deployed. For quite some time, monolithic applications were deployed to a set of long living servers, receiving rather infrequent and incremental updates over their lifespan. With the rise of cloud native technology has come a rise in microservice architectures running with ephemeral containers on infrastructure that is quickly evolving. Adding to the complexity, these applications are regularly deployed to multiple availability zones, regions, or even multiple clouds.

This approach to application architecture has come with many advantages, but it’s also required significant shifts in supporting technologies like monitoring. Cloud native systems are generally more stable and highly available, often including some kind of auto failover components. Unfortunately as these architectures get more complex, so do the potential failure modes. The more components that are involved, the more ways things can break. With that in mind, it’s more important than ever to have an effective monitoring strategy.

In a cloud native world, the traditional monitoring we’ve become accustomed to simply can’t keep up with these more modern architectures. It can’t provide the level of insight into our applications that we need to fully understand what’s happening.

The most effective monitoring strategies will take a multifaceted approach, covering four different areas: External polling, centralized logging, custom metric collection, and request tracing. Although not every architecture requires each of these components, each of them can provide a unique and complementary level of insight. The best monitoring strategies rely on a combination of approaches to provide a comprehensive system overview.

External Polling Provides High-Level Visibility

There’s a broad category of monitoring often referred to as “black box” monitoring. This refers to polling a system from the outside in to measure its health. An example of this would be polling a web endpoint every minute to ensure the uptime of your application. One of the most traditional forms of monitoring there is, this likely still has a place in a monitoring strategy. This approach to monitoring is highly effective at detecting problems that are already visible to your users.

To fully understand your systems, the base set of traditional metrics often don’t cut it. With custom metric collection, applications can expose metrics that are a better measure of application health.

In contrast to black box monitoring, many of the newer approaches are referred to as “white box” monitoring. This involves monitoring from the inside out and can provide a level of insight that black box monitoring generally can’t. This approach can often detect problems before they become externally visible and can provide valuable information for in-depth debugging. Each of the following approaches is a form of white box monitoring.

Centralized Logging Provides Valuable Debugging Data

Logging is nothing new. Just like polling, it’s been around for quite some time. Cloud native architectures don’t just require local logging though, they really need some kind of centralized logging system. With traditional monolithic architecture running on long-lived servers, logs sometimes never left the server they originated from. Centralized logging was not always seen as a necessity. Debugging sometimes meant logging into a specific server and sifting through logs to find a problem.

Of course with cloud native infrastructure, containers and servers are ephemeral, and it becomes more important than ever to ship logs to some kind of centralized logging system. ElasticSearch, often deployed with Logstash and Kibana to make up an “ELK” stack, has become one of the most popular open source solutions for centralized logging. The components of an ELK stack combine to provide a very compelling set of open source tools that simplify log storage, collection, and visualization respectively.

Having all system and application logs in a single place can be an incredibly powerful component of your monitoring system. When things go wrong, centralized logging allows you to quickly see everything happening in your system at that point in time, and filter through logs for specific applications, labels, or messages.

Additionally, these centralized logging systems can be configured to alert for anomalous behavior. This could be as simple as significantly increased log volume, or potentially an unexpected influx of error messages coming through.

Custom Metrics Enable Fine-Grained Reporting

To fully understand your systems, the base set of traditional metrics often don’t cut it. With a custom metric collection, applications can expose metrics that are a better measure of application health. These kinds of metrics can provide much more precise information than the kind of metrics derived from polling data from outside of the system.

Open source tools like Prometheus have transformed this space. At its core, it is a monitoring and alerting toolkit that stores metrics with a multi-dimensional time series database. Each time series is identified by a key-value pair and tracks the value of that metric over time. The simplicity of this model enables the efficient collection of a wide variety of metrics.

Prometheus has become especially popular in the Cloud Native ecosystem, with great Kubernetes integration. The ease of tracking new metrics with Prometheus has resulted in many applications exposing a wide variety of custom metrics for collection. These are usually well beyond the standard resource utilization metrics we’d traditionally think of when it comes to monitoring. As an example of what this could look like, the popular Kubernetes nginx-ingress project exposes metrics such as upstream latency, process connections, request duration, and request size. When Prometheus is running in the same cluster, it can easily collect the metrics exposed by the many applications like nginx-ingress that support Prometheus out of the box.

In addition to all the tools that have Prometheus support built in, it’s rather straightforward to export custom metrics for your own application. Having these kinds of custom metrics monitored for your application can provide a great deal of insight into how your application is running, along with exposing any potential problems before they become more outwardly visible.

Request Tracing Provides End to End Visibility

With cloud native architectures, requests often end up triggering a series of additional requests to supporting microservices. When looking at an individual request, it is helpful to see all the related requests to other microservices. Traditional monitoring solutions didn’t have a great way to find this information. This led to a new form of monitoring, request tracing, a means of connecting all related requests together for better system visibility.

There are some great open source tools focused on request tracing, including Jaeger and Zipkin. These tools allow you to see detailed information about all requests that spawned from an initial request, providing end to end visibility across your microservices. This kind of insight can be invaluable when trying to diagnose any bottlenecks in your systems.

There have been some incredible advances in monitoring technology that help us better understand our systems in a cloud native world. As system architecture evolves, so must monitoring strategies. Open source tools like Jaeger and Prometheus can provide a great addition to traditional monitoring solutions, with all components working together to provide a cohesive approach to monitoring. With great monitoring comes better and more reliable systems, it’s an investment worth making.

Feature image via Pixabay.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Enable.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.