Observability: From Service Discovery to Service Mesh
Microservices offer a unique opportunity for organizations, with the promise of faster deployment times and lower overhead costs. When paired with a service mesh, microservices are capable of automating networking between themselves and other services.
All stories of service mesh offer a grand vision of how application deployments in a cloud native environment should be, but often the journey is much more complicated.
One of the biggest challenges that we hear from users is how difficult it is to capture metrics and data from their microservices. It’s a lot harder to fix a bad connection point if you don’t know exactly which service, of the hundreds of services running, is causing that failure. This is why most organizations in the “evaluation” stages of adopting microservices are unsure when they will be able to deploy production applications with this type of architecture.
Service mesh solutions attempt to solve that challenge by providing built-in observability capabilities, but that doesn’t help if you need to include services that reside outside of the mesh — or if you’re unsure on the proper configurations to capture this data. What if instead of trying to add observability after deploying a service mesh solution, you address those capabilities before?
Before diving into service mesh observability, let’s set a baseline for what we are referring to when we talk about observability at HashiCorp. The definition that we follow is that the term “observability” comes from control theory, where it describes a measure of how well internal states of a system can be inferred from knowledge of external outputs. Basically, can you measure the health and performance of your solution based on the outputs that it produces?
If you look at this definition alone, it can broadly be applied to any workflow. As written in our Tao, we focus on workflows not technologies. When implementing any new tool, we believe you should ask yourself if you can measure its effectiveness. If not, is there a way to enable that? As we unpack observability as it applies to service discovery and service mesh in this post, keep this assumption in mind.
Setting Up an Observable Foundation
Regardless of which service mesh or service discovery tool you utilize, that tool is likely capturing some sort of telemetry metrics. Examples of these metrics are changes to your environment, the status of the tool itself, or general health metrics for the services currently running.
To prevent memory outages or storage exhaustion, a lot of this data is purged after a short time. For instance, our service mesh tool, Consul, retains telemetry data for one minute by default. This is good for users that want a real-time snapshot of cluster performance, but what if the issue occurs during non-working hours?
Fortunately, there is a solution to this. Many Application Performance Monitoring (APM) solutions — like Datadog, SignalFX or AppDynamics — when combined with charting tools like Prometheus or Grafana, are able to read telemetry data from metrics aggregation servers like “statsite” or “statsd.” Adding only a couple of lines of code, you can instruct your service mesh tool to distribute its telemetry data to these servers and actively monitor its health and performance through dashboards. Many of these dashboards include alerting capabilities, letting you know if there has been an interruption in service which can be critical for reducing outages.
At this point, you might be asking yourself, “what does this have to do with a service mesh?” To answer that, let’s remember what the benefit of using a tool for service discovery is.
Service Discovery to Service Mesh
By automating the process by which you discover and remember new/existing services, you are now able to manage a much larger scale of services. Having solved for one bottleneck though, new issues start to arise. First, I can discover and locate services quickly, but I still have to manage the connections between them manually. If I’m using a ticket-based system to open up port connections and whitelist IP addresses, I’ve only increased the volume of requests for the team receiving them; I haven’t improved their workflow.
On top of that, microservices introduce a new problem: services may not require IP addresses. This means I’ve increased the volume of network requests and introduced a new challenge that traditional methods don’t solve for. This is where a service mesh fits in. Rather than relying on manually connecting services, services should be able to discover each other — either by IP address or service name — and establish a secure connection automatically. This is how automated service discovery evolved to service mesh and, along with it, new challenges and capabilities for observability.
Observability in the Service Mesh World
Should you choose to leverage the same control plane for service discovery and as a service mesh provider, it’s now configured at Day 0 to send observability data to your metrics collector as well as dashboard platform. Foundationally, this ensures that users will have insight into cluster health, but also sets it up for more advanced observability concepts like distributed tracing.
Using the same telemetry server address and enabling your service mesh tool to send gRPC spans to your APM dashboards, you can collect even more granular service-level data and pinpoint failures at the request level.
Remember earlier when we discussed the challenge of identifying connection failures in a microservice environment? Leveraging distributed tracing helps alleviate some of those difficulties, by letting you follow the request from its origin to its eventual failure.
Getting Started with Your Own Observable Mesh
This post glosses over the complexities of these solutions, but the main point I’m trying to illustrate is that preparing your organization for the leap to service mesh starts with implementing observability at the service discovery level. Chances are your environment is already leveraging a tool like Consul for service discovery. Enabling the telemetry features for collection may help reduce the risk of an outage and sets you up nicely for making the jump to a service mesh.
There are many steps in between and if you want to learn more about how to monitor Consul telemetry data, follow this learn guide. If you are interested in learning more about distributed tracing with Consul specifically, check out this post on our blog. It contains a link to a couple of demos for setting up tracing with Consul, Datadog, and Jaeger.