HashiCorp sponsored this post.
The service mesh is a hot topic of conversation. Most of the time the focus is on mutual TLS connections, health checks, or observability. These are all worthy of discussion, but there is one foundational assumption that almost no one is talking about: service discovery.
If you’re looking into service mesh, your first question should be: “how are my services finding each other today?” If you do not have an answer to that question, that’s the first problem you will need to tackle.
In this article, I’m going to briefly explain the premise behind service discovery and show why you need it. If you don’t nail service discovery, there’s no point in building a service mesh.
Managing the Network Using Identity, Not IP
Networking was fairly simple in the days when many applications followed the client/server mentality and you could just route traffic to a monolithic service.
Scalability wasn’t a major factor and the applications were running on bare metal, so hard coding IP addresses in the code and using that to manage network traffic wasn’t a problem.
Then things got messy once we started to add more services, virtualization, physical and virtual switching, physical and virtual firewalls, and other components in order to scale up the size and scope of our applications. IP addresses became a tight coupling between applications and the underlying physical infrastructure. This slowed down our ability to not only innovate internally, but to move to newer infrastructures and workloads such as cloud and containers.
What if There Was No IP?
IPs will always exist underneath network naming abstractions, but the idea is to remove IPs from the human operator’s viewpoint when identifying services. This question spawns a few other very important questions that get to the heart of networking in a hybrid cloud model. How can services find each other? How is load spread efficiently and fail over-managed?
The answer is to add abstractions that allow us to manage the network through named identities, rather than IP addresses. With that approach, IP address coding would become invisible to the everyday operator and then scalable, speedy service networking could take place.
You might say, “We’ve solved this already. We’ve solved this with DNS.” DNS is just part of the solution. In order for DNS to work, you need several hardware and software components, such as:
- Load balancers
- or other networking components
These components have complex configurations that transform things like virtual IP address (VIP) to a DNS entry, which provides the human-readable identity we want.
In the example below, I have services named Web-App and Order Processing, both with load balancers in front of them.
This load balancer pattern is common for service networking. These load balancers ensure that traffic is balanced across services, there is a uniform way of linking a service name to a consistent IP, and some failovers can be handled gracefully. This solves the service discovery problem. I can front a dynamic IP with a virtual IP in the load balancer — the entry point — and then underneath it I’ve allowed those individual components to have their own IP addresses of any nature and still find them.
Complicating East-West Traffic
As service-based applications and systems become larger, this load balancer-on-every-service pattern becomes very complex. Load balancers still have an important place in today’s service networks, but in many cases organizations become overwhelmed with the number of these devices to manage, if they bolt them onto every service and use them to manage all of their traffic within service-based applications (east-west traffic). There were several issues with the previous diagram, at scale:
- Greater latency: Going from one service, to the load balancer, then to the service underneath, and then back, means more network hops than a direct connection.
- High cost: If the load balancer goes down, every instance of the services connected will be unavailable. This requires redundancy of physical devices. This means twice the cost of procuring, implementing and maintaining these devices, regardless of whether they are physical or virtual.
- Complex to manage: The previous diagram shows how this approach requires deep coordination between two teams — an App team and a Network team — to ensure reliability of service and still handle upgrades, code deployments and outages. The processes around managing this system can take days or weeks of time just to bring new services online.
This doesn’t mean you should get rid of your load balancers. Ultimately, you should always be architecting to your environment’s requirements. Do you require advanced features like “sticky sessions” in the path of your east-west traffic? Then load balancers absolutely make sense.
The key takeaway here is that many organizations can benefit by experimenting with more direct service discovery mechanisms that work with your existing networking infrastructure and prevent the unnecessary proliferation or manual configuration of load balancers.
Choosing the Right Service Discovery Path
I’ll repeat what I said earlier: If you don’t nail service discovery, there’s no point in implementing a service mesh.
This is where popular service discovery solutions should come up in your team’s conversations. There are large, solve-all-the-challenges solutions like Kubernetes — which gives you a lot more than just service discovery. Consul is another popular open source choice, since it can start out by providing just service discovery and, once you’ve nailed that practice, you can flip on its full-fledged service mesh features. There are plenty of other paths you could take toward service mesh, but let’s explore these two examples as a starting point and even explore how they can be used together.
An Example Architecture with Consul and Kubernetes
Kubernetes has built-in service discovery, load balancers, and internal networking to solve a lot of the issues from the previous patterns discussed. It works best when your complete applications are working inside its clusters. However, that means if you’re not a startup with greenfield applications that are all being built on Kubernetes, you have to migrate a few brownfield applications to start or use it on your greenfield ones. In any post-startup scenario, you’re going to add some heterogeneity to your environment by running new Kubernetes services alongside other non-Kubernetes ones.
In the diagram below, I’ve added a mobile access service on Kubernetes.
A few remaining problems here are:
- The Kubernetes primitives are not available in non-Kubernetes environments.
- Kubernetes service discovery and configuration doesn’t always translate well outside of the cluster in heterogeneous environments, as it typically requires external DNS for resolution. However, it can be done. There are tools that can make integrating with a heterogeneous environment feel more Kubernetes-native, but you are adding more complexity to the solution.
This is where a service discovery-focused tool like Consul — either by itself or working with Kubernetes — can really make a difference.
Consul doesn’t require migration to containerized workloads — it’s a drop-in, standalone service that connects components on VMs, containers, bare metal, and even mainframes. It works within Kubernetes using native primitives, and it allows users to work in environments shared with non-Kubernetes workloads in a Kubernetes-native way.
Service Discovery via Consul
Consul creates a registry to track services and an optional key-value store for scalable, automated configuration changes. Its service registry operates outside of individual services and Kubernetes clusters, so it can go manage networking anywhere.
In the diagram above you see a new service called Order-History boot and automatically get registered in Consul. This automated discovery, along with failure detection, are key features of a good service discovery foundation. Service discovery solutions like Consul need to provide:
- Naming abstractions that mask IP management so that some load balancers, like the ones in the previous diagram, are not necessary on every service. Or, in many existing environments with load balancers, they can be automated using Consul.
- Direct connections that require fewer network hops, resulting in reduced latency.
- Health checks, so that if any instances die or have health issues, the registry will pick that up and avoid returning that address to other services.
- Load leveling that randomly sends traffic to different instances so that no service gets flooded with too many requests.
The biggest advantage this kind of service discovery provides is manageability. It removes manual labor — which reduces costs — by automatically adding new services and removing unhealthy ones from the registry. It also removes or dramatically shortens the ticket-based deployment process and instead gives your teams the ability to self-service.
If you noticed in the last diagram, in some situations the Network team is no longer needed for application deployments. The App team can interact directly with Consul and Consul will automate the deployment tasks that the Network team previously had to take care of.
While it’s out of scope for this article, another component that takes networking self-service even further—especially in a compliance-heavy organization — is a policy-as-code engine, which automatically denies or warns developers when they perform an action in Consul that breaks company protocol.
The Criteria For a Solid Service Discovery Foundation
Similar to the benefits of service-based architectures, the benefits of service discovery are largely derived from the resulting organizational agility gains and improvements. Whether you choose Consul, Kubernetes, both, or another service discovery approach, your deciding criteria should revolve around agility, operational burdens, portability, security, and resilience.
Agility and Automation
You want to be able to update immediately — not just every day, but at will. Discovery and connection of services to the rest of the service network needs to be fast and automatic. Ticket wait times need to be shorter or eliminated with a self-service system.
Remove the heavyweight processes of ticketing systems. Also, think about the cognitive burden — you don’t want to have to think a lot about how you’re delivering and managing the intercommunication of all these services while you’re trying to deliver them faster.
Whether services are running on-prem, in the cloud, or in another datacenter, your service discovery system should work smoothly across all environments.
Services should be able to self-authenticate against a single service registry. This allows the machine to validate its authorization to publish a service, while still giving operators oversight and control.
Health checks, automatic rerouting of traffic, and progressive delivery features need to be automatic.
Conclusion: A Service Identity Plane for Your Data Center
The need for service discovery and service mesh starts when you have dynamic workloads (e.g. microservices) at a scale where you can’t configure those connections manually anymore. The process and systems put in place to manage this explosive growth across multiple clouds, technologies, and organizations just simply cannot scale. As a result, what suffers is time to delivery.
It is because of this inability to move at speed that you need, before you can build a service mesh, a service that shifts the service identity burden to the application. It needs health checking, dynamic configuration, and secure broadcast authorization. With this service discovery launchpad, you can start to think about adding the remaining service mesh features, such as secure communications, observability, and progressive delivery networking controls for patterns like traffic splitting, A/B testing, feature toggles/flags, and more.
Feature image from Pixabay.