Aspen Mesh sponsored this post.
As the idea for Aspen Mesh was formulating in my mind, I had the opportunity to meet with a cable provider’s engineering and operations teams to discuss the challenges they had operating their microservice architecture. When we all gathered in the large, very corporate conference room and exchanged the normal introductions, I could see that something just wasn’t right with the folks in the room. They looked like they had been hit by a truck. The reason for that is what turned this meeting into one of the most influential meetings of my life.
It turned out that the entire team had been up all night working on an outage in some of the services that were part of their guide application. We talked about the issue, how it manifested itself and what impact it had on their customers. But there was one statement that has stuck with me since: “The worst part of this 13-hour outage was that it took us 12 hours to get the right person on the phone; and only one hour to get it fixed…”
That is when I knew that a service mesh could solve this problem and increase the engineering efficiency for teams of all sizes. First, by ensuring that in day-to-day engineering and operations, experts were focused on what they were experts of. And second, when things went sideways, it was the strategic point in the stack that would have all the information needed to root-cause a problem — but also be the place that you could rapidly restore your system.
Day-to-Day Engineering and Operations
A service mesh can play a critical role in day-to-day engineering and operations activities, by streamlining processes, reducing test environments and allowing experts to perform their duties independent of application code cycles. This allows DevOps teams to work more efficiently, by allowing developers to focus on providing value to the company’s customers through applications and operators to provide value to their customers through improved customer experience, stability and security.
The properties of a service mesh can enable your organization to run more efficiently and reduce operating costs. Here are some ways a service mesh allows you to do this:
- Canary testing of applications in production can eliminate expensive staging environments
- Autoscaling of applications can ensure efficient use of resources.
- Traffic management can eliminate duplicated coding efforts to implement retry-logic, load-balancing and service discovery.
- Encryption and certificate management can be centralized to reduce overhead and the need to make application changes and redeployment for changing security policies.
- Metrics and tracing gives teams access to the information they need for performance and capacity planning, and can help reduce rework and over-provisioning of resources.
As organizations continue to shift-left and embrace DevOps principles, it is important to have the right tools to enable teams to move as quickly and efficiently as possible. A service mesh helps teams achieve this by moving infrastructure-like features out of the individual services and into the platform. This allows teams to leverage them in a consistent and compliant manner; it allows Devs to be Devs and Ops to be Ops, so together they can truly realize the velocity of DevOps.
Like it or not, outages happen. And when they do, you need to be able to root-cause the problem, develop a fix and deploy it as quickly as possible to avoid violating your customer-facing SLAs and your internal SLOs. A service mesh is a critical piece of infrastructure when it comes to reducing your MTTR and ensuring the best possible user experience for your customers. Due to its unique position in the platform, sitting between the container orchestration and application, it has the unique ability to not only gather telemetry data and metrics, but also transparently implement policy and traffic management changes at run time. Here are some ways how:
- Metrics can be collected by the proxy in a service mesh and used to understand where problems are in the application, show which services are underperforming or using too many resources, and help inform decisions on scaling and resource optimization.
- Layer 7 traces can be collected throughout the application and correlated together, allowing teams to see exactly where in the call-flow failed.
- Policy can allow platform teams to direct traffic — and in the case of outages, redirect traffic to other, healthier services.
All of this functionality can be collected and implemented consistently across services — and even clusters — without impacting the application or placing additional burden or requirements on application developers.
It has been said that a minute of downtime can cost an enterprise company up to $5600 per minute. In an extreme example, let’s think back to my meeting with the cable provider. If a service mesh could have enabled their team to get the right expert on the phone in half the time, they would have saved $2,016,000.00. That’s a big number, and more importantly, all of those engineers could have been home with their families that night, instead of in front of their monitors.
Feature image via Pixabay.