Kubernetes / Networking / Service Mesh / Contributed

Near Real-Time Kubernetes at Scale: Increasing App Throughput with Linkerd

19 May 2021 10:25am, by and

Stephen Reardon
The one-man band that keeps the show running, Stephen Reardon is the DevOps engineer in the Entain Trading Solutions team, operating hundreds of Kubernetes nodes in the cloud using IaC tooling, chaos engineering testing tools and end to end monitoring. His main responsibility is operational reliability, keeping the platform resilient and available, and above all developer-proof.

Entain is a global sports betting and gaming operator. You may be familiar with our brands. Names such as Ladbrokes, Coral, BetMGM, bwin, Sportingbet, Eurobet, partypoker, partycasino, Gala, and Foxy Bingo are all part of the Entain family.

When it comes to sports betting, to say that speed is essential would be an understatement. When the Dallas Cowboys score, that data needs to be processed in near real-time or the company loses money.

To tackle that challenge, the Trading Solutions / Feeds Team here at Entain Australia, built a cutting-edge trading platform on cloud native technologies including multiple Cloud Native Computing Foundation projects such as Kubernetes, gRPC, and Linkerd.

In this article, we’ll share our journey and lessons learned.

One take-away? Despite all you hear about service meshes being incredibly complex and hard, it’s not true — especially if you use Linkerd.

Entain’s Trading Platform

Steve Gray
Head of the Trading Solutions team at Entain Australia, Steve Gray manages the platform and team that prices billions of sports and racing outcomes a year. Steve has a range of experience ranging from Insurance, Health, Travel, Slot Machine and Wagering Industries and twenty years of experience, designing solutions, building teams and agile transformation.

The Australian Trading Solutions Team is responsible for handling the huge amounts of data that enter the business’ price management system day and night. And it’s critical that this data is available to the platform as quickly as possible.

The trading platform is based on Kubernetes and consists of approximately 300 microservices with over 3,000 pods per cluster distributed across multiple geographic regions.

To generate revenue, Entain relies on sports prices which are based on the probability of an outcome. As users bet on live events, the platform handles thousands of requests per second and the experience must be as real-time as possible.  Any outage impacts not only the user experience but the accuracy of prices offered and, ultimately, revenues. Even a small latency hit can have huge implications.

Challenges with Performance, Scale, and Reliability

We run a 24×7 environment and leverage Amazon Web Services‘ spot instances to keep costs low. To ensure a resilient platform and applications, we also use chaos engineering tools and practices. Our environment is constantly changing, but despite this,  it’s our job to ensure reliable application performance.

While microservices and gRPC helped us achieve greater efficiency and reliability gains, the default load balancing in Kubernetes compromised performance. Sometimes the requests from a single pod in service would end up going to a single instance, or very instances of another service. While this worked, it had negative implications for the platform. For one, we needed very large servers to accommodate high traffic volumes. Additionally, we weren’t able to use horizontal scaling to process a larger number of requests as much as we’d expected. This, in turn, affected our ability to take advantage of spot instances or process a high volume of requests in a timely manner.

We also struggled with a lack of intelligent routing. To hit our availability targets, we span individual clusters across multiple AWS availability zones (AZ) within one region (Australia). That way no one AZ becomes a single point of failure. While this is likely not an issue for smaller Kubernetes deployments, at Entain’s scale and request volume, cross-AZ traffic became a tangible source of latency and cost. Transactions that needlessly crossed an AZ boundary slowed platform performance down, incurring additional AWS charges.

Quick and Easy Gains with a Service Mesh

To tackle those issues, we chose Linkerd, the CNCF open source service mesh.

As soon as we rolled it out to all our pods, Linkerd’s micro-proxy took over routing requests between instances. Our platform was immediately able to route traffic away from failing pods or those being spun down.

Improvements in load balancing came instantly. Because Linkerd has gRPC-aware load balancing, it immediately fixed the gRPC load balancing issue on Kubernetes and started balancing requests properly across all destination pods.

With Linkerd, we experienced two major business gains: 1) The platform could handle a 10x higher request volume, and 2) we were able to use horizontal scaling to add smaller pods to a service. This allowed us to access a broader range of AWS spot instances and further drive down our compute costs while delivering better performance.

There was also an unexpected side benefit. Kubernetes’ load balancing naively uses a round-robin approach to arbitrarily select endpoints, basically rotating through an endpoint list and distributing the load between them. This results in requests being routed to any node on a cluster without considering latency, saturation, or proximity to the calling service.

Linkerd, on the other hand, looks at all potential endpoints and selects the optimal target based on an exponentially weighted moving average (EWMA) of latency. As soon as we added Linkerd to our clusters, we saw faster response times and lower cross-AZ traffic costs. Linkerd’s built-in EWMA routing algorithm automatically keeps more traffic inside an availability zone leading to significant bandwidth costs and savings of thousands of dollars a day. All without any configuration on our part!

The difference was remarkable — it was instantly visible in our monitoring and led to improvements across the board. Our bandwidth usage spikes disappeared and the CPU just leveled out across the platform. We went from a fraction of our servers going flat out at their line speed (20gbit/s) to an even balance, with no server hitting above 9gbit/s sustained — Linkerd made a huge difference.

Turnkey Service Mesh, the Easy Way

When selecting a service mesh, you can go the easy or hard route. To quote the Buoyant team, the engineers behind the service mesh, we needed “all of the service mesh but none of the service mess.”

We considered Istio but, after some research, we concluded that we would need a team just to run it for our platform’s complexity. It’s too complicated, requiring ongoing, active attention. Yes, it has a ton of great features, but we didn’t need them all. When working with an app day-to-day, you don’t have time to tweak and fine-tune it.

Since we didn’t have the bandwidth to learn and run a complicated new tool, Linkerd’s promise of simplicity was really appealing, and being Kubernetes-native and would work with our current architecture. There was no need to introduce a large number of new custom resource definitions or force an application or environment restructuring.

It took us five to six hours to install, configure, and migrate 300 services to the service mesh. It’s just the go-get-command and then the install process, and job done! It’s that simple, and it just works.

We’re All Happy Campers

Within a week, we took Linkerd into large-scale and highly performant Kubernetes clusters. With 3,000 pods in a single Kubernetes namespace, we have a massive deployment. It may be hard to believe, but only one DevOps engineer manages our entire infrastructure — and he’s the happiest camper of all.

No one had to become a service mesh or proxy expert. Linkerd just sits in the background and does its job. It solved our gRPC load balancing issue and augmented standard Kubernetes constructs allowing us to move forward without reconfiguring our app. We increased the request volume ceiling over tenfold, reduced operating costs, and hit our availability targets. Not only did it solve the problems we sought to address, it even solved those we didn’t realize we had.

Whenever a service or app fails and Entain Australia stops taking bets, it has a direct financial impact. We have to ensure we’re available, reliable, and rock-solid, and Linkerd is a part of our strategy for achieving that.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: MADE, Real.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.