Trivago Achieves ‘Regional Independence’ with Istio, Kubernetes
Online travel service Trivago performed a “regional failover” test to see if its sites can sustain traffic if it removed a Google Cloud region from production. And thanks to the backend rewrite done in 2020, the results were successful.
The goal was to reach “regional independence,” first introduced in 2015 at Trivago’s yearly tech conference, where each locale (i.e. trivago.de, trivago.com, trivago.jp, etc.) could be served by any other in the system, ideally, the “geographically closest” of Trivago’s on-premises datacenters, explained Arne Claus, a Site Reliability Engineer at Trivago, in a recent blog.
Initially, such a feat was not possible. At that time, locales were sharded across Trivago’s on-prem data centers, meaning if a user went to trivago.de they would always be served by Germany regardless of their geographical location. The concept of “regional independence” wasn’t initially feasible because too many systems relied on the regional separation especially when it came to the data.
Fast forward to 2020 and the complete application rewrite was underway and everything seemed possible. The backend, which was once two monoliths is now comprised of multiple microservices that include Kubernetes on Google Cloud (GKE) and Kafka, with Kafka taking over the replication duties from MySQL replication.
The reduced traffic as a result of the COVID-19 pandemic paired with the team closing in on the completion of the backend API rewrite signaled it was time to attempt “regional independence.”
Trivago’s Approach to ‘Regional Independence’
“It was very important to make sure no services were relying on locale-based sharing anymore,” Claus wrote.
Most services removed that dependency as a result of the backend rewrite and most of the remaining ones were adjusted.
The next step was to make sure all regions could, in theory, handle the same amount of traffic. This was done by increasing the maximum capacity of several components. Trivago had no issues doing this on Google Cloud.
The diagram below illustrates their ingress setup at that time.
In the updated setup, incoming requests entered through Akami. Akami receives the request and sends them to the appropriate load balancer based on their location. The load balancer then forwards the request to the Istio ingress gateway in their assigned region.
Trivago also created a parallel global load balancer and attached its existing regional backends to it without having to modify the current setup. A rule was added in Akami to split traffic between regional and global load balancers to allow for proper testing.
And here’s the cool part, according to Claus. Trivago hid the routing decision behind a feature flag header which was passed all the way down to each microservice. They were able to do this as a result of separating the management of their Google Cloud ingress from their Kubernetes Ingress. This allowed Trivago to transparently route traffic even inside the Istio service mesh, based on whether traffic was regional or global. This was useful for services that were still in the process of having their locale based sharing removed or where the teams were busier about their implementation working as intended.
Running the ‘Regional Failover Test’
The move to global routing was completed just in time for the first “regional failover” test in 2021. In order to simulate a regional outage, Trivago scaled the Istio ingress gateway of one of their regions to zero.
Overall the test was a “huge success,” Claus said. The global load balancer started distributing traffic to the nearest working region within seconds. With the exception of the known issue of a few old services located in Trivago’s on-premises datacenters which still relied heavily on regional sharding, everything ran smoothly.
The testing, however, was not without surprises. The first surprise came when the Istio Gateway. When the failover test is done with a decent amount of active users on the platform, the traffic goes back to the original region at the same time, very quickly. This in turn actually forced a “load test by accident.” Trivago was able to handle all the traffic so the impact was minimal.
The second test happened when Trivago tested the Europe region. The EU traffic went to the US which was the middle of the night in the US and the US normally had little to no traffic. This was fine for Trivago’s servers, but U.S. advertisers like Expedia or booking.com‘s servers weren’t prepared for such a sizable increase in traffic and in turn, faced challenges handling the load size.
For future failover tests, Trivago will choose regions with less traffic and add a feature they call “increase based rate limiting”. This feature gives their infrastructure enough time to scale up and protects them against other traffic spikes by using a flexible threshold that automatically adjusts to traffic using a time window approach. Traffic is limited if the rate increases too fast. After some time, Trivago will increase the threshold based on demand and repeat the whole process.
Claus explained that “Another benefit of having the whole process behind a feature flag, was seeing the effects of global routing in our business metrics.” Trivago found the results quite interesting.
- For most locales, not much changed; 99% of users were routed to the same region. Trivago’s sharding was spot on.
- 47% of users distributed to two or three regions in some locales, especially those at the geographic “edge” between two Google Cloud regions.
- In about 17% of locales, more than 5% of users were affected.
In regards to the “splitting” of users, this wasn’t a huge surprise for Trivago. To put this into context, for countries like India which is at the edge between Asia and Europe from a latency perspective, Trivago predicted a “north-south” split, and that’s exactly what happened.
“All in all, we’re very happy with the results,” Claus said. The application rewrite created the changes needed for this to be possible. Microservices, Google Cloud, Kubernetes, Istio, and Kafka are some of the technologies that helped make it possible.