Monitoring / Service Mesh / Sponsored / Contributed

Sailing Faster with Istio, Part 2

20 May 2021 9:00am, by

While our intrepid team tackled the problem of scaling for high-throughput workloads by adding sidecar proxies, we encountered a bottleneck not entirely unlike what the Ever Given experienced not long ago in the Suez Canal. (In part 1 of this series, you can read all about how we found this low CPU utilization along with unbalanced proxy worker thread connections while running a Fortio client. During this scaling experiment, queries per second (QPS) remained stubbornly low in spite of adding sidecar proxies, an unacceptable result.)

Luckily, we had a few more things to try, and we were ready to take a closer look at the listener metrics.

Let There Be Equality Among Threads!

Neeraj Poddar
Neeraj is a technical leader with a proven record of delivering outcomes in fast-paced and early-stage organizations. His passion for startups and solving complex distributed system problems led him to co-found Aspen Mesh where he leads the engineering efforts as chief architect. He’s an open source enthusiast and an Istio Technical Oversight Committee (TOC) and Steering Committee member.

As we talked about in part 1, we quickly stumbled upon this issue, which nicely sums up our CPU utilization and worker thread balance problem and the various efforts made to fix it. After parsing through the conversations in the issue, we found the pull request that enabled a configuration option to turn on a feature to achieve better balancing across worker threads. At this point, trying this out seemed worthwhile, so we looked at how to enable this in Istio. (Note that as part of this PR, the per-worker thread metrics were added, which was useful in diagnosing this problem.)

For all the ignoble things EnvoyFilter can do in Istio, it’s useful in situations like these to quickly try out new Envoy configuration knobs without making code changes in “istiod” or the control plane. To turn the “exact balance” feature on, we created an EnvoyFilter resource like this:

With this configuration applied and with bated breath, we ran the experiment again and looked at the per-worker thread metrics. Voila! Look at the perfectly balanced connections in the image below:

Measuring the throughput with this configuration set, we could achieve around ~80,000 QPS, which is a significant improvement over the earlier results. Looking at CPU utilization, we saw that all the CPUs were fully pegged at or near 100%. This meant that we were finally seeing the CPU throttling. At this point, by adding more CPUs and a bigger machine, we could achieve much higher numbers as expected. So far so good.

As you may recall, this experiment was purely to test the effects of server sidecar proxy, so we removed the client sidecar proxy for these tests. It was now time to measure performance with both sidecars added.

Measuring the Impacts of a Client Sidecar Proxy

With this exact balancing configuration enabled on the inbound port (server side only), we ran the experiment with sidecars on both ends. We were hoping to achieve high throughputs that could only be limited by the number of CPUs dedicated to Envoy worked threads. If only things were that simple.

We found that the maximum throughput was once again capped at around ~20K QPS.

A bit disappointing, but since we then knew about the issue of connection imbalance on the server side, we reasoned that the same could happen on the client side between the application and the sidecar proxy container on localhost. First, we enabled the following metrics on the client-side proxy:

In addition to the listener metrics, we also enabled cluster-level metrics, which emit total and active connections for any upstream cluster. We wanted to verify that the client sidecar proxy was sending a sufficient number of connections to the upstream Fortio server cluster to keep the server worker threads occupied. We found that the number of active connections mirrored the number of connections used by the Fortio client in our command. This was a good sign. Note that Envoy doesn’t report cluster-level metrics at the per-worker level, but these are all aggregated, so there’s no way for us to know how the connections were distributed on the outbound side.

Next, we inspected the listener connection statistics on the client side similar to the server side to ensure that we were not having connection imbalance issues. The outbound listeners, or the listeners set up to handle traffic originating from the application in the same pod as the sidecar proxy, are set up a bit differently in Istio as compared to the inbound side. For outbound traffic, a virtual listener “0.0.0.0:15001” is created similar to the listener on “0.0.0.0:15006,” which is the target for iptables redirect rules. Unlike the inbound side, the virtual listener hands off the connection to the more specific listener like “0.0.0.0:8080” based on the original destination address. If there are no specific matches, then the listener configuration in the virtual outbound takes effect. This can block or allow all traffic depending on your configured outbound traffic policy. In the traffic flow from the Fortio client to server, we expected the listener at “0.0.0.0:8080” to be handling connections on the client-side proxy, so we inspected connections metrics at this listener. The listener metrics looked like this:

The above image shows the connection imbalance issue between worker threads as we saw it on the server side. However, the connections on the outbound client-side proxy were only getting handled by one worker thread which explains the poor throughput QPS numbers. Having fixed this on the server-side, we applied a similar EnvoyFilter configuration with minor tweaks for context and port to address this imbalance:

Surely, applying this resource would fix our issue and we would be able to achieve high QPS with both client and server sidecar proxies with sufficient CPUs allocated to them. Well, we ran the experiment again and saw no difference in the throughput numbers. Checking the listener metrics again, we saw that even with this EnvoyFilter resource applied, only one worker thread was handling all the connections. We also tried applying the exact balance config on both virtual outbound port 15001 and outbound port 8080, but the throughput was still limited to 20K QPS.

This warranted the next round of investigations.

Original Destination Listeners, Exact Balance Issues

We went around looking in Envoy code and opened Github issues to understand why the client-side exact balance configuration was not taking effect, while the server side was working wonders. The key difference between the two listeners, other than the directionality, was that the virtual outbound listener “0.0.0.0:15001” was an original destination listener, which hands over connections to other listeners matched on the original destination address. With help from the Istio community (thanks, Yuchen Dai from Google), we found this open issue, which explains this behavior in a rather cryptic way.

Basically, the current exact balance implementation relies on connection counters per worker thread to fix the imbalance. When the original destination is enabled on the virtual outbound listener, the connection counter on the worker thread is incremented when a connection is received, but as the connection is immediately handed to the more specific listener like “0.0.0.0:8080,” it is decremented again. This quick increase and decrease in the internal count spoofs the exact balancer into thinking the balance is perfect as all these counters are always at zero. It also appears that applying the exact balance on the listener that handles the connection, “0.0.0.0:8080” in this case, but doesn’t accept the connection from the kernel has no effect due to current implementation limitations.

Fortunately, the fix for this issue is in progress, and we’ll be working with the community to get this addressed as quickly as possible. In the meantime, if you’re getting hit by these performance issues on the client side, scaling out with a lower concurrency setting is a better approach to reach higher throughput QPS numbers than scaling up with higher concurrency and worker threads. We are also working with the Istio community to provide configuration knobs for enabling exact balance in Envoy to optionally switch default settings so that everyone can benefit from our findings.

Working on this performance analysis was interesting and a challenge in its own way, like the small tractor next to the giant ship trying to make it move.

Well, maybe not exactly, but it was a learning experience for me and my team, and I’m glad we are able to share our learnings with the rest of the community as this aspect of Istio is often overlooked by the broader vendor ecosystem. We will run and publish performance numbers related to the impact of turning on various features such as mTLS, access logging and tracing in high-throughout scenarios in future blogs, so if you’re interested in this topic, subscribe to our blog to get updates or reach out to us with any questions.

I would like to give a special mention to my team members Pawel and Bart who patiently and diligently ran various test scenarios, collected data and were uncompromising in their pursuit to get the last bit out of Istio and Aspen Mesh. It’s not surprising. After all, being part of F5, taking performance seriously is just part of our DNA. 

Featured image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.