Aspen Mesh sponsored this post.
While the extraordinarily large shipping container, Ever Given, ran aground in the Suez Canal, halting a major trade route that has caused losses in the billions, our solution engineers at Aspen Mesh have been stuck diagnosing a tricky Istio and Envoy performance bottleneck on their own island for the past few weeks. Though the scale and global impacts of these two problems is quite different, it has presented an interesting way to correlate a global shipping event with the metaphorical nautical themes used by Istio. To elaborate on this theme, let’s switch from containers carrying dairy, and apparently everything else under the sun, to containers shuttling network packets.
To unlock the most from containers and microservices architecture, Istio (and Aspen Mesh) uses a sidecar proxy model. Adding sidecar proxies into your mesh provides a host of benefits, from uniform identity to security to metrics and advanced traffic routing. As Aspen Mesh customers range from large enterprises all the way to service providers, the performance impacts of adding these sidecars is as important to us as the benefits outlined above. The performance experiment that I’m going to cover in this blog is geared toward evaluating the impact of adding sidecar proxies in high throughput scenarios on the server or client, or both sides.
We have encountered workloads, especially in the service provider space, where there are high requests or transactions-per-second requirements for a particular service. Also, scaling up — i.e., adding more CPU/memory — is preferable to scaling out. We wanted to test the limits of sidecar proxies with regards to the maximum achievable throughput so that we can tune and optimize our model to meet the performance requirements of the wide variety of workloads used by our customers.
Throughput Test Setup
The test setup we used for this experiment was rather simple: a Fortio client and server running on Kubernetes on large AWS node instance types like burstable t3.2xlarge with 8 vCPUs and 32 GB of memory or dedicated m5.8xlarge instance types which have 32 vCPUs and 128 GB of memory. The test was running a single instance of the Fortio client and server pod with no resource constraints on their own dedicated nodes. The Fortio client was run in a mode to maximize throughput like this:
for (( k = 2; k < 9000; k=k*2 )); do
/usr/bin/fortio load -jitter=False -c $k -qps 0 -t 60s -a http://fortio-server:8080/echo\?size\=1024’
The above command runs the test for 60 seconds with queries per second (QPS) 0 (i.e. maximum throughput with a varying number of simultaneous parallel connections). With this setup on a t3.2xlarge machine, we were able to achieve around 100,000 QPS. Further increasing the number of parallel connections didn’t result in throughput beyond ~100K QPS, signaling a possible CPU bottleneck. Running the same experiment on an m5.8xlarge instance, we could achieve much higher throughput around 300,000 QPS or higher depending upon the parallel connection settings.
This was sufficient proof of CPU throttling. As adding more CPUs increased the QPS, we felt that we had a reasonable baseline to start evaluating the effects of adding sidecar proxies in this setup.
Adding Sidecar Proxies on Both Ends
Next, with the same setup on t3.2xlarge instances, we added Istio sidecar proxies on both Fortio client and server pods with Aspen Mesh default settings; mTLS STRICT setting, access logging enabled and the default concurrency (worker threads) of 2. With these parameters, and running the same command as before, we could only get a maximum throughput of around ~10,000 QPS.
This is a factor of 10 reduction in throughput. This was expected as we had only configured two worker threads, which were hopefully running at their maximum capacity but could not keep up with client load.
So, the logical next step for us was to increase the concurrency setting to run more worker threads to accept more connections and achieve higher throughput. In Istio and Aspen Mesh, you can set the proxy concurrency globally via the concurrency setting in proxy config under mesh config or override them via pod annotations like this:
Note that using the value “0” for concurrency configures it to use all the available cores on the machine. We increased the concurrency setting from two to four to six and saw a steady increase in maximum throughput from 10K QPS to ~15K QPS to ~20K QPS as expected. However, these numbers were still quite low (by a factor of five) as compared to the results with no sidecar proxies.
To eliminate the CPU throttling factor, we ran the same experiment on m5.8xlarge instances with even higher concurrency settings but the maximum throughput we could achieve was still around ~20,000 QPS.
This degradation was far from acceptable, so we dug into why the throughput was low even with sufficient worker threads configured on the sidecar proxies.
Peeling the Onion
To investigate this issue, we looked at the CPU utilization metrics in the server pod and noticed that the CPU utilization as a percentage of total requested CPUs was not very high. This seemed odd as we expected the proxy worker threads to be spinning as fast as possible to achieve the maximum throughput, so we needed to investigate further to understand the root cause.
To get a better understanding of low CPU utilization, we inspected the connections received by the server sidecar proxy. Envoy’s concurrency model relies on the kernel to distribute connections between the different worker threads listening on the same socket. This means that if the number of connections received at the server sidecar proxy is less than the number of worker threads, you can never fully use all CPUs.
As this investigation was purely on the server-side, we ran the above experiment again with the Fortio client pod, but this time without the sidecar proxy injected and only the Fortio server pod with the proxy injected. We found that the maximum throughput was still limited to around ~20K QPS as before, thereby hinting at issues on the server sidecar proxy.
To investigate further, we had to look at connection level metrics reported by Envoy proxy. (In Sailing Faster with Istio, Part 2, we’ll see what happens to this experiment with Envoy metrics exposed. By default, Istio and Aspen Mesh don’t expose the connection-level metrics from Envoy.)
These metrics can be enabled in Istio version 1.8 and above by following this guide and adding the appropriate pod annotations corresponding to the metrics you want to be exposed. Envoy has many low-level metrics emitted at high resolution that can easily overwhelm your metrics backend for a moderately sized cluster, so you should enable this cautiously in production environments.
Additionally, it can be quite a journey to find the right Envoy metrics to enable, so here’s what you will need to get connection-level metrics. On the server-side pod, add the following annotation:
This will enable reporting for all listeners configured by Istio, which can be a lot depending upon the number of services in your cluster, but only enable the downstream connections total counter and downstream connections active gauge metrics.
To look at these metrics, you can use your Prometheus dashboard, if it’s enabled, or port-forward to the server pod under test to port 15000 and navigate to http://localhost:15000/stats/prometheus. As there are many listeners configured by Istio, it can be tricky to find the correct one. Here’s a quick primer on how Istio sets up Envoy configuration. (You can find the complete list of Envoy listener metrics here.)
For any inbound connections to a pod from clients outside of the pod, Istio configures a virtual inbound listener at 0.0.0.0:15006, which receives all the traffic from iptables’ redirect rules. This is the only listener that’s actually configured to receive connections from the kernel, and after the connection is received, it is matched against filter chain attributes to proxy the traffic to the correct application port on localhost. This means that even though the Fortio client above is targeting port 8080, we need to look at the total and active connections for the virtual inbound listener at 0.0.0.0:15006 instead of 0.0.0.0:8080. Looking at this metric, we found that the number of active connections were close to the configured number of simultaneous connections on the Fortio client side. This invalidated our theory about the number of connections being less than worker threads.
The next step in our debugging journey was to look at the number of connections received on each worker thread. As I had alluded to earlier, Envoy relies on the kernel to distribute the accepted connections to different worker threads, and for all the worker threads to be fully utilizing the allotted CPUs, the connections also need to be fairly balanced. Luckily, Envoy has per-worker metrics for listeners that can be enabled to understand the distribution. Since these metrics are rooted at listener.<address>.<handler>.<metric name>, the regex provided in the annotation above should also expose these metrics. The per-worker metrics looked like this:
As you can see from the above image, the connections were far from being evenly distributed among the worker threads. One thread, worker 10, had 11.5K active connections as compared to some threads which had around ~1-1.5K active connections, and others were even lower. This explains the low CPU utilization numbers as most of the worker threads just didn’t have enough connections to do useful work.
In our Envoy research, we quickly stumbled upon this issue, which very nicely sums up the problem and the various efforts that have been made to fix it.
So, next, we went looking for a solution to fix this problem. It seemed like, for the moment, our own Ever Given was stuck as some diligent worker threads struggled to find balance. We needed an excavator to start digging.
For the rest of the story, check out part 2 next week.
Featured image via Pixabay.