Surviving Observability Floods on Kubernetes with Tremor
Sometimes you can’t avoid it: Production systems eventually break in one way or another. They get flooded with traffic and cannot scale fast enough, or hardware breaks, or misconfigurations happen. There is always a gap between a well-tested application as a software artifact and a smoothly running production environment. And when it all breaks down, and it will break down, you want to see it as it’s happening and get enough information out of your environment so you can repair the breakage. And you want to see it quickly, at lightning speed.
Having a broken or overloaded production system that is also breaking the observability pipeline, so that in effect you have huge delays in your signals coming from said production system, was the specific experience that led to the creation of Tremor. And it can be generalized to the scenario where a downstream system is not able to scale at the same speed as upstream — think of a stateless web application as upstream vs. a time-series database engine as downstream, for example. Usually, incoming traffic correlates linearly with the amount of log messages produced. In error scenarios the situation is unclear, but repeated restarts and failures with high frequency might produce a ton of log messages. Failing requests usually produce more logs than successful ones — just think about your average exception traceback.
So, how did Tremor help us see what was needed?
Tremor is an event-processing engine that focuses on ease of use, performance and efficient resource utilization. With its generic and flexible architecture, it is well suited for simple ETL workloads, data distribution, and — most importantly — traffic shaping. In our opinion, faced with our systems being flooded, no logs were worse than late logs. But having some logs in the time we needed, as we couldn’t handle them all. So we decided to build our own tool that allows you to throw away logs. Not all, but some, and in a controlled fashion. In that respect, Tremor is a kind of sophisticated /dev/null device with a few fancy knobs attached. And that architecture? You plumb together your own logic, using connectors to the outside world and pipelines in which you express your application logic using our own happy little DSL, called trickle.
Let me introduce you to some techniques and patterns around how we successfully applied Tremor to our Kubernetes infrastructure to mitigate flooding our observability pipeline.
In all of these patterns, Tremor acts as a monster-in-the-middle that sneakily throws away events if it detects a configured rate limit being exceeded or an overloaded downstream system. Yes, you’ve read that correctly. Our Tremor installation will throw away some of your precious little log messages or metrics data points. Keep in mind that we value some logs arriving on time more so than all logs coming in with a huge delay. You get to decide on the specifics here: Should it be an increasing percentage of events being thrown away once Tremor detects downstream timeouts or errors, or should it throw away all incoming events for a configured amount of time upon timeouts or errors? If you are willing to dive deeper, you can implement custom backpressure algorithms in tremor-script yourself and be accompanied by a helpful Language Server Protocol (LSP), with plenty of tools for testing your implementation before it goes into production.
Where is the best place to insert that Tremor-in-the-middle, then? Well, you might have guessed it: It depends. Mostly on what you want to consume and forward.
Tremor as a DaemonSet
It makes good sense to deploy Tremor as a DaemonSet in your cluster, so it will sit on each worker node collecting logs and metrics from the local node and all running pods and containers. The number of pods and containers running on a worker node are within reasonable bounds for one Tremor instance to handle their logs gracefully. And if the node sizing supports a chonk of pods, then you can easily chonk up your Tremor DaemonSet.
Application logs can be sent to the Tremor DaemonSet directly, for instance via UDP/TCP, or they can be collected via tools like Filebeat, Fluentd or Vector, which usually tail the log files produced by the container engine or kubelet.
At Wayfair, Tremor DaemonSets exist on our ingress and worker nodes. They forward logs to our centralized logging pipeline. They consume system logs via Rsyslog and container logs, which are read via Vector and forwarded via UDP to Tremor. On top of that, they also collect node-level and container-level metrics.
If you want to get up and running with Tremor as a DaemonSet, check out our Helm chart.
Tremor as a Sidecar
Tremor can also be deployed as a sidecar to your application pod and directly collect logs at the source. We currently make use of this pattern to provide users with the possibility to better control the systemwide rate limiting that will randomly throw away logs when they exceed a certain rate. With the Tremor sidecar being so close to their application, developers can, for example, configure to rate-limit based on a session-id or transaction-id and throw away logs for some sessions/transactions if they exceed the mandated rates, while keeping all logs for other sessions/transactions — viewing full transactions/sessions, not partial ones. This also moves the rate limiting to the very source of observability signals — logs, in this case — and thus reduces pressure on the downstream logging pipeline, in case single instances flood the whole system. This is not uncommon, in hindsight.
While the sidecar deployment has the most overhead in terms of resources, it can be customized per pod, which we make good use of. If this level of fine-grained control is necessary, go for it and don’t sweat it. We pulled some shady tricks to make Tremor as fast and resource-efficient as possible.
Tremor as a Regular Deployment
It is exceptionally easy to deploy Tremor as a regular deployment too. We do this as we want to compartmentalize our Tremor deployments per application domain in our production environment to avoid the noisy neighbor problem. To this end, each domain gets its own Tremor cluster assigned to send logs to. We have control over each application’s logging configuration, so the routing isn’t actually a biggie. Every application domain gets one Tremor cluster assigned, and we size it according to the domain demands, which are pretty much known up front and grow in bounds we can control.
With this setup, heavy duty domains don’t dominate the whole Tremor cluster or observability pipeline and also don’t tear latency down for all the other citizens in our system.
If you want to get started with Tremor as a deployment, check out our Helm Chart.
Tremor for Kubernetes Events
Everything meaningful that happens to the Kubernetes cluster is exposed via the master node as Kubernetes events. For the purpose of being able to see what happened on a cluster, and for making those events available (for example, as audit logs), we forward them via kubernetes-event-exporter to a Tremor instance running on the master node as well.
We use Tremor at every point where either lightweight payload validation and normalization is required, or where extreme bursts in traffic need to be flattened to expectable amounts. This simplifies and relieves the downstream observability pipeline and in general makes our ops life that much easier, should the levee actually break.
And it is actually quite easy to do:
define qos::backpressure operator bp
timeout = 100
create operator bp;
select event from in into bp;
select event from bp into out;
select event from bp/err into err;
This is the definition of a Tremor pipeline. You sprinkle some YAML on top, then that configures which port to listen on via UDP and how to reach the next Kafka cluster. Now you’re all set for some happy little traffic shaping with a simple call to:
tremor server run -f main.trickle -f config.yaml
Thanks for Tremoring with Us
I hope I could get across the benefits of using Tremor to gather all of the possible observability chonks in a Kubernetes cluster. We have been bitten before by our infrastructure failing and not being able to see where exactly it failed in time. We are proud to say that we’ve been able to keep our feet dry at Wayfair since we put Tremor in the middle. Maybe you’ll be able to benefit too! (Note to self: Alternative title: Tremor as a Sandbag pattern).
To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon+CloudNativeCon North America 2021 on Oct. 11-15.