For developers, debugging a complex application is a difficult enough task, but tracking a latency issue that runs across a set of microservices that make up a distributed application is even more daunting. Traditional tracing tools, such as DTrace, are built for following single processes on a single CPU.
To aid in microservices debugging, a number of organizations are turning to Zipkin, an open source tracer for microservices, first developed by Twitter to track Web requests as they bounced around different servers.
“The community is starting to standardize around Zipkin,” said Mike Gehard, senior software engineer at Pivotal Labs. We spoke with Gehard at the Cloud Foundry Europe Summit 2016, where he gave a Spring Boot tutorial that incorporated Zipkin.
AirBnb and Uber are also both using Zipkin. On the commercial side, at least one company, LightStep, offers enterprise support for the technology.
And this week, the Cloud Native Computing Foundation is mulling over the next project to take under its wing, and Zipkin is under consideration. Zipkin makes sense for the CNCF portfolio, which is a growing stack of open source tools for running cloud native workloads, starting with the Kubernetes orchestrator.
“Zipkin has helped us find a whole slew of untapped performance optimizations, such as removing memcached requests, rewriting slow MySQL SELECTs, and fixing incorrect service timeouts,” wrote then-Twitter engineer Chris Aniszczyk (now with the CNCF) in a 2012 blog post announcing the open sourcing of the technology.
Twitter developed the technology using a Google paper that described Google’s internally-built distributed app debugger, Dapper.
Debugging is, in theory, an easy thing to do. The general approach is to find the problem, then fix the problem. In many cases, this can be done by attaching a debugger to a process and watching the stream of things that happen, what calls are being made, what data is being moved, and so on.
Debugging distributed systems, however, is not so easy, Gehard explained. When a user reports that the service is slow, or not working at all, the new challenge is to find out where the problem happened. One user request will kick off multiple microservices, and may even backtrack through a microservice more than once. Each microservice — and some organizations can have hundreds of microservices — will generate its own log.
Of course, a developer could search through these multiple logs using a tool like Splunk or Elasticsearch, tracking the user request across multiple services. If you know a request came in at 4:52 a.m. you can search all the logs from 4:52 a.m. to, say, 4:55 a.m. But there is a lot of other events going on at the same time, so it is extremely difficult to track a request through the dynamic maze of microservices.
Enter Zipkin. This software aggregates timing data that can be used to track down latency issues.
That data is transmitted back to a Zipkin server, which is captured by Node.js and stored in Cassandra. It is left to the user to establish the communication protocol between the emitter and the collector; for his class, Gehard uses RabbitMQ. Scribe, HTTP, and Kafka are also recommended as transport mechanisms.
Zipkin comes with a Web interface that shows the amount of traffic each microservice instance is getting. The log data can be filtered by application, length of trace, annotation, or timestamp.
This approach does add some latency to a microservices architecture, as well as add to the size of the microservice itself. “Microservices are a set of trade-offs,” Gehard said. “If I can scale the service, I’m willing to add some costs.”
Zipkin is not the only microservices monitoring tool in the market. Interested parties should also take a look at OpenTracing (also based on Google Dapper) and RisingStack’s Trace, a full-stack monitoring service that includes distributed tracing capabilities.
CNCF is a sponsor of The New Stack.
Feature image: Public art, Frankfurt Germany.