Development / DevOps

Fossor Fuel: LinkedIn Open Sources New Tools for Automated Investigation of Application Issues

14 Dec 2017 11:08am, by

Being someday replaced by a robot in our jobs is something many of us fear these days. Steven Callister, on the other hand, is actively working to make that dreaded eventuality something of an actual reality.

As a step toward that goal, Callister — a senior site reliability engineer at LinkedIn — created two new open source tools designed to automate the identification and troubleshooting process when a server or application glitches. Fossor is a plugin-oriented Python tool and library designed for automating the investigation of broken hosts and services. Ascii Etch is a Python library that can optionally append to Fossor plugins to display output visually, by rendering number streams into graphs using ASCII characters.

As an SRE, Callister’s job is indeed to automate work as much as possible. And, robot overlords aside, automation has a lot going for it: by taking over repetitive or monotonous workflow tasks, it can save a lot of time. Which allows a busy engineer to focus on building the Next Cool Thing — or get more sleep. Or, as Callister explained,

“One particular experience really led me to begin writing the code for Fossor. I was on call and received an escalation one night at 3:00am that led me to trace a problem down through three different services and I found myself typing the same commands at each service as I got closer to the root issue. I figured there had to be some way to automate these steps to help speed up investigation, but also to help me get more sleep next time.”

Interested in exploiting automation’s power to not only automate the necessary investigative steps, but to perform them in parallel, Callister began to play around with building his own tool. Beyond performing checks specifically aimed at his current set of services, he realized, he could build in the flexibility to add new checks in the future, as needed.

Can You Dig It?

And so Fossor was born. From the Latin for “one who digs” (an alternate translation is “grave digger” but let’s just go with the first one) Fossor was named for the purpose of helping users to dig into server or application issues. To avoid the introduction of performance or application-breaking bugs into the tool itself, Fossor was built in two parts: the engine itself, and then a library of plugins.

The standalone engine is responsible for collecting the necessary plugins and then running each one in its own process. By isolating each plugin in its own process, the main engine is protected from a single plugin failing and crashing the application. This resiliency, a key tenet of software reliability engineering, allows Fossor to safely manage plugins from many contributors.

The first plugin Callister wrote addressed a vexatious issue his team had recently faced. The problem, which turned out to be memory fragmentation, was at first hard to identify, because they had never before experienced the issue and it took a lot of time for the team to finally track it down. “Once we figured out what the problem was, we didn’t want anyone else to have to start from the beginning either,” said Callister. “It felt like a waste of valuable lessons learned to not alert others of this possible problem. And so the memory fragmentation plugin became the first plugin I wrote for Fossor.”

Plugging in

The plugin aspect is actually his favorite aspect of Fossor, Callister continued. At this point, Fossor’s plugin library has the ability to check for hundreds of different site issues, and users get to build their own menu of checks.

The collaborative aspect of Fossor — that anyone can contribute a plugin, especially now that the tool has been open sourced — is also exciting, Callister said. “Once a plugin is contributed, it benefits every other user. In all, this tool is a coming together of knowledge from people with differing areas of expertise sharing their best and most useful checks.”

The Fossor Workflow.

“And, because of Fossor takes full advantage of a computer’s ability to run checks in parallel, there’s no reason for a person to have to pick and choose what to check for first. You can simultaneously run checks for multiple issues — including those you might not have thought of on your own,” he added.

Fossor supports three types of plugins: variable gathering, check, and report, executed by the engine in the flow shown below. The plugins themselves are small classes, all using the same basic structure, that must implement a single method, the run method. If the run method returns output, this indicates the output is “interesting” and should be reported back to the user. The run method accepts a single argument, a Python dict named “variables,” used to optionally provide external information to the plugin.

Some generic plugins currently in the Fossor library investigate high memory or disk usage, network errors, error patterns in the logs, high load averages, and recent kernel messages. For much greater detail and specific code examples, check out Callister’s blog post.

Getting Graphic

Downstream Latency plugin shown polling LinkedIn’s service metrics to check each downstream service for latency. If the latency appears abnormal, the plugin prints an ASCII graph back to the user using the Ascii Etch library.

Callister anticipated a problem that could arise from the automation of parallel services: so much output that key data would be difficult to pull out from the stream. Fossor’s reporting function is specific to each plugin, which only reports information when it’s deemed significant. This curated output made for easier access to data of interest, but Callister realized he could also take it a step further by creating a utility to perform graphical translation and output of the data.

Thus Callister wrote a companion for Fossor called Ascii Etch, which creates a graphical output of data that make reports easier to read. Ascii Etch’s original task was displaying latency graphs back to users on the command line in Fossor. “The original downstream latency plugin for Fossor displayed latency average, minimum, and maximum. While these are useful stats, a quick graph is a much clearer representation of whether or not there is actually latency downstream,” he explained.

Open Source Future

Since being introduced, Fossor and Ascii Etch have already helped improve incident response times at LinkedIn. “We are able to more quickly identify the cause of application issues by performing our investigative checks in parallel and then reporting back only the useful information, streamlining the debugging process through a single command,” Callister said.

The advantage of Fossor’s plugin-based approach is that it can be incredibly specific through the creation of distinct plugins, yet also vast in its library of contributions. Now that the tools are being open sourced, the LinkedIn SRE team is looking forward to seeing what those might look like.

“Since Fossor becomes more useful with each additional plugin, we hope the open source community finds value in using this automation tool and contributes to its budding library of investigative checks,” Callister concluded.

A newsletter digest of the week’s most important stories & analyses.