Monitoring

Sensu: Workflow Automation for Monitoring

22 Aug 2018 9:11am, by

Caleb Hailey
Caleb Hailey is the co-founder and CEO of Sensu, where he helps businesses solve their stickiest monitoring challenges. Before founding Sensu, Caleb was President at Heavy Water Operations as well as Director of Product Development at Novitas Data, where he rebuilt their software development division, shortened development cycles from years to weeks, and cemented deals with several Fortune 500 companies. An Oregon native, Caleb loves contributing to open source projects, wine tasting, and cheering the Portland Timbers to victory.

Almost exactly seven years ago, Marc Andreessen explained why software is eating the world. We’re now living the reality he so accurately predicted: Every company is becoming (or has become) a software company. Software is not only ubiquitous, it’s powerful, enabling us to solve a wide range of problems. The issue is — as Sensu co-founder Sean Porter so aptly put it — it also prompts us to create new problems. In this software-dependent world, availability is critical, and downtime is not only expensive but damaging to business reputation. As a result, monitoring systems and applications has become a core competency, crucial to business operations. “Improved operational visibility through monitoring” is often cited as a top priority among CIOs and senior operations leadership.

The collection of monitoring data is essentially a solved problem: there are a plethora of utilities (Nagios plugins, CollectD plugins, Telegraf plugins, StatsD libraries, Prometheus exporters, just to name a few) that are capable of producing or collecting data about how your systems and applications are performing. Where it gets interesting (and wherein lies the challenge), is connecting that data with the systems and tools you rely upon. And, modern infrastructures are increasing in velocity, and increased velocity further exacerbates the problem of connecting disparate data from the bounty of tools at our disposal. The life of an operator is all about becoming a proficient systems integrator. My favorite analogy for this is “trying to fit a square peg in a round hole” — i.e., getting data from modern systems like Kubernetes into legacy tools (e.g. Nagios) or getting data from legacy systems (e.g. SNMP traps, or metrics collected in outdated formats) into modern tools like InfluxDB.

“Ok people, listen up. The people upstairs handed us this one, and we gotta come through. We gotta find a way to make this [square object] fit into the hole for this (round object), using nothing but that {random and unrelated parts}.”

I love this scene from Apollo 13 for how well it depicts our day-to-day life as operators. It has everything: a top-down mandate from “the people upstairs”; a time-sensitivity that only an operator would truly appreciate (though in the case of this scene, they are dealing with a potential life and death situation. Technical operations and actual life/death situations are not the same things — I want to be clear on that, and give credit and thanks to first responders and real-life NASA engineers for the work they do.); a need to solve a problem introduced by decisions you had no influence over (the square peg), a requirement to solve the problem using some existing tool (the round hole); and limited resources that sometimes feel like a pile of incompatible parts strewn across a table. As operators, we might not have had a say in the design of the service that is failing in production, or the company’s investment in ServiceNow, but we do have the job of solving the problem and reporting on our progress in the tool that the rest of the organization uses to track work.

The solution to all of this is surprisingly simple: getting data from one tool or system to another is ultimately just a workflow. And when you begin to view these “square peg in a round hole” challenges as workflows that can be automated, the results are really impactful. At Sensu, we’re completely changing how we help customers reason about these challenges by enabling them to apply workflow automation principles to monitoring. We’ve been working towards making this possible for over seven years —  by building the world’s first monitoring event pipeline.

Here’s how it works:

We consume monitoring events (e.g., availability and performance data) and provide a simple set of building blocks (or core “primitives”) including event filters, event payload mutators, event handlers, and more. These simple building blocks enable users to model workflows and automate them using Sensu.

With this approach, you can consolidate data and integrate disparate and otherwise incompatible monitoring tools — connecting monitoring outputs (e.g. from modern and legacy systems), with existing operational tools (e.g. ITSM systems). Here are just a few very simple examples:

Taking an event-based approach to monitoring is key to the pipeline approach; they provide a simple and extensible abstraction for discovery (e.g. a new device, compute instance, or container), availability (device and service health information; e.g. “is my service still responding to requests?”), telemetry (metrics and other performance data; e.g. “is my service responding within a defined SLO or SLA?”), and plain old alerts. Sensu events are just JSON data, which makes them both developer and operator friendly; it’s dead simple to push events into the Sensu pipeline from any programming language in the world (you can even add monitoring to your “nightly backups as a cron job” with a single line of bash!)

(This is a pseudo-event, but all Sensu events are just JSON data and very easy to generate)

The last piece of this puzzle, which makes Sensu so broadly adaptable across multi-generational datacenters and hybrid cloud infrastructures is the Sensu agent. Sensu’s agent can consume monitoring data events from a variety of popular and standards-based utilities (including Nagios plugins, StatsD libraries, Prometheus exporters, SNMP traps, CollectD metrics, and more), and wrap them in the Sensu event context for processing in the pipeline.

The most fulfilling thing about what we’re building is seeing how our customers put it to use in real-world contexts — and there are plenty of them among the growing global Sensu community! Our customers are doing things like replacing their Nagios setup (if you’re in Portland, come by Sensu Summit this week to hear Box.com’s migration story); monitoring ephemeral infrastructure (see this post we recently discovered from the team at WePay); automated remediation, which Demonware’s Kale Stedman described at Monitorama 2018;  and automated compliance monitoring (see this moment from Paul Czarkowski’s talk about monitoring at IBM Blue Box from last year’s Sensu Summit). We’re even seeing very advanced workflows like “enterprise monitoring governance” — which is a sophisticated “monitor of monitors” that integrates with large enterprise “data lake” repositories and business intelligence workflows (I’m personally very excited about this particular workflow, so if you’re interested to learn more, please drop us a line!)

As you can see, hybrid cloud monitoring becomes a lot easier to reason about when everything is a “workflow.” It’s not only easier, but the possibilities for what you can monitor (and how) are virtually endless. We’re excited to continue learning from our customers and community on how they’re automating their monitoring workflows.

I’ll be the closing speaker on Day 1 (August 22nd) of the 2018 Sensu Summit, where I’ll go into greater detail about how this “workflow automation” approach empowers users. If you’re in town, come on by — I’d love to hear your feedback!

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.