FireHydrant: Managing Incidents Without the Chaos
Like many startups, incident management tool FireHydrant came about when a developer started building the tool he wished he had.
Bobby Ross was working as site reliability engineer at human resources software startup Namely. It was an early adopter of Kubernetes and used other fairly new open source tools like Spinnaker, which led to frequent incidents.
“We didn’t have an audit log of the changes going out to our infrastructure. Because Kubernetes was so new, we were basically hand-rolling our deployments to Kubernetes before switching to Spinnaker,” explained Ross, who also goes by Bobby Tables.
“That was a problem because we didn’t know when something was actually shipped into production. So an application team would ship something into production and the SRE would get a page that something just broke. We didn’t know that something just went to production, all we knew was that something just went down.”
FireHydrant taps into these newer technologies and lets people track everything going on in their system — a configuration change or whatever — and use that information to help people resolve issues faster because they don’t need to do a lot of research to find out what just changed, he said.
The New York-based company, including co-founders Daniel Condomitti and Dylan Nielsen, recently raised a $1.5 million seed round from Work-Bench. Its team formerly worked as site reliability engineers at companies including DigitalOcean, CoreOS and Paperless Post.
“While the current crop of monitoring and alerting tools acts as a smoke alarm, there’s a gap in automating post-incident workflows using site reliability engineering best practices to stop the blaze and provide actionable intelligence to prevent it from happening in the first place,” Vipin Chamakkala, principal at Work-Bench, wrote in a blog post.
FireHydrant sits at kind of a Venn diagram right now between other companies with a little bit of overlap with each one, including PagerDuty; OpsGenie, which was acquired by Atlassian; with Blameless as its closest competitor, Ross said.
FireHydrant helps you figure out what’s happening during an incident and where to look for the problem, then afterward perform a post-mortem to figure out exactly what happened in an effort to prevent that from happening again.
“I had been involved in several firefighting scenarios — from production databases being dropped to Kubernetes upgrades gone wrong — and every incident had a common theme: absolute chaos,” Ross wrote in a blog post at the product launch.
Chamakkala described FireHydrant this way:
“Key to FireHydrant’s approach is how it tracks and traces changes by monitoring deployments, which then point you to areas where problems started. From there, the tool automatically assigns roles and tasks based on FEMA’s Incident Commander framework, used and proven to tackle real-life emergencies. Last and most importantly, the platform allows you to learn from your outages with analytics and an easy post-mortem process. This process identifies root causes, thereby allowing teams to make the necessary fixes and helping managers understand the overall reliability of their systems.”
It allows users to create teams and link components they are responsible for to those teams. Teams also can be assigned to specific incident entirely.
It’s fully integrated with Slack, which allows users to quickly and easily create channels for incident communication, which is logged for later review, as well as GitHub and Kubernetes.
The team has taken an API-first approach to development, Ross said.
“We’re trying build our system to be very integration-ready. All our code is built with the notion that we’re going to have other integrations at some point. We’re building all our integrations in-house, but all our APIs are public [on GitHub]. Basically, if you want to build our UI, you have that option,” he said.
The open source projects are the pieces that will phone home and send information to FireHydrant about your infrastructure.
“It gets exponentially more powerful if you use the open source tools to send us information, but if you wanted to use it for just incident management — as a command center — you can,” he said.
“The moment you have an incident, you can open up a command center. Say your site goes down. Our software will guide you through the process that you set up in the product. If you say, ‘I want these types of people to respond to an incident,’ we’ll guide you through this process. This service, this team. Automatically notify those people in Slack and email. Here’s the details you should know, and here’s the command center you should join to help resolve this issue.”
It doesn’t rely on a monitoring tool, but it is fully integrated with PagerDuty and can kick off incidents directly from those alerts.
It provides a complete audit log of all the changes in your system, enables users to see the affected environments and services, and to make sure each team knows what they need to do.
For post-mortems, it provides stats such as how long your incidents are typically open to help users understand which services tend to be problematic and which teams most effectively put out fires.
Feature image via Pixabay.