Internet-scale data centers have changed the meaning, as well as the significance, of “fault tolerance.” For years before, the phrase had actually been used synonymously with “intolerance,” as if the odds of perfect execution did not go down as data centers scaled up.
Wednesday, during the Cassandra Summit in Santa Clara, a company called StackStorm lifted the veil on version 1.0 of its workflow automation system. But having been inspired by Facebook — which revealed in 2011 that its FBAR system was responsible for reducing the number of admins in one department from 200 down to two — StackStorm is already repositioning its key product as something more novel, and certainly more necessary, than workflow diagramming: auto-remediation.
“Get the Dumb Stuff Done”
“FBAR has shown folks that auto-remediation is not just something that can make it less likely you get woken up at 2 a.m. for something stupid,” says StackStorm CEO Evan Powell in an interview with The New Stack. “Job one of auto-remediation is, get the dumb stuff done so you don’t have to be annoyed by it.
“But more importantly, over time,” Powell continues, “you have systems like Facebook that would not exist. If FBAR goes down, Facebook goes down. The control plane ends up being an integral component.”
It’s a comment that reveals a bit about the role Powell envisions for StackStorm 1.0 in the enterprise data center. As he describes it, its purpose is to enable administrators to very quickly model the mundane, though remedial, tasks that a data center typically takes in response to everyday events, especially service degradation. His goal is for admins to take their data centers from zero to basic auto-remediation in a half-hour, preferably — certainly less than one hour.
From there, StackStorm 1.0 listens for the telltale events that signal a potential problem, and in response, runs workflows as scripts. Those workflows may include scripts that were developed for the previous generation of so-called runbook automation tools.
“The adoption pattern we’ve seen here is that, step one, day one, literally the first hour — we try to make it the first half-hour — you take StackStorm, you ingest what automation you already have, whatever scripts you’ve got, whatever you have in Chef or Puppet. StackStorm is your automation library. But then, those are Lego blocks and you start snapping them together, into more interesting automations.”
A StackStorm blog post published Wednesday offers an example use case for auto-remediation involving a Cassandra cluster. With a typical Cassandra monitoring system, when a node in the cluster dies, an alerting system often automatically “wakes” (because, after all, it can happen any time) one of the DevOps engineers. That person may be charged with undertaking a six-step process for spinning up a new node and adding it to the cluster — a process that may be more complex than it sounds, but which may not be worth the loss of sleep nonetheless.
StackStorm shows how those six steps actually translate to a truckload of YAML code. It’s easy enough for a person to read and decipher, but there’s a lot of it and it’s not all that intuitive.
When represented as a simple flowchart, however, with green flow arrows representing
on_success and red ones representing
on_error, the workflow makes more immediate sense. It becomes much more obvious how someone can drag-and-drop an event into the workflow and draw conditions that connect to it, without consulting a dictionary first.
A flow can be elevated to a rule by endowing it with conditions which must be met, in real time.
With the previous generation of workflow automation tools, many businesses considered their workflows as critical intellectual property, and only shared them with trusted individuals on a need-to-know basis. They represented the way their businesses worked, and were considered business processes rather than infrastructural matters.
As workflow moves into deeper layers of infrastructure, however, Powell tells us, it becomes less about businesses’ competitive advantage and proprietary processes, and more about keeping the data center alive. A lot of that process is open source anyway now. That said, StackStorm does include role-based access control (RBAC) to ensure that people sharing these processes are, at the very least, authorized by management.
“In the case of some of the Cassandra remediations, many of them came from a runbook in the traditional sense that Datastax and the community had authored. They said, ‘When you see these things, go try those things to get these results.’ My point is, there is a lot of IT [like this] that is shared freely. You want people to know how to run it, without you being called every time there’s an underlying physical or virtual failure.”
This may not be the case, he concedes, with a major financial institution — for example, for a workflow that details how a bank’s data centers respond to a phishing attack. “There are some types of remediations that you want to keep completely locked down,” says Powell.
The CEO points to other examples from major service providers like Cisco Spark (a communications platform, not to be confused with Apache Spark) as having set new standards and ideals for auto-remediation that StackStorm wants to emulate for everyone. Could that mean StackStorm could follow in the wake of communications providers such as Alcatel-Lucent, which acquired a consumer analytics tool and co-opted it for use in service remediation?
“We love to be told the right thing to go do,” responds StackStorm’s Powell. “But we’re the scalable, easy-to-use, infrastructure-as-code system for going and doing those things.”
That sounds like a no, at least for now.
“Our vision is that, over time, you will have more and more truly self-driving data centers,” he continues. “Part of that is analytics on the monitoring side. But we also do think that there’s a real opportunity to have analytics that determines which remediation you should run. But that shoe has not dropped with us.”
Cisco is a sponsor of The New Stack.