PagerDuty Automates Incident Response with Rundeck
Digital operations management company PagerDuty has released a new version of its PagerDuty Operations Cloud that introduces a new level of automation to its platform, in part due to the company’s acquisition of enterprise DevOps automation provider Rundeck.
“It’s about injecting automation in as many places as possible for a modern operations enterprise, some at real-time without a human being, some empowering a human being and then combining it with machine learning,” said vice president of product at PagerDuty Michael Cucchi. “It’s machine learning to understand what’s going on, machine learning and automation to solve it before you have to involve a human being so that you never have to interrupt people, and then if you’re going to interrupt people, try to give the, excuse the term, but the least skilled employee the most power: Make your first responder your most powerful responder with automation.”
Deliver as Much Information as Possible
Lastly, said Cucchi, if you do have to escalate an incident to a developer, make sure to deliver as much information to them as possible. PagerDuty Operations Cloud does all of this in three distinct ways, according to PagerDuty’s blog post on the launch, two of which Cucchi said are powered by its Rundeck acquisition.
First, PagerDuty has added several features to connect everything, whether people to services or dependencies within an organization. The Dynamic Service Graph, for example, gives a snapshot of service health and application dependencies, as well as shows the teams or individuals responsible for those services. When an incident occurs, the Dynamic Service Graph gives users the ability to immediately see what is happening and who to call for help. Similarly, Global Search allows users to quickly search for attributes corresponding to incidents, alerts, services, and schedules from a single location.
Next, PagerDuty Operations Cloud sees several features added as part of the Rundeck acquisition, which it groups into the latter two categories of automating everywhere and delivering flexibility. The automation comes in the form of Rundeck Actions, which brings automated diagnostics and response to the frontlines of incident response, and Rundeck Cloud, which allows engineers to author self-service automated processes without having to deploy or administer a Rundeck cluster, instead being able to run from the cloud, even behind firewalls and within virtual private clouds (VPCs).
“Now, we try to inject automation wherever possible,” explained Cucchi. “When we do need a human being, we want to give them a whole bunch of tools to work with. So instead of interrupting the developer, give the customer support person or give the first tier help desk support person access to a whole bunch of standardized but high-powered automation routines, so that they can potentially run out and do things like reboot a cluster or roll over a cluster or flip a load balancer — things that used to take a team of people to decide to do, you can build into these low-risk sequences that humans can do.”
Reducing the Noise
The final realm of PagerDuty’s updates focuses on this aspect of informing responders with as much information as possible and reducing noise. Cucchi gave the example of a home security camera that is triggered by movement but placed in front of a bush moved by the wind. If it alerts every time the wind blows it will give a lot of false alerts, but with machine learning, you could filter out that noise for only the times that alerts are necessary. Probable Origin, for example, gives responders an auto-generated list of likely origin points to help them resolve the issue faster, while Auto-Pause Incidents removes noise by using machine learning to detect events that historically auto-resolve themselves. Similarly, Event Orchestration reduces noise by letting teams build custom logic based on event conditions at scale to reduce manual event processing.
Moving forward, Cucchi said that PagerDuty’s focus will be on flexibility with workflows, further expanding what they have done with automation for remediation.
“Now it’s about being able to assemble workflows across all of that technology for all different people,” said Cucchi. “We want to be able to build really dynamic workflows so that people can assemble these technologies we’ve been talking about really easily and quickly for different use cases.”