Flooded by Event Data? Here’s How to Keep Working
Today’s digital-first organizations need to create superb experiences for their customers — or risk irrelevance. Ideally, this requires resolving any operational issues before the end user has realized there’s something wrong. However, for most organizations, it’s not that easy. Digital operations teams are drowning in a tsunami of events. Existing tooling is unable to cope; manual processes and multiple point solutions translate into interruptions and escalations for overburdened responders.
Not only does this slow mean time to resolve (MTTR) and impact customer loyalty, but it can also hit team morale and drive up burnout rates. Organizations need their best and brightest to work on high-value innovation projects. They don’t want engineers to stop their work every five minutes for interruptions that could be automated. This problem is exacerbated by the sheer volume of events from dozens, if not hundreds, of sources. Filtering and contextualizing these alerts to know where to take action is a daunting task.
What’s the Problem?
PagerDuty is seeing a 70% increase in event volumes year-over-year across its customer base. Why? The increase can be partly attributed to the surge in home working and a corresponding proliferation of systems. Teams are also getting more advanced with their modeling. They want to get ahead of incidents, which means taking a look at data more often. Observability systems looking at multiple signals generate a deluge of metrics, which in turn generates more events. However, there are inevitable problems that stem from this surge in data:
- Productivity gets hammered with the noise that event data creates. In some cases, organizations need to bring in a second responder just to acknowledge notifications because the first one is too busy fixing problems.
- Because it’s hard to parse a signal from all the noise in events, organizations are confused about where and what’s going on. And if 100 people end up on an incident response call, it can be difficult to discern the root cause. Over half (59%) of organizations say finding the root cause of issues is getting more challenging due to the increasing complexity of app and infrastructure monitoring.
- Too much time is wasted on manual tasks. Often, incident responders come in and do the same five things. But if they know when to perform a task and why, and what the output should be, why should they be obligated to log in each time to press those buttons? These tasks should be automated. Half (47%) of organizations say manual processes and responses to Priority 1 incidents harm productivity.
Enter Event Orchestration
Solving the issues above is where event orchestration can help. Event orchestration enables users to route events toward the most appropriate set of actions. PagerDuty’s event orchestration functionality, for example, analyzes, enriches, determines logic for and automatically acts on events as they occur in real time, within microseconds. This enables our customers to take all the events coming in from 650+ integrations and apply logic and automation to figure out what should be done with each one — what the next best action is — at machine speed.
Because we’re able to nest automation together, users can have one automated action, start a diagnostic process, learn more about the event and then use this information to figure out what to do next. It allows organizations to take human processes and automate them. Furthermore, it allows us to enrich events — creating context, removing machine jargon and making it human-actionable, as required.
There are two big wins from this. Either there’s a high-priority incident, for which the responder knows exactly what to do and where to start, or, ideally, there’s no incident at all. The event has been automatically resolved and developers can get on with their job without any interruptions. Perhaps they can even enjoy some well-earned rest.
How Did We Enable Event Orchestration?
There are three key innovations behind this engine:
- Advanced conditions. Previously, users needed to create too many rules to accomplish simple tasks. The volume of rules could result in browsers crashing and responders left unsure about what was happening to their events. Event orchestration solves this by adopting a new conditional language that allows for complex condition definition. This new language, built by PagerDuty, has led to users seeing a 90% reduction in rules.
- Contextual rules. We often hear from customers that they want to know more about the state of their systems and how events are being processed in order to solve problems faster. But to write automation that can parse complex processes as humans do instinctively, it’s important to understand the context of a service, the rates of events going in and what those rates say about issues. It’s now possible to do this via contextual rules.
- Rule nesting. We allow customers to write a rule that can perform any number of actions. Users can then write a rule that is nested beneath another rule, inheriting those actions and performing further actions based on what occurred when the previous rule was triggered. This allows users to create precise logic in a simplified manner, which can mimic common resolutions that are manually performed today.
When organizations start nesting rules like this, it results in some interesting mathematical outcomes. On the face of it, PagerDuty has built a more powerful rule engine that gives users the ability to nest rules and leverage advanced conditions. Behind the scenes, we’re allowing users to build “directed acyclic graphs” within their event ingestion pipeline.
What this really means is that customers are now able to build a finite state machine. These machines have some very useful properties. They can detect the statefulness of an incoming event and all the data associated with it. State machines can pick apart all the data associated with an event and turn it into a particular set of actions in a highly deterministic fashion. Thus, users can push an event into this high-tech “vending machine” and it will follow all the rules, processes and logic they input, and, with 100% certainty, users will know what they’ll get on the other side.
This could include outcomes like automated remediation or suppressing/enriching an event. The point is that determinism gives users precise control over what happens to events as they’re being ingested. It allows low risk and precise deployment of automation. And this precision gives users the confidence to try more automation use cases knowing exactly what’s going to happen if they take a particular course of action.
Making Life Easier
So what’s the bottom line for event orchestration? It provides a set of tools that can be leveraged in a variety of use cases in highly effective ways. Think: automated diagnostics to flag alerts that didn’t auto-resolve, helping to speed up resolution and suppression of non-urgent notifications that arrive outside the responders’ working hours. Or how about automatically identifying noisy parts of the infrastructure? Identifying and automatically informing teammates of known root causes? Event orchestration can do all of the above.
It’s all about helping to get rid of manual work, dealing with known issues and enabling responders to get to work faster. In today’s uncompromising digital ops environments, nothing less will do.