Incident Management: How Organizational Context Can Help
It’s no secret that today’s IT environments are more dynamic and complex than ever — which means effective incident management matters more than ever.
Too often, however, help desk teams, IT operations engineers, and other technologists are forced to search for the proverbial needle in a haystack.
That haystack has become more like a hayfield. Homogeneous, monolithic systems have given way to far more dynamic and distributed applications and infrastructure. Think: containerization and orchestration, multicloud and hybrid cloud, microservices architecture, CI/CD, and a software supply chain that spans a virtually limitless number of sources.
“Organizations are incredibly complicated these days, whether it is from technology — where you’ve got myriads of different software systems and infrastructure and teams that own them — but also the interconnections between teams themselves and extensions out to customers,” Evans told The New Stack.
“Organizations can be thought of as this big web or graph of things that can be connected and mapped to other things in the organization.”
The sheer quantity and scale of those things — and how they explicitly and implicitly connect — is vast today. While incident management is a longstanding pillar in enterprise IT, both as a practice and in terms of the various tools and platforms that support that effort, it hasn’t necessarily kept pace with the scope and scale of that organizational web (or graph) that Evans described.
That’s why incident.io recently launched Catalog, a new feature on its platform intended to arm teams with the dynamic contextual awareness needed to effectively and efficiently respond to incidents when they occur — and without burning the house down in the process.
What Is Catalog?
Catalog is essentially a modern take on an older approach to this, the configuration management database (CMDB). CMDBs have typically been used to organize and track various IT assets, from employee laptops to databases.
“It was a static list back when organizations were a bit more static, certainly from a technology standpoint,” Evans said.
Service catalogs are a bit more dynamic, but Evans notes they have typically been restricted to a fairly fixed topology along the lines of: We have teams, those teams own services, and those services depend on the underlying infrastructure.
Catalog essentially pairs the two approaches and aims to push them further to better mirror and adapt to modern organizations and technology systems.
“Catalog is a very flexible data structure that lets you model all of the things that exist in your organization and all of the connections between them,” Evans said.
That flexibility means you can model virtually anything — not just, say, the straight line between an application, the internal team that owns it, and the infra it runs on, but connections to different business functions, or to customers and the account managers that take care of them, or to virtually any other facet of a specific organization.
The table stakes here, according to Evans, are the ability to map a particular incident to have it might impact a particular customer or particular business process. But then you can layer on additional data that creates a richer context that essentially magnetizes those needles out of the haystack instead of sending valuable team members on a neverending chase.
Moreover, the feature can use inference to kickstart automated workflows from the moment an incident is created.
For example, a customer-support representative (CSR) might get bombarded with inbound calls about problems with a customer-facing mobile application. Odds are that CSR doesn’t know who’s responsible for the app or how to troubleshoot.
But by creating an incident, Catalog can then automatically notify the people or teams that need to act because of their connection to a particular system or business functionality (the mobile app).
That workflow can get very granular: What is that team’s Slack channel? Who is the team lead? Where is the PagerDuty Escalation Policy I should use? And so forth.
Reducing Cognitive Load, Enabling Faster Response
Essentially, Evans said, Catalog helps reduce team members’ cognitive load, while encoding organization rules that can considerably cut down on lead time and manual effort in terms of response and mitigation.
This wasn’t especially necessary 10 or 20 years ago in the conventional IT environment: A database, a monolithic application or two, a few (or even a bunch of) servers that you could walk down the hall to see running in your data center. But that’s not the reality for most modern enterprises of any kind of substantial size — say, 100 employees and up.
“Modern organizations just aren’t building things the same way,” Evans said.
He shared an illustration from his previous company, an online bank in the U.K. When Evans started there in 2017, the bank ran on roughly 250 microservices running on hundreds of Amazon Web Services servers, all managed by five or so engineering teams.
That has enough complexity on its own. But when Evans left the firm in 2021, the bank ran about 2,500 microservices — with multiple copies of each running for redundancy, which meant north of 10,000 different workflows managing them. The microservices were deployed on nearly 1,000 cloud-based servers, and the company had grown from roughly 70 employees to about 2,500.
If an incident occurred that was strictly an engineering issue, the firm’s service catalog typically sufficed. But Evans said it lacked the context needed to quickly span out to the complete organization whenever that might have been needed — say, identifying the particular executive who might need to be looped in, or the ability to rapidly determine which customers might be most directly impacted and act accordingly.
In most organizations, that kind of context is pushed onto the plates of the front-line employees actually responding to incidents, which almost invariably adds time, headaches and costs.
Said Evans, “It’s exactly that [challenge] that we’re trying to solve with Catalog:” Give everyone the shared context of the organization and navigate that live during an incident, when things are already super high-pressure and you don’t have the time to go talk to a million different people. Give that context to everyone in one place.”