TNS
VOXPOP
Favorite Social Media Timesink
When you take a break from work, where are you going?
Instagram/Facebook
0%
Discord/Slack
0%
LinkedIn
0%
Video clips on TikTok/YouTube
0%
X, Bluesky, Mastodon et al...
0%
Web surfing
0%
I do not get distracted by petty amusements
0%
DevOps / Operations / Security

Incident Response: Three Ts to Rule Them All

The best operations platforms can quickly resolve high-impact incidents and elevate continuous learning, in which teams are ahead of issues before they start.
Aug 24th, 2023 8:56am by
Featued image for: Incident Response: Three Ts to Rule Them All
Image from Chaosamran_Studio on Shutterstock

The growing momentum in adopting generative AI is one of the most exciting trends of recent history. But as developers begin producing more code with AI-assisted programming, are your operational processes keeping up?

Incidents will still happen, and the ability to orchestrate real-time incident response is more critical than ever, as digital infrastructures get increasingly complex and customer expectations rise.

Operational excellence is key to effectively managing these macroenvironmental changes, and to do so effectively, it’s imperative to take a pulse check on your organization’s own operational maturity. The Three Ts — teams, techniques and technology — can guide you toward balancing growth with operational efficiency.

Teams

Effective incident response teams are typically structured in three hierarchical levels: command, liaison and operations.

  • Command: At this level, the goal is to coordinate response efforts, while reminding, reviewing and delegating external communications during that period and implementing post-mortem exercises. The incident commander leads this team and can be assisted by a deputy and a scribe, depending on the incident’s scale and complexity. The deputy takes on critical supporting tasks to help the commander stay focused on the incident. The scribe documents the timeline of an incident and ensures that important decisions and data are captured for review.
  • Liaison: During an incident, it’s vital to reach out to both customers, either directly or via public channels, and internal stakeholders, keeping them updated and mobilizing them if needed. And those are the customer liaison and internal liaison responsibilities, respectively.
  • Operations: Subject-matter experts (SMEs) are domain experts or designated owners of a component/service within an organization’s technical ecosystem. These SMEs are the boots-on-the-ground folks working to bring the incident to a close.

Note: This is only a proposed team structure. Different incidents require different needs. For example, during smaller incidents a single person can take on multiple roles. Determine ahead of time what severity of incident requires which people so that incident response teams are right-sized for the scope of an issue.

Techniques

Preparation, clearly defined roles and actions, communication, documentation and learning are key to set up incident response teams for success. Here are techniques to standardize your incident response process while ensuring continual learning:

  • Define what constitutes an incident and a major incident. Use simple, unambiguous language. Define the severity levels of an incident to outline what kind of response should be taken. Tip: If an incident appears to fall between two levels, treat it as if it is higher in severity.
  • Define how and when to mobilize responders. As a best practice, incidents should be created automatically, and ideally, that same automation should be able to resolve them. If processes are still manual, set up a dedicated phone bridge and chat room in advance, with the relevant numbers and connection information documented.
  • Create a postmortem process. Detail the cause of the incident, how it played out and what steps could prevent something similar from happening again. This is a vital part of continuous learning, which can help organizations to iteratively improve incident response.

An important footnote is to practice. The mental shift required between “peacetime” and “wartime” can be challenging for responders. That’s why running fake incidents during “game days” is a good idea. Our long-running “Failure Friday” initiative helps not only to uncover issues that could affect resilience, but also builds stronger team culture by bringing everyone together to share knowledge.

Technology

People and processes are a vital part of any incident response strategy. But so is technology. Organizations should be looking for software designed to manage the entire life cycle of an incident, from alerting to diagnostics and remediation. This way, it’s possible to overcome limits on responder resources, facilitate faster resolution by assigning operational issues and incidents to the right person or teams to address in real time, arm those responsible with the right context about an incident, and resolve incidents without human intervention.

The right tools will:

  1. Keep stakeholders informed while managing higher incident volumes and continuously improving response processes.
  2. Equip the right people in your organization with self-service access to IT operations tasks, resolving requests and incidents while reducing escalations and interruptions.
  3. Leverage machine learning and event-driven automation while grouping alerts, creating event orchestration and speeding up triage.
  4. Break down barriers between customer service and development teams to keep teams and customers in the loop at all times.

The strongest operations platform includes all of the above acting as a single source of truth for urgent, unplanned work. It ingests data from monitoring and observability, DevOps and DataOps tools to detect and diagnose urgent disruption, mobilize a response and automate workflows to improve mean time to resolution (MTTR). Combining automation with machine learning also enables intelligent alert grouping and event orchestration, to reduce noise and further enhance responder productivity.

Minimizing Disruption, Maximizing Brand Value

As digital infrastructures come under increasing strain, a fresh look at incident response helps you enhance your operational maturity. Ultimately, the best operations platforms can quickly resolve high-impact incidents and elevate digital operations to a preventative state of continuous learning, in which teams are ahead of issues before they start. It’s the only way to minimize disruption to customers, employees and brand reputation.

Read the PagerDuty incident response Ops guide for more helpful information to improve your operational processes.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.