How to Enable Critical Work Management Outside of DevOps
We all know the power of real-time operations when it comes to DevOps. In an always-on world, seconds matter; otherwise revenue, reputation and retention are at risk. Adopting real-time operations enables development and IT teams to take a proactive and preventative approach to managing time-critical incidents. This means when high-severity issues occur, action is taken in real time, engaging the right people and accelerating resolution.
What about outside of DevOps? Critical work isn’t exclusive to IT and development. There are cases of critical work management, when time is of the essence, that extend across every organization and every industry.
Organizations are often stuck in ticket time for critical work due to reliance on legacy systems and legacy processes, but there is a better way. Teams outside of DevOps and IT should ask themselves the following questions to identify their use cases and workflows that could benefit from real-time operations:
- Do your critical work processes cross people or organizational boundaries?
- Are you working in ticket time? For instance, are large portions of critical work determined by the duration of a ticket based on how quickly people respond to a given ticket?
- Do you lose lots of time on a critical work item in getting the right person in your response workflow to take action?
- Is there a large volume of inbound work with little prioritization?
If the answer is yes to any of the above, then teams are well-positioned to benefit from real-time technology and react quickly when time is of the essence.
From industrial control systems to security and crisis management, critical work can strike anywhere and at any time. Here are three use cases where real-time operations help teams address critical work.
Industrial Control Systems
One use case that can benefit from real-time operations is building management, particularly in areas that host mission-critical processes and data. When managing these spaces, organizations need to treat digital services, the data they rely on and the building they’re in as one single entity.
This is an approach adopted by one PagerDuty customer, Fox Corp. When Fox made a significant investment to open a new, purpose-built technology and media center in Tempe, Arizona, the company made PagerDuty the central nervous system to build resilience, integrating operations for physical, digital and network environments into one place.
The center is Fox’s hub for digital data, broadcast and streaming operations, with its own data center hosted on-premises. The facility services more than 200 affiliate and owned TV stations, as well as partners such as Amazon Prime and Apple TV.
The facility went live while construction was still taking place, making critical operations subject to the impact of final facility work. During this period, a worker inadvertently struck a water pipe while drilling through a wall in the data center, causing water to flood into the facility and setting off flood-detection systems.
However, because Fox treats all of its infrastructure as one entity, it received alerts to the flood via PagerDuty within 30 seconds. It identified which systems could be affected by the flood – in this case, it was equipment for a satellite uplink facility 4 miles away. By rapidly alerting teams, PagerDuty was able to help Fox avert disaster and keep operations running smoothly, maintain uptime and avoid a catastrophic failure.
It certainly wasn’t the primary use case Fox imagined, but did reinforce the importance of tying together digital and physical realms.
Security Incident Response
Another area where real-time action is vital is responding to threats against the business. As organizations continue to adopt agile work practices, security teams are facing more pressure than ever to understand security posture in real time. Coupled with a deluge of security alerts that take time to resolve and cloud adoption increasing the attack surface, security teams and developers alike are struggling to respond at pace.
Whether it’s data validation, ensuring the safety of personally identifiable information or responding to a possible phishing attack, critical security work cannot be slowed down by manual processes and alert fatigue. Especially when business and customer data is at stake.
Real-time operations has a vital role to play. Platforms like PagerDuty can be used to gather context from digital signals, then escalate threats to the right people across security, development and operations to drive real-time action on the signals that matter the most.
One such customer that has had success in this area is Signal Science, which adopted PagerDuty to overcome manual processes that were encumbering security incident management. With PagerDuty, Signal Science can orchestrate a real-time response to security incidents and reduce their impact by notifying the right people at the right time. This helps to shorten mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR).
By integrating PagerDuty with security monitoring and log management tools, Signal Science has a complete view of its security operations, ensuring there is a path of escalation so nothing slips through the cracks. This ensures Signal Science stays on top of its security posture and resolves incidents faster and more consistently.
Amid a growing climate crisis, the global COVID-19 pandemic, and political and social unrest, it’s never been more important for organizations to establish crisis management processes and react in real time. If there’s an active shooter, an earthquake or a positive COVID-19 test among the workforce, what should happen next and how?
In a crisis, teams need to establish two things quickly: Are our employees in the affected area, and most importantly, are they safe? Secondly, are there any major systems in the area that might be affected or taken offline?
This is where the power of real time to handle crisis scenarios will be key. At PagerDuty, we use real-time operations within our crisis management processes, and something we got the whole business involved with as part of National Preparedness Month.
Teams at PagerDuty use our platform to support our response to a crisis, such as an earthquake. The platform has been set up to detect earthquakes measuring 7.0 or higher on the Richter Scale near large employee-concentration areas. If we receive an alert, teams are able to quickly establish whether incident commanders and responders are able to be on call to help with the response. This allows us to trigger a real-time response and bring in responders.
Using PagerDuty’s integrations with Slack, Zoom and email clients, we’re also able to quickly establish if any employees have been affected. If they have, we can get help to them ASAP.
How Can Teams Get Started?
There are a host of use cases that real-time operations can influence outside the arena of DevOps and IT. Other teams can benefit – even sales teams that need legal review of a contract before the quarter ends or immediate approval from the deal desk to close the deal.
Often critical work is delayed because it’s sent to one specific person, who could be on holiday, unavailable or even left the business. Delays can also be caused through alert noise flooding inboxes, messaging platforms like SMS and Slack, and even purpose-built alerting tools.
This where real-time platforms like PagerDuty can excel. Managing and responding to critical work is our specialty. We start by directing critical incidents to the right people at the right time through automatic escalation policies, so incidents don’t sit around waiting for someone to respond. We then provide the tools to not only orchestrate the work, but also identify possible culprits to dramatically reduce time to resolution (MTTR).
Using PagerDuty for critical works means teams don’t need to constantly scan their inbox and Slack for work – it comes to them directly, rather than blending in with everything else. This helps teams to effectively prioritize critical work, instead of forcing them to always prioritize dynamically.
Critical work affects everyone. Failure to act fast could have a range of consequences – from not billing on time to missing a security threat. Adopting real-time operations to address this can help ensure teams react swiftly when time is of the essence.