In the movie about the nearly disastrous return entry of the Apollo 13 spacecraft, the lead engineer at Mission Control, played by Ed Harris, stormily exalts that “Failure is not an option.” While such a spirited proclamation may be exciting for movie goers, in the actual moments when NASA engineers were determining the best course for returning the astronauts, they remained calm and collected.
For PagerDuty’s Rich Adams, the main thing to keep in mind when an incident strikes is to remain calm. “Don’t Panic!” was Douglas Adams’ first bit of advice about space travel in “The Hitchiker’s Guide to the Galaxy” and it applies to IT incident response as well. Freaking out helps no one and spooks other parties involved, Adams noted, during an educational session held the day before the company’s PagerDuty Summit Thursday.
PagerDuty offers a service for automating incident management at the IT level — and increasingly — for other business functions. The service aligns nicely with the DevOps practices of speeding up the development cycle, allowing administrator teams to take advantage of automated workflows to speed recovery of damaged services — without burning out staff with unnecessary 3 am wake-up calls.
An “incident response” is any coordinated effort that an organization undertakes to deal with some issue that has come up, Adams explained. The idea is to not to merely eliminate the problem but also limit the damage and reduce recovery time and costs.
“You need an automated way to trigger this incident process,” Adams explained.
PagerDuty’s own service handles the routine tasks of the incident management process. It orders the information coming in from the monitoring systems, organizes the people who will be involved in the remediation process, alerts appropriate personnel through multiple communications channels, allows stakeholders to set and change priorities on incidents, and provides a space and relevant data to hold post-mortem analysis.
As cutting edge as the PagerDuty’s cloud-based technology may be, the company adheres to a set of well-defined procedures around how to handle manage crises, drawing from the U.S. government’s National Incident Management System (NIMS), a general set of rules first developed in the 1970s to cover all sorts of disasters.
PagerDuty has a dedicated approach to handling incidents. Anyone in the company can trigger an incident. “Humans are good at picking out impending failures,” Adams noted. Key to this process is the assignment of an “incident commander” (IC) to each incident.
The role of the IC can not be overestimated. Once selected, the IC is in charge of the situation. He or she leads all the calls or chat sessions related to the incident — even the CEO can’t usurp the agenda that the IC sets.
It is important to note that the IC does not actually solve the issue itself. That person is only in charge of coordinating the efforts of others. The IC makes all the decisions. This is why anyone at the company can be an incident commander. Currently, the effort is volunteer-driven; you don’t need to be an engineer to be an IC, even if the problem is technical in nature.
Other roles are assigned as well. Each incident also has a deputy or backup IC. A scribe is assigned to record all the interactions that take place, either by Slack chat or by a telephone conference call. An external liaison takes care of all customer-facing, or external-facing, communications, relying to the outside world that the service is down. An internal liaison to do the same for communications within the company itself (These roles can be combined in smaller organizations).
Finally, there are the subject matter experts (SMEs). These are the engineers and operations folks who know how the system actually works and can diagnose the problem and fix it. The IC will work closely with the SMEs to characterize the problem and then figure out ways of fixing it.
Adams stressed that the IC shouldn’t be the one to fix the problem. More often than not, what appears to be a simple issue can turn out to be more complex, and suddenly, the IC would have two roles — fixing the problem, and reporting statuses back to others.
“I’m Rich and I’m the Incident Commander”
Good communications are essential for ICs. ICs should speak clearly and get their point across, Adams explained. Being clear is better than being concise — try to avoid acronyms. The idea is to ask questions, size up the situation, and then figure out the next step, quickly assessing the what risks will be involved. Do you take the service offline to reboot the servers to potentially solve the problem more quickly? Or do you a rolling restart, which may take longer but will not interrupt service? Risk assessment is a huge part of the task.
Making the wrong decision can be better than making no decision at all, Adams noted, given that making the wrong decision can bring in more data. “Don’t get reckless, but don’t get hung up,” he said. Every task should be assigned to a particular individual, not a team (though individuals can assign out sub-tasks to others in a team). Before each decision, ask if anyone has any “strong objections,” emphasizing the word “strong,” as to instill the idea that these are unusual circumstances so any objections should be serious ones. Once a task is agreed upon, the IC should give the task-owner a specific deadline, say 20 minutes or an hour.
The key thing is that the IC stay in charge of the incident. In a sense, the ICs role is to remain calm, which conveys a sense of purpose and order to other participants and observers. And part of the job is to maintain the IC’s authority. Oftentimes, incident response teams will suffer from what Adams called “executive swoop,” or a high-level boss who joins the call to give out orders, tries to reassess the severity of the incident, or ask for additional documentation (“an Excel spreadsheet with all the affected customers”), all of which can disrupt the overall process. In these cases, the IC should ask the exec, or any other disruptor, if they would like to take over as IC. They always decline, Adams notes.
Adams also offered some rules as to what they found doesn’t work in this IC setup:
- Don’t call everyone: Each time an incident at PagerDuty occurred, “We used to call every engineer,” Adams said. This is fine when the company has 5 engineers but not when it has 60. Having unnecessary workers involved disrupt work, and home life, and can make employees grouchy.
- Limit the ICs “Span of control”: A limit of about eight people should report to the IC, though each of those individuals can have sub-teams.
- Limit the frequency of status updates: Spending too much time sending out status updates takes away resources from actual problem solving, especially when there are no significant updates to convey. Try to work on the status updates when there is a lull in the conversation.
- Don’t assume silence means no progress is taking place: In many cases, there is significant work being done when there is silence on the call or in the chat room.
- Don’t force everyone to stay on the call: If you determine that a problem is, say, within an application, then there is no reason for the site reliability engineer (SRE) to stick around. Dismiss others when it is clear they are no longer needed.
Finally, don’t forget the post-mortem, Adams noted. The entire incident management can be recorded and played back for later analysis.
PagerDuty is a sponsor of The New Stack.