PagerDuty Open Sources Its Incident Response Best Practices
This post is part of a sponsored series from PagerDuty on disseminating incident response knowledge.
If a team of paramedics turned up at your house and didn’t know how their equipment worked or had to look up how to set up an IV drip, you’d probably be pretty upset. And if you’re trying to figure out how to effectively respond to an IT incident during the incident, your customers will likewise probably be upset.
Thankfully, since they undergo extensive training, emergency responders are usually well prepared, can maintain calm in various situations, and work together like a well-oiled machine. We IT responders can use the same skills demonstrated by emergency responders to help improve IT incident response — namely, by ensuring our training processes are just as rigorous so we can approach incidents calmly and with a clear head.
Many tech organizations already have such processes in place, but very few publicly share them. At PagerDuty, we decided to take a different approach: We open-sourced our entire incident response documentation and best practices guide, as well as our training materials and security response process.
Anyone can use the information in our guide, whether they’re a PagerDuty customer or not. The guide is about how organizations can respond to incidents — regardless of the products they use — so we focus on the principles and techniques of incident response as opposed to how one can perform specific actions within a tool.
Sharing Is Caring: Learning and Growing Together
Every organization — from small startups to large enterprises — experience incidents. We can’t escape the fact that as systems grow more complex, they will also inevitably fail in increasingly complex and incomprehensible ways.
One of the benefits of using our open-source guide is that you can skip some of the awkward growing pains that we at PagerDuty went through. For example, we learned that it’s important to gain the consensus of everyone involved in an incident before a decision is finalized — this way, no one comes back later to say, “I knew that wouldn’t work.”
To combat this problem, we ask the group, “Are there any strong objections?” We purposely phrase the question to specifically ask for disagreement. The question allows us to make progress quickly during an incident and combats the problem of hindsight since no objections implicitly means everyone agrees with the decision. Lessons like this are normally hard-won, but with our open-source guide, others can learn from our mistakes and skip the growing pains we went through.
Incident Response Roles and Training Guide
At PagerDuty, our incident response process is based on the Incident Command System (ICS). Developed in the 1970s, ICS is the national model that local, state and federal emergency responders use during major incidents, from responding to bomb threats to mobilizing teams during natural disasters.
Note, however, that while our process is based on ICS, we heavily modified it for our needs since some things that make sense for emergency responders don’t make sense for us — among other things, we mainly removed roles that didn’t make sense and added new ones that do.
The primary focus of our process is on the role of Incident Commander. The leader in any incident response process, the Incident Commander coordinates all communication between responding parties, ensuring things keep moving toward a resolution. While every incident is different (hopefully you’re not having the same issues over and over again), the process for responding is the same each time. The Incident Commander isn’t a technical expert, but rather an expert in how to respond effectively, and they rely on the technical experts to provide information on their relevant systems.
Our training guide covers all of the main techniques and best practices for someone who wants to become an Incident Commander. We talk about a number of different topics, from how to ensure people communicate effectively by using clear language as opposed to concise language littered with acronyms, to how to handle difficult situations, such as when an executive joins the response and starts trying to call the shots (what we call “executive swoop”).
Incident Response Process: Practice Makes Perfect
We can provide as many guides as you can read, but the key to any successful incident response process is practice. You want it to be routine. Much like how emergency responders train for hours to optimize their response during calls or disasters, the more your team practices the response process, the more relaxed and calm everyone will be when an incident strikes.
Our guide includes various ways you can practice, whether it’s purposefully breaking your systems every week as part of a Failure Friday exercise or playing a game of Keep Talking and Nobody Explodes. The video game is my personal favorite because though it doesn’t seem like a good way to practice incident response, it’s actually a great way to practice Incident Commander skills in a more relaxed setting. Bonus: If you play with a group, it can be lots of fun and you can all build a natural rapport that will be useful during a real incident.
These skills are also useful in a whole host of different settings and environments — for example, I’ve personally used the same skills to help with being a new parent as there’s surprisingly a lot of overlap.
All organizations have different ways of operating. And all can learn from one another. With this in mind, we at PagerDuty decided it was better to share our incident response process with the world so we can all learn and grow. By sharing our knowledge, we give others the opportunity to not only improve their processes, but also provide direct feedback to us so we can improve our own as well.
Curious to learn more? Check out all the documentation.
Feature image via Pixabay.