Tips to Make Your On-Call Process Less Stressful
Every organization wants to set up an efficient incident response process, use modern on-call scheduling software and practice on call in a way that’s better, less stressful and actually works to improve their incident response processes, uptime and reliability.
If you belong to a team that’s doing any DevOps, site reliability engineering, operations or development, even if you have modern incident response platforms at your disposal and have set up incident response processes, the odds of getting paged can still be high. You are invariably required to be on at least one on-call rotation. This can mean hours, days or weeks of just being there to attend to the dreaded pager when it starts ringing. And sometimes, even weekends and holidays. The lucky ones get few to no pages while on call, but let’s face it, that’s a rarity.
Even when you don’t get paged, just the anticipation of the pager going off at random hours of the night can be draining. It’s unfortunate that an alarming number of people consider being on call the single most dreadful part of their job.
Being on call shouldn’t suck, and there is no getting rid of it. There is no such thing as a perfect system. Systems will fail and humans going on call will always be part of keeping systems reliable. What can help is if we give on call the kind of attention it deserves. Your on-call function is a direct reflection of your engineering practices and company culture. So if your team dreads it, then you’ve got some serious issues to fix.
And it’s an organizational goal, not just the engineering team’s.
A lot can be done to make your on-call function better, and this post outlines some of the steps you can take.
Setting the Right Expectations
There are a lot of misconceptions around being on call, and it is important to establish the real objective for it. A few common ones are:
- “I need to know everything before I go on call”: This is not true. You don’t need to know everything. Like everything else, it is a learning process.
- “I need to find a long-term fix”: On call is focused on quick fixes and immediate mitigation. Long-term resolutions can and should be arrived at collaboratively.
- “I need to do this alone”: It is daunting even to think that you may need to do all of this alone, and you’re most certainly not expected to. You can ask for help; it is important for this to be an integral part of the process. Collaboration is everything.
In some cases, people assume that they’d be a “hero” if they found the long-term fix for an issue and tend to make it an implicit goal, however long it takes. This makes it hard for the rest of the team, for whom on call may already be very stressful. It is important to recognize employee effort and outcomes, but this shouldn’t come at the cost of making the process harder for the rest of the team.
It is important to set expectations straight for your engineers who are going on call and ensure that you have effective on-call schedules and rotations across your team to avoid pitfalls like the above “hero” example.
Getting Your Onboarding Plans Right
It really helps if you feel like you know enough to get started with on call as opposed to being blindly thrown into it. There’s the famous 40/70 rule for informative decision-making that forms the basis of this.
The idea is that you need to have at least 40% to 70% of the information you need to make the right decision. Anything below that, and you’re playing the blind man’s bluff; anything beyond 70%, and you’ve probably lost a lot of time in the resolution process.
Having an onboarding program in place to help you figure out how to get that 40% to 70% of information you need can go a long way. Every good onboarding plan must ultimately help you to:
Understanding Your Systems’ Anatomy:
- Provide an overview of all systems, their owners, architecture and components. When an incident hits, this information will help you to understand the impact it may have and prepare for it, as well as know who to call for more help.
- Understand the dependencies that each component has on others. This will help in the investigation/triage phase and get to the root cause analysis (RCA) faster.
- The fastest way to understand your system behavior from an on-call perspective is to go through past incidents and see how they were resolved. You can also access your knowledge base and runbooks to understand the typical resolution processes in place.
Maintaining the Right Set of Tools:
Understanding the tools of the trade is half the trade itself. This also opens up opportunities to improve your on-call process. More often than not, tooling is a crucial area where broader inefficiencies can be addressed.
It becomes important to build and maintain open documentation that outlines:
- The observability tools used in the organization as well as the metrics and events being tracked by them.
- Any logging sources and visualizations to understand log data.
- Tools typically used for tracing and storing traces.
- Incident management tool(s).
- Other related resources and solutions for on-call management, status pages, CI/CD, ChatOps, ticket management, customer communication, etc.
Here are some really awesome resources that can help you take the right call in terms of tools:
Keep Your On-Call Teams Incident Ready:
Ensure that your teams know what’s coming when going on call and that they feel fully prepared for it. Getting your teams ready through training can be not only informational, but also help them understand cultural nuances of how the on-call process works.
A few ways to do this would be to:
- Make shadowing a part of the process. This way they get to understand what goes into resolving an incident and the kind of resources needed. That pager can ring anytime, and they should be able to handle it with no stress.
- Assign someone to the role of a scribe, who is responsible for documenting the incident to be prepared for RCAs/reviews/status updates.
- Taking on low-severity incidents is a good way to get prepared. This will also provide a hands-on experience with all the tools available, so you wouldn’t have to panic and feel lost when you’re actually on call.
- You can create high sev incidents in a simulated environment and get the team to resolve them together to prepare for the worst.
- You can also use some scenario-based games like Wheel of Misfortune and customize it for your organization. This will make the learning process more fun.
Build and Maintain Your Knowledge Base:
Having a knowledge base is probably the most important part of the whole on-call journey. This cannot be stressed enough. A lot of stress involved with being on call can be reduced if team members know they can resolve anything that comes their way. This kind of confidence can be instilled by simply pointing at past incidents and showing how they were resolved.
To ensure that your team understands this, you can start with:
- Going through existing runbooks to understand the resolution process.
- Going over how to look at historical incidents and what was done to resolve them.
- Explaining the importance of post-incident reviews or postmortems and the most common ways incident reviews are done in the organization.
How Squadcast Makes On Call Less Stressful
At Squadcast, we are always brainstorming ways to improve on-call practices. We also try to incorporate it into our product to make it accessible and simple for anyone using it. Here are a few practices we follow to ensure on call can be a smooth experience.
- We ensure that incidents with high severity or moderate to high impact will have a postmortem report outlining the incident and resolution process.
- We use our knowledge base of postmortems and historical incidents to get our new recruits started with on call.
- We use incident deduplication to ensure that we do not get inundated with alerts.
- We use autogenerated incident tags to identify, classify and enrich incident context.
- We use our virtual war rooms to call in subject-matter experts and other members on the on-call team for help when needed. We understand that on call is daunting and your tool should have the ability to collaborate instantly when needed.
- We refer to our automated incident timelines when writing our postmortem reports and to understand similar incidents, if there are any.
- We use our private status page where all the components (internal and external facing) and their dependencies are clearly mapped out. This also shows our incident history and implemented resolutions and subsequently published RCAs.
Enjoyed this? We would love to hear from you! What do you struggle with as a DevOps/site reliability engineer? Do you have ideas on how on call could be done better in your organization?
Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.