PagerDuty sponsored this post.
Being on call can be challenging. If you’ve been woken up at 2 a.m. by an alert, or had to interrupt dinner with a friend to jump on an incident, you know the emotional toll being on call brings. Eliminating all possibility of failure from the system is an impossibility. Even with code freezes, load testing, game days and more, failure can still happen. You might be holding the proverbial pager when it does.
It’s important to focus on making improvements to on-call processes and iterating on them regularly to improve on-call life. In this blog post, we’ll share how you can help yourself and team weather the storm.
Building Your On-Call Confidence
Anyone who’s been on call knows that there’s a learning curve before you get used to the way things work. If you’re newer to it or haven’t been on call before, you’re in good company: It’s intimidating! As with anything new, it takes practice and familiarity to ease into it. This is super-important for people new to going on call as well as seasoned on-call engineers who have changed teams or companies. Here are some things you can do to build on-call confidence.
Shadowing and Reverse-Shadowing
Shadowing is a common technique for training new team members. During a shadowing session, the new teammate will follow an experienced teammate during their on-call shift, usually during business hours. It’s a similar practice to pair programming. After shadowing, the new teammate may or may not feel ready to be in the driver’s seat, so it’s important to allow them to test drive. This is where reverse shadowing comes in. During this time period, the roles are reversed, and the shadow now responds to alerts. If the new teammate finds themself in need of some help, their mentor can jump in.
At PagerDuty, this period of shadowing and reverse shadowing is very common during month two and three of a new engineer’s tenure. This process isn’t just valuable for new hires. It’s also great for those who might be nervous about being on call. You can set up shadowing opportunities within your teams at any time.
Don’t be Afraid to Escalate
Unsure of how to resolve an incident? We’ve all been there. In a recent webinar, engineering manager Dileshni Jayasinghe spoke about how the past 18 months changed the way her team approached triage. Prior to the pandemic, her team could turn in their chairs before needing to kick off an incident to ask one another questions if an alert was triggered.
Now, Jayasinghe says it’s important to err on the side of caution. Trigger incidents first, escalate without shame as soon as you need to and know that your team is there to support you. The time lost debating with yourself whether an incident is bad enough to loop in teammates is crucial. When downtime can cost thousands of dollars per minute, especially during the holidays, acting fast can save revenue and customer satisfaction.
Safety Is Key
Psychological safety, according to Harvard professor Amy Edmonson, is a belief that the workplace is safe for speaking up. She notes that to be successful, teams need both a commitment to excellence as well as psychological safety to enter the learning zone, which is the best and most productive way to work.
To help build psychological safety, try to focus on empathy and blamelessness. Empathy might look like recognizing how a teammate, or yourself, could have made a mistake and affirming that this is normal. This also requires blamelessness. Rather than naming and shaming someone or calling yourself out for failure, focus on the systemic problems that contributed to this failure. Was there a lack of tooling or documentation? Was the responder tired after a night of being woken up by alerts?
By having empathy and focusing on blamelessness, you can foster psychological safety within your team. This will increase everyone’s confidence in their ability to be on call and remind them that even if you do fail, you can learn and bounce back.
Improving Process and Documentation
Even the most confident on-call engineer can be shaken when they find that the processes and documentation they’re supposed to use don’t reflect the current state. At regular intervals throughout the year, you should review your processes and documentation to ensure everything is up to date. Here are some of the most important things to check:
- On-call rotations: If you haven’t established an on-call rotation yet, work with your team to figure out who is responsible for which days. If you have an on-call rotation, double-check to make sure any days that you need to have covered have someone assigned to them.
- Map of services and dependencies: Each service likely depends on others and will have services depending on it. If you follow a full-service ownership model, you should have these connections mapped out. If this isn’t documented, there’s no time like now to ensure that you know how your service affects the wider system.
- Incident response documentation: You need to understand how you categorize and resolve incidents. Make sure that you have documentation that outlines what classifies an incident for each severity. Record the roles and responsibilities of responders to streamline response.
- Runbooks: Runbooks come in many shapes and sizes. Some are elaborate auto-remediation sequences that can resolve incidents before a human needs to be involved. Others are lighter automation sequences that can provide responders with additional context or eliminate toil from the process. Even if there’s no automation involved, runbooks are helpful step-by-step guides. These should detail the process for resolving common issues so that any responder can act on them.
- Tooling: Make sure you’re familiar with the on-call tools your team uses. Review things like operational analytics, dashboards, incident response tools and more. As engineering manager Leeor Engel noted, “You want as little novelty as possible when you get paged. That way, you have the things you need at your fingertips. If you can mentally rehearse that by getting those tools down, that’s a huge help.”
With the right processes and documentation, you’ll be better prepared to handle any problems during your on-call shift. This preparation can help you save time when every second means major dollars for your organization. Beyond thinking about the bottom line, there’s an additional and even more important consideration, however.
Evaluating Responder Health and Communicating
Keeping in touch with how you’re faring is the most important thing you can do. You’ll need to understand how to disclose qualitatively and quantitatively how your on-call rotation was, and what support you need.
According to a report created from PagerDuty platform data comparing 2019 to 2020, burnout has become an even bigger issue over the past 18 months. We compared the number of off-hour interruptions users experienced and broke them down into 3 categories:
The good category had only two interruptions per month. The bad category had seven. The ugly category had 19. These after-hour interruptions accounted for an extra two hours of work per day, an extra 12 weeks of work per year. According to our data, this cohort is also the most likely to leave the PagerDuty platform (our proxy for attrition). It’s everyone’s responsibility to make sure that burnout does not reach this level.
One way to do this is to keep track of how many times you’re interrupted and at what time these interruptions occur. After all, an interruption at 2 p.m. is much less disruptive than one at 2 a.m. Analytics tools can help you understand what your on-call shift looked like so you can bring data to your team and manager. If you’re a manager, you want to look at these metrics and check in to gauge how your team is doing.
Additionally, you can ask your manager about override and day-off policies, which should be detailed in your on-call documentation. Overrides and recovery days aren’t “work perks” for engineers. They’re table stakes for health.
To learn more about excelling at on-call and staying well while you do it, you can check out these resources:
- On-call Ops Guide
- Webinar: The Volume and Human Impact of On-Call and Real-Time Work
- Best Practices for On-Call Teams
- The State of Digital Operations Report
- Full Service Ownership Ops Guide
Or see how PagerDuty can help you adopt best practices with a 14-day free trial.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.
Featured image via Pexels.