Culture / DevOps / Monitoring / Security / Technology

How Can You Tell If Your On-Call System Is Broken?

20 Sep 2021 7:00am, by

As the old adage goes, if it ain’t broke, don’t fix it. But when you’re dealing with people and processes, you don’t know things are working until you ask — sometimes, repeatedly.

And even on the technology side of things, you really can’t tell what’s broken until you interrogate your systems with monitoring and observability. That’s why we go on call — to respond to both increasingly unpredictable tech and users.

On-call rotations are one of the most dreaded tasks software engineers face, but they are also one of the best opportunities to learn about and improve your code. They can be a soul-destroying sleep interrupter or a positive experience that increases customer value. That’s why it’s one of the top things developers ask about during job interviews.

So, how do you identify the flaws in your on-call process? How do you continuously improve that pager duty experience? How do you stop adding to existing burnout? At this year’s LeadDev Live, a panel of site reliability and engineering leads shared lessons on how to craft an efficient on-call process that devs actually want to participate in.

The Need for a Good Feedback Loop

It’s always best to start with a baseline definition of what “good” is. Stevi Deter is a principal software engineer at platform-as-a-service DexCare. After years of coordinating all-hours support for complex systems integrations, she defines a successful on-call process as one with a positive feedback loop. A process fails, she told the conference audience, when the on-call process is siloed, when pages aren’t being captured and brought into the overall engineering process and prioritization.

“That’s going to completely break your ability to capture where you need to improve your product, where your product’s actually being used, and also the importance of understanding of how you’re affecting your engineers in their day-to-day lives in being able to support your product,” Deter said.

Nothing is more discouraging than repeat pages that those on call can’t even do anything about.

“On-call shouldn’t be seen as some sort of punishment,” she said. “It’s actually, sort of, one of the advantages of success in that you actually need to support your product. And it’s really one of the richest sources of finding out what’s going wrong with your system.

“So, if you’re not capturing that, if you’re not having a system that makes engineers who are on call feel empowered that they can actually improve the overall system — for themselves, for their other on-call engineers and for the overall organization — then you’re missing out.”

Phil Calçado, senior director of engineering at SeatGeek, agreed with Deter: “The perception of things that they’re the same or getting worse is kind of what kills a process like that.”

Tailor the Schedule to the Team

To understand the quality of your on-call feedback loop, you’ve first got to understand the engineers’ workload, argued Ricardo Aravena, site reliability engineering (SRE) manager at Rakuten. He recommends planning a hand-off after every shift, asking:

  • What were the major alerts?
  • How many times have you seen this specific alert or incident before?
  • Is there something you could fix?
  • Anything you couldn’t fix?

It’s important to balance the shared workload, Aravena said. Of course, at a larger organization, you may have six or more people in rotation, but at a startup, you may have only two. Those two may communicate more easily, but as you expand internationally, you are more able to craft the on-call process around timezones, even allowing for 24-hour coverage in reasonable eight-hour shifts — meaning no one has to be paged at 2 a.m.

“When you have incidents … just make sure that something actually happens as a result of that. So if you do get woken up in the middle of the night and it turns out it was a major incident that you have this sense of, ‘OK, that sucked, but now we’re actually going to figure out how to not have that happen to us again.’”

—Stevi Deter, principal software engineer, DexCare @smd

On-call schedules will change as company size changes. During the pandemic, Calçado said, his live-event ticket retailer employer had to reduce its team size, while still providing on-call support for often unpredictable traffic.

Deter’s team also had to find ways to prioritize what gets done on call. Earlier this year, DexCare spun out as an incubator from the much larger Providence Digital Health. It went from three teams of six engineers on call to just one team of six — supporting the same amount of software, with a growing user base.

“Be realistic about what is expected, what they can achieve during a certain amount of time. And think also about how it affects your overall processes,” she said.

Deter urged the audience to ask, “What can you expect out of a person who is on call? Can you expect them to also be doing sprint work?”

One way DexCare ameliorated the tripled workload was to have a team just focus on paying down technical debt.

They also realized there was a shared dread of Mondays when they were ending shifts that ran through the weekend. They experimented with switching off on Fridays. This had a dramatic, positive effect when teammates finished midday Fridays and got the weekend to recuperate after a week on call.

Tailor the Process to the Organization

On-call must be a constantly evolving process, noted Jaime Woo, site reliability educator, mindfulness instructor, and co-editor of “97 Things Every SRE Should Know.”

“Keep changing it. Location matters. Team size matters. Needs matter. And I don’t think there’s always that intention of who is going to watch out as it evolves,” he added, to make sure “you have that flexibility and cohesion.”

Calçado has only worked at small to medium-sized businesses, where the on-call process was always in flux. The incident management side of the on-call process, he told the audience, can be pretty straightforward, but the “on-call component is a bit more complicated because it relates so much to people’s personal lives and expectations and just happiness around the company.”

He warned against attempting a company-wide on-call strategy, instead, allowing teams to decide their own. After all, they are the ones who should know everyone’s optimal working schedule. Many of his SeatGeek colleagues are based in Israel, where the weekend is Friday to Saturday, which meant the teams shifted on-call rotations a day earlier.

“You need to be flexible and work out even on a day-to-day basis what’s best for each team,” Calçado advised.

Of course, it’s always a good reminder to never page new parents. As Honeycomb founder and CTO Charity Majors has put it: “You should not have more than one thing waking you up in the middle of the night.”

Sync Up with Incident Response and Postmortems

Nothing is more demoralizing than feeling ineffective. You need a plan. For Aravena, the first step toward on-call success is opening a communication channel with your team. This can be regular retrospectives and handovers mixed with anonymous feedback tools like Slack polling — all backed, of course, by incident response managers.

You want to pay attention to key DevOps metrics like mean time to repair (MTTR) or even mean time to detect an issue. Leadership will always want to know if these are getting better or worse over time.

But you should also be on the lookout for patterns, like whether the same type of alerts keep occurring, or if engineers can’t do anything to fix recurring incidents. Tracking and avoiding false negatives is equally important to avoid burnout.

Debriefings and postmortems are essential, Calçado said, to make a habit out of the practice. Regularly ask teammates what happened on call, even if nothing big did. You may need to tune your alert system to be sure you’re not overlooking incidents.

He recommended you regularly create a timeline of what typical incidents look like and try to observe themes. This can help de-personalize discussion of problems, he noted, “so you’re more free to have your criticism instead of thinking that you’re talking about a colleague or a friend.”

Always ask: What would you change in the process? This is especially powerful in a one-on-one setting, when everyone gets a chance to speak.

Always remember, an on-call process has the potential to empower a team. For Woo, it’s all comes down to how you and your team choose to frame your experience:

“You can hate it — you still have to do it anyway. Or you can learn to not necessarily love it, but learn something from it. Why are we getting these alerts or why are people feeling this way? I think through that curiosity, through that humility, something great will happen.”

A newsletter digest of the week’s most important stories & analyses.