Keeping the Lights On: The On-Call Process that Works
The On-call process is a touchy subject for a SaaS company. On the one hand, you must have it, because your prod server always seems to go down at 2 a.m. on a Saturday. On the other hand, it places a heavy burden on those who must be on call, especially at a small company like Tinybird, where I currently head the engineering team.
I have actively participated in creating the on-call process in three different companies. Two of them worked very well, while the other didn’t. Here, I’m sharing what I’ve learned about making on call successful.
Before a On-Call Process: Stress and Chaos
When I joined Tinybird, we didn’t have an on-call system. We had automated alerts and a good monitoring system, but nobody was responsible for an on-call process or a rotation schedule between employees.
Many young companies like ours don’t want to create a formal on-call process. Many employees justifiably shy away from the pressure and individualized responsibility of being on call. It seems better to just handle issues as a hive.
But in reality, this just creates more stress. If nobody is responsible, everybody is responsible.
In the absence of a formal process, Tinybird relied on proactive employees and mobile notifications for some of our alert channels. In other words, it was disorganized, unstructured and stressful. We had multiple alert channels, constant noise and many alerts that weren’t actionable. If that sounds familiar, it’s because this is typical in most companies.
Obviously, this approach to handling production outages doesn’t scale, and it’s a recipe for poor service and disgruntled customers. We knew we needed a formal on-call structure and rotation, but we wanted to avoid overwhelming our relatively small team (at the time, we had less than 10 engineers).
How It Started: Implementing an On-Call Process
People don’t want to an on-call process. They’re afraid that this on-call experience will look like their last on-call experience, which inevitably sucked. Underneath that fear is insecurity about trying to solve a problem you know little about when nobody is around (or awake) to help. And that burden of responsibility weighs heavily. Sometimes you have to make a decision that can have a big impact. Downtime can be caused by the difference between a 0 and a 1.
Our goal at Tinybird was to assuage those fears and insecurities so that people felt empowered and respected as on-call engineers.
Before we even discussed a process, we outlined some core principles for the on-call system that would provide boundaries and guidance for our implementation.
Core Principles for an On-Call Process
- On call is not mandatory. Some people, for various reasons, do not want or are not able to be on call. We respect that choice.
- On call is financially compensated. If you are on call, you get paid for your time and energy.
- On call is 24/7. We provide a 24/7 service, and we must have somebody actively on call to maintain our SLAs.
- Minimize noise. Noise makes stress. If alerts aren’t actionable, this stress will cause burnout. Alerts must always be actionable.
- On call isn’t just for SREs (site reliability engineers). Every engineer should be able to participate. This promotes ownership among all team members and increases everyone’s awareness and understanding of our systems.
- Every alert should have a runbook. Since anybody from any function could be on call, we wanted to make sure everyone knew what to do even if the issue wasn’t with their code or system.
- Minimize the amount of time spent on call. Our goal was to only have people be on call once every six weeks. Of course, depending on how many people participate, this may not be achievable, but we set it as a target regardless.
- Have a backup. Our service-level agreements (SLAs) matter, so we always wanted to have a backup in case our primary on-call personnel were unreachable for whatever reason.
- Paging someone should be the last resort. Don’t disrupt somebody outside of working hours unless it is absolutely necessary to maintain our SLAs. Additionally, every time an incident occurs, measures should be taken to prevent its recurrence as much as possible.
How We Implemented an On-Call Process
So, here’s how we approached our on-call implementation.
First, we made a list of all our existing alerts. We asked two questions:
- Are they understandable? Any of our engineers should be able to see the alert and understand the nature and severity of it very quickly.
- Are they actionable? Alerts that aren’t actionable are just noise. Every alert should demand action. That way, there’s no doubt about whether action should be taken when an on-call alert pops into your inbox.
Second, as much as possible, we made alerts measurable, and each one pointed to the corresponding graph in Grafana that described the anomaly.
In addition, we migrated all of our on-call alerts to a single channel. No more hunting down alerts in different places. We used PagerDuty for raising alerts.
Critically, we created a runbook for each alert that describes the steps to follow to assess and (hopefully) fix the underlying issue. With the runbook, engineers feel empowered to solve the problem without having to dig for more context.
For about two months, every Monday, Tinybird’s CTO and I would meet with the platform team to review each alert with the following objectives:
- If the alert was not actionable or was a false positive, correct or eliminate it.
- If the alert was genuine, analyze it to find a long-term solution and give it the necessary priority.
We also started reviewing each incident report collaboratively with the entire engineering team. Before we implemented this process, we would create incident reports (IRs) and share them internally, but we decided to take it a step further.
Now, each IR is presented to the entire engineering team in an open meeting. We want everyone to understand what happened, how it was resolved and what was affected. We use the meeting to identify action points that can prevent future occurrences, such as improving alerts, changing systems, architecture changes, removing single points of failure, etc. This process not only helped us mitigate future issues but also helped increase ownership and overall knowledge of our code and systems across the entire team. The more people know about the codebase, the more they feel empowered to fix something when they are on call.
Initially, we had just three people on call (two engineers and the CTO). We knew this would be challenging for these three, but it was also a temporary way to assess our new process before we rolled it out to the entire team.
Note that we still made on call mandatory during working hours. Each engineer is expected to take an on-call rotation during a normal shift. This has several benefits:
- Increased ownership: Being on call makes you realize the importance of shipping code that is monitored and easily operable. If you know you’re going to be on call to fix something you shipped, you’ll spend more time making sure you know how to operate your code, how to monitor it and how to parse the alerts that get generated.
- Knowledge sharing and reduced friction: Being on call can feel scary when you’re alone. But if you’re on call during working hours, you are not alone. For newcomers, this helps them ease into the on-call process without anxiety. They learn how to respond to common alerts, and they also learn that being on call isn’t as noisy or scary as they think.
Every week, when the on-call shift changes, we review the last shift. We use this time to share knowledge and tricks, identify cross-team initiatives necessary to improve the system as a whole and so on.
Finally, anytime a person is the primary person on call overnight, we give them the next day off.
How It’s Going: Where Are We Now?
After about a year of implementing this new on-call process, we have nine people rotating on the primary (24/7) on-call system and six people simultaneously on call during working hours.
It has worked exceptionally well. While I won’t go so far as to say that our engineers enjoy being on call, I think it is fair to say that they feel empowered to handle issues that do arise, and they know they have a forum where they can share difficulties about the on-call system and suggestions for how to improve them.
If you’re interested in hearing more about Tinybird and our on-call system, I’d love to hear from you. Also, if you’re inspired by Tinybird’s on-call culture and think you’d like to work with us, check out our open roles here.