Being on-call can be a daunting and disruptive experience. Many people with on-call duties complain how having to be ready to handle incidents affects work-life balance, even health, as on-call employees may be frequently woken up in the middle of night or may need to plan evenings and weekends while considering on-call duties. As organizations enroll changes to scale on-call teams, it needs to be considered how to best match that evolution with a sustainable and humane solution. Below is some advice based on our experiences at OpsGenie so far with our customers.
Transparency is the key to successful communications. Clarifying expectations around availability of employees is a must when rolling out an on-call system or a change to an existing on-call system. People with new on-call duties may have questions like the following; the answers should be clear before enrolling the changes:
- Are engineers supposed to be on-call during nights?
- If on-call during nights, is there flexibility to work from home the next day?
- Or start the next day later than usual to make up needed sleep from the on-call part of the night? Are engineers supposed to do development work during on-call time?
- Maximum how many times in a month would an engineer be on-call?
Providing proper training to new on-call engineers is a must for scaling on-call systems successfully. Having an established training plan for on-call processes and tooling is the first step in this direction.
Providing engineers with detailed and up-to-date runbooks is also very important. That helps engineers see what the protocol is for certain situations, or how similar issues were resolved effectively in the past.
Shadowing experienced on-call engineers would be a valuable experience as well, so that the new on-call engineers can feel the on-call atmosphere directly while still being given some guidance and support.
Having multiple escalation channels on on-call schedules is a good idea especially for junior on-call engineers. Not being completely alone when help is needed will come as a relief. Related to this, having the junior engineers in the primary on-call rotation and having a more senior engineer as a back-up or in a secondary rotation is another best practice. This organization helps junior engineers develop the required on-call skills while avoiding panic when there is an issue beyond junior expertise.
Development Duties During On-Call
Having development duties during on-call usually means lots of context switching and interruptions, especially when the on-call duty is a heavy one. It also has a negative effect on development sprints since it is di cult to estimate how much contribution on-call people may have to the development sprint. This usually means less development efficiency and more stress on the on-call engineers. As a best practice, we recommend not assigning development duties to developers during on-call duties and when there is free time, request for the developer to work on improving the on-call related documentation and automation to eventually improve the sustainability of the systems and services.
Well-Defined and Fine-Tuned Processes and Systems
A healthy on-call system can only exist if improved constantly by fine-tuning processes and systems. A few recommendations towards this goal:
- Evaluate alert priority and urgency and set systems based on that. Low urgency alerts can wait until the morning; on-call employees do not need to wake up in the middle of the night because of these alerts.
- Reduce false-positives by classifying alerts based on factors such as root-cause, originating system, message, thresholds etc. This way, differentiate actionable alerts from the rest.
- Deduplicate related alerts to avoid alert fatigue; on-call employees do not need to get notified by a flood of alerts in an hour for the same issue; give the chance to focus on the resolution instead of notifying with the same alert over and over.
- Design rich alerts that empower the on-call engineers to make active decisions and apply the knowledge recorded in runbooks.
- Provide alert reports and metrics to on-call teams so that the weak areas can be seen in the systems and insights gained on the root-causes and resolutions. Do not let on-call teams get bogged down by the same problem repeatedly.
Reviewing On-Call Reports
For fairness, managers are advised to review on-call related reporting and check how often each team member was paged or woken up, and enroll changes in the on-call system considering such facts and hopefully avoid employee burnout. Below is an on-call report for one of our teams internally. The report shows information such as how long each team member has had on-call duty, distribution of on-call hours for each on-call schedule, and the hourly and daily distributions of on-call duty for each person.
Friendly On-Call Culture:
On-call engineers carry huge responsibility for the success of companies. The job requires extra caution when rolling out changes, and responding knowledgeably as fast as possible when there is a problem. Such responsibility may naturally mean stress and tension, especially when there is a big issue and there are unknowns. The on-call culture set by the senior on-call engineers and management teams defines how people deal with that stress and tension and whether the on-call experience is transformed for the good of the company or not.
Lastly, we recommend management teams make sure that on-call employees are being listened to. Management should organize all-hands meetings with the on-call engineers to discuss problems, complaints, and areas of weaknesses. Make sure to take actions which help resolve these problems and weaknesses. Constantly reevaluate the on-call system, tools, processes, people, documentation, and training plans that are being used. Lastly, make sure that management is not the sole decision maker when it comes to on-call organization and protocol, and instead, make each decision a team decision.
For both the on-call engineers’ sake and the on-call culture of the company, management teams should pay attention to developing a friendly on-call culture and make it clear that the goal should always be to find the problems, risks, and weaknesses in systems and solve them. The goal should never be to find whom to blame.
Feature image via Pixabay.