Who ya gonna call?
Social networking service LinkedIn has open sourced two tools developed internally that it uses to answer that question when the service experiences problems.
“And when things do break, we need to escalate as quickly as possible to make sure the problem gets fixed. This usually takes the form of calling up an on-call engineer, but what if this person doesn’t answer the phone?”
The company previously used a manual process to determine who to contact next but found that didn’t work well in a rapidly growing company.
In 2015, the company began working on the automated system Iris, named after the Greek goddess of messages. Iris allows users to define an escalation plan that it will automatically follow should an incident occur.
Plans are laid out in steps. If Iris doesn’t hear back from the first contact, it proceeds to the next step and continues until someone responds or there are no more defined steps. It defines priorities, from low to urgent, and allow users to map contact modes to these priorities. It doesn’t dictate how each person is to be contacted but allows for personal preferences: One person might choose Slack messages rather than email, while another might opt for a text.
In an example on the blog post, it simultaneously sends a medium priority message to the whole monitoring infrastructure team, along with a high-priority message to the primary on-call. The number and frequency of repeats can be configured.
Rather than specifically defining users to escalate to, Iris supports custom roles, with pluggable methods of role lookups. It also works with multiple messaging services, including Slack, Twilio, and SMS messaging.
LinkedIn heavily uses the “team,” “on-call,” and “manager” roles, each of which is determined dynamically from a separate tool, called Oncall. It frees Iris from having to track these details, allowing it to focus solely on message delivery.
Oncall provides a calendar to track which team members are on call for a particular shift. It supports follow-the-sun schedules, and its UI makes it easy for managers to make changes as necessary. Other teams that don’t manage critical applications within LinkedIn, such as sales, use Oncall as a specialized calendar.
Iris’s only task is ensuring that incidents are acknowledged. Twilio and other messaging vendors handle message delivery and other systems handle alerting.
Wang explains Iris’s architecture:
“… an application triggers an incident by sending a POST request to Iris’s REST API, which tracks the incident in its database. Then, the Iris sender uses this incident data to generate messages according to the incident’s escalation plan, forwarding the notifications to external messaging vendors such as Twilio or Slack for delivery. A user then receives the message and responds to it to claim the incident, either by using the Iris frontend or by sending a reply to the vendor.
If a claim is processed through the vendor, a relay provides access back to Iris’ internals through the company’s firewalls.
“Finally, the API receives the user’s request to claim the new incident, marking the incident as acknowledged. After the incident has been claimed, Iris’ job is done; it has guaranteed successful message delivery and confirmed that someone is responding to messages, so it ceases to escalate further,” Wang explained.
Iris is designed to be modular and independent of external applications.
LinkedIn has released around 100 projects as open source, including the stream processing framework Samza, the distributed streaming platform Kafka, both now under the Apache Software Foundation, and in March donating Flashback, a tool for mocking internet traffic for developer tests under a BSD two-clause license.
Richard Waid, senior manager of site reliability at LinkedIn, said there are no plans to donate Iris and OnCall to ASF, but “but we’re interested in working with and integrating with other open source projects and getting feedback on Iris and Oncall from the open source community.”