“Double, double toil and trouble; fire burn, and cauldron bubble,” chant three witches in Shakespeare’s “Macbeth” as they warn of getting more than you originally wished for. The message is apt for site reliability engineers (SREs).
Although the SRE job role is often defined as being about automation, the reality is that 59 percent of SREs agree there is too much toil (defined as manual, repetitive, tactical work that scales linearly) in their organization. Based on 188 survey responses from people holding SRE job roles, Catchpoint’s second annual SRE Report surprisingly found that almost half (49 percent) of the SREs believe their organization has not used automation to reduce toil.
Often being inspired by DevOps, SREs have high expectations for automation. Yet, there are key differences between the two and SRE responsibilities are much closer to those associated with systems administrators. SREs have the capability to automation and innovate but are often burdened by IT operations historical focus on incident management and reliability.
Although automation is the top technical skill needed by SREs according to last year’s report, the reality is that the day-to-day responsibilities of IT operations cannot always be eliminated by writing a new script or creating an improved infrastructure configuration. It turns out that automating the CI/CD process is just one of many SRE responsibilities.
Another responsibility is responding to “incidents,” which are usually defined as a service being down. Fifty-two percent of respondents deal with more than one incident a week, which can generate a lot of stress because they impact customer satisfaction and because availability is how SRE success is measured.
Availability is the key “metric used to define the “reliability” part of SRE role. Three-quarters of SREs say their organization has service level objectives (SLOs), and of this group, almost everyone said availability is tracked. Latency and the response time experienced end users are also utilized but not as often.
Monitoring service providers and fine-tuning an application’s performance can reduce the number of incidents and move the organization closer to five nines, which means only experiencing five minutes of downtime a year. Yet, despite the promise of AIOps, or artificial intelligence for operations, most incidents cannot be automated away.
While incidents get the most attention, a bigger concern for SREs may be the volume of non-emergency alerts they get. Twenty-seven percent said non-urgent messages are the top source of their “toil”, while only 15 percent cited on-call notifications.
SREs are more than glorified IT operations professionals, but a focus on availability means they often are often not empowered to work on the engineering challenges they rather being working on.
Context from Other Reports
- Incidents Create Friction Between Developers and IT Operations: Three-quarters of developers would prefer if the application development team were responsible for handling major incidents. According to a 2018 Atlassian survey, software developers’ rationale is that they are more knowledgeable about the errors and because communicating back and forth with the IT team takes too much time. However, members of central IT operations teams feel almost as strongly that they should take the lead, and a majority of c-level executives agree. The c-level executives are probably right as two-thirds of respondents believe that a software development team’s involvement is needed in fewer than half of all major incidents.
- AIOps Only Part of the Solution: A survey conducted by OpsRamp found that three-quarters of executives familiar with AIOps believe the primary purpose of this type of tool is to eliminate tedious, manual tasks. However, 80 percent of respondents said that less than half of their incidents are repeat occurrences. In other words, they cannot be directly addressed with automation.
Feature image by John Downman.