Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
DevOps / Observability

Where Site Reliability Engineering Overlaps with DevOps

Jun 25th, 2020 8:18am by
Featued image for: Where Site Reliability Engineering Overlaps with DevOps
Feature image via Pixabay.

Site reliability engineers (SREs) are constantly balancing priorities. The job role continues to evolve but is very much real. Catchpoint’s “2020 SRE Report” surveyed over 600 people that do SRE-type work, of which 46% said their organization has a dedicated SRE team that is distinct from teams that handle IT operations and administration. Still, the role is often conflated with DevOps, with 19% of the respondents saying the DevOps team handles SRE responsibilities. In fact, there is reason to believe the two functions can be managed as a whole — 41% said SREs and DevOps are part of the same team, while only 26% consider them to be complementary.

While SREs juggle both development and operations responsibilities, more than half spend less than 25% of their time doing development. Almost half (48%) spend a moderate or large amount of time writing software to help with operations, with much of that code helping to automate previously manual tasks. Although it will take many years to make most infrastructure programmable, SREs can be expected to be leading the way as 71% said infrastructure-as-code is used by site reliability engineers.

Source: Catchpoint’s “2020 SRE Report”.

Overall, monitoring and incident management continues to be the most common activities performed by SREs, but 55% said that application release and deployment management tools are used by SREs. As long as DevOps is the primary owner of application release management, then the distinction between the two teams will likely continue. However, this just means that there will be a new area of conflict. Is it self-evident to you whether SREs should focus on monitoring infrastructure or applications?

Q. What tool categores are used by SREs? Source: Catchpoint’s “2020 SRE Report”.

Catchpoint has historically focused on end-user monitoring, where the end-user can either be a customer or an infrastructure service supporting other applications. The generic nature of the “monitoring and alerting” category means that 93% said SRE use these tools while only 55% use observability tools. Defining exactly what an observability tool is can be difficult. The report confirms findings from TNS sponsor Honeycomb’s recent survey on the subject. That study found that the adoption of individual components of an observability stack is common, even if observability as a practice is relatively immature.

An underlying theme in the report is that observability needs to be a holistic practice. From the vendor’s point of view, this means that all services should have an API that plugs into an observability framework to detect outages and performance issues.

  • Chaos engineering showed up several times in the report. Resiliency checks via practices like chaos engineering take up at least a moderate amount of time for 19% of those that perform SRE tasks. The prevalence of these tools among SREs is actually higher, but many times the tools are not at the core of day-to-day activities.
  • SREs are effective working remotely. Of the 356 people that answered the survey after the imposition of COVID-19 stay-at-home orders, only 14% had to be onsite so far. Two-thirds of respondents said SREs are part of the organization’s on-call rotation, which implies that most of the manual work related to this can be performed remotely. Looking at the data collected after the pandemic’s onset, 80% believe their effectiveness handling incident management was not hurt by the move to at-home work. Only 9% said their sites or apps experienced more incidents during the forced at-home period of work, but just as many saw a decline in incidents. While traffic and capacity issues may have occurred due to remote work, they did not have a serious impact on customer-facing operations.
  • But don’t get too excited about remote work. While half said nothing about incident management is more challenging when they working home, 28% said escalating to the right teams is more difficult. Without face-to-face communication, some aspects of team communication may need to be readjusted. Although SREs may get more flexibility to work at home in the future, many jobs will still have an onsite or in-office element.
Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.