The New DevOps: Site Reliability Engineering Comes of Age

App and platform design and development have changed dramatically in the last five years and with it, the monitoring tools, technology and IT operations roles supporting those applications. The advent of site reliability engineers (SRE) and open source monitoring tools are two examples of these shifts. An SRE straddles the line between the “Dev” and “Ops” sides of DevOps teams, both writing code and supporting existing IT systems.
The SRE is the ultimate adjudicator when a performance issue is identified, determining conclusively what factor (code or IT systems) is the root cause and seeing it through to resolution. SREs need to carefully weigh the options between “do it yourself” (DIY) open source approach to monitoring leveraging against a new generation of commercial full-stack monitoring tools which incorporate open source technologies.
Traditionally, IT operations teams have typically taken an inside-out view of the world. Specialists myopically monitor their respective app, infrastructure or network components in hopes their discrete efforts combine to ultimately improve user experience. This “find and fix faults” approach means waiting for an alert to fire and then dispatching experts to troubleshoot the issue.
Site reliability engineering (SRE) stands that approach on its head by viewing reliability through an outside-in lens. LinkedIn, Target and Netflix are examples of companies that use this approach by gauging success by first measuring the quality of the end user experience that is rendered. When the digital experience becomes the bellwether measurement, their DevOps-savvy SRE teams spend less time chasing misleading alerts, and instead, focus their efforts on how they can deliver the best experience possible across every touchpoint of customer engagement.
“SRE seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness — with features, service, and performance — is optimized,” Marc Alvidrez, SRE for Google, says.
While the job title SRE may not exist in every company, most are adapting to manage this balance. Indeed, IT teams now fuse software engineering into their operations practices. This blending signals a shift from staffing specialized domain experts to hiring generalists that understand the “full stack” — not just front and backend development code, but also the underlying infrastructure that executes and delivers the app experience to users.
The shift from experts to generalists is driving the need to develop new ways to collect and correlate monitoring insights across the digital delivery chain while also identifying ways to automate scaling and problem remediation. Specialized application, infrastructure and network monitoring tools collect and store data in different formats (structured, unstructured, time series, topological, etc.) making correlation across tools difficult and increasing the risk for monitoring blind spots. Complicating the challenge is the sheer increase in volume, variety and velocity of data that must be collected.
Coders at heart, many SREs turn to open-source solutions to address these challenges. Open-source technologies such as Elastic, Logstash and Kibana (ELK stack), Kafka, Apache Spark and Mineral represent some of the building blocks that SRE developers use to code their own solutions to collect, store and analyze app performance.
Design and development of homegrown solutions for app and user monitoring require a considerable amount of time and effort. What’s more, the challenge only grows as machine learning and artificial intelligence for IT Ops (AIOps) become core components of automated problem remediation. Common performance problems are recognizable and machine learning means pattern recognition can be employed to automatically detect and remediate issues. However, for that to work, tools must have a library of these performance problems and their remedies.
Commercial monitoring solutions benefit from decades of learning and evolution but have historically lacked the ability to correlate across silos. That has changed through the adoption of the latest big data and open source technologies that can normalize and correlate analytics to eliminate the traditional silos of monitoring that previously limited insight and control of modern apps.
IT operations management (ITOM) teams are rapidly evolving to manage modern applications and can learn many lessons from how SREs can help them to pursue performance and availability. When it comes to selecting a unified approach to monitoring, these teams have to make a critical decision between taking a build-it-yourself open-source approach or selecting a commercial monitoring solution. The first alternative is that they can adopt and implement open-source software in a homegrown approach — while the use of the free software comes at a price. The second options consists of using commercial monitoring solutions, which now incorporate many of these same open-source technologies saving teams the trouble.
As organizations transition to an SRE model, recruiting more generalists with developer backgrounds; the way applications are monitored will also evolve as mentioned above. Point-monitoring tools will give way to full-stack tools that correlate data across application, infrastructure and network. Yet, SREs will continue to be challenged by the age-old dilemma of build vs. buy, with the need to weigh the options of both carefully before deploying the next generation of monitoring.