DevOps / Technology / Tools / Sponsored / Contributed

5 Ways to Drive Mature SRE Practices

6 May 2022 7:20am, by

Saif Gunja
Saif is director of product marketing for Dynatrace’s cloud automation and DevOps solutions, bringing over 10 years of IT and marketing experience from his previous roles at VMware, Apple and Deloitte.

Developing reliable applications is more important than ever as organizations become increasingly dependent on digital services. Performance degradation and downtime can reduce revenue, increase customer churn and cause reputational damage. Regular headlines about banking application failures and social media outages highlight the impact of organizations failing to get it right.

Site reliability engineering (SRE) can put organizations on firmer footing to avoid these issues. It helps them maintain service availability and performance to meet the needs of users and customers. When done correctly, SRE enables organizations to define and drive development best practices at scale — not only improving performance and reliability but also accelerating digital transformation. Most organizations, however, remain immature in their SRE adoption and need to become more strategic to maximize the effect of the discipline.

Building an Advanced SRE Practice

Site reliability engineering is a vital source of education, enablement and strategic direction for DevOps practitioners. It can arm development teams with data-driven answers, solutions and best practices to tackle new challenges and drive innovation without affecting service performance.

But site reliability engineers are also responsible for an array of other tasks, such as automating development processes, configuring service-level objectives (SLOs) and creating workarounds to avoid overrun error budgets. Additionally, they’re increasingly responsible for analyzing vulnerabilities and building self-healing observability into applications and infrastructure. The problem is that these tasks are time-consuming. And if they take too much time, SRE won’t be any different from regular IT security and operations functions.

Here are five ways IT leaders can overcome that risk and accelerate their journeys to mature SRE practices.

1. Make automation more efficient.

Automating DevOps workflows is critical to reducing mean time to repair (MTTR) — a key objective for SRE. However, in many cases, the process of scripting automation is highly manual, which erodes the time SREs can spend on higher-value tasks. According to Dynatrace’s recent “State of SRE Report: 2022 Edition,” 60% of SREs invest significant time in building and maintaining automation code. Additionally, manually adding automation into workflows on an ad-hoc basis makes it difficult to scale, limiting the overall effect.

Therefore, SREs need to work with other stakeholders, such as DevOps teams and architects, to ensure the software they build is automatable by default. This is best enabled through platforms that offer advanced automation and everything-as-code capabilities across the entire development lifecycle, from configuration and testing to observability and remediation.

2. Involve the right stakeholders.

The most mature SRE functions work to influence architectural design decisions from the start of any innovation project to improve software and system reliability, resilience and security. Adopting SRE-driven practices enables enterprises to harness the experience of developers who know what works and what doesn’t. These developers can help architects build reliable services that can scale from a single user to 1,000, or from 1 million users to 10 million, without unforeseen problems.

3. Drive a culture of learning.

Project failure — and the way it’s regarded within the organization — is often as important as success. To create maximum value, SREs must be free to experiment and work on strategic projects that push the boundaries, understanding they will fail as often as they succeed. However, according to the “State of SRE Report,” only a quarter of organizations accept the “fail fast, fail often” mantra.

To mature their practice, enterprises must free SREs from the traditional cost constraints placed upon IT and encourage them to challenge accepted norms. They should be setting new benchmarks for innovative design and engineering practices, not be bogged down in the minutiae of development cycles. Running hackathons and bonus schemes focused on reliability improvements is a great way to uplevel SREs and encourage an organizational culture of learning and experimentation, where failure is valued as much as success.

4. Access the right level of observability to measure success.

Measurement is critical to developing any IT program, and SRE is no exception. To truly understand where performance gaps are and optimize critical user journeys, SREs need to go beyond performance monitoring data. They need detailed visibility into what’s draining an error budget and its effect on reliability. With true full-stack, in-depth observability capabilities, enterprises gain detailed insight into all key metrics and dependencies, from backend infrastructure performance to real user experiences.

5. Enhance SLO monitoring and management.

While SLOs provide an important mechanism for tracking success, too often, they’re left to SREs to define, establish and manage. To have a significant effect on reliability, SLOs need to be treated as a team game. Stakeholders from business, development and operations teams should come together to identify initiatives that will create the greatest business impact and then establish SLOs to support them. SREs need to lead these discussions but also encourage the participation of all stakeholders. To foster closer collaboration, SREs should look for an end-to-end observability platform that unifies teams around a single source of truth for service levels.

Taking a More Mature Approach to SRE

Ultimately, while the role of the SRE has become table stakes, most organizations have only begun their journey, and there is still a long way to go. The demand for skilled engineers far outstrips supply, so it’s urgent for organizations to find ways to increase the maturity of their practices and further their SREs’ efforts.

This can be achieved only by eliminating as much manual toil as possible, allowing SREs to focus on more value-adding tasks that advance their services’ reliability, resiliency, security and performance. That’s what will help development teams optimize critical user journeys and drive better business outcomes.

Featured image via Pixabay.