How the SRE Experience Is Changing with Cloud Native
In the cloud native space, the “shift left” toward developer ownership of the full software development life cycle (SDLC) has changed the way everyone in the ecosystem works and how they work together. For site reliability engineering (SRE) teams, the new cloud native normal means moving away from pure firefighting when something goes wrong to becoming fire-safety officers. This makes the SRE role more about enablement than fixing. Moreover, their critical support enables developers to take increased ownership to build, ship and run their own applications.
This article, part of a series drawn from real-world discussions and interviews with developers, SREs and platform architects, explores how SREs can help empower and centralize the new developer experience, streamline their own work and contribute to shipping software faster and more safely.
From Firefighting to Prevention for SREs
The cloud native, Kubernetes-based SDLC has fundamentally redefined the traditional roles of developers, ops and platform teams, and SREs. However, collaboration among these divergent groups is key to breaking down existing silos.
How is the SRE experience changing as roles converge, and how can developers and ops teams successfully work with SREs as trusted partners? In our recent conversations with industry experts, a consistent theme was that SREs are moving from “let me fix this for you” to “let me show you what you need to fix or prevent this in the future.”
Empower Developers with Self-Service
Traditionally, the SRE “rides to the rescue” at the first sign of trouble, but the changing SRE support system is moving toward providing appropriate platforms and tools, along with an interactive self-service experience. In this less linear, fast-moving and highly distributed model that typifies cloud native development, it’s unrealistic to assume that an SRE could troubleshoot problems any faster than the developer whose services are on fire — assuming the developer has both a sense of ownership and the right tools to work with.
If SREs collaborate with developers and ops to support the self-service culture and ownership mindset, and have organizational buy-in, this casts SREs in the role of educators and enablers rather than firefighters.
For developers, there can be a significant learning curve with this new paradigm. SREs, in collaboration with platform teams, can best support developers, the developer experience and overall developer productivity, by creating the right abstraction layer. That is, how much does the developer need to know and do to ship and run software — and at what level? Proper tooling, clear expectations and a centralized control plane as a baseline can give a developer what they need to work most effectively on their applications, with other teams and toward full ownership.
Facilitate Developer Autonomy
A number of SREs working with Kubernetes have expressed an emerging sentiment that developers should own the full life cycle of services, but in most cases don’t. A senior SRE from CartaX, Mario Loria, shared his insights and pointedly stated, “It should not be up to me as an SRE to define how your application gets deployed or at what point it needs to be rolled back, or at what point it needs to be changed, or when its health check should be modified.” Developers should be capable of, autonomous enough and empowered enough to make these determinations.
In this new normal for SREs as educators, one of the best things SREs can do is focus on giving developers the tools and support they need to ship software safely at speed. And that means creating the infrastructure and core services to support this primary goal. At the same time, the developer doesn’t necessarily need to care about what platforms and tools are used, but does need to be able to use them to, for example, canary a service or get service metrics. With this accomplished, SREs buy some breathing room to focus on strategic activities that support shipping software and enable less time spent on post-incident firefighting, cleanup and ad-hoc requests.
Adopting a New Code-Ship-Run Paradigm
With roles and responsibilities shifting, an SRE can support and advocate for the new developer-owned code-ship-run paradigm by focusing on the platform as a service or “paved path,” that helps reduce cognitive load for a developer taking on full ownership. Ultimately, the SRE is making their job, and the developer’s job, easier by providing more clarity about how to ship and run code without breaking anything.
|Site Reliability Engineer (SRE) team||Developers||Operations team|
|Provide and teach effective use of platform tooling to empower developers to be self-sufficient
Document clear escalation paths for developers struggling in production
|Treat SREs as application operation partners, not only as first responders to incidents
Turn to ops teams for the “paved path” or centralized developer control plane
|Provide self-service platform deployment and observability, and enable visibility into ramifications of actions
Provide opinionated “paved path” platform or developer control plane (DCP), but allow developers to swap platform components if they also want to be accountable
Summary: Charting a Course
As cloud native development continues to usurp the position of the traditional monolith, deploying, releasing and operating applications continues to be an evolving beast. Developers know how to code, but building in the necessary understanding, and ownership, of the “ship” and “run” aspects of the life cycle introduces a steep learning curve for which SREs are uniquely qualified to support. Charting this developer-ownership course depends on transparency and visibility, both at an organizational and technical level.
Organizationally, the success of this shift is twofold:
- Organizational leadership has to support the end-to-end “developer-as-service-owner” mindset as part of their high-level business strategy. Leadership needs to make this clear from the top down.
- Developers have to buy in also. Developers learning this new culture come to it from their own history and experience, which can vary considerably. Changing the developer mindset starts with empathy — understanding their goals, practices and skills — and closely follows with steady, consistent communication.
Technically, the self-service approach championed in many cloud-first organizations is possible for two reasons:
- To liberate developers and give them the freedom and responsibility to take ownership, they need transparency and visibility into what is going on with their services. SREs are a big part of making this happen and help set developers up for success when they take on potentially unfamiliar activities, such as triaging and debugging, and gathering intelligence. Becoming less reliant on SRE intervention ensures that developers are in the driver’s seat, while the SRE teams help developers gain “mechanical sympathy” and best drive high-performance machines (applications).
- This training is eased by putting in place a centralized developer control plane as a single source of truth and an integrated tooling approach as a baseline to the “paved path”.