SRE vs Platform Engineer: Can’t We All Just Get Along?
So far, 2023 has been all about doing more with less. Thankfully, tech layoffs — a reaction to sustained, uncontrolled growth and economic downturn — seem to have slowed. Still, many teams are left with fewer engineers working on increasingly complex and distributed systems. Something’s got to give.
It’s no wonder that this year has seen the rise of platform engineering. After all, this sociotechnical practice looks to use toolchains and processes to streamline the developer experience (DevEx), reducing friction on the path to release, so those that are short-staffed can focus on their end game — delivering value to customers faster.
What might be surprising, however, is the rolling back of the site reliability engineering or SRE movement. Both platform teams and SREs tend to work cross-organizationally on the operations side of things. But, while platform engineers focus on that DevEx, SREs focus on reliability and scalability of systems — usually involving monitoring and observability, incident response, and maybe even security. Platform teams are all about increasing developer productivity and speed, while SRE teams are all about increasing uptime in production.
Lately, a lot of organizations are also in the habit of simply waving a fairy wand and — bibbidi-bobbidi-boo!— changing job titles, like from site reliability engineer, sysadmin or DevOps engineer to platform engineer. Is this just because the latter makes for cheaper employees? Or can a change in role really make a difference? How many organizations are changing to adopt a platform as a product mindset versus just finding a new way to add to the ops backlog?
What do these trends actually mean in reality? Is it really SRE versus platform engineering? Are companies actually skipping site reliability engineering and jumping right into a platform-first approach? Or, as Justin Warren, founder and principal analyst at PivotNine, wrote in Forbes, is platform engineering already at risk of “collapsing under the weight of its own popularity, hugged to death by over-eager marketing folk?”
In 2023, we have more important things to worry about than two teams with similar objectives feeling pitted against each other. Let’s talk about where this conflict ends and where collaboration and corporate cohabitation begins.
SREs Should Be More Platform-focused
There’s opportunity in bringing platform teams and SREs together, but a history of friction and frustration can slow that collaboration. Often, SREs can be seen as gatekeepers, while platform engineers are just setting up the guardrails. That could be the shine effect for more nascent platform teams or it can be the truth at some orgs.
“Outside of Google, SREs in most organizations lack the capacity to constantly think about ways to enable better developer self-service or improve architecture and infrastructure tooling while also establishing an observability and tracing setup. Most SRE teams are just trying to survive,” wrote Luca Galante, from Humanitec’s product and growth team. He argues that too many companies are trying to follow suit of these “elite engineering organizations,” and the result is still developers tossing code over the wall, leaving too much burden on SREs to try to catch up.
Instead, Galante argues, a platform as a product approach allows organizations to focus on the developer experience, which, in turn, should lighten the load of operations. After all, when deployed well, platform engineering can actually help support the site reliability engineering team by reducing incidents and tickets via guardrails and systemization.
In fact, Dynatrace’s 2022 State of SRE Report emphasizes that the way forward for SRE teams is a “platform-based solution with state-of-the-art automation and everything-as-code capabilities that support the full lifecycle from configuration and testing to observability and remediation.” The report continues that SREs are still essential in creating a “single version of the truth” in an organization.
A platform is certainly part of the solution, it’s just, as we know from this year’s Puppet State of Platform Engineering Report, most companies have three to six different internal developer platforms running at once. That could leave platform and SRE teams working in isolation.
Xenonstack technical strategy consultancy actually places platform engineering and SRE at different layers of the technical stack, not in opposition to each other. It looks at SRE as a lower level or foundational process, while platform engineering is a higher level process that abstracts out ops work, including that which the SRE team puts in place.
Both SRE and platform teams are deemed necessary functions in the cloud native world. The next step is to figure out how they can not just collaborate but integrate their work together. After all, a focus on standardization, as is inherent to platform engineering, only supports security and uptime goals.
Another opportunity is in how SREs use service level objectives (SLOs) and error budgets to set expectations for reliability. Platform engineers should consider applying the same practices but for their internal customers.
The same Dynatrace State of SRE Report also found that, in 2022, more than a third of respondents already had the platform team managing the external SLOs.
In the end, it is OK if these two job buckets become grayer — even to the developer audience — so long as your engineers can work through one single viewpoint and, when things deviate from that singularity, they know who to ask.
How SREs Built Electrolux’s Platform
Whether a platform enables your site reliability team or your SREs can help drive your platform-as-a-product approach, collaboration yields better results than conflict. How it’s implemented is as varied as an organization’s technical stack and company culture.
Back in 2017, the second largest home appliance maker in the world, Electrolux, shifted toward its future in the Internet of Things. It opened a digital products division to connect eventually hundreds of home goods. This product team kicked off with ten developers and two SREs. Now, in 2023, the company has grown to about 200 developers helping to build over 350 connected products — supported by only seven SREs.
Electrolux teammates, Kristina Kondrashevich, SRE product owner, and Gang Luo, SRE manager, spoke at this year’s PlatformCon about how building their own platform allowed them to scale their development and product coverage without proportionally scaling their SRE team.
Initially, the SREs and developers sat on the same product team. Eventually, they split up but still worked on the same products. As the company scaled with more product teams, the support tickets started to pile up. This is when the virtual event’s screen filled with screenshots of Slack notifications around developer pain points, including service requests, meetings and logs for any new cluster, pipeline or database migration.
Electrolux engineering realized that it needed to scale the automation and knowledge sharing, too.
“[Developers] would like to write code and push it into production immediately, but we want them to be focused on how it’s delivered, how they provision the infrastructure for their services. How do they achieve their SLO? How much does it cost for them?” Kondrashevich said, realizing that the developers don’t usually care about this information.“They want it to be done. And we want our consumers to be happy.”
She said they realized that “We needed to create for them a golden path where they can click one button and get a new AWS environment.”
As the company continued to scale to include several product teams serving hundreds of connected appliances, the SRE team pivoted to becoming its own product team, as Electrolux set out to build an internal developer platform in order to offer a self-service model to all product teams.
Electrolux’s platform was built to hold all the existing automation, as well as well-defined policies, patterns and best practices.
“If developers need any infrastructure today — for example, if they need a Kubernetes cluster or database — they can simply go to the platform and click a few buttons and make some selections, and they will get their infrastructure up and running in a few minutes,” Luo said. He emphasized that “They don’t need to fire any tickets to the SRE team and we ensure that all the infrastructure that gets created has the same kind of policies, [and] they follow the same patterns as well.”
“For developers, they don’t need to navigate different tools, they can use the single platform to access most of the resources,” he continued, across infrastructure, services and APIs. “Each feature contains multiple pre-defined templates, which has our policies embedded, so, if someone creates a new infrastructure or creates a new service, we can ensure that it already has what we need for security, for observability. This provided the golden path for our developers,” who no longer need to worry about things like setting up CI/CD or monitoring.
Electrolux’s SRE team actually evolved into a platform-as-a-product team, as a way to cover the whole developer journey. As part of this, Kondrashevich explained, they created a platform plug-in to track cloud costs as well as service requests per month.
“The first intention was to show that it costs money to do manual work. Rather the SRE team can spend time and provide the automation — then it will be for free,” she said. Also, by observing costs via the platform, they’ve enabled cross-organization visibility and FinOps. “Before our SRE team was responsible for cost and infrastructure. Today, we see how our product teams are owners of not only their products but…their expenses for where they run their services, pipelines, etcetera.”
They also measure platform success with continuous surveying and office hours.
In the end, whether it’s the SRE or the product team running the show, “Consumer experience is everything,” Kondrashevich said. “When you have visibility of what other teams are doing now, you can understand more, and you can speak more, and you can share this experience with others.”
To achieve any and all of this, she argues, you really need to understand what site reliability engineering means for your individual company.
The colleagues ended their PlatformCon presentation with an important disclaimer: “You shouldn’t simply follow the same steps as we have done because you might not have the same result.”