What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
DevOps / Software Development

SRE vs. DevOps — ‘Ops vs. Dev 2.0’ or Something New?

Site reliability engineers need opinionated solutions — as distinct from "toolkits" — that directly and completely address platform-level pain points.
Oct 10th, 2022 7:35am by
Featued image for: SRE vs. DevOps — ‘Ops vs. Dev 2.0’ or Something New?
Image via Unsplash.

The divide between IT Ops and software development has been debated for half a century. More-seasoned technologists will remember several (many?) generations of this debate, played out against the backdrop of disruptive changes in the technology landscape.

While none of these disruptions changed everything, each changed a lot about how enterprise tech worked, and thus about how Ops and Dev did their business and managed their (often uneasy) collaboration. But consistently, most attention has been paid to Dev, whose outputs, to be fair, are highly visible, business-differentiating and value-generating.

What smart businesses and technology teams have recently (past 5-10 years?) discovered, however, is that to help Dev deliver more value, someone needs to pay more attention to Ops. And Ops — as a role and list of best practices — has become a thing that everyone needs to do well.

At this point, depending on what you do and where you sit and how far along the clouds+containers+automation curve your organization is, the lines of battle may seem to have shifted. At tech-forward, cloud native organizations, the point of the spear of IT Ops has now, in some cases, morphed into an elite class: site reliability engineers (SREs), who do the software/systems part of platform engineering.

Dev, meanwhile, has morphed into DevOps. And in the sense we mean, here — as a job description — DevOps folks are (I think, anyway) best defined as application-facing consumers of the systems SREs create, tasked with delivering cloud native applications that are available, resilient, performant and efficient.

Is This Just IT Ops vs. Dev 2.0?

No, it’s really different. And this is getting proved out as technology makers adapt their products better and better to serve practitioners of these new disciplines.

A lot of the current action is around generalized tools for deployment automation and point-solutions/SaaS services for building things like build automation, CI/CD, test-driven Development workflows and so on.

But low-level tools aren’t enough. And DIY ambitions can kill forward progress.

Most tech organizations aren’t yet staffed with platform engineers and SREs — many technologists-in-place are still learning — and those skills are hard to find and expensive, particularly in today’s employment market. Even more rare and expensive are the highly specialized skills: in resilient storage, ingress, identity and access management, container networking and security, and other disciplines required to build clusters that are performant and reliable.

As cloud estates multiply and hybridize and needs emerge to deliver, for example, consistent Kubernetes on many platforms (dev/test/production, blue/green, edge clusters, hosted bare metal, specialized hardware such as GPU hosts), challenges keep mounting. It’s easy to end up building the wrong kind of team, with headcount skewed toward building and life cycle-managing clouds and underlying infrastructures, rather than focused on applications and on delivering business value.

In a DIY environment, Dev is underserved. SRE isn’t providing the systems they need (application design organization-standard patterns, deployment automation, Kubernetes operators, observability tools) because they don’t have time; and also because platforms themselves are in constant flux — being built and refactored “while the plane is in the air.”

As long as SRE keeps struggling with building and maintaining automation to put platforms in place, observe them, and keep them on their feet, SRE vs. DevOps will keep looking like Ops vs. Dev 2.0.

Opinionated Solutions to Backstop Platform Engineers and Empower SREs

SRE needs to up-level their work. To do this, they need opinionated solutions — as distinct from “toolkits” — that directly and completely address platform-level pain points. Tools that are higher-order, simpler to operate and more complete, but still usefully flexible. Solutions, for example, enabling production-grade, consistent Kubernetes cluster creation and life cycle management on multiple platforms and infrastructures.

Plenty of toolkits claim to do this. Few do an even barely adequate job. Some produce low-performance clusters with indifferent availability. Some, in the name of flexibility, require users to make critical decisions about Kubernetes components (such as ingress), doing so with insufficient guidance, then configure the tools to deploy their solutions of choice, and seek support from multiple entities and/or open source communities in the event of trouble. Few toolkits deliver clusters complete with observability in place — meaning another set of decisions, more configuration, more experimentation before usable clusters are in place, can be evaluated and actual operations can commence.

In short, these toolkits aren’t doing the actual job, which is to limit the need for custom platform engineering and specialized skills at the start of a cloud transformation or buildout project. A functional solution does this, reducing uncertainty, strictly curtailing the need for experimentation and tuning, and enabling proof of concept and basic deployments to proceed more or less immediately — getting the organization where it needs to be before any benefits accrue.

To do this credibly (to satisfy sophisticated end users), the solution must embody a lot of the specialized knowledge platform engineers, distributed storage specialists and other kinds of subject-matter experts bring to the table. And it needs to embrace a vast laundry list of best practices, plus numerous affordances that productize the solution and its outputs (clusters):

  • It needs to provide a complete, flexible, layered, fundamentally simple, well-documented, and extensible system of tooling for managing bare metal, cloud and other underlying infrastructure, dealing with Linux and potentially Windows host operating systems, and both deploying and life cycle managing all aspects of a complete production Kubernetes cluster architecture, including container runtime, worker and manager components, high-performance container networking, distributed storage, ingress, container registry and observability for everything.
  • It needs to be equipped to do its work with best-in-class Kubernetes and subsidiary components, purpose-matched for consistency and interoperability.
  • It thus needs to express opinions — well-reasoned and tested in the real world at scale — about all these things, none of which are trivial (see note on Ceph, below).
  • It needs, beyond this, to have a clear vision (more well-founded opinions) of how clusters should be observed and life cycle managed, provide all necessary affordances “out of box,” plus be flexible enough to enable SREs to add their own unique value, over time.

Committing to making such decisions responsibly requires more specific experience than any typical organization — almost regardless of size — will have in-house,or be able to buy. In Mirantis’ case, for example, our Mirantis Container Cloud product — a complete deployment and operations framework for Kubernetes, Swarm, and indeed, containerized OpenStack clusters — deploys Mirantis’ own, open source-derived distributions of Kubernetes and OpenStack components and subsystems, all of which evolve independently but are regressively tested together. In effect, therefore, when MCC is used to deploy, for example, Kubernetes clusters in a particular, tested, highly-available configuration on multiple infrastructures, those clusters are actually consistent: aside from resource requirements and networking details determined at runtime, nothing should stop a workload from running on any of them.

MCC also uses Ceph distributed storage by preference for file, block and object stores. Are there other distributed storage systems? Certainly. But Mirantis has contributed to the Ceph project for almost 10 years, supports Ceph at scale under some of the world’s largest private clouds, and has a large bench of storage and, effectively, professional Ceph engineers in-house to ensure that Ceph works well for most customers. “Just works well” is more valuable than “flavor of the month,” particularly when the point of cloud is to support development teams that want to focus on applications.

The years (decades, really) of experience and opinion represented by a solution like Mirantis Container Cloud (and Mirantis Kubernetes Engine, and Mirantis Container Runtime, and Mirantis Secure Registry, and Ceph, and etc.) help ensure that MCC fulfills a real need: backstopping platform engineers and SREs, quickly converging on required cluster architecture to serve the customer’s use case, and fielding reliable production clusters, enabling the real work of application building to proceed.

SRE + DevOps 1.0

Today, as opinionated, complete solutions become available to help meet real cloud platform challenges, SRE and DevOps are playing on the same team — practitioners of a single discipline. Yes, there can be friction because it’s easier to write apps fast than write them well (“worked fine in dev, ops’ problem now”). But given the right solutions for getting platforms online and keeping them healthy, it’s getting easier for SRE to present DevOps with systems that help apps thrive, and for DevOps to write good apps as a result (and thus, safer for all of us to push to prod on Friday and also have fun weekend plans).

Given the right solutions, SRE and DevOps can be what they’re supposed to be: a new and actually better thing. And as cloud native tech and best practices continue working their way into everything and becoming “how things are done,” we may finally be able to say goodbye to this old, toxic, finger-pointing business of “Ops vs. Dev.” (Though I will never not love the evil meme-child, the burning house, and “Ops’ problem now.”)

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, Mirantis.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.