Culture / Technology

Site Reliability Engineering for Cloud-Native Operations

26 Jun 2017 6:00am, by

Developers want to change things as soon as they can, while operations teams remain apprehensive that changes will break stuff. To reconcile these two drives, Google forged the path of site reliability engineering (SRE), an emerging practice for maintaining complex computing systems that need to run with high reliability. As the founder of Google’s SRE Team, Ben Treynor put it: SRE is “what happens when a software engineer is tasked with what used to be called operations.”

SRE dates back to 2003 when Treynor joined Google to manage a team of engineers to run a production environment. The practice proved to be a success, and the company now 1,500 engineers working in SRE. Apple, Oracle, Microsoft, Twitter, Dropbox, IBM, and Amazon have all implemented their own SRE teams as well.

So what exactly is SRE? Some would call it a subset of DevOps itself. Treynor describes it as an alternative to the traditional sysadmin approach to service management, where there are two distinct dev and ops teams. He says that the mainstream sysadmin approach is certainly easy to implement and there are many tools and examples already in place to help you do it. However, Treynor says that there are direct and indirect costs that come with the process, including a team having to scale and constrict with service and traffic needs.

“At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension,” Treynor wrote.

The priority of the SRE team is to make sure the systems stay strong and stable by spending at least half their time on development. The Google SRE team is made of Google software engineers, many with specialized skills such as Unix or networking administration. All are focused on software for complex problem-solving. They carry a “code it or drown it” mentality, so no one is working on something for long, keeping the operation’s debt to a minimum.

Co-editor of the O’Reilly book “Site Reliability Engineering,” Google researcher Chris Jones told the audience at this year’s CoreOS Fest that even progressively-minded agile software engineering is still missing the maintenance aspect of software engineering, focusing, focusing more on building the software than operating it.

What software reliability engineers do is “think about the whole lifecycle of software objects from their inception to their deployment to operation, refinement, and eventual, peaceful decommissioning,” Jones said.

Site Reliability Engineering in a Cloud-native World

Everything that’s done on Google’s SRE team involves not only creating and automating but measuring how things are doing. These metrics drive further development of the software so it can be made to run more reliably, faster and cheaper.

But how does this work within our cloud-native world?

“The cloud is a computing environment where it is difficult for you to point to the exact thing that’s running your software,” Jones said.

A vital step in cloud deployment is pairing down applications to only their own dependencies. For Google, containers are the underpinning technology of the cloud that accomplishes this. “Internally, we’re almost open source,” Jones said of Google, where engineers can propose changes to any code, allowing them to really build software to needs and to understand what it’s doing.

Jones conceded that this approach may not work for companies that don’t have full transparency and ownership of what they are building and running, but he suspects standards will begin emerging to cover this. As more software is built for the cloud, the cloud will become inherently interoperable for these goals.

In the end, Jones argues that as we move more and more into the cloud, we will need SRE as an alternative to Waterfall or DevOps because it was built in the cloud.

CoreOS is a sponsor of The New Stack.


A digest of the week’s most important stories & analyses.

View / Add Comments