Google’s Formula for Elite DevOps Performance
Every organization wants to be successful, but who decides which are successful or not? For Google, there’s a clear definition of how to measure the success of a DevOps team. At CloudNative London last year, Google Cloud Platform Advocate, and co-author of “Continuous Delivery,” Jez Humble explained Google’s four key metrics — commonly referred to as the DevOps Research and Assessment or DORA metrics — and how to become one of these few, proud elite teams.
Let’s start by clarifying how Google defines DevOps, namely as “an organizational and cultural movement that aims to increase software delivery velocity, improve service reliability, and build shared ownership among software stakeholders.” This definition further goes into how DevOps teams should work to “improve the speed, stability, availability, and security of your software delivery capability.”
Humble says a lot of teams think they can just install Kubernetes and start deploying apps, but few organizations have the technical and managerial capabilities that drive real success. So what those few top performers are doing right?
The 4 DORA Metrics of a Successful DevOps Team
For Google, it comes down to three teams — software development, software deployment, and service operations — who care about four metrics plus one that can’t be measured directly, availability, but nonetheless can’t be compromised during this process. The four key DevOps DORA metrics are:
- Lead time for changes
- Deployment frequency
- Time to restore service
- Change failure rate
Successful DevOps teams understand the shared ownership of these objectives.
Humble further defines DevOps high performers as those that do better at throughput, stability and availability. These elite performers:
- Release many times per day.
- The lead time for changes and moving into production is less than a day.
- The time to restore service is less than an hour (low performers take between a week or a month).
- The change failure rate is zero to 15%.
These elite performers reach corporate goals because they do well at the following metrics:
- Market share
- Number of customers
- Quality of products or services
- Operating efficiency
- Customer satisfaction
- Quantity of products or services provided
- Achieving organizational and mission goals
All of these pour into the belief that software delivery and operations (SDO) performance predicts the whole organizational performance. Humble also says there is a further correlation between your SDO’s performance and your cultural performance. “They predict culture. The extent to where your culture is mission-oriented, not pathological, controlling-oriented,” Humble said.
All these elite performers share in fostering a climate for learning, highly participatory retrospectives, and the encouragement of trust, voice, and autonomy.
But it’s also not just about the people. Tech is just as important as culture in DevOps.
Can You Have DevOps Without Cloud Computing?
The Google State of DevOps 2019 report found that 80% of its respondents were primarily hosting on some sort of cloud platform. Humble explained that Google applied the National Institute of Science and Technology’s definition of cloud computing to the SDO performance. This outlines the five essential characteristics of cloud computing:
- On-demand self-service: provisioning computing resources without human interaction from a cloud provider.
- Broad network access: heterogeneous access through phones, laptops, and tablets, not just work stations.
- Resource pooling: multi-tenant, can be abstracted through country, state or data center.
- Rapid elasticity: capabilities can scale up and down easily.
- Measured service: cloud systems automatically control, optimize and report key resource usage.
Only 29% of respondents for Google’s survey met all five requirements. Unsurprisingly these lined up with DevOps performance. In fact elite performers were 24 times more likely to have met all these essential cloud characteristics than the low performers.
Last year’s report realized, along with this year’s, validated that it didn’t matter of they were working on a public, private or hybrid cloud, a team focusing on cloud-based execution should see success in terms of speed, stability, and availability.
Humble said enterprises are typically running hundreds of thousands of services, made up of heterogeneous tech, but there are many other companies where more than 70% of the IT budget is “keeping lights on and adding capacity.” Then when they have to support CI/CD, they need to buy unsupported hardware on eBay or they are “running something mission-critical that no one has the code to anymore.”
Are You Fostering an Elite Culture?
Elite teams have a clear understanding of who does what and automates as much as possible.
Humble said that that the hardest bit to measure of the four key DevOps metrics is lead time. This looks to answer questions like: How long would it take your organization to deploy a change that involves just one single line of code? Can you do this on a repeatable, reliable basis?
He went onto highlight different areas that dramatically affect lead time and the strongest software development and operations teams all seem to have answers to:
- Garbage collection: Who should I be billing for this virtual load balancer or database instance? What would happen if I deleted this service? Are people still using it? The platform should ensure that every virtual resource is assigned to either an app or the platform itself.
- Making changes: If an app has a vulnerability, how do I fix and deploy it? If I need to update this dependent service, where is the source code? Humble says it should be possible to redeploy any app at the click of a button.
- Multitenancy: How do we enable developers with self-service deployments or configuration? Humble admitted that making AWS and Kubernetes multitenant is difficult but it’s an essential requirement of any enterprise platform-as-a-service (PaaS).
- Managing complexity: Making sure the stack is up-to-date. How will we hire people in 15 years that know how to work this? Humble says to limit options. For example, all apps must be built on predefined approved runtime stacks that PaaS operators can patch and redeploy on demand.
And probably most importantly, when there’s a vulnerability in your stack, how long will it take you to patch, build and redeploy all of your impacted applications? Humble referenced the need for this rapid patching up against what happened when Equifax had a headline-grabbing breach from a flaw in the Apache Struts framework.
The Architecture Behind Elite DevOps Teams
During his keynote, Humble outlined the architectural outcomes that allow teams the flexibility needed for high-performing DevOps without security risks. It’s not surprising that Conway’s Law rears its head here.
Ask yourselves these questions:
- Can my team make large-scale changes to the design of its systems without the permission of somebody outside the team or depending on other teams?
- Can my team complete its work without needing fine-grained communication and coordination with people outside the team?
- Can my team deploy and release its product or service on demand, independently of other services the product or service depends on?
- Can my team do most of its testing on demand, without require an integrated test environment?
- Can my team perform deployments during normal business hours with negligible downtime?
Humble suggests the need for a Platform as a Service. He says a self-service, multi-tenant PaaS minimizes the attack service area.
For Humble, the key principles of PaaS is a separation of responsibilities, which are:
- The platform team is responsive for PaaS.
- Make the application part as small as possible.
- Leverage a self-service API for deployment.
He calls this a “Function-as-a-Service model,” where ideally app stacks are part of the platform too so you can patch them easily. This all helps maintain compliance.
As you transition to a PaaS, Humble suggests you start by thinking about the outcomes and think about the contracts there are for the entire software development lifecycle. A PaaS will help you achieve short lead times and short times to restore.
Humble listed the advantages of a PaaS include:
- Extremely cheap and highly scalable
- Minimizes attack surface area
- A separation of concerns between content and functionality,
- Super easy to configure and deploy
- Decouples presentation and services
- All around easier to develop/test
The Psychological Safety Behind High-Performing Teams
Finally, Humble talked about how teams achieve outcomes not individuals, which is why team and organizational culture are what make or break these elite DevOps performers.
So, what’s the secret to building high-performing teams and enabling them to delivery with speed and stability? Psychological safety.
Organizations that foster psychological safety — where team members feel safe to take risks and be vulnerable around each other — have a greater capacity for dependability, structure, clarity, meaning and impact on the organization.
All of the above combines to strongly affect both culture and SDO performance which help org performance. And it also helps reduce burnout, deployment pain and, rework.
Author’s Note, November 22, 2023: For years, I have been driving part of the focus on defining DORA metrics as solely those famous four – or sometimes five if you include reliability – measurements. Yes, deployment frequency, lead time for changes, change failure rate, and failed delivery recovery time (previously called mean time to recovery, or MTTR) are important, especially when benchmarking yourself against your own team’s or company’s previous measurements, but they aren’t nearly the whole perspective of the sociotechnical power of DORA. In a more recent conversation with some of the creators of the recent State of DevOps Report 2023 report, I highlight several of the other 45 or so DORA questions, in my piece Google Says You Might Be Doing DORA Metrics Wrong.