Charity Majors’ Recipe for High-Performing Teams
Do whatever it takes. Sacrifice the health and happiness of your employees to make the business successful. Please let 2022 be the death knell of this Churchillian blood, toil, tears and sweat work ethic. Because it doesn’t work, especially in the tech industry.
So, what does? First, leadership that understands the sociotechnical demands of teams and systems. The DevOps elite and high-performing teams not only stand out in their ability to deploy frequently and respond to negative impact quickly, but they continually look for ways to scale services without increasing team workload.
Charity Majors gave one of the keynotes at the WTF is SRE conference on fulfilling this promise of continuous delivery. The Chief Technology Officer and co-founder of observability platform Honeycomb.io and co-author of the new “Observability Engineering,” offered a compelling case for doing everything you can to empower engineering team satisfaction — focusing on the systems and processes to support your people.
What Makes a High-Performing Team?
“I’ve never seen the happiness of your engineering team and the happiness of your users significantly diverge. I’ve never seen a team of absolutely miserable engineers creating a product that delights their users. And I’ve never seen a company with absolutely miserable users, where their engineers enjoyed their work.” — Charity Majors, Honeycomb
Four years ago, Google’s DevOps Research and Assessment team set the standard in the infamous book “Accelerate”. Elite and high-performing teams are able to continuously meet their respective DORA metrics:
- Deployment frequency – between on-demand (elite); daily (high-performing)
- Lead time for changes – less than once a day (elite); less than once a week (high-performing)
- Meantime to recovery – less than an hour (elite); less than a day (high-performing)
- Change failure rate – zero to 15% of the time (for both)
On top of those key metrics, Majors offers her own definition: A high-performing team is one that spends most of their time solving interesting and new problems that move the business appreciably forward. Lower-performing teams, on the other hand, are wasting time on everything else.
She also adds a fifth metric which asks: How often are you alerted outside of working hours? “Because those are the alerts that correlate tightly to: It’s an emergency. It’s on fire. And it’s going to burn your engineers out really quickly,” she said, in her keynote.
The most recent State of DevOps report found that there are more and more elite performers that continue to raise the bar (and lower their DORA metrics), while the lower-performing teams continue to lose even more ground, as teams, their code, commitments and datasets continue to scale without any change to their build pipelines.
“You can’t actually hire your way out of this problem. You can’t build a high-performing team by just hiring more and more and more people.” In fact, Majors continued, “if you have a low- or mediocre-performing team and you add more people to it, you’re much more likely to make it worse. You’re much more likely to add a lot of contention and waiting around a system that’s already slow and clogged.”
She says you have to focus first on becoming a high-performing team before expanding that team. And even a team of excellent individual engineers, she argues, doesn’t necessarily make for a high-performing team.
“In order to become a high-performing team, you have to nail the art of doing less,” she said, which includes continuous learning, best practices and tooling to accomplish more with the same amount of effort. “No engineer ever got burned out from shipping too much. We get burned out from shipping too little relative to our efforts.”
Majors says there is just plain misery to be found in that gulf between medium-performing and high-performing teams. The latter:
- Performs 108 times more frequent code deployments
- Has 2,604 times faster time to recover from incidents
- Has 106 times faster lead time from commit to deploy
- Has a seven times lower change failure rate
“Look at how much faster these people are learning. Look how much less time they’re spending on bullshit,” she reminded.
“You should all want to be high-performing teams and the business team should be highly invested in making sure that teams are high performing.” — Charity Majors, Honeycomb
How Do You Build a High-Performing Team?
Especially in this last year of the “Great Resignation,” companies are spending an impressive amount of time and money on recruiting those who, on paper, are exceptional engineers in hopes of becoming a high-performing organization. Majors argues it doesn’t work that way: The smallest unit of software delivery and ownership is a team.
“You have a problem if the smallest unit of software delivery is a person. That’s a bus factor,” she said, referring to if that single point of release or failure is hit by a bus or finds another job or is down and out with the next variant.
If given the choice, always choose a great teammate over a great engineer. “I would always choose the good engineer who’s a great teammate, who’s a good communicator, who is humble and curious, over somebody who has more skills in data structures and algorithms,” Majors explained.
Anyway, a manager’s job is to build teams — yes there’s hiring, but the focus should be on developing talent through feedback loops. She says it’s really important for engineering managers to “invest in teams that are low in toil and high in the things that make work worth doing — autonomy, mastery and meaning.”
Like all things in tech, that success hinges on a mix of people, processes, and technology.
“The job of engineering leaders is to try to figure out how to flip the script and do more with less. How can you take a team of 25 people who run a service for 50,000 people, and constantly improve so that three years later you have 25 people running a service for 5 million people?” Majors told The New Stack in a follow-up interview.
If a CTO has one job, she believes, it is to ensure every engineer can spend as much of their time on the highest value work possible. This is where your deployment pipeline comes in.
So you pilfered someone from a FAANG company, who of course has ample experience in shipping several times a day. Does that speed up your release time? Nope. If it’s two months, it’ll still be two months. But, Majors argues, if you are already shipping multiple times a day, you can hire any decent engineer to that team, and they will all ship at that rate.
She referenced in her talk the sociopsychological phenomenon of a fundamental attribution error, where we have this cognitive bias that everything is our individual fault, when it is usually the result of the social and environmental factors around us.
“You need to have a working knowledge of algorithms and data structures, but your ability to ship code is primarily a function of the socio-technical system that you exist in and participate in, which is why it’s so important for technical leaders to focus on feedback loops that are at the heart of this system,” Majors explained. “The system is really what matters most. Create the right system and the people will excel.”
System performance comes back to that first DORA metric: deployment frequency. And yet, Majors observes, fear around deploys becomes the largest source of technical debt in most organizations, which leaves most only meeting the first half of their continuous integration and continuous deployment (CI/CD) goals.
Intercom echoes this with one of its core values: Shipping is your company’s heartbeat. That doesn’t just mean it’s ride or die, but rather shipping should be as steady as heartbeats. Because, as the customer communications platform wrote, “Software only becomes valuable when you ship it to customers. Before then it’s just a costly accumulation of hard work and assumptions.” The feedback loop that challenges your assumptions doesn’t even kick in until you ship.
In her keynote, Majors described that loop:
- Following peer review, an engineer merges a single change to the main,
- Hiding any user-visible changes behind feature flags, decoupling deploys from scheduled releases,
- Which triggers CI to run tests and build an artifact.
- This artifact is then deployed or canary released to production with no human gates. (There may be other automated gates or scheduling for release.)
This continuous deployment feedback loop should take 15 minutes or less to automatically deploy code live to real users. Anything longer, and the developer loses the context or risks cognitive overload trying to remember the original intent, tradeoffs, implementation details and more.
In just 15 minutes, with proper telemetry and observability tooling and instrumentation, Majors says engineers can find about 80% of bugs. “They close the loop by going in and looking at it, and asking yourself: Is it doing what I want it to do? Does anything else look weird?” This predictive interval becomes as reflexive as a heartbeat, with the rush of dopamine feeding back to the engineer who understands they are making a difference every time.
This all unravels though, the longer you get from that 15-minute marker, the more complex your code release gets, the more teammates your teammates head into what Majors calls the “engineer death spiral.”
The Cost of Fear of Deployment
So if the technology is there and you don’t need a team of incredibly senior engineers, what’s holding back so many organizations from joining the DevOps elite? The costly fear of deployment.
If it takes X amount of engineers to build, maintain and run your software systems with an under 15-minute delivery frequency, then Majors reckons it will take twice as many engineers for an hour deployment frequency. Quadruple the staff if it takes you days to ship. Octuple it if it takes you weeks. And she thinks this is a rather conservative estimate.
“Most leaders will spend freely on headcount, millions of dollars a month, but get all penny-pinchy at spending tens of thousands of dollars on a tool,” Majors told The New Stack in the same follow-up interview.
So many orgs are still trying to respond to this growth with more people and more code, but in order to take that leap to elite productivity, Majors says it comes down to smart investment in tooling. “Leveraging other people’s code — the code you can rent or lease or pay for, but someone else has to maintain and own.”
She estimates that teams should be spending 15 to 20% of their infrastructure bill on observability tooling alone in to understand that infrastructure. Without that investment, your teams are just slowing down because, if your company is scaling, so does that surface area, code, entropy and responsibilities needed to maintain it.
If you’re spending substantially less than that, she argues, you either have inferior tooling or you’re wasting sprints on code that your users don’t care about — probably both.
And that’s just one tool. Often organizations of all sizes are wasting money building tools that aren’t supporting their unique value proposition.
“The smarter choice is to spend time on the engineering problems directly related to your core business model,” she said. “Make it someone else’s problem. Keep your cognitive load low. Let them do what they’re the best in the world at, and you do what you’re the best in the world at.”
Once you have a high-performing team in place that’s able to deploy in 15 minutes or less, backed by best-in-breed tooling, only then you can expand. And, anyway, you aren’t going to be recruiting so-called high-performing engineers anyway. As Majors warned: “Engineers who have worked on teams with a short delivery cycle are unwilling to ever work anywhere else again.”
Disclosure: The author of this article was the host of the WTF is SRE conference.