Why People Should Be at the Heart of Operational Resilience
At a chaos engineering conference in 2019, Google engineer David Rensin argued that “our software and hardware systems are not the most complex distributed systems we deal with in our day-to-day lives. Our companies are.”
The implication of this, Rensin continued, is that “most of the complexity in … large distributed systems is not in the software or the hardware, it is in the humans.”
He went on to discuss the intriguing idea of bringing chaos engineering principles to “people systems” — having, say, one team member absent on the day of a particularly difficult system issue.
It’s a shame Rensin’s idea hasn’t caught on. Although resilience, technical or otherwise, appears to be a word more widely used in our post-pandemic world, there’s still a sense that it’s ultimately a technical issue.
Yes, operational resilience might be business critical, maybe even a matter of life and death. But it can be all too easy for those outside or on the edges of the software teams to believe issues of resilience and reliability are simply in the process of being solved by the right people.
Of course, this couldn’t be less true. If resilience is about an ability to adapt and respond to change, then it is inevitably going to be a human issue.
“Humans are the mechanism of adaptation in software, because, by design, our programs and machines can’t adapt on their own, Kelly Shortridge, senior principal at Fastly and one of the authors of “Security Chaos Engineering,” pointed out in a conversation with The New Stack.
In other words, it simply doesn’t make sense to try and separate the human and technical components of resilience — they necessarily go together.
For Will Gallego, a software engineer at PagerDuty, resilience needs to be understood as distinct from robustness. Where robustness is a quality of software, resilience is about how the interaction of people and software artifacts can respond to change.
“Robustness is how the technical system is built such that it can withstand the expected,” he told The New Stack over email. “Resilience would be engineers responding in the moment using their expertise to diagnose the failure mode and look at patterns that could potentially relieve the stress.”
Why Does Human/Tech Resilience Matter Today?
There’s admittedly nothing remarkable about noting that humans and technology are closely intertwined. The term “sociotechnical” was coined in the middle of the 20th century by scientists Eric Trist, Fred Emory and Ken Bamforth to specifically describe the way technological artifacts, organizational structures and humans interact and shape one another.
However, it has particular urgency today for a number of reasons.
The first is that software systems are increasingly complex. Cloud-based systems and services, particularly those that are modular and composed of microservices using the ever-expanding ecosystem of tools and platforms associated with such an approach, mean that those in the field are often dealing with interconnecting puzzles of services and applications, often possessing only a partial view.
The second is that the march of digital transformation, which has accelerated in recent years as a consequence of the Covid-19 pandemic, means that these systems are more embedded and enmeshed in people’s everyday lives. To put it another way, the world demands more from these products and services and expects them to be always available.
Alongside this, of course, is the reality that software is built differently now. DevOps, in particular, has changed expectations about who is responsible for what.
Because we’ve seen a “shift to service ownership by the developers who build the service,” according to Jim Gochee, CEO of Blameless, we now have “a new breed of on-call responders” taking on the type of work that would have been done in the past by support engineers or, more recently, by site reliability engineers (SREs).
Embracing the ethos of “you build it, you run it” isn’t necessarily a bad thing, but turning it into a fetish can easily lead us into a place where failures and faults become the responsibility of individuals. That’s not good for anyone, humans or technology.
“If the resilience of a system depends on humans never making mistakes, then the system is really brittle,” Shortridge said. “Humanity’s success is because of our creativity and ability to adapt; it isn’t because we’re great at doing the same thing the same way every time, or can memorize 50 things on a checklist that we never forget.”
Although DevOps is well-intentioned in attempting to break down barriers, it has arguably contributed to a broader organizational discomfort with failure — a desire to control and minimize risk. “Many organizations struggle with the existential angst of wanting to prevent anything bad from ever happening,” Shortridge claimed.
This she added, is ultimately “an impossible goal … It’s a downward spiral where the fear of things going wrong results in a slower, heavier approach, which actually increases the likelihood of things going wrong – as well as hindering the ability to swiftly recover from failure.”
Gallego stated the problem in even more blunt terms: “Folks say ‘safety is our number one priority.’ But the only way to be completely safe is to shut down your system and go home. Safety is challenged by environmental constraints and business needs.”
Given the pressures on organizations caused by the current global economic situation, it’s not surprising that organizations are anxious about failure. Indeed, the recent debate around developer productivity is connected to operational resilience; the idea that organizations are successful when developers are doing the right thing (writing lines of code) and that we need to prevent obstacles from getting in the way of that, ultimately emphasizes the usefulness of the individual rather than the health of the overall system.
This might be challenging for many organizations, Shortridge said: “Accepting that failure will happen, that we need the capacity to make changes quickly, that human mistakes represent opportunities for better design — that’s often a philosophical leap for organizations.”
Thinking Beyond Engineering Teams
Given the challenge of changing the cultural mindset around the role of humans in resilience, expanding its scope might seem odd. However, thinking about how engineering teams (dev or ops) are embedded within even wider organizational structures could help to broaden the way we think about resilience.
Indeed, it might even help place more emphasis on the sociotechnical nature of operational resilience.
Gallego believes that operational resilience “extends throughout the entire organization. You might not have your sales team hop into an incident to bring up servers, but knowing that you’re going to run a marketing campaign affects the ability of the site to withstand heavier loads. So extending communications ahead of time allows folks to gauge parts of the system to stress test, share expertise as to potential failure modes with traffic patterns.”
This, he told The New Stack, is partly a key lesson of DevOps that the industry should consider more deeply. “Once you start removing barriers between those teams, you ask yourself: ‘Why aren’t we chatting with support more? With our data science teams or our comms teams?’”
Similarly, Gochee noted how resiliency requires participation from other teams, from support to sales. “Let’s say it’s the end of the quarter and there’s a number of companies in a trial, and the site goes down during the trial. And now these companies are like, ‘Well, I don’t know, if you have a reliable service, I may pull out and decide not to buy.’”
The implications may be significant, he said: “You’ve just impacted the sales team, which impacts bonuses, commissions, paychecks, livelihoods.” Eventually, this finds its way through the hierarchy. “It escalates to the CEO, who is usually like, ‘OK, we have a huge issue, like, what’s going on? Why isn’t it getting fixed?’”
Rewarding ‘Glue Work’
Both Gochee and Gallego highlight the importance of communication and transparency between these various functions to resilience.
Imagine, for instance, “you’re in a role and you hear the site’s down,” Gochee said. “It’s nice to know it’s being addressed in a timely fashion and then it’s nice to know afterwards that we did a deep dive, we understood why it happened.”
What’s particularly important is that we shouldn’t treat these things as trivial. Yes, communication is invariably the obvious answer to the challenges we face, but Gallego pointed out that in an organizational setting, this requires actual energy and work.
He cited the concept of ”glue work.” Popularized by Tanya Reilly, a blogger and senior principal engineer at Squarespace, the term refers to the work of onboarding new team members, talking to users and other teams and keeping things moving forward, noting that such labor is often “invisible.”
More importantly, it often goes uncompensated. “Folks don’t always get promotions on avoiding failure,” Gallego noted. “Tech too often has a hero culture around folks working late nights and weekends, jumping into fires to put them out. They’re readily obvious and immediately felt, but doing the work to avoid problems when you don’t know how bad it could have been doesn’t evoke the same emotions.”
An important step in operational resilience, then, is to ensure that this “glue work” is acknowledged and recognized.
The Importance of Leadership
The first step is perhaps to create a role that is fully committed to resilience.
“If there’s not a leader who’s doing it, it typically won’t get done,” Gochee said. “The VP of engineering just doesn’t usually have the bandwidth to do a VP of operations, or VP of infrastructure or VP of DevOps. They’re so busy with 1,000 other things that they can just never really get to it.”
This might seem counterintuitive, given the importance of a huge range of roles to resilience. But the point is that it requires someone to lead on the processes, tooling and mindset needed to actually do operational resilience effectively.
Preparation, Training and Psychological Safety
But what’s actually needed to help organizations tackle the challenges of operational resilience? Part of it is about preparation.
“You train for it, you prepare for it,” Gochee said. “You make it easy on them by having a playbook, so you don’t have to figure out what to do when you first get into an incident.”
To some extent this bears some similarity with Rensin’s point about injecting a little bit of chaos into people systems — making sure people still know what to do even if one person, say, is unavailable.
“It’s like any kind of emergency disaster planning in any parts of life,” Gochee said. He explains that having playbooks is critical. They should outline what should happen and who needs to be involved to reduce the uncertainty or lack of clarity in a process.
Indeed, knowing exactly who should be responsible for developing your playbooks may be tough. This underlines the value of having someone in a leadership position who can advocate for resilience initiatives and assets.
The real value of training and preparation for operational resilience, Gochee noted, is what it affords individuals and teams in psychological terms. In the event of something unexpected, people can easily panic: “Your amygdala gets hijacked,” he says, which triggers a flight or fight response.
Both Gallego and Shortridge also highlight the psychological dimension, emphasizing in particular psychological safety.
“The best cultures doing this work develop psychological safety as a core function, full stop,” Gallego said. “Are you able to bring forth challenging ideas? Are senior members of a team willing to pass the reins in order to allow others to grow? When incidents do happen … are they swept under the rug?”
When The New Stack asked Shortridge what organizations that do operational resilience well look like, she offered a very similar response to Gallego: They have cultures that reward curiosity and transparency where “teams are continually learning about their systems with humility.”
These, she said, are organizations where “making a mistake isn’t inherently shameful; refusing to learn from the mistake and sticking with the status quo anyway is the problem.”
Organizations that don’t acknowledge that are not only hurting their employees, they’re also undermining their resilience. These points are echoed elsewhere — in a piece published earlier this year in The New Stack, Fred Hebert, a staff SRE at Honeycomb.io, wrote that when new engineers are onboarded to the team, it is made explicit that “we are expected to not know what is going on, that this is OK, and that responsibility is shared, and we are fully aware of the situation they’re being put in.”
A Better Platform for Everyone
To be clear: while the human dimension of resilience is vital, we shouldn’t ignore the technical part of the sociotechnical concept. Shortridge is particularly clear on this point and sees platform engineering as critical to enabling operational resilience in the future.
“If we want to sustain software resilience, we need to ensure application developers don’t have to think about resilience as much as we can,” Shortridge said. “This means we need platform engineering teams that think about resilience, just like they already think about reliability or productivity, so they can build platforms, tools and patterns that make the resilient way the fast and easy way for developers.”
This doesn’t mean building a new silo between developers and platform engineers. What matters is ensuring that everyone in an organization — inside and outside of engineering functions — has the tools and knowledge they need, and feels comfortable with ambiguity and uncertainty, so they can respond when the unexpected happens. Because the unexpected certainly will happen.
Correction: Will Gallego’s job title and employer have been corrected and updated from a previous version of this article.