The Need to Decouple Human Error from Incident Response
VALENCIA, Spain — Science fiction may be about humans versus the machines. However, a lot of software management is still quick to blame incidents on human error over the complexity of those machines we’re interacting with.
At a time in which we understand the impact of both burnout and psychological safety on teams, finding “human error” included in a root cause analysis is just bad business. The blame game must end as it’s disruptive to the team and organizational resiliency.
In her lightning talk “Whyhappn instead of Whodunnit,” independent software engineer Silvia Pina begged the KubeCon + CloudNativeCon Europe 2022 audience to remove the term human error from their vocabulary. Because when we are talking about consistently complex systems with unknown unknowns and increasingly sophisticated attack vectors, it can’t come down to just one person.
As Charity Majors contends, the smallest unit of software delivery and ownership is a team. It’s time to shift our focus from blaming the individual to applying Pina’s perspective of systems thinking and organizational psychology to increasing resiliency.
Even Aviation Doesn’t Talk Human Error Anymore
The concept of human error in technology is adapted from the aviation industry. “Because the system or the machine is considered really reliable and all safety issues come from the fact that humans are operating it, so humans are the weak link,” Pina explained. Or at least we were perceived to be.
Over time, human error in aviation changed from the cause of failure to a symptom of failure. Safety is no longer perceived as inherent to the system, so progress has been redefined as a better understanding of the ways in which tools, tasks, and the environment interact.
Alas, human error is still being applied to reasons behind software incidents.
“It’s like an Agatha Christie story trying to figure out who has committed the crime, or, in this case, the incident,” Pina said. “This ties to an old view of human error that comes from aviation where high reliability is a requirement.” Reliability is of course a requirement in software engineering, but not at the 100% uptime an airplane full of people demands.
Like aviation, distributed software systems have high levels of complexity. But these systems also have a huge amount of variability. “This level of variability requires some level of adjustment,” she said. “This is one of the reasons we are successful, but this is also one of the reasons why there are failures.” Teams must accept that failures will occur, no matter what they do to plan against them.
There’s also an embracing of failure — in software engineering, not aviation — as an opportunity to experiment and learn. This is even a critical part of the site reliable engineering practice, to allow for an error budget, applying observability and chaos engineering to better learn through pushing systems to the limits, and sometimes, failure.
The Psychological Safety of High-Performing Organizations
Success and failure are better perceived as two sides of the same coin. Pina calls this new view of human error more like a “no view. We no longer need to have human error as a category in postmortems.
“We should take away the focus from the individual and try to look at what organizations can do,” she said.
At this level, she recommends considering the five characteristics that are common to high-reliability organizations, which are:
- Preoccupied with failure — try to identify warning signs for all possible failures at technical, process or human levels
- Reluctant to simplify — embrace complexity, don’t look for simple answers, understand need for specialization, upskilling and training, as well as automation
- Sensitive to operations — maintain a global view and look to understand work-as-done, embracing candid employee feedback
- Committed to resilience — failure becomes a learning opportunity, teams constantly looking for ways to recover more quickly
- Defer to expertise — anyone can ask questions or provide answers, expertise is valued more than authority
“Failure has a role in how these [elite] organizations work,” Pina explained. “We build resilience to failure by focusing on helping people to cope with complexity under pressure.”
This means, she says, keeping awareness at an organizational level, and spreading the lessons throughout. With this in mind, the blameless postmortem is essential to learn the root causes of an incident. A postmortem is an important mechanism for continuous learning and improvement in incident response, but only if the finger-pointing is left out.
“We move from this very human tendency to judge to a point where we can then understand why a failure happens,” Pina said. “And this is why we need to no longer talk about human error.”
This is also why zero trust culture centers on moving away from the assumption that humans are the weakest link in any security chain, and more toward making security everyone’s job. Then from a technical level, enforcing collaborative governance. Yes, human error is a leading cause of Kubernetes security incidents, but that’s because the orchestration system is very weak on security-minded defaults.
Red Hat even found that these Kubernetes incidents were caused by misconfiguration incidents. But if that is the repeated error, that’s a systemic and procedural issue — along with a technical one — not down to the error of one teammate. High-performing organizations understand that they must improve processes and tech in response, not play the blame game.
Psychological safety is essential to building organizational resilience to failure. Pina says, therefore, it’s leadership’s job to help people cope with complexity under pressure.
Decoupling human error from incident response gains perspective, she explained, and you see things anew, just like a René Magritte painting.