Why Observability Is a Labor Issue
Observability is a brilliant term for anyone who writes about technology. The debate and disagreement about what it is, what it isn’t, and why it matters makes it ripe for commentary and analysis.
But although there’s plenty of interest in it from a technical perspective (is it, for example, just a fancy word for monitoring?), one aspect often gets overlooked: the fact that as much as it is a deeply technical issue, it’s also very much a human one.
Wherever businesses are concerned about downtime, and wherever software developers are having to firefight while on-call in the small hours of the morning, observability is something we need to see as fundamental to the way we think about and practice software engineering. In short, observability is a human issue and, more often than not, a labor one as well.
Doing it properly will not only ensure software engineers are given the respect and empathy they deserve in organizations — but it will also help them to become more collaborative, curious and engaged in the workplace.
Bridging the Gap: Engineers and Everyone Else
Nora Jones, CEO of incident-analysis platform Jeli, recounted a time when a company she used to work for had managed to land a Super Bowl commercial spot. When the commercial aired, the company’s systems crashed — a high-profile (and expensive) failure.
At the incident review meeting in the week that followed, she said, there was no one from marketing or public relations in attendance — only site reliability engineers.
This experience, Jones said, was a big inspiration in the development of Jeli. “What I wanted to do was create those bridges so that other functions could understand each other and how they participated.”
Jeli is a particularly interesting platform in that it highlights the way in which observability needs to be viewed as much more than a technical challenge. It has to be understood as a people problem, one that is symptomatic of the inevitable puzzle that comes from building and relying on systems that involve different domains and different types of expertise.
It works by making it easy for representatives from different teams, whether they’re engineers, marketers or support staff, to not only collaborate on recovering from a given incident, but also to document the necessary context that makes responding to an incident — and taking the steps needed to prevent it from happening in the future — so much simpler.
One of the most curious aspects of the tool is that its UI invites users to construct a “narrative.” “You’ll notice you don’t see the word ‘incident’ a lot in here,” Jones said. “Because we’re trying to have it be this collaborative opportunity.”
This is ultimately all about creating a psychologically safe workplace, she said: “People are more likely to participate in an open and honest way if they don’t feel like they’re in trouble.”
This is important on its own terms, of course, but we shouldn’t overlook the fact that if people don’t feel comfortable participating, then issues are going to persist. It will become practically impossible to address the inevitable consequences of complexity, creating an even worse working environment and leading workplaces on a spiral of misery and stress. Ultimately this will only further exacerbate burnout, which is already incredibly prevalent in the industry.
For software developers working with complex software systems, the technical difficulty of identifying the causes of given issues and incidents is felt at a human and interpersonal level. Ironically, rather than creating a blame-free culture, it creates the conditions for it to thrive.
In the absence of data — whether that’s application logs or contextual notes and documentation — it’s tempting to seek out alternative explanations. To use Jeli’s terminology, if there’s no explicit narrative, others will emerge in communication backchannels and whisper networks. That’s a recipe for toxicity.
The Intersection of Humans and Technology
To understand the importance of observability, then, we need to pay attention to the way in which the human and the technical interact. It’s not enough to think about metrics in terms of the specific activity taking place within a system or application: these can be useful, sure, but in the context of incident response and reliability, such an approach inevitably limits what you’re able to see and, by extension, the sorts of questions you can even ask.
“Observability is centered around an exploratory and interactive workflow. Asking new questions. Making sense of ‘unknown-unknowns. Figuring out what matters, in the context of your business,” said Liz Fong-Jones (no relation to Nora Jones), developer advocate at observability platform Honeycomb.io and co-author of O’Reilly’s “Observability Engineering.”
“An investigative workflow can begin at a canned dashboard, but must always, always enable the user to step through and customize the questions being asked of their systems.”
To follow this line of thought, observability is something that can change the way that individuals and collectives relate to and think about systems. Without wanting to sound too idealistic, it returns some level of autonomy and agency to the people responsible for developing and maintaining those systems.
Perhaps it’s in this sense that Nora Jones describes reliability as both “an art and a science” — it’s more than mere passive empiricism, but is instead about being able to think carefully about what it means for your organization.
As important as it is to bridge the gap between business functions and engineering ones, we shouldn’t overlook the fact that its biggest impact is on the way engineers work with one another.
Indeed, if it does bridge a gap between technical and non-technical teams it should lead to a situation where technologists gain greater empathy from parts of the business that might have previously had little knowledge or sensitivity to the actual work they do.
This is particularly important at a time of economic downturn and the so-called Great Resignation, Jones suggested. While working at Slack as head of chaos engineering and human factors, she recalled the palpable sense of panic from the company’s leadership as a cohort of experienced engineers left around the time of the company’s initial public offering: “What do they know that no one else knows?”
Observability, when done well, will not only ensure that pockets of expertise are transparent, but will also drive engagement and, yes, maybe even fun and enjoyment.
Noting that the Jeli team uses Honeycomb, Jones mentioned just how excited engineers were when the team brought it in. “It’s helping me invest in my engineers as a CEO because I’m giving them time and space and tools to learn and to understand,” she said. “It empowers them.”
Observability encourages curiosity; it opens up ways for engineers to ask new questions and explore things in a way that a standard monitoring dashboard does not.
Observability and Labor
As important as curiosity and fun are, we also need to acknowledge the usefulness of observability in the context of workplace pressure in engineering teams.
With on-call rotations now normal in many organizations thanks to the increasing importance of digital infrastructure (and the increasing cost of downtime), observability is critical in ensuring that knowledge can be effectively shared within and across teams.
It can also help guarantee that people have precisely what they need when they’re trying to fix a bug that they may not have ever encountered before — and doing so in the middle of the night.
The industry sometimes suffers from what Jones called a “hero syndrome,” the tendency for some individuals to make themselves more valuable by blackboxing themselves (keeping their essential work hidden) and winding up as the only possible person to solve every problem.
Without a doubt, observability can go a long way in tackling this by opening up knowledge and giving teams the clarity and context needed to be able to debug and fix systems effectively.
“Without access to observability, developers certainly will be worse off in terms of their working conditions,” Fong-Jones said. However, she stressed that observability is only one small element in the context of labor rights and working conditions.
“The key issue is around how power and control is allocated,” she said. “Having the data is a great start, but it’s necessary to talk about how on-call happens in organizations and whether it’s a punishment or something developers dread, or something that they have agency and control over.”
In other words, insofar as observability sits at the apex of the technical and the social, for it to make a real impact on the lives of software engineers, teams need to be having frank and open conversations about how they work, how they collaborate, and even the value of their work.
A technique or platform can’t affect change on its own. However, as industries and organizations begin to feel the tensions and strains that come with complexity, it seems that observability will offer one way for engineers — and those around them — to assert the importance of humans in building and maintaining software.
For all the talk of the coming age of automation, observability is a reminder that how we build software and work together will remain questions that will need to be answered over and over again, for years to come.