Reducing the Cognitive Load Associated with Observability
Can you imagine developing or operating a distributed system without modern observability tools? We know observability is a critical practice that lets us improve our system’s reliability, reduce service downtime, visualize usage patterns, provide performance insights and facilitate issue resolution.
The roles of engineers — from devs and Ops to DevOps, site reliability engineering and platform engineering — changed dramatically with the widespread adoption of microservice architectures and a globalized “shift left” intent over the past decade. Many were given more responsibilities and saw an increase in workload.
As a software engineering organization, our job is to build high-quality systems that cater to a specific business need. To achieve that, we’ve instrumented our applications, set up distributed tracing along with centralized log collection and continually monitored latency, error rates and throughput with alerting on top of that. Now what? We can rely on one heroic expert in our organization to handle the alerts, diagnose system failures and prevent outages. Or we can spread that knowledge to all engineers and share the workload.
Asking everyone to be proficient with the tooling in place and understanding the large quantity of data generated will inevitably lead to anxiety, frustration and fatigue. Could we somehow reduce the cognitive load associated with observability?
Making Sense of Observability Data
There are hard skills associated with observability. Engineers need to be trained to decipher the basic data types. Hopefully, tools can assist humans in this task. No wonder we saw a proliferation of vendor tools aiming to provide the best experience to interpret and visualize distributed traces, metrics and logs. It is a complicated task! A distributed trace is just a large blob of linked timestamps and metadata; metrics can be gauges, counters or histograms; a log statement can be structured or unstructured depending on the audience and consumer. Even the most common log statement can look foreign to the untrained eye. Just ask a Java developer to unravel a Python stack trace!
And then we are faced with the problem of “too much data.” We rely on tools to find a needle in a haystack and filter through the noise, and to make it clear that at any point in time, signals that are collected, but not exposed in any visualization or used by any alert, are candidates for removal.
Signals: Finding the Incident-Triggering Needle in the Haystack
Data points need to be filtered and transformed in order to generate the proper signals. Nobody wants to be staring at a dashboard or tailing logs 24/7, so we rely on alerting systems. When an alert goes off, it is intended for human intervention, which means transforming the raw signal into an actionable event with contextual data: criticality of the alert, environments, descriptions, notes, links, etc. It must be enough information to direct the attention to the problem, but not too much to drown in noise.
Above all else, a page alert should require a human response. What else could justify interrupting an engineer from their flow if the alert is not actionable?
When an alert triggers, analysis begins. While we eagerly wait for anomaly detection and automated analysis (with the advent of artificial intelligence) to fully remove the human factor from this equation, we can use a few tricks to help our brains quickly identify what’s wrong.
Visualization: Don’t Underestimate the Value of Platform-Human Interaction
Thresholds are required for alert signals to trigger. When it comes to visualization, people who investigate and detect anomaly need to consider these thresholds too. Is this value in data too low or unexpectedly high?
In this all-too-common graph, the chart title, axes labels and description were deliberately removed. We lack context, yet our brains can instantly spot the anomaly. Alerts leading to graphs should always contain a visual indicator. They are essential to highlighting trends and unusual patterns, even to the untrained eye.
Active Learning: Avoid Hero Culture; Train Your Team
Who on your team is the de facto first responder and observability subject matter expert who rises to the challenge when things go south? Perhaps it’s you. Ask that person to hold back, despite the growing urge to restore a service’s uptime and save the day. Ask yourself these questions:
- What’s the worst that could happen?
- Would anybody else rise to the occasion?
- Is this a learning opportunity for someone else on the team?
- Is this a teaching opportunity? Could shadowing an experienced team member work in this context?
Let someone else get good at it. It certainly isn’t easy to let go. Adjusting your expectations and giving yourself and your team room for investigation is key to reducing the perceived stress and urgency of a situation. Actively learning by responding to real incidents in real production systems using real data, but in a controlled, stress-free environment, is the ultimate training. While this may seem a little too “trial by fire,” this is why we have Game Days.
Game Days are fire drills. We need to accept that failures and outages will happen. The objective of a Game Day is to reduce stress during an actual incident by practicing our ability to respond in advance. We want to be able to act quickly and confidently during a crisis while building some intuition and reflexes that will come in handy at 4 a.m. Practice makes perfect!
Start by choosing a Game Master and accomplices as necessary. Usually, these are subject matter experts of a domain or system. They’ll need to carefully select which system and scenario will be under test during the Game Day activity. The following scenarios are pretty common:
- Replay previous incident scenarios. This tests whether the incident response process has improved, whether people know which observability signals to pay attention to and understand how to correlate data points. This is also a good opportunity to test whether the systems are more resilient following post-mortem learnings and corrective actions.
- Ensure a new system or service has all the right monitoring, alerts and metrics in place before going live in production. This tests whether you are ready to operate the system and whether people know how to discover observability data and know how to respond to alerts.
- Calibrate overconfidence bias when it comes to security, graceful degradation, highly available systems, etc. This tests whether you actually know the failure modes of the system and whether engineers would have the capability to diagnose unknown problems.
Then ask the Game Master to come up with a set of hypotheses and anticipate the expected takeaways from the exercise. Assess the impact of the exercise on the business (blast radius) and identify steps that will be taken if/as needed to minimize it (such as by limiting the exercise to a time box, aborting it if unexpected things happen, etc.)
And let the game begin! Break things deliberately and introduce a bit of chaos. We want people to rely on rational, focused and deliberate cognitive functions when dealing with an incident. Stress and fear will otherwise impair cognitive functions and decision-making.
Observe how human interactions play out in this problem-solving exercise. Is the exercise fostering a collaborative culture? Do team members support each other?
Collaborative Culture: No More Data Hoarding
Fostering a collaborative culture is essential to everyone’s well-being. Sharing data, insights and problems will yield much more engagement, curiosity and trust from team members. Who keeps their observability dashboards hidden from developers? Information should be shared and secrecy should be avoided. These are simple principles, yet few organizations live by this standard when learning from incidents. We should celebrate failure! We need to be transparent in our post-mortems to drive meaningful change. A culture of blame and finger-pointing will only accelerate the vicious cycle of anxiety and mishaps.
Every incident response process should include a post-mortem. In post-mortems, the gathering of information, thoughts, feedback and perceptions is yet again a team activity. Effectively conducting blameless post-mortems will ensure team members have the latitude to propose changes to the process, tools or systems. This activity empowers people to make changes through corrective actions and quality-of-life improvements. Post-mortems should also benefit other members of the organization who might not have had any direct implication in the leading incident, as the written record should be shared broadly and serve as learning material.
Being On Call
Engineers have the capability to make sense of the observability data. As everybody on the team actively learns how to respond to incidents in Game Days, it’s important to share the on-call duty among an entire engineering organization instead of a few select individuals. This will also help reduce the burden and stress associated with the ever-possible impending doom. No engineer should be left alone when on call. Roles and escalation paths need to be clearly defined and understood. From the first responder (the 911 dispatch operator) to the incident commander (a subject-matter expert) and escalation manager (usually an engineering manager responsible for communications), nobody should be asked to be heroic. They should be asked to coordinate and assemble the team best suited to resolve the situation.
While on call, checklists — call them “runbooks” or whatever else — can also serve as a cognitive aid to offload the thinking process when completing complex instructional tasks. Game Days are the perfect settings to test those checklists.
Because we’ve already made sure to reduce false alarms by eliminating signal noise, and because everybody understands their role in the on-call rotation, alert fatigue should be a thing of the past.
People Are Still at the Center of Distributed Systems
By implementing these strategies, software engineering teams can help ensure they are equipped with the knowledge and skills to use and understand observability signals effectively. Making the most out of the collected data is critical to improving distributed systems’ overall performance and reliability. Teaching and learning will scale the human factor beyond a single individual. While we must still rely on human brains to diagnose and resolve issues, let’s ensure we can do it sustainably.