DevOps as a Graph for Real-Time Troubleshooting
Every DevOps engineer naturally keeps a knowledge graph of all the infrastructure and interconnected services — in their heads. This graph is fragmented and takes time to learn and document, as well as mental power to retain. These manual connections we make require familiarity with the system and substantial experience in operation and management.
But today’s complex and dynamic microservice architectures make it impossible to keep up with all the software and infrastructure changes around us. This cognitive load is compounded by siloed monitoring and observability tools that increase context switching, which negatively impacts our ability to fix issues quickly and avoid downtime.
There are hidden relationships in all the data that can only be expressed through an engineer’s knowledge graph. When we pull out the relationships from the data to form a real-time, dynamic graph, the effects and related causes of production issues are more obvious. We want to move from observing individual data points and then slowly connecting them in our heads to observing all the data points and connections in the same context. In doing so, our process of identifying and resolving issues is faster and more accurate.
How will it impact the way we understand production issues?
In modern IT environments, we’re constantly making changes to improve application performance and harden systems. These changes often introduce pathways to failures. Having a DevOps graph that links all the infrastructure and microservices together allows teams to see hidden relationships. When we bring this graph to life through visualizations in our observability tool, operations folks and SREs can find the cause of production issues quickly.
A common challenge for SREs during incident response is translating customer issues or performance issues into specific services and teams. With powerful visualizations of the DevOps graph, this first-level triage becomes easier, allowing SREs to:
- Make associations between operational data with a dynamic topology that shows relationships between the components in your IT environment.
- See the unintended consequences of new code or configuration changes with a real-time event timeline that is searchable and filterable by user or scripts.
- Provide developers with critical event timeline information and topology to actively troubleshoot and debug their code.
- Hold teams accountable within incidents and focus on information sharing with a system of record for operational activity and change management.
How will it impact the way we troubleshoot issues?
Start asking the right first question: What caused the change? Rather than asking what happened (metrics), why (logs) or where (distributed traces), asking what caused the change helps us troubleshoot complex failures holistically. For example, every incident workflow looks somewhat like this:
- The on-call engineer gets an alert and declares an incident.
- They check a few metric dashboards to understand unusual behavior.
- They might escalate the incident to the infrastructure, cloud networking and security teams to eliminate those areas as root causes.
- These teams dive into various logs and traces to get more detail.
- Charts and data snippets are sent back and forth between team members over Slack.
- Teams go back to metric charts and log a few times before getting to the cause(s) of the issue.
What’s wrong with this flow? Several things:
- This incident workflow can take hours or days to mitigate and resolve an issue.
- Teams digging into logs right away without context waste time and resources.
- Teams are working with data snippets over Slack, which isn’t efficient or sustainable.
Before pulling in members from other teams into an incident response, we need to provide context. What circumstances (changes) formed the setting for an incident? What components are involved, and what are the dependencies? What are the useful logs or traces that service owners can use to troubleshoot effectively? By capturing the context of an incident at the moment it occurs in a dashboard with actionable data, knowledge sharing gets easier. That leads to a better troubleshooting experience across the organization.
How will it impact the way we practice observability?
Delivering reliable services and maintaining uptime in containerized environments is a tall order to fill. In case of an incident or a major outage, what matters most is how fast we can recover from it. The growth of observability data can either improve or hinder observability. A major challenge among organizations is the ability to use their data to improve mean-time-to-recovery (MTTR). It’s time to rethink our approach to troubleshooting and connecting cause and effect for the first time. When we approach DevOps as a graph, we finally move beyond the traditional pillars of observability and start tackling incidents with a new mindset.
Data Connection: A lot of data for observability is interrelated, but our current tools don’t allow us to view metrics, logs and distributed traces as connected sources of information. These data types are often collected in siloes, and correlation of the data is done manually. For example, to know if a spike in a metric on one service might have something to do with a spike on another service, we often search through the metric charts to find the correlation.
You need a solution that connects operational data from the start. Visualizing a graph of the application and the infrastructure it sits on that can model cause and effect through incidents in real time removes the mental burden of processing the hidden connections in our heads. This solution needs to also address another key gap in our current tools: missing change data.
Change Impact: Digital Enterprise Journal’s recent State of IT Performance Report found that change is the largest source of production issues. 75% of all performance problems can eventually be traced back to changes in the environment. When simple configuration errors can cause a domino effect, there’s a broader lesson to be learned. If you’re not capturing code or configuration changes as part of your observability strategy, it’s time to close that gap.
Knowledge sharing: Documentation and sharing knowledge consistently across teams is a painful process for many organizations. The knowledge gap between senior engineers and less experienced engineers gets larger over time and makes real-time troubleshooting more challenging. Less experienced engineers often don’t know what data is important to a specific incident and lack the ability to understand the links between the cause and effect.
This is an organizational problem as it affects development, operations, and management. Most companies that deploy daily report that their engineers spend at least half of their time on troubleshooting and debugging. This “troubleshooting tax” will get higher if we don’t provide engineers tools that learn connections, model cause and effect and allow sharing of “graphs” across teams.
In a constantly changing DevOps environment, no amount of testing or automation will prevent bugs from getting into production environments. And no amount of chaos engineering can anticipate every possible failure. Finally, we must realize that outages rarely happen because of a single failure. It’s often a combination of “pre-incidents” (e.g., config change, code commit, script execution) happening at the same time. No matter how much you shift left, there will always be something that goes wrong when you shift right — it’s how you manage the changes and connect the cause and effect that will drive better outcomes.