It’s OK to Ignore Some Tech Debt — Here’s Why
Everyone knows they should pay attention to technical debt. Yet, for some inexplicable reason, we all tend to ignore it until it’s too late. This is the core of why reliability engineering is hard.
Some of the magic of Google’s approach to software reliability was recently revealed. As a former product manager at Google — for Container Engine, Container Registry, Cloud Launcher and the Kubernetes Project — I still marvel at how the company balances speed with scale and maintainability.
Many engineering teams today are lured into trying to understand Google’s approach. If the world’s most highly trafficked cloud services operate so smoothly in the midst of continuous new feature introduction, what can we extrapolate from their methods? But for many, that line of inquiry leads to resignation that “Well, we’re nothing like Google,” and feeling overwhelmed by the prospect of where to actually start.
Before you get into the how of SRE and software reliability more broadly, I think it’s important to understand the why. It makes the whole movement a lot easier to understand.
Why? A Mountain of Technical Debt
While few companies share the characteristics of Google’s billions of users and vast engineering resources, one thing we all have in common is technical debt.
The gap between perfect software and what you ship is technical debt. You know, the issues that get jammed into Jira backlogs as engineering cycles continuously favor feature development. Technical debt isn’t just bugs. It includes performance issues that need caching strategies, upgrades to the latest versions of programming languages or frameworks already in use, security patches, documentation and SSL certs that need to be updated. You name it. Everything that has to do with software reliability.
Every company has to balance new feature delivery against this backlog of technical debt. But let’s be honest: At most companies it isn’t a balance at all. The scale, until there is a disaster, tilts toward features, and tech debts are just those things that get jammed into a milestone with the intent to revisit “someday when things slow down.”
Don’t get me wrong, engineering teams regularly raise technical debt issues to the business side. I’ve seen many teams exaggerate or complain about the risk of tech debt that never manifests. These claims hurt credibility with business stakeholders, who urge for faster delivery of features. They could be convinced, they tell you, to focus on reliability, but the service seems to be working just fine. It’s hard having conversations about risks and priorities. Reliability risk is obvious only in hindsight.
So the business side keeps cracking the whip on feature creation while also expecting perfection on uptime and software reliability. Engineers lose precious weekend and evening family time to oblige on-call cycles. And often when systems go down, the suffering can be traced back to perfectly avoidable issues that were in the tech debt backlog all along, but never prioritized!
Tech debt is the money pit of software, and no company will ever be able to stop digging.
A Simple Analogy: Managing Your Email Inbox
The email inbox provides an excellent analogy for status quo approaches to managing tech debt. Think of unread emails as debt. Some of it matters; some is just junk.
The ‘Zero Inbox’ Approach
On the one side, you have the “Zero Inbox” approach. You deal with every email that comes in, in real time. This user has wholly surrendered to the noise and given up any sense of prioritization. It’s tough to be a productive human being if you are constantly focused on your inbox and evaluate every notification with equal value.
In terms of technical debt, this is the team that flies into a frenzy on every blip and beep that rolls over from the APM and logging tools. This team is so defensive against outages that they build what is jokingly referred to as “gold-plated reliability” — such a precious battle guarding for 100 percent uptime (an impossible goal), that there are no cycles left for creating features.
‘Thousands Unread’ Approach
The extreme opposite of Zero Inbox is the email user who has thousands of unread emails. Back when we used to work in the office, you were visiting their cube, collaborating on something, and as they showed you something on their computer, you noticed they had 40,000 unread emails. Whoa! These email users have risen so high above the noise that they also miss all the signals.
A team managing technical debt like this has a giant features backlog that has exceeded their cognitive limit. It’s so large that they are most likely going to discover what matters during outages.
What’s Missing: Context for Technical Debt
The context that separates “good” technical debt (which lets us ship) and “bad” technical debt (which needs attention) comes down to understanding the actual impact on customers and the team.
Teams that manage technical debt well set “budgets,” to establish triggers for refocusing the team on improving reliability and putting features on hold. Since tech debt is an inevitable byproduct of shipping imperfect code (note: all code is flawed), you need a coping mechanism to provide better visibility, context and urgency to tech debt.
Technical debt budgets based on customer and on-call engineering impact give clear context for why the business should care. Shared context separates disconnected teams merely negotiating these tradeoffs vs. aligned teams who use data and goals to drive behavior.
As email users, most of us fall somewhere in between Zero Inbox and Thousands of Unread approaches. We’re not going to surrender all our brain power to responding to every little thing that rolls over in realtime, nor are we going to neglect the firehose completely. We use filters, we use folders, we use prioritization tags, we set rules, we create automation. And we click unsubscribe! Because our cognitive abilities only allow us to reason with so much, we use a combination of tools and a general strategy for focusing on what matters. We try not to fritter away our attention beyond that point.
Ignoring some technical debt isn’t so bad after all, so long as you have shared context to guide your decisions.