Tech Debt, Incidents and On Call
I was recently chatting with a cloud ops and platform team leader who was navigating how to manage incident response. Like many organizations, they were trying to adopt a “build and run” approach. This is sometimes called “full-service ownership.” Whatever the term, this approach refers to software development teams taking responsibility to make sure the code they write also runs well in production.
Naturally, I asked if the software development teams were taking on-call rotations to support their code in production. After a deep sigh and a “it’s complicated”-type of response, he said something really insightful: Yes, software development teams are often in the escalation path for an incident, but they had moved toward building a site reliability engineering team to take primary on-call duties.
Why? Why not live the values of “build and run” to their fullest? His answer reflected a divide I hadn’t heard articulated before, but it made perfect sense. Although developers know the code best, they aren’t as helpful in an incident. While ops teams want a service restored as quickly as possible, development teams want to root out the underlying issue.
On the surface, these sound similar. But imagine you’re at the checkout line at a grocery store and your credit card is declined. You’ve been living paycheck to paycheck so tightly that your last credit card payment bounced. All you need to restore that credit card and check out is to pay the minimum balance. Call it $25. But the real root issue is that you’re buried in debt, carrying a balance and falling behind. Paying off the balance costs $25,000. It not only frees up your credit card, but eliminates a source of high-interest tech debt on your personal balance sheet.
Coming up with the $25 is relatively easy. For the ops team, it may mean something like restarting a system. A service is quickly brought back online, and the incident is “resolved.” The underlying source, however, still lingers. It looms as a future incident waiting to happen again. But coming up with $25,000 is a longer, more complicated endeavor. And there’s food to get on the table once you finish checking out from the grocery store.
Both perspectives have a point. But the right thing to do is sort out the $25 problem first, then quickly figure out the $25,000 problem. After all, wouldn’t you find the quicker fix to get through the grocery line and get dinner on the table first? But the challenge is that we rarely find the time to come back to the $25,000 problem. So we face the same problem a week later on our grocery run. It’s exhausting and demoralizing.
How Do We Make Time to Unwind Tech Debt?
First, to define tech debt, I prefer to consider all code as technical debt. Why? Because all code will require servicing or maintenance at some point. At a minimum, security updates for libraries with vulnerabilities are bound to come up. Just look at the Log4j vulnerability of late 2021. It required widespread, urgent maintenance work across many organizations.
Rather than debate whether code is debt or not, the better question is how easily can you service that code. If it’s easier to change and update, you’re in a much better position to unwind that $25,000 problem shortly after an incident. But that still doesn’t answer the question of when you do that work.
The answer to that might, ironically, come back to involving developers in that “build and run” responsibility. In her talk at PagerDuty Summit 2022, Charity Majors suggested using on-call rotation time for tech debt work. It’s already not a good time for developers to be working on feature work. They could be interrupted at any moment by an incident that needs their attention. But it’s a great time to dig into what’s been causing incidents.
Using on-call time to “pay down” tech debt serves a couple of purposes. First, it should reduce recurring incidents that have been resolved without addressing the root cause. Second, as Majors’s talk title, “On Call Doesn’t Have to Suck” implies, it makes on-call rotations more attractive. As that cloud and platform ops leader observed, developers want to address the root cause. Ultimately, no one wants to be woken up by an issue, especially one they already know exists. That seems like pain that could have been avoided.
Check out the talk by Charity Majors, co-founder and CTO of Honeycomb for more insights on using on-call rotations to focus on tech debt and other tips to improve the on-call experience.