Why Traditional Logging and Observability Waste Developer Time
This is part of a series of contributed articles leading up to KubeCon + CloudNativeCon later this month.
The last few years have seen their share of changes in DevOps. Those trends are highlighted by containers and microservices, security responsibility spreading to more teams and trying to automate as much as possible.
You could argue that the common denominator is making everything cloud native — containers epitomize emphasis on architects, more things are offered “as a service,” and scale is seemingly automated by moving everything to off-premises (on-demand) servers. But the big “philosophical” shift is to “shift left.”
We saw a recent non-DevOps example of this situation: COVID-19 vaccines. Consider the emergency situation the coronavirus put us in back in March 2020. We needed a vaccine, but the vaccine process usually took years of testing at multiple stages. To produce this vaccine, academic research and preclinical trials were either skipped or reduced. Trial stages overlapped instead of waiting for previous stages to finish. Eventually, that strategy helped to produce a viable vaccine in less than a year.
The idea is the same with software. Planning stages are followed by coding, and then testing, release, deployment and operations. Here, the first three stages overlap and reiterate each other: planning, coding and testing. Once a minimum viable product is reached, the release process kicks in, but the initial process of shifting left continues for updates and UX.
Really though, how “philosophical” is this concept? It’s not abstract, it’s concrete. This is much more about getting straight to the point than it is an overarching philosophy: Make things viable, and do it faster. But if it’s so practical, why haven’t we done it before?
There’s a reason it’s taken so long for “shift left” to become a thing. It wasn’t needed until now.
Cloud computing has been around for a while, but only in recent years has a full shift from on-premises servers picked up the pace. Since more data and processes could occur in the cloud, it could host more complicated applications. As those applications have gotten more complicated, so too have their architectures changed.
To handle the many mechanisms and services newer applications used or offered, they were broken down into their own microlevel apps: microservices. Pulling all the components out of a monolith so each one could run more efficiently on its own obviously required a complex architecture to make them work together.
Cloud native DevOps truncated the development cycle rather organically. Past monolith environments made replicating things in testing pretty simple. But with the cloud, there are too many moving parts.
Each cog and gear — an instance, a container, the second deployment of some app — has its own configuration. Add in the exact conditions affecting some individual user experience or availability of some cloud resource, and you have rather irreplicable sets of conditions.
Hence, devs need to anticipate more and more issues before full deployment, especially if they’re spinning out the process to another “as a service” provider (serverless in particular).
If they don’t do this, late-stage troubleshooting will become overwhelming.
Advancing on the Leftward Front
There are different subtypes of shifting left pertinent to testing, security and DevOps overall. In testing, we push testing to earlier stages of development to get quicker feedback. In security, we start to secure a new app way before release — and reiterate that security as the app is built.
You get the picture — shifting left means moving stages of development closer to the start of a project. That has ramifications for whoever leads each stage of the dev process.
Ops used to own the entire monitoring stack. As dev and Ops move closer, dev is not just closer to Ops tools but assuming responsibility for them. This natural progression needs to continue.
What developers actually need is to continue pushing forward on the leftward front: Find new ways to minimize troubleshooting time by absorbing it into the default workflow of every stage of the development process. Lower mean time to recovery (MTTR) now isn’t just a talking point for your product, it’s a necessity with so many microservices in play.
This means giving devs necessary access to real-time production data. That makes your entire operation more mobile and dynamic. Your dev teams gain the independence to move through production-level code without having to wait for Ops to grant them that access on a case-by-case basis. That’s why we elevate live debugging at Rookout to the same level of importance as remote debugging or the three pillars of observability.
Example of Developer-First Observability in Action
Production debugging tools can be used separately from traditional monitoring and application performance monitoring (APM) tools, but demonstrating why observability needs to shift left is easier to showcase when thinking about these tools working together. Because the truth is that while traditional APM and monitoring are critical, they are providing data that is often more interesting to Ops than to developers.
Whatever production debugging solution you choose should integrate directly with monitoring and APM platforms like Datadog. This will dramatically increase enterprise agility and velocity when it comes to diagnosing and pinpointing the root cause of performance issues. The ability to jump directly from a Datadog alert or anomaly to a specific line of code that caused an error, without restarting, redeploying, or adding more code, is where the magic happens in shift-left observability.
The goal is to ultimately make it easier for businesses to understand their own software and narrow the gaps between indicating a code-related problem affecting performance, pinpointing the direct issue within the line of code and deploying a solution quickly for a seamless customer experience without having to write more code or redeploy the application. In modern production debuggers, this is made possible by setting non-breaking breakpoints that can extract data in real time without stopping the application.
Cost-Effective with Money and Time
Productivity can be a product of three different resources: time, money and energy.
It’s hard to quantify how much money you will save by detecting bugs earlier (an oft-cited figure of costs as high as 100 times higher when already in production might not be true). That being said, there are plenty of numbers out there about the costs of predeployment troubleshooting and testing versus the costs associated with downtime in production.
If you take the cost of debugging in production, consider the costs of a four-person QA team at about $300,000 per year in the United States — ZipRecruiter estimates an average $75,000 salary here — with additional expenses for various tools to do the job.
On the flip side, Gartner has cited the cost of downtime at $5,600 per minute, or $300,000 per hour; the Ponemon Institute cites it at $9,000 per minute (or $540,000/hour).
Consider the cost of shutting down a fully running operation. It means refocusing teams’ energies elsewhere. Many more people will get directly involved in identifying and/or then remedying the problem. Even non-devs have to slow down their productivity to wait for engineers to save the day. Sales reps can’t engage customers; business-to-business sales reps might have to refocus from offense (upselling clients) to defense and re-earning trust (avoiding churn) after service disruptions.
Conclusion: Climbing the Next Hill
Advancing over that next ridge in your leftward push means getting the right supplies to your troops more quickly. It’s a myth to say that there aren’t any tools for left-shifted observability, but there should be more of them. It’s a myth to say teams like QA are dying out; the nature of their work is changing, and they need the right tools to get the job done.
We see our job as making a tool that can fill in those gaps. We have to look at it from two perspectives, both as a service provider and as DevOps practitioners. We anticipate this approach from customers to get more popular, so we are preparing accordingly. The same should be said of other service providers. We are pushing our own internal debugging processes to the left; it’s hard to see it not being the same for other tech organizations.
To hear more about cloud native topics, join the Cloud Native Computing Foundation and the cloud native community at KubeCon + CloudNativeCon North America 2022 in Detroit (and virtual) from October 24-28.