DevOps / Monitoring / Sponsored / Contributed

Debugging for Reduced DevOps Disruptions

12 Mar 2021 7:39am, by

Sarjeel Yusuf
Sarjeel is a Product Manager at Atlassian, responsible for orienting Atlassian tools to facilitate DevOps capabilities in their feature sets.

With the rise of DevOps to maintain stability while increasing velocity, the need for enveloping incident management into the DevOps pipeline became critical. This is because the availability of your services is dependent on how quickly disruptions experienced by the customer are resolved.

However, by centering around the objective of high availability, there is a risk of disturbing the flow of the DevOps process. As unplanned incidents occur, more resources are spent on fixing these incidents and not enough on planned development, which consequently has a ripple effect on the entire DevOps pipeline. This is exacerbated by the fact that the process of going from ideation to release is now consolidated, as silos are broken down.

This article aims to highlight how effective incident management alone is not enough when practicing the DevOps mindset, and how debugging is the new area of improvement we should consider.

Incidents Are Disruptive

Werner Vogels, Amazon vice president and chief technology officer, famously once said, “Everything fails all the time.” Accepting this reality, it is easy to see why incident management becomes an integral component of a team’s DevOps practice. However, no matter how well the incident management strategy is set up, it still has consequential effects on the overall development pipeline.

Maintaining high availability is a prime goal. The goal is to reduce the time customers encounter disruptions. However, here is where an intrinsic problem lies. By considering availability from the customer’s perspective, it is easy to miss the ripple effect of incidents on the entire flow of going from ideation to production — the velocity aspect of DevOps.

In the rush to restore availability, shortcuts may be made to do whatever possible to resume normal functioning. The actual repairs can therefore continue way beyond the system’s normal functionality being restored. That means responders, who per DevOps principles are members of development teams, are caught up in repairing the disruptions. As a result, tickets are created, the backlog increases, deployment queues may be held back, feature flags may be toggled, and important releases may be held up.

If development teams are spending all their time putting out fires, when will they have time to implement new features? Consequently, entire roadmaps may be delayed as resources are held up in repairing disrupted services and system components.

It can be seen that the more incidents we incur after releasing to production, the higher is the risk to the velocity of the team. The entire development cycle is susceptible.

As Vogels implied, incidents will happen. But he did not state how many. This is where some hope in salvaging the development pipeline still remains. The answer lies in the debugging strategies of the team. This is because the more bugs, errors and faults that can be captured in the development stage, the fewer incidents we can expect afterward. The need for a “shift-left” therefore becomes apparent. Improved debugging makes for exceptional DevOps.

By detecting faults and potential disruptions earlier in the development stages of the DevOps pipeline, we can mitigate incidents in the later parts of the DevOps pipeline — more specifically when in production. This necessitates the strengthening of debugging while developing.

However, improving debugging practices is easier said than done. This is because of the ever-changing environment and ecosystem in which software is produced. With the latest advancements, there is an increasing movement to the cloud. With this movement, various cloud initiatives associated with DevOps practices are being taken up.

This entails rethinking traditional debugging practices, to better fit cloud development. These debugging strategies, tailored to cloud development, are as follows:

Leverage Observability

One of the major issues is that developers are sometimes dealing with black-box environments, where the severity of the black-box depends on the paradigm they opt for. As more of the underlying infrastructure responsibility is abstracted away to cloud vendors, it becomes difficult to know what is actually going on under the hood. As a result, it becomes challenging to identify the root cause of disruptions. This is where the term observability comes into play.

In its current form, observability refers to three core “pillars”; which when orchestrated successfully, provide necessary insights into the running of the cloud applications. These three pillars are:

  • Logs: a record of discrete events.
  • Metrics: statistical numerical data collected and processed within time intervals.
  • Traces: a series of events mapping the path of logic taken.

These three forms of insights provide an understanding of the actual state compared to the intended state after the deployment. This covers all facets of the system, including the intended UI, intended configurations, intended architecture, intended behavior, intended resources, and more. Therefore, it is crucial that these three pillars constantly be referred to when developing applications; and that these insights are monitored in the development environment before releasing to production.

Fallback on Traditional TDD

If the objective of shift-left is to capture all possible disruptions in development, then the traditional Test-Driven Development (TDD) can prove beneficial by ensuring that all test cases are thought of during the development process. However, developing for the cloud is very different compared to when developing on-prem or local applications where all depending resources such as depending services and data stores are available locally.

The contrast becomes clearer when thinking of microservices and distributed systems, where local development environments have no access to depending services or resources. As a result, writing tests for cases that depend on these unavailable resources becomes difficult. We must therefore consider various techniques that mitigate this pain point. Some of these techniques are listed below:

  • Mocking resources: This involves replicating layers of services that would otherwise be available in production. This can be done using tools such as JUnit, where expected results from lower-level interfaces are defined during the mock. This, however, can lead to biases in the returned responses as it is simply presumed that the lower-level resources will return the result that has been defined in the mock test.
  • Relying on inbuilt libraries: Some tools in your cloud stack, such as Hadoop, offer development libraries such as the MRUnit library. This can be leveraged when testing for Hadoop MapReduce Jobs. This is an extension of the concept of using the libraries mentioned in the method above but provided from the tools themselves. Hence mitigating the fears of reaffirming test results which can happen when defining the expected results of the resources. The issue with the first method mentioned above.
  • Leveraging local server plugins: There are some libraries that may be available for creating embedded servers of tools in your testing environment. For example, Maven boasts many such plugins, like the one available for Redis under the Ozimov group.

Rely on Third-Party Cloud Support Tools

As cloud development increases in popularity, so does the ecosystem around supporting cloud developers. Cloud vendors can’t provide the full solution, and that is where third-party SaaS products come in to fill the gaps.

As is obvious, the industry is growing weary of the burden of debugging cloud applications. Tools such as Thundra are responding to these market indicators, by providing cloud debugging and observability solutions in a bid to accelerate the cloud developer’s experience.

For example, Thundra recently released its debugging capability which leverages the concept of observability. It achieves this using non-intrusive debugging strategies such as non-breaking breakpoints and IDE integration support. Overall the feature allows you to test your cloud apps in both pre-production and production environments, by surfacing the required insights without posing a risk in your actual codebase. Here is a video showing Thundra Sidekick’s capabilities. It’s an effective tool as it greatly promotes the much-needed shift-left culture. You can find out more about Thundra Sidekick here.


Much of the era of DevOps has been focused on automating CI/CD practices to achieve greater velocity, while improving incident management capabilities to increase stability. However, limitations are now becoming apparent, and we must look at other areas of our software development pipeline. Hence the need for rethinking our debugging strategy to finally break through the glass ceiling and reap the benefits that are promised with DevOps in the cloud.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.