AIOps Done Right: Make SRE More Proactive by Shifting Left
Many organizations are turning to AIOps in hopes of creating better, more secure software faster. But the ability to create robust and fast software delivery pipelines is constantly hampered by the need to troubleshoot and remediate issues in production environments manually. According to both the Puppet State of DevOps Report and the Dynatrace Autonomous Cloud Survey, that is still the approach 90% of organizations are taking.
At the same time, these surveys also show that organizations expect to grow the frequency of production deployments tenfold over the next 12 months. This is almost certainly doomed to fail, if 90% of these organizations continue to rely on manual troubleshooting, remediation and root-cause analysis.
Organizations have begun to tap into the potential for AIOps to reduce this level of manual work and provide faster, automated solutions to get more precise insights into the performance and security of their applications, microservices and infrastructure. Not all AIOps solutions, however, are equal. Older “Gen 1” solutions — solutions that try to find patterns across independent, disconnected data sources — are not as efficient or effective at creating better software faster as they could, or should, be.
In this article, and an accompanying article I’ll post later this month, I will describe what it looks like to deploy AIOps “the right way,” to ensure that you’re deriving maximum value from your AIOps solutions and identify where older iterations may have gone awry. To start, I’ll break down why Gen 1 AIOps solutions did not deliver this value and then outline a few examples of how AIOps is done best, beginning with shifting AIOps left to create more “test-driven operations.”
Why Gen 1 AIOps Solutions Fall Short
The first wave of AIOps solutions provided observability by ingesting data, including logs, metrics and traces, and analyzing this data for possible correlations to explain the root cause of technical problems or changed user behavior. At the time, IT teams could count how many deployment and configuration challenges associated with production workloads occurred each year, so this use of AIOps worked fine for a relatively low number of these challenges. Because the frequency of changes was so low and predictable, it was easier for ITOps teams to manage maintenance windows and keep downtime and mean time to repair (MTTR) to a minimum.
But that is not the environment digital teams are living in today. Now, production deployments are counted in days, not years. Multicloud environments have grown increasingly more dynamic and containerized. Most new application architectures leverage microservices that are deployed as containers in multicluster, multicloud environments, making it even harder to keep track of changes and find root causes.
Teams are moving toward progressive delivery models for deployments (blue/green, canary, feature flags), where instead of replacing entire systems, individual services are upgraded and replaced with new iterations on a piecemeal basis. Environments change too quickly for correlation-based machine learning algorithms to establish a baseline of what’s normal. Also, with potentially millions or billions of dependencies between applications, infrastructure, containers and microservices, it’s harder to correlate logs, metrics, and traces for conclusions. There are too many services involved.
As dynamic multicloud environments drive new changes in delivery and operations, AIOps must adapt accordingly for DevOps teams and site reliability engineers (SREs) to maximize the value they, and their organization, can get out of it. In other words, teams need to ensure they’re doing AIOps the right way.
Tighter Integration Between Processes and Platforms
A more dynamic, comprehensive approach to AIOps goes beyond simply updating your AIOps tools. It means integrating AIOps solutions into everything — development processes, testing, DevOps and SRE practices — and embedding it within your internal platforms. Closing the gap between your AIOps solutions and your internal platforms and processes is what enables AIOps to precisely, and automatically, absorb and learn about both intentional and unintentional behavior changes occurring in your CI/CD pipelines.
The more that ITOps teams can leverage AIOps as part of chaos engineering, the more battle-tested and validated those solutions become at anomaly detection. That validation then gives teams confidence in their AIOps solution’s ability to auto-remediate issues in production environments. If it can handle itself in chaotic scenarios, its automated anomaly detection can deliver fast, precise answers — along with the remediation to back them up — in any situation.
Creating More Proactive ‘Test-Driven Operations’
SREs use service-level objectives (SLOs) to validate and track how systems behave in production, under different workloads or conditions, and write auto-remediation scripts to make whatever adjustments are needed to maintain availability and a consistent digital experience. But this is a reactive position, so engineers are often only deploying the auto-remediation code after a user has had a problem and their digital experience has been compromised.
Shifting AIOps left enables a more proactive approach, where resiliency and auto-remediation scripts are tested before they enter production. One way to do this: engineers can use Keptn, an open-source CNCF project to orchestrate a pre-production environment monitored by the AIOps solution for loading tests, injecting chaos and validating auto-remediation scripts. This is the “shift left” part: By integrating the AIOps solution into this “test-driven operations” environment, you validate the ability of AIOps to trigger auto-remediation scripts in the event of an issue. Rather than the engineers having to script and deploy auto-remediation code after a user has experienced an issue, the AIOps tool can proactively deploy the fix immediately, because it’s already been battle-tested for those scenarios ahead of time.
In my next article, I’ll delve into a couple more examples of how engineers can leverage AIOps the right way, but this use case should hopefully begin to highlight how AIOps, when done right, helps ensure healthy systems in production. Just as test-driven development processes help developers create better quality code, test-driven operations will help engineers maintain more stable production systems and more consistent digital experiences for users, in turn driving more value for the organization overall.