What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
AI / Operations

Site Reliability Engineering and AI

AI could recommend standards and priorities more aligned with business objectives than any team of humans could, using a fraction of the time and effort.
Nov 1st, 2023 10:00am by
Featued image for: Site Reliability Engineering and AI
Image by Steve Buissinne from Pixabay.

When thinking about Site Reliability Engineering (SRE) and general concepts of keeping software reliable, it’s easy to see how AI can have a major role. Large language models like ChatGPT can already be leveraged in many incident management steps. Looking further into the future, more bespoke AI solutions could revolutionize how monitoring tools and service level objectives are used and interpreted. Ultimately, the resilience of your practitioners is the heart of reliability, not any algorithm, but we’ll explore how future AIs could enhance that core strength too.

Let’s explore these three future use cases while diving into the risks and challenges that you’ll face in implementing them.

Incident Response Enhanced by Large Language Models

One of the most powerful types of current AI tools are large language models (LLMs), such as ChatGPT. LLMs leverage a huge database of human-written text to generate helpful responses to any given prompt. In the fast-paced, high-stress environment of responding to an incident, LLMs can be used to remove sources of toil and confusion.

For example, let’s say a service that’s broken has generated an error log. Sounds great, but maybe the log is thousands of lines long, and you aren’t really sure what you’re looking for. You really don’t have the time to manually scan through the document while there are other fires to put out. By submitting the data to an LLM and merely asking it to highlight and summarize “anything abnormal”, you’ll get results in seconds. It might not give you the whole story, but it’ll give you a jumping-off point.

Here’s a few other quick examples of what LLMs can do to reduce cognitive load on responders during an incident:

● Summarize a long Slack conversation to show what was attempted and what results were achieved to help new responders get up to speed.
● Parse your codebase to answer natural language requests like “point me to all the lines of code that deal with the user logging in.”
● Quickly created ad-hoc scripts to assist with testing error causes, like “make a script that submits to Form A every permutation of these options.”

Risks of LLMs and Incident Response

Using LLMs to accelerate sensitive processes like incident response will always come with some risks. LLMs are prone to hallucination, generating data that looks sensible but isn’t founded in fact. The more you depend on them, the more insidious this false data can become. But if you take the time to thoroughly parse everything the AI does, you lose out on the benefits of speed and ease.

The key is to balance your investment in AI acceleration with an equal investment in your incident management process. Having a robust process layered on top of everything you do will minimize potential damage. Incident-management platforms provide a great foundation for experimentation with AI tools.

Monitoring and Service Level Objectives Enhanced by an AI Perspective

Current AI models leverage huge human-generated databases like written text, images and audio to generate novel examples of each. However, future AI research may open the potential for AI to generalize the knowledge, not just the content, of these databases to make judgments on novel situations. The possibility of this “general AI” is a matter of much debate among AI practitioners. No matter where you land on it, it isn’t hard to imagine new types of specialized AI making use of a more limited “perspective.”

For example, let’s look at system monitoring and service level objectives (SLOs), two common challenges in the world of SRE. Both of these are conceptually simple. System monitoring just means observing the outputs of your system to ensure things are still functioning as expected. SLOs are just metrics that track a component of your service important to customer happiness, like “can users search our database fast enough, frequently enough and accurately enough to meet their expectations?”

In practice, however, answering these simple questions is massively complex. Existing tools can already remove a lot of toil from this process by automatically gathering and tracking relevant data. However, there’s still a major subjective element that requires meaningful human involvement. What is “fast enough?” What is “healthy enough?” Coming up with these answers requires a broad holistic perspective on your system and your users, one that understands your business needs but isn’t biased by any particular fixation.

AI could help reach this holistic, unbiased perspective. Unlike humans, who have to overcome natural biases towards the system areas they’re most familiar with, AIs can look at the entire system objectively. They can parse large amounts of user data without preconceptions of what users “ought” to be doing. With this perspective, they could recommend standards and priorities more aligned with business objectives than any team of humans could, using a fraction of the time and effort.

Risks of Relying on AI Perspectives

At the end of the day, the vision is of an AI model that could parse tons of business data and objectives and come up with recommendations. To have an AI that performs this as well as an experienced human, minus the human’s biases and preconceptions, would be a massive achievement. But such a human could still make a mistake.

The way to mitigate this risk is the same as mitigating the existing human risk — with continuous review and learning. No AI can be psychic; there will always be unexpected factors that change what the “right” answer is. Having clear objectives for these choices, in terms of customer retention or system uptime, then reviewing if the standards and priorities are meeting these objectives, will always be a necessary component of business success. Learn something new each time, and teach it to your AI advisor, too.

Empowering Human Resilience with AI

The reactions to AI among engineers are understandably mixed. Sure, in the short term, we’ve already seen it reduce toil, stressors and cognitive load for many tasks. But with that reduction, we’ve seen many organizations shortsightedly try to lay off engineers, hoping the remaining ones will use AI to be productive enough to make up for the losses. Looking at the prospect of future AI doing more strategic and meaningful work, fears of job insecurity and feelings of pointlessness are totally valid.

In the worst case, employees of an AI-embracing organization could leave en masse. This is obviously bad in a general sense, but it could be devastating for the reliability of your service. Losing the resilience of your practitioners is a far greater loss than any benefits that the AI provided. The adaptability, flexibility and, for lack of a better word, grit of humans working in reliability can’t be matched by any algorithm.

Leaders need to make it clear that AI is an ally in the reliability solution, not a replacement. The focus should always be on empowering humans to do work that’s more impactful and interesting, with AI sweeping up the toilsome tasks behind it. Even when AI could potentially tackle the currently impactful and interesting work, reassure engineers that they’ll move on to an even more strategic and directive position that makes use of their unique strengths as humans. Engineers should be brought up to their full potential as individuals contributing to the organization, rather than specific job functions that could be optimized away.

Once engineers feel truly valued as humans, rather than just a collection of MTTx metrics or lines of code written, they become truly resilient. AI learning tools can help cross-team appreciation and a greater holistic perspective on the whole system. Imagine requesting “a PowerPoint summary of each service area, also breaking down its impact on usage data” and getting something in minutes. Engineers would be able to deal with adversity with a greater strategic perspective, boosting retention and satisfaction.

In conclusion, we’ve only been scraping the surface of what AI can accomplish in the world of reliability. In a few years, even these predictions may seem humble and simplistic. The important thing is to invest in humans and processes to mitigate the risks that the AI revolution will bring. Just like leaping to the cloud, splitting your monolith into microservices or even putting yourself online decades ago, adopting AI will come with game-changing challenges and opportunities. We hope you’re thinking positively about facing them.


Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.