How Generative AI Can Support DevOps and SRE Workflows

As the buzz surrounding Large Language Models (LLMs) and generative AI turned from loud to deafening, forward-thinking software teams turned on their noise-canceling headphones to focus on an important question: How can we make this tech work for us?
It seems like a natural fit, of course: Tech pros like new tech. (Duh.) So while it may be a longer — and potentially thornier — process for, say, a human resources professional to consider how they might use generative AI in their jobs, developers, site reliability engineers (SREs), and other technologists are ideally suited for experimenting and applying gen AI tools in their work. 70% of devs will or already use AI for their jobs, according to a Stack Overflow survey, for example.
Still, the question remains: How can we make generative AI work for us??
The use cases will continuously emerge for the foreseeable future, but for modern software teams, the answer to that question largely boils down to communication, according to Dev Nag CEO and founder at PromptOps, a gen AI-based Slack assistant geared for DevOps and SecOps teams.
“My belief is that all work in the enterprise, and especially for DevOps engineers, is about communication,” Nag told The New Stack. “Not just communication between people, but communication between people and machines, too.”
Nag points to tools like Slack, Jira, or Monday as examples, but he also considers common DevOps tasks like querying an application for logs, metrics and other kinds of data — and then acting on the responses — as a form of communication as well.
And, he noted, much of this communication, while necessary, can be repetitive and time-consuming, which is where generative AI can have a great impact.
“LLMs and generative AI are essentially communication hyper-accelerators,” Nag said. “They’re able to take patterns from past data and figure out what you want to do before you even finish your thought.”
PromptOps recently launched a generative AI tool for automating and streaming various DevOps workflows between people and machines, simply by running ChatGPT-like queries or prompts — whether directly in Slack or in a web client.
Nag sees virtually limitless potential for the use of generative AI applications by DevOps and SREs, and other modern software teams. In an interview with The New Stack, he shared six examples of how generative AI can be applied to DevOps workflows today.
6 Use Cases for Generative AI Tools
1. Querying Different Tools.
DevOps and SRE pros often work with a dizzying array of tools. The tech “stack” is more like a tech skyscraper in some organizations.
Querying a bunch of different tools for logs and a bunch of different observability data and outputs manually requires a lot of time and knowledge, which isn’t necessarily efficient. Where is that metric? Which dashboard is in? What’s the machine name? How do other people typically refer to it? What kind of time window do people typically look at here? And so forth.
“All that context has been done before by other people,” Nag said. And generative AI can enable engineers to use natural language prompts to find exactly what they need — and often kick off the next steps in subsequent actions or workflows automatically as well, often without ever leaving Slack (or whatever other client you’ve entered the prompt into).
“[It] is a huge time saver, because I don’t have to master 40 tools anymore,” Nag said. “I can just give my prompt in English and it’s done for me, down to all these different tools.”
2. Discovering Additional Context.
From there, another beneficial use case is automatically and immediately linking out to additional context as needed. So while querying different tools can produce the desired response or data right in Slack (or wherever you input the prompt), generative AI can also add a navigation layer that sends you directly to the full source or context for that response when appropriate.
That’s especially useful whenever speed is of the essence — think of downtime incidents for a common scenario. Every minute spent chasing the information needed to resolve the incident is, in plain business English, expensive.
“You can do any task if you put enough time into it,” Nag said. “But that’s the problem: we don’t have time or people, and especially in a downtime scenario you really don’t have time. You want to get back to your customers being happy and up as fast as possible.”
3. Automating and Fast-Tracking Necessary System Actions.
Similar to how orchestration tools like Kubernetes exploded because of their ability to automate — in a declarative fashion — system operations according to your desired state, generative AI can add another layer to streamlining and speeding up necessary actions in a workflow.
Cloud native tools and platforms have brought their own form of complexity, according to Nag. Performing a wide variety of common operational tasks — provisioning a service, managing configurations, setting up failover — typically requires interacting with a bunch of other things.
“These things might be touching many APIs. It might be touching the AWS API, which has 200-plus services,” he said.
They all have different syntaxes, consoles, command lines, subcommands and so forth — and probably no one actually knows all of them. Kubernetes alone has a daunting learning curve in this regard.
The cloud native ecosystem is vast (and continually growing) — keeping up with the intricacies of everything is almost impossible. With generative AI, Nag said, no one actually needs to know the ins and outs of dozens of different systems and tools.
A user can simply say “scale up this pod by two replicas or configure this Lambda [function] this way. And I can turn that into an actual code snippet, that actually runs that for me in the language of the target system,” Nag said. “LLMs are almost like these wormholes between unstructured data and structured data.”
4. Writing Up Incidents.
Much has been made of generative AI’s potential in terms of content creation. That’s applicable for IT pros as well, especially for the operations engineer who was on call when a system went down and had to write up the postmortem report.
Such reports can often require looking through hundreds or thousands of Slack messages, tons of metrics, charts, and log lines, and other data, noted Nag.
An LLM “can actually look at the Slack conversations — 500,000 lines, it doesn’t matter — and summarize and pull out the key findings,” he said. It focuses on the data that matters and filters out the noise — a massive time-saver for the human(s) charged with summarizing what went wrong.
It’s another example of that “wormhole” effect — turning potentially massive amounts of unstructured information into structured data that people can act on.
5. Ticket Creation.
Also in the content creation category, Nag noted an overlapping use case to incident management: Ticket creation. LLMs can be trained to automatically create tickets in systems like Jira or Monday and initiate the next actions in a workflow spurred by an incident or other IT event.
Sometimes those actions are the direct outcomes of the incident reporting in use case No. 4: That incident report, enhanced by generative AI, can identify better ways to respond if the event recurs.
“Next time I have to create this alert and maybe have this extra infrastructure here, whatever it happens to be,” Nag said. “We can actually scan the conversation and turn them into tickets.”
6. Searching for Non-Technical Documentation.
Last but not least, Nag said his company found a growing need for streamlining the process of discovering and pulling up non-technical documents without wasting a lot of time. This could include things like runbooks or corporate policies — not just IT policies but even common business policies such as HR benefits.
“You don’t always know where to look for these things,” he said. And as organizations get larger and larger, the process of looking can become a time-suck. A ChatGPT-style prompt can produce what you seek in seconds, rather than searching various wikis or tools for what you need.