What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
AI / DevOps / Operations

Operationalizing AI: Accelerating Automation, DataOps, AIOps

While technology is the part many eyes will gravitate toward in operationalizing AI, the people and process elements are actually the most challenging.
Oct 5th, 2023 8:28am by
Featued image for: Operationalizing AI: Accelerating Automation, DataOps, AIOps
Image from Gorodenkoff on Shutterstock.

AI is everywhere today, thanks largely to the impact of ChatGPT. But within IT and technologist roles, most discussion has focused on the productivity benefits that developer teams are seeing from it. There’s been less detail on what should be done once a product is built and delivered. This leaves several open questions: What about other technical roles? What happens after code generated by AI ships as a feature in a service that customers rely on? What happens after you put an LLM-based feature into production? Operationalizing AI means a few different things.

To get started, it helps to frame the discussion around those three classic concepts of people, process and technology. And while technology is the part many eyes will gravitate toward, the people and process elements are actually the most challenging.

The Potential of AI

Even before ChatGPT showed the transformative potential of large language models (LLMs), AI has been embedded in products and business processes at a fair pace. At the end of 2022, McKinsey estimated that the average number of AI capabilities in use per organization had doubled over the previous three years to 3.8. Robotic process automation (RPA) led the way, followed by computer vision, natural language text understanding, conversational interfaces, deep learning and many more. These varied AI capabilities help to optimize service operations, product/service development, sales, marketing and risk management.

We can’t overstate the impact generative AI is having on the wider market. Forecasts estimate the technology could add anywhere from $2.6 to $4.4 trillion annually, increasing the economic impact of all AI by 15 to 40%. Even that figure could be doubled if generative AI were embedded into software used for other tasks. So how do we operationalize it?

Leveling the AI Beneficiary Playing Field

For all the hand wringing about AI taking away jobs, economic analysis and history suggest worker productivity is far more likely than job destruction. But which workers? Let’s start with developers. Some case studies have shown developer productivity improvements of 25-50%, which is huge. But where will they spend that extra time? Chances are they’re not going to get to make time to work on that technical debt they know they’ve been accruing. Instead, the business is going to demand more features, and that can have an impact on other teams. Think of it like a balloon. When you squeeze one end of the balloon hard, you need to think about what happens on the other side. You don’t want it to burst.

The key is to consider the impact of this productivity increase on other teams. What happens to operations and infrastructure teams? What happens to platform teams, site reliability engineers (SREs) and network operations center (NOC) staff? If developers are delivering more code to production and accruing more tech debt faster, that could overwhelm teams supporting that code in production.

Part of the solution is process (which we’ll get to next), but part of it is addressing inequity in the generative AI benefit. So the question becomes how to ensure those non-developer teams get to share in that 25-50% productivity uplift. Generative AI can certainly be part of the answer, by helping to automate operational tasks like standardizing scripts, translating them more easily from bash scripts to Python and so on.

Here’s an example of how platform teams and site reliability engineers can get a boost in productivity by automating how runbook jobs are generated.

DataOps: Supporting Modern Data Architectures

Next comes process. It’s easy for engineering teams to get preoccupied with their own features and not consider the broader experience. And, from prompt engineering to pricing, there is a lot to consider to put an LLM into production. But to deliver high-quality output efficiently, organizations must also look at the bigger picture: the whole product end to end. That means, when building generative AI or any other AI capability, the LLM output is just one part of the overall experience.

In June 2023, Andreesen Horowitz published a useful outline of the emerging LLM architecture (pictured above). It’s complex. Even before adding in the complexity of the LLM stack, data pipelines have already become more complicated. Data engineering teams are dealing with different cloud services and often on-premises systems as well. According to Manu Raj, senior director of analytics and data engineering at PagerDuty, the ServiceOps platform provider gets data from 20 to 25 different sources. Added to the modern DataOps stack are data integration tools, data warehouses, business intelligence tools and now large language model components.

A lot can break in a modern data architecture. More complexity means more interdependencies between different components. At the same time, the stakes are higher than the days when data pipelines fed relatively static reports accessed patiently by a few internal people. Today, data applications are fed by streaming data and are woven into the ecosystem of customer-facing experiences.

“Failures are critical,” explained Raj. Yet, teams building with LLMs don’t have to reinvent the wheel to support these new architectures in production. Some of the problems they encounter will be old and rather familiar, while some might be newer variations on well-worn themes specific to the LLM universe.

For example, teams must make sure that the data is prepared properly before it gets fed into a model, and that governance, security and observability are in place. Database availability is a familiar practice, even if vector databases may be newer for many teams. Latency has always been a challenge in data-intensive applications, but now teams will need to also consider the implications of data freshness on the output from LLMs. From a security perspective, we’ve long dealt with attack vectors like SQL injections, and now we need to safeguard against prompt injections.

In short, for the nonfunctional aspects of operationalizing LLMs, there are a lot of valuable learnings and practices from DevOps, database and site reliability engineering, and security that can and should be applied. Follow best practices in testing, monitoring, vulnerability management, establishing service-level objectives (SLOs) and managing error budgets to enable real-time changes. Work through those practices with an eye on the bigger picture, and it’s far likelier that your LLM-based features will drive the business impact they promise with high performance and availability.

Using AI to Operationalize AI

Finally comes the technology part. The good news is that we can actually use AI to operationalize AI. In fact, it’s a necessity, given the complexity of the LLM app stack. There are some things computers are better at than people, and we need to acknowledge this and tap into the power of machines to drive efficiency.

Where can machine learning help most? Think about the volume of event data being generated by these complex, instrumented systems. A failure somewhere can be felt across multiple systems, triggering a potential flood of alerts. But machine learning can help us with compression and correlation to reduce the alert noise and identify the source of the issue. And it also contextualizes by enriching event data so that responders can get to the root cause and solve issues more quickly and efficiently.

Not only is machine learning able to help with the three Cs of contextualization, correlation and compression, it’s better at it than humans. But why stop there? We can take the output of that identification job and automate next steps. Steps like state capture, restarts, resets and myriad operational tasks that practitioners run to gather more data and restore a service. By connecting the event processing to conditional logic to apply predefined tasks, we can accelerate the resolution of an incident in complex systems. Even if the service can’t be fully restored in a self-healing way, the teams that have to step in can have better context and starting points to troubleshoot.

The Future’s Wide Open

When it comes to generative AI, we’ve only just scratched the surface of what’s possible. But as fun projects find their way into important customer-facing services, the hard work begins. To ensure everyone can share in the benefits of AI, we need to operationalize it effectively. We’re not starting from scratch. Accelerating our use of automation, DataOps and AIOps can help.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.