Embrace AI Acceleration by Investing in Reliability
We’re heading into a blazingly hot AI summer. I’m sure I’m not the first person to tell you this, and I won’t be the last. The hyped-up discourse around AI’s potential has stretched into every industry, and software development is a major focal point.
From AI copilots suggesting the next line of code, to AI sidekicks swiftly running through test batteries during tense incidents, to AI developers building whole apps from natural language requests, AI can empower and accelerate every step of the software life cycle. But this boost in speed only exacerbates a problem staring down even the biggest tech giants: unreliability.
The faster you move, the faster things can break. And with the acceleration of AI, things can break in strange, unpredictable ways. At the same time, keeping reliability up to users’ standards has never been more important. As competitors rush forward with exciting new features, the service that allows the most consistent use of those features will be the winner.
So how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users? Fortunately, there is an answer, and it’s investing in the resilience of people.
AI Empowers Developers, not Replace Them
Far too many managers are seeing AI as an opportunity to replace expensive engineers with cheap language models. Although this could save money in the short term, it’s devastating in the long run. AI is a tool, and like any good tool, it empowers the user to do things more efficiently or in different ways than they could before. But the user is essential.
Instead of looking at the financial advantage of AI in terms of cutting costs, frame it as growing revenue. Think about the competitive advantage you’ll realize from shipping more features, faster. Or the stability you’ll enjoy when your teams are less stressed from overwork. Or the strategic flexibility you can pursue with the ability to quickly experiment and iterate. This is the value you unlock by combining AI with a full-scale team instead of downsizing.
AI Pushes You to the Forefront of the Reliability Crisis
Users have high expectations for services these days: always available, always fast, always accurate. And if a service can’t meet these demands? There are dozens of competitors knocking at their door. It’s no wonder that major outages can wipe out millions, if not billions, in value, and dominate tech headlines. The incredible costs of unreliability, especially for enterprise organizations, is what we’ve dubbed the reliability crisis.
AI exacerbates this crisis in a number of ways. Unavoidably, there’s the simple fact that the faster you go, the faster things will break. If you push up your release schedule from two major releases a quarter to four, be prepared to face twice as many new incidents as well.
AI may also create new types of incidents. Letting AI write code can produce efficient solutions in the blink of an eye, often without the requesting engineer understanding how it works. This is great… until something breaks. If AI code malfunctions, you’ll end up trying to parse and fix this new black box in the middle of an incident. Even if you just use AI for consultation and ideation, not letting it write code directly, issues like hallucinations can have you overconfidently barking up the wrong tree. Of course, all of these sources of unreliability get exponentially worse if fewer engineers have to deal with them.
Make AI Less Risky by Investing in Human Resilience
There’s no way to completely eliminate the unreliability risks of AI without also eliminating all of its benefits. Manually reworking every line of code the AI writes to be “robustly human-compatible,” for example, makes it not much faster than writing code yourself. Instead, let AI accelerate you where it can, and empower the people steering it to mitigate the risk.
A major advantage of engineers over current AI models is perspective. Your AI copilot is lightning-fast at producing and testing code, but it doesn’t understand why you’re asking for these tasks. Unfortunately, human engineers can also end up stuck regurgitating code from requests, not knowing the big picture or having any impact on it. When they become “managers” of AI, it’s more important than ever to empower your engineers with this perspective.
Including all engineers in the strategic conversation will make development more intentful. This pays dividends in operation when things break. Even without needing to understand the details of the AI-written code, each engineer can tackle things on a higher level, mitigating the effect of the problem on the intended outcome of the service.
They’ll know what your users care about and how to leverage AI to quickly bring back functionality. SRE/DevOps best practices, like feature flagging to delineate sections of code and SLOs, are also essential to making these higher-level fixes.
Investing more in the incident response process is also an effective way to mitigate AI risk. If you’re dealing with incidents more often, sometimes with brand-new causes and challenges, you won’t want to be bogged down with any toil in fixing them.
You need to have universal processes in place for identifying and triaging new incidents, bringing together the right team and distributing work, and, most importantly, learning after the incident. This will make things smooth and consistent, even when thrown an AI curveball. AI can help in your response too, making ad-hoc testing scripts, summarizing long log files, and more.
Join the AI Revolution, Reliably
AI provides an unprecedented opportunity to accelerate development velocity and explore new ideas. However, huge outage costs and public backlash has proven that nothing is worth sacrificing reliability. Getting the best of both worlds means investing in people just as much as AI.