Interrogate Your Software with AI — The Future for SREs
Site reliability engineering is a cornerstone of most businesses. Without site reliability engineers (SREs), application and infrastructure management issues would go without remediation, customers would suffer from poor user experiences and the business would lose money in turn. It’s actually quite a simple cause and effect between a lack of a strong SRE group and lost revenue, but that’s where the simplicity ends.
The work of an SRE can be tedious and complex, requiring hours of investigation before discovering the source of an issue. Moreover, all that effort often can be spent just to discover that the issue has happened before, but the fix was poorly documented and communicated. So what should have taken a fraction of the time took hours instead, annoying the SRE and losing money for the organization in the process.
What’s worse is when the issue itself is not the crux of the overall problem. It’s not uncommon that a sea of alerts is flagged, all potentially connected to a critical app failure. And without the right experience or documentation, SREs can be left lost, with no idea which issue to tackle first. The lack of context can be so bad that sometimes perceived issues are, in fact, just a configuration change, a move in machines being used or a simple update.
The problem is exacerbated when subjectivity comes into play. For example, what qualifies a graph as bad? Many make their own inferences based on their experiences, essentially working off a hunch. Something seems off, so they get SREs on the job. But hunches just waste SRE time; something may seem off, but it could have been any of the nonissues I mentioned above. Moreover, different engineers may rank actual issues differently. The lack of consensus means lost time and wasted resources.
The AI Effect
Manual data analysis is time-consuming and can lead to oversight of critical patterns. With AI-driven incident analysis, we gain the capability to process data rapidly and recognize correlations that otherwise might have been overlooked. This empowers us to take proactive measures and predict potential incidents using historical data, breaking free from the limitations of reactive maintenance.
Moreover, AI-powered analysis can play a vital role in assisting SREs in determining the severity of incidents. By defining criteria for incident severity classification and relying on AI insights, we can make more informed decisions and prioritize response efforts efficiently. Resource allocation, a crucial aspect of SRE, can be guided by AI-generated statistics that paint a clear picture of an incident’s impact and resource requirements, enabling us to scale responses based on severity and complexity.
Finally, we can’t forget about incident reports, documentation and runbooks. We all know how bad those can be. Depending on who triaged the incident, what’s reported and documented can range from a simple paragraph to pages of in-depth research and analysis. And even if they are good, they can get lost, stored on a drive somewhere, never to be seen again.
Generative AI in particular can do this systematically — capturing, storing and sharing the right root-cause analysis and post-mortems for the necessary context. Guess work will become a part of the past, and speed and trust can be fostered in these workflows.
The Future SRE
One thing is certain, SRE teams will continue to be critical for organizations moving forward. Their importance will not change, but how they do their job will. Over time, we can expect SREs to not just investigate but also validate the work AI tools do in the background.
There will always be the need for a human in the loop, even with rampant automation. This will require learning how to use prompts effectively for smart generative AI assistants, as well as helping to interpret for other teams and the organization at large.
In their day to day, SREs can expect to be able to interrogate their systems and get answers directly via a generative AI chat tool like PromptOps. It will also free them to do higher-level work that they haven’t been able to get to, from optimizations to new processes and features that ensure the reliable functionality of their applications.
One of my favorite parts: They won’t have to act as translators for their business-focused counterparts. Sure, they need to be able to pull out the right information, but if you need a report that explains the issues in a business context, let a tool like PromptOps generate it, and just worry about validating it.