How Service Ownership Can Make Digital Ops Cloud-Ready
Cloud migration is accelerating across the globe — and with good reason. Organizations want to become more agile, innovative and efficient in the race for customers’ hearts and minds. But there’s a problem. Cloud also means greater complexity across people, technology and processes — especially for the ITOps and DevOps teams tasked with making it work. It requires changes to critical workflows, maintaining visibility and control across a constantly shifting hybrid environment, and mitigating an exponential increase in application-related incidents that may impact your customers. In this context, how can the right people get the right information at the right time to drive effective, real-time incident response?
Fortunately, with service ownership as their watchword, digital operations can move away from siloed and centralized approaches and instead ensure teams take responsibility for the software they deliver, bringing them closer to their customers and speeding up innovation. By clearly defining services and ownership, organizations can harness the power of AI automation to eliminate tasks that were previously manual, allowing teams to spend less time on resolving incidents and more time on business-critical tasks, such as innovation.
Out with the Old
Over 70% of companies have now migrated at least some of their workloads into the public cloud, according to Gartner. The momentum will only continue as more organizations look to optimize scale, flexibility and performance. Gartner predicts that spending on public cloud services will rise by nearly 20% year-on-year in 2022 to exceed $397 billion globally, after an increase of 23% in 2021.
However, the dynamic, heterogeneous infrastructure that a new hybrid era ushers in means more change, more complexity and an inevitable rise in incidents. More moving parts and complexity in visibility increase the risk of failure and of customers — who are becoming increasingly demanding — jumping ship. Amid these challenges, the role of IT operations also changes with migration to the cloud. Gone are the days when most incidents were infrastructure-related and managed by a central team. In the cloud, more incidents are application-based; change happens more frequently, driving up incident volumes. Where once there was a limited range of on-premises tools and home-grown solutions, now there may be services from dozens of vendors to monitor.
In the cloud, IT operations therefore needs to support distributed teams, learn how to use new technologies and adapt to new processes and service-level agreements.
Traditional, centralized IT is ill-equipped to deal with this new world. Issues are identified without context, triage is delayed by siloed, step-by-step processes and business stakeholders and developers aren’t informed quickly enough. Sometimes a single incident could take several days to be fully remediated in this way. Research from 2020 shows a 47% increase in incidents since the start of the pandemic, with nearly two-thirds (62%) of DevOps and IT responders being forced to work over 10 hours longer per week resolving these incidents.
Owning the Problem
Shifting to the cloud will benefit mature organizations in multiple ways. It will completely change how they work, enabling them to scale faster, innovate at pace and become more agile. But the shift to cloud infrastructure multiplies complexity and can lead to an exponential increase in incidents being tackled by distributed teams. Tackling these challenges and advancing your firm’s operational maturity will be critical to ensuring that responders aren’t spending even longer on incident response than before.
A critical step to dealing with complexity and rising incidents is shifting to a service-ownership approach. How does this work? It’s summarized neatly by the idea, “You build it, you own it.” Once they’ve completed a project, developers no longer hand it over to a specialized operations or site-reliability engineering team. Instead, they’re expected to take ownership of software in production, stepping in when something goes wrong.
The idea is that the right contextual information is routed to these experts and service owners within minutes, rather than hours or days. Although key stakeholders, including customer service teams, are kept in the loop, the service owner is best placed to handle problems, as they should know the code inside out. That means fewer hand-offs, less uncertainty and shorter mean-time-to-remediate (MTTR) so issues are fixed with minimal impact on end customers.
This doesn’t just result in optimized incident response. Service ownership also brings developers and engineers closer to their customers, the business and the value being delivered. By understanding the business drivers for developing a particular service, and its impact once released into production, teams get better at delivering value going forward. And spending less time on incident response means more time for innovation.
The Right Tools
Yet service ownership is only one piece of the puzzle. Developers and ITOps teams also need a single pane of glass that bridges the communication gap between central IT and DevOps teams. This will help ensure that tickets and incidents are automatically routed to the relevant service owner, alongside the information to help them rapidly solve incidents. And in all other cases, incidents should be resolved automatically in order to accelerate MTTR and free up as much developer time as possible for high-value creative work.
This is where AIOps products that have been built for the cloud can help. These cloud native platforms harvest contextual information on incidents from across the enterprise with integrations into hundreds of third-party systems. Doing so provides teams with the relevant technical and business-service dependencies across their cloud environment. This visibility is critical as it will help inform the right people with the right information to drive real-time incident response.
Machine-learning algorithms can also be trained to reduce alert noise, so responders are notified only about the incidents that matter. Multiple alerts can be grouped together into one incident, using time-based methods, to further enhance responder productivity. And event-driven webhooks can leverage event rules to trigger external processes and workflows at the push of a button, accelerating incident response resolution.
Up to 90% of CIOs have experienced failed or disrupted cloud migration projects due, in part, to overwhelming complexity. In this new era of digital transformation, there’s plenty up for grabs. But organizations must understand that cloud migration is an ongoing journey. So, too, is advancing the maturity of your digital operations. The first step is often shifting from manual to reactive — but as your cloud maturity grows, so must your operational maturity. At a certain point, you must shift to a preventative stance if you want to scale and become more agile.
To tackle these challenges, organizations should ensure they follow industry best practices for cloud migration and identify service owners to give their team clear responsibility across services to drive continuous improvement and innovation.