How Automation Can Create More Hours in Your Day
Today’s IT operations teams sit in the engine room of the modern enterprise managing and monitoring the digital infrastructure that’s critical to business success. But today’s corporate IT environment is a complex blend of legacy and modern cloud-based systems from multiple providers. Incidents inevitably happen, and when they do it’s a race against time to remediate.
This is a hard enough job to manage on its own, especially when teams are under relentless pressure to fix problems. But the work of ITOps is made even more challenging by rising levels of toil and ticket queues that add overhead, delays and extra risk. They combine to waste the valuable time of engineers that’s already in short supply.
DigitalOps under Pressure
With economic uncertainty and business headwinds likely to last throughout 2023, organizations will need to enable their teams to work more efficiently when handling incidents. First because it will affect their brand and bottom line, but also because companies can’t afford to waste their engineers’ time. They need to keep teams happy and productive. Yet research reveals that 42% of practitioners worked more hours in 2021 than 2020, and over half (54%) were interrupted outside of normal working hours.
To stop wasting engineers’ time, we need to think about how we can automate manual tasks, reduce alert “noise,” enable self-service and minimize errors through standardized processes.
Toil: What Is It and Why Does It Matter?
Toil eats up engineers’ time. It prevents teams from working on high-value tasks and keeps the business stuck in a firefighting Groundhog Day. One area that toil is often associated with is site reliability engineering (SRE), the application of software engineering practices to drive reliability and scalability. Toil is a block on these efforts. It’s manual, reactive and adds no strategic value. In short, it’s the wrong kind of work. Instead of repetitive, tactical toil, which lacks enduring value and increases with scale, organizations need to empower teams with engineering work that’s strategic, creative and builds lasting value.
Yet toil can’t be completely eliminated (sorry to disappoint). Change is the only constant in modern ITOps, and it will always create some degree of toil. The best organizations can strive for is to keep it at a manageable level. That means keeping an eye on key sources of toil that haven’t been automated yet, such as: schema updates and rollbacks; changes to storage quotas, networks and DNS configuration; user adds; and service failover. Teams are also able to streamline incident response for manual interventions like restarts, diagnostics, performance checks and changing configuration settings.
If left unchecked, toil can quickly become toxic. It could lead to burnout, more human error, disillusionment and a lack of career development. At an enterprise level, it can reduce team capacity, drive up operational costs and employee churn, and hurt strategic initiatives.
Most dangerous of all is if toil surges to the point where there’s simply not enough engineering capacity to address it. Strategic work will be required to minimize toil by creating external/internal automation, or to enhance services so they no longer require manual maintenance.
Tickets: What’s the Harm?
Another source of frustration and wasted time for ITOps is the excessive use of tickets to manage operational tasks. Although tickets have become the de facto work management tool for most teams, their ubiquity hides an unpalatable truth. Tickets, or rather the queues they’re put into, could in fact be doing more harm than good. Here are a few reasons why:
- Communication problems: Requests are frequently misunderstood, often because context is missing from the ticket. The requester might make the wrong request or not understand the ramifications of what they are asking. And sometimes as the ticket waits in the queue, the request parameters change, unbeknownst to either party.
- Bottlenecks: Ticket queues are often used where specialist teams fielding requests are outnumbered by those making them. Requesters stuff the queue with requests, which adds to response times. And because of delayed feedback, they don’t know the negative impact this has. Also, as queue lengths grow, the teams responsible for them will instinctively look inward to protect their capacity.
- Siloed ways of working: Ticket queues become a buffer that allows teams to continue working in a disconnected manner. Protecting team capacity becomes of primary importance, rather than serving the needs of the organization. And the more that these siloes are reinforced, the worse things get.
- Snowflakes: This describes something that may be technically correct but is a one-off that can’t be reproduced, such as a manually updated server. Tickets encourage this inefficient way of working because teams jump from one seemingly isolated request to another.
- Obscuring value: Context matters to any type of knowledge work — that is, understanding where each piece of work fits into the bigger picture. But breaking it down into individual tickets obscures both context and the overall value stream that teams can deliver.
- Management overheads: Setting up queues, defining rules and maintaining the ticket system itself all require extra time and effort, which could be better spent on adding strategic value to the business. In effect, ticket queues are a form of toil.
Tickets aren’t inherently bad; they’re just overused and used for the wrong reasons. They’re still appropriate for one-off requests and documentation of human-to-human communication when approvals are unavoidable. But tickets can add up and waste engineering resources.
How to Stop Wasting Engineers’ Time
Organizations should adopt a service ownership mindset, where engineers and developers handle as much of the product life cycle as possible. That can eliminate handoffs and the need to manage incidents solely through tickets. This shift doesn’t mean the end of tickets — we still need some way for cataloging and maintaining visibility of IT requests.
Self-service automation improves incident response by helping to prioritize the tasks that matter. This helps to reduce waiting time, minimize breaks in context and shorten feedback loops.
The right self-service interfaces will help ensure that those remaining ticket queues are reserved for true exceptions. The same technology can effectively prevent teams from creating toil for others by empowering them to do the work themselves. Other toil-reduction strategies could include automating tasks that previously required a great deal of toil, such as runbook procedures, or diagnosis and remediation of issues in production infrastructure. Alert suppression and automated maintenance windows can help to take the pressure off teams and free them up for high-priority work.
Engineers are a precious resource. It’s time we stop wasting their time on toil and tickets.