Smart Systems Make Ops Life More Human
The human toll that complex, modern infrastructure can take on those who manage it is quite familiar to the tens of thousands of people in IT who keep their phones and pagers on by their bedsides. These are the people who support our always-on digital lifestyle. They understand that for every mobile app installed on our phones, there’s an IT professional on the other side who is tasked with supporting it, and whose sleep is disrupted due to an automated alert from a data center or cloud service in the darkest hours of the night.
Now more than ever before we have ways to use machine learning to help combat the issues that come with our connected culture. How to use these technologies effectively for enabling HumanOps — the practice of considering the well-being of IT teams, as much as the infrastructure — is just beginning to emerge. But the solutions show promise for preventing the burnout and bad health that often afflict people who have to be on call for work.
The health impacts of disrupted sleep are documented and well-known. A recent review of scientific research on sleep disruption, for example, points to short and long-term health consequences that range from stress and mood disorders, to hypertension, cardiovascular disease and type 2 diabetes. But very little if any attention is put on the human impact that comes from managing systems that are always on. Nor do we discuss how it may be prevented so operations people can get better rest, improve their health and in turn be more focused on the complex work they do when managing increasingly sophisticated infrastructure.
The IT industry puts an emphasis on machines and how they are running. But people are the ones who assure that applications are running at service-level requirements. There have been cultural ramifications to this approach. Operations teams are conditioned to believe that disruption and fatigue are just part of the job. It is assumed that people have to manage the machines no matter what the time of day.
The business value of digital technologies and the impact they have on end-users is the primary focus of any enterprise. How these technologies are developed, deployed and managed dominate conversations as software increasingly is at the center of any modern business. In turn, operations teams are closer than ever before to the end users as applications increasingly are also at the center of most people’s lives. It’s a fascinating time when considering how immediately connected networks have become now that mobile devices are so predominant.
But there is very little dialogue about the operations people who choreograph these experiences in the background. The dependencies that we have on these important people is without a doubt. But their health and welfare does not get the attention the issue deserves.
A discussion is needed about the human toll due to the role of operations in this new age of immediacy and application-oriented architectures which requires deeper forms of observability. Meanwhile, there is an endless need to support legacy infrastructure and the endless alerts that come with every new and existing technology under the management of operations. It is time we, as an industry, start thinking about these invisible faces, these core operations people who in many ways have made the digital era a reality. They need a helping hand — a solution to avoid the burnout that affects many people.
The Human Toll
Organizations invest heavily in ensuring that their employees are well taken care of so that they can perform their work more efficiently. Yet, in many organizations, the focus on employee well being misses the human impact of operations and HumanOps is seldom addressed or rectified.
The biggest human impact on IT operations teams comes not from handling systems at scale but rather from the alerts they receive after work hours. This becomes especially critical if the person is an on-call engineer or a manager to whom issues get escalated. In traditional IT, all the operations persons on-call get their notifications after they go home to spend time with family, during dinner or while sleeping.
The person on-call is expected to respond immediately to each issue for which they get a notification. However, the notification systems don’t always differentiate between critical issues, transient issues or false alarms. Organizations do apply rules to manage the notifications, but it doesn’t do much to stop the human impact for on-call employees.
On a typical night, the employee can get woken up several times. The initial few notifications will have limited impact on the employee even though, in many cases, it will disturb their family. After a certain level, the employee gives up on their sleep or ignores the alert altogether.
While lack of sleep impacts the health of the employee, getting numb to the alerts can cause business disruption. When this is combined with the impact of such alerts on the morale of their family, the human impact gets bad. Even though alerts during sleep have the biggest impact, any disturbance during family or dinner time also has its own toll.
If this scenario repeats every day, the impact spills over to workplace relationships. In other words, the human impact of incessant alerting beyond work hours is much more than what many organizations are aware of. This goes beyond impacting the employee and team morale to even impacting the organization’s retention rates.
Making Operations More Humane
The impact on organizations is many-fold. It not only affects employee morale but can spill over to the entire team. While missing alerts might impact the business, the difficulty in retaining employees costs money and slows down the delivery of business value. If this human toll is a result of firefighting, there is very little value operations can add towards innovation.
The indifference towards the human impact of operations adds severe strain in traditional businesses but it has the potential to even disrupt organizations that have gone digital. Clearly, it is important for organizations to tackle this issue and make operations more humane.
In order to reduce the human impact of on-call support for always-on infrastructure, the alert system needs to consider the following factors:
- Prioritize employee’s family time by reducing alerts during sleeping time over their evening time (family time or dinner time) and evenings over when they are at work. By doing this, the impact on employee morale is limited. This prioritization is the first step in tackling this problem.
- Eliminate or drastically reduce false alarms and transient alarms (in which the issue is automatically closed before the employee could get back to their computer.)
- Learn about the most impacted employees and help prioritize alerting to be sent to less impacted employees.
- Over time, the system can learn to dynamically alert the employees, taking into account employee preference and company policies, so that the toll on employees’ morale is limited.
This was difficult to implement in traditional systems but moving the monitoring, logging and alerting systems to the cloud, availability of modern data systems which can be tapped to understand responder health along with machine learning/AI can help tame the alerts and reduce the human impact through smart routing. It is time to put the human back into operations and help reduce employee churn or business disruption due to large volumes of alerts. It is time for organizations to focus on the human beings behind the operations.