Centralized vs. Decentralized Operations
The concept of centralized versus decentralized operations predates both site reliability engineering (SRE) and DevOps. Organizing operations around on-call predates both categories as well. And there’s always been tradeoffs associated with dedicated staff to on-call: either small, specialized groups core to the SRE model or distributed teams with some on-call responsibility core to the DevOps experience.
As with most things relating to operations and engineering, in general, a lot depends on the size and level of maturity of a business, as well as what infrastructure is in play (on-premises, public cloud, hybrid, etc.). Further, having access to shared tools is crucial to success for both centralized and decentralized ops teams. Tools are what empower teams and individuals to reduce risk and move fast in response to issues that pop up — expected or, more commonly, unexpected.
Enter: Public Cloud
The rise of public cloud over the past decade has accelerated development, go-to-market and deployment. It’s also created a host of new on-call issues tied to spinning up, and down, scores of accounts and services across stacks and teams within companies. There are also significant cost and security considerations at play here.
Public cloud has become a playground for decentralization. Anyone can spin up VMs, Amazon Web Services‘ accounts and other things in an instant and on a whim. At first, the pure joy of moving fast and creating defrays the potential costs of launching each account. But this can become more than a small line item for individuals and companies at scale alike, especially when a decentralized team is made up of multiple people launching scores of accounts without consulting each other.
The rise of public cloud has serious and potentially harmful security implications. For instance, if an S3 bucket is breached or the data within it is exposed to the public, an entire company suffers the consequences. In a decentralized model, security becomes more of a mixed bag dependent on individuals and teams often working without full knowledge of what others are doing.
When there’s split responsibility between departments, and it’s unclear who holds accountability and/or responsibility for elements of operations, chaos quickly ensues.
In a centralized model, it’s often possible to mitigate security concerns due to ownership of all or part of the stack, as well as direct access to security teams that review work and ensure code is written with security best practices in mind. But the tradeoff is that it takes longer to support the launch of apps and respond to on-call issues affecting customers and company reputations.
The Human Element of Operations
In comparing the human element of centralized vs. decentralized ops, one should consider the software architecture supporting each case.
For example, for a microservices-based architecture, individual teams owning the operations of a specific service could have advantages — especially if they developed the service — because they have the best understanding of its nuances and internal dynamics. In general, the team that coded the service in this type of example is likely going to be faster at debugging operational issues due to software bugs and slower at debugging issues due to environmental issues including underlying dependencies or infrastructure.
From a headcount standpoint, the on-call team should have both developers and operators. As a service matures though, it might make sense to transition ops over to central SRE teams to reduce cost and burnout, and give engineers time back to code and develop new features which is their preferred activity, generally.
Culture around SRE or DevOps is also important. On-call ops, in particular, is a 24/7 vocation with around-the-clock pressure, stress and expectations. Not everyone is cut out for it. But it’s commonly at least a part of an operator’s job, especially at smaller, newer or decentralized organizations.
In many cases, on-call is painful and people are working under immense pressure and perpetual stress. They might also be dealing with outdated wikis or lack information about what needs to happen to troubleshoot or address an on-call issue. It might even be unclear who holds the responsibility for certain actions needed to completely take care of things. Even in the best-case scenarios, there are a lot of moving parts, and companies rely on people to take action.
- In a centralized model: When one team owns all the responsibility and accountability for ops, you don’t have nearly as much finger-pointing across teams and between operators. This structure is more fundamental to the centralized approach to ops, but things can still fall through cracks between team members internally and operators dealing with outside service providers. It also takes longer to respond to issues because there are established and rigid processes in place.
- In a decentralized model: When there’s split responsibility between departments and it’s unclear who holds accountability and/or responsibility for elements of operations, chaos quickly ensues, especially in the on-call context. Taking this approach often involves operators with varying levels of experience learning on-call on the job and collecting knowledge through failure. This model also allows organizations to spread the culture of reliability to the entire software development team so that it becomes at least a part of everyone’s job.
Whether a company is centralized or decentralized, it’s important to create a culture of shared responsibility and accountability that underpins on-call responses, remediations and regular ops work. While a lot of us spend days, and nights, responding to issues, automating tasks, writing code and building apps, we can’t forget to take care of ourselves and keep the human elements of both SRE and DevOps front of mind.
Real Talk from Operators
In our experience working under centralized and decentralized regimes at companies of varying sizes and maturity levels, it’s common for simple issues, like licenses expiring, disks filling up or credits depleting, to cause outages. It’s also very common for operator error to lead to urgent remediation scenarios. And operators are embarrassed to admit fault when a license expires during a launch in beta or a disk fills up before the end of a holiday weekend.
It’s time to accept that the simplest things can take down the biggest apps. As long as some or all of SRE and DevOps is reliant on members of a team, human error will affect the overall success and cause operator pain or stress.