Can SRE Bring Governance and Compliance into the Future?
We are about 15 years into the DevOps revolution. And yet silos still stand tall. Especially around important departments that protect your organization from getting in real trouble.
Mario Platt, as both Chief Information Security Officer at Broadlight Global and a security consultant, wants to break down another of those barriers. This time between governance, risk and compliance (GRC) and engineering. His talk at WTF is SRE last month, in the DevSecOps track, was a plea to Site Reliability Engineering (SREs) professionals to take up the role as mediators to bridge this divide.
The SRE, he argues, is perfectly suited to better align the view of work imagined by governance, risk management and compliance policymakers with the operational reality of engineering teams. And SREs are in the right sociotechnical position to filter, propagate and automate much of these GRC and security best practices.
Uniting these two powerful but very different departments is the next step toward becoming a high-performing DevOps organization.
Our Infosec industry really needs to start being less about “work of security” and more about “security of work”
They’re not the same, and the difference matters.
— Mario Platt #StandWithUkraine (@madplatt) May 12, 2022
The State of GRC Today
There’s no doubt that, from an engineering perspective, GRC feels like a relic from the past. And yet, they are the ones that often make or break an org from preventing fines and bad press all the way to protecting stakeholders. It’s just, like with many more traditional departments, the communication seems one-way. So what does the governance, risk management and compliance department have to do with engineering?
Governance could be called the effort toward business goal alignment across an organization, but, Platt argues, this office is mostly stuck in a command-and-control mindset and overly focused on specialization not integration. He says they are still framing security around “awareness” instead of goal conflicts and trade-offs.
In a follow-up interview, Platt gave the example of the UK government’s plans to introduce “professionalization” to cybersecurity, which would add specialized training and requirements for narrowly defined security roles. “It would have a hugely narrow focus on specialization, which is exactly the opposite of what industry is telling us works,” he said. “In my opinion, if they go ahead with it, it’s a massive step backward.”
Most importantly to engineering, he describes governance teams as historically detached from operational realities, including a knack for what issues should be escalated and what shouldn’t be, creating a common definition of what is actually an emergency.
Risk management involves the identification and addressing of risks to an organization.
“It’s neat, it’s clean, it has arrows. It’s really cool,” is how Platt characterized a risk manager following their stagnant five-step process. This role typically establishes context, and assesses risk with identification and evaluation. Then they try to enforce a stoplight system.
Finally, he includes with compliance any activities toward meeting legal, contractual, regulatory and framework requirements.
But the most important distinctions he made were:
- Work as Imagined – Policies and procedures are created by non-experts who don’t understand the impact of what they’re writing.
- Work as Done – The reality of what’s actually happening.
- The Schism – The documents are unrelated to the operational reality.
Platt spun this off of “The Varieties of Human Work” by Steven Shorrock, which says work is either imagined, prescribed, disclosed, and done.
“But in the real world it’s a mix of industry best practices and training,” not everyone is assessing the same way. Platt continued to talk about ingrained patterns he observes in the GRC industry, where everyone is reading from the same books, taking the same certifications, and not doing much technical work.
“What happens is we see what we expect to see and we do not see what we do not expect to see so, because risk and compliance managers often tend to have their view of work is one of work as imagined, not work as actually done, then this risk analysis process, it’s basically a control catalog,” he said.
When you have expectations, you then prescribe a control for that expectation. Except software doesn’t usually work that way. “And then we hold everyone accountable when they don’t keep those controls up to date and working as we expect,” he said.
What Can SRE Do for GRC?
Well, first, the discipline of site reliability engineering is dedicated to the high reliability and scalability of systems, which governance, risk management, compliance, and security teams undeniably can impact. More specifically, Platt pointed to the structures that GRC teams have put in place to support:
- Governance of technology
- Management of operational risk (or which reliability is an extension of)
- Enforcement of operational standards
These processes can easily be applied to the meeting of security objectives and other SRE goals.
On the flip side, Platt says this opens up a conversation among governance and development professionals. It’s logical to apply some SRE practices like error budgets to security.
“Applying the mechanics of error budgeting into how we approach security would bring a lot of benefits into governance as a whole because you then have an organizational dynamic and the information and the data to actually have a cadence into how we approach and ask ourselves the question: How much security should I be doing?” Platt explained, at the same event, from his perspective as a security professional.
SRE is the best place within an organization, he argued, to bring the perspective of trade-offs and constraints for decision making.
“Obviously, it would require us [GRC professionals] to actually have these conversations and for the SRE functions to feel like it’s a conversation they can have in terms of exposing what are those traits and constraints,” he continued.
SRE Brings a Different Perspective to GRC Issues
When Platt consults with different organizations, he sees roadmaps, reliability backlogs and even bugs, but he says security isn’t typically included. That’s often because security tooling is separate from the engineering teams. This is risky as cyberattacks are only getting worse and the engineers tend to be the only ones on call to notice when things go off.
Traditional risk management, Platt says, is more of a linear causality model, a one-way chain of responsibility, so controls and stops are put in to block that interaction, but things aren’t usually that cut and dry.
“For SREs, you’ve all seen the classic example that you make a code change today and, only two months down the line, a special set of conditions start happening and then we start having a problem from something that we did two months ago.” Platt says this is the norm for SREs and anyone in software development, however, he says, if you go and have the same conversation with GRC professionals, “their view of the problem is likely [that] you didn’t do a good enough job thinking about the conditions that will lead to failure so you need to try harder.”
This is far from a productive conversation. He cited Jens Rasmussen’s risk modeling in a dynamic society, which offers a more in-flux operating point and embraces variability even in business as usual. It comes with:
- Boundary of acceptable workload – skills, processes
- Boundary of economic failure – things aren’t on time or on budget
- Boundary of incidents – reliability, security and data incidents
Within these, there are error margins or marginal boundaries that drive decision-making, which varies dramatically from GRC to SRE.
“A GRC person will typically think that the farther away from that barrier they have their operating control, the better they are. But we know from high-reliability organizations — since the 80s and 90s — that that’s not actually the case. These high-reliability organizations tend to operate closer to that point, and they can do that because that also helps establish a common mental model for how the systems work across the teams,” Platt explained.
This results in, he continued, less erosion over controls over time, and these organizations can move faster without adding the “fat” of excessive processes. It also enables continuous learning, while preventing alert-driven burnout.
Risk Management and GRC Miscommunication Anti-Patterns
Platt shared some anti-patterns that he’s discovered among information security teams:
- Governance, risk management and compliance policies are typically inconsiderate of and inconsistent with operational constraints.
- Awareness of what GRC skills are still lacking among engineers, just as GRC doesn’t have an understanding of the needs of engineers.
- GRC feedback loops are alerting on impractical, in-actionable performance boundaries creating alert fatigue and eventually are ignored.
How do you overcome these anti-patterns? Start by identifying the strengths and weaknesses of both sides of this partnership. On the one hand, Platt says risk management teams are more than decent at an empirical strategy which is enforced with a lot of automation.
However, an SRE team could be very valuable to teaching GRC about the sociotechnical interactions of distributed systems, including learning from systems, which support a better evolutionary strategy. This is in clear opposition with infosec and GRC teams that just search for a single point of human error.
“someone who’s not me would probably go straight to ‘root cause was human error’ but I asked more questions and we identified 2 work system issues which enabled this. So as far as I’m concerned, this was a bug waiting to materialise and engineer X just got the short straw”
— Mario Platt #StandWithUkraine (@madplatt) October 14, 2021
Finally, he explained in the follow-up interview, the two teams need to get together to analyze system entanglement and blast radius. It involves learning the language of complexity and non-linearity, that a small change often has a seemingly disproportionate effect and understanding there’s not a single version of the truth. After all, everyone has a different experience and probably what they did made sense at the time.
On the compliance side, SRE can help translate its perspective on:
- Readiness reviews
- Standards enforcement
- Dealing with toil
- Policy as code applied to security
Where to Start Putting the GRC in Your SRE
Platt advocates for an integrated approach to tackle these anti-patterns. This could be GRC up-skilling your SRE team to tackle some security concerns or, like at Indeed job platform, even creating an SRE security team within the existing SRE function. Or it could be a connected strategy where the GRC still heads security but they train and hand over some of the systems automation responsibility and resolution of goal conflicts to the SREs.
He suggests SRE kick things off by approaching the respective teams with the following topics:
- Governance – teach them about error budgets and metrics, and review who should be (and shouldn’t be) in which rooms when these issues are discussed
- Risk Management – show how SRE manages reliability and where to spot risk
- Compliance – SRE is in charge of technical standards, so it’s logical to discuss security enforcement and automation, too.
SRE is uniquely positioned to help GRC distinguish “work-as-imagined” versus “work-as-done.” SRE teams adopt best practices with a strong sociotechnical focus, which Platt says it can teach to GRC teams. SRE can also teach GRC colleagues more modern technical solutions, like how to leverage CI/CD pipelines as systems of record. SRE will then benefit from understanding those constraints and applying that understanding to create better assurance and governance best practices — and, of course, automation.
No matter what the path to understanding, it will start with the two teams sitting down and talking.
Disclosure: The author of this article was the host of the WTF is SRE conference.