KubeCon: Lessons in Disaster Recovery from COVID-19 and Site Reliability Engineering

Honeycomb is sponsoring The New Stack’s coverage of Kubecon+CloudNativeCon North America 2020.
Catastrophe. Whether it’s a global pandemic or a production outage, it’s not if but when. Security Expert Kris Nóva brought Seattle-based MD Dr. Rachel Beda to during KubeCon + CloudNativeCon this week (virtually, of course) to break down all things disaster recovery.
From COVID-19 to site reliability engineering, they compared notes on outbreaks, being on-call, emergency response and eventual prevention. It turns out, there are loads of lessons to apply from this pandemic to software security in distributed systems.
“You gotta wash your hands. You gotta turn on RBAC.” — Kris Nóva, Security Expert
When an Unexpected Outbreak Strikes
Whether it’s a malicious attack or a catastrophic production outage, no matter how prepared we are, Nóva says an unexpected outbreak is going to happen.
Beda, living in the first U.S. COVID-19 hotspot, knows this all too well. By the start of this fateful year, it became clear that this mystery virus was incredibly contagious and could spread even before someone has symptoms. Seeing warning signs, Beda’s team was ahead of the curve by even asking an employee to self-quarantine after traveling in Asia in the new year. Her team predicted accurately that they would be flooded with questions and started to plan their responses.
Seattle also saw the first shutdowns. By the start of March, schools in her city closed and a Washington state-wide rapid intervention was enacted.
Nóva said this already felt like when a sudden outbreak starts to crop up in a cloud native system that’s running Kubernetes with a handful of applications running on top of it. It looks and feels similar when you start to notice something going wrong in your system.
Just like medical professionals reading news and journals, Nóva says you simply must read your distributed systems via observability.
She defines observability as “Your ability to understand at the system layer and, more importantly, how to connect all of the different components of the system together.”
This includes observing the behavior of the physical compute, storage and network, as well as the Kubernetes abstraction layer.
Then Nóva points to kernel tracing as an effective way to not only debug but to learn about system behavior, alongside operating system logs and virtual machine introspection. She also mentioned application instrumentation which enables engineers to add lines of code that can be scooped up by Prometheus monitoring and time-series logging, so your team knows not only when something goes wrong but when and where and hopefully why.
However, Nóva admits that when you first start practicing observability, it starts out looking just like tiny blips on huge graphs. She says implement a tool like the open source runtime security checker Falco (which Nóva helps maintain). You’ve got to have an observability system in place to be able to respond sooner.
Then, observability allows for forensics after the incident.
“First you want to be able to retell the story and bring your systems back online. The beauty of the cloud is to take snapshots and to replicate and to go back in time to make sense of it and build it back together,” Nóva said.
She continued that Falco has an ability to trigger a runtime alarm when something happens in your Kubernetes cluster.
Somebody’s Got to Be On-Call
Once you’ve finally detected the vulnerability or attack, what’s next? First, to let the right people know who can disseminate the correct next steps — both doctors and SREs spend a lot of nights answering pages. Then the goal is to lessen the number of people affected through quarantine and isolation.
For COVID-19, Beda said “The goal is to get case number down. This is a disease that primarily passes person to person. The biggest way that Covid is spread is person-to-person particles,” like coughing and sneezing.
For this particular coronavirus, it’s about getting the replication number below one percent. Depending on where you live, there’s still definitely work to do.
With any emergency response, it’s essential that information is shared correctly. One of the most important moments in 2020 was when the Chinese government had this coronavirus genome sequenced and open sourced by mid-January. This allowed everyone to work on potential vaccines and treatments much faster.
There was certainly some miscommunication about the benefit of masks and other personal protective equipment (PPE) protocols, alongside the uncertainty of how it gets transmitted.
And then there was blatant misinformation or fake news, saying that it’ll just disappear over the summer.
It turns out there are again a lot of similarities between human and traditional computer viruses. Even the nature of Kubernetes can make it also easier for attackers, if you’re not prepared.
“Right away the whole point of Kubernetes is that it gives you a set of abstractions that make it easy and convenient for you to access other pieces of your infrastructure in the same cluster, which from an attacker’s perspective is fascinating because theoretically once you compromise the [authentication] material, you not only have access to other nodes in the cluster, but you have a wonderful set of tools that people put a lot of time and effort making it convenient for you to access other nodes in the cluster, as well,” Nóva said.
She explained that once you have an illegitimate kubeconfig that works in the entire ecosystem of tools that operators and administrators typically use on a day-to-day basis, that can now be used by an attacker.
After Falco sounds an alarm, it then proceeds to quarantine your nodes in Kubernetes and isolate them so that whatever is affecting one of your nodes can’t spread to others. It’s a key component in stopping horizontal attacks before they start, just like when patients are isolated from infecting others.
Nóva continued to also recommend looking at failure rates and how you can respond with the Kubernetes’ Cluster API, allowing you to take snapshots of infected nodes and then migrate, mutate and drain the nodes to get whatever infection offline so you can bring up new immutable infrastructure again.
Network isolation is another big part of quarantining, mitigating the infection or malicious actor at the network layer to dramatically slow them in their tracks.
Nóva emphasized that “We see that quarantine is important here but more important than that is the ability to take action and to understand that once you have one potential compromise you can see that spread to other parts of your infrastructure.”
She said that when something happens, sometimes teams default to misinformation, speculation and assigning blame. In the open source world, Nóva says, if you happen to discover a vulnerability there is a responsible disclosure time to allow the maintainers to fix it, but then a patch should be released to the broader community. Open sourcing is no good unless there’s open communication.
Most security teams also have a post-mortem evaluation, which is the ceremony of getting together to understand what happened. Nóva emphasized this shouldn’t be a blameful exercise.
She said you can also have false positives and red herrings where it may look like you’re detecting the problem or understanding what’s going on — “but if you debug a live system, it’s usually more sinister and involves DNS.”
Nóva says you have to balance correlation with causation, and that speculation can be equally productive and destructive.
She continued that it can be easy to get fooled which will cause more problems downstream. Her strongest advice is to “be skeptical” by not proving things right or wrong, but by proving your assumptions “not wrong.”
Virus Prevention and Detection: Developing a Security Policy for Humans
Both human and computer viruses can be detected — and need to be detected earlier to prevent spread. With both types of alarms, it’s about balancing under-sensitivity that gives a false negative — like with rapid testing —and over-specificity that gives a false positive — like with the more accurate PCD-based lab-confirmed testing. Beda says you really need both, but the less sensitive rapid tests are preferred, so “you can test more people cheaply, quickly, repeatedly.”
“The stronger our prevention technology is, the stronger our prevention policy is, the healthier our systems are going to be.” — Kris Nóva, Security Expert
Nóva says so much of this comes down to your role-based access control. It’s not just about turning it on but how many parts of a system are defined? How many parts of the system are lumped into this access control?
For the cloud native community, these security tools include:
- Admissions controllers: allows you to control what is and is not admitted into your Kubernetes cluster
- Open Policy Agent (OPA) gatekeepers: what users and clients can and cannot do inside your Kubernetes cluster
- Regression testing: how you prove and disprove that something you didn’t want happening from happening again, not just at the software level, but at network, kernel and storage too
- Kernel Controls: can control what processes users can do, what people can do with our ecosystem
The more rigorous the preventative techniques, the less likely the cluster is going to get sick.
Then, onto detection, Nóva says that in Kubernetes you want to be able to check the false positives, but also you need to just do as best you can. She says a site reliability engineer has to start off by accepting a certain amount of uncertainty and then build on that, like known 90 percent of your clones are up and running and healthy is better than none.
Like with Covid tests, Nóva advocates for quantity over quality with patches, just getting things out there to try to get as healthy as quickly as possible.
She spoke about how you can use Falco again to see what’s happening at the kernel to detect runtime security anomalies.
She says just to dive into SRE to know what’s going on and then you can start making decisions based on what you learn.
Is Herd Immunity Even Possible?
Now we’re going to talk about herd immunity. This does not come from the majority of the population catches the virus. Based on the average infection rate of COVID-19, this comes when 70% of people have immunity. Beda says this can be achieved through either a 100% effective vaccine distributed to 70% of the population or 100% of the population get a 70% effective one. With whooping cough some people will never have an immune-response so, the goal is 94% of the population getting the vaccine several times throughout their lives.
Of course part of this is to make sure each entity or government has a pandemic response plan for future outbreaks, including strategic stockpiles of PPE, antibiotics, airway and breathing medication, and eventually vaccines. This plan also needs to include the chain of command and strategies for rapid deployment.
“You can’t just wait until something bad happens.” — Dr. Rachel Beda, Wise Patient
Nóva says you have to have a plan for how to patch in production and how get rid of the malicious user, while also finding out how they got in in the first place. Just make sure — beyond a shadow of a doubt — that what you are deploying isn’t going to be bad. Use a form of progressive delivery like partial patching or AB testing to slowly release, allowing for fast rollback to the last known state.
Then you apply that knowledge to updating your security policy and maybe your regression testing to prevent it from happening again.
And just like a medical response, software response has to maintain clear documentation and a clear chain of command — so you aren’t waking up the wrong people unnecessarily.
Also be sure to listen to our conversation with Kris Nóva, recorded at the start of the pandemic:
Sysdig’s Kris Nóva – How We Can Never Be Prepared But Open Source Can Help
KubeCon+CloudNativeCon and VMware are sponsors of The New Stack.