Protecting Cloud Native Data Well Before Production
Spend any time at a security show or engrossed in the nonstop stream of bad news about costly ransomware attacks and data loss, and you will start to see such looming events as inevitable disasters.
After all, if you manage any production-grade cloud application that’s worth something to customers or the business, it’s also worth something to bad actors who can profit by threatening to bring it down.
While cloud vendors and organizations can prepare defenses with perimeter security, threat detection and SoC tooling to scan for known attack vectors, hackers are motivated to come up with novel approaches that systems haven’t yet dealt with. New “day-zero” attacks can be worth a fortune to the inventor on the black market.
In such an endless conflict, the number and style of attack attempts can be infinite, and therefore we can expect at least one of these day-zero disasters to find a way to infiltrate our critical applications and associated data, as well as the production infrastructure that supports it.
Enter a new paradigm for building applications. Cloud native computing abstracts away some of the challenges of protecting networks and data. Kubernetes introduced truly distributed and scalable container orchestration that could separate compute workloads from data storage as seemingly stateless microservices.
Cloud Native Applications Aren’t Really Stateless
The cloud native computing project landscape has been envisioned, built and battle-tested by a community of thousands of open source contributors and vendor practitioners, and it does entrain some security advantages by design.
In traditional application environments, whether in a data center or in cloud, there is a network perimeter and application-delivery controller delivering access to application and data resources with IP addresses. Read/write operations are continuously happening between services and storage to maintain the state and persistent results of all user sessions, with backups happening at rare intervals so capacity limits and capital expense costs can be avoided.
By contrast, cloud native developers can use Kubernetes to launch namespaces containing ephemeral, container-based workloads that can materialize and disappear instantly, with more fine-grained compute resources that scale to meet demand. State can be maintained somewhat independently of resources through the concept of in-memory secrets. The lack of physical hardware and known addresses makes it harder for attackers to latch on to systems using many conventional exploits.
Here’s the problem: Even a cloud native application needs to maintain session state and record events to persistent volumes on behalf of its users, somehow, or it won’t be very useful. Securing every vendor tool and open source element that contributes to a widely distributed app becomes of paramount importance, as supply chain attacks are on the rise.
A recent State of Kubernetes Security survey noted that 94% of respondents reported experiencing a security incident in their Kubernetes and container environments in the past 12 months, with more than half having to delay production deployments due to configuration concerns. Organizations need to shift data protection and data restoration concerns to the left side of the cloud native application lifecycle, to “Day Minus One,” before thinking about dynamically automating delivery.
Preparing Four Secure Day-Minus-One Approaches
What are some of the secure safeguards that cloud native computing teams can put in place well before the next Day-Zero malware disaster threatens data in production? Here are four:
1. Scaling dynamic storage and backup policies to avoid cost surprises — Engineers can run an open source tool such as Kubestr to identify dozens of potential storage volumes available to Kubernetes clusters, many of which have unique protocols and permission settings.
Writing scripts and configuring data workloads to store and back up correctly for each volume can be a time-consuming and expensive process in itself. Worse yet, storage resources that seem reasonably priced for starters can balloon exponentially in months, especially if traffic increases and multiple teams are calling for different storage resources. Maintaining frequent-enough fail-safes for safety could become prohibitively unprofitable.
Setting common backup and restore service-level objectives across application teams can take the manual labor and guesswork out of budgeting against failures and cost overruns.
2. Designing for recovery with Policy-as-Code — Assume up front that some kind of attack will eventually find a vulnerability somewhere. Then limit the blast radius by defining protection, backup and recovery policies along with the architecture.
Protection Policy-as-Code assets can be stored in repos as shared project assets along with the rest of the Infrastructure-as-Code definitions and delivered as part of the continuous delivery pipeline.
Using an interface such as Kasten’s K10, developers and ops teams can manage post-deployment policy contingencies transparently for storing active user data, setting backup intervals and executing complex sequences of recovery and reset workflows across multiple hybrid IT storage volumes, including immutable or air-gapped fail-safe backups.
3. Run and secure application data anywhere — Kubernetes delivers on the promise of run-anywhere portability and openness because every growing application estate will eventually need to be extended to cover multiple acquired vendor platforms and customer domains.
Open source-based data protection, disaster recovery and restore capabilities should follow ephemeral Kubernetes workloads wherever they go without creating proprietary lock-in for only one type of base cloud infrastructure or delivery pipeline.
4. Focus on time-to-restore before you need it — Getting up and running quickly after a system failure or ransomware attack is really what matters most so that revenue and customers aren’t lost in the gap.
Businesses want to achieve an SLO as close to zero seconds as possible to meet the recovery point objective (RPO), which measures the duration of time when transactions are lost between a service interruption and remediation action. Even more importantly, they need to meet the recovery time objective (RTO), which measures the time required to restore the Kubernetes production environment and its accompanying data at scale so that it can resume operations.
Remember that recovery times aren’t independent variables. Reducing human error and lag time in spotting and resolving issues, and employing automation policies such as cross-cluster exports and imports, can drive faster results.
The Intellyx Take
Fail to prepare to fail, or prepare to fail.
A Day-Minus-One mindset changes the way we think about cloud native architecture, state management, data persistence and the resiliency of our applications in general. It is built upon the pragmatic understanding that no system is infallibly designed and that humans will inevitably make some mistakes in configuration.
To scale and survive an inevitable storm of attacks and potential failure conditions, enterprises need to be proactive about backup and recovery, rather than waiting for a new Day-Zero ransomware variant to arrive.