Immutable Hardware: Ops Hygiene For Security and Efficiency

There’s a clear pattern of benefit from using automation to reduce cycle time and eliminate manual intervention for all operational components. And that is as true for hardware as it is for software.
RackN CTO, Greg Althaus, and I were discussing Jérôme Petazzoni’s ContainerCon 2015 security presentation, in which he argued for limiting the lifespans of containers as a security precaution. We realized that the arguments for such short-lived containers also apply for generally to virtual and physical infrastructure.
I’ve written about forced turnover (“mayflies”) before with Josh McKenty. Imagine a cloud in which servers were automatically decommissioned after a week of use. This would force a constant churn of resources within the infrastructure. The cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.
Greg and I agreed that frequently turning over systems would enhance security (and ops hygiene).
The basic idea is that if your environment is being constantly destroyed and reconstructed from “immutable infrastructure”, sources then it’s a more difficult target for intrusions. Containers or virtual machine images aren’t the only way to recreate pristine environments: it is possible to fully automate metal life-cycles as well, to clean, flash and reimage servers at a reasonable pace. In fact, this is methodology is common for hyperscale providers.
If the goal for immutable infrastructure is to eliminate indeterminate configuration by creating disposable base configurations then it is possible to recreate this environment on metal. The key attribute of this approach is the ability to automatically cycle and reset the environment back to a controlled environment. Driving that cycle faster and more reliably translates into more robust infrastructure.
In addition to improved hygiene and repeatability, immutable infrastructure can have implications for security.
With fast cycles, there are no stable landing places to launch an attack, no human backdoors or configuration holes, plus frequent updates and patches are being applied. Most attacks exploit older vulnerabilities that may linger in physical environments because operators perceive change as risk.
On the practical side, it may not be practical to truly destroy and reimage servers like containers or virtual machines. Greg defined three levels of physical immutability to guide implementation discussion:
- REALLOCATE: Reallocation Scrub and redeploy. In this cloud-like model, machines are given back to the pool to random redistribution. Not only is the is-was configuration and data lost, but infrastructure specific relationships between the node and others must be updated. This practice enforces good ops behavior by making sure that existing services get cleaned up. The lack of cleaning potentially costs ops time in chasing missing or removed system since failures at scale must be expected. This model is especially attractive if you have a lot of general purpose systems or plan to rebalance workload allocations.
- REBUILD: Redeploy Scrub & rebuild. In this (re)provisioning model, machines are fully reset but the target workload is reapplied to the same system. This would allow many configure settings (such as IP addresses) to be preserved and reapplied to the system. An additional value of rebuild cycles is “chaos monkey” validation for high availability. Overall, This is good for hygiene but not as rugged and flexible as a REALLOCATE.
- REFRESH: Redeploy Scrub OS while leaving data. A less extreme version of REBUILD focuses on the decaying parts of the system (the OS and app) while leaving data intact. For storage and big data nodes, restoring the data can represent a significant overhead, performance impact and risk. If we assume that data drives are secure or not-OS managed then there is less security risk of leaving the data in place.
These immutable infrastructure concepts can be implemented in physical operations with significant benefits for security, utilization improvements and overall hygiene. We’re interested in hearing your opinion and concerns about this type of cloud-like ops infrastructure.
Feature Image via Pixabay.