Making Systems Engineering Behave as Predictably as Software

Infrastructure automation is rapidly increasing in complexity. It is no longer possible to make simple changes without the possibility of unintended consequences. If infrastructure is to be treated as code, then it is a software product, and as a software product it must adhere to the same reliability standards as any other piece of software. To that end, the same principles used to make software products behave in a predictable fashion should be applied to systems engineering.
Borrowing from the Database World
For infrastructure to be considered reliable, it must behave consistently. It’s not enough for infrastructure to be self-consistent, however, it must also perform predictably with respect to the applications running on it, for infrastructure plays a fundamental role within an application’s architecture. To accomplish this, we should borrow from the world of relational database management systems. Transactional infrastructure, modeled after transaction logic (e.g., ACID), is necessary to facilitate the level of trust needed for infrastructure products to be taken seriously. There are several requirements that must be met in order to approach this ideal:
- It must be possible to guarantee isolation in the transactional approach to infrastructure. This requires not only that the current and all previous states of the infrastructure be known, but also that those states be reliable — meaning no manual changes to the state of systems. This problem is nearly intractable, however. If that’s the case, then how is it possible to reach the goal of transactional infrastructure?
- A representation of the infrastructure must be used to model the current state of the infrastructure as a whole. Given the current state of infrastructure, a transition function or functions must be made available to represent the change from the current state to the next. These functions must offer self-correct error correction, and provide reversibility guarantees. That being said, it may still be impractical to implement all of the pre-, intermediate and post-condition assertions necessary to guarantee a transactional approach.
- Incompletable changes should not modify the current state (changes should be atomic and isolated), and once a change has been completed, the view must be updated for all future observations of the state (changes should be consistent and durable). If we make a change to our infrastructure, we do not want a malformed change request to leave the state of our infrastructure fractured or trapped in a transitory state. Traditional configuration management systems often require human intervention to resolve issues of corrupted state.
Managing State Change
Taking the analogy a step further, it would be ideal to be able to propose a change to state, analyze its applicability and only persist that change if it is “accepted” by the system. If we have multiple concurrent systems managing the state of infrastructure, then this is where some form of MVCC would prove useful. A previously proposed state change may at any time invalidate a newer proposed change once the previous change is persisted. Simultaneously, the proposition and validation of a change allows there to be a consistent view of the state while proposed changes are being serialized and persisted. The primary issue with this approach is that possibly non-deterministic logic is applied in the transition from the initial to final state, because this is a complex distributed system (a collection of hosts running a Linux distribution intended for people to use directly) and not a relational database. Therein lies the problem: nearly all of the operating systems used to operate our infrastructure had users in mind in their design. They are tailor made for people to operate directly.
Consider the problem of package management: a many-times “solved” problem in the Linux ecosystem. Installing a package is represented as a transition function. First, it is required that the state of the system be representable: the package is installed, the installation was attempted and failed, the installation was successful. All possible states must be known. A glance at a list of the state transitions involved in package management leads to a reasonable understanding of how non-trivial the problem is:
- Uninstalled -> Installing
- Installing -> Install Failed or Installed
- Install Failed -> Cleanup Install Failed -> Uninstalled
- Installed -> Uninstalling
- Uninstalling -> Uninstall Failed, Uninstalled
- Uninstall Failed -> Cleanup Uninstall Failed -> Installed, Uninstalled
Principal Actors
There are two principal actors involved in this particular transition function: configuration management (CM) and package management. Many package management systems adhere to ACID principles, which allows configuration management to concern itself only with the terminal states (installed, uninstalled). This is one of the most well-understood pieces of infrastructure management, and it has still proven to be cumbersome and fraught with errors. Furthermore, this must take place on many hosts in a distributed system, and CM systems lack coordination across multiple hosts. “Golden image” approaches to infrastructure management can provide a guarantee of consistent state across all systems that CM otherwise cannot, but are sometimes impractical or dismissed altogether.
If everything is so far from the ideal, then how is it possible to reach this goal of transactional infrastructure?
The operating systems upon which infrastructure is built must stop thinking of themselves as desktop operating systems. For all intensive purposes, nobody will ever run Red Hat on their laptop — so why are any of the facilities for using that particular Linux distribution as a desktop OS necessary at all? Remove them, and streamline the operating system for deployment as independent units of compute that are part of a collective. This is precisely what Linux distributions like CoreOS are attempting to do.
Isolating Deployable Artificacts
CoreOS is a Linux container hypervisor. Logging into a CoreOS system directly is essentially an anti-pattern. The operating system is updated using the Omaha update protocol in an automated manner, and the update only takes place by rebooting the compute node. Users of CoreOS do not run applications on it in the manner that users of traditional Linux distributions are accustomed. Instead, they ship and run applications in containers — isolating all deployable artifacts from any other actors within the system.
CoreOS provides an ideal platform upon which to run distributed systems such as Kubernetes and Mesosphere. These platforms afford developers the ability to transactionally upgrade their applications, but their feature sets are still somewhat lacking in this regard. If it is assumed that an application will behave identically in testing and production environments, then there is sufficient capability in both Mesosphere and Kubernetes. Both allow a “rolling upgrade” of services, that only transitions their respective distributed unit upon the successful completion of a full or partial upgrade. Both allow for the application of simple “success criteria” that indicates a successful transition from the current state to the desired state. However, it would be ideal for these platforms to allow for more complex pre-, intermediate and post-condition assertions to indicate the success of a state transition.
Going Beyond
In order to do so, there must be mechanisms put into place that allow for conditional logic — effectively supplying to platforms a fitness function — to be applied at different stages of the state transition. Simply monitoring the state of your applications could be enough for this, but it would need to go beyond simple availability monitoring. Are the applications behaving consistently and as expected? An applied change should not be considered successful unless the conditions supplied are met, thus guaranteeing that errant processes are not allowed to be persisted.
These platforms have considerably less distance to travel before they reach an ideal transaction-based system. How or can CM approach transactional infrastructure? It will mean a number of trade-offs. Users of CM systems will have to live with reduced flexibility, but will be rewarded with increased assurance that the changes they are making will be persisted and that there will be consistency across the entire infrastructure. It will also mean a paradigm shift for CM users: manual intervention and system administration must come to an end. It is easy to convince system administrators to disallow user logins, but a different argument altogether to have them stop themselves from logging in and making changes.
CM systems must supply a coherent interface for resources that allow for the specification of pre-, intermediate and post-conditions. Furthermore, all resources must have the ability to recover from a failure during any of these assertions. A good deal of this effort rests on the shoulders of those writing resources and providers, but CM must facilitate this style of resource management. CM must provide a holistic view of infrastructure state and allow coordination of updates across systems. It currently only allows for eventual consistency at best, but ultimately this facilitates nothing but guaranteed eventual inconsistency.
Transactional infrastructure would bring the sophistication of systems engineering inline with the services provided to customers. There is no reason that infrastructure, if treated as code, should be relegated to software second-class citizenship. If infrastructure is a service provided to developers, and developers build products on top of the provided infrastructure service, then the infrastructure is part of the product and should be treated as such.
CoreOS and Red Hat are sponsors of The New Stack.
Feature image: “you guys scared?! — humanity’s transaction with nature and banks and the imaginary construct of commerce : liquid painting, scott richard, san francisco (2015)” by torbakhopper is licensed under CC BY-ND 2.0.