How Containers, Microservices and AI Will Lead to the Operatorless Data Center
Containerized solutions and machine learning may soon be more than tangentially related. Containerized solutions will usher in an era of operations that don’t require human intervention. Once humans are taken out of operations, we will be free to apply machine learning techniques to what is left. If we succeed in displacing the data center operator at the same time that truck drivers, radiologists and factory workers are being displaced, we will need a new way to organize our economy.
In this post, we will take a look at some of the trends that are emerging in our industry. In a later post, we will look more closely at emerging, societal trends and their relationship to technology and digitization.
What Comes Now
Containers allow us to easily isolate, transport and run software. One of the killer applications for containers is the microservice architectural pattern. Microservices do one thing well; each service is “elastic, resilient, composable, minimal, and complete.”
Once we started to use containers to create systems of microservices, we needed a way to organize them. Enter stage left, schedulers, and orchestrators.
Schedulers are about selecting a node to run a job on. Scheduling includes matching the needs of the job to the capabilities of the machine. This leads to increased hardware utilization because schedulers will run more than one job on each node, providing the node fulfills the CPU and memory requirements.
Orchestration is about how everything works together. Orchestrators are therefore about networking, scaling, and responding to failure. For example, if the traffic to your system crosses a certain threshold, an orchestrator should notice this, add extra capacity and then reconfigure the load balancer.
Many people in the community don’t agree on what schedulers and orchestrators should and shouldn’t do. There is, however, general consensus that the next thing we’ll get right is automated operations with some combination of software schedulers and orchestrators. This work is now well and truly underway and can be seen, for example, with the progress Mesosphere has made with its DC/OS (Data Center Operating System).
What might come after orchestrators are intelligent agents. An intelligent agent has some basic rules built into it. These rules dictate how the agent will recover from error, increase or decrease in capacity, and what role it should take in the overall system. For example, is the agent a master or a slave in the system? That will depend on whether a master already exists. Which agent ends up as a slave and a master is undetermined at build time. This is one characteristic of systems of intelligent agents.
The above schematic for Joyent’s ContainerPilot. With ContainerPilot, an agent is composed of a microservice process, a container and an adaptation layer. It is through this layer that agents communicate, via a global store, to the other agents in the system. The promise of ContainerPilot is that applications deployed using it will manage themselves.
Intelligent agents are not smart in and of themselves. However, when they interact they can create the illusion of intelligence. A good example here is Conway’s Game of Life. The agents in that game called ‘cells,’ ‘live’ on a two-dimensional plane. The cells follow simple rules about living and dying:
- Any live cell with fewer than two live neighbors dies as if caused by under-population.
- Any live cell with two or three live neighbors lives on to the next generation.
- Any live cell with more than three live neighbors dies, as if by overpopulation.
- Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.
In Conway’s Game, system level intelligence emerges from the individual behavior of the cells. This system level intelligence creates ‘organisms’ like the ‘lightweight spaceship’ and ‘glider’, both of which seem to ‘travel.’ The ‘glider gun’ is a configuration of Conway’s game that creates an infinite number of gliders.
Conway’s Life is an example of a Complex Adaptive System (CAS). A complex adaptive system is composed of autonomous agents each following rules. The agents interact and intelligence emerges from the system. For example, a city like London runs with a large number of agents (people), many of whom have never met but who nevertheless conspire to make sure that bread is delivered, garbage is collected and trains run on time.
Other examples of complex adaptive systems are ant colonies and the traffic system. Applications built with intelligent containers, à la ContainerPilot, are also complex adaptive systems. So, just as Conway could not predict the emerging intelligence of his game, we cannot predict exactly what sort of system-level intelligence will arise from applications built with intelligent agents.
Can this really happen? Intelligent agents may sound like science fiction, but we are in fact all the beneficiaries of a highly distributed, fault-tolerant network of agents. The Transmission Control Protocol (TCP), which is a key protocol of the internet, has rules for congestion control, retransmission and error detection. All of these rules combine and compound to make sure that packets of data get from their source to their destination. We don’t ‘run’ or ‘manage’ packets of data and soon we won’t run or manage our applications, either. They’ll do that all on their own.
Machine Learning for Automated Operations
We can learn a lot from complex adaptive systems. But they are also limited. They will never evolve into something that is beyond their own rules. Conway’s Life won’t spontaneously become three dimensional; ants won’t start solving crosswords; cars won’t fly.
Machine learning is different. There are no rules, to begin with. For example, Google’s DeepMind is a neural network that is designed to (amongst other things) play computer games. DeepMind learns the rules for winning as it goes along. When told to optimize on the score for Atari Breakout, DeepMind, quite to the surprise of its creators, started devising strategies on its own. One of these strategies was to break a channel in the wall. The system then bounced the ball up behind the wall where the maximum amount of points could be scored with the minimal amount of effort (and risk).
Thus, machine learning surprises us like children can and complex adaptive systems cannot.
The three things we need for machine learning are:
- Lots of computing power.
- Lots of data.
- Decent ‘questions.’
Once the question is posed, the machine runs lots of simulations. The output of one simulation can be fed into the next one. This is how machines learn. This technique can be used to automate operations and to optimize around things like cost. To do this, you need two things. Firstly, you need a way to automatically iterate through millions of permutations of server configurations. Tools like HashiCorp’s Terraform already let you easily provision servers. Secondly, you need to be able to iterate through different configurations of an application. Containers already let you easily re-configure your application. Once a configuration is found, an application can be deployed. If the application is built with intelligent agents, they will run, recover from error and scale without any human assistance.
Again, on the surface, this sounds like science fiction. However, Skipjaq is a product that uses machine learning techniques to optimize applications. Their product is already commercially viable. At the same time, Google’s DeepMind is being applied to energy management within the data center and reducing energy costs by 15 percent. It is through Skipjaq, and tools like Terraform and ContainerPilot, that we can start to glimpse the future.
The next wave of automation in the data center will be driven by containers, tools for provisioning hardware, advances in machine learning and the raw CPU power that Moore’s Law delivers every year. Each innovation and each recombination nudge us one step closer to the operatorless data center. We don’t have to look very hard to find evidence of this, with ContainerPilot, DC/OS and DeepMind pointing the way.
That been said, as we get closer to an operatorless data center, we will have to come to terms with the fact that what we are creating will displace thousands of workers. We must also face the fact that we are contributing to a wider trend that will displace millions of workers. This is why we cannot (and should not) talk about our work without considering the wider, societal effects.