What Do You Think About Designing Hardware for Containers on Metal?

At my company, RackN, we’ve been having a lot of container-on-metal discussions. Since we’re hardware ops specialists, our partners look to us to create reference architectures (RAs). Our experience building OpenStack RAs, when the team was founding Crowbar at Dell, was to start with dialogue around a straw man instead working a chisel and block.
Containerized Metal can be Very Different Than Virtualized Metal
Lately, we’ve been arguing about the merits of a few big servers versus lots of small servers. For this post, I’m taking the lots of small servers position.
In both cases, we’re assuming that:
- The containers run directly on metal using a basic Linux operating system (CoreOS versus CentOS? Let’s save that for a future post.)
- There’s no virtualization or shared storage.
- You are planning a container scheduling system, like Kubernetes, Mesosphere, Cloudsoft, StackEngine, Docker Swarm, etc.
First, we have to figure out likely constraints: network (net), memory (RAM), storage (disk) or compute (CPU). For now, we’ll start by assuming that our container workload is mainly handling user requests or brokering middle-tier service requests. For comparison, I created a table of other scale workloads (see bottom).
1. Containers are Not Taking Over Every Workload (Like We Assumed for Virtualization)
This is our most critical assumption. That means we don’t have to build containers that handle all workloads and we can focus on the strengths of container orchestration: short life cycles, intelligent placement, fault-recovery and microservice architectures. We should also expect that large tasks are more likely to be split between containers, than rely on multiple cores or threads within a container (like a database). The result is that container hosts can have less RAM, CPU, and disk while network remains important.
2. Anticipating a Container Orchestration System is Critical
This is because it allows us to design a more distributed system that can respond quickly to system faults. In that sense, we’re finally able to bake the pets versus cattle analogy right into our design. If the scheduler reacts to system failures, then our workloads are automatically cattle; consequently, we benefit from a more distributed and less individually fault tolerance architecture.
3. Containers Handle Oversubscription and Dense Packing Better
Containers run in a shared environment. They can rely on the operating system to handle resource allocation smoothly up to the full resources of the server. A VM, in contrast, may run out of RAM or CPU on a system that is not fully utilized based on how it was allocated. Assuming design loading means that we do not need to leave a lot of overhead on individual systems, so the spare capacity is reflected at the system level by inactive nodes, instead of average percent node utilization.
Overall, we are looking at a larger number of less powerful servers. In aggregate, it makes sense to buy the same total RAM or compute; however, there may be substantial cost, reliability and performance benefits from doing it with a lot of smaller units.
We’d like to hear from you about these assumptions. Our next step — get into the “speeds and feeds” for the system — will be built on these assumptions, so it’s critical that we get your input.
Support Material: Workload Table
Workload | Compute | Network | Memory | Storage |
Cloud/VMs | High | Mid | High | Low |
Big Data | Mid | Mid | Mid | High |
Database | Mid | High | High | High |
Containers | Mid | Mid | Low | Low |
By design, workloads are limited to at most two “high” categories.
Cloudsoft, CoreOS and Docker are sponsors of The New Stack.
Feature image: “Fantasy – London City Architecture” by Simon & His Camera is licensed under CC BY-ND 2.0.