Container Orchestration and Scheduling: Herding Computational Cattle
Forgive the analogy, but we are increasingly being asked to think of our infrastructure and applications as cattle. We care about cattle, but perhaps a little less than we would in comparison to our pets. We try not to be reckless with our cattle, and we want them to be grazing on the best land, we want to move them periodically away from crowded or dangerous pasture, and if they do get ill, we want someone to tend them back to full health.
In the world of cloud-native container platforms, the orchestration and scheduler frameworks perform all of these roles for our application “cattle.” The best application “farmers” maximize resource utilization while balancing the constantly changing demands on their systems with the need for fault-tolerance.
There is a perfect storm forming within the IT industry that comprises three important trends: the rise of truly programmable infrastructure, which includes cloud, configuration management tooling and containers; the development of adaptable application architectures, including the use of microservices, large-scale distributed message/log processing, and event-driven applications; and the emergence of new processes/methodologies, such as Lean and DevOps.
With all this change, why should we care about a topic like container orchestration and scheduling? We should care for the same reason that the successful farmers from the post-agricultural revolution cared about where they let their cattle graze.
What Type of Container Ranch are You Running?
There is currently a wide range of choice for hosting and deploying containerized applications. If we only consider modern cloud platforms, we can divide the landscape broadly into three categories: Infrastructure as a Service (IaaS), Containers as a Service (CaaS), and Platform as a Service (PaaS).
Each category has inherent strengths and weaknesses depending on the types of applications being deployed. Although some organizations will be tempted to choose simply based on intuition, analyst reports or existing vendor relationships, this is a fundamental decision that requires serious evaluation.
The choice of platform used heavily influences the type of orchestration and scheduling that can be implemented. The table below attempts to provide insight into the key abstractions, control planes and appropriate use cases for each of the three types. It’s important to note that Containers as a Service is still a hotly disputed area — many say that what we’ve traditionally called container services is also a type of Containers as a Service; however, Docker suggests a different set of criteria to be considered a CaaS solution, and central to the qualifications is being cloud infrastructure agnostic. We think the larger market interpretation of this product category still has room to be clarified.
Following our farming analogy, we can think of compute resource (CPU, memory, disk, networking, etc.) as land on which our application cattle graze. The role of orchestration and scheduling within a container platform is to match applications to resources — cattle to land — in the most effective and efficient way possible. At first glance, it may not appear to be particularly challenging to match efficiently applications to resources, but the bin-packing approach to optimizing resource usage is computationally an NP-hard problem. When we combine this fact with the volatile and ephemeral nature of the underlying cloud fabric, we definitely have a challenge on our hands.
Corralling Containerized Cattle
Creating a containerized application platform from an IaaS is akin to buying a bare plot of land and building your ranch from scratch — fairly arduous, but you have maximum control. Utilizing CaaS is much like buying grazing land, and employing a professional herds person to build fences and shepherd the cattle around. This allows you to have visibility of the land, declare your intentions, and have someone else make the moment-to-moment decisions. Deploying applications to a PaaS is effectively trusting your cattle to another farmer. Arguably, here you get to focus on the stuff that really matters (buying and breeding the best stock, taking your cattle to market quickly), but you may not have visibility of the grazing land, and you may not always agree with how the other farmer runs their farm.
In my work at OpenCredo, we have worked with multiple clients to design, build and operate containerized applications and platforms. The following sections share details from two case studies we produced. These examples are meant to demonstrate the decisions that go into creating these orchestration stacks.
Running Spring Boot Services on Mesos and Google Cloud Platform
With our first client, Apache Mesos was the orchestrator being implemented. The client had a desire to mix long-running containerized application services and Spark-based batch data processing. Mesos itself is effectively a “data center kernel,” in that it abstracts compute resources from the cluster of infrastructure, and offers these resources to frameworks, such as Mesosphere’s Marathon for long-running services, Chronos for batch jobs, and Spark for data processing workloads. Deploying and running Mesos can be operationally complex; however, it has been proven to work at scale by the likes of Twitter, Airbnb, and eBay and the ability to mix workloads and schedule applications across the entire data center is an attractive proposition.
Mesos was deployed and managed with Ansible, and containerized applications built and deployed into the Marathon framework via Jenkins. Service discovery was provided with the combination of Consul, Registrator, and srv_router, and service restarts or re-allocations for the purpose of fault-tolerance were provided with Marathon. The QA team were able to spin up personal instances of the Mesos platform within Google Cloud Platform (GCP), which facilitated testing by reducing contention on resources (previously an issue). However, several hard lessons were learned, as the resource-constrained personal environments scheduled workloads differently to the full platform. For example, running multiple Mesos frameworks within a cluster of one machine often led to resource starvation for one of the frameworks.
Deploying Go-Based Microservices to Kubernetes on AWS
In this project, Kubernetes was deployed onto Amazon Web Services (AWS) via Terraform and Ansible, with the primary goal of providing a level of abstraction for developers above IaaS, but without investing wholesale into a PaaS. Google Container Engine (GKE) could not be used due to client/vendor restrictions. It is easily argued that the vast majority of organizations ultimately create some form of PaaS because many of the PaaS offerings are required by a typical development team, e.g., testing, deployment, service discovery, storage, database integration and developer community facilitation. Accordingly, this case study did implement many of these features.
Kubernetes provided namespace isolation, which was useful for running multiple tests and staging deployments on the same cluster. Separate clusters were used for true isolation. Deployment was managed via the integration of the Kubernetes API or kubectl CLI tooling. Continuous integration framework Jenkins was used, which was also responsible for building and testing our application services. Service discovery was provided out-of-the-box, as was a good level of fault tolerance, with the provision of node health checks and application/container-level liveness probe health checks. Storage and database integration was provided by the underlying IaaS layer, and development patterns were shared and re-used with the combination of a version control system (VCS) and an internal, well-maintained wiki.
Lessons Learned From our Time on the Ranch
The aim of this article is to provide input on choosing a container platform and, based on your requirements, the associated orchestration and scheduling mechanisms. However, this is based primarily on two example client case studies. There are numerous other product combinations and resulting stacks, everything from Docker Swarm mixed with Mesos to focuses on PaaS-based cloud orchestration builds with IBM.
Here are some of our key learning experiences so far at OpenCredo.
- Automate as much as possible, with the goal of time saving, repeatability at scale, and the reduction in deviation of infrastructure/configuration.
- Discourage, but allow, manual intervention, especially when introducing this new container technology into an organization. We all know that you shouldn’t SSH into production hosts, but leaving this method of ingress open as a last resort measure can be a business saver.
- Semantic errors — read “business impact” errors — are much more important in the cloud and container world than underlying infrastructure failures. Ultimately, it doesn’t matter if you lose half of your production environment, as long as the system gracefully degrades and customers can still get value from the applications. Too many times we have seen operations teams inundated with infrastructure failure alerts, but they do not know the actual user-facing impact.
- Learn about Brendan Gregg’s Utilization, Saturation and Error (USE) and Tom Wilkie’s Rate, Error and Duration (RED) methodologies. We found these approaches to be incredibly useful at addressing performance issues.
Monitoring and Resource Contention
- There must be visibility at all levels. Container monitoring and management is a major concern for companies implementing containers in production.
- Instrument applications appropriately, and log only essential information.
- Provide business-level (semantic) metrics. This is especially useful for driving adoption with key stakeholders and also helps greatly with failures.
- Network: make sure you appropriately size your compute instance if creating your platform. For example, your application may only require the CPU and memory provided by a “small” instance, but have you confirmed the associated network bandwidth?
- Extensive monitoring is essential at the disk and CPU levels. With the CPU, watch for steal time if you’re deploying onto a public cloud.
- Lack of entropy can be a tricky issue to debug, and is surprisingly common in a containerized environment where multiple containers are contending over the host /dev/random. Containerized applications will often hang on for no apparent reason, and upon debugging it is observed that a security operation (e.g., session or cryptographic token generation) is blocking it.
Feature image via Pixabay.