How to Run Containers in Production Environments
The benefits of adopting a container-oriented development and deployment workflow are not fully realized if the adoption is only retained within the boundaries of development and test environments. The reluctance to run containers in production stems from concerns surrounding security and isolation, and a general lack of operational expertise in managing containers in a production environment.
In organizations that are at some stage of adopting containers, the decision to move them into production environments is a major consideration. It is easier to take the plunge when adopting containers for a completely new service or application, one that is ideally container native.
What does it mean to be container native? A container-native application is one that is designed and built around the lifecycle of containers and considers containers to be first class citizens of its existence. For applications that have been retrofitted to work with containers, the decision to move to production is usually harder. This refers to legacy applications, which are prone to extensive refactoring to adopt container-oriented development and deployment.
Understanding the impact on existing workflows and processes in the organization’s production environments is a significant aspect of operating containers in production. Here are some workflows and processes around production that may be impacted by containers:
- Moving a change or feature set from development to production.
- Allowing end users to access the change or feature set deployed in production.
- Debugging an issue reported in production.
- Monitoring applications in production.
- Updating application versions in production.
- Taking backup of data generated.
- Disaster recovery and business continuity.
- Capacity planning for production.
- Setting up a host machine, especially for networking and security configuration.
Operating a production environment without containers is the norm at most organizations in the midst of container adoption. In such organizations, virtual machines may take precedence as the deployment unit, serving the need for isolation and management of an application’s components. Most likely, these individual components are distributed across a set of VMs that run across multiple hosts in redundant configurations, which allows for high availability. If the production environment is shared with other applications, then the VM becomes an important tool for this isolation.
The operations team controls the VM management and lifecycle, with application code usually copied over to VMs via some form of automation and workflow. The VMs are rarely destroyed for new deployments and are instead reused, as revisions of code keep coming in for new deployments. Occasionally, these VMs also go through changes, ideally by reconstructing the VM itself using a new version of the golden image.
Alternatively, some organizations practice reprovisioning new sets of VMs by discarding the old ones when releasing changes to application or infrastructure in production. Netflix’s Aminator, and the practices around it, became a trendsetter on how VM-based management of production environments can still use the principles of immutability and disposability.
Proponents of containers increasingly advocate for running containerized workloads on bare metal. This helps avoid performance and context switch penalties that arise from using a hypervisor in between the bare metal and the operating system. This argument is more pronounced if we run single tenant systems that are not going to share the resources of other non-trusted tenants.
Organizations in the early stages of container adoption can try to use a hybrid strategy to deal with this situation: have a combination of containerized applications running on machines as VMs and some on bare metal. This combination can be fine-tuned by analyzing the management and performance metric over time. When a certain level of maturity is reached with operating containers, some adopters choose to use only bare metal for running their containers.
When running multiple applications on a shared production infrastructure, the decision to run containers on bare metal should be based on prior experience. A safer option is to have these multiple tenants wrapped around VMs, and let containers be the runtime deployment of your services inside these VMs. Each VM may host containers that belong to the same tenant, providing isolation against other tenants.
Figure 1: At low maturity, the tenants do not share the production infrastructure. This isolation can be a VM or a physical host. At high maturity, components of each tenant application are spread across the shared infrastructure. When considering containers, high maturity represents a shared VM or bare metal infrastructure, where components of each tenant application are spread.
Releasing Applications in Production
The ephemeral nature of containers allows for updates by launching code changes in each new container instance, rather than updating an existing instance. When a particular change is marked to go into production, a new set of containers is created using the new version of the container image, identified by a new tag. The new tag is ideally generated during the continuous integration (CI) process. The generated image and tag are stored in the container image registry, which is accessible to the production environment.
While it is possible to share the container registry among all environments — including development, test and production — in some cases, a separate container image registry is used exclusively for the production environment. This exclusive container image registry for production is not shared with other environments. In this case, there needs to be a promotion process to take the “candidate” container image from the registry marked for development and test to the registry marked for production. An intermediary system could be used to pull the candidate image from the dev/test registry, and re-tag and push the candidate container image to the registry marked for production. Using this method, there is clear isolation of the images stored and available in the image registries between non-production and production environments.
The transient aspect of containers enforces constraint on the pre-existing practice of using the static port bindings and IP addresses of the host from which the application is deployed. This helps with configuration of the network firewalls and switches. Best practice for releasing changes in production is to adopt rolling upgrades. This requires additional infrastructure in the production environment to modify existing load balancers and proxy configurations, while a new set of container instances are launched on the side of the existing version.
To maintain some level of resilience when deploying containers, it would be wise to leverage the offerings from the container runtime and platform around the orchestration tool. If you’re only using the container runtime without any orchestration tool, taking advantage of a runtime configuration like the “–restart” policy is crucial. The recommended restart policy is to toggle between the “on-failure” and “unless-stopped” configurations or pick one that makes sense for your environment.
Setting Up the Host Environment for Containers in Production
One important aspect in preparing your production infrastructure for running containers is to acknowledge the “sealed” nature of the container image. A container image demands a standardized infrastructure that remains similar in configuration across development, test and production environments. This standardization is crucial to getting a predictable experience when running containers in production. Here are some factors to consider:
- Choice of kernel.
- Container runtime version and choice.
- Network access and firewall configuration.
- Choice of security hardening solution.
- System access permissions.
It may be tempting to sway away from standardizing all of these factors across environments; however, any deviance can lead to unpredictable results that may require investigation. The simplest way to address this is by ensuring that each environment, from development to test to production, is built with the same topology. It may be difficult to achieve at the beginning, but consider it the objective for container adoption. All environments use the same Linux kernel; they use the same host configuration, file system topology, network configuration and user roles. Standardizing all factors helps ensure that the container image which gets built and tested in integration remains the same right through to production.
Certain aspects of the runtime behavior of the container instance will be different, like URLs for accessing other dependent services, or log-levels. These runtime configurations are passed to the container using various options. One such option is to inject the runtime configurations via environment variables to the container instance. The environment variable could point to a service registry, which then points to the right dependent service. Tools like ZooKeeper and Consul are helpful in implementing this paradigm.
If the host is being provisioned using existing configuration management and provisioning tools, like Chef, Puppet and Ansible, then it would be right to have a singular configuration for all environments, as much as possible. For a scale-out infrastructure, the limited differences remaining across environments is the number of instances that would be required to be provisioned, considering availability and performance requirements.
The host will also have to be occasionally updated with the latest kernel patches, container runtime version, etc., that cannot be sealed inside the container image. This change of host configuration could follow the rolling upgrade, which allows updates in stages, instead of taking a big bang approach.
Finally, like the host, there needs to be similar configuration and topology for supporting infrastructures, such as log aggregation, monitoring, metrics, service routing and discovery for all environments. This allows a level of consistency for the container images, not just with the host, but with all participating services in the respective environment.
Service Discovery of Containers in Production
Standardizing the service discovery mechanism is an important consideration for containerized applications. It would be rare to run containerized applications the same way you would run an application on VMs: there would be more than one container running on the same host. In this rare case, if you’re using a bridged network mode, you may require a static port assignment for each container. This means that you pre-select the port exposed by the container on the host, and use that to configure the load balancer and proxy.
Reconfiguration of the load balancer and proxy may not be needed if there is only a limited need for adding more containers to the host. In most cases, however, containers are added or removed based on the load through the use of immutable deployments or scaling practices. In this case, it would be difficult to keep a static set of pre-assigned port numbers, and that is precisely where service discovery systems excel.
Production Support: Log Aggregation and Monitoring
The supporting services needed for containers in production remain the same as with non-containerized environments. This includes a way to capture logs from container instances and ship them to a centralized log management system. Built-in log backend support already exists in the Docker daemon, and custom solutions are in place to take care of this.
There are a number of options for managing log data produced by Docker containers. Docker’s default settings write logs uncompressed to disk with optional retention settings to limit storage usage. In a production environment, there are a number of logging drivers that ship with the Docker engine and provide flexible options for managing log data transparently to the applications producing it. If you already have a centralized logging solution in place, there is likely a Docker logging driver which can be configured to feed container log data to it. There are logging drivers for various established protocols such as Syslog, which can be used against a number of ingestion systems as well as cloud or SaaS-specific options.
The ephemeral state of containers is a key behavior to note in container logging and monitoring. New containers replace old ones when deployment happens in production. This is a break from the traditional convention of assuming stateful and long-running compute units. Rotating the container fleet creates new problems for the traditional logging and monitoring mindset. Rather than having compute instances lying around for days to months, containers could be rotated within a window of an hour.
The ephemeral state of containers is a key behavior to note in container logging and monitoring.
If you practice continuous delivery (CD), this window could be even shorter. A host-centric logging and monitoring solution cannot scale to the complexity of containers. Rather, due to their short life span, have a declarative way of monitoring them. And it’s better to monitor them as a group, instead of monitoring each container in the fleet.
One way to associate containers within a group is by using the appropriate metadata, such as tags. A tag refers to the “image-tag” that would be running in the production environment. It is recommended to avoid using the misunderstood “latest” tag for Docker containers when implementing this.
For example, suppose you deploy the image of your application “product-api” with the image tag “25” and the environment variable set to “production.” This means this is the 25th image identified by the build system, and the container running is in the production environment. You could have multiple instances of this running across your container infrastructure at any given point in time. The tag will be changed with each new deployment, but the environment variable configuration will remain set to “production.”
Having a monitoring system watch out for container images that are currently running with the environment variable set to “production” creates the impression of a long-running production service, insulated from the continuous changes that are being made as a new version is deployed. If you are using orchestration tools, then you have access to a richer vocabulary of tags for use in grouping container instances.
The appropriate monitoring and logging strategy for containers in production is a non-intrusive solution that can stay out of the running container, ideally running as a “side-car” container service. Container monitoring tools need to be container aware and even container-platform aware. This awareness will help monitoring tools avoid confusion when reporting a failure, such as when a container is stopped on one host and moved over to another by the container platform. The container footprint will lead to an excess of data gathered by the monitoring and logging systems, thereby causing a need for additional data management.
Container monitoring is a space that is growing by leaps and bounds, and we are covering it in our fifth ebook, to be published later this year. There are various Software as a Service (SaaS) and on-premises tools that could help you make a decision around this in the meantime.
Approaches for Managing Container Data
The generally accepted advice for managing container data is to have stateless containers running in the production environment that store no data on their own and are purely transactional. Stateless containers store processed data on the outside, beyond the realm of their container space, preferably to a dedicated storage service that is backed by a reliable and available persistence backend. This security concern is even more pronounced with container instances that host storage services, like databases, queues and caches. For these stateful containers, the agreed-upon pattern is to use data containers. The runtime engines of these stateful services get linked at runtime with the data containers. In practical terms, this would mean having a database engine that would run on a container, but using a “data container” that is mounted as a volume to store the state.
If you are running a clustered hosting environment using an orchestration platform, it is important to have a distributed storage solution, like Gluster and Ceph, to provide shared mount points. This is useful if the container instances move around the cluster based on availability.
Container Security and Key Management
Security is often a major concern, especially when running multiple container instances, each for a different tenant on a shared machine. This concern stems from a lack of trust and confidence that container technology will provide the right kind of isolation, as is expected from a VM-based implementation. However, treating containers as a replacement for VMs will not help with this concern. Container implementations, like Docker, provide a security blanket for applications. The container runtime abstracts away the complication of configuring fine-grained permissions on different namespaces, such as user, network and process.
If considering a multitenant deployment of containers on a shared infrastructure, leverage VMs for each tenant separation, and use containers as the isolation medium between a tenant’s application components. As proliferation of container deployments increase across users in the community, it will become necessary to avoid the VM wall and have all tenants share the same infrastructure.
The other aspect of running an immutable container instance is to avoid baking in any keys or credentials that need to be kept secret. There are multiple ways to solve this problem, from passing the secrets via environment variables, to protecting them with encrypted data containers that are mounted as volumes. However, there has been criticism in using these techniques, and there is no available standard from container runtime providers.
The Road Ahead with Containers in Production
An important element in making container adoption successful in production is by having an informed and educated community. Tools and practices will evolve to a state where running containers become the norm for most organizations. Until then, the unfinished battle to make this adoption happen lies within the organization. Making every stakeholder, especially operations and security teams, deeply understand their requirements around containers is a task that requires a lot of work. The path to container adoption in production will be different from that of VMs, and will prioritize developer experience and operational simplification in addition to the benefits of resource utilization.