While building containerized applications has become an established practice, some organizations are facing challenges with managing the containers themselves. Managing containers in production environments requires a deep understanding of the technology, along with unlearning habits and practices that no longer work. These practices span across the different life cycle aspects of container management: logging, monitoring, notifications, remediation and metrics.
Containerized applications have to be designed with consideration for the various nonfunctional and container-native patterns that have emerged over time. For example, the twelve-factor application style has been widely advocated and adopted by microservices enthusiasts and developers. The twelve-factor principles require the application to be operation-ready, including design for fast startups and graceful shutdowns, but it misses out on handling the software entropy that such container-based systems acquire over time.
Containers Add Complexity to the Overall System
Adopting containers, especially in a microservices-based architecture, leads to increased entropy in the system. When we say entropy, what we mean is the system becomes immensely more complex, and that complexity needs to be managed. The complexity is due to the increase in moving parts in a container-based environment. Moreover, the disposable and immutable nature of containers can encourage some teams to create more rapid delivery pipelines. This propensity towards change further fuels the overall system complexity. If not managed, it can create opportunities for burnout, downtime, and disappointment in adopted technologies.
Another important lesson is to have applications that are designed to handle the complexity from containers, a lesson that might remind many of the antifragile movement.
Understanding New Design Goals
In order to address these challenges, there are ways to create containerized applications in a way that makes them operation-ready. Making changes to the development process will enable better monitoring and management of containers in production environments. These lessons translate into some high-level design goals for containerized application development.
- Applications running as containers should acknowledge that services could be interrupted at any time.
Interruption could be triggered by a variety of situations. In the case of Docker, a stop command sends the SIGTERM signal to the application, indicating the request to shut down. This may require applications to perform a cleaning-up activity before the container dies. Taking a hint from the twelve-factor principles, the application must gracefully shut down. For example, if there is a state that is being manipulated by the container, that would require being checkpointed. The checkpoint would be persisted onto external storage, where other container workers could access it.
- An application should expose an interface to allow for health checks that can be used by the container platform.
The accuracy of the health check implementation is critical to the functioning of this setup. Usually, an application emits health check information that indicates it can correctly connect to a third party or external interfaces. A failed health check is a good indicator for the platform to remove the instance from the configuration. Platforms also use this to decide remediation activities that need to be performed due to an error in the container. This remediation could involve launching the same container image on a different host.
- There should be a mechanism to reliably identify a faulty instance and its root cause.
Diagnosing application issues in a containerized environment is an open invitation to chaos. The problem occurs when there is a need to identify the exact container instance that caused the issue. Using a common identification artifact that spans across the notification, monitoring and logging infrastructure is one way to solve this problem. A common approach is to use a business-specific identity, configured to be emitted as a notification, that can be captured by the logging system. The log should then provide mapping between the identifier and the container name that caused the issue.
- The log format must provide full context to an operator.
Logging remains an important element in the arsenal for debugging issues in the production environment. With container environments, the log format must include some context like container ID, container image and container host. The container runtime injects a HOST environment variable that can be used by the application in its logs. Container platforms use metadata in their log stream, which helps identification and remediation activities.
Making Agile Decisions with Containers
A fast and agile environment provides the ability to make quick decisions and iterate as needed. One of the cornerstones of the DevOps movement is the adoption of ideas around the observe, orient, decide and act (OODA) loop, a widely-used principle to take action in a constantly changing environment. Mapping the OODA practice to a containerized production environment would lead to the following inferences:
- Observe: This pertains to alerts and notifications that filter out the useful signals from the noise. This is possible through tools that receive events from the monitoring system when something goes wrong. Having a good signal-to-noise ratio at this stage is critical to the overall success of the process.
- Orient: Once access to the information is sorted out, it is used to identify the symptoms causing an issue. Getting information from the logging and monitoring system is the basis of orientation. You must be able to identify the exact source of information with minimal noise at this stage.
- Decide: Based on the symptoms identified during the orientation phase, you must decide what action to take to resolve the situation. An example action would be changing the group configuration or relocating to a new set of hosts. If the issue identified is related to the application logic, then rolling back to the previous configuration could be a possible fix.
- Act: The container platform and tools must allow for fast action once they’re decided upon. Having access and permission to the container management tools is useful.
Container implementations in the enterprise must allow the OODA loop to be implemented and have fast transitions. The merit of any container management and monitoring system is measured by the accuracy of the information it provides and the speed with which actions can be taken.
Tools for Taming Complexity
The art of managing the chaos in a container-based environment has led to the creation of new tools and services that embrace the dynamic and autonomous nature of container management. Tools like StackStorm and Netflix Winston have inspired implementations to trigger automated workflows in the case of events, especially events that involve an issue with the environment. Tying this to your container platform can allow operation runbooks to execute in case of a fault with a container. This reduces manual intervention and engineering team burnout, which increases productivity.
One of the concepts we discussed earlier was to monitor groups of containers instead of focusing on individual instances. The use of container labels and environment variables can be used to implement this practice. A tool like cAdvisor can capture the labels provided to any container on a host. If environment variables are used, cAdvisor also allows them to be captured using the –docker-env-metadata-whitelist runtime parameter.
Tracing calls between self-contained services in an architecture is difficult with traditional practices. Improving practices around tracing is an important part of continued success with microservices. Tracing platforms like OpenTracing will become commonplace in all container based environments going forward. The Cloud Native Computing Foundation has adopted OpenTracing as a hosted project. There are also tools like Zipkin, an open source tracer for microservices, first developed by Twitter to track web requests as they bounced around different servers. There’s also Sysdig tracers, which allows for open source tracing of everything from microservices down to system resource access time.
Taking actions iteratively in an OODA loop is an important part of container implementation. Platforms like Vamp allow workflows to be implemented for making canary release decisions based on containerized application metrics. Tools like this could act as a method of implementing the OODA loop and applying it to release and testing practices.
If you are running containers with process isolation, then finding a flagged process running among a set of containers across hosts is a challenging feat. Identifying the host that runs the flagged container is one part of the problem. Usually, this is solved through a host monitoring tool like Cloud Native Computing Foundation‘s Prometheus. Once you identify the host, you can perform process ID mapping between the host and the container. This requires identifying the root process ID and correlating it with the running container. Tools like Sysdig solve this problem and much more with little or no overhead on the container performance.
Cloud Foundry has a unique approach to solving container management and monitoring difficulties. It provides an abstraction of the containerized infrastructure in the form of well-designed APIs, a solid development environment, and by providing logs and metrics for each application. These features make it easy for developers to adopt agile practices and leverage visibility into their containerized production applications.
Organizations working in a hybrid setup, involving both containerized and traditional workloads, will have a hard time embracing this shift. The challenge is maintaining systems that revolve around different schools of thoughts. Legacy systems are usually long running and non-disposable; they demand per instance, granular approaches to monitoring and management.
Some organizations will want to experiment with a team that is independent and has few traditional system monitoring needs. However, the OODA loop is still a valuable approach to containerized applications and establishes common ground rules for both traditional and container-based environments.
Developers need to be more aware of new practices in monitoring, logging and service management in order to create containerized applications that will thrive. The changes needed to successfully adopt containers will result in cloud-native systems that are able to accomplish the goals of agile IT.
The Cloud Foundry Foundation, the Cloud Native Computing Foundation and Sysdig are sponsors of The New Stack.
Feature image via Pixabay.