Help! My Microservice Crashed: A Guide for First Responders
I’m not going to lie to you. Having a microservice crash in the cloud, and then trying to figure out exactly what happened is incredibly frustrating. It’s easy enough to get another instance up and running, but without knowing why the previous one failed, you have no guarantee that the new service will be any more or less stable than the one it replaced.
In this article, we’re going to look at strategies you can employ to understand better what circumstances caused your microservice to crash. Being sufficiently informed about why it failed will afford you the opportunity to implement changes to increase the stability of your microservice and hopefully prevent the same problems from reoccurring.
When It’s Too Late
Unfortunately, I don’t have a great deal of useful advice for you if you’ve found this article because your microservice just crashed and you’re desperately trying to figure out why. By preparing support infrastructure and monitoring ahead of time, you’ll be better enabled to identify which microservice crashed and why. Proper preparation is the topic this article focuses on, but here are a few tips so that I don’t leave you hanging.
Identify the Microservice at the Root of the Problem
In a microservice environment, each service may have multiple dependencies on other services. Because of these dependencies, a failure in one service can often surface as the apparent failure of another service for which it functions as a dependency. Identifying which service is failing is the first step. If you suspect a service of being the cause of the failure, validate that all of its dependencies are operational and responding as expected.
Preserve the Evidence
In a production environment, you’ll want to get the service back up and running as soon as possible. Often this process can involve instantiating a new virtual machine or a new cluster of machines. When either is done, automated processes may sometimes destroy the previous instances or cluster. Take steps to preserve these devices if possible. Log files, memory dumps and the execution of diagnostic tools on the offending machine may provide important clues into what caused the failure.
Preparation is Key
Hopefully, you’re reading this article in preparation for moving to a microservice architecture and are planning a comprehensive support strategy. A well-designed microservice architecture allows individual teams to implement independent functionality and create an environment which is highly scalable, loosely coupled and can support the rapid introduction of new features.
These benefits to the development process result in an environment that can be difficult to monitor with traditional production monitoring tools. You’ll want to investigate and implement an application performance monitoring tool (APM) which allows developers to understand the current state of their applications, as well as automate many of the processes involved in supporting a production system, and generate alerts when key indicators fall outside of defined performance limits.
Selecting a Good APM
An APM typically consists of an agent which is installed on microservice machines or containers and communicates with a central APM server or cluster of servers. In some cases, the APM agent may itself run as a standalone container, especially for services that require deeper instrumentation. The advantage of such an approach is that it decouples the monitoring of the service from the actual service itself. It also allows for analytic data to be persistent and actively monitored, even if the target machine fails or shuts down unexpectedly.
Ideally, the APM solution which you select and implement should include trending or machine learning to support fully automated monitoring. The APM tool should also include dynamic baselining and rules, and allow users to customize their own rules which can then be used to generate alerts before systems degrade to the point of being unusable.
Finally, the APM solution should support visualizations of services in real time and historically to enable developers to understand the state of the machine concerning metrics such as memory, processing and network communication. The collection, aggregation and indexing of log data is also a vital component of a comprehensive monitoring solution.
Tracking, Fault Tolerance and Defensive Coding
In addition to deploying a comprehensive APM solution, there are development best practices that you should implement when building your microservices to make support and troubleshooting easier for whoever is tasked with operation support when things fail.
Always assume that aspects of your system will fail. Even if your code is perfect, it’s still dependent on infrastructure, dependent services and unpredictable user input. Code your application in such a way as to handle failures of any dependencies gracefully, and in a manner which makes identifying the source easy. Add comprehensive exception handling, descriptive log messages and validations where you can.
When passing messages or requests between services, add a correlation token or trace ID to all requests originating from the same client. These tokens will prove invaluable when you’re trying to track a client’s request through multiple services, especially if you are aggregating all logs into an indexed log aggregation system.
Finally, deploy your microservices in such a way as to enable fault tolerance. The use of software-based load balancers and distribution of a single microservice across multiple virtual machines (and if possible, in geographically distinct places) will help minimize downtime for your users if one of the instances fails.
To learn more about monitoring microservices, get the free eBook: Container Monitoring & Management.
CA Technologies is a sponsor of The New Stack.
Feature image via Pixabay.