Why Monitoring Can Save Microservices

Raygun sponsored this post.

All too often during the development phase of microservices applications, monitoring is critically missing from the process. In other words, developers need to design and build their code with monitoring in mind for a number of reasons in order to save a lot of pain down the road once microservices are deployed.
Systems such as Raygun’s Error Monitoring and Application Performance Monitoring (APM) certainly can integrate monitoring into a microservices application in a few easy steps.
However, developers can and should also make monitoring more robust and effective throughout the process when creating microservices.
Here are a few ways to do that and why it’s important.
The Big Picture
In a microservice architecture, business processes are distributed over more than one service. Consider an online system that sells books, for example. Each sale involves adding a book to cart, accepting shipping information, processing a payment and adjusting inventory. In most microservice designs, a separate service handles each step. Monitoring a sale means following it across several services.
So, monitoring a microservice-based application requires processing events from several services. First, this is only possible if the services provide the necessary information. Second, it must be in a shared grammar so you can collate the events.
Martin Fowler discusses this problem in his blog about microservices. He says that reduced governance is one of the primary benefits of microservices. Teams can work with their designs and on independent schedules. But this independence may come with a cost. When teams select different ways of representing data, centralized monitoring suffers.
Here is how Martin Fowler describes the problem:
“At the most abstract level, it means that the conceptual model of the world will differ between systems. This is a common issue when integrating across a large enterprise, the sales view of a customer will differ from the support view. Some things that are called customers in the sales view may not appear at all in the support view.”
However, Fowler doesn’t offer a specific solution about how to avoid this problem without creating an extra layer of bureaucracy. All I can offer is there is a difference between governance and cooperation and eschewing one should not mean discarding the other.
Your organization may foster cooperation or mandate centralized control. Either way, you can take steps to make your application monitoring better. Let’s take a look at three of them.
Logging
Logs are the most fundamental monitoring mechanism you can add to your application. While many developers think of logs as a way to report errors, they are also a mechanism for reporting events and application state.
Managing Logs
In the Twelve-Factor app, Adam Wiggins describes logs as an event stream. It’s a useful abstraction for designing how your application generates log messages. Wiggins recommends delegating the log file management to something outside of your application. Your code should hand log messages off to an external agent that delivers them to where they belong. The infrastructure collects log information in a central system for indexing and search.
This frees developers from worrying about where logs go. But the onus is still on them to make the contents useful. The information log messages contain and why an application creates them is essential. Developers should design log messages like any other application artifact.
Wiggins’ description of logs as event streams is a good starting point for deciding what to put in them. Any application event that’s relevant to monitoring and analysis belongs in the stream.
Log Contents
So, if we return to the book sales application, we can identify six log messages right away.
-
- Item added to cart;
- Check-out started;
- Shipping information captured;
- Shipping cost calculated;
- Payment processed;
- Inventory adjusted.
Each message advances us through the sales process. If the messages include timestamps, they can also give us a picture of how well the system is performing. A critical point here is that logging is for more than just failures. By logging events, you can create a picture of the application’s internal operations and state. So, your code should log any event that you want to measure or track. There’s no reason to skimp on messages to save space or preserve the signal-to-noise ratio. A log management system like ElasticSearch makes it easy to find what you’re looking for. Add one to your infrastructure, and use an error reporting like Raygun’s Crash Reporting to process error logs and exceptions in a specialized system for analysis.
Just as important is identifying what belongs in a log message. Wiggins seems to recommend not using any logging library at all. But, logging messages to standard out with no context is not an effective strategy. Most logging APIs add a time, a severity level, and a context such a class or source file name. They can also be directed to standard output or to a file that can be processed by an external application like LogStash or FluentD. So, this makes them worth considering.
It’s also important to ensure that the log message is useful to everyone, not only developers. So, working with operations and other development teams on message contents is a critical step.
Health Checks
Logs are good for telling you when things happen, when they go right, and when they go wrong. They can be read by monitoring systems and then used to trigger alarms. Logging is reactive, not proactive. A complete monitoring solution has active mechanisms.
Developers can add proactive monitoring to applications with health checks. These checks are a mechanism for the monitoring system to probe services at regular intervals. Depending on the result, they report the status via alarms and display them on dashboards.
The checks can vary from returning a fixed value, like “up” or “down” to the results of an operation. For example, a check can simulate a live transaction. In our bookselling example, each service might support a way to simulate its step in a transaction, such as adding a book to a mock shopping cart or manipulating the inventory level of a book that doesn’t exist in the catalog. This kind of health check verifies both service health and the infrastructure they rely on. So, they can provide a window into problems before a customer discovers them.
Identifying these health checks should be part of the design and build phase of application development.
APM and Instrumentation
Developers can add application performance monitoring by building and shipping the code with a library that provides instrumentation. Some of these monitoring systems use their own dashboards, so integrating the service means working with operations. They also may need extra software on servers for collecting data. Despite requiring extra coordination and installation effort, these tools provide valuable information for proactively discovering issues and improving system performance.
For example, Raygun’s APM and Error and Crash Reporting supply a library that links with the target application. The libraries report relevant statistics, logs, and other data to Raygun’s Platform for reporting and graphs.
Application Design Includes Monitoring
Developers need to play an active role in system monitoring. Retrofitting applications for monitoring after QA releases them doesn’t cut it in a microservice system. Monitoring requirements should be factored into the application design, and treated with the same import as business requirements. A microservice architecture gives development teams a great deal of independence and autonomy, but like in any other situation, with great power comes great responsibility.
Feature image via Pixabay.