Interviews /

Q&A James Turnbull: The Art of Monitoring in the Age of Microservices

29 Jul 2016 9:28am, by

coverBy now, forward-thinking system architects are starting to realize that moving to a microservices architecture requires an entirely new set of monitoring tools. Built for long-running physical servers, the traditional application monitoring tools are just not suited for keeping tabs on the ephemeral containers that may pop in and out of existence in less than a minute.

James Turnbull has been keeping tabs on the emerging practice of microservices monitoring, and recently released a book on the subject, “The Art of Monitoring,” which both offers some technical specifics and overall strategies for monitoring dynamic systems. Read excerpt here.

By day, Turnbull is the chief technology officer for Kickstarter, before which he held technical leadership roles at Docker, Venmo and Puppet Labs. A busy man, he has also written books on Docker, LogStash, Puppet, Nagios and Linux.

We caught up with Turnbull to cadge some a bit of free advice on microservices monitoring, as well to learn more about what his new book could offer the project manager, system architect, system administrator or the chief technical officer.

What makes traditional monitoring tools not so well suited for the microservices/container architecture?

Traditional monitoring assumes that a host or a service is long lived. We’re now living in a world with the cloud and virtualization and containers; infrastructure is fast moving, and it changes really quickly. Containers appear and disappear far faster than you can set up a Nagios check for them. You have a whole landscape out there where dynamic fast-moving short-lived infrastructure exists. There’s very little in that sort of traditional monitoring space that addresses it.

Back in the day, you probably have a server, and it ran Apache, and it ran your website or a web service of some kind or application. The server probably lived some time either as a physical server and then as a virtual machine. It probably had a lifespan as long as the application. You knew it ran Apache, and you probably monitored its disk and its CPU and its memory. It rarely changed.

And then all of a sudden, the upstarts in your application or ops team says, “We’re going to run this Docker thing. Instead of a server, we’re just going to run our web app on a cluster of containers.” And if we want more containers, we just add more containers.

So the person doing the launching says, “Well, what are containers called?” An ops person says, “Well, it’s just a random string of numbers.” The monitoring person says, “Well, how do I monitor that?” “I don’t know. Go figure it out.”

So it’s a fairly immature market still. A lot of open source tools have started to become container aware as well, tools like Sensu and Nagios. But I think the fundamental problem is an architectural one and that’s that the current tools are not well-suited to a container or services architecture.

Because what are you going to do? A container that appears and disappears like its uptime might be measured in minutes or seconds even. Let’s say you’ve done a service that triggers on a transaction. That service only exists for a matter of seconds, but you want to track its performance in some way. You want to know that it’s running.

The Art of Monitoring: Table of Contents

  • Chapter 1: An Introduction to Monitoring
  • Chapter 2: Monitoring, Metrics and Measurement
  • Chapter 3: Events and metrics with Riemann
  • Chapter 4: Storing and graphing metrics, including Graphite and Grafana
  • Chapter 5: Host-based monitoring with collectd
  • Chapter 6: Monitoring hosts and services
  • Chapter 7: Containers - another kind of host
  • Chapter 8: Logs and Logging, covering structure logging and the ELK stack
  • Chapter 9: Building monitored applications
  • Chapter 10: Alerting and Alert Management
  • Chapters 11-13: Monitoring an application and stack
  • Appendix A: An Introduction to Clojure

So you have to figure out what the right abstraction is to be monitoring. Is it the container? Is it the service? Is it the number of containers inside a service. And you’ve got to do the traditional monitoring processes, by pinging a machine, saying, “Tell me what the state of this thing is, like am I connected to Apache Port 80? Do an HTTP Get and return some content.”

If you don’t know the name of this thing running your service, and you can’t guarantee it’s going to be there when you ping it, how do you monitor it? So that sort of presents a whole bunch of really interesting challenges.

Regarding possible software out there that companies could use, is it mostly just open source or are the commercial solutions coming along?

I think at the moment, particularly in the monitoring services and the container space, there’s a limited tool set. A lot of the commercial companies are obviously updating their services, things like Datadog and New Relic, to be container aware. But until recently, there wasn’t a lot of hooks into the container. There wasn’t a lot of APIs available for monitoring. You’d do the very basics, like querying the Docker Daemon. If you had a service, you really have to write a health check of some kind.

So it’s a fairly immature market still. A lot of open source tools have started to become container aware as well, tools like Sensu and Nagios. But I think the fundamental problem is an architectural one and that’s that the current tools are not well-suited to a container or services architecture.

So what I’ve proposed in the book is that the sort of the two types that are monitoring that traditionally. One is poll-based monitoring, which is what Nagios does, which polls a service, saying, “Are you there? Tell me what your status is.” And there’s push-based monitoring.

So what happens in this case, a service or a container might wake up and start running and inside that container is a service discovery tool. It wakes up and tells the monitoring system, “I’m alive! Here’s the metrics from me.”

In that case, the monitoring system either pings some sort of configuration management database (CMD) or a tool like etcd or Consul.

At some point, if that container goes away, it stops spitting metrics out, and your monitoring system goes, “Huh, I’m not getting any metrics anymore. Do I need to worry about that?” It can poll out a discovery system to ask “Is that container still a thing? Do I still need to care about it? Oh, it isn’t. You’ve turned it off. Okay. I don’t need to worry about it anymore.” Or it goes, “That container is supposed to be still there, and I’m not getting any more data from it. I should do something now. I should take some action. I should tell some human that one of these containers that I’m expecting to see or this service I’m expecting to see is this gone.”

Do you have any thoughts about the other side of the equation, the presentation of all these metrics to the admin?

So I think the really interesting thing about monitoring until recently is that like traditional alerts tended to be email based. They tend to have like an error message like, “Disk on host A is 88 percent. Warning.” Or, “Disk on host A is 95 percent. Critical.” If I look at that alert, it’s not very contextual, right? I don’t really understand whether I got a problem or not.

So is 95 percent okay? Is 95 percent of 15 terabytes? Has it grown from 88 to 95 in the last 30 seconds and it’s about to get 100 percent and knock the machine over. Or is it going to grow incrementally? It’s just hit 95. It might get to 96 by another month.

If people are going to be on-call, if their sleep is going to be disturbed, if their family lives is going to be disrupted, you want to minimize the pain of that experience by not having them repeatedly work out things that are not important.

So I get no context around that sort of alert.

If you are dealing with data that should be consumed by computers, for example, like a metric or a log entry, turn it into a structured form. If you’re dealing with data that should be consumed by a human, for example, like I’m going to be woken up at three o’clock in the morning with this alert from PagerDuty, then give me some context.

It means that if I open my phone or my laptop, and I read that its 95 percent disk on this machine, then my alert should contain the graphs showing me the disk growth over the last, say, 24 hours, as well as maybe some graphs with some additional state to that machine, like that machine’s CPU and memory is maxed out. Something is definitely happening on that machine. I can take some action.

Rather than spending the first minutes looking at an alert, trying to puzzle out if it is a problem of not, I can immediately get some context around that.

If people are going to be on-call, if their sleep is going to be disturbed, if their family lives are going to be disrupted, you want to minimize the pain of that experience by not having them repeatedly work out things that are not important. But also if they are being woken up by things that are important, you want to reduce the time to understand a problem and fix it.

If you wanted to compare metrics across systems, you need a set of agreed-upon definitions of what is being measured and what the tools are that should actually do the measuring. Do you see issues in regard of either?

When you think about monitoring, you need to standardize what you are monitoring across your environment. So if I measure a CPU or disk, that I’m using the same metric, the same data. So I’m making an apple-to-apples comparison.

If I have an application I’m monitoring, I’m sampling transaction rates every ten seconds. But on the machine, I’m sampling the disk I/O at one-second intervals. So if there’s some disparity, and I correlate between those two metrics, one of them is every ten seconds, and one of them is every second. So if I’ve got a different resolution, I could quite easily miss behavior in that ten-second window that might be impacting the metric at the one-second interval.

I first look at business metrics. I first look at that thing that my customer and the business cares about

So in the book, I encourage people to choose a resolution and be consistent about it everywhere, so you’re always comparing things at the same granularity.

And definitions too. Like if you create metrics like useast1.applicationgroup.applicationname.apache.httpcommand.get, that’s a schema, and you should treat it like one, so you know that every one of your applications that is counting get commands is the same path, the same schema. So that you know that anyone in the organization understand what’s happening with this application and [could] compare the behavior of the application in useast1 with the behavior of the application on Heroku or in a Docker container or whatever.

Is there one metric or maybe a couple of metrics that are standard go-to metrics you look at first for improving the performance?

I don’t know that there was a single metric. I think it depends on what your application does and what the customer expectation of that application is. I first look at business metrics. I first look at that thing that my customer and the business cares about, and that may be, for example, response time.

Amazon cares deeply about how easy it is for someone to do a one-click order. One-click ordering must make Amazon an enormous amount of money. So if one-click ordering is not responsive, that is definitely going to cause people not to use that.

So I think about that, and I start measuring that, and then I drill down to the application and look for all the metrics related to that business metric. What are the things that roll up to produce that result? The speed of my API, the speed of my database transactions, things like that. Then underneath, I’ll look at the infrastructure metrics like by how much memory and disk and all that stuff. All of these things roll up into one thing, like the response time for one-click ordering.

Then if the response time for the on-click order suddenly changes, like it previously was three microseconds and now five microseconds, I can drill down inside and ask which of the things have changed. I may see one machine is really weird. What’s going on there? The database system is really churning along. Aha! We should add an index for that particular table.

Then, as soon as we get the index, the response time drops back down to three milliseconds. I’m back on track, and my customer is happy.”

What was the goal in publishing this book?

I wrote the book as a framework, as a potential approach. I chose some technologies that I like, but I also recommend a whole bunch of other technologies and so I provide the pros and cons of those alternatives.

Hopefully, it will provoke ideas and make you think about how you monitor your systems. At the very least, it should get people to think about they could be doing monitoring differently. That was my intention, is to present a roadmap towards leveling up your monitoring rather than a technology guide on how to implement some tools.

TNS research analyst Lawrence Hecht contributed to this article.

A digest of the week’s most important stories & analyses.

View / Add Comments