For this edition of The New Stack Analysts podcast, Alex welcomes back SignalFx founder and CTO Phillip Liu, who is joined this time by Karthik Rau, SignalFx founder and CEO. Also participating is James Turnbull, VP of engineering at Kickstarter, and co-host Donnie Berkholz of 451 Research.
For more episodes, check out the podcast section of The New Stack.
This podcast is also available on YouTube.
Karthik explains that SignalFx was designed from the ground up to provide better monitoring and visibility for modern distributed applications: “We’re seeing new applications being designed as distributed applications, and more microservices architectures, and it’s quite common to find applications to have dozens, hundreds, and potentially even thousands of components.”
“In such environments,” he says, “you really have to re-think how you do monitoring. If you’re simply doing health checks on individual nodes, you can just create a lot of noise. What we’re finding is that progressive organizations are looking at more of an analytics-based approach to monitoring, where you collect data up and down the stack, and across different services, centralize it, and run analytics to try to identify patterns.”
When Alex asks about this radical shift in the approach to monitoring, James asserts that “traditional monitoring was already broken.”
“With things like Nagios, where you actually care about things like fault detection, you would wire up all of your services and all of your hosts, and ping them, and ask them about themselves, and if an Nginx or Apache stopped working it would send you an alert,” says James.
“People ended up with mailboxes or pager queues full of alerts that said that something has stopped, and that’s the most insight you would ever get out of that sort of monitoring.”
“That doesn’t scale,” observes James. “It’s not very useful. If you get woken up at 4 o’clock in the morning with a message saying, ‘Apache has stopped,’ how much use is that to you? As a result, operations people started ignoring those alerts,” James says, adding that this process had been overdue for updating.
“People want to work with monitoring in terms of logical applications, whether it’s microservices or not,” Donnie adds.
“A lot of companies are scaling up their IT footprint without scaling up the team behind their IT footprint,” says Donnie. “Cloud-driven companies today are trying to maintain something like 10,000 systems with one SRE.”
Karthik says that many progressive organizations “instrument as many metrics as might be relevant down the road.”
“Where you have a performance issue, or you have error conditions,” he says, “having those metrics really accessible, especially if you still have those metrics streaming in, means you can get a much better sense, in a very interactive way, of what’s happening in your environment.”
“These modern applications are not monolithic applications,” Karthik continues. “You’ve got different services, and you’ve got different owners of different services. If all of your data is in logs, it becomes very difficult to communicate across different teams, to be able to share contacts, and to understand that if you have a degradation in Service ‘B’, and downstream implication on Service ‘C’, you may have different teams of developers that all need to share information and figure out how to solve an issue.”
“If you have metrics instrumented,” and dashboards in a service like SignalFx, says Karthik, “you can detect issues as they slowly start to rise between different components, and when you have a problem you can actually very quickly start to de-bug it across different teams.”
As James puts it, “I’m not looking for a monitoring tool to provide me with a single pane of glass. I’m looking for monitoring tool that will help me identify problem areas that I can then drill down onto, and work out what’s going on, and troubleshoot. I’m not expecting a huge amount of troubleshooting or diagnostics to emerge out of my monitoring system. I think that’s a human problem, and humans are much better at solving those problems, but they need to be pointed at the right place.”
“We’re looking for guidance rather than some magic anomaly detection,” says James.
Alex asks why a microservices architecture needs analytics as a core function of a monitoring platform.
“It’s a much more dynamic solution,” James replies. “Microservices tend to be small clusters of hosts that change frequently. Live Docker containers are largely expected to be ephemeral; they’re only expected to exist for a short period of time. As a result, it’s not like the old days of a monolithic server, and a monolithic application running on top of that server, where ‘application-1 dot example dot com’ is your host name forever, and has the same IP address, and spits out the same JVM metrics all the time.”
“You can no longer build monitoring systems around these monolithic artifacts,” says James. “You may spin out 1,000 containers in an hour, all of whose names and IP addresses you don’t care about. All you care about is the service that is running through that container, or through that application, or through that microservice, and those things have metrics. They have transaction throughputs, or response times, or latencies, and those are the numbers you now care about.”
In the move towards metrics-based monitoring, says Karthik, people care about trends across the service. One might still care about what’s happening on an individual component, but not with the same level of urgency as what is happening across the entire service. “We see more momentum toward instrumenting metrics, and building analytics around metrics.”
Alex suggests that microservices architectures must be stirring up questions and contextual issues.
“You have more elastic environments where you can potentially see an individual service scale up or scale down very frequently,” says Karthik. “You have a lot more dependencies; different services are sometimes updated on different cadences, and have upstream and downstream dependencies.”
Many questions that people have are unique to their own applications, Karthik observes: “You have a performance issue arising. You need to understand where that originated. Was it an upstream or downstream service? Getting a better picture of that becomes very important. Error conditions may have all sorts of other dependencies across dependent services, so there’s a lot more collaboration required, and the questions are unique to every organization.”
“We’ve certainly seen a need to build tools to provide better communication in these environments,” he says.
“In a microservices environment,” says Phillip, “one application comprises many different services but disparate teams own different parts of the application,” and deployment changes in one service may have unexpected side effects.
In the past, Phillip notes, if it were one big application, it just wouldn’t work, but, “In this microservices environment, you’re pushing one component. Downstream, it may be affecting other components.”
“Unless you have a wholistic view, or a system that allows you to look at multiple components at once, you won’t have insight into what effect you’re causing.”
SignalFx is a sponsor of The New Stack.
Featured image: Paul Klee, “At The Core” (1935), Oil on board, via WikiArt.org.