At the time series-focused Influx Days in San Francisco, presenters offered many unique views of log data. From talks on better analyzing log streams to bitter warnings against identifying what’s “normal,” the one-day event featured a range of ways enterprises can apply new techniques. The goal: get their arms around the near-infinite supply of logging and monitoring data their systems generate.
Troubleshooting Common Errors
For example, Nakashima pointed out that it is not uncommon for third parties to copy and paste your entire web site’s front-end code in order to steal some small portion of functionality. This can cause havoc if things like New Relic are embedded into the page, sending spurious data about an unknown server to your logging flow.
Other pitfalls come from errors thrown by the client side, for reasons which are out of the engineers’ control. An example, said Nakashima, is the user having a broken browser plug-in installed. Generally, she said, a lot of monitoring products can filter out this type of noise automatically, but for home-rolled monitoring systems this can be a major pain; filtering out browser plug-in issues requires knowledge of hundreds of plug-ins.
Another area of danger is the use of ad blockers. Nakashima said most ad-blocking software will filter out requests to third-party sites and limit all information on a page to the host domain. This can block analytics and monitoring software, and thus, cause ad blocker users to be semi-invisible to the error tracking systems. The solution, said Nakashima, is to proxy all of those third-party systems through your host domain.
Then you’ll have to filter out the errors thrown by your third-party vendors when they push code changes. “This seems like a rare, one-time problem, but once you look at it you realize you see this kind of third-party code problem all the time,” said Nakashima.
Having this extended visibility into the front-end should also provide your team with deeper statistics than it may be capturing now. You should be tracking browser versions, installed fonts, color schemes, visibility, geolocation, and support for new browser APIs. Tracking this information will give you a better view of the technologies your customers are using, said Nakashima.
Beware Automated Anomaly Detection
Elsewhere at Influx Days, Baron Schwartz, CEO of VividCortex told a cautionary tale about anomaly detection. He said that many vendors cropped up to offer automated anomaly detection, sometimes under different names, over the past four years. He added that it is impossible to do automated anomaly detection, and will always be so.
Schwartz founded VividCortex in 2012 to help companies better understand the queries they were running on their databases. The ultimate goal is to show which queries are jamming the system, providing better usage of the database overall.
Schwartz spent a lot of time experimenting with anomaly detection since founding VividCortex, and he said that all solutions fall down over the long haul. This comes from the fact that it is incredibly difficult to determine what exactly “normal” state looks like in a complex system.
“A monitoring tool isn’t supposed to give answers, it’s supposed to be an extension of your team. You should choose your monitoring tool the way you’d hire an engineer,” said Schwartz.
“Anomaly detection gets called a lot of different names: machine learning, big data, dynamic baselining, automatic thresholds.” A lot of these things are simply anomaly detection, and anomaly detection is predicting and forecasting. Really when enterprises are talking about anomaly detection, they want to find something that’s not normal. Their assumption is these systems that have not normal things going on are interesting to look at,” said Schwartz.
Unfortunately, while this sounds like a good idea, it’s not so true in practice, said Schwartz. Oftentimes, most of the activity going on in a system at any given time is abnormal and unpredictable. With so many moving pieces in most systems, it’s tough to distill normality into a single algorithm.
“Ultimately it gets boiled down to some equation somewhere that ends up being a proxy for what’s assumed to be normal, and you use that model to predict. You train the model on past data and say in the case of monitoring, you’re going to look at data as it comes in and say ‘is this data point anomalous?'” said Schwartz.
Not only is it incredibly hard to figure out what normal means in a system, it’s also hazardous to make a wrong guess. If the pagers inside your administrators’ pockets are going off every time something abnormal happens on your network, they’ll likely be going off every few minutes. Such a spamming of the pager would result in most administrators ignoring alerts entirely, which would create a major problem when something bad really happens.
Schwartz even tried to come at the anomaly problem from another direction: he attempted to measure the time between changes in host services. He hoped to find anomalies by detecting when the time between changes changed. As a result, however, Schwartz said he built the most useless spam generation machine he’d ever seen.
“The truth is, there’s all these wacky things happening in our systems all the time. They’re not actionable, they’re not diagnosable, and there’s nothing for you to do about it. On the other hand, if you build these models, even if you work hard you get lots of indications something abnormal happened, and the cost-benefit is exactly the reverse of what we as engineers are wired to think,” said Schwartz. Having these systems in place can create more work, essentially.
“Alerts that come in that are non-actionable immediately turn alerting systems into a Gmail filter to the trash bin. They create pager burnout. These results come out of a black box that’s not interpretable. The data is already highly digested. It is surprising how quickly you end up six or eight degrees away from the original input,” said Schwartz.
When the chips are down and a system is broken, added Schwartz, the last thing you want to do is try and figure out what some black box means when it tells you there’s a problem. This is why many administrators simply end up opening an SSH connection to the problem machine anyway: they need to see the root of the problem, not some blinking light that vaguely indicates there’s an issue.
Influxdata is a sponsor of The New Stack.