How Argyle Data Uses Facebook’s PrestoDB and Apache Accumulo to Detect Fraud
Argyle Data is in the business of examining huge pools of transactions in search of fraudulent behavior. It’s a new frontier. And as we know, frontiers do not always have any defined set of rules.
What we do have are new ways to explore this new frontier where fraud comes in forms that are often difficult to detect. In particular, machine learning technologies that allows Argyle to query large volumes of data that leverages key/value store technology.
“A rule, by definition, is about an experience of old, previous fraud,” explains Ian Howells, Argyle Data’s CMO, in an interview with The New Stack. “Rules don’t discover new fraud. We’re moving from rules to sophisticated machine learning. The whole thrust is, instead of discovering what you know, discovering what you don’t know, because criminals are continually evolving.”
According to Tom Ryan, Argyle Data’s CEO, the typical telco customer is already losing two percent of its revenue to fraud, though some are shedding as much as five percent. The typical fraudulent act in this instance is the triggering of the dialing of a premium number, usually without the victim being able to stop it.
The revenue leakage doesn’t stop there. Call centers, Ryan tells The New Stack, spend hours in (legitimate) conversations with customers complaining about unexpected charges. “But the biggest, ironically, is the churn due to damage to reputation and brand,” he says. “That’s what our mobile customers are most concerned with. In fact, for our customers in the financial services industry, it’s their top priority — it’s all about protecting their brand.”
Argyle had its own technology for detecting fraud behavior patterns in data warehouses. Meanwhile, telcos are moving from the last remnants of their circuit-switched infrastructures to very-high-speed, IP-based networks. The functions that had historically been performed by discrete components you could hold in your hand, were being moved to software. And so were the fraud patterns.
Those patterns were evolving faster than the software. As Howells tells us, Argyle’s architects found themselves in a quandary, at the very pit of which they began asking themselves this question:
How would Facebook handle fraud?
At last, an easy question to answer. Facebook develops its own technology in-house for managing data at high speed. Contrary to the pattern of most successful American companies, Facebook shares that technology with the rest of the world as it’s being developed. We’ve talked about it here: PrestoDB.
“Within a data set, you could have lots of different patterns of attacks that potentially even change over time,” says Arshak Navruzyan, Argyle Data’s vice president for product management. “That’s where the algorithmic approach seems to work really well. We’re using Presto to synthesize the data set, to write in parallel very large-scale queries to essentially produce these [patterns]. Then algorithmically, we interpret them using a probabilistic approach to isolate the part of the traffic that could potentially be fraudulent.”
Argyle’s operation does involve machine learning, but Navruzyan cautions us from getting all hyperbolic about it. Deep learning and neural network algorithms are good for learning speech patterns, he says, which are not subject to change. In fact, he doubts that deep learning systems would even be helpful for fraud detection.
“The stuff we do is fairly sophisticated, in the realm of graphical models — hidden Markov models, conditional random fields, Markov random fields — but they’re very well-known methods,” he remarks. “I think the deep learning stuff seems to just be more wishful thinking, in my estimation.”
Argyle utilizes Presto in conjunction with Apache Accumulo, a distributed key/value store which according to the site is “based on Google’s BigTable design and built on top of Apache Hadoop, Zookeeper, and Thrift.” Argyle uses these data analytics technologies to extract data from telcos and assemble them into key/value stores. This makes query response times uniform, even when dataset sizes are variably large. “Even if I have a terabyte of data, everything is broken up into one-gigabyte tablets, and distributed over hundreds, or potentially thousands — maybe tens of thousands — of servers.”
While Hadoop’s HBase has a similar design, he says, it’s not quite as proven as Accumulo with analyzing huge datasets.
Suppose a telco’s system samples one subscriber calling another. Because of Accumulo’s uniform response time store, Navruzyan says, it could retrieve the entire call history of both subscribers. Argyle can then run a probabilistic model on that data, to determine the likelihood of questionable activity, given both parties’ call history.
A Sudden Detour into Hyperspace
From there, Navruzyan continues, Argyle can employ probability distribution and density estimation functions in hyperspace.
I could really annoy you and end this article right here. But I won’t.
As this Virginia University research paper on the topic shows [PDF], spatial analysis has been used for several years as a research tool by law enforcement. Their theory has been this: Take the various properties of a criminal event, render it quantitatively, on a linear scale. If you treat each of these linear scales as an axis, then you can presume a kind of configuration space, or “hyperspace,” in which each axis is a dimension. Thus, any change in quantity along one or more axes becomes a movement in that space.
This is one of those methods that Arshak Navruzyan characterizes as unsophisticated.
“Within Presto, there’s the ability to do approximate queries,” he notes. “When you have very large data sets, you don’t necessarily need to scan the entire data set to arrive at an approximate answer, provided that your data is normally distributed. You can sample the data, and then get something like 95 percent accuracy, without having to look at every element of the data.”
When you compound the various properties being analyzed with 95 percent accuracy, to determine the general direction to which they all point in this configuration space, you don’t need degrees/minutes/seconds accuracy to come to the conclusion that something’s very probably going wrong. Telcos can use this information to stop attacks on their network from spreading, reducing both the immediate costs incurred by maintenance, and the subsequent costs incurred in customer support.
After Facebook’s developers put their heads together to resolve the efficiency issues with which data warehouse users have been plagued for over two decades, the rest of the world is reaping the fruits of their labor … in a mildly unsophisticated, warp-speed kind of way.
Feature image via Flickr Creative Commons.