Elastic, the company behind the Elasticsearch, has released a plugin a for its distributed search engine to allow users to ask questions about the relationships within their data.
The plugin called Graph, is actually two extensions: one for the Elasticsearch search engine that that lets you ask questions at the API level and another for the Kibana visualization platform that adds the UI that allows you to explore the data visually — all without creating any new index formats. It’s one commercial product under the same license; it’s just a matter of adding the extensions in the appropriate place.
“We have this great experience in relevance ranking because we’ve been in the space for many years,” said Steve Kearns, senior director of product management at Elastic.
“As a technology platform, we’re already capturing these really valuable statistics about your data. Now we’re able to show that back to you and let you ask a new type of question, and it opens up a broad range of new use cases,” Kearns said.
Traditionally, organizations set up multiple different systems — they might set up a Hadoop cluster and do batch jobs, or they might set up a graph database and use that as a secondary or third or fourth data store, Kearns said.
“One of the benefits is that this new type of query is done on your existing Elasticsearch data, on your existing Elasticsearch indexes, so it’s basically a new way to query your existing data,” he said.
The idea is to look for relevance, a process, he explained, much like that used for a full-text search.
“We’re trying to say the really common things everywhere are really less important. We want to down-weight the really common things. We do that by comparing the relative frequency of properties within the document. We’ve taken the last 30 or 40 years of information retrieval research and turned it on its side … We’re applying the same sort of relevance to the relationship process as we do to keyword search,” he said.
He offered three examples of how Graph might be used:
For a recommendation engine: Music site Last.FM published a data set on 3 million users, including up to 50 bands that each person likes. If you ask, “What bands are most like Mozart?” you’ll probably find the most commonly liked bands would be perhaps the Beatles or Radiohead because they’re among the most popular with the general population.
But if you find 75,000 users who like Mozart, “We really want to know what’s different about the Mozart lovers vs. the global population. The statistics will show that people who like Mozart also like Bach more than the global population,” he explained. “We’re trying to use the frequency data we’ve already indexed, that we’ve already used super high-scoring algorithms for.”
Fraud detection: If 50 people report a fraudulent purchase on their credit cards, you’re going to look at where they shopped to find a common point of compromise. Maybe they all shopped at Walmart, Amazon.com or Starbucks.
“If you just look at the places these people shopped, you’re just going to find the most popular places that people shop. That might not be an indicator,” he said. “
“What we’re trying to do is say, ‘Where did these people uniquely go?’ You’re looking for different behavior than the rest of the population. You might find that 35 of these people stopped at a particular gas station between 6 and 9 p.m. last Thursday. Even though 49 of these people shopped at Amazon, let’s start with this gas station because it seems more relevant.”
Even though more people shopped at Amazon, it might not be the problem if only 50 people had fraudulent purchases, he explained, and investigating all of Amazon is a huge amount of work.
Logging: From the logging data for your website, you can look for a common point of attack — on websites it’s often a request for “/admin,” trying to get to the administrative console on your website. You might not have one, making it that much more suspicious. You could start by requesting to see all the IP addresses that made a “/admin” request. Then from that group, you could look at what else they were doing. From this known attack vector, you can learn new attack vectors. You can expand to find other people who are using these other attack vectors. You can also learn who is attacking your website and the types of attacks your website is having. This can be in real time. If you identify these attack vectors, if they’re doing something new, you can get that information in real time.
Kearns said he expects that with Graph, users will discover an array of new uses for the Elastic stack.
Elastic, which was renamed from Elasticsearch, was founded in 2012 by people behind the Elasticsearch, Kibana, Logstash, and Beats open source projects. Its boasts more than 50 million cumulative downloads. It has raised $104M in total funding from Benchmark, NEA and Index Ventures.
Elasticsearch already is being used in myriad interesting ways. Among them:
The U.S. Geological Survey streams Twitter’s Public API into the Elasticsearch distributed search engine, then uses Kibana, its real-time analytics engine to sift through the tweets for relevant terms such as “earthquake” and “tremor” in several different languages to detect quakes.
Elasticsearch is part of an application the Mayo Clinic is using that allows physicians to find similar patients and explore various possible scenarios using outcome and intervention parameters.
Giant Oak, a spin-off from DARPA (The Defense Advanced Research Projects Agency), is using Elasticsearch in its fight against human trafficking. This effort has led to the arrest of more than 100 human traffickers in the past year.