Data / Kubernetes / Machine Learning

Illuminating the Anonymous with Neo4j’s Graph Database

29 Jun 2020 6:00am, by

Richard MacManus
Richard is senior editor at The New Stack and writes a weekly column about what's next on the cloud native internet. Previously he founded ReadWriteWeb in 2003 and built it into one of the world’s most influential technology news and analysis sites.

“Companies can no longer ignore the power of connected data for improving the accuracy of data science models and predictions,” Jim Webber, chief scientist of the graph database company Neo4j, told me last week. While that sounds like the usual grand, sweeping statement we get from tech companies, this time there’s a solid case study to back it up.

In April, Neo4j announced its latest product: Neo4j for Graph Data Science, a predictions platform for enterprises. The media conglomerate Meredith used this product to turn data about its largely anonymous website visitors into customer profiles, by graphing the data into billions of nodes and then applying machine learning to it.

Meredith called this “illuminating the anonymous,” which is a somewhat creepy phrase (and a reminder that privacy is not a given, even when you think you’re browsing anonymously). But if you look past the privacy issues for a minute, what Meredith did illustrates the sheer power of machine learning when combined with cloud and graph technologies.

As the name suggests, graph database systems represent data in graph structures — a map of relationships between nodes. That’s one way that AI systems “learn”; by running algorithms to find patterns in those data relationships.

Google is the prime example of how graphs have become central to AI software. Its original product, Google Search, was essentially a graph database. And it has built on that ever since. Nowadays, graphs are at the center of Google’s machine learning efforts.

As for Neo4j, it is the world’s most popular graph database system according to DB-Engines.

Neo4j’s founder and CEO, Emil Eifrem, told The New Stack a couple of years ago that “we’re in the first inning of a nine-inning game” when it comes to AI – and machine learning (ML) in particular. So I asked Jim Webber, what inning are we in now?

Being English, Webber decided to switch out the baseball metaphor for a cricket one. Instead of nine innings, a one-day cricket game has two innings. And according to Webber, we’re at the end of the first innings in AI.

“ML has arrived and is making huge impacts across our industry and the verticals we support,” he said. “The latest conversation — the next innings if you will — that I’ve been having revolves around graphs and the integration of machine learning. For example, companies like Amazon and Google have one foot in the detail of this technology, and another foot entrenched in the verticals. [They] are working towards better ML techniques underpinned by graphs, not just vast volumes of data. There’s a lot still to play for here, but right now graph AI is the most promising candidate for the future, and I think it’s going to win out in the long term.”

What’s most intriguing about combining graphs with ML is that a lot can be inferred from a piece of data just by connecting it to other data. Which brings us back to the Meredith case study. At Neo4j’s virtual event at the end of April, Connections, Meredith Senior Data Scientist Ben Squire went into detail about the project.

“The majority of our consumers are anonymous and not logged in,” he began. Nevertheless, each of those millions of anonymous users leaves identifying trails — primarily via cookies. Meredith was collecting “hundreds of millions of cookies,” but analyzing this data was extremely difficult. Squire cited “cookie loss, device diversity and ITTP 2.3 browsers” as the key challenges, making measurement of unique users “fuzzy at best.”

This is where Neo4j’s graph database came in. Each anonymous user could be identified first and foremost by a first-party cookie (that is, a cookie created by Meredith itself), but it would require some heavy-duty computation to connect that to other data. Also, some of the other data was problematic — IP addresses change constantly, for example, making it unreliable data (at least in terms of individual identity).

Source: Meredith

After 20 months, Meredith’s graph database had 14.4 billion nodes, 67.6 billion properties and 20.6 billion relationships.

From all that data, Meredith was able to create 163 million user profiles from 346 million first-party cookies “that previously would’ve been considered unique individuals,” said Squire. They then used these profiles to deliver personalized content and advertising.

You can see from this case study how a graph database can be used to de-anonymize data on a large scale (in the hundreds of millions of users). In Meredith’s case, the key was finding relationships between the main data type — the first-party cookie — and the other, more problematic, streams of data.

But what about other types of database systems, and in particular relational databases. Can’t they be used to do a similar thing to what Meredith achieved?

“The thing is,” said Webber, “connected data is both more valuable and more numerous than relational (row-oriented) data.” He also pointed out that the analytical tradition for graphs is strong and is a good match with machine learning — “feeding graph features into ML makes for incredible improvements.”

I was curious to know how Webber thinks Neo4j fits into the cloud native ecosystem. I mentioned Sam Ramji of DataStax’s observation that Kubernetes and the NoSQL database Cassandra are “like peanut butter and chocolate […] kind of a perfect pairing of data and compute for a cloud native world.”

“At Neo4j, we’re also big believers in Kubernetes,” Webber said. “We’ve invested heavily in it, making it work sympathetically with a stateful system like the Neo4j graph database.”

He noted that Kubernetes underpins its graph database as a service, Neo4j Aura. Webber says the control plane of Aura “is standardized, commodified and portable across cloud platforms,” which he thinks is “a technical leg-up versus those databases that went to the cloud with bash scripts and duct tape.”

One of the things I’m learning, as I do this series of columns about database systems in the age of AI, is that the 2020s will most likely be a bountiful decade for AI and ML apps. So where does Neo4j fit into this emerging landscape of cloud native, data-intensive apps?

“With cloud native solutions,” he replied, “developers can write, conceive, deploy, and upgrade projects to scale with lower friction, which leads to faster iterations and agility. But we’re seeing already that there is no single cloud — true, there may be [a] few popular clouds, but no outright victor in this contest.”

What this means, says Webber, is that the tools that developers need — like graph ML — will need to be on every cloud platform going forward. So the underlying architecture must, therefore, be cloud-agnostic.

“Our choice of Kubernetes for Neo4j’s cloud architecture reflects this reality,” he continued. “It means we can relatively easily move to the cloud platforms where developers and enterprises want us to be, now and in the future. Our intention is to ensure that Neo4j sits alongside the ML processing services being rolled out by the cloud players. That’s a sweet convergence of infrastructure and application technologies which is very compelling.”

A sweet spot, indeed. Especially when you see what a large enterprise like Meredith can do with a combination of cloud, ML and Neo4j’s graph database. Now imagine what hundreds of startups will build in the application space with those same technologies.

DataStax is a sponsor of The New Stack.

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.