Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements

Case Study: Graph Databases Help Track Ill-Gotten Assets

If you want to find oligarchs’ dirty money — or reveal connections hidden in any data — you will need a graph, not a map.
Jun 12th, 2023 4:00am by
Featued image for: Case Study: Graph Databases Help Track Ill-Gotten Assets
Image by Conny Schneider from Unsplash. 

In the modern world, information — and money — is digital. Its flow around the world leaves a trail as it moves through onshore and offshore intermediaries, such as lawyers, accountants and banks, and is transformed into other assets, such as property, private jets and yachts.

A series of massive data leaks over the last decade has gifted the International Consortium of Investigative Journalists (ICIJ) vast tranches of digitized data, potentially allowing it to track asset flows worldwide. This data has been pulled together in the ICIJ’s Offshore Leaks database.

Much of the activity covered in these leaks is perfectly legitimate — but some of it will be tied to unlawful or corrupt behavior.

A robust document database could manage these vast amounts of data. But to really understand what is going on within the world of offshore finance, the ICIJ needed to also surface the connections between various players and their assets.

The transactions and relationships contained in the Offshore Leaks data have, in many cases, been deliberately designed to obscure the truth and make it challenging to track asset movements and ascertain who owns what.

A decade ago, tracking these relationships would have meant modeling the data using multiple joined spreadsheets. But this was painstakingly difficult, and investigators still faced the question of how to visualize these relationships for a non-technical audience.

Ultimately, the job calls for a graph database, one that can map connections and relationships, including those that are designed to remain hidden.

How Neo4J Unearths the Secrets in the ICIJ’s Data

The Offshore Leaks data is a potential treasure trove of information about how wealth and assets are diverted into the offshore financial system. But secrecy and obscurity are part of the process. The relationships between individuals, companies, assets and enablers are buried in the data.

But uncovering such hidden relationships and presenting them visually, in a way investigators, journalists and citizens can quickly grasp, is where a native graph database really comes into its own. And this is why the ICIJ uses Neo4j’s platform to analyze its vast amounts of data and reveal the links between various entities.

While traditional relational databases are all about rows and columns, graph databases are all about connections. In Neo4j’s model, data elements are stored as “nodes,” which may be connected by any number of “relationships.” Both the nodes and relationships can have “properties.”

As well as its core database, Neo4j offers a suite of tools to allow developers and data scientists to model, store and query data as a graph. It has its own query language, Cypher. There is a Python wrapper for the Neo4j graph data science library to ease integration into data science workflows. At the other end of the spectrum, there are API integrations to allow, for example, web developers to build web applications that are backed by Neo4j.

This model lends itself to a variety of data access patterns depending on the use case, according to William Lyon, a developer relations engineer at Neo4j. If a journalist simply wants to know a list of offshore companies linked to a sanctioned individual, this will involve a local graph traversal from a well-defined starting point.

At the other end of the scale, a data scientist — ICIJ has both journalists and data scientists on its team — might look to analyze the entire network or run graph algorithms such as PageRank to establish the most important nodes in the network.

The platform is particularly useful both for analyzing nested data and for being able to combine datasets and running queries across them.

“By extracting the entities and the relationships out of all these documents, and then adding them into Neo4j,” Lyon said, “you get this huge graph of how all of these people and offshore companies and assets are connected.”

A “node” would represent an individual, an offshore company or, he said, an address connected to the person or the company.

“And then on those nodes, you can store key-value pair attributes that are called properties, like the name or the passport number that are associated with the node,” Lyon said. “And then we would also add another component called a label; that is a way to group the node.”

The result, said Lyon, is “you’re able to encode those relationship types that are shown in these documents or extracted through this natural language processing in the property graph data model used to model this Offshore Leaks data.”

The ICIJ then uses a visualization tool called Linkurious, which integrates with Neo4j, to enable less technical users to interrogate the graph. Most journalists are not going to be writing SQL or Cypher queries.

Tracking Connections in Messy Data

One of the big problems for the ICIJ is not just the scale of the data involved and the hidden nature of the connections between various players, but the format it arrives in.

The Swiss Leaks investigation in 2015 centered on 3.3GB of leaked data. The Paradise Papers leak in 2017 involved 13.4 million documents amounting to 1.4TB of data, spanning 19 corporate registries.

The Pandora Papers investigation, which hit in 2021, included 11.9 million files from 14 different “offshore service providers”, spanning PDFs, images, emails, spreadsheets, and audio and video files, amounting to 2.94TB.

More recently, the ICIJ has pulled together previous Russia-related investigations in its Russia Archive, which has helped spark action by authorities and regulators in the wake of Russia’s invasion of Ukraine.

But the data from which investigations spring is never handed to reporters on a tidy plate. “The data that we get is usually very problematic,” Emilia Diaz-Struck, ICIJ’s data and research editor, told the New Stack. “It’s very messy, it’s not structured.” For instance, just 4% of the data in the Pandora papers were in structured formats, such as spreadsheets.

So the ICIJ uses a variety of tools for “entity extraction,” including optical character recognition (OCR) and machine learning. “For some of the information that was in documents, we use Python scripts that our team wrote for extracting,” she said.

The team also uses Scikit-learn, a Python machine learning toolkit, “to separate forms from longer documents and then we used OCR to extract the information.” Some investigations have included handwritten documents and this means data must be transcribed manually.

Once the entities have been extracted, the ICIJ must still fact-check and validate the information.

The organization has also developed its own platform, Datashare, which is an open source tool for securely sharing massive amounts of records with everyone involved in a project. Clearly, with as many as 600 journalists on an investigation, it’s not feasible for individual reporters to have to visit a single secure location.

But even when the ICIJ and its partners have extracted this vast amount of information from a mass of unstructured data, it still must use Neo4j’s graph database to connect the dots between individuals, entities and assets, and visualize the results. Or, to put it another way, build a story.

No Need for Complex Queries

The ICIJ’s ultimate aim is to further public interest journalism and democratize the data it obtains. This necessarily means putting that data in the hands of people who are not necessarily technical experts but rather seasoned field reporters, great storytellers or simply highly motivated citizens.

By combining Neo4j’s ability to uncover links and relationships that researchers may not even have dreamed of, together with Linkurious’s data visualization and analysis technology, said Diaz-Struck, it has been able to both construct the graph and provide an interface for people to query it, without the need to code or construct complex queries.

“That’s the powerful thing, the magic,” she said. “They can start typing a name of anyone, or an address, and then they will get suggestions.”

From there, she said, journalists and other researchers can expand their search and realize that a person they thought had one company has connections to multiple companies, or entities. From there, she said, they can return to Datashare and explore the documents themselves.

“This is a great way to find connections and find key information that will help advance their reporting process,” Dias-Struck said. “It helps a lot with discovering stories and interpreting and finding connections.”

Anyone can get a feel for the power of graph databases by checking the Offshore Leaks database. Because, as Diaz-Struck said, ICIJ’s work is about transparency, and answering the question, “How do we democratize access to data and make it available and usable for everyone?”

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.