Connections Problem: Finding the Right Path through a Graph
In complex data systems — plenty of which have grown to gigantic caches of billions of facts about the smallest links in the biggest chains — almost any fact in which anyone may be interested can be specified in terms of a path through a graph. By “graph,” I mean a way to literally draw the route in a system to where a fact may be obtained.
Today, we have what I call a “Connections Problem.” In the systems we have, there are often too many unrelated facts. As a result, that path through the graph is usually implicit and undiscoverable. If the path does exist, it’s too long or too hard to navigate. You end up with a graph too bewildering to use.
Search for the Value-Add
Don’t get me wrong, massive tables of facts are great. Yet sometimes you wish you didn’t have 25 or 38 systems for them! Sure, machine learning has grown by leaps and bounds, but ML and AI practitioners will tell you the data engineering challenge there is staggering, and connecting the dots is more and more painful. These folks would prefer to spend their time with predictive modeling. What they actually end up doing is data engineering; painfully connecting dots before they can do anything that adds value.
You see this yourself every day, with every data lookup you perform. First, JSON documents for relational databases induce you to create an identifier scheme, so you can join two tables, or connect two JSON documents via a lookup operation. Each time you do this, you pay the same performance cost again.
Alternatively, you could choose to denormalize your entire data structure. For instance, you could store one big table instead of two smaller ones. Instead of joining them, by having one table you’ve simply baked that relationship into the table schema.
When you denormalize your data, you pay for connections with greater data redundancy and reduced cohesion. So you can normalize data and pay in compute time, or denormalize data and pay another way.
With both options, you lose flexibility somewhere. When the need arises for a new and deeper relationship to be extracted from the data, you have to at least consider re-engineering everything. Finally, you end up with physical data models that tend to reflect how your many systems need to function for performance and connectivity reasons, instead of reflecting how subject matter experts actually think about their domains.
Storing and using data in a connected way, using graphs, is a very old idea that offers a fresh perspective. Working with graph databases keeps redundancy low. A node can be stored once in a graph, and then referred to by as many relationships as you like.
Graphs keep data cohesion high, leading to reusability and maintainability of data. And graph data models don’t penalize you when you need to adapt them for new functions or tasks.
Subject matter experts tend to think in the context of networks. Their “boxes and lines” drawings on whiteboards show how all their concepts come together and interrelate. Graph data models are exactly like these networks. When it’s time for you to get insights out of your data, the structure of your data network becomes the solution, not the limitation.
There’s a need here — perhaps an urgent one — for data representation and storage to be approachable in a fundamentally different, more powerful, way.
For you and your data environment to level up, if you will, you can’t just layer a few more features on top of the typical SQL database or JSON document collection that’s already there. You need to focus on the spaces between the facts: the Connections Problem.
Let’s start simple: What’s the best way to move a package from Anchorage, Alaska, to Richmond, Virginia? Logistics companies find routes that are, obviously, paths through the graphs of their shipping networks. What’s the shortest path between cities for an airline that’s had to cancel some routes?
Airline companies rebooking customers will need at least one path through the graph of their cities and flights. But let’s take maps and geography out of the mix for a moment. For any organization, you might ask how many people a given manager manages, directly or indirectly. Each person in that company has a reporting structure, which is also a path through a graph.
Usually, a relational database would force you into concocting, and then maintaining, some intermediate roll-up structures for every method of attaining a path. With a graph database structure, you can traverse and summarize all reporting structures instantly. There’s no magic here; this uses classic approaches such as breadth-first and depth-first searches.
With graph visualization, you can literally zoom out and see the structure with your own eyes. Here, you don’t lose the forest when you focus on the trees. Each graph superstructure applies itself to actual system dynamics, rather than the same list of granular facts you’re tired of managing.
We’ve discussed the structures of businesses and business transactions. Let’s keep scaling up for a minute.
Life in the 21st century is interconnected to an unprecedented degree. The 1990s and 2000s ushered in a wave of globalization that saw the creation of complex trade ties between nations. Individual industries have leveled up their practices, leading to longer supply lines and more sophisticated logistics.
The pandemic of 2020 shook up global supply chains, stranding container ships in ports, mixing up markets and creating chip shortages and other phenomena that persist into 2022.
In 2016, the International Consortium of Investigative Journalists (ICIJ) exposed a network of offshore bank accounts that enabled corruption and crime on a global scale. One company owning another or transferring money to another may not appear suspicious in itself. Yet when ICIJ zoomed out to illuminate the superstructure of the network, the team exposed flows of resources between previously unfathomable, related accounts, companies and individuals. There was no signal to be discerned from any single bank account. What they managed to expose — using a graph database as their tool — were patterns of interdependencies and connections that were more informative than the data points themselves.
This same principle translates directly to a task such as generating recommendations for media users or retail shoppers. You may enjoy products or services that are highly rated and used by people with profiles similar to yours.
In the field of fraud detection, patterns of financial transfers throw red flags. In logistics networks, supplier challenges one week may lead to predictable production problems the following week. Connections matter. The key insights you’re looking for may be extracted from the forest, not the trees.
Leveling up Means Letting Go
As data practitioners, how can you level up? Expanding the breadth of questions you can ask and adopting multiple frameworks for examining a problem, together make a great start. As you’ve seen, the tools you use today have a tricky ability to constrain how you think through problems. When all you have is a hammer, the saying goes, everything looks like a nail.
Leveling up means transitioning from perusing lists of facts, to thinking about complex system dynamics. It means finding a path through the forest, not just avoiding running into the next tree. To accomplish this, you need to approach the Connections Problem with the right tools, and with a fresh viewpoint.
When I was learning software development, I started with Java, and got all the standard object-oriented programming vernacular booted up into my head. And it was great. You know the feeling — that “click” that happens in your brain when you finally “get it” and start to fly.
But Java isn’t everything; I later learned about different paradigms — procedural logic, functional logic, logic programming. It was as though the world had been stood on its head. All the problems I knew how to solve with Java suddenly looked different with a different mental toolset. Click, click, click.
If you’re coming from the realm of JSON documents and relational tables, that’s what the feeling of graph methodology will be for you: one big “click.” It’s still your data! That didn’t change. But you gain a fresh perspective, enabling patterns that just don’t come up naturally or aren’t idiomatic with other database systems.
As technologists, in an industry where a complete technology generation extends no longer than five years, many of us have grown accustomed to being professional learners. Maybe graph databases won’t change everything you do with application development. Yet it could offer you a new mental toolset that makes you better prepared for whatever the next generation brings.
We technologists should be strategically impatient with the limitations of outmoded approaches. This is how you level up your practice. This is how garbage collection became so prevalent in modern software development. This is how workload deployment becomes more automated. This is why we have continually abstracted forward, from servers to virtual machines to serverless.
Oftentimes, your success as a developer is defined by how many old things you can sustain and maintain. Some of the freshest, most powerful ideas aren’t complicated or magic, they’re simply a perspective shift: looking at old problems in new ways to make new answers attainable.
Where to Begin
- Get started free with Neo4j AuraDB native graph database
- Free, complete self-paced, courses from Neo4j GraphAcademy
- Register now for Neo4j NODES 2022 Online Developer Education Summit Nov. 16.