Contributed / Technology / Top Stories

Using Graphs and Machine Learning to Find Needles in a Haystack

19 Jul 2018 1:59pm, by

Gaurav Deshpande, Vice President of Marketing, TigerGraph
Gaurav Deshpande is the Vice President of Marketing at TigerGraph. He spent 15 years overseeing marketing for IBM's Artificial Intelligence, Blockchain and Cloud portfolios for the Banking and Financial, Telecommunications and also Retail markets. Additionally, he built out and positioned IBM’s Big Data and Analytics portfolio.

Fraud detection, in many ways, resembles finding needles in a haystack. You must sort and make sense of massive amounts of data in order to find your “needles” or in this case, your fraudsters.

Let’s use the example of a phone company with billions of calls occurring in its network, all on a weekly basis. How can it identify signs of fraudulent activity from its mountain — or haystack — of call logs? This is where machine learning provides value, offering a magnet, which in this case, is the ability to identify behaviors and patterns of likely fraudsters. Using a graph model, a machine becomes more adept at recognizing suspicious phone call patterns and is able to separate them from the billions of calls made by regular people which comprises our haystack of data.

Indeed, more and more organizations are leveraging machine learning, along with graphs, to prevent various types of fraud, including phone scam, credit card chargeback, advertising, money laundering and more. Before we further discuss the value of the powerful combination of machine learning and graphs, let’s take a look at how current approaches for identifying fraudsters based on machine learning are missing the mark.

A Machine Learning Algorithm Is Only as Good as Its Training Data

In order to detect a specific condition such as a phone engaged in a scam or a payment transaction involved in money laundering, a machine learning system requires sufficient volume of fraudulent calls or payment transactions that are likely to be related to money laundering. Let’s drill down further with using phone-based fraud as an example.

In addition to the volume of calls that are likely to be fraudulent, a machine learning algorithm also requires features or attributes that have a high correlation with the phone fraud behavior.

As fraud (much like money laundering) is part of less than 0.01 percent or 1 in 10,000 of the total volume of transactions, the volume or the quantity of training data with confirmed fraud activity is very small. Having such a limited quantity of training data, in turn, results in poor accuracy for the machine learning algorithms.

Features or attributes for finding a fraudster are based on a simple analysis. In the case of phone-based fraud, they include calling history of particular phones to other phones that may be in or out of the network, the age of a prepaid SIM card, percentage of one-directional calls made (cases where the call recipient did not return a phone call) and the percentage of rejected calls. Similarly, to find payment transactions involved in money laundering, features such as size and frequency of the payment transactions are fed into the machine learning system.

However, by relying on features focused on individual nodes alone, the result is a high rate of false positives. For example, a phone involved in frequent one-directional calls may belong to a sales representative, who is calling prospects to find leads or sell goods and services. It may also be involved in harassment, where one user is calling another as a prank. A high volume of false positives results in a wasted effort to investigate non-fraudulent phones leading to low confidence in machine learning solution for fraud detection.

Building a Better Magnet for Phone-Based Fraud

Real life examples are proving the value of graphs and machine learning to combat fraud. Currently, a large mobile operator uses a next-generation graph database with real-time deep link analytics, to address the deficiencies of current approaches for training machine learning algorithms. The solution analyzes over 10 billion calls for 460 million mobile phones and generates 118 features for each mobile phone. These are based on deeper analysis of calling history and go beyond immediate recipients for calls.

The diagram below illustrates how the graph database identifies a phone as a “good” or a “bad” phone. A bad phone requires further investigation to determine whether it belongs to a fraudster.

Figure 1 – Detecting phone-based fraud by analyzing network or graph relationship features

A customer with a good phone calls other subscribers, and the majority of their calls are returned. This helps to indicate familiarity or trusted relationships between the users. A good phone also regularly calls a set of others phones — say, every day or month — and this group of phones is fairly stable over a period of time (“Stable Group”).

Another feature indicating good phone behavior is when a phone calls another that has been in the network for many months or years and receives calls back. We also see a high number of calls between the good phone, the long-term phone contact and other phones within a network calling both these numbers frequently. This indicates many in-group connections for our good phone.

Lastly, a “good phone” is often involved in a three-step friend connection — meaning our good phone calls another phone, phone two, which calls phone three. The good phone is also in touch with direct calls with phone three. This indicates a three-step friend connection, indicating a circle of trust and interconnectedness.

By analyzing such call patterns between phones, our graph solution can easily identify bad phones, which are phones likely involved with the scam. These are phones have short calls with multiple good phones, but receive no calls back. They also do not have a stable group of phones called on a regular basis (representing an “empty stable group”). When a bad phone calls a long-term customer in the network, the call is not returned. The bad phone also receives many rejected calls and lacks three step friend relationships.

The graph database platform leverages more than 118 new features that highly correlate with good and bad phone behavior for each of 460 million mobile phones in our use case. In turn, it generates 54 billion new training data features to feed machine learning algorithms. The result has been improved the accuracy of machine learning for fraud detection, with fewer false positives (e.g. non-fraudulent phones marked as potential fraudster phones) as well as lower false negatives (e.g. phones involved in fraud that weren’t marked as such).

To see how graph-based features improve accuracy for machine learning, let’s consider an example (Figure 2) using profiles for four mobile users: Tim, Sarah, Fred and John.

Figure 2 – Improving accuracy for machine learning with graph features

Traditional calling history features, such as the age of the SIM card used, percentage of one directional calls and percentage of total calls rejected by their recipients, result in flagging three out of four of our customers, Tim, Fred and John as likely or potential fraudsters as they look very similar based on these features. Graph-based features with analysis of deep link or multi-hop relationships across phones and subscribers help machine learning classify Tim as a prankster, John as a salesperson, while Fred is flagged as a likely fraudster. Let’s consider how.

In the case of Tim, he has a stable group, which means he is unlikely to be a sales guy since salespeople call different numbers each week. Tim doesn’t have many in-group connections, which means he is likely calling strangers. He also doesn’t have any three-step friend connections to confirm that the strangers he is calling aren’t related. It is very likely that Tim is a prankster based on these features.

Let’s consider John who doesn’t have a stable group, which means he is calling new potential leads every day. He calls people with many in-group connections. As John presents his product or service, some of the call recipients are most likely introducing him to other contacts if they think the product or service would be interesting or relevant to them. John is also connected via three-step friend relations, indicating that he is closing the loop as an effective sales guy, navigating the friends or colleagues of his first contact within a group, as he reaches the final buyer for his product or service. The combination of these features classifies John as a salesperson.

In the case of Fred, he doesn’t have a stable group, nor does he interact with a group that has many in-group connections. Plus, he does not have three-step friend relations among the people he calls. This makes him a very likely candidate for investigation as a phone scam artist or fraudster.

Going back to our original analogy, we are able to find our needle in the haystack, in our case, it’s Fred the potential fraudster, by leveraging graph analysis for better machine learning for improved accuracy. This is achieved by using the graph database framework to model data in a way that allows for more features that can be identified and considered to further analyze our haystack of data. The machine, in turn, is trained with more and more accurate data, making it smarter and more successful in recognizing potential scam artists and fraudsters.

Building a Better Magnet for Anti-Money Laundering

Graphs and machine learning are used for a host of fraud detection use cases beyond identifying a phone-based scam. Machine learning algorithms are being trained to detect various other types of anomalous behavior, such as identifying potential money laundering.

Global money laundering transactions comprise an estimated two to five percent of the global GDP, or roughly $1 to 2 trillion annually, according to PWC report. The risk of money laundering spans the entire financial services ecosystem and including banks, payment providers and newer cryptocurrencies, such as Bitcoin and Ripple. Given how much financial activity occurs every second of every day, how is it possible to find the needles — our fraudsters, in a haystack of data?

Current approaches focus on attributes or features for the individual node such as payment or user in question, but this often leads to high volumes of false positives. The same data is fed into the machine learning algorithm, resulting in a poor accuracy of future fraud prediction by machine learning system — poor data in, poor insights out!

As you can expect, fraudsters disguise their activity with circuitous connections between themselves and known bad activity or bad actors. Any individual connecting path can appear innocent, but if multiple paths from one point to another can be found, the likelihood of fraud increases. As more traversal hops are needed to find data connections two or more transactions away, this is where graphs offers value — by identifying and finding features over data connections and relationships that can be used to better inform and train machine learning. These features may include size and frequency of the payment transactions or they may be more abstractly based on relationships between the data.

For example, a graph-based approach can uncover semantically meaningful connecting paths between nodes. Let’s consider an incoming credit card transaction to show how its relation to other entities can be identified:

New Transaction → Credit Card → Cardholder → (other)Credit Cards → (other)Bad Transactions

This query uses four hops to find connections only one card away from the incoming transaction. Any individual connecting the path can appear innocent, but if multiple paths from A to B can be found, the likelihood of fraud increases. Given this, more hops are needed to find connections between two or more transactions away. In this way, Real-Time Deep Link Analytics offers value in uncovering multiple, hidden connections to minimize fraud.

Second, a graph-powered approach enables the use of use graph-based statistics to measure the global relevance of nodes, links, and paths. For example, the feature of betweenness gives the number of times an entity falls on the shortest path between other entities. This metric shows which entity acts as a bridge between other entities. Betweenness can be the starting point to detect any money laundering or suspicious activities. could indicate that someone or something is a go-between in a fraud ring or in money laundering layering. Community detection finds the natural groupings in a network, by comparing the relative density of in-group connections vs. between-group connections.

Similarly, other graph-based analytics, such as degree centrality and shortest path, can add necessary coloring to otherwise unremarkable data points. Degree centrality provides the number of links going in or out of each entity, offering a count of how many direct connections each entity has to other entities within the network. This is particularly helpful for finding the most connected accounts or entities which are likely acting as a hub, and connecting to a wider network.

By linking data together, graph analytics can support rules-based machine learning methods in real time to automate automated money laundering (AML) processes and reduce false positives. Using a graph engine to incorporate sophisticated data science techniques such as automated data flow analysis, social network analysis, and machine learning in their AML process, enterprises can improve money laundering detection rates with better data, faster. They can also move away from cumbersome transactional processes, and towards a more strategic and efficient AML approach.

Example: E-Payment Company

For one example of graphs and machine learning powering AML, we can look towards the #1 e-payment company in the world. Currently, this organization has more than 100 million daily active users, and uses graph analytics to modernize its investigation methods.

Previously, the company’s AML practice was a very manual effort, as investigators were involved with everything from examining data to identifying suspicious money movement behavior. Operating expenses were high and the process was highly error-prone.

Implementing graph analytics, the company was able to automate development of intelligent AML queries, using a real-time response feed leveraging machine learning. Results included a high economic return using a more effective AML process, reducing false positives and translating into higher detection rates.

Example: Credit Card Company

Similarly, a top five payment provider sought to improve its AML capabilities. Key pain points include high cost and inability to comply with federal AML regulations — resulting in penalties. The organization relied on a manual investigative process performed by a machine learning team comprised of hundreds of investigators, resulting in a slow, costly and inefficient process with more than 90 percent false positives.

The company leverages a graph engine to modernize its investigative process. It has moved from having its machine learning team cobble processes together towards combining the power of graph analytics with ML to provide insight into connections between individuals, accounts, companies and locations.

By uniting more dimensions of its data, and integrating additional points — such as external information about customers — it is able to automatically monitor for potential money laundering in real time, freeing up investigators to make more strategic use of their now-richer data. The result is a holistic and insightful look at its colossal amounts of data, producing fewer false positive alerts.


In today’s era of data explosion, it’s more and more important for organizations to make the most of analyzing their colossal amounts of data in real time for fraud detection. The powerful combination of graphs with machine learning offers immense value in ensuring that machine algorithms are being fed quality data. As machine training become more effective, the result is more fraudulent activity being identified as it happens. Graphs are a powerful asset in helping to ensure that higher quality, more complex features can be identified to support accurate machine learning designed to find the needles in the haystacks.

Feature image via Pixabay.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.