AI Makes New Scientific Discoveries by Analyzing Old Research Papers

Artificial intelligence could potentially be used to automate new scientific discoveries, as researchers from the U.S. Department of Energy’s Lawrence Berkeley National Laboratory recently found out when they let an unsupervised AI loose to analyze millions of old scientific papers.
Amazingly, the algorithm — which had no previous training in materials science — was able to predict now-known thermoelectric materials in advance, suggesting that it could be used to review past scientific papers in order to uncover new knowledge that might have been missed by human experts.
Published in the recent edition of Nature, the team’s paper outlines the current problem as it stands with the today’s scientific literature: much of it is text-based, which makes it hard to analyze, whether it’s by conventional statistical analysis or via existing machine learning methods.
It’s unfortunate, as authors of these papers often make insightful connections between bits of data and draw valuable conclusions about the issue being studied, so there’s a lot of knowledge embedded in these papers that isn’t easily interpreted by machines, and would be time-consuming for humans to digest as well. While there have been previous attempts to use machine learning methods such as natural language processing to retrieve information from scientific papers, the disadvantage of this approach is that it requires a lot of human supervision, in the sense that datasets have to be manually labeled for training the AI model.
In contrast, the Lawrence Berkeley team’s solution was to use a machine learning algorithm called Word2vec, which didn’t require any human supervision at all. Instead, it works by establishing word embeddings, where words and phrases from a body of text are mapped as vectors, which help to preserve and represent their syntactic and semantic relationships. The idea here is that words with similar meanings will often appear together in similar contexts, and therefore will have similar word embeddings. For instance, when the algorithm is trained on enough text from a scientific paper, it will generate a vector for the word “iron” that will link it closer to the word “steel” than the word “biological”.
Uncovering Novel Connections
In their tests, the team gathered 3.3 million abstracts from scientific papers published in over 1,000 journals between 1922 and 2018. The algorithm then processed the roughly 500,000 unique words found in the abstracts and transformed each into an array of 200 vectors. Though the AI had no prior training in materials science, after this process it was nevertheless able to ‘learn’ scientific concepts and infer relationships between data points, simply by analyzing the placement of words in the abstracts and when they co-occur with one another.
“Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals,” said team leader and paper co-author Anubhav Jain in a statement. “That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven’t studied so far.”
In particular, the algorithm proved that it could predict novel thermoelectric materials, which convert heat to electricity efficiently. During the team’s tests, the algorithm came up with a variety of predictions for possible thermoelectric materials, with the top ten predictions demonstrating higher-than-average thermoelectric properties.
In addition, the team tested the ability of the algorithm to “discover” new materials, by only feeding it abstracts up to a certain year in the past, say only up to 2009, to see if it would come up with materials that were found after that date. They then took those results and compared it to actual findings made after that year and found that a significant number of those material predictions — four times more than if they were made at random — appeared in studies dated after that cut-off point.
“I honestly didn’t expect the algorithm to be so predictive of future results,” said Jain. “I had thought maybe the algorithm could be descriptive of what people had done before but not come up with these different connections. I was pretty surprised when I saw not only the predictions but also the reasoning behind the predictions, things like the half-Heusler structure, which is a really hot crystal structure for thermoelectrics these days.”
But the algorithm wouldn’t be restricted to just discovering new materials. Because it isn’t trained on a specific dataset to start with, the algorithm could readily be generalized and used in other fields, such as discovering new drugs or even finding new cross-disciplinary links, as it works in an unsupervised capacity to find novel connections that might have been overlooked — perhaps even years ahead of time. Moreover, according to the team, this method could be used to automatically extract knowledge that is still hidden in older scientific papers, and which might not be apparent to human eyes.
Artificial intelligence is increasingly being used to assist humans in all sorts of daily tasks — from automating game design, to figuring out the complexities of protein folding, and even “reading” our minds to reconstruct our memories. As these results have shown, AI could also be used to find new scientific discoveries altogether.
Images: Lawrence Berkeley National Laboratory