Science / Technology /

MIT’s New AI Data Extraction System Teaches Itself by Surfing the Web

11 Jan 2017 9:40am, by

We live in an age where there is a vast, over-abundance of data available on the web. The problem is that sifting through all of it to find and make sense of whatever is deemed relevant is an incredibly time-consuming task. But it may soon become easier, as Massachusetts Institute of Technology researchers recently revealed in a paper that introduces a new artificial intelligence system that would be capable of learning, on its own, in extracting useful information from online sources.

Recently presented at the conference of the Association for Computational LinguisticsConference on Empirical Methods on Natural Language Processing in Austin, the researchers’ paper describes a new information extraction system that’s able to automatically extract structured information from unstructured machine-readable documents. Put simply, the program can do what humans are good at: When faced with a gap in information or something we don’t understand, we go and search for another document to digest that will add to our understanding or further our knowledge.

“In information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,” said professor Regina Barzilay and senior author of the new paper. “That’s very different from what you or I would do. When you’re reading an article that you can’t understand, you’re going to go on the web and find one that you can understand.”

AI Fills Information Gaps by Itself

That’s what distinguishes this new AI from its predecessors, as it operates in an unconventional way compared to previous models. Typically, machine learning models work within narrowly defined parameters and must be ‘taught’ with many training examples before it can tackle a problem with some measure of success. This new model, however, was trained on very little data, and then set loose to fill the gaps on its own.

Similar to other models, the process involves the AI assigning a “confidence score” to its data classifications, which indicates the statistical probability of whether the classification is correct or not, as compared to the patterns determined from the training data. In contrast to previous system, this new model will automatically perform a web search for more relevant information if the confidence score doesn’t meet a certain threshold. It will then extract pertinent data from the new texts and integrate it with its previous extractions. If the confidence score is still too low, the cycle will begin again.

“We used a technique called reinforcement learning, whereby a system learns through the notion of reward,” explained graduate student Karthik Narasimhan, one of the paper’s co-authors on Digital Trends. “Because there is a lot of uncertainty in the data being merged — particularly where there is contrasting information — we give it rewards based on the accuracy of the data extraction. By performing this action on the training data we provided, the system learns to be able to merge different predictions in an optimal manner, so we can get the accurate answers we seek.”

Analyzing Shootings and Contaminated Food

The researchers employed what is called a deep-Q network (DQN), that is “trained to optimize a reward function that reflects extraction accuracy while penalizing extra effort.”

They tested the information extraction system separately on two tasks. The first was analyzing a collection of data on mass shootings in the United States (macabre, we know, but useful if one is studying the effects of gun control laws), where the system had to extract the name of the shooter, location, the number of wounded and the number of fatalities. The second task involved going through a set of data on food contamination events to extract information on food type, contaminant type and location. In both cases, the team found that the new system outperformed conventionally trained information extractors by about 10 percent.


Sample news article of one shooting case, which has both the shooter’s name and number of fatalities, but both pieces of information would need complex extraction tools to analyze them.


Two other articles on the same shooting case, retrieved by the information extraction system. The first article gives the number of people killed, while the second article identifies the shooter in an easily extractable form.

The new system could be a boon to accelerating research tasks that may have required more tedious, manual effort from humans previously. Not only would a system like this save time, it could also save lives: the researchers foresee that such a system could be used by healthcare providers, as a tool for aggregating patient histories under a more unified structure, which would improve the quality of care that a patient receives.

In the greater scheme of things, the system is one step toward building what’s called artificial general intelligence, capable of mastering any number of tasks in the way a human might, rather than being an expert at only one domain.

Featured image: Esther’s Follies, Austin Texas. Other images: MIT.

A digest of the week’s most important stories & analyses.

View / Add Comments