Machine Learning

How Good Is Machine Learning at Understanding Text?

3 Apr 2018 3:00am, by

How good are computers at reading and understanding documents?

When Microsoft announced recently that it had matched human performance on the Stanford Question Answering Dataset (SQuAD), it was an important milestone in machine reading comprehension (MRC). It was closely followed by Alibaba achieving a similar score and beaten a few days later by a system from the Harbin Institute of Technology and Chinese AI company iFLYTEK.

It’s the same kind of regular improvements we’ve seen in speech and image recognition, and as usual, developers can expect to see some of these techniques packaged up in open source tools and commercial APIs for them to use.

Microsoft hasn’t published the details of the R-NET+ system, but an earlier version called R-NET previously achieved a record score on SQuAD. This was a gated attention-based recurrent network with a self-matching stage. The recurrent neural network uses word embeddings to represent the questions and the documents that provide answers to them (which picks out the phrases that the neural network to should pay attention to), with a gating mechanism that scores different sections of the document to emphasize the passages that are relevant to the answer and ignore passages that aren’t relevant.

Recurrent networks use directed connections between nodes as memory inside the network, but in practice, they only store a limited amount of context. That means one possible answer being processed by the network might not include all the sections of the document that contain clues to the answer. The self-matching layer uses the neural network to match each word in the source document against the rest of the document as well as against the question, to make sure all the available information gets used to predict which words in the document are the best answer.

Reading or Understanding?

But the announcement also generated something of a backlash because the term “reading comprehension” sounds a lot like “reading a document and understanding it as well as a human.” That might be the long-term aim of the field, but it’s not what the SQuAD dataset measures. Reading comprehension has very broad uses in search and beyond, but right now it’s much more like the tests you may remember from school, where you had to read a passage of text and then answer questions using just the information it contains.

SQuAD has over 100,000 questions, all of which are answered by a section of text in one of 536 Wikipedia articles; the questions were created by crowd workers on Mechanical Turk who were asked to come up with as many questions as they could, based on a short snippet of an article. It’s a big improvement on previous datasets which were much smaller, or took a simpler approach of multiple choice or fill in the blank, and it does require logical reasoning to infer answers from the source passages (like connecting a name in one sentence to a pronoun in the next sentence), but it’s still not reflective of the kind of open-ended questions users ask search engines and intelligent agents.

SQuAD questions don’t have abbreviations or spelling mistakes in, either, and the MRC system doesn’t have to look at multiple documents or even multiple passages in a document to come up with an answer, so they’re not synthesizing information or rewriting it to make it match the question. The source documents on Wikipedia are also clean, high-quality articles; not all the documents systems will need to read will be that clear. And there’s always an answer in the SQuAD text, so models can use tricks like picking the only date in a passage if the question starts with “when.” That means it doesn’t test whether a model can determine whether there is enough information to answer a particular question.

SQuAD is a dataset of pairs of questions and answers and the snippet of Wikipedia text they come from; this example is why Alibaba’s chief scientist for natural-language processing Luo Is claimed that “Objective questions such as ‘what causes rain’ can now be answered with high accuracy by machines.”

As Adam Trischler, senior research scientist in the Maluuba team, explained to us, “SQuAD cover a range of topics including science, politics and medical information, which is good coverage, but the language is very distinct from what you’d find in a novel or a news article, so models trained on SQuAD have some limitations in generalizing to, say, reading the news.”

News search is a particularly interesting area, he said. “One of the places we see question answering and generation applying really nicely is in more transient high-volume information sources like news. There’s a ton of knowledge in Wikipedia, but it’s more static. News is changing every day so a machine comprehension reading question and answer approach can have a lot of impact because it can sift through the new information rapidly.”

The Maluuba team also released the Frames dataset based on conversations between users and travel experts with details of hotels, flights and destinations, as a way of testing intelligent agents that can answer more complex questions as part of on-going conversations where the goal might change based on the information they supply. Effective MRC systems will need to work with a wide range of data sources and handle questions that arrive in different forms, so multiple datasets are key to improving comprehension models.

Learning to Ask Questions

As usual, having large datasets to train and test with is helping researchers improve the performance of reading comprehension models — but creating those data sets is slow and expensive, making it harder to use machine reading for topics where there isn’t a good test data set to learn from.

“Objective questions such as ‘what causes rain’ can now be answered with high accuracy by machines.”– Alibaba’s Luo Is

Another team at Microsoft Research is using transfer learning to build MRC systems for domains like a new disease, where there are plenty of source documents, but we don’t have existing, manually labeled datasets of questions and answers to train on. SynNet is a two-stage synthesis network that first learns what interesting information in a document looks like — key facts, named entities and semantic concepts that are frequently used — synthesizes those into answers and then learns to generate natural language questions that the “interesting” information could answer, using bi-directional Long Short-Term Memory Networks. Once trained, SynNet can be applied to a new domain where it can generate both questions and answers that can be used to train an MRC system. Training SynNet on SQuAD then using it with NewsQA gives almost as good results as from a system that was trained specifically on NewsQA.

The Maluuba team that Microsoft acquired last year created the NewsQA dataset with 120,000 Q&A pairs using CNN articles from the DeepMind Q&A dataset with questions and answers written by different people and validated by still others. Again, not all questions have answers, so the system has to recognize when it doesn’t have enough information; those that do require reasoning to answer; that includes synthesizing information from multiple sentences, recognizing answers that paraphrase the question using synonyms and using general knowledge to infer answers from incomplete information or related concepts.

As usual, having large datasets to train and test with is helping researchers improve the performance of reading comprehension models — but creating those data sets is slow and expensive, making it harder to use machine reading for topics where there isn’t a good test data set to learn from.

Interesting facts and synthesized questions and answers from SynNet.

The Maluuba team is also working on teaching systems to ask better, more informative questions, using reinforcement learning with rewards focused on what makes the question better — like being fluently phrased. A system that can ask good questions requires deeper reasoning ability because it has to both understand the source document and be able to generate natural language questions. It also has to know when the answer to a question can’t be answered fully by a document and it needs to ask more questions or look for more sources, Trischler told us.

“Sometimes the information you’re looking for isn’t in the document you have in front of you; your knowledge base is incomplete, and information is missing from it. We want to address the question of when the available knowledge is insufficient: how do you improve or add to it? You explore the limits of what you know and what you don’t know, what you understand and what needs further clarification. As people, if there’s something we don’t know one of the first things we do is ask a question: asking questions is this fundamental behavior we undertake to fill in the gaps and add to knowledge. We envision the same thing happening with literate machines,” Trischler said.

That will also mean MRC systems working across multiple documents. “If I’m an agent and I get a question from a user and maybe I find a source of information, but as I read it — as a machine — maybe something is still unclear, so I want to ask additional questions and maybe that directs me to another document I can reference against to answer more complicated questions.”

 

 

 

Literate Machines

Machine reading is a harder problem than recognition tasks like image recognition because of the ambiguity and complexity in language. MRC systems have to understand synonyms and context (tea and coffee are both hot drinks, Java can be a coffee growing region or a programming language, a cappuccino and a cup of coffee mentioned in the same document might be the same thing).

That means using background information that may not be in the source document, as well as synthesizing information from multiple phrases (which means parsing sentence structure and connecting nouns, pronouns and verbs across sentences) and summarizing it into an answer that matches the question.

But if the answer to a question like “why does it rain” includes clouds and gravity, the system doesn’t have to be able to explain the theory of gravity unless that’s also mentioned in the document. Reading comprehension is a “term of art” and doesn’t imply understanding in the human sense, any more than sentiment detection algorithms “understand” happiness and sadness.

The most obvious use of MRC is in search; instead of a link to a document or web page that has the answer to a query somewhere on the page, it can give an immediate answer, synthesized from the source. It also improves search for narrow and specific domains where there may not be a lot of the data that search algorithms usually depend on to give useful results.

But could also help with fact checking (are the statistics or political commitments quoted in an article backed up by the documents quoted as evidence), summarizing scientific papers (which of two methods covered gives the best results for a given task, which drug has the best results for patients with these characteristics), product recommendations (which product gets only positive reviews), diagnosis (which diseases have these symptoms, what treatments are recommended in this situation), extracting troubleshooting information from a manual or answering questions about how tax codes apply to a specific situation.

The question and answer format is a good match for an intelligent assistant like Cortana, and it’s one of the reasons that Microsoft acquired Maluuba. Announcing the news Microsoft executive vice president for artificial intelligence Harry Shum talked about an AI agent that would use machine comprehension to answer complex questions about, say, how tax law applied to a company to work like “a secretary with an intuitive knowledge of your company’s inner workings and deals. The agent would be able to answer your question in a company security-compliant manner by having a deeper understanding of the contents of your organization’s documents and emails, instead of simply retrieving a document by keyword matching, which happens today.”

To deliver that Maluuba is trying to create what product manager Rahul Mehrotra described to us as “literate machines”; Machines that can think, reason and communicate like humans, machines that can read text, understand text and then learn how to communicate, whether it’s written or orally.

“Our goal was to solve the problem of finding information. Today we use keywords and you have to read all this information to really find what you’re looking for. Our original idea was to build a literate machine that can answer any question a user has. [The next step of reading comprehension is] how can we move away from multiple choice where the answers are given, to something more extractive where the system can read a news article and extract the exact answer where multiple options aren’t available for it to look at.”

Conversational exchanges with an agent could help to narrow down search results, Trischler suggested. “A search for an image of trees will have a huge pool of returned results. You may be looking for something more specific. The algorithm has no idea what that is but if it has reasoning capacity, it can look at the images and divide them into classes and ask you questions to disambiguate. ‘Are you looking for trees in summer with green leaves or in autumn with nice red leaves?’ That could help people refine searches or just learn more about you to help personalize your tools.”

Microsoft is a sponsor of The New Stack

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.