IBM Releases Open Source Counterparts for Deep Search

IBM Research’s latest effort in speeding up scientific discovery comes in the form of the Deep Search for Discovery Toolkit (DS4SD). Powered by natural language processing and AI, the technology analyzes and extracts large amounts of structured or unstructured data from medical journals, technical articles, financial records, or any other document. From there, if you have the private, paid version, the toolkit collects and curates data to be added to a base of searchable knowledge graphs.
The cloud-based software as a service Deep Search product by IBM has proved itself valuable with success in 2020 COVID research and the company’s own project focusing on molecule synthesization. Claiming 1,000x the consumption power of a human and 100x the screening capabilities, Deep Search processed over 6,000 documents of various types in a quest to find a new photoacid generator molecule during Project Photoresist.
Now, IBM is publicly releasing two parts of Deep Search — the DS4SD toolkit and Deep Search Experience. The latter is an automatic document conversion service that grants users the ability to get a closer look at the conversion quality of a document. The main goal, according to IBM Research, is users should be able to “robustly explore information extracted from tens of thousands of documents without having to read a single paper.”
The Next Step in Accelerated Discovery
If you remember, IBM released another toolkit earlier this year focusing on quickening the pace for hypothesis generation. I talked to lead researcher Matteo Manica about GT4SD, or the Genetic Toolkit for Scientific Discovery, and the information technology company’s plans for an Open Science Hub. This time, I interviewed principal researcher and manager of the group developing Deep Search, Peter Staar. He offered some insight into the project and what’s to come ahead. “Think of accelerated discovery as a platform with multiple cloud-based technologies,” Staar says. “Deep search is one of the starting points of the AD platform and pipeline, where we try to manage all of the known knowledge that can be shared.”
Fellow IBM researcher Michele Dolfi added, “With the deep search tool, we’re extracting this knowledge into [different] data sets that we can then give to our colleagues in access discovery to do other fancy stuff with. Or we can give it to the people doing simulations, or to the people working with the genetic toolkit, and so on.” The toolkit itself is a Python package, easily installable with the usual package managers. To upload and convert bulk PDFs into easily read JSON files, users can direct the tool to the folder of documents they want to be analyzed and let the DS4SD do its work.
The beginning stages of the project weren’t as smooth. Building the toolkit came with its own challenges and setbacks, most of which Staar credits to perfecting the AI.
“Although AI has evolved massively over the last five to 10 years, it was still a huge challenge to actually extract accurate knowledge from these documents,” Staar comments. “The main reason for that is because it’s extremely complicated to get good data to train on.”
He continues, “You have all these different layouts, you have text that is grouped together, you have tables, you have figures, and you need to be able to capture that you have forms with key values. And, you want to do it in a way that when you upload the documents, you can easily get this data out.” Staar says. “That is where it becomes quite tricky.”
Luckily for them, as Staar noted, IBM has “state of the art” resources at their disposal that made it possible to overcome such roadblocks. Now that the toolkit is completed and released as open source Dolfi says they are eager to continue the work, this time with the community. “We’re very research-oriented. We love to have collaborations, whether it’s people in academia or someone working on a cool pet project — we’d be super happy to talk to you.”
Take Part in the Deep Search Experience
The other side of this coin is the service that the toolkit can connect to when uploading the documents, available with “open access” according to Staar, but not open source.
“The reason why we’re not making that open source is simply because you cannot actually run it on your laptop,” Staar explains. “It is many different AI models, many different things that have to happen in the background from parsing PDFs, to applying models, etc. So for convenience, we said, ‘Okay, you get one API, you send a document, and you get the document back.’”
The Deep Search Experience works through four steps: parse, interpret, index, and integrate. After picking through data and enriching it by cleaning up the findings, the service stores personal collected information in addition to millions of documents from other verified sources. At this point, users can leverage Deep Search to create a knowledge graph that finds and connects relevant entities across their document collection. The software also is capable of answering complex queries as opposed to basic keyword searches.
The open source community continues to grow as more tools and software built for collaborating enter the space. With more companies dropping the veil on their internal projects and inviting like minds to help progress along, we can only hope to see more record-breaking discoveries in the future. In the meantime, IBM Research promises more additions to their one-day Open Science Hub and milestones in accelerated discovery coming soon.