Data / Machine Learning

IBM’s Open Source GT4SD Generates Ideas for Scientists

30 Mar 2022 8:00am, by

Global information technology company IBM has released an open source library, the Generative Toolkit for Scientific Discovery (GT4SD) with hopes of accelerating discovery within the field of machine learning.

Built with the intention of not only making advanced generative models easier to use but also more efficient when applying them to discovery workflows, the GT4SD hosts generative algorithms developed at IBM Research for the design of novel materials with distinct applications and language models for scientific documents.

In a blog post explaining the GT4SD, IBM described the project as “an open-source library to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models.” IBM Research scientist and lead creator of GT4SD Matteo Manica gave his insights on the toolkit in an interview with The New Stack.

“It was an open exchange between different researchers inside IBM Research. We noticed that there was this urge in the community, and in other fields, to simplify the access to these AI technologies. We also noticed a gap that needed to be filled in generative models,” He said. “Last year we started to garner opinions from various researchers working the topic. There were a lot of different technologies that we were building, so it was a big effort to homogenize all these ideas coming from different labs into one research project.”

“Compared to the usual time it takes to publish research, we moved very fast. From the initial ideation of an algorithm, it can take around a year to get it published. By that time, it’s already old and you have plenty of other ideas,” Manica jokes. “Instead, what was beautiful about this initiative is that it was very focused on developing a technology quickly. We started with algorithms we already created at IBM Research and then looked for certain things we wanted to make available with our library. It was an effort that lasted ten or eleven months at most.”

Real World Use

Merging AI with hypothesis generation can bring unprecedented benefits to several fields of study. In drug discovery, where there are countless drug-like molecules currently known to man, it is next to impossible to find the perfect combination with the regular workflow of trial and error. Using the GTS4D, this process can be sped up exponentially.

“Generative models are really good at looking at what you already know and listing examples, like properties, and then extrapolate on new examples. You can see this process as connecting dots,” Manica comments. “Imagine you have a Pollock-like canvas with many dots (properties), and you can draw lines between these dots. Along these lines, you find many dots that didn’t exist because they haven’t been discovered yet. Generative models give you a way to simulate the discovery of these dots. If the properties discovered match the criteria used for the search, they can be optimal candidates for discovery processes.”

Though drug discovery is an immediately recognizable use case, Manica states that the GTS4D can be used for “Any molecular science application.”

“For instance, you can optimize enzymes to catalyze a specific reaction. It’s pretty cool because enzyme engineering and design are super important for greener chemical processes. You won’t need extreme temperatures, toxic solvents, or to consume a lot of energy,” Manica said. “This is a perfect example where generative modeling can help scientists to make a process more sustainable and efficient.”

In their post unveiling the GT4SD, IBM detailed numerous scenarios where the toolkit would be highly useful. Here are just a few:

  • Materials discovery and drug discovery scientists can use the library to provide models that can generate new molecule designs based on specific properties like target proteins, target omics profiles, scaffold distances, binding energies, HOMO and LUMO energies, and many more.
  • Scientists and students using generative models are offered a centralized environment to both access and try out different models simplification of model use via consistent commands for inferencing or retraining with default parameter settings.
  • AI/ML practitioners building generative models can benefit from the GT4SD’s familiar framework which makes models easily accessible to a more vast community.

“The replacement of manual processes and human bias in the discovery process has important effects on applications that rely on generative models, leading to an acceleration of expert knowledge,” IBM writes.

“Our focus is on material design because that is where we had most of the research. However, the toolkit is designed to be as generic as possible so generative models can be used in various applications,” Manica said.

Open Source for Research

The GT4SD was developed to be open source from the start, according to Manica. “I’m a strong believer in open source projects, and I think it’s the best way to reach big goals within the scientific community.”

He went on to say, “Our goal behind making the toolkit open source was for research to progress faster in the domain of generative modeling. For any company, revenue is obviously important, and it can be difficult to see how you can make revenue from an open source project. But the main return we’re looking for here is to create a community of users and contributors that helps us to build better models and who we can help empower to build better models.”

Open source technology has been making waves in engineering and coding communities, but Manica is confident that scientists can benefit just as much. The more people that use the toolkit, the more advanced it will grow.

A Look Ahead

The GT4SD has been released for public use, but it isn’t what Manica would call finished. “The nice thing is that this is not a final product. Of course, it is ready to use, but the library is sort of an open factory for generative modeling. We intend to keep developing it not only for our own research at IBM but for the broader research community to build together. This is only the beginning — we hope for contributors and users to improve the library and also use it in their own projects.”

He continues, “In five years, I’d like to see GTS4D truly grow and foster into a true community. Whether it be in the biochemical domain, polymer physics, or any other industry. At the end of the day, our main concern is achieving the goal of accelerating discovery substantially in the next ten years.”

The toolkit is available for use as of last week. Manica finished by urging potential users to “Try the GT4SD. Use it in your research, break it, and report any issues. It was developed for the scientific community, and we want everyone involved.”

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.