How generative models are driving accelerated discovery

Science is built on questions. They are the very basis of the discipline, with researchers asking anything from why a certain process is occurring (cancer, for example), to how it can be resolved (posit: more targeted drugs).

But science takes time and money, said Matteo Manica, research staff member with IBM Research. Any hypothesis can have numerous possible answers, and it simply isn’t feasible to test all of those out.

This is making generative models critical, particularly in this era of “accelerated discovery” fueled by the convergence of artificial intelligence (AI), the cloud, and quantum computing. As Manica explained, these generative models can be trained from known molecules to help propose new candidates and new sets of criteria to answer a myriad of scientific questions.

“Generative models are probably our most powerful tool right now to leverage the vast troves of data in science,” Manica said, “and use it to come up with starting points to design and discover new materials, drugs and more.”

IBM takes to open source

To fuel and accelerate this process, IBM Research has released the open source library Generative Toolkit for Scientific Discovery (GT4SD). The toolkit includes various generative models developed by IBM researchers that can be used to accelerate the generation of hypotheses and the discovery process as a whole, said Manica, its chief architect. It also helps to ease adoption of generative AI.

GT4SD is compatible with most popular deep learning frameworks including Pytorch, Pytorch lightning, HuggingFace Transformers, GuacaMol and Moses.

“We want to foster an open community around scientific discovery,” Manica said. “Technologies like AI should be a tool that scientists and researchers use to carry out their research quicker and more effectively, rather than something that requires very specific domain knowledge to utilize.”

More science, more users

The goal is to promote engagement, collaboration, and “more science by more users,” agreed John R. Smith, an IBM fellow at IBM Research. Researchers and other professionals across academia, government, industry, and business can collaborate and develop, adapt, benchmark, and compare open source technologies, contributions, and projects.

For example, GT4SD includes models that can generate new molecule designs based on properties including target proteins, binding energies, or other targets relevant to materials and drug discovery. Users can also work on discoveries around macro-molecules, enzymes, tissues, and polymers in such areas as preventative treatment and antimicrobial applications. Manica anticipates future uses around not only healthcare and life sciences, but agriculture and even sustainability.

One group of IBM researchers built a generative model that can propose new antimicrobial peptides 2 with their properties. These are novel, or innovative, drug candidates that have not previously been identified. This was a critical discovery, Manica pointed out because the class of peptides is considered a “drug of last resort” against antimicrobial resistance – one of the world’s biggest threats to global health and food security.

Novel candidate molecules were identified, then filtered in a second AI system that used predictive processes around toxicity and broad-spectrum activity. Within the span of a few weeks, researchers identified several dozen novel candidate molecules. This process can normally take years.

In another example, a team used generative models in combination with AI and high-performance tools to come up with a new photoacid generator (PAG), which are key to manufacturing semiconductors. What would normally take years was completed in weeks.

“These models learn how to be novel,” Manica said. “They can learn how to create diverse new inputs that can be valuable.”

The well-honed circular process of the scientific method has researchers working out hypotheses, performing studies, testing, assessing, then reporting back to their original question. Typically, this can take anywhere from $10 million to $100 million and 10 years to complete.

“Generative models can greatly shorten the time that it takes and reduce the cost,” Smith said. “This has applications in so many different areas. It can help accelerate discovery for problems related to climate, sustainability, and for therapeutics.”

Manica noted that scientific breakthroughs have often come about as a combination of curiosity and creativity, trial and error. While this can be methodical, it can be slow, and it is unfeasible in a time when the impetus for solving problems is critical (such as during COVID-19).

How AI drives accelerated discovery

The future demands accelerated discovery. “This is an area where AI can greatly help us,” Manica said. Generative models can be creative aids that can help researchers break through bottlenecks and think in new ways they might not have in the past, he said, thus prompting more idea generation and so-called “eureka!” moments.

He sees greater implications in generative models shifting the scientific thought process to what questions should be asked before researchers even set out to find the answers.

“Given everything we know about a field, what is the next question we should ask?” Manica said. “We can potentially create generative models to help us answer questions we don’t know where to start with either – such as how to find a new antiviral for an unknown protein, or whether we could make a catalyst for CO2 in the atmosphere.”

Testing models can then be established to help determine the exact conditions needed to derive accurate results and refine future tests, he posited.

Smith agreed on the widespread implications of open source generative modeling. “It’s not the whole universe on its own; there’s a lot more for us to do here around tools,” he emphasized. “We want to do this in a way that brings to life this notion of open source science.”