Google and ZebiAI launch Chemome Initiative to identify 'chemical probes' with AI models

In a study published this week in the Journal of Medicinal Chemistry, researchers at Google, in collaboration with X-Chem Pharmaceuticals, demonstrated an AI approach for identifying biologically active molecules using a combination of physical and virtual screening processes. It led to the creation of the Chemome Initiative, which launches today -- a collaboration between Google's Accelerated Science team and startup ZebiAI that aims to enable the discovery of many more small molecule chemical probes for biological research.

As part of the Chemome Initiative, Google says that ZebiAI will work with researchers to identify proteins of interest and source screening data the Accelerated Science team will use to train AI models. These models will make predictions on commercially available libraries of small molecules -- chemical probes that aren't useful as drugs, but that selectively inhibit or promote the function of specific proteins -- that will be provided to researchers for activity testing to advance some programs through discovery.

Making sense of the biological networks that support life and produce disease is a complex task. One approach is using small molecules; in a biological system (e.g., cancer cells growing in a dish), they can be added at a specific time to observe how the system responds when a protein has increased or decreased activity.

Despite how useful chemical probes are for this kind of biomedical research, only 4% of human proteins have a known chemical probe available. In an effort to isolate new ones, Google and X-Chem Pharmaceuticals turned to the field of AI and machine learning.

As the coauthors of the study explain, chemical probes are identified by scanning the space of small molecules in a target protein to distinguish "hit" molecules that can be further tested. The physical part of the process uses DNA-encoded small molecule libraries (DELs) that contain many distinct small molecules in one pool, each of which is attached to a fragment of DNA serving as a "barcode" for that molecule. One generates many chemical fragments along with a common chemical handle. The results are pooled and split into separate reactions, where a set of distinct fragments with another chemical handle are added.

The chemical fragments from the two steps react and fuse together at the common chemical handles, and they're connected to build one continuous barcode for each molecule. Once a library has been generated, it can be used to find the small molecules that bind to the protein of interest by mixing the DEL with the protein and washing away the small molecules that don't attach. Sequencing the remaining DNA barcodes produces millions of individual reads of DNA fragments that can then be processed to estimate which of the billions of molecules in the original DEL interact with the protein.

To predict whether an arbitrarily chosen small molecule will bind to a target protein, the researchers built a machine learning model -- specifically a graph convolutional neural network, a type of model designed for graph-like inputs like small molecules. The physical screening with the DEL provides positive and negative examples for a classifier, such that the small molecules remaining at the end of the screening process are positive examples and everything else is negative examples.

The team physically screened three diverse proteins using DEL libraries: sEH (a hydrolase), ERα (a nuclear receptor), and c-KIT (a kinase). Using the DEL-trained models, they then virtually screened large make-on-demand libraries from drug discovery platform Mcule and an internal molecule library at X-Chem to identify a set of molecules predicted to show affinity with each protein target. Lastly, they compared the results of their classifier to a random forest model, a common method for virtual screening that uses standard chemical fingerprints. They report that the classifier significantly outperformed the RF model in discovering potent candidates.

The team tested almost 2,000 molecules across the three targets, which it claims is the largest published prospective study of virtual screening to date.

"We're excited to be a part of the Chemome Initiative enabled by the effective ML techniques described here and look forward to its discovery of many new chemical probes. We expect the Chemome will spur significant new biological discoveries and ultimately accelerate new therapeutic discovery for the world," Google wrote in a blog post. "While more validation must be done to make the hit molecules useful as chemical probes, especially for specifically targeting the protein of interest and the ability to function correctly in common assays, having potent hits is a big step forward in the process."

More