Salesforce's ProGen trained on 280 million amino acid sequences to learn to generate proteins

This week, a team of scientists at Salesforce published a study detailing an AI system -- ProGen -- they say is capable of generating proteins in a "controllable fashion," such that it could unlock new approaches to protein engineering. If their claims pan out, it could lay the groundwork for meaningful advances in synthetic biology and material science -- a highly desirable outcome in the midst of the devastating coronavirus outbreak.

As Salesforce research scientist Ali Madani explained in a blog post, proteins are simply chains of molecules -- amino acids -- bonded together. There are around 20 standard amino acids, which interact with one another and locally form shapes that constitute the secondary structure. Those shapes continue to fold into a fully three-dimensional structure called a tertiary structure. From there, proteins interact with other proteins or molecules and carry out a wide variety of functions, from ferrying oxygen to cells around the body to regulating blood glucose levels.

ProGen, then -- an AI model with 1.2 billion parameters (i.e., values defining skills on a problem) -- was fine-tuned to learn the language of proteins. Given the desired properties of a protein, like a molecular function or a cellular component, it can accurately create or generate a viable sequence.

It's a technique unlike that of DeepMind's AlphaFold, which estimates the distances between amino acids pairs and their angles and uses the estimations to generate protein fragments, or MIT CSAIL's system, which learns to predict how similar protein structures are likely to be from pairs of proteins and embeddings (i.e., mathematical representations) of their sequences. By contrast, ProGen approaches protein generation from a natural language perspective: It treats amino acids as words in a paragraph (in this case, a protein).

Madani and the rest of the team behind ProGen trained the model on a data set of over 280 million protein sequences and associated metadata -- the largest publicly available. They formulated the samples as over 100,000 conditioning tags so that ProGen could learn the distribution of natural proteins selected through evolution. Basically, the model took each training sample and formulated a guessing game per amino acid; for multiple rounds of training, given a short protein sequence, it attempted to predict the next amino acids from the previous amino acids.

ProGen completed this "game" over 1 trillion times, after which it became capable of generating proteins with sequences it hadn't seen before.

In one experiment, the researchers tasked ProGen with replicating the protein VEGFR2, which is responsible for biological processes like cell proliferation, survival, migration, and differentiation. At test time, they provided the model with the beginning portion of VEGFR2 along with relevant conditioning tags and asked it to generate the remaining sequence. Impressively, the ProGen-generated portion maintained the structure of the protein, implying that it produced a functional protein.

In a second test, the team sought to demonstrate ProGen's abilities with experimentally verified labeled data. Fed a corpus containing over 150,000 variants of protein G domain B1 -- a protein important for the purification, immobilization, and detection of virus- and bacteria-neutralizing antibodies -- ProGen managed to identify proteins with a spread of high fitness values, which corresponded to the properties that make a functional protein.

Importantly, the team demonstrated in both experiments that ProGen's sequences were in a relaxed low-energy state. This correlates with stability -- a high energy state corresponds to the protein wanting to "explode," indicating that the sequence is incorrect.

"The ProGen sample exhibits lower energy overall, and energy is highest for amino acids that do not have secondary structure. This suggests that ProGen learned to prioritize the most structurally important segments of the protein," Madani wrote in the blog post. "The intuition behind this is that ProGen has learned to become fluent in the language of functional proteins, as it has been trained on proteins selected through evolution. If given an unknown sequence, ProGen can recognize whether the sequence is coherent in terms of being a functional protein."

In the future, the researchers intend to refine ProGen's ability to generate novel proteins, whether undiscovered or nonexistent in nature, by honing in on specific protein properties. "Our dream is to enable protein engineering to reach new heights through the use of AI," Madani continued. "If we had a tool that spoke the protein language for us and could controllably generate new functional proteins, it would have a transformative impact on advancing science, curing disease, and cleaning our planet."

More