MIT CSAIL's AI predicts a protein's function from chains of amino acids

AI's been tapped to classify seizures and predict whether breast cancer is likely to metastasize, but that's far from its only medical application. In an academic paper scheduled to be presented at the International Conference on Learning Representations in May, MIT CSAIL scientists describe a system that "computationally" breaks down how segments of chained amino acids determine a protein's function.

They believe it could be used to improve protein engineering -- that is, the design of new enzymes or proteins with certain functions.

"I want to marginalize structure," Tristan Bepler, a graduate student in the computation and biology group at CSAIL and a coauthor of the paper, said in a statement. "We want to know what proteins do, and knowing structure is important for that. But can we predict the function of a protein given only its amino acid sequence? The motivation is to move away from specifically predicting structures, and move toward [finding] how amino acid sequences relate to function."

As Bepler and colleagues explain, the behavior of proteins -- which comprise the aforementioned amino acid chains, each tightly connected by peptide bonds -- is difficult to predict with machine learning. (That said, Google's DeepMind made impressive gains in December with AlphaFold.) Only tens of thousands of the millions of three-dimensional folded protein shapes have been documented, and amino acid sequences often take on similar structures, making it tough to distinguish between novel and duplicate results.

So the paper's authors took a different approach: encoding predicted protein structures directly into representations. Specifically, they trained an AI system on roughly 22,000 labeled proteins from the open source Structural Classification of Proteins (SCOP) database, and for each pair calculated a score indicating how close the two were in structure. Then, they supplied the model random pairs of proteins and embeddings (i.e., mathematical representations) of their amino acid sequences, from which it learned to predict how similar their 3D structures were likely to be. Lastly, they had the model compare the two similarity scores to identify which paired embeddings shared protein structures, and architected it to concurrently forecast a "content map" indicating how far each amino acid was from the others in a protein's structure.

The result of all that work? An end-to-end system that, given an amino acid chain as input, generates an embedding for each amino acid position in a protein -- embeddings that other models can use to predict said amino acid's function. In one experiment, the researchers trained a model to predict transmembrane and non-transmembrane segments more accurately than previous approaches.

"Our model allows us to transfer information from known protein structures to sequences with unknown structure. Using our embeddings as features, we can better predict function and enable more efficient data-driven protein design," Bepler said. "At a high level, that type of protein engineering is the goal. Our machine learning models thus enable us to learn the 'language' of protein folding -- one of the original 'Holy Grail' problems -- from a relatively small number of known structures."

More