Harvard Medical School's AI estimates protein structures up to a million times faster than previous methods

The recipe for proteins -- the fundamental building blocks of tissues, muscles, enzymes, and antibodies -- is encoded in DNA. It's these genetic dictionaries that define proteins' three-dimensional structures and determine their functions, but predicting how their amino acid components will interact is notoriously difficult. DNA only contains information about chains of amino acid residues, not those chains' final form. In fact, scientists estimate it would take more than 13.8 billion years to figure out all the possible configurations of a typical protein's thousands of amino acids in order to identify the right structure.

Encouragingly, scientists at Harvard Medical School have made progress toward an AI system that is capable of predicting the structure of effectively any protein and can spit out predictions upwards of a million times faster than current state-of-the-art systems without sacrificing accuracy. The work is detailed in a report published this week in the journal Cell Systems, and both the software and the results are freely available via GitHub.

"Protein-folding has been one of the most important problems for biochemists over the last half-century, and this approach represents a fundamentally new way of tackling that challenge. What's compelling about the problem is that it's fairly easy to state: Take [an amino acid] sequence and figure out the shape," said Dr. Mohammed AlQuraishi, research lead and instructor in systems biology in the Blavatnik Institute at HMS, in a statement. "A protein starts off as an unstructured string that has to take on a 3D shape, and the possible sets of shapes that a string can fold into is huge ... [but] we now have a whole new vista from which to explore protein-folding, and I think we've just begun to scratch the surface."

Proteins, AlQuraishi explains, are constructed from a library of 20 different amino acids. These are combined into loops, spirals, sheets, twists, and other substructures in 3D space in close physical proximity, and they're far from random. Amino acids respect the laws of physics, seeking out "energetically favorable" states, which makes them predictable.

Previous methods have mapped new amino acid sequences onto predefined templates or sifted through genomic data to identify sequences that might have evolved together. Alphabet subsidiary DeepMind's AlphaFold, for instance, which beat 98 competitors in the Critical Assessment of Structure Prediction (CASP) protein-folding competition last year, used the latter technique to suss out the structure of 25 out of 43 proteins.

But, as AlQuraishi notes, these systems can't determine structures for which we lack prior knowledge, because they don't predict protein structures solely from sequences. He and colleagues instead employed differentiable learning -- a machine learning method in which a model tunes and adjusts itself by feeding data samples forward and backward through its components -- to discover the relationships between a protein sequence and its structure. Their recurrent geometric network, which is made up of only a few thousand lines of computer code, can predict both the most likely angle of the chemical bonds connecting amino acids and the angle of rotation around these bonds.

Trained over the course of months on thousands of proteins, the AI system outperformed all other methods from several recent years of CASP at predicting protein structures for which there are no preexisting templates, and it leapfrogged all but the best models that made use of preexisting templates. Moreover, it made its predictions -- which it compares against ground truth protein structures to check accuracy -- in milliseconds, or around six to seven orders of magnitude faster than prior art, which can take hours.

The model isn't accurate enough for commercial applications; currently, it falls around six angstroms, which is equal to 0.1 nanometer. (About one to two angstroms are needed to resolve the full atomic structure of a protein.) But AlQuraishi says there are plenty of opportunities to optimize the approach, like further integrating chemical and physical rules. And he says the system can complement existing computational and physical methods to determine a wider range of protein structures than previously possible.

"Accurately and efficiently predicting protein-folding has been a holy grail for the field, and it is my hope and expectation that this approach, combined with all the other remarkable methods that have been developed, will be able to do so in the near future," AlQuraishi added. "We might solve this soon, and I think no one would have said that five years ago. It's very exciting and also kind of shocking at the same time."