Machine Learning / Technology

Harvard’s New Open Source AI Algorithm Simplifies Protein Folding Puzzle

25 Jul 2019 12:00pm, by

Proteins may be small and unassuming, but these molecules are essential for a variety of biological functions in all living organisms, including digestion, immune response and even intracellular communication. Consisting of long chains of smaller organic compounds called amino acids, the different functions of various proteins are determined by the way they fold up in three-dimensional space. Not surprisingly, the folded structures of these protein chains can get immensely complex, and scientists have yet to fully figure out the mysteries behind how and why certain proteins fold the way they do, and how diseases like Alzheimer’s might be caused when they misfold.

While using modern technologies like cryo-electron microscopes, nuclear magnetic resonance and X-ray crystallography can help us understand protein folding a little better, it’s an unfortunately time-consuming and costly process. Accurately predicting the folded structures of proteins could be the key to unlocking many medical mysteries, and thanks to recent developments in integrating artificial intelligence in the field of computational biology, that slow process may very well be accelerated — allowing us to discover or even design new and useful proteins.

That possibility was made evident recently with Google DeepMind’s AlphaFold, an AI that demonstrated that it could accurately predict the three-dimensional structure of a new protein, when only given a list of the amino acids that make it up. Now a systems biologist and researcher from Harvard University has developed yet another novel computational technique that may be up to a million times faster than AlphaFold in predicting the structures of folded proteins.

Simplifying Complexities of Protein Folding

In a paper titled “End-to-end differentiable learning of protein structure” and published in Cell Systems, Harvard researcher Mohammed AlQuraishi details how his method uses a kind of AI known as deep learning, via an artificial neural network that emulates the way a human brain processes information.

What distinguishes this new model is that it uses something known as end-to-end differentiable learning. While techniques such as the one utilized by AlphaFold are able to predict the ways a protein chain might fold, based on its constituent amino acids and the distances and angles between them, these methods often rely heavily on so-called co-evolutionary data from previously known proteins, or predefined structural templates for protein folding. Thus, they aren’t all that effective in determining the 3D structure of an unknown protein string, much less predicting what might happen if there are changes or mutations in a given protein structure. Moreover, these current methods often use off-the-shelf neural networks components that aren’t necessarily tailored to solving the protein folding problem, and they don’t explicitly map out how protein sequences are related to their structure mathematically.

AlQuraishi’s approach, however, streamlines the process somewhat by using a mathematical function. Dubbed a recurrent geometric network (RGN), the method employs a more contextual technique to determine how the various ways a protein chain might fold. Rather than being restricted to prior data on protein folding configurations or templates, RGNs are more flexible in the sense that they use “differentiable primitives,” which are analogous to placing words contextually within a sentence — one follows certain rules so that it makes sense. While the model does take many weeks to train using available data on protein folding, afterward the model is capable of fine-tuning itself over and over as it analyzes and “learns” how a particular protein sequence relates to its structure mathematically, using a “ground truth” example of a folded protein to compare and check against for accuracy. This approach then allows it to translate that knowledge to analyze unknown sequences and generate folding predictions in mere milliseconds, rather than the hours or days that are required by other models.

“Through their recurrent architecture, RGNs are able to model long protein sequence fragments and discover higher-order relationships between these fragments,” explained AlQuraishi. “As additional structural and sequence data become available, and as new recurrent architectures emerge that are able to capture even longer-range interactions… RGNs can automatically learn to improve their performance, while implicitly capturing sequence-structure relationships that may be uncovered using neural network probing techniques.”

During his tests, AlQuraishi found that this new algorithm either outperformed or was competitive with similar protein folding prediction models. He’s since made the code available on GitHub, with the hope that other researchers and experts will build and improve upon it, saying: “Deep-learning approaches, not just mine, will continue to grow in their predictive power and in popularity, because they represent a minimal, simple paradigm that can integrate new ideas more easily than current complex models.”

Read the paper here, and see AlQuraishi’s recent talk via the Broad Institute, below.

Images: Ousa Chea via Unsplash; Harvard University

A newsletter digest of the week’s most important stories & analyses.