Skip to main content

Machine Learning for Proteins by Mohammed AlQuraishi

Lecture was a brief but intense dive into the mechanics of protein folding prediction. The speaker works on works on protein structure and interactions and his lab maintains an Open Source project called OpenFold which is a "Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2".

10,000ft overview of how AlphaFold and its successor AlphaFold2 work: both how they predict structure and how they learn to predict structure. Graph showing CASP free modeling with RMSD Angstrom and years on X-axis. Precipitous drop after 2018! Speaker noted that you need to be below 3-4Å if you want to make biological inference. AlphaFold (v1) was the first to dip below 5Å.

Speaker noted that AlphaFold2 has shown indications it has:

  • Learned some measure of Physics.
  • Learned to generalize to unseen regions of structure space.
  • Not learned to generalize on structurally disruptive mutations.

Lecture was a discussion on each of these.

Part I: Measure of Physics

Observation 1: Multiple Sequence Alignment (MSA) 'sculpts' the "energy landscape" of the protein. This appeared to be like gradient descent but with energy. What if you didn't provide the model the MSA? It fails terribly! But what if you gave it a bunch of templates that varied in how they were dis/similar to the target structure (range 0, 1, the "Decoy TM Score")? Absent an MSA when you provide the model with a very good hypothesis/template, it's certain about its prediction (and vice-versa). That is to say it 'understands' something about protein structure, even absent an MSA. Maybe not the physics but some folding patterns.

Observation 2: In Structural Probe Experiments where MSA goes through 192 layers before a stucture is produced, they 'froze' the structure as it evolved through layers and didn't even note much of a difference! Why? The MSA is deep. Now if you make the MSA sparse, these shenanigans do not work!

Observation 3: In the Prediction of Multimeric Complexes, you pair the proteins by species because coevolution of residues (and not just not just intra but inter) and perform two separate predictions. Now the best result didn't come from pairing species! They just came from smushing two MSAs and giving it to the model and it Just Worked™. It understands something about how to build proteins and link them together.

Part II: Generalize to Unseen Regions of Structure Space

Speaker's lab uses OpenFold Interlude a trainable implementation of AlphaFold. Full parity with weights from AlphaFold2. Discussion on how it is far more efficient at training and prediction compared to AlphaFold: fractions of data needed, much lower compute. The whole point of this was to say: If the model doesn't see spirals in my training set (which only contains sheets), what does it do? Speaker notes that this is not a sequence → structure model! It's an MSA to Structure model. The MSA encodes a LOT of structure implicitly.

Part III: Not Learned to Generalize on Structurally Disruptive Mutations

Note that mutation have minimal impact on predictions. Once again, the MSA comes to the rescue: it is almost adversarially trained to ignore minor sequence variations.

Small discussion on how the model trains. His models explore 1D, 2D, and 3D spaces, like they're learning space itself. Note that partially trained models show very fast convergence (> 90%)!