Skip to main content

Single Cell Foundation Models by Xi Fu

What makes a Foundation Model 'foundational'? It is (a) multi-modal and (b) can self-supervise on a wide variety of downstream tasks. Every frontier model is a Foundation Model.

Lecture was about Foundation Models and how they are applied to Biology. More of the whys and hows. Started with the examination of the history, with the note that the last 7 years have seen a revival/explosion (starting with the Transformer, which remains the basis of several foundation models).

Explanation of the motivations of Self-Supervised Learning (SSL). Language is a special kind of information where most of what we want to convey is captured in the relationships between words and not the words by themselves.

Step 1 of SSL is Data Representation: is the rep self-informed, Universal, info-dense, noisy? These are the Big-4 questions. What are you capturing in this representation? Discussion on language being less noisy than images.

Step 2 is the Pretext Task definition ("training" that is useful for downstream applications). BERT does masked token prediction[^1] and GPT does next token prediction. Masked Autoencoders (MAE) and Bootstrap Your Own Latent (BYOL) are used in vision learning.

Step 3 is to ask if this will be useful to downstream applications.

The Scaling Law states that throwing a lot of compute and data to increase the parameters of your model will just continue to make your model better. This has some caveats. E.g. There is a limit to human knowledge and the rate at which we produce it! Note that reading a book 1,000 times will lead to a point of diminishing returns very early on; this is the case with the Scaling Law as well.

"Physics of Language Models"

Discussion on data bias, pollution, and lack of diversity[^2]; curation is very important.

"Pretraining will unquestionably end".


How these are applied to Single Cell Technology: create a foundation model for the cell. Not all of them, some class of cell (in a circumscribed domain). You want to study heterogenity in the cell and can do this via imaging. All the problems of data sparsity and quality[^3] and diversity apply here. Not to mention: it's really expensive!

In the speaker's project, each gene is a feature and each cell is a measurement.


DNA Language Models! Why do this? "Because it's there." You tokenize with kMER or BPT (or just say screw it and use the nucleotide!)

Step 1

  • Self-informed? Not really.
  • Universal? Yes. Life is made of ACGTs!
  • Information dense? Not really.
  • Noisy? No. Not too many 'typos'.

Step 2: The task here is masked and next token prediction. What's it learning? Interactions between nucleotides, high freq patterns (motigs). If repetition is the basis of learning, you can see evolution as the 'repeater' mechanism: copy + selection pressure!

Step 3: Is this useful for downstream applications? Yes: gene therapy for example.

[1]: This is akin to when you're learning a new language or doing "fill in the blanks" problems at school!

[2]: E.g. Too many cat pictures.

[3]: Think about the Batch Effect: two cameras might have varying quality. In Language, French and English embed and convey information differently.


TODO: What is a token in an image foundation model?