Foundation Models

Discussion on T0pp. ChatGPT was released later. You can reduce a lot of downstream tasks into prompting. Big deal in 2022 or thereabouts.

Foundation models are increasingly common area of research in healthcare.

But what are they? There is a lot of ambiguity around this term. This is professor’s view of it, not the ‘global’ one.

They were introduced by people at Stanford (Bommasani et al, 2021).

There’s a spectrum:

Classical   ---------> Foundation
Single-Task ---------> Zero Shot
Single-Task ----> Multi-Task ----> Few-shot ----> Zero-shot

Few-shot learning specializes from pre-training (AKA “transfer learning”).

FMs are around the few and zero short part of the spectrum. LLMs are the dominant Foundation Models. Note that using scale and some inductive bias and letting the training part ‘figure it all out’ has been proven to be more powerful than researchers programming the system or imparting ‘structure’.

Why not just use the Frontier Models? Well, they’re built on what human beings have produced. How can you map them to things like protein folding?

Some properties, things that define them in practice:

Generative AI (they can generate samples: text, images)
Large Dataset
Uses a Transformer NN
Uses an LLM
Data-efficient Task Adaptation (this is prof’s chief criterion)
Self-supervised Learning (uses some structure inherent in the data instead of requiring supervision).

Now a Classical model might

        X,y
         |
x ----> f(x) ----> y

But there’s something missing. Your data might have a distribution. That’s cool. But the tasks might also have a distribution! Now the idea is you look at the distribution of allowed tasks and these, in baby ML world, are latent and emerge during training.

Now if you incorporate tasks into Classical models, nothing really changes:

        X,y_a
          |
x ----> f_a(x) ----> y_a

Where $\alpha \in \text{Set of Tasks}$ . In FMs you pass the task (and input) to the model:

            X
            |
x, a ----> f(x) ----> y_a

Tasks

So you must ask “over what tasks” when considering a foundation model!

All foundation models are subsets of single-task models; the reverse is not true!

On a given task, a single-task model will match or outperform a foundation model. This is about efficiency!

Implicitly when you’re training a FM, you’re training over many tasks.

FMs push the SOTA on labeled data efficiency. The key thing here is that you’re using all the data for a single-task model. So if you have a ‘billion’ of data, you can get maybe 8 single-task models. Or, by FM proxy, you can get a lot more single-task models.

So consider Perf per Task (Bayes Error) versus Labeled Data per task. Health data lives in top left.

FMs push the SOTA on accessibility: if you have a ‘mild task shift’ you can argue that a single-task model will perform worse than a FM.

Ask:

What does FM meal for this model?
What is the distro of tasks?
Do you need data-efficient adaptability? Are you limited on total data or task-specific data?
Is a FM good enough for your case? How are you evaluating?