Skip to main content

Representation Learning with Health Data by Matthew McDermott

The lecture was about Representation Learning, which means a few things depending on whom you're speaking to: Algorithmic, Informatic, and Geometric. And now "ChatGPT-ic".

Note that there is a conditional probability P(Y|X) where Y is the output and X the input, which is the 'truth' and the ceiling of what we can achieve with our cleverness (a function f(X) that maps to Y).

In general, we project data into some Latent Space that helps us discover new perspectives on the data.

In the Algorithmic perspective, we ask "How can we efficiently work with data?" Here, RL is about efficiency more than capability. Consider a Stochastic Descent Model (no Gradient!). You keep running this until you get a model that satisfies a loss function. You will theoretically get a model that beats anything else but you might have to wait a while.

In the Informatic perspective, we ask "How can we denoise and disentangle data distributions?" Here, it's about discarding the right information. E.g. by projecting the data into some low dimensional space and looking for noise.

In the Geometric perspective, we ask "What am I adding to or changing the structure of the data to establish similarity?" Consider two pictures of a Husky and a Shiba Inu. They are 'similar' in conceptual space; the pixels still encode this but it's difficult to emerge. Now are all kinds of structures: implicit (constrain neighborhoods and not relationships), explicit (you draw relationships between datapoints: supervised pretraining for example. Lets you be explicit. E.g. you want to capture emotions: "This dog and car look happy/sad"). Finally there's Universal Structure that attempts to capture arbitrary structure. Now a single geometry cannot capture Universal Structure: these are encoded with a dynamic, algorithmic geometry! LLMs do not give a single embedding.

Struture-Induced Pretraining (SIPT) allows you to induce a graph YOU specify in latent space. It allows you to reason about downstream task performance. It's the speakers work1.

RL in Health Data is complicated: We don't understand it, Structures are complicated, and Data is inherently multi-modal. Lots of challenges. #JobSecurity