Skip to main content

Health Data Modalities II

How do the data differences manifest in modeling choices?

Predictions must align with decisions.

  • For whom? - Which patients are eligible? Think of clinicians too.
  • From when? - What is the prediction time / index event?
  • Using what data? - What history is available?
  • Over what time horizon?
  • To support what clinical action?

Think of “clinically meaningful windows”. If you make a prediction outside this window your model is useless even with its nice shiny AUROC.

Representation Choices

  • Tables - Fixed length feature vector
  • Time sequence of events (“discretized temporal sequence”) with regular chunks (days)
  • Full-resolution event stream - just raw timestamps

Each is defensible. Each throws away something. Each changes waht “missing” and “unobserved” means.

Data, Data, Data: Most modeling differences are really representation differences.

Tables

A ---- 1 ---- 2 ---- 3 ---- ... ---- |> Prediction time
B ---- 1 ----------- 3 ---- ... ---- |> Prediction time

At each predicton time, summarize the event stream up to that prediction time into a fixed-length vector.

TODO: Label leakage?

Think of how awesome your model was in September 2019 and what would happen to its predictive prowess in a year.

Regular chunks

You ‘bin’ and bucket and turn an irregular event stream into a regular sequence. This binning is an inductive bias. But LSTMs, CNNs, RNNs really like this data. But what is the width? Are things ordered within the bin (no)?

Full-Resolution Stream

Each event is a tuple <Time, Code, Value>. This is the highest fidelity representation. But it’s also very complex:

  • Six months of silence.
  • Lab 18.9 and diagnosis E115.8. How to represent these?
  • A single visit/encounter can make 15 codes!
  • Long sequences: a patient with 10s of thousands of events (here, model can cheat by counting a lot of events for sicker patients)
  • Yuge vocabularies

Other Notes

The statistical machinery still works (AUROC, train/validation/test). But we’re changing how the input is parameterized and what assumptions you make (esp temporally).

In NLP tokens are mostly arbitrary (TODO: “mostly”) symbols. Medical codes are structured objects with hierarchies, human-readable names, cross-vocab mappings (SNOMED <--> ICD)

Autoregressive EHR models eat an event stream and spit out many possible future trajectories and you sort of aggregate across them “in what fraction did event X happen in 30 days?” Patient → Model → Multiple Sampled Futures (Different events and time gaps!)

Note that LLMs generate the most likely continuation. Now think of a rare cancer. What is the most likely prediction from an LLM?