Health Data Modalities II
How do the data differences manifest in modeling choices?
Predictions must align with decisions.
- For whom? - Which patients are eligible? Think of clinicians too.
- From when? - What is the prediction time / index event?
- Using what data? - What history is available?
- Over what time horizon?
- To support what clinical action?
Think of “clinically meaningful windows”. If you make a prediction outside this window your model is useless even with its nice shiny AUROC.
Representation Choices
- Tables - Fixed length feature vector
- Time sequence of events (“discretized temporal sequence”) with regular chunks (days)
- Full-resolution event stream - just raw timestamps
Each is defensible. Each throws away something. Each changes waht “missing” and “unobserved” means.
Data, Data, Data: Most modeling differences are really representation differences.
Tables
A ---- 1 ---- 2 ---- 3 ---- ... ---- |> Prediction time
B ---- 1 ----------- 3 ---- ... ---- |> Prediction time
At each predicton time, summarize the event stream up to that prediction time into a fixed-length vector.
TODO: Label leakage?
Think of how awesome your model was in September 2019 and what would happen to its predictive prowess in a year.
Regular chunks
You ‘bin’ and bucket and turn an irregular event stream into a regular sequence. This binning is an inductive bias. But LSTMs, CNNs, RNNs really like this data. But what is the width? Are things ordered within the bin (no)?
Full-Resolution Stream
Each event is a tuple <Time, Code, Value>. This is the highest fidelity representation. But it’s also very complex:
- Six months of silence.
- Lab 18.9 and diagnosis E115.8. How to represent these?
- A single visit/encounter can make 15 codes!
- Long sequences: a patient with 10s of thousands of events (here, model can cheat by counting a lot of events for sicker patients)
- Yuge vocabularies
Other Notes
The statistical machinery still works (AUROC, train/validation/test). But we’re changing how the input is parameterized and what assumptions you make (esp temporally).
In NLP tokens are mostly arbitrary (TODO: “mostly”) symbols. Medical codes are structured objects with hierarchies, human-readable names, cross-vocab mappings (SNOMED <--> ICD)
Autoregressive EHR models eat an event stream and spit out many possible future trajectories and you sort of aggregate across them “in what fraction did event X happen in 30 days?” Patient → Model → Multiple Sampled Futures (Different events and time gaps!)
Note that LLMs generate the most likely continuation. Now think of a rare cancer. What is the most likely prediction from an LLM?