Skip to main content

Phenotyping by George Hripcsak

What is phenotyping?

Identification or estimation of clinically useful properties from raw clinical data. E.g., based on EHR data, does the patient have diabetes?

Example shorthand from clinical notes: PERRLA = “Pupils Equal, Round, Reactive to Light and Accommodation.”

Why EHR data is hard to work with

Missing data

Data are mostly missing:

  • Sampled only when the patient is sick
  • Pertinent negatives recorded by the attending vs. CC3

Noisy data

Accuracy can be as low as 50%.

The chain from truth to model

Truth → Concept → Record → Concept → Model
  • Truth: the actual health status of the patient
  • Concept (1): the clinician’s or patient’s conception of that state
  • Record: the EHR/PHR entry
  • Concept (2): a second clinician’s conception of the patient based on the record
  • Model: the computable representation

Temporal complexity

For any given test result, “the right time” could mean any of:

  • When the specimen was drawn
  • When the specimen was received
  • When the test was performed
  • When the result was updated
  • When the result was received by the patient
  • When the patient told the clinician

Narrative complexity

Much of the useful information lives in unstructured text. Example:

“Slight increase of pulmonary vascular congestion with new left pleural effusion, question mild congestive changes."
"s/p LURT 1998 c/b 1A rejection 7/07 back on HD”

NLP attempts to decompose this into structured concepts, e.g.:

  • pulmonary vascular congestion → change: increase, degree: low
  • pleural effusion → ?

Health care process bias

Environment → Patient state → Care team → EHR
↑ ↓
└── Therapy ←──┘

The recording process itself injects signals that don’t reflect physiology.

The good news

Doctors successfully infer patient state from records. The goal is to mimic the doctor’s reasoning — to deconvolve the truth from the record.

EHR-derived phenotypes

A phenotype is a clinically relevant feature derived from the EHR, such as:

  • Patient has a diagnosis of type II diabetes
  • Recent rash and fever
  • Drug-induced liver injury

Phenotypes feed downstream correlation studies (e.g., which treatments are associated with the best outcomes):

Raw EHR data ──(query)──→ Phenotype ──(data mining)──→ Correlations

“Physics” of the medical record

  1. Study the EHR as if it were a natural object — use the EHR to learn about the EHR. You’re not studying the patient, you’re studying the recording of the patient.
  2. Aggregate across units and models.
  3. [third point not captured in original notes]

Related work:

  • Correlating lab tests and concepts
  • Timing of cause in disease vs. treatment
  • Interpreting time: “truth” = when the patient came in; “narrative” = when the doctor said it happened
  • Controlling for health care process effects (processes inject signals on top of physiology)
  • Granger causality to decipher associations

Approaches to phenotyping

Manual chart review

  • Why is manual chart review trusted as a gold standard?
  • We need better metrics.

Knowledge engineering

  • Manual authoring of rules (sometimes with tools, increasingly automated)
  • Manual review for a test set to evaluate the phenotype
  • Slow and inaccurate

Machine learning

  • Supervised learning
  • Manual review for training and test sets
  • Problem: many degrees of freedom create edge-case failures

Phenotype discovery

  • Unsupervised learning, clustering
  • Different goal: summarize and understand the dataset
  • Still needs evaluation

”Next generation” phenotyping

  1. Manual chart review
  2. Manually written rules
  3. Automated, semi-automated, or assisted rule writing
  4. Improved performance by better understanding the EHR
  5. Use of language models to incorporate human knowledge
  6. Evaluate

Hard cases

DILI (Overby 2013)

Drug-induced liver injury — defined by negation (liver disease not due to anything else). Took six months to generate the definition.

TRALI

Transfusion-related acute lung injury — a tough FDA contract example and an extreme case of definition-by-negation.

Transportability

Phenotypes need to work correctly across institutions.

Evaluation metrics

  • Sensitivity = TP / (TP + FN) → recall
  • Specificity = TN / (TN + FP)
  • Positive predictive value (PPV) = TP / (TP + FP)
  • AUROC (area under the receiver operating characteristic curve)
    • Area under the plot of sensitivity vs. (1 − specificity)
    • Probability of correctly picking the positive case when given one positive and one negative
    • Range: 0.5 (chance) to 1.00 (perfect)

Tools

PHOEBE

Concept set recommender.

ATLAS — cohort building

  • Optimized for observational research
  • Time series: who and when (vs. classification)
  • Assumes a complex definition — linearized into AND-OR groups

KEEPER

Chart-review alternative.

Key papers and methods

  • Swerdel 2019, PheValuator — Estimate sensitivity and specificity without chart review. Choose very sensitive and very specific cohorts, train an ML model, and judge a new algorithm’s performance using the ML predictions in place of ground truth.
  • Cai 2017, PheNorm — Normalize and denoise features.
  • Cai 2019, MAP
  • Cai 2020, SureLDA
  • Sontag 2014, Anchors — Domain expert picks imperfect but relevant variables; the rest are learned from the dataset. Produces a “silver standard” — not as accurate as gold but cheaper.
  • Shah 2016 — Silver standard.
  • Bhave 2023 — LSTM to handle time.
  • Albers 2018, PopKLD — Handling continuous data.
  • Pivovarov 2015, UPhenome — Unsupervised; improves on LDA for heterogeneous cases; reduces mixture of phenotypes.
  • CEHR-GPT

Open question

Should we leave the phenotype implicit inside a deep learning model?

  • Hard to predict its errors
  • Poor performance on hard tasks, and prone to fabrication — but still better than poor human performance
  • Bottom line: it’s all in evaluating the phenotype — our methods now exceed our ability to evaluate them.