Health Data Modalities III — Clinical and Biomedical Text
The text is the richest but also the messiest and rapidly changing part of EHR. Byproduct of clinical care and billing. See it as a series of nominal tokens.
Discussion of Jevon’s Paradox in the context of clinical notes.
By “text” we have four categories
- Clinical Documenation — Clinician → Record
- Clinical Communication — Clinician → Clinician/Patient (also think Clinician → Insurance)
- Patient-Generated — Patient → Clinician/World
- Scientific Literature — Researcher → Researcher
Clinical Documenation
Notes of what a clinician writes during a patient visit/encounter. Motivations
- Billing (“How thorough was the visit?”)
- Legal protection (“If it was not documented, it didn’t happen.”)
- Clinical Care continuity (What the next clinician needs to know about the patient story.)
KEY IDEA: There is an extreme time-pressure to write notes that satisfy all these three concerns. So physicians will spend 2x time documenting than caring (“pyjama time”: they complete notes after work).
A note is always part of a longitudinal sequence. Never look at it in isolation.
NOTE: With the limited time they have, the most important thing they update is the Plan (A/P section!).
Who writes them?
All manner of people. Physicians, nurses, NP/PA, dietician. You don’t know who’s writing this in certain datasets. What their training is, expertise is, etc.
What do they look like?
It’s a Template-Free-Text hybrid. So most “text” derives from machine-generated boilerplate. The clinician’s stuff/reasoning is 1-2 ‘new’ sentences in the A/P. The templates may be proprietary; you may not get access to the template.
KEY IDEA: An LLM trained on clinical notes is partly learning to emulate template engines and copypasta. You don’t know the template (assume).
Why can’t you infer the template with some model? Well, EHR companies might not like this. And there’s really no incentive to do so: what are you getting by boiling away the template? Still, prof thinks there may be some utility in the future. Jaccard Similarity is used to examine note similarity.
The length changes by specialty (discharge summaries and surgical notes can be 10x longer than radiology notes!)
Estimate is 60-80% of notes is duplicated from previous note or from templates (Tsou, 2017; Cohen, 2013). So here you can see that the information content is volume of text. So you will see train-test leakage if you split by note (TODO: how?)
Note that abbreviations are ambiguous: MS = Multiple Sclerosis, Mental Status, etc.
Negation
If you use a bag-of-words model, it will tell you there is malignancy when it sees “no evidence of malignancy.” You can have the word and its antonym in the same note! Note that clinicians document by exclusion (30-50% of medical concepts are negated (Chapman, 2001; Harkema, 2009)). There actually are NLP negation tools for this purpose! There’s also NLP tools that will do sentiment and legal analysis. Even worse: some conditions are negated more than the others (e.g. rashes).
Clinical Communication
Clinician → Clinician. Compared to Documenation, less copypasta, more targeted.
- Some are more structured than others (e.g. Radiology: Indication, Technique, Findings, Impression).
- Need note be an evolving narrative like progress notes. Many x-rays → many reports/notes.
- Almost all labels are derived from the “Impression” section - “Silver-Standard Labels” (e.g. “No actute cardiopulmonary abnormality.”)
Note that you can encode temporal dependence (“since 09 Jan 2018”) or dependence on other artifcacts (like the x-ray.)
Note that this includes AVS (After-Visit Summaries).
Patient-Generated Text
- Patient Portal Messages
- Social Media (e.g. Reddit,
/r/AskDocs, Twitter, PatientsLikeMe) - Online Health communities (WebMD, HealthUnlocked) TODO: how is this different than Social Media?
Biomedical Literature
The intent is different, the audience is different. PubMed is 36M+ abstracts, and PubMed Central is full-text. You can use this for RAGs for example. Lots of high Biomedical information density (think of how the BioBERTs perform so well in this domain).
- MIMIC III/IV — By far the best for clinical notes.
- MTSamples - not really usefil for AI research
- MIMIC-CXR - for radiology and x-rays
- PubMed/PMC
You’ll focus on more strange cases than normal stuff. Won’t publish things that are not of general use/interest.
You’ll see a lot of use of ontologies. MeSH is for metadata, UMLS terminologies (SNOMED, ICD10).
Currently
DAX Copilot, Suki, Abridge are Ambient AI Scribes that listen to Patient-Provider convos and have been received well. ~1.5% hallucination rate (Asgari, 2025). Physicians report saying 1-3 hours/day. Burnout rates have dropped.
But how do they affect notes? Note bloat! Longer and detailed. If you are an insurance company, you would downcode to account for extra billing claims. And the notes written by scribes are more fluent, more uniform, fewer abbreviations.
Bottom Line
Look at the abbreviation density, intended readership, communication intent, assumed knowledge for each of these classes of text. A lot of which ones you use comes down to the problem you’re trying to solve and data accessibility. You need to know your data’s source selection bias before you pick it (e.g. PubMed reports overrepresent rare findings.) “When you hear hoofbeats think horses not zebras.”
With AI Scribes: the notes were written in the past by humans with time pressure. These scribes now shift to revenue pressure. Jevon’s Paradox: making things easier has led to longer notes with decreased information density.