Health Data Modalities — Imaging
Images are the first ‘grid’ data type (EHR is irregular sequences) and are quite amenable to the kinds of representaions ML needs. Lots of great work here. Lots of modalities.
- Physics
- Acquisition (Expensive = lots of bias)
- Metadata (non-pixel info)
- Labels (Where do these come from? What granularity?)
- Pitfalls (What will break your model?)
Chiefly: X-Ray, CT, MRI, Histopathology Slides (H&E). There are several more.
Think of an image enters clinical care. PACS = Picture Archiving and Communication System (proprietary, vendor lock-in).
You don’t take a single image: before and after, angles, different times. Radiologists are the most common label source.
X-Rays
Dark = all the rays have gotten through. An x-ray sums everything along the beam path (lung nodule behind hard may not be seen). Think superposition. It’s all intensity that you’re measuring (i.e. this is why B&W).
Bone blocks the rays. Air lets them through.
Metadata example: what is implicitly in the image? Which side are you seeing? Dexter and Sinister. These things are burned into the image typically.
You have to worry abotu the radiation you are exposing people to. ALARA: As Low As Reasonably Achievable.
In ML this is the most common image modality! Cheap, tons of datasets (some of them have radiology reports alongside them)! There are “portable” and “upright” systems that can be confounders. NIH ChestX-ray14 was the first by the NIH. You can have severe class imbalance (50-60% may not have a diagnosis).
In datasets, they use NLPs for a kind of “silver standard”. Radiologists can be wrong/uncertain. There can be report ambiguity as well.
Now label granularity is not consistent and varies a lot. Think of “Normal exam” versus some kind of tumor mask versus 5-year risk of mortality. The larger your resolution gets the more expensive things get (TODO: how?)
Volumetric Imaging — CT and MRI
CT
X-Ray but with slices! Sort of. The data is a stack of slices so you’re looking at voxels: 512x512x1000 (square x-ray).
The Houndsfield Unit is used to capture intensity. Air = -1000, Water = 0, Bone ~ +1000. TODO: What is “Windowing”? You use it to highlight anatomical features by scaling the contrast; you define which HU range is visible.
These are way more expensive and there’s a lot more radiation than x-rays (and this dose depends on the tissue you’re looking at; higher density → higher dose needed to see things). You also have to worry about cumulative risk (esp in babies).
Also have to think about the fact that poorer people won’t be represented well thanks to the health system.
MRI
Much better at visualizing soft tissue (e.g. the brain, spine, joints, cardiac) — great “soft tissue contrast”.
No ionizing radiation. You can get more multi-channel images (multiple distinct signals). Compare to CT where you’re twiddling the Houndsfield Scale (and that’s all you get). There’s no standardization at all so this depends a lot on the machine and vendor.
It’s not as quick compared to CT (30-90 mins). Low throughput. Claustrophobic. But really freaking awesome in terms of getting information from the body.
ML Challenges
Anisotropy in Voxels: You want cubes but you get cuboids because of unpredictable slice thickness. This can cause problems in modeling.
Histopathology — H&E
Haemotoxylin — Nucleus (blue).
Eosin — Cytoplasm (pink)
Take a sample from the patient and send them away. Put that on some glass and stain it and take really high-res pics of the tissue (like at 20-40x magnification.)
So you have a “Gigapixel Problem”. So what you’re looking at is portions of the whole image.
Within these portions you can see stuff at nuclear → cellular → structural → tissue architectural level.
Biopsies can be invasive. Costs a lot to store. Not everyone’s on-board globally with digitization. Faster and cheaper than MRIs tho (no radiation risk like in CTs). You can’t tell who someone is from their cells lol.
Here’s a giantass dataset.
Other Stuff
- Retinal Imaging (FUNDUS, actual ‘normal’ picture)
- Dermoscopy (pictures of skin)
- Ultrasound (not really imaging? sound? understudied, safe, but problems with operator dependence.)
- PET
Radiation Dose Comparison
Think of how nice it would be if you can do a lot of great prediction with lower doses.
DICOM
A universal standard for medical images. Has both the pixels and metadata (orientation LR AP, pixel spacing (mm/pixel), slice thickness, window center/width, manufacturer/model.)
Use all this metadata to normalize in pre-processing (frequent source of bugs!)
The Shortcut Learning Problem
Think about how “Had an x-ray” can lead a model to ‘simply’ conclude that they are sick (i.e. without taking into account their pathology).
TODO: Berkson’s Paradox.