Modality → Representation Map

Data, Data, Data. “No bricks without clay.” Meditating over data is $\gt80\%$ of what you will be doing.

Clinical record data

Modality	Structure	Naive tabularization loses	Model family	Concern
EHR event stream	Irregular timing, observation density informative, multi-table (vitals, labs, meds, dx, encounters)	Order, time gaps, recency, trends; the diagnostic signal of how often something gets measured	Event-stream transformer, RNN, or carefully engineered windows with explicit time gaps	Label leakage (post-event features); prediction-time discipline; ACES framework for cohort definition
EHR snapshot (tabularized)	One row per patient at a fixed prediction time	Repeated measurements, longitudinal trends, recency of last measurement	Logistic regression, gradient boosting, random forest	Snapshot timing must be defined precisely; sicker patients get measured more, so missingness is informative
Chunked EHR	One row per fixed time window per patient	Sub-window dynamics; variable event timing within the window	Tabular models per chunk; sequence model over chunks	Window choice is a modeling assumption; events near boundaries are arbitrary
Claims sequence	Longitudinal billing records, irregular, coverage-gated	Care trajectories, recurrence, coverage gaps, timing of re-entries	Temporal coded-event sequence; engineered windows over claims	Billing ≠ clinical truth; insurance discontinuity affects features and labels
Clinical notes (free text)	Sequential tokens, semantic meaning context-dependent	Negation, assertion, temporality, the difference between “denies chest pain” and “reports chest pain”	Clinical BERT / BioBERT / GatorTron; LLM with RAG for QA	Templates and copy-paste create predictive shortcuts; abbreviations are ambiguous (MS = 6 things); stigmatizing language reflects bias
Biomedical literature / reports	Long-form scientific text; structured sections	Section structure; citation network; figure-text alignment	Domain-pretrained LMs (PubMedBERT); RAG	Hallucination in generative models; out-of-date knowledge after cutoff

Imaging

Modality	Structure	Naive tabularization loses	Model family	Concern
Chest X-ray (2D projection)	2D image, anatomy superposed (3D collapsed to 2D)	Spatial locality, edges, texture, anatomic context	CNN (ResNet-style), pretrained on ImageNet then fine-tuned; vision transformer	Augmentation validity — vertical flip is wrong (heart is on the left); DICOM headers are shortcut hazards; multi-site brittleness
CT / MRI (volumetric)	3D voxel grid; for MRI, multiple contrast modes per study	Volumetric continuity across slices; cross-modal information	3D CNN, U-Net for segmentation, transformer for long-range structure	Slice thickness varies by protocol; scanner/site artifacts; registration across modes
Whole-slide pathology	Gigapixel spatial; slide-level label, tile-level unknown	Spatial locality, tissue morphology, large-scale architecture	Tile-level CNN/ViT + multiple-instance learning (max / mean / attention pooling); transfer learning	Scanner artifacts; weak label-region alignment; patient-level split (multiple slides per patient)
Dermatology / fundus / ultrasound	2D natural-image-like medical photo	Local morphology, color, lesion-vs-background	CNN with augmentation; transfer learning from ImageNet or domain FM	Camera/device variation; skin-tone distribution shift; lesion prevalence varies dramatically by clinic
Echocardiography / video imaging	Spatiotemporal; 2D + time	Motion, periodicity, function-from-motion	3D CNN, video transformer, temporal pooling over 2D-CNN frames	Operator variability; view selection; gating to cardiac cycle

Time series and signals

Modality	Structure	Naive tabularization loses	Model family	Concern
ECG / waveform	Continuous, regular, multi-channel (12-lead)	Channel meaning (different leads = different spatial vantages), waveform morphology, rhythm	1D CNN, transformer on raw signal, hybrid frequency-domain features	Lead placement variability; informative gaps from device disconnection
Wearable / home monitoring	Longitudinal, irregular, partly patient-generated	Adherence patterns, time-of-day effects, missingness as signal	Sequence model over engineered daily features; mixed-effects for individual baselines	Patients measure more when they feel unwell; missing data may mean stable OR worsening OR device failure OR hospitalization
ICU bedside monitor streams	High-frequency multi-channel, near-continuous	Trends over minutes/hours, alarm context, cross-channel coupling	RNN/transformer on summarized windows; engineered SOFA-like scores	Alarms create artifacts; sensor dropout; informative observation (sicker = more sensors)

Network and graph data

Modality	Structure	Naive tabularization loses	Model family	Concern
Protein interaction network	Variable size, no canonical ordering, neighborhood-defined	Connectivity, neighborhood, pathway-level structure	Graph neural network (GCN, GIN); message passing	Noisy/incomplete edges; oversmoothing with too many message-passing layers
Molecular graph	Atoms as nodes, bonds as edges, variable size	Atom adjacency, ring structure, functional groups, stereochemistry	GNN (message-passing); also: ECFP fingerprints + RF as a notoriously strong baseline	Scaffold split, NOT random split; activity cliffs (small structural change → large activity change); 3D conformer choice
Knowledge graphs (UMLS, drug-disease)	Heterogeneous nodes and edges, relation types	Path semantics, edge types, multi-hop reasoning	Heterogeneous GNN, knowledge-graph embeddings	Curation bias; missing edges treated as negatives ≠ true negatives

Molecular and biological data

Modality	Structure	Naive tabularization loses	Model family	Concern
Protein sequence	Linear amino-acid string, evolutionary signal in alignments	Conservation, coevolution, long-range residue contacts	MSA-based methods (PSSM, AlphaFold-style); protein LM (ESM-2) for embeddings	Structure prediction confidence (pLDDT) is about fold, not function/binding
Predicted protein structure	3D atomic coordinates	Conformational dynamics, induced fit, water/ion mediation	3D / equivariant networks for binding tasks	High pLDDT ≠ validated docking hit; pocket flexibility often poorly modeled
DNA sequence / variants (GWAS)	Linear sequence, ~10⁶ SNPs per individual, population-stratified	Linkage disequilibrium, haplotype block structure, ancestry signal	GWAS (per-SNP regression with PC adjustment); polygenic risk scores	Population structure as the dominant confounder; PRS transferability across ancestries is poor (Martin et al. 2019) — deployment can widen disparities
Regulatory / epigenomic sequence	Long-context DNA + cell-type-specific signals	Cell-type context, chromatin accessibility, 3D genome organization	Long-context sequence models (Enformer, Evo); paired with ATAC/ChIP-seq	Sequence alone is insufficient — same variant has different effects in different tissues
Bulk RNA-seq	Sample × gene expression matrix, ~20K genes	Gene-gene correlation structure, pathway groupings	PCA/dim reduction + classical ML (your TCGA HW3 pattern); pathway scoring	Batch effects across sites; RSEM normalization assumptions; p ≫ n
scRNA-seq	Sparse high-dim count matrix, depth-confounded, dropout	(Less about loss, more about dominance:) batch effects and depth dominate without normalization	Normalize → PCA → UMAP/clustering; latent-variable models (scVI)	Clusters need biological validation, not trusted automatically; doublets and ambient RNA contamination
Mass spec / metabolomics	Spectra, peak intensities, retention times	Peak co-elution, isotope patterns, fragmentation structure	Domain-specific peak detection + downstream ML on identified features	Identification ambiguity; instrument-specific drift; reference library coverage

Population-level / designed data

Modality	Structure	Naive tabularization loses	Model family	Concern
Designed surveys (NHANES)	Probability sample, weighted, cross-sectional	Survey weights and design strata (treating as iid biases population estimates)	Survey-design-aware regression; classical epidemiology	Small samples for rare conditions; self-report errors; non-response bias
Cancer registries (SEER)	Population-based by geography, comprehensive ascertainment within catchment	Geographic and temporal trends; differential ascertainment across registries	Time-to-event models (Cox, KM); standardized rates	Only covers cancer; coding changes over time (e.g., ICD revisions)
Biobanks (UK Biobank, All of Us)	Deeply phenotyped volunteers; genetic + imaging + EHR linkage	Volunteer self-selection effects	Multi-modal models combining genetics, imaging, EHR	Healthy-volunteer bias — volunteers are wealthier and healthier than the population
Randomized trials (RCT)	Randomized treatment assignment, structured outcome ascertainment	Real-world heterogeneity outside eligibility criteria	Causal estimands (ATE, HR); intent-to-treat	Strong internal validity, weak external validity; eligibility excludes most real patients

Modality	Structure	Naive tabularization loses	Model family	Concern
Multi-modal cancer (sequencing + expression + pathology + EHR)	Patient-linked across modalities, partial availability per patient	Inter-modal complementarity; the fact that absence of a modality is informative	Modality-specific encoders + late fusion (safer with limited data) or end-to-end multimodal (needs more data)	Confounding by indication (treated patients differ systematically); leakage across tiles/slides; missing modalities are non-random
Image + report pairs	Paired modalities sharing patient/study	Cross-modal alignment between text findings and image regions	CLIP-style contrastive learning (CheXzero, BiomedCLIP, CONCH)	Reports describe findings the radiologist saw, not ground truth; report templates create shortcuts

Quick Mnemonic

Name the structure. Sequence? Image? Graph? Spatial? Event stream? Tabular?
Name what tabularization loses. Order? Spatial locality? Connectivity? Time gaps?
Match the inductive bias. CNN for spatial, GNN for graphs, sequence model for ordered data, MIL for weak slide-level labels, fingerprint+RF as the surprisingly-strong molecular baseline.
Name one concern from the data-generating process. Scanner artifacts, billing distortions, ancestry confounding, informative observation, batch effects, shortcut hazards, missingness-as-signal.
Which naive default would fail here and why? (split / threshold / augmentation / metric / feature representation)
What contextual shift threatens generalization? (site / time / population / device / prevalence)

Here’s another

Type	Core thing preserved	What tabularization destroys
Sequence	order + timing	temporal structure
Image/Spatial	locality + geometry	spatial relationships
Graph/Network	connectivity	neighborhood structure
Language/Text	context + semantics	meaning
High-dimensional biology	correlation/manifold structure	latent biological organization
Population/causal	sampling/assignment process	study design assumptions
Multimodal	relationships across modalities	cross-modal information

Clinical record data​

Imaging​

Time series and signals​

Network and graph data​

Molecular and biological data​

Population-level / designed data​

Multi-modal patient-linked data​

Quick Mnemonic​

Clinical record data

Imaging

Time series and signals

Network and graph data

Molecular and biological data

Population-level / designed data

Multi-modal patient-linked data

Quick Mnemonic