Modality → Representation Map
Clinical record data
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| EHR event stream | Irregular timing, observation density informative, multi-table (vitals, labs, meds, dx, encounters) | Order, time gaps, recency, trends; the diagnostic signal of how often something gets measured | Event-stream transformer, RNN, or carefully engineered windows with explicit time gaps | Label leakage (post-event features); prediction-time discipline; ACES framework for cohort definition |
| EHR snapshot (tabularized) | One row per patient at a fixed prediction time | Repeated measurements, longitudinal trends, recency of last measurement | Logistic regression, gradient boosting, random forest | Snapshot timing must be defined precisely; sicker patients get measured more, so missingness is informative |
| Chunked EHR | One row per fixed time window per patient | Sub-window dynamics; variable event timing within the window | Tabular models per chunk; sequence model over chunks | Window choice is a modeling assumption; events near boundaries are arbitrary |
| Claims sequence | Longitudinal billing records, irregular, coverage-gated | Care trajectories, recurrence, coverage gaps, timing of re-entries | Temporal coded-event sequence; engineered windows over claims | Billing ≠ clinical truth; insurance discontinuity affects features and labels |
| Clinical notes (free text) | Sequential tokens, semantic meaning context-dependent | Negation, assertion, temporality, the difference between “denies chest pain” and “reports chest pain” | Clinical BERT / BioBERT / GatorTron; LLM with RAG for QA | Templates and copy-paste create predictive shortcuts; abbreviations are ambiguous (MS = 6 things); stigmatizing language reflects bias |
| Biomedical literature / reports | Long-form scientific text; structured sections | Section structure; citation network; figure-text alignment | Domain-pretrained LMs (PubMedBERT); RAG | Hallucination in generative models; out-of-date knowledge after cutoff |
Imaging
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| Chest X-ray (2D projection) | 2D image, anatomy superposed (3D collapsed to 2D) | Spatial locality, edges, texture, anatomic context | CNN (ResNet-style), pretrained on ImageNet then fine-tuned; vision transformer | Augmentation validity — vertical flip is wrong (heart is on the left); DICOM headers are shortcut hazards; multi-site brittleness |
| CT / MRI (volumetric) | 3D voxel grid; for MRI, multiple contrast modes per study | Volumetric continuity across slices; cross-modal information | 3D CNN, U-Net for segmentation, transformer for long-range structure | Slice thickness varies by protocol; scanner/site artifacts; registration across modes |
| Whole-slide pathology | Gigapixel spatial; slide-level label, tile-level unknown | Spatial locality, tissue morphology, large-scale architecture | Tile-level CNN/ViT + multiple-instance learning (max / mean / attention pooling); transfer learning | Scanner artifacts; weak label-region alignment; patient-level split (multiple slides per patient) |
| Dermatology / fundus / ultrasound | 2D natural-image-like medical photo | Local morphology, color, lesion-vs-background | CNN with augmentation; transfer learning from ImageNet or domain FM | Camera/device variation; skin-tone distribution shift; lesion prevalence varies dramatically by clinic |
| Echocardiography / video imaging | Spatiotemporal; 2D + time | Motion, periodicity, function-from-motion | 3D CNN, video transformer, temporal pooling over 2D-CNN frames | Operator variability; view selection; gating to cardiac cycle |
Time series and signals
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| ECG / waveform | Continuous, regular, multi-channel (12-lead) | Channel meaning (different leads = different spatial vantages), waveform morphology, rhythm | 1D CNN, transformer on raw signal, hybrid frequency-domain features | Lead placement variability; informative gaps from device disconnection |
| Wearable / home monitoring | Longitudinal, irregular, partly patient-generated | Adherence patterns, time-of-day effects, missingness as signal | Sequence model over engineered daily features; mixed-effects for individual baselines | Patients measure more when they feel unwell; missing data may mean stable OR worsening OR device failure OR hospitalization |
| ICU bedside monitor streams | High-frequency multi-channel, near-continuous | Trends over minutes/hours, alarm context, cross-channel coupling | RNN/transformer on summarized windows; engineered SOFA-like scores | Alarms create artifacts; sensor dropout; informative observation (sicker = more sensors) |
Network and graph data
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| Protein interaction network | Variable size, no canonical ordering, neighborhood-defined | Connectivity, neighborhood, pathway-level structure | Graph neural network (GCN, GIN); message passing | Noisy/incomplete edges; oversmoothing with too many message-passing layers |
| Molecular graph | Atoms as nodes, bonds as edges, variable size | Atom adjacency, ring structure, functional groups, stereochemistry | GNN (message-passing); also: ECFP fingerprints + RF as a notoriously strong baseline | Scaffold split, NOT random split; activity cliffs (small structural change → large activity change); 3D conformer choice |
| Knowledge graphs (UMLS, drug-disease) | Heterogeneous nodes and edges, relation types | Path semantics, edge types, multi-hop reasoning | Heterogeneous GNN, knowledge-graph embeddings | Curation bias; missing edges treated as negatives ≠ true negatives |
Molecular and biological data
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| Protein sequence | Linear amino-acid string, evolutionary signal in alignments | Conservation, coevolution, long-range residue contacts | MSA-based methods (PSSM, AlphaFold-style); protein LM (ESM-2) for embeddings | Structure prediction confidence (pLDDT) is about fold, not function/binding |
| Predicted protein structure | 3D atomic coordinates | Conformational dynamics, induced fit, water/ion mediation | 3D / equivariant networks for binding tasks | High pLDDT ≠ validated docking hit; pocket flexibility often poorly modeled |
| DNA sequence / variants (GWAS) | Linear sequence, ~10⁶ SNPs per individual, population-stratified | Linkage disequilibrium, haplotype block structure, ancestry signal | GWAS (per-SNP regression with PC adjustment); polygenic risk scores | Population structure as the dominant confounder; PRS transferability across ancestries is poor (Martin et al. 2019) — deployment can widen disparities |
| Regulatory / epigenomic sequence | Long-context DNA + cell-type-specific signals | Cell-type context, chromatin accessibility, 3D genome organization | Long-context sequence models (Enformer, Evo); paired with ATAC/ChIP-seq | Sequence alone is insufficient — same variant has different effects in different tissues |
| Bulk RNA-seq | Sample × gene expression matrix, ~20K genes | Gene-gene correlation structure, pathway groupings | PCA/dim reduction + classical ML (your TCGA HW3 pattern); pathway scoring | Batch effects across sites; RSEM normalization assumptions; p ≫ n |
| scRNA-seq | Sparse high-dim count matrix, depth-confounded, dropout | (Less about loss, more about dominance:) batch effects and depth dominate without normalization | Normalize → PCA → UMAP/clustering; latent-variable models (scVI) | Clusters need biological validation, not trusted automatically; doublets and ambient RNA contamination |
| Mass spec / metabolomics | Spectra, peak intensities, retention times | Peak co-elution, isotope patterns, fragmentation structure | Domain-specific peak detection + downstream ML on identified features | Identification ambiguity; instrument-specific drift; reference library coverage |
Population-level / designed data
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| Designed surveys (NHANES) | Probability sample, weighted, cross-sectional | Survey weights and design strata (treating as iid biases population estimates) | Survey-design-aware regression; classical epidemiology | Small samples for rare conditions; self-report errors; non-response bias |
| Cancer registries (SEER) | Population-based by geography, comprehensive ascertainment within catchment | Geographic and temporal trends; differential ascertainment across registries | Time-to-event models (Cox, KM); standardized rates | Only covers cancer; coding changes over time (e.g., ICD revisions) |
| Biobanks (UK Biobank, All of Us) | Deeply phenotyped volunteers; genetic + imaging + EHR linkage | Volunteer self-selection effects | Multi-modal models combining genetics, imaging, EHR | Healthy-volunteer bias — volunteers are wealthier and healthier than the population |
| Randomized trials (RCT) | Randomized treatment assignment, structured outcome ascertainment | Real-world heterogeneity outside eligibility criteria | Causal estimands (ATE, HR); intent-to-treat | Strong internal validity, weak external validity; eligibility excludes most real patients |
Multi-modal patient-linked data
| Modality | Structure | Naive tabularization loses | Model family | Concern |
|---|---|---|---|---|
| Multi-modal cancer (sequencing + expression + pathology + EHR) | Patient-linked across modalities, partial availability per patient | Inter-modal complementarity; the fact that absence of a modality is informative | Modality-specific encoders + late fusion (safer with limited data) or end-to-end multimodal (needs more data) | Confounding by indication (treated patients differ systematically); leakage across tiles/slides; missing modalities are non-random |
| Image + report pairs | Paired modalities sharing patient/study | Cross-modal alignment between text findings and image regions | CLIP-style contrastive learning (CheXzero, BiomedCLIP, CONCH) | Reports describe findings the radiologist saw, not ground truth; report templates create shortcuts |
Quick Mnemonic
- Name the structure. Sequence? Image? Graph? Spatial? Event stream? Tabular?
- Name what tabularization loses. Order? Spatial locality? Connectivity? Time gaps?
- Match the inductive bias. CNN for spatial, GNN for graphs, sequence model for ordered data, MIL for weak slide-level labels, fingerprint+RF as the surprisingly-strong molecular baseline.
- Name one concern from the data-generating process. Scanner artifacts, billing distortions, ancestry confounding, informative observation, batch effects, shortcut hazards, missingness-as-signal.
- Which naive default would fail here and why? (split / threshold / augmentation / metric / feature representation)
- What contextual shift threatens generalization? (site / time / population / device / prevalence)
Here’s another
| Type | Core thing preserved | What tabularization destroys |
|---|---|---|
| Sequence | order + timing | temporal structure |
| Image/Spatial | locality + geometry | spatial relationships |
| Graph/Network | connectivity | neighborhood structure |
| Language/Text | context + semantics | meaning |
| High-dimensional biology | correlation/manifold structure | latent biological organization |
| Population/causal | sampling/assignment process | study design assumptions |
| Multimodal | relationships across modalities | cross-modal information |