Skip to main content

Modality → Representation Map


Clinical record data

ModalityStructureNaive tabularization losesModel familyConcern
EHR event streamIrregular timing, observation density informative, multi-table (vitals, labs, meds, dx, encounters)Order, time gaps, recency, trends; the diagnostic signal of how often something gets measuredEvent-stream transformer, RNN, or carefully engineered windows with explicit time gapsLabel leakage (post-event features); prediction-time discipline; ACES framework for cohort definition
EHR snapshot (tabularized)One row per patient at a fixed prediction timeRepeated measurements, longitudinal trends, recency of last measurementLogistic regression, gradient boosting, random forestSnapshot timing must be defined precisely; sicker patients get measured more, so missingness is informative
Chunked EHROne row per fixed time window per patientSub-window dynamics; variable event timing within the windowTabular models per chunk; sequence model over chunksWindow choice is a modeling assumption; events near boundaries are arbitrary
Claims sequenceLongitudinal billing records, irregular, coverage-gatedCare trajectories, recurrence, coverage gaps, timing of re-entriesTemporal coded-event sequence; engineered windows over claimsBilling ≠ clinical truth; insurance discontinuity affects features and labels
Clinical notes (free text)Sequential tokens, semantic meaning context-dependentNegation, assertion, temporality, the difference between “denies chest pain” and “reports chest pain”Clinical BERT / BioBERT / GatorTron; LLM with RAG for QATemplates and copy-paste create predictive shortcuts; abbreviations are ambiguous (MS = 6 things); stigmatizing language reflects bias
Biomedical literature / reportsLong-form scientific text; structured sectionsSection structure; citation network; figure-text alignmentDomain-pretrained LMs (PubMedBERT); RAGHallucination in generative models; out-of-date knowledge after cutoff

Imaging

ModalityStructureNaive tabularization losesModel familyConcern
Chest X-ray (2D projection)2D image, anatomy superposed (3D collapsed to 2D)Spatial locality, edges, texture, anatomic contextCNN (ResNet-style), pretrained on ImageNet then fine-tuned; vision transformerAugmentation validity — vertical flip is wrong (heart is on the left); DICOM headers are shortcut hazards; multi-site brittleness
CT / MRI (volumetric)3D voxel grid; for MRI, multiple contrast modes per studyVolumetric continuity across slices; cross-modal information3D CNN, U-Net for segmentation, transformer for long-range structureSlice thickness varies by protocol; scanner/site artifacts; registration across modes
Whole-slide pathologyGigapixel spatial; slide-level label, tile-level unknownSpatial locality, tissue morphology, large-scale architectureTile-level CNN/ViT + multiple-instance learning (max / mean / attention pooling); transfer learningScanner artifacts; weak label-region alignment; patient-level split (multiple slides per patient)
Dermatology / fundus / ultrasound2D natural-image-like medical photoLocal morphology, color, lesion-vs-backgroundCNN with augmentation; transfer learning from ImageNet or domain FMCamera/device variation; skin-tone distribution shift; lesion prevalence varies dramatically by clinic
Echocardiography / video imagingSpatiotemporal; 2D + timeMotion, periodicity, function-from-motion3D CNN, video transformer, temporal pooling over 2D-CNN framesOperator variability; view selection; gating to cardiac cycle

Time series and signals

ModalityStructureNaive tabularization losesModel familyConcern
ECG / waveformContinuous, regular, multi-channel (12-lead)Channel meaning (different leads = different spatial vantages), waveform morphology, rhythm1D CNN, transformer on raw signal, hybrid frequency-domain featuresLead placement variability; informative gaps from device disconnection
Wearable / home monitoringLongitudinal, irregular, partly patient-generatedAdherence patterns, time-of-day effects, missingness as signalSequence model over engineered daily features; mixed-effects for individual baselinesPatients measure more when they feel unwell; missing data may mean stable OR worsening OR device failure OR hospitalization
ICU bedside monitor streamsHigh-frequency multi-channel, near-continuousTrends over minutes/hours, alarm context, cross-channel couplingRNN/transformer on summarized windows; engineered SOFA-like scoresAlarms create artifacts; sensor dropout; informative observation (sicker = more sensors)

Network and graph data

ModalityStructureNaive tabularization losesModel familyConcern
Protein interaction networkVariable size, no canonical ordering, neighborhood-definedConnectivity, neighborhood, pathway-level structureGraph neural network (GCN, GIN); message passingNoisy/incomplete edges; oversmoothing with too many message-passing layers
Molecular graphAtoms as nodes, bonds as edges, variable sizeAtom adjacency, ring structure, functional groups, stereochemistryGNN (message-passing); also: ECFP fingerprints + RF as a notoriously strong baselineScaffold split, NOT random split; activity cliffs (small structural change → large activity change); 3D conformer choice
Knowledge graphs (UMLS, drug-disease)Heterogeneous nodes and edges, relation typesPath semantics, edge types, multi-hop reasoningHeterogeneous GNN, knowledge-graph embeddingsCuration bias; missing edges treated as negatives ≠ true negatives

Molecular and biological data

ModalityStructureNaive tabularization losesModel familyConcern
Protein sequenceLinear amino-acid string, evolutionary signal in alignmentsConservation, coevolution, long-range residue contactsMSA-based methods (PSSM, AlphaFold-style); protein LM (ESM-2) for embeddingsStructure prediction confidence (pLDDT) is about fold, not function/binding
Predicted protein structure3D atomic coordinatesConformational dynamics, induced fit, water/ion mediation3D / equivariant networks for binding tasksHigh pLDDT ≠ validated docking hit; pocket flexibility often poorly modeled
DNA sequence / variants (GWAS)Linear sequence, ~10⁶ SNPs per individual, population-stratifiedLinkage disequilibrium, haplotype block structure, ancestry signalGWAS (per-SNP regression with PC adjustment); polygenic risk scoresPopulation structure as the dominant confounder; PRS transferability across ancestries is poor (Martin et al. 2019) — deployment can widen disparities
Regulatory / epigenomic sequenceLong-context DNA + cell-type-specific signalsCell-type context, chromatin accessibility, 3D genome organizationLong-context sequence models (Enformer, Evo); paired with ATAC/ChIP-seqSequence alone is insufficient — same variant has different effects in different tissues
Bulk RNA-seqSample × gene expression matrix, ~20K genesGene-gene correlation structure, pathway groupingsPCA/dim reduction + classical ML (your TCGA HW3 pattern); pathway scoringBatch effects across sites; RSEM normalization assumptions; p ≫ n
scRNA-seqSparse high-dim count matrix, depth-confounded, dropout(Less about loss, more about dominance:) batch effects and depth dominate without normalizationNormalize → PCA → UMAP/clustering; latent-variable models (scVI)Clusters need biological validation, not trusted automatically; doublets and ambient RNA contamination
Mass spec / metabolomicsSpectra, peak intensities, retention timesPeak co-elution, isotope patterns, fragmentation structureDomain-specific peak detection + downstream ML on identified featuresIdentification ambiguity; instrument-specific drift; reference library coverage

Population-level / designed data

ModalityStructureNaive tabularization losesModel familyConcern
Designed surveys (NHANES)Probability sample, weighted, cross-sectionalSurvey weights and design strata (treating as iid biases population estimates)Survey-design-aware regression; classical epidemiologySmall samples for rare conditions; self-report errors; non-response bias
Cancer registries (SEER)Population-based by geography, comprehensive ascertainment within catchmentGeographic and temporal trends; differential ascertainment across registriesTime-to-event models (Cox, KM); standardized ratesOnly covers cancer; coding changes over time (e.g., ICD revisions)
Biobanks (UK Biobank, All of Us)Deeply phenotyped volunteers; genetic + imaging + EHR linkageVolunteer self-selection effectsMulti-modal models combining genetics, imaging, EHRHealthy-volunteer bias — volunteers are wealthier and healthier than the population
Randomized trials (RCT)Randomized treatment assignment, structured outcome ascertainmentReal-world heterogeneity outside eligibility criteriaCausal estimands (ATE, HR); intent-to-treatStrong internal validity, weak external validity; eligibility excludes most real patients

Multi-modal patient-linked data

ModalityStructureNaive tabularization losesModel familyConcern
Multi-modal cancer (sequencing + expression + pathology + EHR)Patient-linked across modalities, partial availability per patientInter-modal complementarity; the fact that absence of a modality is informativeModality-specific encoders + late fusion (safer with limited data) or end-to-end multimodal (needs more data)Confounding by indication (treated patients differ systematically); leakage across tiles/slides; missing modalities are non-random
Image + report pairsPaired modalities sharing patient/studyCross-modal alignment between text findings and image regionsCLIP-style contrastive learning (CheXzero, BiomedCLIP, CONCH)Reports describe findings the radiologist saw, not ground truth; report templates create shortcuts

Quick Mnemonic

  1. Name the structure. Sequence? Image? Graph? Spatial? Event stream? Tabular?
  2. Name what tabularization loses. Order? Spatial locality? Connectivity? Time gaps?
  3. Match the inductive bias. CNN for spatial, GNN for graphs, sequence model for ordered data, MIL for weak slide-level labels, fingerprint+RF as the surprisingly-strong molecular baseline.
  4. Name one concern from the data-generating process. Scanner artifacts, billing distortions, ancestry confounding, informative observation, batch effects, shortcut hazards, missingness-as-signal.
  5. Which naive default would fail here and why? (split / threshold / augmentation / metric / feature representation)
  6. What contextual shift threatens generalization? (site / time / population / device / prevalence)

Here’s another

TypeCore thing preservedWhat tabularization destroys
Sequenceorder + timingtemporal structure
Image/Spatiallocality + geometryspatial relationships
Graph/Networkconnectivityneighborhood structure
Language/Textcontext + semanticsmeaning
High-dimensional biologycorrelation/manifold structurelatent biological organization
Population/causalsampling/assignment processstudy design assumptions
Multimodalrelationships across modalitiescross-modal information