Random Notes

Final Cheatsheet FOR REALSIES

\frac{d}{dx} e^x = e^x \\ \frac{d}{dx} e^{ax} = ae^{ax} \\ \frac{d}{dx} e^u = e^u \frac{du}{dx} \\ \frac{d}{dx} e^{au} = ae^{au} \frac{du}{dx} \\ \frac{d}{dx} a^u = a^u (\ln a)\frac{du}{dx} \\ \frac{d}{dx} \ln u = \frac{1}{u}\frac{du}{dx} \\ \frac{d}{dx} \log_a u = \frac{1}{u \ln a}\frac{du}{dx}

Q3

• Threshold choice depends on workflow, not only prevalence or AUROC.
• False positives and false negatives produce different real-world consequences.
• False positives create alarms, interruptions, clinician burden, and possible alarm fatigue.
• False negatives may delay recognition of imminent death or deterioration.
• A ranking metric does not determine the operating point.
• New workflows may require recalibration and new thresholds.

what happens after an alert: how quickly nurses and physicians can respond, whether the alert triggers specific interventions, how many alerts per shift are tolerable, whether alerts interrupt other safety-critical tasks, and whether responding to the alert actually improves outcomes

performance at candidate thresholds, calibration, subgroup performance, and prospective or workflow-simulated estimates of alert burden

FP:

interrupt the bedside nurse, page the attending physician, disturb the patient, and contribute to alarm fatigue
clinicians may start ignoring alerts

FN:

delay bedside assessment, escalation, or intervention. The cost is not just a missed label; it is a missed opportunity to act in a narrow time window

There is no threshold that is universally best: relevant tradeoff depends on the clinical action, staffing, alarm capacity,
and expected benefit of earlier intervention

AUROC:

Ranking only but not where to place the thresholds.

Q4

• A binary label can collapse true negatives, censored cases, deaths, and other competing events into one category.

BInary label is only valid if follow-up is complete!
• Censoring means the event status is unknown, not negative.
• Death can be a competing risk because it changes whether the target event can occur.
• Survival or time-to-event methods use partial follow-up more appropriately.
• Evaluation should consider time, censoring, calibration, and ranking over time.

separate the question “has the event happened yet?” from “how long was the patient actually observed?”

SA Eval: Time-based AUROC, IPCW Briar, survival curves by predicted-risk group

Q5

Modality	Structure	Naive tabularization loses	Repr. or Model family	Concern
EHR event stream	Irregular event stream of labs, meds, diags, vitals, encounters; observation density informative	Order, time gaps, recency, trends, the fact that sicker = measured more	Event-stream transformer, RNN, or engineered windows	prediction-time discipline to prevent label leakage (post-event features);
Whole-slide pathology	Ultra high-res spatial; slide-level label, tile-level unknown. Local tissue morphology, larger-scale arch.	Spatial locality, tissue morphology, large-scale architecture	Tile-level CNN/vision transformer + multiple-instance learning; transfer learning	Scanner artifacts; weak label-region alignment; patient-level split
Claims sequence	Longitudinal billing records (diagnosis, procedure, refills), irregular timing	Care trajectories, recurrence, coverage gaps, timing	Temporal coded-event sequence; engineered features over clinically meaningful time windows	Billing ≠ clinical truth; insurance discontinuity affects features and labels
Protein interaction network	Graph-structure variable size, no canonical ordering, neighborhood-defined	Connectivity, neighborhood, pathway-level structure	Graph neural network; message passing	Noisy or incomplete edges; oversmoothing with too many message-passing layers
scRNA-seq	Sparse high-dim gene expr count matrix, depth-confounded, dropout	Batch effects and seq depth dominate without normalization	Normalize → PCA dim red → UMAP/clustering; latent-variable models	Clusters need biological validation (marker genes, ext annots), not trusted automatically

Type	Core thing preserved	What tabularization destroys
Sequence	order + timing	temporal structure
Image/Spatial	locality + geometry	spatial relationships
Graph/Network	connectivity	neighborhood structure
Language/Text	context + semantics	meaning
High-dimensional biology	correlation/manifold structure	latent biological organization
Population/causal	sampling/assignment process	study design assumptions
Multimodal	relationships across modalities	cross-modal information

• Clinical notes (text): templates and copy-paste are the load-bearing concern; clinical BERT or LLMs for representation; abbreviation/negation issues; site-specific note dialect.
• X-ray / CT / MRI (volumetric/projection imaging): CNN or U-Net; augmentation validity matters (vertical flip is wrong for X-rays — anatomy isn’t symmetric); DICOM headers can be shortcut hazards.
• ECG / waveform time series: 1D CNN or transformer on the raw signal; channel meaning (different leads = different spatial vantages); informative gaps.
• Molecules: SMILES strings, atom-bond graphs, ECFP fingerprints + RF as a notoriously strong baseline; scaffold split, not random split.
• DNA sequence / variants: GWAS, polygenic risk scores, population structure as the dominant confounder, ancestry transferability problem.

Q6

• Lookup tables fail because long contexts are combinatorially sparse and do not generalize.
• Next-token prediction models a sequence distribution and can cast many tasks as completion.
• Clinical foundation models aim for reusable representations or task adaptation, not one fixed label.
• A tabular snapshot can lose temporal order, trends, recency, time gaps, and observation process.
• Broad usefulness should be shown by transfer across tasks/settings; shortcut learning is suggested by brittle site- or workflow-specific performance.

Lookup tables

Exact long contexts never repeat
Cannot generalize to a new understandable context

Next-token prediction

Language tasks written as completion problems. Chain rule of Probability.

FMs

try and learn the reusable structure of the Input Domain to support many downstream tasks (input → task distribution)
Eval: bad if collapse under site/time split, removal of workflow artifacts (e.g. note templates). Good if pretrained representation improved perf across many clinically different downstream tasks.

Q7

Bio

SNP:

Genome-wide signif == statistical evidence of assoc != strong individual-level pred
Can can tiny effect size but very large p-value in large study
Ask about effect size, allele freq., ancestry, replication, LD structure

Protein:

Docking Score != Affinity in Real Life™
You need to validate experimentally in the lab!!
AlphaFold is about structure that’s all: not ligand binding, conformational dynamics, docking score accuracy.
As if binding pocket is confidently modeled, if alternative conformations matter, if wetlab assays confirm hits.

DNA:

Expression/Regulation depend on the cell environment/context! Cell type, chromatin accessibility, transcription factors, dev stage, etc.
Ask if validated in tissue or cell state. Do we need more data on expression, chromatin, or eQTL?

General framework

Is the prediction task defined? What is the prediction horizon? Over what/whom?
What is the data modality? Inductive biases?
Which model?
How are we splitting? Patient/Time/Site? Leakage?
How are we evaluating?
- Discrimination/calibration/threshold - clinical utility/subgroup perf.
Can we tabularize and start with baseline models?
Can we conclude any causality (almost never)?
How are we interpreting the model? Validating biologically/clinically?
How are we connecting model to workflow/goal? Harm?

Stuff on Cheatsheet

Trig identities
Variances
EVs
Common derivatives/integrals
RELU and other curves (see lectures!)
Common curves $e^x$ , $log(x)$ , etc
Var(X) = E[X], Cov(X), samples
Distributions - PDF, CDF, E[X], Var[X]
PPV, TPR, FPR, Accuracy, Precision, Recall, F1
Outcome, Event Space mappings — IMPORTANT!
Entropy, KL divergence, information content, mutual information, chain rule, bayes rule, joint, marginal, conditional Probabilities
Minimizing Cross Entropy <—> Maximizing Likelihood
$\hat{w}$ expression, regularization equations
Stride, padding, etc equation.
Activation Functions, graphs, derivatives

Recalibration: Platt scaling, isotonic regression: Plot Empirical/Predicted Probability

Start with baseline, tabularization — go to complex models only if necessary
Examine calibration, esp. in subgroups (non-optional)
Does the threshold match clinical need?
Is further calibration required? Is the model stable over time? Demographic shifts? at the same site?

Bayes optimal claassifer formula. $r = \frac{C_{FN}}{C_{FP}}$ and $\tau* = \frac{1}{1 + r}$ . How much worse is a FN than a FP? In medicine, FN’s are usually worse.

Net Benefit: “Using this model is equivalent to getting 10 true-positive interventions per 100 patiants with no unnecessary interventions.” You plot across a range of thresholds.

Not about Simple and Complex. Not about which functions your model class can represent. About which functions gradient descent tends to discover early and easily during optimization?

TODO

BIAS AND VARIANCE
- Practical consequences (real talk)
  - Linear regression → higher bias, lower variance — How?
  - Deep neural nets → low bias, high variance (unless regularized) — How?
  - Random Forest → reduces variance via averaging — Review HW2
  - XGBoost → balances bias and variance through boosting — how?
k-fold CV
Probability from CI textbook.
What happens when the Probability doesn’t add up to one?
Interpretation of Precision/Recall curves — Do weird shit with this. What do they mean?
Why is orthogonality a nice thing in ML?

Training Binary Classifiers

Loss → Model → Control Bias (Regularization, etc)

There are some rare cases where you are fitting to the data provided and not the generative process. Non-parametric modeling.

Target Use-Case detemines the Target Generalizability Unit.

Loss: You can have all sorts of loss. E.g. MSE loss for Regression, log-likelihood, etc.

So do you model $y, x$ or $y|x$ ?

Note that just guessing is the simplest possible ‘prediction’ and not always a stupid choice!

You are saying something about the generative process when you pick your model. So think about a great case for clustering but you use a linear decision boundary. Is the generative process linear?

In KNN

You can have a kernel function that weights nearer neighbors more.
Really you’re not really ‘learning’ anything about the distribution.
The geometry matters! Think of what would happen if you smushed the y-axis.
Gradient Descent doesn’t work since you don’t really get anything for small changes in radius $r$ . $\mathbb{L}(r)$ is almost always a constant.

The “right” thing you want to learn in this binary case is $p_{y|y=x}$ .

kNNs are Awesome Initially

As a sanity check baseline. It shows

Bias/Variance tradeoff
The Curse of Dimensionality
Effect of Domain Shift (your neighbors from Training aren’t the same anymore)
Converges to the Irreducible Bayes Error theoretically (can be proved). But 1-NN does at most 2 times the error (Cover & Hart, 1967)

Complexity and Bias and Variance and Error

Simple models are “consistently wrong.” → High Bias, Low Variance on new data.
Complex models are “inconsistently right.” → Low Bias, High Variance on new data.

Scaffold Splits

Scaffold Split: Molecules are grouped based on their Bemis-Murcko scaffolds. This strategy ensures that molecules within the same cluster are assigned to either the training set or the test set.

— Source

Quick example: You want your model to predict the animal’s shape from a skeleton. You have 1,000 samples representing 100 animals. You don’t want a crow skeleton to show up in train and test. You split by “mammal”, “avian” in train and eval performance in test which has “reptile”. Goal is for your model to predict well on skeleton types it hasn’t seen. Fin.

Bayes Errors et al

The Bayes Error is the irreducible, unavoidable error in modeling. Exists because you cannot capture everything about the Real World™ which is messy (#JobSecurity). So in ML,

\text{Error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible/Bayes Error}

Bayes-optimal classifier is something that achieves this minimum level of error. Bayes consistency means that the modeling eventually reaches this minimum.

It’s about learning Structure

Every time you “represent” something in ML you are picking a basis.

You want to learn the underlying structure despite the coordinate system, the representation used to describe it. This is the most important takeaway. Think of the various map projections for Earth (Mercator, Peters, etc.) Do you want the model to learn “Greenland is huge” or the actual distances, topology, geography? Most of the “it doesn’t work at Hospital B” problems come from the model not being basis-invariant. Real World is the abstract vector space. Your data is merely one representation/coordinate system that describes it!

Modeling Steps (Supervised)

Specify the data. What is x, what is y, what are they jointly distributed over.
Specify a parametric probabilistic model. A family of conditional distributions p(y | x; θ) (supervised) or joint p(x; θ) (unsupervised), parametrized by θ.
Define an objective. Almost always a likelihood-based one — most commonly negative log-likelihood (NLL), 𝔼[-log p(y | x; θ)].
Optimize θ. With gradient descent (smooth differentiable objectives) or Expectation-Maximization (when latent variables make the gradient awkward).

Be Humble and Ask the Right Questions

Maintain a healthy epistemic humility/uncertainty: neither overconfident nor paralyzed by uncertainty. “AUROC against what label, on which patients, generalizing where, evaluated by whom, deployed how?”

Math is rarely the problem: limiting factor is the Question Framing. Same any SOTA library/framework or modeling technique. What problem are you trying to solve? For whom? With what data? what is the data actually measuring, what is the loss actually rewarding, what is the deployment context actually doing to the data-generating process.

Why is ML for Healthcare different?

This is not just “ML applied to health-shaped tables”: The data, the labels, the deployment, and the consequences (are you inadvertently affecting certain groups, marginalized or otherwise?) are all different in ways that affect modeling choices.

When you deploy an AI Healthcare Model, it may be (a) technically sound/sophisticated, (b) look fine in training and validation but when deployed (1) degrade (2) harm a subgroup (3) get gamed or (4) solve a wrong problem or something nobody asked for. This happens a lot.

Really Random

We do not use the Hessian since it is computationally intractable in high dimensions. People do use a diagonal approximation sometimes.

Practice Questions

For a readmission prediction model at a hospital network, what should the generalization unit be? Defend your choice. “Generalization Unit” means what you’re going to split your data on. In this case, split by patient. You want your model to generalize to new patients. Don’t leak patients into the test set; model has already seen them in training and will give you misleading metrics.

A k-NN model achieves 99% training accuracy and 60% test accuracy. What is the likely cause? What hyperparameter change might help? Here, $k=1$ or some small number. You need to ‘loosen’ things so that you are averaging across more neighbors and the decision boundaries are ‘smoother’ and not as jagged.. So a higher $k$ will help. Note that kNN is a bit opposite intuitively when it comes to complexity! And typically, $k = \sqrt{n}$ . Why? Remember that kNN is Bayes Optimal when $k \rightarrow \infty$ and $\frac{k}{n} \rightarrow 0$ as $n \rightarrow \infty$ . What we are saying with that is ” $k$ must grow with $n$ but not too fast.” Locality is key.

Final Cheatsheet FOR REALSIES​

Q3​

Q4​

Q5​

Q6​

Q7​

Bio​

General framework​

Stuff on Cheatsheet​

TODO​

Training Binary Classifiers​

kNNs are Awesome Initially​

Complexity and Bias and Variance and Error​

Scaffold Splits​

Bayes Errors et al​

It’s about learning Structure​

Modeling Steps (Supervised)​

Be Humble and Ask the Right Questions​

Why is ML for Healthcare different?​

Really Random​

Practice Questions​