Skip to main content

Random Notes

Final Cheatsheet FOR REALSIES

ddxex=exddxeax=aeaxddxeu=eududxddxeau=aeaududxddxau=au(lna)dudxddxlnu=1ududxddxlogau=1ulnadudx\frac{d}{dx} e^x = e^x \\ \frac{d}{dx} e^{ax} = ae^{ax} \\ \frac{d}{dx} e^u = e^u \frac{du}{dx} \\ \frac{d}{dx} e^{au} = ae^{au} \frac{du}{dx} \\ \frac{d}{dx} a^u = a^u (\ln a)\frac{du}{dx} \\ \frac{d}{dx} \ln u = \frac{1}{u}\frac{du}{dx} \\ \frac{d}{dx} \log_a u = \frac{1}{u \ln a}\frac{du}{dx}

Q3

• Threshold choice depends on workflow, not only prevalence or AUROC.
• False positives and false negatives produce different real-world consequences.
• False positives create alarms, interruptions, clinician burden, and possible alarm fatigue.
• False negatives may delay recognition of imminent death or deterioration.
• A ranking metric does not determine the operating point.
• New workflows may require recalibration and new thresholds.

what happens after an alert: how quickly nurses and physicians can respond, whether the alert triggers specific interventions, how many alerts per shift are tolerable, whether alerts interrupt other safety-critical tasks, and whether responding to the alert actually improves outcomes

performance at candidate thresholds, calibration, subgroup performance, and prospective or workflow-simulated estimates of alert burden

FP:

  • interrupt the bedside nurse, page the attending physician, disturb the patient, and contribute to alarm fatigue
  • clinicians may start ignoring alerts

FN:

  • delay bedside assessment, escalation, or intervention. The cost is not just a missed label; it is a missed opportunity to act in a narrow time window

There is no threshold that is universally best: relevant tradeoff depends on the clinical action, staffing, alarm capacity,
and expected benefit of earlier intervention

AUROC:

  • Ranking only but not where to place the thresholds.

Q4

• A binary label can collapse true negatives, censored cases, deaths, and other competing events into one category.

  • BInary label is only valid if follow-up is complete!
    • Censoring means the event status is unknown, not negative.
    • Death can be a competing risk because it changes whether the target event can occur.
    • Survival or time-to-event methods use partial follow-up more appropriately.
    • Evaluation should consider time, censoring, calibration, and ranking over time.

separate the question “has the event happened yet?” from “how long was the patient actually observed?”

SA Eval: Time-based AUROC, IPCW Briar, survival curves by predicted-risk group

Q5

ModalityStructureNaive tabularization losesRepr. or Model familyConcern
EHR event streamIrregular event stream of labs, meds, diags, vitals, encounters; observation density informativeOrder, time gaps, recency, trends, the fact that sicker = measured moreEvent-stream transformer, RNN, or engineered windowsprediction-time discipline to prevent label leakage (post-event features);
Whole-slide pathologyUltra high-res spatial; slide-level label, tile-level unknown. Local tissue morphology, larger-scale arch.Spatial locality, tissue morphology, large-scale architectureTile-level CNN/vision transformer + multiple-instance learning; transfer learningScanner artifacts; weak label-region alignment; patient-level split
Claims sequenceLongitudinal billing records (diagnosis, procedure, refills), irregular timingCare trajectories, recurrence, coverage gaps, timingTemporal coded-event sequence; engineered features over clinically meaningful time windowsBilling ≠ clinical truth; insurance discontinuity affects features and labels
Protein interaction networkGraph-structure variable size, no canonical ordering, neighborhood-definedConnectivity, neighborhood, pathway-level structureGraph neural network; message passingNoisy or incomplete edges; oversmoothing with too many message-passing layers
scRNA-seqSparse high-dim gene expr count matrix, depth-confounded, dropoutBatch effects and seq depth dominate without normalizationNormalize → PCA dim red → UMAP/clustering; latent-variable modelsClusters need biological validation (marker genes, ext annots), not trusted automatically
TypeCore thing preservedWhat tabularization destroys
Sequenceorder + timingtemporal structure
Image/Spatiallocality + geometryspatial relationships
Graph/Networkconnectivityneighborhood structure
Language/Textcontext + semanticsmeaning
High-dimensional biologycorrelation/manifold structurelatent biological organization
Population/causalsampling/assignment processstudy design assumptions
Multimodalrelationships across modalitiescross-modal information

• Clinical notes (text): templates and copy-paste are the load-bearing concern; clinical BERT or LLMs for representation; abbreviation/negation issues; site-specific note dialect.
• X-ray / CT / MRI (volumetric/projection imaging): CNN or U-Net; augmentation validity matters (vertical flip is wrong for X-rays — anatomy isn’t symmetric); DICOM headers can be shortcut hazards.
• ECG / waveform time series: 1D CNN or transformer on the raw signal; channel meaning (different leads = different spatial vantages); informative gaps.
• Molecules: SMILES strings, atom-bond graphs, ECFP fingerprints + RF as a notoriously strong baseline; scaffold split, not random split.
• DNA sequence / variants: GWAS, polygenic risk scores, population structure as the dominant confounder, ancestry transferability problem.

Q6

• Lookup tables fail because long contexts are combinatorially sparse and do not generalize.
• Next-token prediction models a sequence distribution and can cast many tasks as completion.
• Clinical foundation models aim for reusable representations or task adaptation, not one fixed label.
• A tabular snapshot can lose temporal order, trends, recency, time gaps, and observation process.
• Broad usefulness should be shown by transfer across tasks/settings; shortcut learning is suggested by brittle site- or workflow-specific performance.

Lookup tables

  • Exact long contexts never repeat
  • Cannot generalize to a new understandable context

Next-token prediction

  • Language tasks written as completion problems. Chain rule of Probability.

FMs

  • try and learn the reusable structure of the Input Domain to support many downstream tasks (input → task distribution)
  • Eval: bad if collapse under site/time split, removal of workflow artifacts (e.g. note templates). Good if pretrained representation improved perf across many clinically different downstream tasks.

Q7

Bio

SNP:

  • Genome-wide signif == statistical evidence of assoc != strong individual-level pred
  • Can can tiny effect size but very large p-value in large study
  • Ask about effect size, allele freq., ancestry, replication, LD structure

Protein:

  • Docking Score != Affinity in Real Life™
  • You need to validate experimentally in the lab!!
  • AlphaFold is about structure that’s all: not ligand binding, conformational dynamics, docking score accuracy.
  • As if binding pocket is confidently modeled, if alternative conformations matter, if wetlab assays confirm hits.

DNA:

  • Expression/Regulation depend on the cell environment/context! Cell type, chromatin accessibility, transcription factors, dev stage, etc.
  • Ask if validated in tissue or cell state. Do we need more data on expression, chromatin, or eQTL?

General framework

  • Is the prediction task defined? What is the prediction horizon? Over what/whom?
  • What is the data modality? Inductive biases?
  • Which model?
  • How are we splitting? Patient/Time/Site? Leakage?
  • How are we evaluating?
    • Discrimination/calibration/threshold - clinical utility/subgroup perf.
  • Can we tabularize and start with baseline models?
  • Can we conclude any causality (almost never)?
  • How are we interpreting the model? Validating biologically/clinically?
  • How are we connecting model to workflow/goal? Harm?

Stuff on Cheatsheet

  • Trig identities
  • Variances
  • EVs
  • Common derivatives/integrals
  • RELU and other curves (see lectures!)
  • Common curves exe^x, log(x)log(x), etc
  • Var(X) = E[X], Cov(X), samples
  • Distributions - PDF, CDF, E[X], Var[X]
  • PPV, TPR, FPR, Accuracy, Precision, Recall, F1
  • Outcome, Event Space mappings — IMPORTANT!
  • Entropy, KL divergence, information content, mutual information, chain rule, bayes rule, joint, marginal, conditional Probabilities
  • Minimizing Cross Entropy <—> Maximizing Likelihood
  • w^\hat{w} expression, regularization equations
  • Stride, padding, etc equation.
  • Activation Functions, graphs, derivatives

Recalibration: Platt scaling, isotonic regression: Plot Empirical/Predicted Probability

Start with baseline, tabularization — go to complex models only if necessary
Examine calibration, esp. in subgroups (non-optional)
Does the threshold match clinical need?
Is further calibration required? Is the model stable over time? Demographic shifts? at the same site?

Bayes optimal claassifer formula. r=CFNCFPr = \frac{C_{FN}}{C_{FP}} and τ=11+r\tau* = \frac{1}{1 + r}. How much worse is a FN than a FP? In medicine, FN’s are usually worse.

Net Benefit: “Using this model is equivalent to getting 10 true-positive interventions per 100 patiants with no unnecessary interventions.” You plot across a range of thresholds.

Not about Simple and Complex. Not about which functions your model class can represent. About which functions gradient descent tends to discover early and easily during optimization?

TODO

  • BIAS AND VARIANCE
    • Practical consequences (real talk)
      • Linear regression → higher bias, lower variance — How?
      • Deep neural nets → low bias, high variance (unless regularized) — How?
      • Random Forest → reduces variance via averaging — Review HW2
      • XGBoost → balances bias and variance through boosting — how?
  • k-fold CV
  • Probability from CI textbook.
  • What happens when the Probability doesn’t add up to one?
  • Interpretation of Precision/Recall curves — Do weird shit with this. What do they mean?
  • Why is orthogonality a nice thing in ML?

Training Binary Classifiers

Loss → Model → Control Bias (Regularization, etc)

There are some rare cases where you are fitting to the data provided and not the generative process. Non-parametric modeling.

Target Use-Case detemines the Target Generalizability Unit.

Loss: You can have all sorts of loss. E.g. MSE loss for Regression, log-likelihood, etc.

So do you model y,xy, x or yxy|x?

Note that just guessing is the simplest possible ‘prediction’ and not always a stupid choice!

You are saying something about the generative process when you pick your model. So think about a great case for clustering but you use a linear decision boundary. Is the generative process linear?

In KNN

  • You can have a kernel function that weights nearer neighbors more.
  • Really you’re not really ‘learning’ anything about the distribution.
  • The geometry matters! Think of what would happen if you smushed the y-axis.
  • Gradient Descent doesn’t work since you don’t really get anything for small changes in radius rr. L(r)\mathbb{L}(r) is almost always a constant.

The “right” thing you want to learn in this binary case is pyy=xp_{y|y=x}.

kNNs are Awesome Initially

As a sanity check baseline. It shows

  • Bias/Variance tradeoff
  • The Curse of Dimensionality
  • Effect of Domain Shift (your neighbors from Training aren’t the same anymore)
  • Converges to the Irreducible Bayes Error theoretically (can be proved). But 1-NN does at most 2 times the error (Cover & Hart, 1967)

Complexity and Bias and Variance and Error

  • Simple models are “consistently wrong.” → High Bias, Low Variance on new data.
  • Complex models are “inconsistently right.” → Low Bias, High Variance on new data.

Scaffold Splits

Scaffold Split: Molecules are grouped based on their Bemis-Murcko scaffolds. This strategy ensures that molecules within the same cluster are assigned to either the training set or the test set.

Source

Quick example: You want your model to predict the animal’s shape from a skeleton. You have 1,000 samples representing 100 animals. You don’t want a crow skeleton to show up in train and test. You split by “mammal”, “avian” in train and eval performance in test which has “reptile”. Goal is for your model to predict well on skeleton types it hasn’t seen. Fin.

Bayes Errors et al

The Bayes Error is the irreducible, unavoidable error in modeling. Exists because you cannot capture everything about the Real World™ which is messy (#JobSecurity). So in ML,

Error=Variance+Bias2+Irreducible/Bayes Error\text{Error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible/Bayes Error}

Bayes-optimal classifier is something that achieves this minimum level of error. Bayes consistency means that the modeling eventually reaches this minimum.

It’s about learning Structure

Every time you “represent” something in ML you are picking a basis.

You want to learn the underlying structure despite the coordinate system, the representation used to describe it. This is the most important takeaway. Think of the various map projections for Earth (Mercator, Peters, etc.) Do you want the model to learn “Greenland is huge” or the actual distances, topology, geography? Most of the “it doesn’t work at Hospital B” problems come from the model not being basis-invariant. Real World is the abstract vector space. Your data is merely one representation/coordinate system that describes it!

Modeling Steps (Supervised)

  • Specify the data. What is x, what is y, what are they jointly distributed over.
  • Specify a parametric probabilistic model. A family of conditional distributions p(y | x; θ) (supervised) or joint p(x; θ) (unsupervised), parametrized by θ.
  • Define an objective. Almost always a likelihood-based one — most commonly negative log-likelihood (NLL), 𝔼[-log p(y | x; θ)].
  • Optimize θ. With gradient descent (smooth differentiable objectives) or Expectation-Maximization (when latent variables make the gradient awkward).

Be Humble and Ask the Right Questions

Maintain a healthy epistemic humility/uncertainty: neither overconfident nor paralyzed by uncertainty. “AUROC against what label, on which patients, generalizing where, evaluated by whom, deployed how?”

Math is rarely the problem: limiting factor is the Question Framing. Same any SOTA library/framework or modeling technique. What problem are you trying to solve? For whom? With what data? what is the data actually measuring, what is the loss actually rewarding, what is the deployment context actually doing to the data-generating process.

Why is ML for Healthcare different?

This is not just “ML applied to health-shaped tables”: The data, the labels, the deployment, and the consequences (are you inadvertently affecting certain groups, marginalized or otherwise?) are all different in ways that affect modeling choices.

When you deploy an AI Healthcare Model, it may be (a) technically sound/sophisticated, (b) look fine in training and validation but when deployed (1) degrade (2) harm a subgroup (3) get gamed or (4) solve a wrong problem or something nobody asked for. This happens a lot.


Really Random

We do not use the Hessian since it is computationally intractable in high dimensions. People do use a diagonal approximation sometimes.


Practice Questions

For a readmission prediction model at a hospital network, what should the generalization unit be? Defend your choice. “Generalization Unit” means what you’re going to split your data on. In this case, split by patient. You want your model to generalize to new patients. Don’t leak patients into the test set; model has already seen them in training and will give you misleading metrics.

A k-NN model achieves 99% training accuracy and 60% test accuracy. What is the likely cause? What hyperparameter change might help? Here, k=1k=1 or some small number. You need to ‘loosen’ things so that you are averaging across more neighbors and the decision boundaries are ‘smoother’ and not as jagged.. So a higher kk will help. Note that kNN is a bit opposite intuitively when it comes to complexity! And typically, k=nk = \sqrt{n}. Why? Remember that kNN is Bayes Optimal when kk \rightarrow \infty and kn0\frac{k}{n} \rightarrow 0 as nn \rightarrow \infty. What we are saying with that is ”kk must grow with nn but not too fast.” Locality is key.

More on Locality

Why not use all points/neighbors for an average? Imagine you have a giant space with a few points and go from there — what happens to kNN when you keep filling the space?

If you have a Scalar Space of temperature and pick an arbitrary point for a temperature do you really want all temperature values to contribute to a temperature prediction near your AC? This is why locality matters if you want a good predictive model/decision boundary.

“kNN with enough data can appoximate arbitrary decision boundaries.”

Why is it incorrect to tune hyperparameters on the test set? What should you do instead if you have limited data? You want to use the test set to see how well your model generalizes to unseen data. That’s the entire goal. So if you use the test set for hyperparameter tuning, you have leaked data from the Evaluation Phase into the Training Phase! Use a Validation Set for this (e.g. differnt values of kk in kNN.) If you don’t have enough data, use n-fold Cross Validation.

A model class cannot represent non-linear decision boundaries. Is the resulting error bias or variance? Is there a data-driven fix? This is a highly biased model since it is simple (lines, planes as decision boundaries.) The fix here is to move the data to some higher dimension (“Kernel Method”) and solve the problem there. You can also engineer a feature using a non-linear function (x2x^2, sin(x)sin(x)) and use that instead. You can also just pick another model yo…