Random Notes on Important Things™

Common Throughlines?

Was the model calibrated at the site?
- Was it evaluated prospectively?
Resourcing!
- What interventions are triggered?
- Are other safety-critical tasks interrupted?
- Does responding to alerts improve outcomes?
Fairness
- Simpson’s Paradox — Are alerts aligned with subgroups?

These overlap many times. Cost of FP and FN’s might be “meh”/equivalent.

Alarm/Time-Sensitive Workflows → FP are burdensome
- Alarm fatigue
- Interruption of safety-critical tasks
- Clinicians stop paying attention
- Patient anxiety
Diagnosis-Sensitive Workflows → FN are disastrous
- Level of invasiveness.
- Patients will be hurt or will die.

ASK: What signal does the data you have capture? What is it omitting?
“No Observed Event” is not the same as “No Event” in binary classification. “No observed event in 180 days” mixes together (a) genuinely event-free patients with full follow-up; (b) patients who left the system at day 10; (c) patients who died of something else; (d) patients with delayed claims.
SICKER PATIENTS ARE MEASURED MORE OFTEN.
A tabular snapshot of the EHR loses temporal order, time gaps, recency, trend direction, repeated measurement, and the diagnostic signal.
Foundation models map an Input Domain → Multiple Tasks (Prediction, Q&A, Information Extraction). The important thing here is that they attempt to learn an input representation that can be adapted to multiple tasks. So if it faceplants at Weill Cornell, it has not learned the input representation well and is “exploiting narrow shortcuts”.
- ASK: “Foundation Model over what?” All sites? Which inputs? What tasks?
- ASK: Does it work with new patients? Caregivers? Sites? Times? Another hospital? If some artifacts (e.g. documentation) are unavailable?
SNPs: Easy for stuff to become significant (low p-value) with a very large sample size even if the SNP has a small effect size.
Affect subgroups? Adherence?

DNA expression is completed contextual to the environment of the cell.
Protein structure is linked to Function.
Models that use the evolutionary signal outperform those that don’t.
- E.g. LMs that use Sequence + MSA
Protein docking scores $\ne$ true affinity

This is local!
is $\ne$ Accuracy!
High pLDDT does not mean that downstream stuff (binding, dynamics) are OK/validated!

AUROC is about ranking
Calibration is about magnitude (Model A says “95% sure”, Model B says “17% sure”, both can have same AUROC scores! Model B is not well-calibrated!)
Thresholds are about action, where we ‘cut’ the scores/rankings to perform an intervention.

THRESHOLDS ARE $\ne$ TRAINING OBJECTIVE!

This is where Precision/Recall curves are important! FP versus FN. You decide.