Skip to main content

Fairness in ML

This is a huge and complicated and very important topic. Ties into the Simpson’s Paradox and why it’s so important to evaluate your model’s performance across subgroups: one big “AUC 0.98 BRO!” ain’t gonna cut it (almost never does regardless.) Compute AUC, PPV, TPR/FNR, etc across each subgroup. This is not optional. Be thorough, be excellent to everyone.

Comes in two flavors: Sufficiency and Separation. The root of the problem here is that “base rates” (can also mean prevalence of the outcome you’re looking for) are unequal across subgroups in the Real World™.

Sufficiency / Predictive Parity

This condition requires that when the model predicts something, it should be correct for all subgroups. “Can we trust the score equally?” Uses PPV/Precision. P(YXY^)P(Y \perp X|\hat{Y}). Note that we’re conditioning on the true outcome.

Suppose you have Groups A and B and the model predicts a risk score of 0.7 for Sepsis. You observe (in the Real World™) that, indeed, 70% of patients in both groups have Sepsis. Cool. Sufficiency satisfied.

Now,

  • If only 50% in Group A have Sepsis, they will be overtreated.
  • If 80% in Group B have Sepsis, they will be undertreated.

This is not good for the patient or the hospital’s resources. Sufficiency violated.

Separation / Classification Parity

This worries about how much error we are making across subgroups. “Are we making mistakes equally across subgroups?” P(Y^XY)P(\hat{Y} \perp X|Y). Note that we’re conditioning on Reality.

For groups A and B, both of these should be true to satisfy Separation:

FPRAFPRB and FNRAFNRBFPR_A \approx FPR_B \textit{ and } FNR_A \approx FNR_B

Both are necessary! Let’s see how. Let’s start with the FPR. We are asking about people who are actually healthy.

  • If Group A has an FPR of 20%, healthier patients are more likely to be overtreated.
  • If Group B has an FPR of 5%, healthier patients are less likely to be overtreated.

Now with FNR, we are asking about people who are actually sick.

  • If Group A has an FNR of 20%, sicker patients are more likely to be undertreated.
  • If Group B has an FNR of 5%, sicker patients are less likely to be undertreated.
MetricWhat happensWhy
High FPROvertreatmentHealthy people falsely flagged
High FNRUndertreatmentSick people falsely missed

Lots of nuance here.

Impossible!

The Chouldechova/Kleinberg Impossibility Theorem (considered foundational in ML) states that you can never have both if the “base rates” are different across groups. That is, for perfect fairness, you need:

  1. The same base rate across all groups (impossible in Healthcare… and human beings really)
  2. A Most Perfect Predictor (lol)

What’s Possible

Since we live in the Real World™, you can try and build a fair model by

  1. Dealing with the data (fix the labels or the sampling method)
  2. Using subgroup thresholds and not one big one
  3. Training your model to satisfy some fairness criterion

Each of these, of course, has tradeoffs and inductive biases.

Obermeyer et al

Classic case. They wanted to predict future Healthcare Cost. That’s the goal. They had great Sufficiency and Separation, calibration across subgroups. They did not explicitly use race. But this was an unfair model. Why? Because they used cost as a proxy for need. If the model predicted $20,000 as the estimated cost for a Black person, their actual healthcare need was much higher than a White person due to historical and systemic inequities.

The authors estimated that this racial bias reduces the number of Black patients identified for extra care by more than half. Bias occurs because the algorithm uses health costs as a proxy for health needs. Less money is spent on Black patients who have the same level of need, and the algorithm thus falsely concludes that Black patients are healthier than equally sick White patients.

More than half!! Insane. And this was not because of biased people or bad models/statistics. Just a failure to ask “Am I really measuring what I want to measure with this label?”