Skip to main content

Analysis of Cohort Studies II

Logistic Regression

This is specifically for binary outcomes. You use the logit function which is the natural log of the odds p1p\frac{p}{1 - p}.

Probability and odds are similar when probability is low but odds grow very fast. There’s another reason too: Probability [0,1]\in [0, 1]. Odds is [0,+]\in [0, \infty_+]. Apply a natural log and now logit is [,+]\in [\infty_-, \infty_+]. So now you’re kinda back to linear regression. It’s a nice mathematical trick you use for when you want probabilities.

Now pp is the probability of seeing YY in Y=β0+β1XY = \beta_0 + \beta_1X.

p=eβ0+β1X1eβ0+β1Xp = \frac{e^{\beta_0 + \beta_1X}}{1 - e^{\beta_0 + \beta_1X}}

So here β0\beta_0 is the log-odds of Y=1Y = 1 (and eβ0e^{\beta_0} is the odds of Y=1Y = 1).

Odds are stationed around 1. If it’s 1.5, 50% more, if it’s 0.7, 30% less. So if you have a negative β1\beta_1 is a ‘protective’ factor when it comes to interventions.

Assumptions

  • Outcome is binary.
  • Logit(p) and predictors is linear.
  • IID.
  • Multicollinearity doesn’t exist in a multivariable setting.
  • There should be a ‘sufficient’ sample size.

Some rule of thumb: if you have 5 covariates, you will need 50 labels. Rule of 10

Class Imbalance

What if you have a severe class imbalance?

  • Downsample major class
  • Upsample minor class
  • Both?
  • Assign Weights