Analysis of Cohort Studies II

Logistic Regression

This is specifically for binary outcomes. You use the logit function which is the natural log of the odds $\frac{p}{1 - p}$ .

Probability and odds are similar when probability is low but odds grow very fast. There’s another reason too: Probability $\in [0, 1]$ . Odds is $\in [0, \infty_+]$ . Apply a natural log and now logit is $\in [\infty_-, \infty_+]$ . So now you’re kinda back to linear regression. It’s a nice mathematical trick you use for when you want probabilities.

Now $p$ is the probability of seeing $Y$ in $Y = \beta_0 + \beta_1X$ .

p = \frac{e^{\beta_0 + \beta_1X}}{1 - e^{\beta_0 + \beta_1X}}

So here $\beta_0$ is the log-odds of $Y = 1$ (and $e^{\beta_0}$ is the odds of $Y = 1$ ).

Odds are stationed around 1. If it’s 1.5, 50% more, if it’s 0.7, 30% less. So if you have a negative $\beta_1$ is a ‘protective’ factor when it comes to interventions.

Assumptions

Outcome is binary.
Logit(p) and predictors is linear.
IID.
Multicollinearity doesn’t exist in a multivariable setting.
There should be a ‘sufficient’ sample size.

Some rule of thumb: if you have 5 covariates, you will need 50 labels. Rule of 10

Class Imbalance

What if you have a severe class imbalance?

Downsample major class
Upsample minor class
Both?
Assign Weights

Logistic Regression​

Assumptions​

Class Imbalance​

Logistic Regression

Assumptions

Class Imbalance