Case-Control Studies

What was discussed last class (that you missed)

Missing Data: MCAR, MAR, MNER. When can you use imputation.

Single: Mean, Median, NN, Zero (not desirable but you use it since it’s easy)
Multiple imputation

Feature scaling:

Normalize - $[0,1]$ or z-scaling: $\frac{x - \mu}{\sigma}$

Discretization:

Integer encoding: ordinal level. Doesn’t increase your feature space.
One-hot encoding: $n-1$ .

Outliers: Messes with the Loss Function.

Cohorts are classified based on exposure and followed around for outcomes. Prospective and retrospective. TEMPORALITY IS THE GREATEST THING ABOUT THIS OMG.

Now consider a glioblastoma. Complicated. Very low prevalence! Cohort studies might not be appropriate. You’d have to follow participants for decades.

So what if you selected on outcomes and mapped back to exposure? That’s what Case-Control Studies are. We are not observing incidence! You are fixing the outcome distribution!

Now you have the same ‘pool’ of people for both your cases and controls. The same population. Cases: they have the outcome. Controls: they don’t. They have to be from the same population since you don’t want to break links. So the Study Design consists of the Population, Cases, and Controls.

👉 Controls are meant to approx what the cases would have looked like if they had not developed the disease. It’s a bit counterfactual-y like in Causal Inference. You want to know what the exposure is! What causes glioblastoma?

Metrics

Risk Ratios cannot be observed! You don’t know $(a + b)$ and $(c + d)$ - think about it. You are setting the prevalence ( $(a+c)$ and $(b+d)$ : there are many ways to get at these sums! Only experiment will reveal what the combo is.)

You certainly can compute the Odds Ratio. Now for very rare diseases, RR $\approx$ OR. Think about it.

Sampling Strategies

This is the hardest part. Remember that Control group is a snapshot of the baseline population.

Base or Case-Base or Case-Cohort Sampling

→ Sample at start

Time period: 10 years. Cases: glioblastoma now. Controls, random people 10 years ago.

Question: What happens when controls develop the outcome? Good question. Use this for really, really rare stuff.

Cumulative Density/Survivor Sampling

→ Sample at end

Controls here are people at the end of the study period who did out develop the outcome/disease. Select randomly.

Problems are Survivor Bias. Other things may have killed you off.

Incidence Density/Risk Set Sampling

→ Sample anytime a case happens!

So for every incident case, you make a risk set and randomly choose a person from this set to compose the control group. It preserves the underlying risk process over time. Good for short-term or fluctuating exposures.

Basically a random sampling from a population at risk.

Doll & Hill

Framingham posited the hypothesis. This was the definitive epidemiological proof.

What was discussed last class (that you missed)​

Metrics​

Sampling Strategies​

Base or Case-Base or Case-Cohort Sampling​

Cumulative Density/Survivor Sampling​

Incidence Density/Risk Set Sampling​

Doll & Hill​