Matching
Pre-analytic design strategy to control for confounding. You are manipulating data. You have corrupted the study population! Don’t forget this!
“Give me my controls that are most like my cases.”
Probability of matching: 1:1 (low), metric (low), Logistic Regression.
Matching wants to achieve covariate balance. Each treatment group is paired with a similar comparator unit.
“Capacitation” is the knowledge that ‘flows’ through a graph TODO: More on this.
Consider a case and control where you have 4 people in former and 19 in the latter for A1C. You match based on percentage and drop the unmatched (these may represent outliers among other things).
How do you match?
- Balancing Metric
- Distance
- Matching Procedure
Balancing Metric
Distribution f(x|B(x)) is the same for treated and comparison groups.
Raw features and balancing scores (propensity, prognosis, disease risk score)
Raw features: the dumbest. 1-1 matching of gender, age, HbA1C, etc. This is usually infeasible. People drop out. You are not using your data well.
Another technique is using a score, which is a function of the covariates.
Propensity Score (cohort): P(Getting Treatement|Observed covariates)
Prognosis Score (cohort): Predicted score under control. P(Outcome|Patient,No exposure)
Disease risk Score (case control): P(Outome = 1 | Patient)
How you derive this is via Logistic Regression which is very common (linearity is doing big lifting here!). You can certainly use tree-based methods, SVMs, neural networks.
Distance
You can look at raw features. This is dependent on the type of data (categorical? quadratic?) There are 100s of ways.
With binary feature vectors, use Tanimoto/Jaccard or Hamming.
For quadratic, use Euclideam, Mahalanobic, Canberra. Each has their use.
👉 Which you use depends on your research question. What if you have a lot of zeroes? think informative missingness!
You can also do linear distance or log-linear distance.
Former is distance between scores.
Log-linear is |logit(Bx2) - logit(Bx1)|
Matching Procedure
How close do two units have to be to be considered ‘similar’? That’s Tolerance Limit. If it is zero, this would completely eliminate bias! Of course that’s not easy or feasible (really almost never).
Now you can try Fine Balance: So now you focus really on one confounding variable and make sure that difference is zero (one or more maybe).
Now you can try “Within a Caliper”: Take the entire thing and ‘pass’ pairs that have a difference score above some thrashold. This is subject to ye olde bias-variance tradeoff. Narrow caliper? Low bias, high variance. Wide caliper? High bias, low variance.
Now you can also break into strata like quintiles (not the only strata).
Now you can do Ratio Matching: Think about kNN.
Is this 1:1?
Nope!
- Fixed Number of comparators (1 to 1)
- Variable number of comparators (1 to many)
- Variable treatment and comparators (many to many)
Replacement
Without replacement: increase bias. With replacement: decrease bias.
Think greedy algorithms (you’re making a locally optimal decision at every step).
Notes
Clean/prep your data so you reduce your reliance on matching as much as possible.
IID
Independence is violated! You’re forcing data points to be related to each other!
| O+ Case | O- Control | ||
|---|---|---|---|
| E+ | a | b | |
| E- | c | d |
a and d are concordant pairs. b and c are discordant paits. So .
You use a McNemar’s chi-squred here because of the independence problem.
Now how does Logistic Regression change? The Probability here is very simply the Probability that the subject is the case in the matched set based on their covariates. It’s a Softmax function within each matched set.