Research Data Analysis
Gather data → Prep Data → Analyze guided by research objectives → Interpret.
“Statistics” by William Hays is a very accessible book. There’s a Design of Experiments class at Mailman that’s highly recommended as well.
You state your hypothesis before you collect your data! No p-hacking here! Be nice!
Exploratory Analysis
Draw plots:
- See normality.
- See skewness and symmetry.
- See modality (how many peaks?)
Do Measures of Central Tendency:
- Do point estimates (mean, median, mode),
- Do interval estimates (CI)
Confirmatory Analysis (AKA Hypothesis Testing)
Do association between variables. Do differences between groups. Maybe do non-parametric testing.
- Here, you invoke Lord Popper and establish a null hypothesis (you can only falsify).
- Then you collect the sample.
- Then you calculate a test statistic and compare to a critical value.
- Then see if you can reject the null.
“Science is a social enterprise” — Kuhn. There are ‘fashions’ and ‘trends’ in the world of statistics.
Measuring Correlation
Continuous variables. Pearson’s Coefficient is common here (assume: continous, normal, linear association). . The null is that is zero.
In the end this is a Signal/Noise ratio. How much signal is there amongst the noise? A lot of the test statistics do this: Pearson looks for covariance (numerator) over the internal variance.
- Measure of strength of signal in your sample —→ Test Statistic.
- Is the signal robust enough to reject the null? ---→ p-value.
Then there’s a test of significance. Note one and two-sided tests (about direction of association). TODO: Unpack this. Two-sided is more conservative.
Now if Pearson assumptions are not met, you can do Spearman Rank Coefficient. This is a non-parametric approach. Assume monotonicity!
The 2x2 Table
The Chi-squared test has a signal/noise test. O is observed frequency and E is expected frequency.
You compute the degree of association with (a) Cohen’s or (b) Odds Ratio.
Differences between groups
The null changes here: “There is no difference between groups”.
For the t-statistic, the signal here is the difference between means (for two groups). And noise is the pooled standard deviation.
Now: Groups are the same if the differences (variance) within groups is larger than the variance between groups. This is what the F-statistic does. It looks at the ratio of between/within (mean squares, which uses sum of squares).
There’s McNemar’s test too.