Skip to main content

Research Data Analysis

Gather data → Prep Data → Analyze guided by research objectives → Interpret.

Statistics” by William Hays is a very accessible book. There’s a Design of Experiments class at Mailman that’s highly recommended as well.

warning

You state your hypothesis before you collect your data! No p-hacking here! Be nice!

Exploratory Analysis

Draw plots:

  • See normality.
  • See skewness and symmetry.
  • See modality (how many peaks?)

Do Measures of Central Tendency:

  • Do point estimates (mean, median, mode),
  • Do interval estimates (CI)

Confirmatory Analysis (AKA Hypothesis Testing)

Do association between variables. Do differences between groups. Maybe do non-parametric testing.

  1. Here, you invoke Lord Popper and establish a null hypothesis (you can only falsify).
  2. Then you collect the sample.
  3. Then you calculate a test statistic and compare to a critical value.
  4. Then see if you can reject the null.
Note

“Science is a social enterprise” — Kuhn. There are ‘fashions’ and ‘trends’ in the world of statistics.

Measuring Correlation

Continuous variables. Pearson’s Coefficient is common here (assume: continous, normal, linear association). ρ[1,+1]\rho \in [-1, +1]. The null is that ρ\rho is zero.

It's about Signal to Noise

In the end this is a Signal/Noise ratio. How much signal is there amongst the noise? A lot of the test statistics do this: Pearson looks for covariance (numerator) over the internal variance.

  1. Measure of strength of signal in your sample —→ Test Statistic.
  2. Is the signal robust enough to reject the null? ---→ p-value.

Then there’s a test of significance. Note one and two-sided tests (about direction of association). TODO: Unpack this. Two-sided is more conservative.

Now if Pearson assumptions are not met, you can do Spearman Rank Coefficient. This is a non-parametric approach. Assume monotonicity!

The 2x2 Table

The Chi-squared test has a signal/noise test. O is observed frequency and E is expected frequency.

χ2=i(OiEi)2Ei2\chi^2 = \sum_i{\frac{(O_i - E_i)^2}{E_i^2}}

You compute the degree of association with (a) Cohen’s κ\kappa or (b) Odds Ratio.

Differences between groups

The null changes here: “There is no difference between groups”.

For the t-statistic, the signal here is the difference between means (for two groups). And noise is the pooled standard deviation.

Now: Groups are the same if the differences (variance) within groups is larger than the variance between groups. This is what the F-statistic does. It looks at the ratio of between/within (mean squares, which uses sum of squares).

There’s McNemar’s test too.