Research Data Analysis

Gather data → Prep Data → Analyze guided by research objectives → Interpret.

“Statistics” by William Hays is a very accessible book. There’s a Design of Experiments class at Mailman that’s highly recommended as well.

warning

You state your hypothesis before you collect your data! No p-hacking here! Be nice!

Exploratory Analysis

Draw plots:

See normality.
See skewness and symmetry.
See modality (how many peaks?)

Do Measures of Central Tendency:

Do point estimates (mean, median, mode),
Do interval estimates (CI)

Confirmatory Analysis (AKA Hypothesis Testing)

Do association between variables. Do differences between groups. Maybe do non-parametric testing.

Here, you invoke Lord Popper and establish a null hypothesis (you can only falsify).
Then you collect the sample.
Then you calculate a test statistic and compare to a critical value.
Then see if you can reject the null.

Note

“Science is a social enterprise” — Kuhn. There are ‘fashions’ and ‘trends’ in the world of statistics.

Measuring Correlation

Continuous variables. Pearson’s Coefficient is common here (assume: continous, normal, linear association). $\rho \in [-1, +1]$ . The null is that $\rho$ is zero.

It's about Signal to Noise

In the end this is a Signal/Noise ratio. How much signal is there amongst the noise? A lot of the test statistics do this: Pearson looks for covariance (numerator) over the internal variance.

Measure of strength of signal in your sample —→ Test Statistic.
Is the signal robust enough to reject the null? ---→ p-value.

Then there’s a test of significance. Note one and two-sided tests (about direction of association). TODO: Unpack this. Two-sided is more conservative.

Now if Pearson assumptions are not met, you can do Spearman Rank Coefficient. This is a non-parametric approach. Assume monotonicity!

The 2x2 Table

The Chi-squared test has a signal/noise test. O is observed frequency and E is expected frequency.

\chi^2 = \sum_i{\frac{(O_i - E_i)^2}{E_i^2}}

You compute the degree of association with (a) Cohen’s $\kappa$ or (b) Odds Ratio.

Differences between groups

The null changes here: “There is no difference between groups”.

For the t-statistic, the signal here is the difference between means (for two groups). And noise is the pooled standard deviation.

Now: Groups are the same if the differences (variance) within groups is larger than the variance between groups. This is what the F-statistic does. It looks at the ratio of between/within (mean squares, which uses sum of squares).

There’s McNemar’s test too.

Exploratory Analysis​

Confirmatory Analysis (AKA Hypothesis Testing)​

Measuring Correlation​

The 2x2 Table​

Differences between groups​

Exploratory Analysis

Confirmatory Analysis (AKA Hypothesis Testing)

Measuring Correlation

The 2x2 Table

Differences between groups