t-tests, χ², F-tests done by hand

Part 2 — Hypothesis testing without p-hacking

Learning objectives

Compute the ONE-SAMPLE t-statistic $T = (\bar X - \mu_0)/(s/\sqrt n)$ by hand and recognise it as the standardised distance of the sample mean from the null mean in units of the SE
Compute the TWO-SAMPLE WELCH t-statistic $T = (\bar X_1 - \bar X_2)/\sqrt{s_1^2/n_1 + s_2^2/n_2}$ and its Welch–Satterthwaite degrees of freedom, and use it as the DEFAULT for two-sample comparisons (not the Student equal-variance pooled-s form)
Recognise the PAIRED t-test as a one-sample t on the differences $d_i = x_i - y_i$ ; explain why pairing usually buys power (within-subject SD < between-subject SD)
Compute the χ² statistic for a 2-way contingency table $\chi^2 = \sum (O - E)^2/E$ with expected counts $E_{ij} = R_i C_j / N$ under independence, and state the df = (r−1)(c−1) rule
Apply the expected-count-≥-5 rule of thumb (Cochran 1954) and know to switch to FISHER'S EXACT TEST for small tables
Compute the one-way ANOVA F-statistic $F = MS_b / MS_w$ with $SS_b = \sum_g n_g (\bar x_g - \bar x)^2$ and $SS_w = \sum_g \sum_i (x_{gi} - \bar x_g)^2$ ; df_b = k−1, df_w = N−k
Distinguish the F-TEST OF VARIANCES $F = s_1^2/s_2^2$ from the ANOVA omnibus F, and know that the variance-ratio F is BRITTLE under non-normality (Box 1953) — Levene's or Brown–Forsythe replaces it
Recall the assumption profile of each test: t-tests robust to mild non-normality (Boneau 1960) but sensitive to outliers; χ² independence sensitive to small expected counts; ANOVA F robust to non-normality with equal n, sensitive to unequal variance + unequal n
Recognise the ROBUST ALTERNATIVES: Mann–Whitney / Wilcoxon for t; Fisher's exact for χ²; Levene / Brown–Forsythe for variance comparison; Welch's ANOVA for unequal variances; permutation tests in general (Conover 1999; Pesarin & Salmaso 2010)
Connect the by-hand mechanics to the N-P / §2.2 framework: each test statistic has a CENTRAL reference distribution under $H_0$ and a NONCENTRAL version under $H_1$ ; rejecting at α uses the central tail; power uses the noncentral tail

§2.1 framed hypothesis testing as a decision problem with α and β as its operating characteristics. §2.2 turned α and β into research-planning numbers via power analysis. §2.3 descends from those abstractions to the ACTUAL MECHANICS — the formulas that have been the workhorses of applied statistics for a century. Three test families dominate the literature: t-tests for means (Student 1908, Welch 1947), χ²-tests for categorical/contingency data (Pearson 1900), and F-tests for variances and ANOVA omnibus comparisons (Fisher 1925). Every applied paper you will read uses at least one of these three.

The teaching arc of §2.3 has five stops. First, the ONE-, TWO-SAMPLE, AND PAIRED t-TESTS — formulas, df conventions, and the Welch–Satterthwaite default that has quietly replaced Student's pooled-variance form. Second, the χ² TEST OF INDEPENDENCE for contingency tables — the expected-count-≥-5 rule, the (r−1)(c−1) df, and Fisher's exact alternative for small tables. Third, the F-TEST for variances AND the one-way ANOVA F — the same statistic family doing two different jobs, with very different assumption robustness. Fourth, an HONEST INVENTORY of when each test breaks: the analytic results in Boneau (1960) on t-test robustness, Box (1953) on F-test fragility, Cochran (1954) on χ² small-cell behaviour. Fifth, the ROBUST ALTERNATIVES (rank-based, exact, permutation) that you should know exist when the parametric assumptions fail. Two widgets thread the section: the first WALKS THROUGH the by-hand arithmetic for each test, the second SIMULATES the Type-I rate under controlled assumption violations.

One framing note before the formulas. Every test in §2.3 has the same structure: a TEST STATISTIC that is a fixed recipe applied to the sample, a REFERENCE DISTRIBUTION that the statistic follows under $H_0$ , and a CRITICAL VALUE that defines the rejection region at level α. The recipe and the reference distribution are what the textbook formulas give you; the reference distribution carries the $H_0$ probability calculation that turns an observed statistic into a p-value. Compute the statistic by hand, look up the critical value (or the p-value) in the reference distribution, and the decision is determined. That is the entire mechanics of a test in twenty words.

The one-sample t-test

The simplest setting: iid sample $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with $\sigma$ UNKNOWN, testing $H_0: \mu = \mu_0$ against $H_1: \mu \ne \mu_0$ . The natural standardised statistic IF $\sigma$ were known would be the z-statistic $Z = \sqrt n (\bar X - \mu_0)/\sigma$ , which is standard-normal under $H_0$ . The trouble is $\sigma$ is unknown. The Student t-test (Gosset 1908) plugs in the SAMPLE STANDARD DEVIATION $s$ for $\sigma$ and asks what distribution the resulting statistic has.

Define the SAMPLE MEAN and the SAMPLE SD with Bessel's n−1 correction:

\bar X = \frac{1}{n} \sum_{i=1}^n X_i, \qquad s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2.

The t-statistic is

T = \frac{\bar X - \mu_0}{s / \sqrt n}.

Reading: T is the signed distance of the sample mean from the null mean, measured in units of the ESTIMATED standard error $s/\sqrt n$ . Gosset's result (Student 1908, Biometrika) is that under $H_0$ , $T \sim t_{n-1}$ — the Student-t distribution with n − 1 degrees of freedom. The df reduction reflects the one parameter (μ) estimated in computing $s$ .

The two-tailed test rejects $H_0$ at level α when $|T| > t_{1 - \alpha/2, n-1}$ . At $\alpha = 0.05$ and $n = 10$ (so df = 9), the critical value is $t_{0.975, 9} \approx 2.262$ ; at $n = 30$ it is $t_{0.975, 29} \approx 2.045$ ; as $n \to \infty$ the t distribution converges to standard normal and the critical value approaches 1.96.

Worked example. A new oven thermometer is calibrated against a 100°C reference. Eight readings: 99.6, 100.2, 99.4, 100.1, 99.9, 100.0, 99.7, 99.5. Test $H_0: \mu = 100$ .

n = 8, $\bar X = (99.6 + 100.2 + 99.4 + 100.1 + 99.9 + 100.0 + 99.7 + 99.5)/8 = 798.4/8 = 99.80$ .
Deviations: −0.20, +0.40, −0.40, +0.30, +0.10, +0.20, −0.10, −0.30. Squared: 0.04, 0.16, 0.16, 0.09, 0.01, 0.04, 0.01, 0.09. Sum: 0.60.
$s^2 = 0.60 / 7 \approx 0.0857$ ; $s \approx 0.293$ .
$\text{SE} = s/\sqrt 8 \approx 0.293/2.828 \approx 0.1036$ .
$T = (99.80 - 100)/0.1036 \approx -1.93$ .
df = 7. Critical value $t_{0.975, 7} \approx 2.365$ . $|T| = 1.93 < 2.365$ : FAIL to reject. The p-value is $\approx 0.095$ . The data are consistent with the thermometer being correctly calibrated at α = 0.05.

Things the example illustrates. (i) The t-statistic is a UNITLESS standardised distance — it ignores whether you measure temperature in Celsius or Fahrenheit. (ii) The decision is determined by comparing |T| to the critical value, not by any subjective judgement about whether the mean of 99.80 "looks close" to 100. (iii) Failing to reject is NOT proof of $H_0$ — the 95% CI for μ is $99.80 \pm 2.365 \cdot 0.1036 = [99.55, 100.05]$ , so means as different from 100 as 99.55 are still compatible with the data. §2.2's underpowered-study cautions apply: a non-significant result with a wide CI is INCONCLUSIVE, not negative evidence.

The two-sample t-test: Welch is the default

For two independent samples $X_1, \ldots, X_{n_1} \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y_1, \ldots, Y_{n_2} \sim \mathcal{N}(\mu_2, \sigma_2^2)$ , testing $H_0: \mu_1 = \mu_2$ , the textbook used to teach Student's POOLED-VARIANCE t-test, which assumes $\sigma_1 = \sigma_2$ and pools the two sample variances into one estimate. Modern practice uses WELCH'S t-test (Welch 1947, Biometrika), which does NOT assume equal variances and computes each group's SE separately:

T = \frac{\bar X - \bar Y}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}.

The denominator is the SE of the difference of means estimated from the two sample variances — exactly the right SE under the only assumption that the two samples are independent and approximately normal. Under $H_0$ , T is approximately t-distributed with the WELCH–SATTERTHWAITE degrees of freedom

\nu = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)}.

Reading the Welch df: it INTERPOLATES between $\min(n_1, n_2) - 1$ (worst case, one group dominates the variance) and $n_1 + n_2 - 2$ (Student's pooled-variance df, attained when $s_1 = s_2$ ). The df is in general FRACTIONAL — that is fine; software uses it directly. By hand, round down to an integer for table lookup (Welch 1947 §6 makes this precise).

Why Welch is the modern default: when the variances ARE equal, Welch loses very little efficiency relative to Student's pooled-variance form. When variances are UNEQUAL — which is common — Student's pooled form has the wrong Type-I rate (inflated if the smaller group has the larger variance, deflated otherwise; see Ruxton 2006 Behav Ecol for a clean review). Welch corrects this with one extra formula and is robust to which group has which variance. There is essentially no reason to ever default to the pooled-variance form. Most modern textbook treatments (Wasserman 2004 §10.6, R's default t.test() behaviour) make Welch the default.

Worked example. Two fertilisers, A and B, yields per plot (g):

A (n₁ = 6): 22, 25, 23, 27, 24, 26. $\bar X_A = 147/6 = 24.5$ . Deviations: −2.5, 0.5, −1.5, 2.5, −0.5, 1.5. Squared sum = 6.25 + 0.25 + 2.25 + 6.25 + 0.25 + 2.25 = 17.5. $s_A^2 = 17.5/5 = 3.5$ .
B (n₂ = 7): 28, 31, 26, 30, 32, 29, 33. $\bar X_B = 209/7 \approx 29.857$ . Deviations: −1.857, 1.143, −3.857, 0.143, 2.143, −0.857, 3.143. Squared sum ≈ 3.45 + 1.31 + 14.88 + 0.02 + 4.59 + 0.73 + 9.88 ≈ 34.86. $s_B^2 = 34.86/6 \approx 5.81$ .
$s_A^2/n_1 = 0.583$ , $s_B^2/n_2 = 0.830$ . SE = $\sqrt{0.583 + 0.830} = \sqrt{1.413} \approx 1.189$ .
$T = (24.5 - 29.857)/1.189 \approx -4.51$ .
$\nu = (1.413)^2 / (0.583^2/5 + 0.830^2/6) = 1.997 / (0.068 + 0.115) = 1.997 / 0.183 \approx 10.9$ .
Critical value $t_{0.975, 10.9} \approx 2.20$ . $|T| = 4.51 > 2.20$ : REJECT $H_0$ . p-value $\approx 0.001$ . Fertiliser B yields significantly more than A.

The Welch df (≈ 11) is BETWEEN the equal-variance pooled df of $n_1 + n_2 - 2 = 11$ and the worst-case $\min(n_1, n_2) - 1 = 5$ . The data here have $s_B/s_A = \sqrt{5.81/3.5} \approx 1.29$ , so the variances are mildly unequal but not extreme — Welch gives essentially the Student answer here. The structural point: Welch costs you nothing when variances are equal and protects you when they are not.

The paired t-test

When the data are PAIRED — same subject measured under two conditions, twins matched on covariates, before/after measurements — the paired t-test is the right tool. The recipe: compute the per-pair differences $d_i = x_i - y_i$ , then run a ONE-SAMPLE t-test on the $d_i$ 's testing $H_0: \mu_d = 0$ against $H_1: \mu_d \ne 0$ .

T = \frac{\bar d}{s_d / \sqrt n} \sim t_{n-1} \text{ under } H_0.

The crucial difference from a two-sample test: the paired test eliminates BETWEEN-SUBJECT variability. If x_i and y_i are correlated within subject (which they typically are — the same subject is at the same baseline in both conditions), the SD of the differences $s_d$ is SMALLER than the SD of x or y individually. That smaller SE buys POWER — a paired design with the same n almost always has more power than an unpaired design.

Worked example. Reaction times (ms) for 6 subjects before and after caffeine:

Subject: 1, 2, 3, 4, 5, 6.
Before: 245, 312, 278, 295, 260, 320.
After: 220, 290, 260, 280, 245, 305.
$d_i = x_i - y_i$ : 25, 22, 18, 15, 15, 15.
$\bar d = (25+22+18+15+15+15)/6 = 110/6 \approx 18.33$ .
Deviations: 6.67, 3.67, −0.33, −3.33, −3.33, −3.33. Squared sum ≈ 44.49 + 13.47 + 0.11 + 11.09 + 11.09 + 11.09 ≈ 91.34. $s_d^2 = 91.34/5 \approx 18.27$ . $s_d \approx 4.27$ .
SE = $4.27/\sqrt 6 \approx 1.745$ .
$T = 18.33 / 1.745 \approx 10.51$ . df = 5. $t_{0.975, 5} \approx 2.571$ . REJECT. p ≪ 0.001. Caffeine reduces reaction time.

Compare to running a two-sample test on the same data ignoring pairing: the between-subject SD of "Before" is much larger (subjects 1 and 5 differ by 60 ms at baseline), so the unpaired SE would be 20× larger and the unpaired T would be a fraction of the paired one. Pairing buys POWER precisely BECAUSE it cancels the between-subject baseline differences.

The first widget makes the recipe live. Pick a test, edit the sample data (or reroll a random draw), and the widget computes the test statistic ROW BY ROW: every intermediate number (sample mean, sum of squared deviations, sample variance, SE, T, df) is exposed. The reference distribution is rendered with the observed statistic marked and the rejection region shaded.

Things to verify in the widget:

One-sample t, default sample (random draw, μ₀ = 0): the widget should reject H₀ on the default seed since the generator pulls from N(0.7, 1) (μ shifted by 0.7). Reroll several times; the rejection rate across draws is the POWER of the test at this effect, not 5% — 5% is the rate UNDER H₀.
Two-sample Welch t: notice the Welch–Satterthwaite ν is fractional (e.g., 14.7), not an integer like n_1 + n_2 − 2. The closer the two sample variances are, the closer ν gets to that pooled df.
Paired t: replace the generated y_i with copies of x_i shifted by a CONSTANT. The differences d_i become a constant, s_d → 0, T → ∞. Pairing eliminates between-subject noise completely in this degenerate case.
χ² independence: notice the "min expected" row. Make a small table where the minimum expected cell is < 5 — the widget flags it and explicitly recommends Fisher's exact instead.
ANOVA F (k = 3 groups): the F statistic IS a ratio of mean-squares, not a t. Notice that doubling all the data values by a constant changes nothing (F is scale-invariant); doubling within-group spread while leaving means the same shrinks F (and the test fails to reject).

The χ² test of independence (Pearson 1900)

For a two-way r × c contingency table of COUNTS $O_{ij}$ , testing whether the row and column variables are independent, Pearson's χ² statistic is

\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \qquad E_{ij} = \frac{R_i \cdot C_j}{N},

where $R_i$ is the i-th row total, $C_j$ is the j-th column total, and $N = \sum_{ij} O_{ij}$ is the grand total. The expected count $E_{ij}$ is what you would expect in cell (i, j) under the null hypothesis that the row and column variables are independent (so the joint probability factorises into the product of marginals).

Pearson (1900, Philos. Mag.) showed that under $H_0$ , $\chi^2 \sim \chi^2_{\nu}$ asymptotically, with $\nu = (r-1)(c-1)$ . The df comes from the constraints the marginal totals impose on the table — only $(r-1)(c-1)$ cell values are free once the marginals are fixed.

The big practical caveat: the χ² approximation is ASYMPTOTIC. It is unreliable when expected counts are small. The conventional rule of thumb (Cochran 1954 Biometrics) is that EVERY expected count should be at least 5 — softer versions allow 80% of cells with E ≥ 5 and no cell with E < 1. When the rule is violated, the standard recommendation is to use FISHER'S EXACT TEST (Fisher 1925; for r × c tables, see Agresti 2002 §3.5), which computes the exact hypergeometric probability of the observed table under the null and is valid for any sample size.

Worked example. A clinical trial randomises 200 patients to a new drug or a placebo and records whether they responded:

Table O = [[55, 45], [40, 60]] (drug × response). Row totals: 100, 100. Column totals: 95, 105. N = 200.
Expected: $E_{11} = (100 \cdot 95)/200 = 47.5$ . $E_{12} = (100 \cdot 105)/200 = 52.5$ . $E_{21} = 47.5$ . $E_{22} = 52.5$ . Min expected = 47.5 > 5. ✓ Cochran satisfied.
(O − E)²: (55 − 47.5)² = 56.25; (45 − 52.5)² = 56.25; same for row 2.
χ² = 56.25/47.5 + 56.25/52.5 + 56.25/47.5 + 56.25/52.5 ≈ 1.184 + 1.071 + 1.184 + 1.071 = 4.510.
df = (2−1)(2−1) = 1. $\chi^2_{0.95, 1} = 3.841$ . χ² = 4.51 > 3.841: REJECT. p ≈ 0.034.

The drug is associated with response at α = 0.05. Note the χ² test reports ASSOCIATION; it does not by itself tell you the DIRECTION (which group responded more). Inspecting the table, the drug group had 55/100 = 55% response vs 45/100 = 45% in placebo — a 10-percentage-point difference. The Pearson χ² is one piece of the diagnostic; effect size (odds ratio = (55·60)/(45·40) ≈ 1.833, or risk difference 10 pp) is the magnitude.

For a 2 × 2 table the SAME information can be extracted via a two-proportion z-test, and the two test statistics satisfy $\chi^2 = z^2$ exactly. For larger tables (r > 2 or c > 2), the χ² is the natural multi-cell generalisation; pairwise z-tests would require multiple-comparisons correction (§2.5).

χ² goodness-of-fit — same machinery, different hypothesis

The same Σ(O−E)²/E recipe applies to GOODNESS-OF-FIT testing: given observed counts $O_i$ in k cells and a THEORETICAL distribution $p_1, \ldots, p_k$ (with $\sum p_i = 1$ ), the expected counts are $E_i = N \cdot p_i$ and the test statistic is

\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \sim \chi^2_{k - 1 - m}

where m is the number of parameters of the theoretical distribution estimated from the data. (If you pre-specified the distribution completely, m = 0 and df = k − 1; if you fit, e.g., the Normal mean and SD from the data first, m = 2 and df = k − 3.) The Cochran ≥ 5 rule still applies. Classical examples: testing whether a die is fair (k = 6, expected = N/6 per face, df = 5), or whether genetic offspring counts match the predicted 9:3:3:1 ratio (Fisher 1925).

F-tests: variance ratio vs ANOVA omnibus

Two distinct uses share the F-distribution. The first is the F-TEST OF VARIANCE EQUALITY: given two independent normal samples, test $H_0: \sigma_1^2 = \sigma_2^2$ via

F = \frac{s_1^2}{s_2^2} \sim F_{n_1 - 1, n_2 - 1} \text{ under } H_0.

This test is FAMOUSLY BRITTLE. Box (1953, Biometrika) showed that even mild departures from normality (kurtosis 3.5 vs 3.0, say) dramatically inflate the test's actual Type-I rate. Box called the practice "unleashing a hurricane to test whether the wind is gentle enough to sail" — the F-test's assumptions are stricter than the t-test that often follows it. Modern practice replaces the F-test of variance with LEVENE'S TEST (Levene 1960) — an ANOVA on the absolute deviations from group medians — or BROWN–FORSYTHE (Brown & Forsythe 1974), which uses absolute deviations from medians (more robust than means). Both are calibrated under non-normality where the variance-ratio F is not.

The second and far more important F-test is the ONE-WAY ANOVA F. Given k groups of (possibly unequal) sizes $n_g$ , total $N = \sum_g n_g$ , with iid Normal observations within each group, test $H_0: \mu_1 = \mu_2 = \ldots = \mu_k$ . Decompose total variance into BETWEEN-group and WITHIN-group sums of squares:

SS_{\text{between}} = \sum_{g=1}^k n_g (\bar x_g - \bar x)^2, \qquad SS_{\text{within}} = \sum_{g=1}^k \sum_{i=1}^{n_g} (x_{gi} - \bar x_g)^2,

where $\bar x_g$ is the g-th group mean and $\bar x$ is the grand mean (weighted by group size). The mean squares are MS_b = SS_b/(k−1) and MS_w = SS_w/(N − k). The F-statistic is

F = \frac{MS_{\text{between}}}{MS_{\text{within}}} \sim F_{k-1, N-k} \text{ under } H_0.

Reading: MS_w is the WITHIN-group variance estimate (the average of group-specific variances, weighted by df). MS_b is the BETWEEN-group variance estimate, scaled so that under $H_0$ it equals MS_w in expectation. If the group means truly differ, MS_b inflates while MS_w stays put, and F gets large.

Worked example. k = 3 teaching methods, n = 5 students per group, test scores:

Group 1: 72, 75, 70, 78, 73. $\bar x_1 = 368/5 = 73.6$ .
Group 2: 82, 79, 85, 80, 83. $\bar x_2 = 409/5 = 81.8$ .
Group 3: 76, 78, 74, 79, 77. $\bar x_3 = 384/5 = 76.8$ .
Grand mean $\bar x = (368 + 409 + 384)/15 = 1161/15 = 77.4$ .
$SS_b = 5(73.6 - 77.4)^2 + 5(81.8 - 77.4)^2 + 5(76.8 - 77.4)^2 = 5(14.44 + 19.36 + 0.36) = 5 \cdot 34.16 = 170.8$ .
Group 1 SS: (72-73.6)² + (75-73.6)² + (70-73.6)² + (78-73.6)² + (73-73.6)² = 2.56 + 1.96 + 12.96 + 19.36 + 0.36 = 37.2.
Group 2 SS: (82-81.8)² + (79-81.8)² + (85-81.8)² + (80-81.8)² + (83-81.8)² = 0.04 + 7.84 + 10.24 + 3.24 + 1.44 = 22.8.
Group 3 SS: (76-76.8)² + (78-76.8)² + (74-76.8)² + (79-76.8)² + (77-76.8)² = 0.64 + 1.44 + 7.84 + 4.84 + 0.04 = 14.8.
$SS_w = 37.2 + 22.8 + 14.8 = 74.8$ .
df_b = 2, df_w = 12. MS_b = 170.8/2 = 85.4. MS_w = 74.8/12 ≈ 6.23.
$F = 85.4 / 6.23 \approx 13.70$ . $F_{0.95, 2, 12} \approx 3.89$ . REJECT. p ≈ 0.0008.

The omnibus F says the three groups don't share a common mean. It does NOT say which pair differs — that requires POST-HOC pairwise tests with multiple-comparisons correction (Tukey HSD, Bonferroni — §2.5 will cover these). The natural follow-up is a Tukey honestly-significant-difference test on all three pairwise contrasts.

Assumptions and their violations

Each test rests on a specific set of assumptions. The honest summary, drawing on Boneau (1960), Box (1953), Cochran (1954), and a generation of follow-up work:

t-tests (all three forms). Assume the underlying data (or differences, for the paired test) are approximately Normal. Boneau (1960, Psychol Bull) ran the canonical simulation study and found: for n ≥ 25-30, the two-sample t-test's Type-I rate stays within ~1 percentage point of the nominal α even for moderately skewed or kurtotic populations. So t-tests are ROBUST TO MILD NON-NORMALITY at moderate-to-large n. They are SENSITIVE to extreme outliers: a single observation 5 SDs from the mean inflates s and can make T look more extreme than the data really support. Robust alternatives: WILCOXON RANK-SUM / MANN–WHITNEY for two-sample, WILCOXON SIGNED-RANK for one-sample / paired (Conover 1999, ch. 5–6).
χ² of independence. Assumes (i) independent observations and (ii) expected counts ≥ 5 per cell (Cochran 1954). The χ² approximation BREAKS DOWN for small expected counts — the discrete sum (O−E)²/E with so few possible O values cannot land arbitrarily close to the continuous χ² critical value. With small N or sparse tables, the test's actual Type-I rate is unstable around nominal 5%. Robust alternative: FISHER'S EXACT TEST (Fisher 1925; Agresti 2002 §3.5). Exact and valid for any N, at the cost of being computationally expensive for large tables (modern software handles up to ~10000 cells easily).
F-test of variance ratio. VERY sensitive to non-normality (Box 1953). The variance ratio F's actual Type-I rate at nominal 0.05 can rise to 0.15 or higher with kurtosis as mild as 4 (heavy tails common in real data). NEVER use this test for a real research question; use Levene's test or Brown-Forsythe instead.
ANOVA F-test. Assumes (i) iid Normal within groups and (ii) equal variances across groups (homoscedasticity). Robust to non-normality with EQUAL n per group (Tan 1982 Stat. Med.). NOT robust to unequal variances combined with unequal n: if the smaller group has the LARGER variance, Type-I rate INFLATES (liberal); if the smaller group has the SMALLER variance, Type-I rate DEFLATES (conservative). The asymmetric inflation is the classical "Behrens-Fisher problem" generalised to k groups. Robust alternative: WELCH'S ANOVA (Welch 1951) — same recipe as Welch's two-sample t but generalised to k groups, with a Satterthwaite-style df correction. Modern statistical software defaults to Welch's ANOVA in many cases.

The second widget makes the assumption-violation Type-I rates VISIBLE. Pick a scenario, slide an assumption-violation knob, and the widget runs 2000 Monte Carlo trials at each slider step UNDER $H_0$ . The empirical rejection rate is the test's ACTUAL Type-I rate at the current setting; compare it to the nominal 5% reference line. Where the curve diverges from 5% is the assumption violation biting.

Things to verify:

skewed-t scenario: at moderate right-skew, the t-test stays near nominal 5%. The CLT-based robustness at n = 30 absorbs mild non-normality. The sign-test alternative stays exactly calibrated by construction.
outlier-t scenario: at ε up to 10% contamination, the t-test's Type-I rate stays close to nominal — because under H₀ (true mean = 0), outliers inflate s in BOTH numerator (via x̄) and denominator (via SE) symmetrically. The POWER under H₁ is what really suffers from contamination (try replacing the H₀ mean with a non-zero mean in the source).
small-cell-chi2 scenario: at N = 12 (each expected cell ≈ 3), the χ²-test's rejection rate is unstable around nominal 5% (sometimes above, sometimes below — the discrete sampling is the cause). Fisher's exact stays at or slightly below 5% (Fisher is slightly conservative). At N ≥ 60 the two tests agree closely.
unequal-var-anova scenario: at variance ratio = 1 (assumption satisfied), the F-test rejects at ≈ 5%. As the ratio grows, with the BIG group (n=16) holding the BIG variance, the test becomes CONSERVATIVE (deflated Type-I < 5%). At ratio = 10, the F-test rejects only ≈ 3% — too few. The Welch-corrected version stays closer to 5%.

Try it

Pen-and-paper. Five paint samples' drying times (minutes): 38, 42, 35, 39, 41. The manufacturer claims μ = 40. Test $H_0: \mu = 40$ at α = 0.05. Compute $\bar x$ , s, SE, T. Step-by-step: $\bar x = 195/5 = 39$ . Deviations: −1, 3, −4, 0, 2. SS = 1 + 9 + 16 + 0 + 4 = 30. $s^2 = 30/4 = 7.5$ , $s = 2.74$ . SE = 2.74/√5 = 1.22. T = (39 − 40)/1.22 = −0.82. df = 4. $t_{0.975, 4} = 2.776$ . |T| < 2.776: FAIL to reject. The data are consistent with the manufacturer's claim.
Pen-and-paper. Two groups: control (n = 8, mean = 50, s² = 16) and treatment (n = 10, mean = 56, s² = 25). Compute the Welch t and ν. SE = √(16/8 + 25/10) = √(2 + 2.5) = √4.5 ≈ 2.12. T = (50 − 56)/2.12 = −2.83. ν = (4.5)² / (2²/7 + 2.5²/9) = 20.25 / (0.571 + 0.694) = 20.25/1.265 ≈ 16.0. $t_{0.975, 16} = 2.120$ . |T| > 2.120: REJECT.
Pen-and-paper. A 2 × 2 table testing whether a vaccine reduces infection: vaccinated {sick: 8, well: 92}; placebo {sick: 22, well: 78}. Row totals 100, 100. Column totals 30, 170. N = 200. Expected: E₁₁ = 100·30/200 = 15. E₁₂ = 85. E₂₁ = 15. E₂₂ = 85. (O − E)²/E: (8−15)²/15 + (92−85)²/85 + (22−15)²/15 + (78−85)²/85 = 49/15 + 49/85 + 49/15 + 49/85 = 3.267 + 0.576 + 3.267 + 0.576 = 7.686. df = 1. $\chi^2_{0.95, 1} = 3.841$ . REJECT — the vaccine is associated with fewer infections.
Pen-and-paper. The same vaccine trial with much smaller N: vaccinated {sick: 1, well: 9}; placebo {sick: 4, well: 6}. Row totals 10, 10. Column totals 5, 15. N = 20. Expected: E₁₁ = 10·5/20 = 2.5. ALL expected cells = 2.5 or 7.5 — minimum 2.5 < 5. Cochran rule VIOLATED. Don't use χ²; use Fisher's exact. Fisher's exact two-sided p ≈ 0.30. Even with the same proportional pattern, the small-N evidence is much weaker — and the χ² approximation would have wrongly suggested otherwise.
Pen-and-paper. Three groups of n = 4: A = {10, 12, 11, 13}, B = {14, 15, 13, 16}, C = {12, 13, 14, 11}. Means: 11.5, 14.5, 12.5. Grand mean = 12.83. SS_b = 4·(11.5−12.83)² + 4·(14.5−12.83)² + 4·(12.5−12.83)² = 4(1.778 + 2.778 + 0.111) = 4·4.667 = 18.67. Within-group SS: A: (10−11.5)² + (12−11.5)² + (11−11.5)² + (13−11.5)² = 2.25+0.25+0.25+2.25 = 5. B: similar = 5. C: similar = 5. SS_w = 15. df_b = 2, df_w = 9. F = (18.67/2)/(15/9) = 9.33/1.67 ≈ 5.59. $F_{0.95, 2, 9} = 4.26$ . REJECT.
In the test-statistic-builder, switch to χ² independence. Change a single cell so the minimum expected count drops below 5. Confirm the widget flags this and suggests Fisher's exact. Re-test: does the verdict at α = 0.05 change?
In the assumption-violator, switch to "unequal-var-anova" and slide the variance ratio from 1 to 20. Notice the F-test's Type-I rate DEFLATES below 5% (the big-σ group is also the big-n group → conservative). Now flip the design mentally: if the big-σ group were the SMALL one, you'd see INFLATION (liberal). This asymmetry is the structural reason Welch's ANOVA is the modern default.
Pen-and-paper. Use the test-statistic equivalence $\chi^2_{1} = z^2$ to convert a two-proportion z-test on the same vaccine data into a χ² statistic and vice versa. (Recall z = (p̂₁ − p̂₂)/SE_diff, where SE_diff uses pooled p̂ = (x₁+x₂)/(n₁+n₂). Verify your z² matches the χ² from Question 3.)
Pen-and-paper. Show that the one-sample t-statistic is SCALE-INVARIANT: multiplying every x_i by a constant c > 0 changes x̄ → c·x̄, s → c·s, μ₀ → c·μ₀ (if you scale the null hypothesis too), SE → c·SE, and so T is UNCHANGED. This is a sanity check: the test conclusion doesn't depend on whether you measure in Celsius or Fahrenheit (after correctly scaling the null hypothesis).
Pen-and-paper. The Wilcoxon signed-rank test is the robust alternative to the one-sample / paired t-test. Recipe: compute the differences d_i, rank |d_i| from 1 to n, attach signs back, sum the positive-ranked ones to get W₊. Under H₀ (median = 0), W₊ ~ a known discrete distribution (Wilcoxon 1945 Biometrics). Look up its variance: $\text{Var}(W_+) = n(n+1)(2n+1)/24$ for n ≥ 10. Verify against a small example. The point: when the t-test's normality assumption is suspect, the Wilcoxon recipe is the standard non-parametric backup.

Pause and reflect: §2.3 has descended from the abstract N-P framework of §2.1 and the power machinery of §2.2 to three CONCRETE TEST FAMILIES that account for the bulk of applied statistical inference. The mechanics are straightforward — every test is a recipe applied to data, followed by a tail-probability lookup. The HARD PART is knowing when each test is honest and when it is not: t-tests survive mild non-normality but flinch at outliers; χ² independence breaks down at small expected counts; the F-test of variance is a meteorite; the ANOVA F-test is robust until the design becomes both unequal-n and heteroscedastic. The widgets in this section are deliberately designed to MAKE THIS BREAKDOWN VISIBLE — robust alternatives (Welch, Fisher, Wilcoxon, Levene) are not exotic, they are the modern defaults whenever the parametric assumptions fail. §2.4 returns to what a p-value IS — and what it isn't — having now seen concrete p-value calculations under three different reference distributions.

What you now know

You can compute three families of test statistics by hand. The ONE-SAMPLE t is $(\bar X - \mu_0)/(s/\sqrt n) \sim t_{n-1}$ . The WELCH TWO-SAMPLE t is $(\bar X_1 - \bar X_2)/\sqrt{s_1^2/n_1 + s_2^2/n_2}$ with the Welch-Satterthwaite ν (and is the modern default — Student's pooled-variance form is brittle). The PAIRED t is a one-sample t on the differences, and pairing usually buys power. The χ² test of independence is $\sum (O - E)^2/E$ with $E_{ij} = R_i C_j/N$ and $\nu = (r-1)(c-1)$ ; the Cochran-1954 rule says you need expected counts ≥ 5, else Fisher's exact. The one-way ANOVA F is $MS_b/MS_w$ with $df_b = k-1, df_w = N-k$ .

You also know the ASSUMPTIONS and the FAILURE MODES. t-tests are robust to mild non-normality at moderate n (Boneau 1960) but sensitive to extreme outliers; Wilcoxon / Mann–Whitney are the rank-based fallbacks. χ² independence is sensitive to small expected counts; Fisher's exact replaces it. The F-test of variances is sensitive to non-normality (Box 1953); Levene's or Brown-Forsythe replace it. The ANOVA F is robust at equal n but brittle under unequal-n + unequal-σ; Welch's ANOVA is the corrected form. Permutation tests (Pesarin & Salmaso 2010) provide a third axis of robustness: they only assume EXCHANGEABILITY under $H_0$ and work for essentially any test statistic.

Where this lands in Part 2. §2.4 dissects what a p-value IS — the misinterpretations, the N-P vs Fisher muddle, what reporting CIs alongside p-values buys, the role of the §2.3 reference distributions in producing the p in the first place. §2.5 tackles multiple testing — when you run multiple t-tests as post-hoc follow-ups to an ANOVA omnibus, family-wise α inflates and you need Bonferroni / Tukey HSD / FDR control. §2.6 preregistration as the procedural fix for selective reporting. §2.7 equivalence testing (TOST) — the right way to conclude "no meaningful effect" rather than just "p ≥ 0.05". §2.8 the replication crisis: what happens to a literature when these tools are misused systematically.

References

Student (Gosset, W.S.) (1908). "The probable error of a mean." Biometrika 6(1), 1–25. (The foundational paper that introduced the t-distribution. Gosset was working at the Guinness brewery and published under "Student" because Guinness disallowed publishing under one's real name.)
Welch, B.L. (1947). "The generalization of 'Student's' problem when several different population variances are involved." Biometrika 34(1–2), 28–35. (The Welch correction for unequal variances in the two-sample t. The modern default form.)
Welch, B.L. (1951). "On the comparison of several mean values: an alternative approach." Biometrika 38(3–4), 330–336. (The k-group generalisation — Welch's ANOVA.)
Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philos. Mag. Series 5, 50(302), 157–175. (Pearson's χ² statistic. The foundational paper for contingency-table testing.)
Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh. (The book that codified the t-test, χ²-test, F-test, and the p < 0.05 convention. The first comprehensive applied-statistics textbook of the modern era. Contains the exact-test machinery in Ch. 7.)
Lehmann, E.L., Romano, J.P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer. (Chapter 5 covers t-tests rigorously; Chapter 14 covers χ² for goodness-of-fit and independence; Chapter 7 covers the F-test for ANOVA.)
Wasserman, L. (2004). All of Statistics. Springer. (Chapter 10 covers all three test families at the working-statistician level. Excellent companion if Lehmann-Romano is too dense.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 8 covers t and F tests; §10.3 covers χ². The standard graduate-level reference.)
Boneau, C.A. (1960). "The effects of violations of assumptions underlying the t test." Psychological Bulletin 57(1), 49–64. (The canonical simulation study of t-test robustness. Showed that at n ≥ 30 the t-test stays within ~1 percentage point of nominal α for moderately skewed or kurtotic populations.)
Box, G.E.P. (1953). "Non-normality and tests on variances." Biometrika 40(3–4), 318–335. (The classic paper showing the F-test of variances is unreliable under non-normality. "Unleashing a hurricane to test whether the wind is gentle enough to sail." The reason modern practice uses Levene's test instead.)
Cochran, W.G. (1954). "Some methods for strengthening the common χ² tests." Biometrics 10(4), 417–451. (The expected-count-≥-5 rule of thumb. Also covers continuity corrections.)
Levene, H. (1960). "Robust tests for equality of variances." In I. Olkin (ed.), Contributions to Probability and Statistics, Stanford University Press. (Levene's test — ANOVA on absolute deviations from group means — as a robust replacement for the F-test of variances.)
Brown, M.B., Forsythe, A.B. (1974). "Robust tests for the equality of variances." J. Am. Stat. Assoc. 69(346), 364–367. (Brown-Forsythe uses absolute deviations from group MEDIANS — even more robust than Levene's.)
Conover, W.J. (1999). Practical Nonparametric Statistics (3rd ed.). Wiley. (The standard reference for the rank-based and exact alternatives: Wilcoxon signed-rank (Ch. 5), Mann-Whitney (Ch. 5), Kruskal-Wallis (Ch. 5), Fisher's exact (Ch. 4).)
Agresti, A. (2002). Categorical Data Analysis (2nd ed.). Wiley. (Chapter 3 covers 2-way contingency tables in depth, including the χ²-vs-Fisher tradeoff. The standard reference for categorical-data analysis.)
Ruxton, G.D. (2006). "The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test." Behavioral Ecology 17(4), 688–690. (The clean modern argument for using Welch's t-test as the default. Required reading before defaulting to Student's pooled-variance form.)