Permutation tests and exchangeability

Part 8 — Resampling and nonparametrics

Learning objectives

State the EXCHANGEABILITY assumption that justifies permutation tests under H0
Construct the permutation null distribution by random label shuffling
Compute and interpret the permutation p-value
Recognise when permutation tests are EXACT (small N, complete enumeration) vs APPROXIMATE (large N, Monte Carlo subset)
Apply paired permutation tests, stratified permutation tests, and permutation-based regression

Bootstrap (§8.1) builds CIs by resampling WITH replacement to mimic the sampling distribution. The complementary resampling technique for HYPOTHESIS testing is the PERMUTATION TEST (Fisher 1935): under the null hypothesis of no effect, group labels are EXCHANGEABLE, and the null distribution of any test statistic can be built by SHUFFLING the labels. The p-value is the fraction of shuffled statistics at least as extreme as the observed.

The exchangeability foundation

Suppose you have two groups (control C, treatment T) with $n_C$ and $n_T$ observations. Under H0 (no treatment effect), the underlying distribution of outcomes is the same for both groups. The treatment-vs-control LABELING is just a tag — under H0, ANY assignment of labels to the $N = n_C + n_T$ observations is equally probable. This is EXCHANGEABILITY.

The observed data $y_1, \ldots, y_N$ with its specific labelling is ONE of the $\binom{N}{n_C}$ equally probable label-orderings under H0. Any test statistic $T$ takes a specific value $T_{\text{obs}}$ on the observed data. The NULL DISTRIBUTION of T under exchangeability is obtained by computing T on every possible label permutation (or a random subset for computational ease). The exact PERMUTATION P-VALUE is:

p = \frac{1 + |\{\pi: T(y, \pi) \ge T_{\text{obs}}\}|}{1 + \#\text{ permutations considered}}.

(The +1 in numerator and denominator includes the observed labelling — this gives a conservative valid p-value.)

Exact vs Monte Carlo permutation tests

EXACT: enumerate all $\binom{N}{n_C}$ label assignments. For N = 20 with 10/10 split, this is $\binom{20}{10} \approx 184{,}000$ — feasible. For N = 100 with 50/50, this is $\binom{100}{50} \approx 10^{29}$ — infeasible.

MONTE CARLO: randomly sample B (e.g., 5000 or 10000) label permutations. The Monte Carlo p-value approximates the exact one, with Monte Carlo SE $\sqrt{p(1-p)/B}$ . For B = 5000 and a true p-value of 0.02, the SE is $\approx 0.002$ — comfortably small. For very small claimed p-values (e.g., 0.001), use larger B.

The connection to t-tests

Under Normal-distributed data with equal variances, the permutation test of mean difference produces approximately the same p-value as the two-sample t-test. The permutation test makes NO parametric assumption — it just needs exchangeability — but its power is similar to the t-test when t-test assumptions hold. When data are non-Normal (skewed, heavy-tailed), the permutation test remains valid; the t-test may not.

Permutation tests for other statistics

The permutation test can use ANY test statistic, not just difference of means. Useful choices:

Difference of medians: robust to outliers. Permutation test of medians remains valid under skewness.
Wilcoxon rank-sum (Mann-Whitney) U: replaces values with ranks; permutation null is the same as the classical Mann-Whitney null distribution.
Kolmogorov-Smirnov D: permutation null is the same as the classical KS distribution for two-sample tests.
Quadratic form (Hotelling T²): multivariate generalisation.
Regression coefficient β̂: permute the dependent variable (or rows of X) to test individual coefficients under no-effect null.

Paired permutation test

For paired data (e.g., pre/post measurements on the same units), permute the within-pair labels (swap pre/post) for each pair independently. The null distribution is built by considering each pair's sign as exchangeable. Equivalent to the sign-flip test or the Wilcoxon signed-rank test under appropriate test statistics. Use this for matched designs (twin studies, before-after, crossover trials).

Stratified permutation

If your design has a STRATIFICATION variable (e.g., age group, sex, site), permute labels WITHIN STRATA. This preserves the stratification structure. The null distribution thus reflects the design's balance and the test is more powerful (smaller variance) than unstratified permutation.

Permutation tests for regression

For testing the coefficient β of a covariate X in a regression model, permute the X values across observations (or equivalently the residuals from an X-free model) to build the null distribution of $\hat{\beta}$ . This generalises permutation to settings with covariate adjustment, where the simple two-group permutation isn't applicable.

What permutation tests CANNOT do

Permutation tests are HYPOTHESIS TESTS, not CI calculators. The output is a p-value about exchangeability, not a CI for the effect size. To get CIs, INVERT the test (find the interval of effect sizes that wouldn't be rejected at level α), or use bootstrap (§8.1) for the CI and permutation for the test.

Also: permutation tests assume EXCHANGEABILITY. They fail when this is violated — e.g., when the variance is different between groups under H0. The permutation test of MEANS under heteroscedasticity has incorrect Type-I error (Romano 1990). Use the t-test with Welch's correction, or a permutation test of STUDENTIZED statistics.

Try it

Start with default: true δ = 0.6, N = 25 per group. The treatment group sits visibly above the control group. T_obs is in the right tail of the null distribution; permutation p ≈ 0.02. The t-test gives a similar p-value (Normal data, t-test assumption holds).
Drag δ to 0.0. T_obs is now near zero, in the bulk of the null distribution; permutation p ≈ 0.5. The test correctly fails to reject under truth.
Drag δ to 1.2 (large effect). T_obs is in the far right tail; p < 0.001 (Monte Carlo precision limits how small p can be reported). The test has high power for large effects.
Try N = 100. The permutation null distribution narrows (by 1/√N). Smaller observed effects become detectable. The test gains POWER with N exactly as classical tests do.
Compare the permutation and t-test p-values across all settings. They should track closely under Normal data. The permutation test additionally would remain valid under skewed or otherwise non-Normal data — at the cost of slightly more compute than a closed-form t-test.

A clinician runs a 2x2 trial: 10 patients per arm, outcomes are highly skewed counts. Why might a permutation test be preferable to a t-test here?

What you now know

Permutation tests build the null distribution by shuffling group labels under the exchangeability assumption. The permutation p-value is the fraction of shuffled statistics at least as extreme as the observed. Exact tests enumerate; Monte Carlo tests sample B permutations. Works with any test statistic, paired/stratified/regression designs. Reproduces classical t-test p-values under Normal data; remains valid under non-Normality. §8.3 next: cross-validation done right — the canonical resampling-based tool for predictive-model evaluation.

References

Fisher, R.A. (1935). The Design of Experiments. Oliver and Boyd. (The original permutation-test source.)
Good, P. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses (2nd ed.). Springer.
Pesarin, F., Salmaso, L. (2010). Permutation Tests for Complex Data. Wiley.
Romano, J.P. (1990). "On the behavior of randomization tests without a group invariance assumption." JASA 85(411), 686–692. (Heteroscedasticity caveat.)
Anderson, M.J., Robinson, J. (2001). "Permutation tests for linear models." Australian & New Zealand J. Stat. 43(1), 75–88.