Rank-based methods and U-statistics

Part 8 — Resampling and nonparametrics

Learning objectives

State the WILCOXON RANK-SUM (Mann-Whitney U) test and its connection to permutation tests
Compute and interpret the test statistic + Normal approximation p-value
Recognise the WILCOXON SIGNED-RANK test for paired data
Understand U-STATISTICS as the general framework: estimating expectations over symmetric kernels
Compare rank-based methods to t-tests under Normal, skewed, and heavy-tailed data

Rank-based methods replace observations with their RANKS in the combined sample, then compute test statistics on the ranks. Result: methods that are DISTRIBUTION-FREE (the null distribution doesn't depend on the underlying data distribution) and ROBUST to outliers and heavy tails. The price is a small loss of efficiency under Normal data (~5%) — well worth paying for distribution-free coverage.

Wilcoxon rank-sum (Mann-Whitney U)

Two-sample data: $n_A$ from group A, $n_B$ from group B. Pool the $N = n_A + n_B$ observations and rank them. Let $R_A$ be the sum of A's ranks. The Wilcoxon rank-sum statistic is

W = R_A.

Under H0 (same distribution), $E[W] = n_A (N + 1)/2$ and $\text{Var}(W) = n_A n_B (N + 1)/12$ . For large N, the Normal approximation $Z = (W - E[W])/\sqrt{\text{Var}(W)}$ gives a two-sided p-value via $2(1 - \Phi(|Z|))$ . The equivalent Mann-Whitney U statistic is $U = R_A - n_A(n_A + 1)/2$ — same test, different scaling.

Crucially, the test is EXACT in small samples (enumerate the rank assignments under H0) and asymptotically Normal in large samples. Tie correction: when there are tied observations, midranks are used, and the variance formula gets a tie-correction factor.

The link to permutation tests

The Wilcoxon test is EXACTLY a permutation test using the rank-sum statistic. Under exchangeability of group labels, the rank distribution is determined by the $\binom{N}{n_A}$ ways of assigning A-labels to N ranks. The Normal approximation just summarizes this discrete null distribution. So Wilcoxon is in the permutation-test family (§8.2) — using a clever distribution-free test statistic.

Wilcoxon signed-rank for paired data

For paired observations $(x_i, y_i)$ , compute differences $d_i = y_i - x_i$ . Rank $|d_i|$ from smallest to largest. The signed-rank statistic is

W^+ = \sum_i \text{sign}(d_i) \cdot \text{rank}(|d_i|).

Under H0 (symmetric around zero), $E[W^+] = 0$ and $\text{Var}(W^+) = n(n+1)(2n+1)/6$ . Asymptotically Normal. The signed-rank test is the rank-based analogue of the paired t-test; preferred when the differences are not Normal-distributed.

U-statistics: the general framework

Hoeffding (1948) introduced U-statistics as a general framework for unbiased estimators of population functionals. A U-statistic of degree m is

U_n = \binom{n}{m}^{-1} \sum_{i_1 < \ldots < i_m} \psi(X_{i_1}, \ldots, X_{i_m}),

where $\psi$ is a symmetric kernel. Examples: the sample variance is a U-statistic of degree 2 with kernel $(x_1 - x_2)^2 / 2$ ; Kendall's tau and Spearman's rho are U-statistics; the Mann-Whitney U is a U-statistic of degree 2 with kernel $\psi(x, y) = \mathbb{1}(x < y)$ . Hoeffding showed all U-statistics are asymptotically Normal with closed-form variances derivable from kernel projection.

Pitman ARE: rank vs t under Normal

The Pitman Asymptotic Relative Efficiency (ARE) of Wilcoxon vs t under Normal data is $3/\pi \approx 0.955$ . Meaning: Wilcoxon needs about 5% more sample size to achieve the same power as t under Normal data. Hodges-Lehmann (1956) proved this is the WORST CASE: under any other distribution, Wilcoxon's ARE is >= 0.864 vs t (always; Hodges-Lehmann lower bound) and frequently MUCH higher (Wilcoxon dominates under skewed/heavy-tailed).

Implication: use rank-based methods as the default. Lose at most 5% efficiency under Normal data; gain dramatic robustness elsewhere.

Kruskal-Wallis: multi-group analogue

For comparing K > 2 groups (multi-group ANOVA analogue), the Kruskal-Wallis test pools all data, ranks, computes a chi-squared-like statistic from sums of ranks per group. Under H0, the statistic follows χ²(K-1) asymptotically. Post-hoc rank tests (Dunn-Sidák, Conover-Iman) handle pairwise comparisons after a significant K-W test.

Hodges-Lehmann estimator

The rank-test counterpart to the t-test's estimator (mean difference): the median of all pairwise differences $y_j - x_i$ across $i, j$ . This is the Hodges-Lehmann location estimator, ROBUST to outliers and naturally paired with the Wilcoxon test. Used to report point estimates alongside Wilcoxon p-values.

When NOT to use rank-based methods

Hypothesis is about means specifically: e.g., regulatory thresholds based on means. Wilcoxon tests stochastic dominance, not mean differences.
Subgroup analyses: rank tests within tiny subgroups have low power; t-test under Normal assumption is fine if the assumption holds.
Complex regression with covariates: rank-based extensions (van der Waerden, quantile regression) exist but are more involved; OLS often preferred unless residuals are clearly non-Normal.

Try it

Start with Normal data, shift 0.70, N = 30. Both t-test and Wilcoxon give similar p-values (around 0.005-0.01). Normal data is the t-test's home turf; Wilcoxon is competitive.
Switch to Lognormal (skewed). Same shift, same N. Compare the p-values. Wilcoxon's p-value is typically smaller (more power) because the rank ordering separates the groups more cleanly than the noisy mean comparison.
Switch to Cauchy (heavy-tailed). Re-sample several times. t-test p-values are erratic — a single outlier can dominate the sample variance estimate. Wilcoxon p-values are much more stable.
Set shift = 0 (true null). Re-sample many times under each distribution. t-test p-values are uniform on [0, 1] under Normal, but Wilcoxon p-values are ALSO uniform — both have correct Type-I error under Normal. Under Cauchy, however, t-test Type-I error can be inflated (heavy tails violate CLT in small N).
Crank N to 200. Both tests gain power. The ratio of p-values stabilizes — at larger N, both tests reach decisive p-values for moderate shifts in all three shapes.

A scientist tests a small clinical trial (N = 15 per arm) with outcomes that visibly skew right (lognormal-like). Why is Wilcoxon a better default than a t-test for this scenario?

What you now know

Rank-based methods (Wilcoxon rank-sum, signed-rank, Kruskal-Wallis) replace observations with ranks, making them distribution-free and robust. Pitman ARE of Wilcoxon vs t under Normal is 0.955; under any other distribution, Wilcoxon is at least 0.864 and often dominates. Hoeffding's (1948) U-statistic framework gives the general theory. The Hodges-Lehmann estimator pairs robust point estimation with Wilcoxon testing. §8.5 next: kernel density estimation, the nonparametric continuous-distribution estimator.

References

Wilcoxon, F. (1945). "Individual comparisons by ranking methods." Biometrics Bulletin 1(6), 80–83. (Original.)
Mann, H.B., Whitney, D.R. (1947). "On a test of whether one of two random variables is stochastically larger than the other." Annals of Mathematical Statistics 18(1), 50–60.
Hoeffding, W. (1948). "A class of statistics with asymptotically normal distribution." Annals of Mathematical Statistics 19(3), 293–325. (U-statistics.)
Hodges, J.L., Lehmann, E.L. (1956). "The efficiency of some nonparametric competitors of the t-test." Annals of Mathematical Statistics 27(2), 324–335.
Lehmann, E.L. (2006). Nonparametrics: Statistical Methods Based on Ranks (revised). Springer.