Rank-based methods and U-statistics

Part 8 — Resampling and nonparametrics

Learning objectives

  • State the WILCOXON RANK-SUM (Mann-Whitney U) test and its connection to permutation tests
  • Compute and interpret the test statistic + Normal approximation p-value
  • Recognise the WILCOXON SIGNED-RANK test for paired data
  • Understand U-STATISTICS as the general framework: estimating expectations over symmetric kernels
  • Compare rank-based methods to t-tests under Normal, skewed, and heavy-tailed data

Rank-based methods replace observations with their RANKS in the combined sample, then compute test statistics on the ranks. Result: methods that are DISTRIBUTION-FREE (the null distribution doesn't depend on the underlying data distribution) and ROBUST to outliers and heavy tails. The price is a small loss of efficiency under Normal data (~5%) — well worth paying for distribution-free coverage.

Wilcoxon rank-sum (Mann-Whitney U)

Two-sample data: nAn_A from group A, nBn_B from group B. Pool the N=nA+nBN = n_A + n_B observations and rank them. Let RAR_A be the sum of A's ranks. The Wilcoxon rank-sum statistic is

W=RA.W = R_A.

Under H0 (same distribution), E[W]=nA(N+1)/2E[W] = n_A (N + 1)/2 and Var(W)=nAnB(N+1)/12\text{Var}(W) = n_A n_B (N + 1)/12. For large N, the Normal approximation Z=(WE[W])/Var(W)Z = (W - E[W])/\sqrt{\text{Var}(W)} gives a two-sided p-value via 2(1Φ(Z))2(1 - \Phi(|Z|)). The equivalent Mann-Whitney U statistic is U=RAnA(nA+1)/2U = R_A - n_A(n_A + 1)/2 — same test, different scaling.

Crucially, the test is EXACT in small samples (enumerate the rank assignments under H0) and asymptotically Normal in large samples. Tie correction: when there are tied observations, midranks are used, and the variance formula gets a tie-correction factor.

The Wilcoxon test is EXACTLY a permutation test using the rank-sum statistic. Under exchangeability of group labels, the rank distribution is determined by the (NnA)\binom{N}{n_A} ways of assigning A-labels to N ranks. The Normal approximation just summarizes this discrete null distribution. So Wilcoxon is in the permutation-test family (§8.2) — using a clever distribution-free test statistic.

Wilcoxon signed-rank for paired data

For paired observations (xi,yi)(x_i, y_i), compute differences di=yixid_i = y_i - x_i. Rank di|d_i| from smallest to largest. The signed-rank statistic is

W+=isign(di)rank(di).W^+ = \sum_i \text{sign}(d_i) \cdot \text{rank}(|d_i|).

Under H0 (symmetric around zero), E[W+]=0E[W^+] = 0 and Var(W+)=n(n+1)(2n+1)/6\text{Var}(W^+) = n(n+1)(2n+1)/6. Asymptotically Normal. The signed-rank test is the rank-based analogue of the paired t-test; preferred when the differences are not Normal-distributed.

U-statistics: the general framework

Hoeffding (1948) introduced U-statistics as a general framework for unbiased estimators of population functionals. A U-statistic of degree m is

Un=(nm)1i1<<imψ(Xi1,,Xim),U_n = \binom{n}{m}^{-1} \sum_{i_1 < \ldots < i_m} \psi(X_{i_1}, \ldots, X_{i_m}),

where ψ\psi is a symmetric kernel. Examples: the sample variance is a U-statistic of degree 2 with kernel (x1x2)2/2(x_1 - x_2)^2 / 2; Kendall's tau and Spearman's rho are U-statistics; the Mann-Whitney U is a U-statistic of degree 2 with kernel ψ(x,y)=1(x<y)\psi(x, y) = \mathbb{1}(x < y). Hoeffding showed all U-statistics are asymptotically Normal with closed-form variances derivable from kernel projection.

Pitman ARE: rank vs t under Normal

The Pitman Asymptotic Relative Efficiency (ARE) of Wilcoxon vs t under Normal data is 3/π0.9553/\pi \approx 0.955. Meaning: Wilcoxon needs about 5% more sample size to achieve the same power as t under Normal data. Hodges-Lehmann (1956) proved this is the WORST CASE: under any other distribution, Wilcoxon's ARE is >= 0.864 vs t (always; Hodges-Lehmann lower bound) and frequently MUCH higher (Wilcoxon dominates under skewed/heavy-tailed).

Implication: use rank-based methods as the default. Lose at most 5% efficiency under Normal data; gain dramatic robustness elsewhere.

Kruskal-Wallis: multi-group analogue

For comparing K > 2 groups (multi-group ANOVA analogue), the Kruskal-Wallis test pools all data, ranks, computes a chi-squared-like statistic from sums of ranks per group. Under H0, the statistic follows χ²(K-1) asymptotically. Post-hoc rank tests (Dunn-Sidák, Conover-Iman) handle pairwise comparisons after a significant K-W test.

Hodges-Lehmann estimator

The rank-test counterpart to the t-test's estimator (mean difference): the median of all pairwise differences yjxiy_j - x_i across i,ji, j. This is the Hodges-Lehmann location estimator, ROBUST to outliers and naturally paired with the Wilcoxon test. Used to report point estimates alongside Wilcoxon p-values.

When NOT to use rank-based methods

  • Hypothesis is about means specifically: e.g., regulatory thresholds based on means. Wilcoxon tests stochastic dominance, not mean differences.
  • Subgroup analyses: rank tests within tiny subgroups have low power; t-test under Normal assumption is fine if the assumption holds.
  • Complex regression with covariates: rank-based extensions (van der Waerden, quantile regression) exist but are more involved; OLS often preferred unless residuals are clearly non-Normal.

Rank Test ExplorerInteractive figure — enable JavaScript to interact.

Try it

  • Start with Normal data, shift 0.70, N = 30. Both t-test and Wilcoxon give similar p-values (around 0.005-0.01). Normal data is the t-test's home turf; Wilcoxon is competitive.
  • Switch to Lognormal (skewed). Same shift, same N. Compare the p-values. Wilcoxon's p-value is typically smaller (more power) because the rank ordering separates the groups more cleanly than the noisy mean comparison.
  • Switch to Cauchy (heavy-tailed). Re-sample several times. t-test p-values are erratic — a single outlier can dominate the sample variance estimate. Wilcoxon p-values are much more stable.
  • Set shift = 0 (true null). Re-sample many times under each distribution. t-test p-values are uniform on [0, 1] under Normal, but Wilcoxon p-values are ALSO uniform — both have correct Type-I error under Normal. Under Cauchy, however, t-test Type-I error can be inflated (heavy tails violate CLT in small N).
  • Crank N to 200. Both tests gain power. The ratio of p-values stabilizes — at larger N, both tests reach decisive p-values for moderate shifts in all three shapes.

A scientist tests a small clinical trial (N = 15 per arm) with outcomes that visibly skew right (lognormal-like). Why is Wilcoxon a better default than a t-test for this scenario?

What you now know

Rank-based methods (Wilcoxon rank-sum, signed-rank, Kruskal-Wallis) replace observations with ranks, making them distribution-free and robust. Pitman ARE of Wilcoxon vs t under Normal is 0.955; under any other distribution, Wilcoxon is at least 0.864 and often dominates. Hoeffding's (1948) U-statistic framework gives the general theory. The Hodges-Lehmann estimator pairs robust point estimation with Wilcoxon testing. §8.5 next: kernel density estimation, the nonparametric continuous-distribution estimator.

References

  • Wilcoxon, F. (1945). "Individual comparisons by ranking methods." Biometrics Bulletin 1(6), 80–83. (Original.)
  • Mann, H.B., Whitney, D.R. (1947). "On a test of whether one of two random variables is stochastically larger than the other." Annals of Mathematical Statistics 18(1), 50–60.
  • Hoeffding, W. (1948). "A class of statistics with asymptotically normal distribution." Annals of Mathematical Statistics 19(3), 293–325. (U-statistics.)
  • Hodges, J.L., Lehmann, E.L. (1956). "The efficiency of some nonparametric competitors of the t-test." Annals of Mathematical Statistics 27(2), 324–335.
  • Lehmann, E.L. (2006). Nonparametrics: Statistical Methods Based on Ranks (revised). Springer.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.