intermediate nonparametric-ml 50 min read

Rank Tests & Permutation Inference

Distribution-free hypothesis testing through ranks and permutations — Wilcoxon, Mann-Whitney, Kruskal-Wallis, Pitman ARE, Hodges-Lehmann, and randomization inference for A/B testing

From the Sign Test to Wilcoxon Signed-Rank

The simplest hypothesis test you can run on paired data — pre-treatment vs post-treatment, twin-pair A vs twin-pair B, before-and-after readings on the same subject — is the sign test: count how many differences came out positive and ask if that count is consistent with random chance. The test makes no distributional assumption beyond continuity and a symmetry around zero, so it works under conditions where Gaussian-based tests would fail. But it also throws away the magnitudes — a difference of 0.010.01 counts the same as a difference of 5.05.0. Wilcoxon’s signed-rank statistic recovers the magnitude information through ranks while keeping the distribution-free property. Both tests turn out to be special cases of the more general permutation framework we develop in §2.

Definition 1 (Sign-Test Statistic).

For paired differences D1,,DnD_1, \ldots, D_n, the sign-test statistic is the count of positive differences,

N+  =  i=1n1{Di>0},N^+ \;=\; \sum_{i=1}^n \mathbf{1}\{D_i > 0\},

where 1{A}\mathbf{1}\{A\} is the indicator of event AA.

Theorem 1 (Sign-Test Null Distribution).

Suppose D1,,DnD_1, \ldots, D_n are independent and each DiD_i is continuously distributed and symmetric about 00. Then under H0 ⁣:H_0\!: the common median is 00,

N+    Binomial ⁣(n,12).N^+ \;\sim\; \mathrm{Binomial}\!\left(n,\, \tfrac{1}{2}\right).
Proof.

By continuity, P(Di=0)=0\mathbb{P}(D_i = 0) = 0, so 1{Di>0}\mathbf{1}\{D_i > 0\} is well-defined a.s. Symmetry of DiD_i about 00 gives P(Di>0)=P(Di<0)=1/2\mathbb{P}(D_i > 0) = \mathbb{P}(D_i < 0) = 1/2. Independence of the DiD_i then makes 1{D1>0},,1{Dn>0}\mathbf{1}\{D_1 > 0\}, \ldots, \mathbf{1}\{D_n > 0\} iid Bernoulli(1/2), and their sum N+N^+ is Binomial(n,1/2)(n, 1/2). \square

Definition 2 (Wilcoxon Signed-Rank Statistic).

With paired differences D1,,DnD_1, \ldots, D_n having no ties in absolute value, let RiR_i denote the rank of Di|D_i| in the absolute-value sample (so Ri{1,,n}R_i \in \{1, \ldots, n\}). The Wilcoxon signed-rank statistic is the sum of ranks at positive differences,

W+  =  i:Di>0Ri  =  i=1nRi1{Di>0}.W^+ \;=\; \sum_{i\,:\,D_i > 0} R_i \;=\; \sum_{i=1}^n R_i \cdot \mathbf{1}\{D_i > 0\}.

Theorem 2 (Signed-Rank Null Distribution).

Under H0H_0 (paired differences symmetric about 00, no ties in Di|D_i|), W+W^+ is distributed as i=1niBi\sum_{i=1}^n i \cdot B_i with BiiidBernoulli(1/2)B_i \stackrel{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(1/2). Consequently,

E[W+]  =  n(n+1)4,Var(W+)  =  n(n+1)(2n+1)24,\mathbb{E}[W^+] \;=\; \frac{n(n+1)}{4}, \qquad \mathrm{Var}(W^+) \;=\; \frac{n(n+1)(2n+1)}{24},

and the exact null distribution is symmetric about n(n+1)/4n(n+1)/4.

Proof.

Condition on the absolute ranks R1,,RnR_1, \ldots, R_n (which form a permutation of {1,,n}\{1, \ldots, n\}). Symmetry of each DiD_i about 00 gives P(Di>0Di)=1/2\mathbb{P}(D_i > 0 \mid |D_i|) = 1/2, and independence of the DiD_i gives that the signs sign(Di)\mathrm{sign}(D_i) are iid uniform on {1,+1}\{-1, +1\} conditional on (D1,,Dn)(|D_1|, \ldots, |D_n|). Define Bi=1{Di>0}B_i = \mathbf{1}\{D_i > 0\}, so BiiidBernoulli(1/2)B_i \stackrel{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(1/2) conditionally and unconditionally. Then

W+  =  i=1nRiBi  =d  i=1niBσ(i)  =d  i=1niBi,W^+ \;=\; \sum_{i=1}^n R_i B_i \;\stackrel{d}{=}\; \sum_{i=1}^n i \cdot B_{\sigma(i)} \;\stackrel{d}{=}\; \sum_{i=1}^n i \cdot B_i,

because (Bσ(1),,Bσ(n))(B_{\sigma(1)}, \ldots, B_{\sigma(n)}) is again iid Bernoulli(1/2)(1/2) for any permutation σ\sigma (which here maps a fixed index to its rank position).

The moments follow termwise. Each iBii \cdot B_i has mean i/2i/2 and variance i214i^2 \cdot \tfrac{1}{4}, and the BiB_i are independent, so

E[W+]  =  12i=1ni  =  n(n+1)4,Var(W+)  =  14i=1ni2  =  n(n+1)(2n+1)24.\mathbb{E}[W^+] \;=\; \tfrac{1}{2}\sum_{i=1}^n i \;=\; \frac{n(n+1)}{4}, \qquad \mathrm{Var}(W^+) \;=\; \tfrac{1}{4}\sum_{i=1}^n i^2 \;=\; \frac{n(n+1)(2n+1)}{24}.

For symmetry: the sign-flip B1BB \mapsto \mathbf{1} - B is a bijection on {0,1}n\{0, 1\}^n that maps W+=iiBiW^+ = \sum_i i B_i to ii(1Bi)=n(n+1)2W+\sum_i i (1 - B_i) = \tfrac{n(n+1)}{2} - W^+. Hence the null distribution is symmetric about its mean n(n+1)/4=12n(n+1)2n(n+1)/4 = \tfrac{1}{2} \cdot \tfrac{n(n+1)}{2}. \square

A concrete example illustrates the workflow. Take n=8n = 8 paired differences from a treatment study,

D  =  (0.4,  1.2,  2.8,  0.1,  3.7,  0.6,  1.5,  2.0).D \;=\; (-0.4,\; 1.2,\; 2.8,\; -0.1,\; 3.7,\; 0.6,\; -1.5,\; 2.0).

Five of the eight are positive, so N+=5N^+ = 5 — the sign-test two-sided pp-value from Binomial(8,1/2)\mathrm{Binomial}(8, 1/2) is 0.727\approx 0.727, well above α=0.05\alpha = 0.05. The Wilcoxon signed-rank statistic uses the absolute ranks R=(2,4,7,1,8,3,5,6)R = (2, 4, 7, 1, 8, 3, 5, 6) and sums the ranks at positive differences: W+=4+7+8+3+6=28W^+ = 4 + 7 + 8 + 3 + 6 = 28. Enumerating all 28=2562^8 = 256 sign vectors gives a discrete null with mean 1818 and variance 5151, against which W+=28W^+ = 28 has two-sided p0.195p \approx 0.195. Both tests agree the effect is not significant at α=0.05\alpha = 0.05, but the framework — replace data with rank-and-sign information, derive the null from the symmetry group of sign flips, get an exact pp-value from enumeration — is the structure we generalize next.

At n = 8 the Wilcoxon null is centred at μ = n(n+1)/4 = 18 with variance n(n+1)(2n+1)/24 = 51 (Theorem 2). Drag D₁ across zero to flip its sign and watch which bars enter the rejection region.

The Permutation Principle and Exchangeability

Theorem 2’s null distribution didn’t really need symmetric continuous differences — it needed something subtler. The joint distribution of (D1,,Dn)(D_1, \ldots, D_n) had to be invariant under sign-flips. That kind of invariance generalises to a property called exchangeability, which is the engine driving every test in this topic and (it turns out) conformal prediction as well.

Definition 3 (Exchangeability).

A random vector (X1,,Xn)(X_1, \ldots, X_n) is exchangeable if for every permutation π\pi of {1,,n}\{1, \ldots, n\},

(Xπ(1),,Xπ(n))  =d  (X1,,Xn).(X_{\pi(1)}, \ldots, X_{\pi(n)}) \;\stackrel{d}{=}\; (X_1, \ldots, X_n).

Independence and identical distribution (iid) implies exchangeability, but not conversely. Under exchangeability with no ties, the rank vector is uniformly distributed on the n!n! permutations of {1,,n}\{1, \ldots, n\}.

Theorem 3 (Permutation-Test Exactness).

Let TT be any test statistic and G\mathcal{G} a finite group of transformations such that under H0H_0 the joint distribution of X=(X1,,Xn)X = (X_1, \ldots, X_n) is invariant under every gGg \in \mathcal{G} — i.e., gX=dXg \cdot X \stackrel{d}{=} X. Define the permutation pp-value

p(X)  =  1GgG1{T(gX)T(X)}.p(X) \;=\; \frac{1}{|\mathcal{G}|} \sum_{g \in \mathcal{G}} \mathbf{1}\{T(g \cdot X) \geq T(X)\}.

Then under H0H_0,   P(p(X)α)α  \;\mathbb{P}(p(X) \leq \alpha) \leq \alpha\; for every α[0,1]\alpha \in [0, 1], with equality at α\alpha values that are integer multiples of 1/G1/|\mathcal{G}| (after random tie-breaking).

Proof.

Under H0H_0, for any fixed g0Gg_0 \in \mathcal{G}, the random vector g0Xg_0 \cdot X has the same distribution as XX. So the multiset of statistic values

M(X)  =  {T(gX):gG}M(X) \;=\; \{T(g \cdot X) : g \in \mathcal{G}\}

satisfies M(g0X)=M(X)M(g_0 \cdot X) = M(X) as a multiset, since

M(g0X)  =  {T(gg0X):gG}  =  {T(hX):hG}  =  M(X),M(g_0 \cdot X) \;=\; \{T(g \cdot g_0 \cdot X) : g \in \mathcal{G}\} \;=\; \{T(h \cdot X) : h \in \mathcal{G}\} \;=\; M(X),

where h=gg0h = g g_0 ranges over G\mathcal{G} as gg does (group property). The observed statistic T(X)T(X) corresponds to g=idg = \mathrm{id} and is one element of M(X)M(X). Under H0H_0, T(X)T(X) is distributionally exchangeable with each of the G|\mathcal{G}| values {T(gX)}\{T(g \cdot X)\}; this is the consequence of gX=dXg \cdot X \stackrel{d}{=} X applied to each gg. Therefore the rank of T(X)T(X) within the multiset M(X)M(X) — call it RR — is uniformly distributed on {1,2,,G}\{1, 2, \ldots, |\mathcal{G}|\} (with random tie-breaking).

The permutation pp-value satisfies

p(X)  =  #{g:T(gX)T(X)}G  =  GR+1G.p(X) \;=\; \frac{\#\{g : T(g \cdot X) \geq T(X)\}}{|\mathcal{G}|} \;=\; \frac{|\mathcal{G}| - R + 1}{|\mathcal{G}|}.

Since RR is uniform on {1,,G}\{1, \ldots, |\mathcal{G}|\}, p(X)p(X) is uniform on {1/G,2/G,,1}\{1/|\mathcal{G}|, 2/|\mathcal{G}|, \ldots, 1\}. Therefore

P(p(X)α)  =  αGG    α,\mathbb{P}(p(X) \leq \alpha) \;=\; \frac{\lfloor \alpha\, |\mathcal{G}| \rfloor}{|\mathcal{G}|} \;\leq\; \alpha,

with equality whenever α\alpha is a multiple of 1/G1/|\mathcal{G}|. \square

Remark (Permutation vs Randomization Inference).

The permutation principle is sometimes called the randomization principle, and the two terms are routinely conflated. There is a real distinction. In the population-sampling context (data drawn from an infinite population), Theorem 3’s invariance assumption is exchangeability of the sample under H0H_0 — a property of the joint distribution. In the finite-population or experimental context (Theorem 12 below; §8), the invariance is assured by the experimenter’s random treatment assignment — a property of the physical randomization mechanism. The two are mathematically identical (both are instances of Theorem 3 with different choices of G\mathcal{G}), but their justifications differ. In observational studies, exchangeability is often a much stronger assumption than experimenters realise; in randomized experiments, it is guaranteed by design. We pick this back up explicitly in §8.

The proof is short and statistic-free. That’s the power of the permutation principle: validity comes from the symmetry of the null, not from any asymptotic property of TT. Use a tt-statistic, a mean difference, a max correlation, anything — the permutation test wraps it in an exact level-α\alpha envelope.

The permutation framework is statistic-agnostic — same procedure, three interchangeable test statistics, validity from Theorem 3 in every case. Switch the group distribution to log-normal or shrink Δ to feel the power profile shift.

Wilcoxon Rank-Sum and Mann-Whitney UU

The two-sample setting moves from paired data to two independent groups: n1n_1 observations X1,,Xn1X_1, \ldots, X_{n_1} from one population and n2n_2 from another, with N=n1+n2N = n_1 + n_2 pooled. Test scores from two teaching methods, activation latencies from two neural regions, conversion times for two checkout flows. The null is that the populations have the same continuous distribution; the alternative is that one is stochastically larger than the other. Pool the samples, rank them from 11 to NN, and take the rank-sum of the smaller group — that’s the Wilcoxon rank-sum statistic. Mann-Whitney’s UU is its combinatorial twin, counting pairs (Xi,Yj)(X_i, Y_j) in which XX comes out smaller. They contain the same information (Theorem 4), and both inherit exact null moments from the exchangeability of pooled ranks (Theorem 5).

Definition 4 (Wilcoxon Rank-Sum Statistic).

With X1,,Xn1X_1, \ldots, X_{n_1} from one population and Y1,,Yn2Y_1, \ldots, Y_{n_2} from another, pool all N=n1+n2N = n_1 + n_2 observations and rank them 1,,N1, \ldots, N. Let RiXR_i^X denote the rank of XiX_i in the pooled sample. The Wilcoxon rank-sum statistic is

W  =  i=1n1RiX.W \;=\; \sum_{i=1}^{n_1} R_i^X.

Definition 5 (Mann-Whitney $U$ Statistic).

With pooled samples as above (and continuous data, so ties have probability zero), the Mann-Whitney UU statistic is the count of (X,Y)(X, Y) pairs in which XX is smaller:

U  =  i=1n1j=1n21{Xi<Yj}.U \;=\; \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} \mathbf{1}\{X_i < Y_j\}.

The ratio U/(n1n2)U / (n_1 n_2) is an unbiased estimator of θ=P(X<Y)\theta = \mathbb{P}(X < Y) when XFX \sim F and YGY \sim G are independent draws from the two populations.

Theorem 4 (Equivalence of $W$ and $U$).

W  =  n1n2  +  n1(n1+1)2    U.W \;=\; n_1 n_2 \;+\; \frac{n_1(n_1 + 1)}{2} \;-\; U.

Because WW and UU are linearly related, they yield identical tests; the choice between them is a matter of bookkeeping.

Proof.

Sort the XX-sample so X(1)<X(2)<<X(n1)X_{(1)} < X_{(2)} < \cdots < X_{(n_1)}. The rank of X(k)X_{(k)} in the pooled sample equals kk (its rank within XX) plus the number of YY-values strictly less than it:

RX(k)  =  k  +  #{j:Yj<X(k)}.R_{X_{(k)}} \;=\; k \;+\; \#\{j : Y_j < X_{(k)}\}.

Sum over kk:

W  =  kRX(k)  =  k=1n1k  +  #{(k,j):Yj<X(k)}  =  n1(n1+1)2  +  #{(i,j):Yj<Xi}.W \;=\; \sum_k R_{X_{(k)}} \;=\; \sum_{k=1}^{n_1} k \;+\; \#\{(k, j) : Y_j < X_{(k)}\} \;=\; \frac{n_1(n_1+1)}{2} \;+\; \#\{(i, j) : Y_j < X_i\}.

Each pair (i,j)(i, j) contributes to exactly one of the events {Xi<Yj}\{X_i < Y_j\} or {Yj<Xi}\{Y_j < X_i\} (continuity rules out ties), so

#{(i,j):Yj<Xi}  =  n1n2U,\#\{(i, j) : Y_j < X_i\} \;=\; n_1 n_2 - U,

giving W=n1(n1+1)/2+n1n2UW = n_1(n_1+1)/2 + n_1 n_2 - U. \square

Theorem 5 (Null Moments of $U$).

Under H0H_0 (the pooled sample (X1,,Xn1,Y1,,Yn2)(X_1, \ldots, X_{n_1}, Y_1, \ldots, Y_{n_2}) is exchangeable, equivalent to F=GF = G continuous),

E[U]  =  n1n22,Var(U)  =  n1n2(N+1)12.\mathbb{E}[U] \;=\; \frac{n_1 n_2}{2}, \qquad \mathrm{Var}(U) \;=\; \frac{n_1 n_2 (N + 1)}{12}.

For n1,n210n_1, n_2 \gtrsim 10, the normal approximation UN ⁣(n1n2/2,  n1n2(N+1)/12)U \approx \mathcal{N}\!\left(n_1 n_2 / 2,\; n_1 n_2 (N+1)/12\right) is excellent; for smaller samples, enumerate the (Nn1)\binom{N}{n_1} rank assignments exactly.

Proof.

Write U=i,jUijU = \sum_{i, j} U_{ij} with Uij=1{Xi<Yj}U_{ij} = \mathbf{1}\{X_i < Y_j\}. Under H0H_0 every pair (Xi,Yj)(X_i, Y_j) is exchangeable, so E[Uij]=P(X<Y)=1/2\mathbb{E}[U_{ij}] = \mathbb{P}(X < Y) = 1/2 by symmetry. Therefore E[U]=n1n2/2\mathbb{E}[U] = n_1 n_2 / 2.

For the variance,

Var(U)  =  i,jVar(Uij)  +  (i,j)(i,j)Cov(Uij,Uij).\mathrm{Var}(U) \;=\; \sum_{i, j} \mathrm{Var}(U_{ij}) \;+\; \sum_{(i, j) \neq (i', j')} \mathrm{Cov}(U_{ij}, U_{i'j'}).

Variance term. Each UijU_{ij} is Bernoulli(1/2)(1/2), so Var(Uij)=1/4\mathrm{Var}(U_{ij}) = 1/4. The diagonal contribution is n1n2/4n_1 n_2 / 4.

Covariance terms split by overlap pattern:

(a) Same XX, different YY (i=ii = i', jjj \neq j'). Three exchangeable continuous random variables (X,Yj,Yj)(X, Y_j, Y_{j'}) have all 3!=63! = 6 orderings equally likely, so

P(X<Yj and X<Yj)  =  P(X=min)  =  1/3,\mathbb{P}(X < Y_j \text{ and } X < Y_{j'}) \;=\; \mathbb{P}(X = \min) \;=\; 1/3,

giving Cov(Uij,Uij)=1/31/4=1/12\mathrm{Cov}(U_{ij}, U_{ij'}) = 1/3 - 1/4 = 1/12.

(b) Different XX, same YY (iii \neq i', j=jj = j'). Symmetric to (a): Cov=1/12\mathrm{Cov} = 1/12.

(c) All distinct (iii \neq i', jjj \neq j'). UijU_{ij} and UijU_{i'j'} are functions of disjoint independent pairs, so they are independent and Cov=0\mathrm{Cov} = 0.

Counting the pairs. Same-XX ordered pairs: n1n_1 choices for the shared XX, n2(n21)n_2(n_2 - 1) ordered choices for the YY-pair, total n1n2(n21)n_1 n_2 (n_2 - 1). Same-YY pairs by symmetry: n2n1(n11)n_2 n_1 (n_1 - 1). Combining,

Var(U)  =  n1n24  +  n1n2(n21)12  +  n1n2(n11)12  =  n1n212[3+(n21)+(n11)]  =  n1n2(N+1)12.  \mathrm{Var}(U) \;=\; \frac{n_1 n_2}{4} \;+\; \frac{n_1 n_2 (n_2 - 1)}{12} \;+\; \frac{n_1 n_2 (n_1 - 1)}{12} \;=\; \frac{n_1 n_2}{12}\bigl[3 + (n_2 - 1) + (n_1 - 1)\bigr] \;=\; \frac{n_1 n_2 (N + 1)}{12}. \;\square

Remark (scipy's Convention for $U$).

scipy.stats.mannwhitneyu(X, Y) returns UU for the first argument, counting #{(i,j):Yj<Xi}\#\{(i, j) : Y_j < X_i\} — the complement of our Definition 5. The two conventions satisfy UXY+UYX=n1n2U_{XY} + U_{YX} = n_1 n_2, encode the same information, and produce identical two-sided pp-values; only the reported statistic differs. Practitioners who run mannwhitneyu(X, Y) and then look up “expected U=n1n2/2U = n_1 n_2 / 2” sometimes lose an hour to this. The fix is to pass the arguments in the order matching whichever convention you have in mind.

A worked example: X=(1.2,2.7,3.1,4.0,5.5)X = (1.2, 2.7, 3.1, 4.0, 5.5) and Y=(2.1,3.5,4.8,6.0,6.2)Y = (2.1, 3.5, 4.8, 6.0, 6.2). Pooled ranks give W=22W = 22 (the sum of XX-ranks 1,3,4,6,81, 3, 4, 6, 8) and U=18U = 18 (the count of (Xi,Yj)(X_i, Y_j) pairs with Xi<YjX_i < Y_j). Theorem 5’s null moments give E[U]=12.5\mathbb{E}[U] = 12.5 and Var(U)22.92\mathrm{Var}(U) \approx 22.92. The two-sided exact pp-value (scipy.stats.mannwhitneyu) is 0.31\approx 0.31 — consistent with no shift between the two populations. The interactive PermutationDistributionExplorer below lets you switch the underlying parametric families (normal, log-normal, exponential) and verify that UU‘s null moments hold across all of them — that’s the distribution-free promise.

The permutation framework is statistic-agnostic — same procedure, three interchangeable test statistics, validity from Theorem 3 in every case. Switch the group distribution to log-normal or shrink Δ to feel the power profile shift.

Kruskal-Wallis: The kk-Sample Extension

When there are kk groups instead of 22 — three teaching methods, four drug doses, five regional cohorts — the natural extension is the Kruskal-Wallis test. It is the rank analog of one-way ANOVA’s FF-test: pool, rank, and look at the between-group dispersion of mean ranks. The shape of the test mirrors ANOVA exactly, with the rank substitution buying distribution-freeness without giving up the chi-squared limit.

Definition 6 (Kruskal-Wallis Statistic).

Suppose we have kk groups of sizes n1,,nkn_1, \ldots, n_k with total N=jnjN = \sum_j n_j. Pool all observations and rank them 1,,N1, \ldots, N; let RjiR_{ji} denote the rank of the ii-th observation in group jj, and let

Rˉj  =  1nji=1njRji\bar R_j \;=\; \frac{1}{n_j} \sum_{i=1}^{n_j} R_{ji}

be the mean rank in group jj. The Kruskal-Wallis statistic is

H  =  12N(N+1)j=1knj(RˉjN+12) ⁣2.H \;=\; \frac{12}{N(N+1)} \sum_{j=1}^k n_j \left( \bar R_j - \frac{N+1}{2} \right)^{\!2}.

The grand mean rank is (N+1)/2(N+1)/2 — the average of the integers 1,,N1, \ldots, N — and the prefactor 12/[N(N+1)]12 / [N(N+1)] normalises by the variance (N21)/12(N^2 - 1)/12 of the discrete uniform on {1,,N}\{1, \ldots, N\}, putting HH on a scale comparable to a chi-squared.

Theorem 6 (Kruskal-Wallis Asymptotic Null).

Under H0H_0 (all kk groups have the same continuous distribution), as all njn_j \to \infty with nj/Nρj(0,1)n_j / N \to \rho_j \in (0, 1),

H    d    χk12.H \;\xrightarrow{\;d\;}\; \chi^2_{k-1}.
Proof.

Define standardised mean-rank deviations

Zj  =  njRˉj(N+1)/2N(N+1)/12,j=1,,k.Z_j \;=\; \sqrt{n_j} \cdot \frac{\bar R_j - (N+1)/2}{\sqrt{N(N+1)/12}}, \qquad j = 1, \ldots, k.

Then H=jZj2=ZZH = \sum_j Z_j^2 = Z^\top Z. We compute the asymptotic distribution of Z=(Z1,,Zk)Z = (Z_1, \ldots, Z_k)^\top.

Under H0H_0 the rank vector (Rji)(R_{ji}) is a uniformly random permutation of {1,,N}\{1, \ldots, N\}. A single rank has mean (N+1)/2(N+1)/2 and variance (N21)/12(N^2 - 1)/12. The mean rank in group jj averages njn_j ranks sampled without replacement from {1,,N}\{1, \ldots, N\}, so the finite-population variance formula gives

Var(Rˉj)  =  N2112njNnjN1  =  (N+1)(Nnj)12nj,\mathrm{Var}(\bar R_j) \;=\; \frac{N^2 - 1}{12 n_j} \cdot \frac{N - n_j}{N - 1} \;=\; \frac{(N+1)(N - n_j)}{12 n_j},

where (Nnj)/(N1)(N - n_j)/(N - 1) is the finite-population correction. Therefore

Var(Zj)  =  njN(N+1)/12(N+1)(Nnj)12nj  =  NnjN    1ρj.\mathrm{Var}(Z_j) \;=\; \frac{n_j}{N(N+1)/12} \cdot \frac{(N+1)(N - n_j)}{12 n_j} \;=\; \frac{N - n_j}{N} \;\to\; 1 - \rho_j.

For the off-diagonal structure: the centred ranks (Rji(N+1)/2)(R_{ji} - (N+1)/2) sum to zero, so the standardised group sums satisfy jnj(Rˉj(N+1)/2)=0\sum_j n_j (\bar R_j - (N+1)/2) = 0. Rescaling, the vector SS with Sj=i(Rji(N+1)/2)/N(N+1)/12S_j = \sum_i (R_{ji} - (N+1)/2)/\sqrt{N(N+1)/12} lies on the hyperplane jSj=0\sum_j S_j = 0, so its covariance is singular along the all-ones direction. After the further rescaling Zj=Sj/njZ_j = S_j / \sqrt{n_j},

Cov(Z)  =  Ipp,p=(ρ1,,ρk),p2=jρj=1.\mathrm{Cov}(Z) \;=\; I - p p^\top, \qquad p = (\sqrt{\rho_1}, \ldots, \sqrt{\rho_k})^\top, \qquad \|p\|^2 = \sum_j \rho_j = 1.

The diagonal entries 1ρj1 - \rho_j match the marginal-variance calculation; the off-diagonal entries ρjρl-\sqrt{\rho_j \rho_l} match the constraint jρjZj=0\sum_j \sqrt{\rho_j} Z_j = 0 (rescaled from jnjRˉj=\sum_j n_j \bar R_j = const).

The multivariate finite-population CLT (Hájek 1960; van der Vaart 1998 §13.2) gives

Z    d    N(0,Ipp)Z \;\xrightarrow{\;d\;}\; \mathcal{N}(0,\, I - p p^\top)

as all njn_j \to \infty with nj/Nρj(0,1)n_j / N \to \rho_j \in (0, 1).

Finally, for ZN(0,Σ)Z \sim \mathcal{N}(0, \Sigma) with Σ\Sigma idempotent of rank rr, the quadratic form ZZZ^\top Z is χr2\chi^2_r-distributed. Here Σ=Ipp\Sigma = I - p p^\top is idempotent — (Ipp)2=I2pp+p(pp)p=Ipp(I - p p^\top)^2 = I - 2 p p^\top + p (p^\top p) p^\top = I - p p^\top since pp=1p^\top p = 1 — and has rank k1k - 1 (the all-ones direction is in the null space). Therefore

H  =  ZZ    d    χk12.  H \;=\; Z^\top Z \;\xrightarrow{\;d\;}\; \chi^2_{k-1}. \;\square

The translation from FF to HH is essentially “replace data values with their pooled ranks.” A worked example with three groups of nj=8n_j = 8 each, drawn under H0H_0 from a common normal, gives mean ranks Rˉ(10.4,14.5,12.6)\bar R \approx (10.4, 14.5, 12.6) and H=1.365H = 1.365 — a χ22\chi^2_2 pp-value of 0.50\approx 0.50, well within the null. Empirical type-I error at α{0.01,0.05,0.10}\alpha \in \{0.01, 0.05, 0.10\} tracks the nominal levels closely for both balanced (nj=15n_j = 15) and unbalanced (n1,n2,n3)=(10,25,40)(n_1, n_2, n_3) = (10, 25, 40) designs.

Histogram of the Kruskal-Wallis H statistic under H0 for three balanced groups with n_j = 15, overlaid with the chi-squared(2) density showing convergence
Empirical null distribution of $H$ from $5000$ simulations under $H_0$ at $k = 3$, $n_j = 15$, overlaid with the asymptotic $\chi^2_2$ density. The convergence is visibly tight even at this moderate $n$. The static figure is the right tool for this story — interactivity wouldn't add insight to a one-shape, one-degree-of-freedom convergence.

Asymptotic Relative Efficiency (Pitman)

The rank tests of §§1, 3, 4 are valid under nearly any continuous distribution — that’s the distribution-free promise. A natural concern is that we might be paying for that generality with reduced power: under conditions where the optimal parametric test exists (the tt-test under normality, say), are we losing meaningful sensitivity by using ranks? Pitman’s framework of asymptotic relative efficiency answers the question precisely. ARE compares the sample sizes two tests need to achieve equal power against the same local alternative; under contiguous alternatives θn=h/n\theta_n = h/\sqrt{n}, the ratio depends only on the noise density’s variance and an integral f2\int f^2. The result is the celebrated formula in Theorem 8 below — and the surprise is just how forgiving it is to give up the parametric optimum.

Theorem 7 (U-statistic Representation of $W^+$).

For continuous paired differences D1,,DnD_1, \ldots, D_n with no ties in Di|D_i|,

W+  =  i:Di>0Ri  =  1ijn1{Di+Dj>0}.W^+ \;=\; \sum_{i\,:\,D_i > 0} R_i \;=\; \sum_{1 \leq i \leq j \leq n} \mathbf{1}\{D_i + D_j > 0\}.

That is, W+W^+ counts the number of Walsh averages Aij=(Di+Dj)/2A_{ij} = (D_i + D_j)/2 that exceed zero (with the convention Aii=DiA_{ii} = D_i). This bridges W+W^+ to the Walsh-average construction we use for Hodges-Lehmann in §6.

Proof.

Split the pairs by the diagonal:

ij1{Di+Dj>0}  =  i1{2Di>0}i=j  +  i<j1{Di+Dj>0}i<j.\sum_{i \leq j} \mathbf{1}\{D_i + D_j > 0\} \;=\; \underbrace{\sum_i \mathbf{1}\{2 D_i > 0\}}_{i = j} \;+\; \underbrace{\sum_{i < j} \mathbf{1}\{D_i + D_j > 0\}}_{i < j}.

The diagonal term is i1{Di>0}\sum_i \mathbf{1}\{D_i > 0\}.

For i<ji < j with continuous data, exactly one of Di,Dj|D_i|, |D_j| is larger; since the larger absolute value dominates the sum, the sign of Di+DjD_i + D_j matches the sign of DargmaxD_{\arg\max}. So for each jj and each index i<ji < j with Di<Dj|D_i| < |D_j|, the pair (i,j)(i, j) contributes 1{Dj>0}\mathbf{1}\{D_j > 0\}. The count of indices ii with DiDj|D_i| \leq |D_j| is RjR_j (the rank of Dj|D_j|), so the count of i<ji < j contributing is Rj1R_j - 1 (subtracting the term i=ji = j). Hence

i<j1{Di+Dj>0}  =  j(Rj1)1{Dj>0}.\sum_{i < j} \mathbf{1}\{D_i + D_j > 0\} \;=\; \sum_j (R_j - 1) \cdot \mathbf{1}\{D_j > 0\}.

Combining:

ij1{Di+Dj>0}  =  j1{Dj>0}  +  j(Rj1)1{Dj>0}  =  jRj1{Dj>0}  =  W+.  \sum_{i \leq j} \mathbf{1}\{D_i + D_j > 0\} \;=\; \sum_j \mathbf{1}\{D_j > 0\} \;+\; \sum_j (R_j - 1)\, \mathbf{1}\{D_j > 0\} \;=\; \sum_j R_j\, \mathbf{1}\{D_j > 0\} \;=\; W^+. \;\square

Theorem 8 (Pitman ARE Formula).

Under the contiguous local alternative θn=h/n\theta_n = h / \sqrt{n} with paired differences Di=Zi+θnD_i = Z_i + \theta_n, where ZiZ_i has density ff symmetric around 00 with finite variance σ2\sigma^2 and f2<\int f^2 < \infty, the Pitman asymptotic relative efficiency of the Wilcoxon signed-rank test relative to the one-sample tt-test is

ARE(W,T)  =  (cWcT) ⁣2  =  12σ2 ⁣(f(x)2dx) ⁣2,\mathrm{ARE}(W, T) \;=\; \left( \frac{c_W}{c_T} \right)^{\!2} \;=\; 12\, \sigma^2 \!\left( \int f(x)^2\, dx \right)^{\!2},

with drift coefficients cT=1/σc_T = 1/\sigma and cW=12I(f)c_W = \sqrt{12}\, I(f) where I(f)=f2I(f) = \int f^2. At standard symmetric densities,

DensityARE(W,T)Normal3/π0.955Logisticπ2/91.097Uniform on [1,1]1.000Laplace1.500Heavy tails (e.g. t3,t2)\begin{array}{l|l} \text{Density} & \mathrm{ARE}(W, T) \\\hline \text{Normal} & 3/\pi \approx 0.955 \\ \text{Logistic} & \pi^2 / 9 \approx 1.097 \\ \text{Uniform on } [-1, 1] & 1.000\\ \text{Laplace} & 1.500 \\ \text{Heavy tails (e.g. } t_3, t_2\text{)} & \to \infty \end{array}
Proof.

Drift of the tt-test. Tn=nDˉ/sT_n = \sqrt{n}\, \bar D / s. Under θn=h/n\theta_n = h/\sqrt{n} we have nE[Dˉ]=h\sqrt{n}\,\mathbb{E}[\bar D] = h, and sσs \to \sigma in probability, so TndN(h/σ,1)T_n \xrightarrow{d} \mathcal{N}(h/\sigma, 1). The drift coefficient is cT=1/σc_T = 1/\sigma.

Drift of Wilcoxon. Apply the U-statistic representation of W+W^+ from Theorem 7. Decompose E[W+]\mathbb{E}[W^+] into diagonal and off-diagonal contributions:

E[W+]  =  nP(D1>0)  +  (n2)P(D1+D2>0).\mathbb{E}[W^+] \;=\; n \cdot \mathbb{P}(D_1 > 0) \;+\; \binom{n}{2} \cdot \mathbb{P}(D_1 + D_2 > 0).

(a) Diagonal term. Write D1=Z1+θnD_1 = Z_1 + \theta_n with Z1fZ_1 \sim f symmetric. Then P(D1>0)=P(Z1>θn)=1F(θn)\mathbb{P}(D_1 > 0) = \mathbb{P}(Z_1 > -\theta_n) = 1 - F(-\theta_n), where FF is the CDF of Z1Z_1. Taylor-expand FF around 00 using F(0)=1/2F(0) = 1/2 and F(0)=f(0)F'(0) = f(0):

F(θn)  =  12    f(0)θn  +  O(θn2),F(-\theta_n) \;=\; \tfrac{1}{2} \;-\; f(0)\, \theta_n \;+\; O(\theta_n^2),

so P(D1>0)=1/2+f(0)θn+O(θn2)\mathbb{P}(D_1 > 0) = 1/2 + f(0)\, \theta_n + O(\theta_n^2).

(b) Off-diagonal term. D1+D2=(Z1+Z2)+2θnD_1 + D_2 = (Z_1 + Z_2) + 2 \theta_n, where Z1+Z2Z_1 + Z_2 has density g=ffg = f \star f (convolution), symmetric around 00. The density at 00 is

g(0)  =  f(u)f(u)du  =  f(u)2du  =  I(f),g(0) \;=\; \int f(u)\, f(-u)\, du \;=\; \int f(u)^2\, du \;=\; I(f),

using the symmetry of ff. The same Taylor expansion applied to the CDF GG of Z1+Z2Z_1 + Z_2 gives

P(D1+D2>0)  =  G(2θn)  =  12  +  2g(0)θn  +  O(θn2)  =  12  +  2I(f)θn  +  O(θn2).\mathbb{P}(D_1 + D_2 > 0) \;=\; G(2 \theta_n) \;=\; \tfrac{1}{2} \;+\; 2\, g(0)\, \theta_n \;+\; O(\theta_n^2) \;=\; \tfrac{1}{2} \;+\; 2\, I(f)\, \theta_n \;+\; O(\theta_n^2).

Combining and subtracting the null mean n(n+1)/4n(n+1)/4:

E[W+]n(n+1)4  =  nf(0)θn  +  n(n1)I(f)θn  +  O(n2θn2).\mathbb{E}[W^+] - \tfrac{n(n+1)}{4} \;=\; n\, f(0)\, \theta_n \;+\; n(n-1)\, I(f)\, \theta_n \;+\; O(n^2 \theta_n^2).

With θn=h/n\theta_n = h / \sqrt{n}:

E[W+]n(n+1)4  =  hnf(0)  +  h(n1)nI(f)  +  O(1)    hn3/2I(f),n,\mathbb{E}[W^+] - \tfrac{n(n+1)}{4} \;=\; h \sqrt{n}\, f(0) \;+\; h\, (n-1) \sqrt{n}\, I(f) \;+\; O(1) \;\sim\; h\, n^{3/2}\, I(f), \quad n \to \infty,

where the off-diagonal U-statistic part of order n2θn=n3/2hn^2 \theta_n = n^{3/2} h dominates the diagonal of order nθn=nhn \theta_n = \sqrt{n}\, h.

The null variance of W+W^+ is n(n+1)(2n+1)/24n3/12n(n+1)(2n+1)/24 \sim n^3 / 12, so the standardisation divisor is n3/12=n3/2/12\sqrt{n^3/12} = n^{3/2}/\sqrt{12}. The standardised statistic

Wn  =  W+n(n+1)/4n(n+1)(2n+1)/24W_n^* \;=\; \frac{W^+ - n(n+1)/4}{\sqrt{n(n+1)(2n+1)/24}}

satisfies E[Wn]h12I(f)\mathbb{E}[W_n^*] \to h \sqrt{12}\, I(f). By the U-statistic CLT (Hoeffding 1948) and contiguity of θn\theta_n with the null,

Wn    d    N(cWh,1),cW  =  12I(f).W_n^* \;\xrightarrow{\;d\;}\; \mathcal{N}(c_W h, 1), \qquad c_W \;=\; \sqrt{12}\, I(f).

ARE. The Pitman ARE compares the squared drift coefficients,

ARE(W,T)  =  (cWcT) ⁣2  =  12I(f)21/σ2  =  12σ2I(f)2.  \mathrm{ARE}(W, T) \;=\; \left( \frac{c_W}{c_T} \right)^{\!2} \;=\; \frac{12\, I(f)^2}{1/\sigma^2} \;=\; 12\, \sigma^2 \, I(f)^2. \;\square

A direct numerical sanity check on the ARE(W,T)=0.955\mathrm{ARE}(W, T) = 0.955 figure under normality: at n=30n = 30, θ=0.3\theta = 0.3, normal noise, 50005000 Monte Carlo replications, the Wilcoxon empirical power is 0.4650.465 vs the tt-test’s 0.4780.478 — a ratio of 0.974\approx 0.974, just above the asymptotic 0.9550.955. The asymptotic ARE is the limit, and finite-nn ratios bracket it.

Try Student-t with df = 5 (ARE ≈ 1.24) or contaminated normal with ε = 0.1 (ARE rises sharply with the contamination weight). The Hodges-Lehmann minimum is the parabolic density at which ARE(W, T) bottoms out at 0.864.

Remark (The Hodges-Lehmann 0.864 Floor).

A celebrated result of Hodges & Lehmann (1956) establishes that for any density ff symmetric with finite variance and a bounded twice-differentiable form,

ARE(W,T)    108125  =  0.864.\mathrm{ARE}(W, T) \;\geq\; \frac{108}{125} \;=\; 0.864.

Wilcoxon never loses more than ~14% asymptotic efficiency relative to the tt-test, regardless of the underlying density, and can gain unboundedly under heavy tails. The minimum is achieved at the parabolic density f(x)=345(1x2/5)f^*(x) = \tfrac{3}{4\sqrt{5}}(1 - x^2/5) on x5|x| \leq \sqrt{5}. The proof is a calculus-of-variations problem — minimise 12σ2(f2)212 \sigma^2 (\int f^2)^2 subject to symmetry and unit variance — and is reproduced in Hodges & Lehmann (1956) and Lehmann (1975, §4.2). The bound is the rare statistical bargain: distribution-freeness with a worst-case efficiency cost smaller than most practitioners’ rounding error in sample-size calculations.

The Hodges-Lehmann Estimator

The Wilcoxon test gives a pp-value but not a point estimate of the location parameter — there’s no obvious “Wilcoxon estimator of θ\theta” until you invert the test. Hodges-Lehmann’s 1963 construction does exactly that: take the median of all pairwise averages of paired differences. The result is a point estimator that sits between the sample mean and the sample median in robustness, with the same asymptotic relative efficiency to the mean that the Wilcoxon test has to the tt-test (Theorem 11). And because the construction is test-inversion, it comes with an exact distribution-free confidence interval (Theorem 10).

A worked example with an outlier makes the case. Take D=(0.3,0.2,0.7,0.9,1.4,1.8,2.1,8.0)D = (-0.3, 0.2, 0.7, 0.9, 1.4, 1.8, 2.1, 8.0) — the 8.08.0 is a single dominant value (think: a logging error, a server crash duration, a one-time mega-purchase). The sample mean is 1.851.85, dragged up by the outlier. The sample median is 1.151.15, ignoring the outlier completely. Hodges-Lehmann gives θ^HL=1.20\hat\theta_{\mathrm{HL}} = 1.20 — between the mean and median, robust to the outlier without throwing it out. The 95%95\% HL confidence interval is [0.30,4.45][0.30,\, 4.45], exact under symmetry of the underlying distribution.

Definition 7 (Walsh Averages and One-Sample Hodges-Lehmann Estimator).

For paired data D1,,DnD_1, \ldots, D_n, the Walsh averages are the pairwise means

Aij  =  Di+Dj2,1ijn.A_{ij} \;=\; \frac{D_i + D_j}{2}, \qquad 1 \leq i \leq j \leq n.

There are M=n(n+1)/2M = n(n+1)/2 of them — nn diagonal entries Aii=DiA_{ii} = D_i plus n(n1)/2n(n-1)/2 off-diagonal entries. The one-sample Hodges-Lehmann estimator of the population location parameter is

θ^HL  =  median{Aij:1ijn}.\hat\theta_{\mathrm{HL}} \;=\; \mathrm{median}\{A_{ij} : 1 \leq i \leq j \leq n\}.

Definition 8 (Two-Sample Hodges-Lehmann Estimator).

For two samples X1,,Xn1X_1, \ldots, X_{n_1} and Y1,,Yn2Y_1, \ldots, Y_{n_2} drawn from distributions differing by a location shift Δ\Delta, the two-sample Hodges-Lehmann estimator of Δ\Delta is the median of all n1n2n_1 n_2 cross-sample differences:

Δ^HL  =  median{YjXi:1in1,  1jn2}.\hat\Delta_{\mathrm{HL}} \;=\; \mathrm{median}\{Y_j - X_i : 1 \leq i \leq n_1,\; 1 \leq j \leq n_2\}.

Theorem 9 (Hodges-Lehmann as Test-Inversion of Wilcoxon).

θ^HL\hat\theta_{\mathrm{HL}} is the unique value θ0\theta_0 at which the Wilcoxon signed-rank statistic applied to the shifted data Dθ0D - \theta_0 takes its null mean n(n+1)/4n(n+1)/4. Equivalently, θ^HL=median{Aij}\hat\theta_{\mathrm{HL}} = \mathrm{median}\{A_{ij}\}.

Proof.

Apply the U-statistic representation of W+W^+ (Theorem 7) to the shifted data Dθ0D - \theta_0:

W+(Dθ0)  =  ij1{(Diθ0)+(Djθ0)>0}  =  ij1{(Di+Dj)/2>θ0}  =  #{(i,j):ij,Aij>θ0}.W^+(D - \theta_0) \;=\; \sum_{i \leq j} \mathbf{1}\{(D_i - \theta_0) + (D_j - \theta_0) > 0\} \;=\; \sum_{i \leq j} \mathbf{1}\{(D_i + D_j)/2 > \theta_0\} \;=\; \#\{(i, j) : i \leq j,\, A_{ij} > \theta_0\}.

As θ0\theta_0 increases from -\infty to ++\infty, the count W+(Dθ0)W^+(D - \theta_0) decreases from M=n(n+1)/2M = n(n+1)/2 to 00, jumping by 11 each time θ0\theta_0 crosses a Walsh average. It equals the null mean n(n+1)/4=M/2n(n+1)/4 = M/2 exactly when θ0\theta_0 is positioned with M/2M/2 Walsh averages above it and M/2M/2 below it — i.e., at the median of the discrete set {Aij}\{A_{ij}\}. \square

At the notebook's n = 8 outlier example, HL = 1.20 sits between mean (1.85) and median (1.15). The 95% HL CI is [0.30, 4.45] — exact under the Wilcoxon symmetry assumption (Theorem 10). Inject another outlier and watch the right panel: mean RMSE rockets, median is steady, HL stays close behind median.

Theorem 10 (Distribution-Free CI from Walsh-Average Ordering).

Let A(1)A(2)A(M)A_{(1)} \leq A_{(2)} \leq \cdots \leq A_{(M)} be the sorted Walsh averages with M=n(n+1)/2M = n(n+1)/2, and let wαw_\alpha be the integer satisfying PH0(W+wα)α/2\mathbb{P}_{H_0}(W^+ \leq w_\alpha) \leq \alpha/2 — the lower α/2\alpha/2 critical value of the discrete null distribution of W+W^+. Then

[A(wα+1),  A(Mwα)]\bigl[\, A_{(w_\alpha + 1)},\; A_{(M - w_\alpha)} \,\bigr]

is a (1α)(1 - \alpha) confidence interval for the location parameter θ\theta, with coverage exact at α\alpha levels achievable by the Wilcoxon discrete null and conservative otherwise.

Proof.

By Theorem 9 (and its symmetric twin for the upper tail), the count W+(Dθ0)W^+(D - \theta_0) drops by 11 at each Walsh-average crossing as θ0\theta_0 increases. A two-sided level-α\alpha test fails to reject H0 ⁣:θ=θ0H_0\!: \theta = \theta_0 when

wα  <  W+(Dθ0)  <  Mwα,w_\alpha \;<\; W^+(D - \theta_0) \;<\; M - w_\alpha,

i.e., the count of Walsh averages strictly greater than θ0\theta_0 is strictly between wαw_\alpha and MwαM - w_\alpha. This holds iff θ0\theta_0 lies strictly between the wαw_\alpha-th and (Mwα)(M - w_\alpha)-th sorted Walsh averages — closing the interval as

θ0[A(wα+1),A(Mwα)].\theta_0 \in \bigl[\, A_{(w_\alpha + 1)},\, A_{(M - w_\alpha)} \,\bigr].

By the duality of test inversion, the probability that this interval covers the true θ\theta equals the probability that the level-α\alpha test fails to reject the truth, which is at least 1α1 - \alpha by the exactness of the Wilcoxon test under the symmetry assumption. \square

Theorem 11 (Hodges-Lehmann Asymptotic Distribution and ARE).

Under the regularity conditions of Theorem 8 (ff symmetric with finite variance, density continuous and bounded, f2<\int f^2 < \infty),

n(θ^HLθ)    d    N ⁣(0,  112(f2) ⁣2).\sqrt{n}\,(\hat\theta_{\mathrm{HL}} - \theta) \;\xrightarrow{\;d\;}\; \mathcal{N}\!\left(0,\; \frac{1}{12 \left(\int f^2\right)^{\!2}} \right).

Consequently:

ARE(θ^HL,Dˉ)=12σ2 ⁣(f2) ⁣2=ARE(W,T),ARE(θ^HL,median)=3(f2) ⁣2f(0)2.\mathrm{ARE}(\hat\theta_{\mathrm{HL}},\, \bar D) = 12 \sigma^2 \!\left( \int f^2 \right)^{\!2} = \mathrm{ARE}(W, T), \qquad \mathrm{ARE}(\hat\theta_{\mathrm{HL}},\, \mathrm{median}) = \frac{3 \left( \int f^2 \right)^{\!2}}{f(0)^2}.

The HL estimator inherits the same efficiency advantages as the Wilcoxon test (Theorem 8), with a celebrated 0.8640.864 asymptotic-efficiency floor relative to the sample mean (Hodges & Lehmann 1956).

Proof.

The first claim follows from Theorem 8 and the test-inversion structure of Theorem 9. The Wilcoxon statistic standardised under θn=h/n\theta_n = h/\sqrt{n} satisfies WndN(cWh,1)W_n^* \xrightarrow{d} \mathcal{N}(c_W h, 1) with cW=12I(f)c_W = \sqrt{12}\, I(f). By Theorem 9, θ^HL\hat\theta_{\mathrm{HL}} is the value of θ0\theta_0 at which Wn(Dθ0)W_n^*(D - \theta_0) equals zero — i.e., the zero of the standardised Wilcoxon as a function of the location parameter.

A delta-method argument inverts the asymptotic-mean function around its zero. The mean function shifts linearly in θ\theta with slope cWc_W in standardised units, equivalent to slope cWnc_W \sqrt{n} in the un-standardised W+W^+ sum; the noise scale is Var(W+)n3/2/12\sqrt{\mathrm{Var}(W^+)} \sim n^{3/2}/\sqrt{12}. The implicit-function calculation yields

n(θ^HLθ)    d    N ⁣(0,1cW2)  =  N ⁣(0,112I(f)2).\sqrt{n}\,(\hat\theta_{\mathrm{HL}} - \theta) \;\xrightarrow{\;d\;}\; \mathcal{N}\!\left( 0,\, \frac{1}{c_W^2} \right) \;=\; \mathcal{N}\!\left( 0,\, \frac{1}{12\, I(f)^2} \right).

The full Bahadur-representation foundation for this inversion (uniform convergence of the rank-statistic process) is in Lehmann (1975, §4.3) and van der Vaart (1998, §13).

ARE comparisons. The sample mean has asymptotic variance σ2/n\sigma^2 / n, so

ARE(θ^HL,Dˉ)  =  σ2/n1/(12nI(f)2)  =  12σ2I(f)2  =  ARE(W,T).\mathrm{ARE}(\hat\theta_{\mathrm{HL}},\, \bar D) \;=\; \frac{\sigma^2 / n}{1 / (12 n I(f)^2)} \;=\; 12\, \sigma^2\, I(f)^2 \;=\; \mathrm{ARE}(W, T).

The sample median has asymptotic variance 1/(4nf(0)2)1/(4 n f(0)^2), so

ARE(θ^HL,median)  =  1/(4f(0)2)1/(12I(f)2)  =  3I(f)2f(0)2.  \mathrm{ARE}(\hat\theta_{\mathrm{HL}},\, \mathrm{median}) \;=\; \frac{1/(4 f(0)^2)}{1 / (12\, I(f)^2)} \;=\; \frac{3\, I(f)^2}{f(0)^2}. \;\square

Remark (When the Sample Median Beats HL (Laplace = MLE)).

The Hodges-Lehmann estimator is not a universal optimum — it loses to the sample median when the data are Laplace-distributed. Under the Laplace density f(x)=12exf(x) = \tfrac{1}{2}e^{-|x|}, the maximum-likelihood estimator of the location parameter is the sample median (because the Laplace log-likelihood is the negative L1L_1 loss). Plugging in the Laplace values f(0)=1/2f(0) = 1/2, f2=1/4\int f^2 = 1/4:

ARE(θ^HL,median)  =  3(1/4)2(1/2)2  =  3/4  =  0.75.\mathrm{ARE}(\hat\theta_{\mathrm{HL}},\, \mathrm{median}) \;=\; \frac{3 \cdot (1/4)^2}{(1/2)^2} \;=\; 3/4 \;=\; 0.75.

The median achieves asymptotic variance 1/(4f(0)2n)=1/n1/(4 f(0)^2 n) = 1/n, while HL achieves 1/(12I(f)2n)=4/(3n)1/(12 I(f)^2 n) = 4/(3n) — a 33% inflation. So under Laplace, use the median, not HL. The general moral: HL is a robust compromise across density shapes, not the best estimator at any single one. When the parametric form is known, the MLE is what beats it.

Permutation Tests with Non-Rank Statistics

The permutation principle from §2 doesn’t require rank statistics. Any test statistic permits a permutation test, and dropping the rank structure can buy back the small power loss that ranks impose under nice distributions. The trade-off is the standard one: parametric efficiency (when assumptions hold) vs robustness to outliers and tails (when they don’t).

The canonical example is the permutation tt-test: take the same two-sample setting from §3 with the Welch tt-statistic

T  =  XˉYˉsX2/n1+sY2/n2,T \;=\; \frac{\bar X - \bar Y}{\sqrt{s_X^2 / n_1 + s_Y^2 / n_2}},

but instead of comparing TT to a tνt_\nu reference distribution (which requires the Welch-Satterthwaite approximation), enumerate or sample the (Nn1)\binom{N}{n_1} label-swap permutations and compute the empirical permutation distribution. Theorem 3 gives exact level-α\alpha control under the exchangeability null, regardless of how non-normal the data are. A worked example: two samples of size 1212 from a t3t_3 distribution shifted by 1.01.0. The classical Welch tt-test gives p0.889p \approx 0.889 (off because t3t_3 is heavy-tailed); the permutation tt gives p0.886p \approx 0.886 — close, but the permutation version is the one that’s exactly right.

Remark (Raw vs Ranks — A Power Trade-Off, Not a Validity Trade-Off).

Both the permutation tt-test and Mann-Whitney are exact under exchangeability — they differ only in the choice of statistic. The choice is a power decision, not a validity one:

  • Raw permutation (mean difference, Welch tt, etc.) is more powerful when the parametric model is roughly right. Asymptotically, the permutation tt and the classical tt share the same drift coefficient cT=1/σc_T = 1/\sigma.
  • Rank permutation (Wilcoxon rank-sum, Mann-Whitney UU) is more powerful when tails are heavy or outliers are present. Drift coefficient cW=12I(f)c_W = \sqrt{12}\, I(f), which dominates 1/σ1/\sigma exactly when ARE(W,T)>1\mathrm{ARE}(W, T) > 1.
  • Random permutation (sampled rather than enumerated): when (Nn1)\binom{N}{n_1} is too large for full enumeration, sample BB random permutations. The Phipson-Smyth pp-value
p^  =  1+#{TπTobs}1+B\hat p \;=\; \frac{1 + \#\{T_\pi \geq T_{\mathrm{obs}}\}}{1 + B}

is conservative; the “+1”s adjust for the observed value being one realisation, and the construction guarantees p^>0\hat p > 0 regardless of BB.

The key point: the permutation framework is the umbrella, and rank tests are just one corner of it. When you encounter a clever test statistic in a paper, the question to ask is not “does it have a known asymptotic distribution?” — it’s “is the null exchangeable?” If the answer is yes, you have a finite-sample exact test for free.

The permutation framework is statistic-agnostic — same procedure, three interchangeable test statistics, validity from Theorem 3 in every case. Switch the group distribution to log-normal or shrink Δ to feel the power profile shift.

A/B Testing as Randomization Inference

Online controlled experiments — A/B tests in tech, RCTs in medicine and policy — are the natural home of permutation inference, but with a twist: they are really randomization tests in the Fisher sense (per Remark 1), not permutation tests in the population-sampling sense. The validity claim is different and considerably stronger: the experimenter’s coin flip provides exactness regardless of the metric’s distribution.

The setup. A platform randomly assigns each of nn users to control (A) or treatment (B), measures an outcome metric (revenue, session length, conversion rate, ad clicks per impression) per user, and asks whether the treatment shifted the metric. Standard practice runs Welch’s tt-test on the means and reports a pp-value. The trouble: heavy-tailed metrics (revenue is famously log-normal-ish) can break the Welch-Satterthwaite reference distribution, producing tests with the wrong size. Randomization-inference-via-permutation gives an exactly correct test even when Welch is wrong — and with one extra ingredient (a pre-period covariate) the CUPED adjustment can dramatically reduce variance without sacrificing exactness.

Definition 9 (Fisher's Sharp Null and the Randomization Distribution).

In an experiment, each unit ii has potential outcomes Yi(0),Yi(1)Y_i(0), Y_i(1) under control and treatment respectively. Treatment is assigned by a known random mechanism W=(W1,,Wn)W = (W_1, \ldots, W_n) with Wi{0,1}W_i \in \{0, 1\}, and the observed outcome is Yi=Yi(Wi)Y_i = Y_i(W_i). Fisher’s sharp null is

H0sharp  :  Yi(0)=Yi(1)for every i.H_0^{\mathrm{sharp}} \;:\; Y_i(0) = Y_i(1) \quad \text{for every } i.

Under H0sharpH_0^{\mathrm{sharp}}, YiY_i is constant in WiW_i, so swapping WW to any other assignment w{0,1}nw' \in \{0, 1\}^n drawn from the same randomization mechanism gives the same observed YY but a different test statistic T(w,Y)T(w', Y). The randomization distribution of TT is the distribution of T(W,Y)T(W', Y) where WW' is drawn from the assignment mechanism.

Theorem 12 (Randomization-Test Exactness Under Fisher's Sharp Null).

Let TT be any test statistic, let WW' range over the support of the randomization mechanism (e.g., all (nntreat)\binom{n}{n_{\mathrm{treat}}} ways of assigning ntreatn_{\mathrm{treat}} of nn users to treatment under simple random sampling). Define the randomization pp-value

p(W,Y)  =  1supportw1{T(w,Y)T(W,Y)}.p(W, Y) \;=\; \frac{1}{|\text{support}|} \sum_{w'} \mathbf{1}\{T(w', Y) \geq T(W, Y)\}.

Then under H0sharpH_0^{\mathrm{sharp}}, P(pα)α\mathbb{P}(p \leq \alpha) \leq \alpha for every α[0,1]\alpha \in [0, 1], with equality at multiples of 1/support1/|\text{support}|.

Proof.

Under H0sharpH_0^{\mathrm{sharp}}, YY is fixed (not random — every unit’s outcome is determined and unchanged by relabelling). The only randomness is in WW. Apply Theorem 3 with the group G\mathcal{G} being the action of permuting WW‘s entries (or, equivalently, swapping which YiY_i‘s are labelled “treatment”): the joint distribution of (T(W,Y))W(T(W', Y))_{W'} is invariant under group composition, so the rank of T(W,Y)T(W, Y) within {T(W,Y)}\{T(W', Y)\} is uniform on {1,,support}\{1, \ldots, |\text{support}|\}. The remainder of the argument is identical to the proof of Theorem 3. \square

The conceptual difference from §2’s permutation-test framing is what gives Theorem 12 its operational power. Validity comes from the experimenter’s coin flip, not from any assumption about the population. The data-generating process for the YiY_i‘s can be arbitrary — heteroscedastic, heavy-tailed, autocorrelated across users — and the randomization inference is still exact.

Remark (CUPED — Variance Reduction Without Breaking Randomization).

CUPED (Deng, Xu, Kohavi & Walker 2013, “Controlled-experiment Using Pre-Experiment Data”) replaces the metric YY with

Ycuped,i  =  Yi    β(XiXˉ),β  =  Cov(X,Y)Var(X),Y_{\mathrm{cuped},\, i} \;=\; Y_i \;-\; \beta\, (X_i - \bar X), \qquad \beta \;=\; \frac{\mathrm{Cov}(X, Y)}{\mathrm{Var}(X)},

where XX is a pre-experiment covariate per user (e.g., the user’s revenue in the prior month) and Xˉ\bar X is the pooled sample mean. The adjusted variable has the same expected mean as YY — the term (XiXˉ)(X_i - \bar X) has expectation zero — but variance Var(Y)Cov(X,Y)2/Var(X)=Var(Y)(1ρX,Y2)\mathrm{Var}(Y) - \mathrm{Cov}(X, Y)^2 / \mathrm{Var}(X) = \mathrm{Var}(Y)\,(1 - \rho_{X, Y}^2), strictly smaller whenever XX and YY are correlated.

Crucially, the adjustment is symmetric in the labels — the same β\beta and the same Xˉ\bar X are used for both groups — so under H0sharpH_0^{\mathrm{sharp}} the YcupedY_{\mathrm{cuped}} values remain fixed under label permutations. Theorem 12’s randomization-inference exactness carries through unchanged. Variance reduction translates directly to power gain. Typical pre-period correlations of ρX,Y0.7\rho_{X, Y} \approx 0.7 deliver power gains equivalent to a \sim50% reduction in sample size — for free, from data the experimenter already had.

At log-normal Y with n = 80, σ_log = 2 and Δ = 0, run the tracker to see the Welch-t bar settle around 0.030 (under-rejecting) while the randomization bar stabilizes at ~0.05. Toggle CUPED to confirm exactness still holds (Remark 6); raise Δ to feel the power gain.

Limitations: Ties, Power, and the Covariate Problem

The rank-and-permutation toolkit is powerful, but it has three honest limitations worth flagging before we close. Ties violate the continuity assumption of §§1, 3, 4 and require a midrank correction (otherwise the test under-rejects). The discreteness of small-nn exact null distributions can erase the asymptotic Wilcoxon advantage at the very sample sizes where it should matter most. And the framework does not extend cleanly to covariate-adjusted inference — you cannot easily ask “is the treatment effect non-zero, controlling for age and baseline severity” within rank tests alone. Each limitation has a partial fix; the third one points outward to quantile regression and conformal prediction for the structural answer.

Remark (Tie Correction — Naive Variance Under-Rejects).

The clean null distributions of §1, §3, §4 assume continuous data with no ties. Real data ties — especially on Likert scales (5- or 7-point ratings), integer counts, or coarsely-binned measurements. The standard fix replaces tied values with their midrank (the average of the ranks they would have occupied) and corrects the null variance.

For Mann-Whitney UU with tie-group sizes t1,,tgt_1, \ldots, t_g,

Var(U)corrected  =  n1n212(N+1k(tk3tk)N(N1))  =  Var(U)no ties(1k(tk3tk)N3N).\mathrm{Var}(U)_{\mathrm{corrected}} \;=\; \frac{n_1 n_2}{12} \left( N + 1 - \frac{\sum_k (t_k^3 - t_k)}{N(N-1)} \right) \;=\; \mathrm{Var}(U)_{\text{no ties}} \cdot \left( 1 - \frac{\sum_k (t_k^3 - t_k)}{N^3 - N} \right).

For Wilcoxon signed-rank with tied Di|D_i|‘s,

Var(W+)corrected  =  n(n+1)(2n+1)24    148k(tk3tk).\mathrm{Var}(W^+)_{\mathrm{corrected}} \;=\; \frac{n(n+1)(2n+1)}{24} \;-\; \frac{1}{48} \sum_k (t_k^3 - t_k).

scipy.stats.mannwhitneyu and scipy.stats.wilcoxon apply these corrections automatically. Skipping them produces tests with wrong size — and the direction is worth getting right. The corrected variance is smaller than the uncorrected one (the bracketed factor 1(tk3tk)/(N3N)1 - \sum(t_k^3 - t_k)/(N^3 - N) is strictly less than 11 when ties exist). Using the larger uncorrected variance in the zz-denominator inflates the denominator, shrinks z|z|, and inflates pp-values — making the uncorrected test conservative (under-rejecting), sacrificing power for no Type I error protection.

Empirical type-I error rates at nominal alpha = 0.05 and 0.10 under 5-point Likert ties for naive (uncorrected), tie-corrected, and scipy-default Mann-Whitney
Empirical type-I error rates for $5$-point Likert data with $n_1 = n_2 = 30$ and $2500$ Monte Carlo replications under $H_0$. Naive (uncorrected) variance gives empirical $\\alpha \\approx 0.033$ at nominal $0.05$ — conservative, under-rejecting. Manual tie-corrected and `scipy.stats.mannwhitneyu` (which applies the correction automatically) hit $\\alpha \\approx 0.042$, close to nominal.

Remark (Discreteness at Very Small $n$).

The Hodges-Lehmann lower bound (Remark 3) says rank tests lose at most \sim14% asymptotic efficiency vs the tt-test even under normality, and gain unboundedly under heavy tails. But asymptotic efficiency is not the same as finite-sample dominance.

At n1=n2=4n_1 = n_2 = 4 (so 70 possible rank assignments), the achievable significance levels for an exact Mann-Whitney test are restricted to 0/70,1/70,2/70,0/70, 1/70, 2/70, \ldots — and the smallest two-sided level above zero is 2/70=0.0292/70 = 0.029. To get a level-0.05 test, we accept the next-larger achievable level 4/700.0574/70 \approx 0.057, which is effectively a level-0.057 test. The CLT-based tt-test, by contrast, can hit any nominal level on the continuum.

This discreteness pinch shows up most exactly where rank tests should otherwise win — at very small nn with heavy-tailed data — and it can erase the asymptotic Wilcoxon advantage. The fix is to keep using the exact null distribution (rather than its normal approximation) but to acknowledge that the achievable α\alpha values are restricted, and report the actual achieved level rather than the nominal one. scipy.stats.mannwhitneyu(method='exact') returns the exact discrete pp-value.

Remark (Forward Connections — Quantile Regression and Conformal Prediction).

The real limitation of rank tests is the covariate problem. Rank tests don’t extend cleanly to “test whether the treatment effect is non-zero, controlling for age, sex, and baseline severity.” The mean-rank framework lacks the regression structure that lets you partial out covariates without sacrificing exactness. Stratified rank tests like van Elteren (1960) exist but apply only to discrete strata (age bins, sex); for continuous covariates, they don’t help.

Two clean answers come from elsewhere in the T4 track:

  • Quantile Regression. Replaces the squared-error loss of OLS with the asymmetric pinball loss to recover conditional quantiles given covariates. Carries the robust-to-tails ethos of Wilcoxon into a regression framework — the conditional median is the L1L_1-loss minimiser, and conditional quantiles handle heavy-tailed outcomes natively.
  • Conformal Prediction. Turns finite-sample exchangeability arguments — exactly the assumption underwriting permutation inference here — into prediction intervals with marginal coverage guaranteed for any black-box predictor. The exchangeability that powers Theorems 3 and 12 in this topic is the same exchangeability that powers conformal’s Theorem 1.

The shared mathematical primitive — exchangeability as a finite-sample statistical engine — ties the T4 track together. This topic deploys it for testing; quantile regression and conformal prediction deploy it for conditional estimation and prediction respectively.

Notation Reference

SymbolMeaningSection
Di=XiYiD_i = X_i - Y_iPaired difference§1
1{A}\mathbf{1}\{A\}Indicator of event AAthroughout
N+=i1{Di>0}N^+ = \sum_i \mathbf{1}\{D_i > 0\}Sign-test statistic§1
RiR_iRank of $D_i
W+=i:Di>0RiW^+ = \sum_{i: D_i > 0} R_iWilcoxon signed-rank statistic§1
G\mathcal{G}Finite group of null-preserving transformations§2
gXg \cdot XTransformation of the data vector by gg§2
=d\stackrel{d}{=}Equality in distribution§2
p(X)p(X)Permutation pp-value§2, §7, §8
n1,n2,N=n1+n2n_1, n_2, N = n_1 + n_2Two-sample sizes and pooled total§3
RiXR_i^XRank of XiX_i in the pooled sample§3
W=iRiXW = \sum_i R_i^XWilcoxon rank-sum statistic§3
U=i,j1{Xi<Yj}U = \sum_{i, j} \mathbf{1}\{X_i < Y_j\}Mann-Whitney UU statistic§3
θ=P(X<Y)\theta = \mathbb{P}(X < Y)Population parameter U/(n1n2)U/(n_1 n_2) estimates§3
Rˉj\bar R_jMean rank in group jj§4
HHKruskal-Wallis statistic§4
ρj=limnj/N\rho_j = \lim n_j/NLimiting allocation in kk-sample CLT§4
θn=h/n\theta_n = h/\sqrt{n}Contiguous local alternative§5
cT=1/σc_T = 1/\sigma, cW=12I(f)c_W = \sqrt{12}\, I(f)Drift coefficients§5
I(f)=f2(x)dxI(f) = \int f^2(x)\, dxDensity-functional integral§5, §6
Aij=(Di+Dj)/2A_{ij} = (D_i + D_j)/2Walsh average§6
M=n(n+1)/2M = n(n+1)/2Number of Walsh averages§6
θ^HL\hat\theta_{\mathrm{HL}}, Δ^HL\hat\Delta_{\mathrm{HL}}One-sample / two-sample Hodges-Lehmann estimators§6
A(k)A_{(k)}Walsh-average order statistic§6
wαw_\alphaLower critical value of Wilcoxon null§6
T=(XˉYˉ)/sX2/n1+sY2/n2T = (\bar X - \bar Y)/\sqrt{s_X^2/n_1 + s_Y^2/n_2}Welch tt-statistic§7
p^=(1+#{TπTobs})/(1+B)\hat p = (1 + \#\{T_\pi \geq T_{\mathrm{obs}}\})/(1 + B)Phipson-Smyth pp-value§7, §8
Yi(0),Yi(1)Y_i(0), Y_i(1)Potential outcomes§8
Wi{0,1}W_i \in \{0, 1\}Treatment assignment indicator§8
H0sharp:Yi(0)=Yi(1)iH_0^{\mathrm{sharp}}: Y_i(0) = Y_i(1)\,\forall iFisher’s sharp null§8
Ycuped,i=Yiβ(XiXˉ)Y_{\mathrm{cuped},\, i} = Y_i - \beta(X_i - \bar X)CUPED-adjusted outcome§8
β=Cov(X,Y)/Var(X)\beta = \mathrm{Cov}(X, Y)/\mathrm{Var}(X)Pooled CUPED coefficient§8
ρX,Y\rho_{X, Y}Pre/in-period correlation§8
tkt_kSize of the kk-th tie group§9
k(tk3tk)\sum_k (t_k^3 - t_k)Tie-correction term§9

Connections and Further Reading

This topic is the testing-and-inference foundation of formalML’s T4 (Nonparametric & Distribution-Free) track, sitting alongside conformal prediction (prediction under exchangeability) and quantile regression (covariate-adjusted location). The shared mathematical primitive — exchangeability as a finite-sample inferential engine — ties the three together; the difference is what’s being inferred (a parameter, a future observation, or a conditional quantile).

The classical foundations live in Wilcoxon (1945), Mann & Whitney (1947), Kruskal & Wallis (1952), Pitman (1937), and the Hodges-Lehmann pair (1956 for the ARE bound, 1963 for the estimator and CI). Fisher’s Design of Experiments (1935) is the source of randomization inference. Modern book-length treatments are Lehmann’s Nonparametrics (2006, revised from the 1975 Holden-Day edition) and Hájek-Šidák-Sen’s Theory of Rank Tests (1999) for the asymptotic theory in §§4–6. Van der Vaart’s Asymptotic Statistics (1998), chapters 13 (rank tests) and 14 (relative efficiency), supplies the modern measure-theoretic foundation. For the A/B-testing applications in §8, Kohavi-Tang-Xu’s Trustworthy Online Controlled Experiments (2020) is the operational guide, and Deng et al. (2013) is the original CUPED paper. See the References section above the page footer for the complete bibliography.

Connections

  • Both topics rest on exchangeability as the single assumption providing finite-sample exact inference. Where this topic uses exchangeability for testing (Theorems 3, 12), conformal uses it for prediction (Theorem 1 of conformal-prediction). The shared mathematical primitive ties the T4 track together; the difference is in what's being inferred — a parameter (here) or a future observation (there). conformal-prediction
  • Quantile regression is the covariate-adjusted counterpart of rank-based location inference. Where Wilcoxon and Hodges-Lehmann handle the unconditional location parameter, quantile regression handles the conditional quantile given $X$ — picking up the heavy-tail-robust ethos in a regression framework. §9 here flags the covariate problem as the structural limitation of rank tests; quantile regression is one of two clean answers to it. quantile-regression
  • T4's track closer generalizes §6 here from a Hodges-Lehmann *confidence interval* for a location parameter to an HL-style *prediction interval* for the next observation. The proof of Theorem 3 of prediction-intervals cites Theorem 10 here, and Theorem 5.3 (HL/conformal asymptotic equivalence) leans on the §5 Walsh-average asymptotic theory. The §6 batch comparison there documents HL's batch-coverage shortfall on shared calibration sets — a regime distinction worth carrying back to single-sample HL applications. prediction-intervals
  • T4 track closer. Depth-induced multivariate ranks are the multivariate generalization of the empirical-CDF ordering used here. Hodges-Lehmann-style multivariate rank tests built on depth-induced ranks are distribution-free under elliptical symmetry and inherit the depth consistency theorem of statistical-depth §4.1 — the multivariate version of the §3 Wilcoxon null. statistical-depth

References & Further Reading