advanced nonparametric-ml 60 min read

Statistical Depth

Multivariate centrality scores — Tukey halfspace depth, the zoo of alternatives, asymptotic theory and NP-hardness, and ML applications via DD-classification, depth-based outlier detection, and functional depth

Overview

Statistical depth assigns each point in Rd\mathbb{R}^d a centrality score with respect to an empirical or population distribution, generalizing the median, quantile, and rank to dimensions where ordering is no longer well defined. The Tukey halfspace depth — the canonical construction — measures how surrounded a point is by the data: it is the smallest fraction of the sample on either side of any hyperplane through that point. Depth functions package multivariate location, spread, and outlyingness into a single object that obeys a clean axiomatic specification (Zuo and Serfling 2000), supports robust median estimation with breakdown 1/(d+1)\ge 1/(d+1) (Donoho and Gasko 1992), and underwrites a small zoo of multivariate ML tools — the DD-classifier, depth-based outlier detection, and functional depth for curves and time series.

The topic structure: motivation and the Zuo–Serfling axioms (§1), Tukey halfspace depth as the canonical depth function (§2), a comparative survey of Mahalanobis, simplicial, projection, and spatial depths (§3), the asymptotic theory and computational complexity of Tukey depth (§4), ML applications via DD-classification, outlier detection, and functional depth (§5), and connections plus honest limits (§6).

The reader is assumed to be comfortable with measure-theoretic probability, weak convergence, and the basic vocabulary of convex geometry — halfspaces, supporting hyperplanes, convex hulls, simplices. Empirical-process limit theorems make a cameo in §4 and are sketched rather than developed from scratch.

This topic closes the T4 Nonparametric & Distribution-Free track. Where the other six T4 topics — Conformal Prediction, Quantile Regression, Rank Tests, Prediction Intervals, Extreme Value Theory, and the univariate machinery they share — operate one variable at a time, statistical depth is what those topics look like in Rd\mathbb{R}^d when there is no canonical order on the sample space.

1. From quantile to depth

In one dimension a sample has an unambiguous middle. The median of X1,,XnRX_1, \dots, X_n \in \mathbb{R} minimizes iXiθ\sum_i |X_i - \theta|, splits the sample into halves, and is the level-1/21/2 value of the empirical quantile function F^1(p)=inf{x:F^(x)p}\hat F^{-1}(p) = \inf\{x : \hat F(x) \ge p\}. The empirical CDF F^\hat F orders the sample, the quantile function inverts that order, and rank-based machinery — Wilcoxon tests, sign tests, prediction intervals built from order statistics — drops out. All of it rests on the fact that R\mathbb{R} is totally ordered.

In Rd\mathbb{R}^d for d2d \ge 2 that order is gone. There is no canonical way to say which of two points x,yR2\mathbf{x}, \mathbf{y} \in \mathbb{R}^2 comes “before” the other; the natural componentwise partial order leaves most pairs incomparable, and the obvious workarounds — the componentwise median, the Mahalanobis-distance “center,” the geometric median — each capture some aspect of multivariate centrality but disagree on most samples and behave badly under affine reparametrization. The question “which point in this 2D cloud is most central?” has several reasonable answers, and which one is correct depends on what we want centrality to do.

A depth function is the answer that asks what we want first and derives the construction second. It assigns each point xRd\mathbf{x} \in \mathbb{R}^d a real number — the depth of x\mathbf{x} with respect to a distribution FF — large when x\mathbf{x} sits in the bulk of FF and small when x\mathbf{x} is near the edge or far outside. The depth function is then a multivariate centrality score, and the deepest point is the multivariate median.

1.1 Why componentwise quantiles fail

The fastest way to feel the gap is to try the obvious fix. Given a 2D sample, take the median of the first coordinate and the median of the second coordinate; call the resulting point the componentwise median. The construction is fast, well-defined, and almost always wrong for what we mean by “the center of the cloud.”

Two failures appear immediately. The first is affine non-equivariance: rotate the sample by 4545^\circ and the componentwise median moves as a function of the rotation, not as the rotated original median. The “center” depends on the choice of axes, which is exactly what we do not want for a notion that ought to track the geometry of the cloud rather than the bookkeeping of its representation. The second is lack of fit to non-axis-aligned bulk: when the sample concentrates along a tilted ellipse, the componentwise median can lie outside the bulk entirely, because the marginal medians know nothing about the off-diagonal correlation. The geometric median (Fréchet 1948) — the minimizer of ixXi\sum_i \|\mathbf{x} - \mathbf{X}_i\| — fixes affine equivariance on rigid motions but still depends on the choice of norm and is not equivariant under arbitrary affine maps.

The mean and the Mahalanobis center handle affine equivariance correctly but inherit moment assumptions: the mean fails outright on heavy-tailed data, and the Mahalanobis center is anchored to the sample covariance, which has breakdown point 00 — a single far outlier can drag the “center” anywhere we like.

What we want from multivariate centrality is a single function that (i) does not care about the choice of axes, (ii) puts the maximum at something that looks geometrically central regardless of the parametric family, (iii) decreases monotonically as we move away from that maximum, and (iv) does not let one or two outliers redefine where the bulk is. Listing those four desiderata is the Zuo–Serfling axiomatization.

1.2 What we want from a depth function

We pause to fix notation. Throughout this topic, FF denotes a probability distribution on Rd\mathbb{R}^d, XF\mathbf{X} \sim F a random vector with that distribution, and X1,,Xn\mathbf{X}_1, \dots, \mathbf{X}_n an iid sample from FF with empirical distribution Fn=1ni=1nδXiF_n = \frac{1}{n}\sum_{i=1}^n \delta_{\mathbf{X}_i}. Affine transformations xAx+b\mathbf{x} \mapsto A\mathbf{x} + \mathbf{b} for nonsingular ARd×dA \in \mathbb{R}^{d \times d} and bRd\mathbf{b} \in \mathbb{R}^d are denoted AA,b\mathcal{A}_{A,\mathbf{b}}; the pushforward of FF under AA,b\mathcal{A}_{A,\mathbf{b}} is written FA,bF^{A,\mathbf{b}}, the distribution of AX+bA\mathbf{X} + \mathbf{b}.

A depth function is a map D:Rd×P(Rd)[0,)D : \mathbb{R}^d \times \mathcal{P}(\mathbb{R}^d) \to [0, \infty), written D(x;F)D(\mathbf{x}; F), that takes a point and a distribution and returns a non-negative real number. The reading is: D(x;F)D(\mathbf{x}; F) measures how central x\mathbf{x} is with respect to FF, with larger values meaning more central. Several depth functions in §3 take values in [0,1][0, 1]; the unit-interval normalization is conventional but not part of the definition.

The deepest point of FF is any maximizer

μD(F)    argmaxxRdD(x;F).\boldsymbol{\mu}_D(F) \;\in\; \arg\max_{\mathbf{x} \in \mathbb{R}^d} D(\mathbf{x}; F).

For most depth functions μD\boldsymbol{\mu}_D is uniquely determined when FF has a density, but it is generally a set when FF is empirical — we will see in §2 that the empirical Tukey median is a region, not a point, and a tie-breaking rule (typically the centroid of that region) returns a single estimate.

The depth contour at level α0\alpha \ge 0 is the upper level set

Dα(F)  =  {xRd:D(x;F)α}.D_\alpha(F) \;=\; \{\mathbf{x} \in \mathbb{R}^d : D(\mathbf{x}; F) \ge \alpha\}.

For halfspace depth these contours are nested closed convex sets indexed by α[0,1/2]\alpha \in [0, 1/2]; they form the multivariate analogue of confidence regions or quantile curves.

1.3 The Zuo–Serfling axioms

Zuo and Serfling (2000) formalize what a depth function ought to do via four axioms that together pick out a flexible but well-disciplined class. We state the axioms first, then motivate each one against the failure modes of §1.1.

Definition 1 (Zuo–Serfling depth function).

A map D:Rd×P(Rd)[0,)D : \mathbb{R}^d \times \mathcal{P}(\mathbb{R}^d) \to [0, \infty) is a statistical depth function if it satisfies, for every FP(Rd)F \in \mathcal{P}(\mathbb{R}^d) admitting a unique deepest point μ(F)\boldsymbol{\mu}(F):

(D1) Affine invariance. For every nonsingular ARd×dA \in \mathbb{R}^{d \times d}, every bRd\mathbf{b} \in \mathbb{R}^d, and every xRd\mathbf{x} \in \mathbb{R}^d,

D(Ax+b;FA,b)  =  D(x;F).D(A\mathbf{x} + \mathbf{b}; F^{A,\mathbf{b}}) \;=\; D(\mathbf{x}; F).

(D2) Maximality at the center. If FF is centrally symmetric about μ\boldsymbol{\mu} — that is, Xμ=dμX\mathbf{X} - \boldsymbol{\mu} \stackrel{d}{=} \boldsymbol{\mu} - \mathbf{X} — then D(μ;F)=supxD(x;F)D(\boldsymbol{\mu}; F) = \sup_{\mathbf{x}} D(\mathbf{x}; F), and the supremum is attained uniquely at μ\boldsymbol{\mu}.

(D3) Monotonicity along rays from the deepest point. For every xargmaxxD(x;F)\mathbf{x}^* \in \arg\max_{\mathbf{x}} D(\mathbf{x}; F) and every xRd\mathbf{x} \in \mathbb{R}^d, the function tD(x+t(xx);F)t \mapsto D(\mathbf{x}^* + t(\mathbf{x} - \mathbf{x}^*); F) is non-increasing on [0,)[0, \infty).

(D4) Vanishing at infinity. D(x;F)0D(\mathbf{x}; F) \to 0 as x\|\mathbf{x}\| \to \infty.

The motivation for each axiom matches one of §1.1’s pain points. (D1) removes the dependence on the choice of axes. (D2) anchors the depth function to a geometric notion of “center” whenever one exists unambiguously. (D3) says depth respects the geometry of “moving outward.” (D4) rules out the pathology where a depth function fails to distinguish “very far away” from “merely far away” — a regularity condition mostly used to guarantee that maxima exist.

The axioms are not independent — (D2) and (D3) interact through the choice of “center” — and in the literature several authors propose weaker variants. For this topic we adopt Zuo and Serfling’s original four. Each depth function in §3 will be checked against the axioms; one of them, Mahalanobis depth, will fail (D1) in spirit on non-elliptical distributions, and the failure is exactly visible in the side-by-side panel.

1.4 Running examples

Three samples thread §§1–6.

Sample A — bivariate Gaussian. X1,,XniidN(0,Σ)\mathbf{X}_1, \dots, \mathbf{X}_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}) with

Σ=(2111),n=200.\boldsymbol{\Sigma} = \begin{pmatrix} 2 & 1 \\ 1 & 1 \end{pmatrix}, \qquad n = 200.

This is the canonical exemplar: an elliptical distribution where every depth function in §3 satisfies the axioms and contours are nested ellipses. The deepest point is the origin; depth contours are concentric tilted ellipses oriented along the leading eigenvector of Σ\boldsymbol{\Sigma}.

Sample B — bivariate Cauchy. Same scale matrix as Sample A. The Cauchy has no mean and no covariance, so the sample mean and Mahalanobis center are unstable — but Tukey, simplicial, and projection depth still produce well-behaved contours, and the deepest point is still the origin (now interpreted as the symmetry center rather than the mean).

Sample C — contaminated Gaussian. A mixture, 0.90N(0,Σ)+0.10N(μout,I)0.90\, \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}) + 0.10\, \mathcal{N}(\boldsymbol{\mu}_{\text{out}}, \mathbf{I}) with μout=(8,8)\boldsymbol{\mu}_{\text{out}} = (8, 8)^\top. Depth functions with high breakdown (Tukey, simplicial, projection) keep the deepest point near 0\mathbf{0} regardless of contamination; depth functions anchored to moments (Mahalanobis on the empirical covariance) drift toward the contamination cluster.

The figure below shows all three samples side by side together with three reasonable-looking but disagreeing notions of “center”: the componentwise median, the sample mean, and the geometric median (the minimizer of ixXi\sum_i \|\mathbf{x} - \mathbf{X}_i\| via Weiszfeld’s iteration). On the Gaussian they roughly agree; on the Cauchy the mean is unstable; on the contaminated sample they disagree by a meaningful amount and none of them is obviously right. The §2 Tukey median, by contrast, will sit near the origin in all three panels.

Three reasonable-looking centers (componentwise median, sample mean, geometric median) plotted on Samples A, B, C, with the symmetry center marked as a gold star
On Sample A the four candidate centers cluster near the origin; on Sample B the sample mean wanders because Cauchy has no population mean; on Sample C the moment-based estimators drift toward the contamination at (8, 8) while the symmetry center stays put.

2. Tukey halfspace depth

Tukey (1975) proposed the construction that defines the canonical multivariate centrality score. The idea is geometric: for a point x\mathbf{x} to be deep, every line through x\mathbf{x} should split the sample roughly in half; a point near the boundary of the cloud, by contrast, has at least one line through it that puts almost all of the sample on one side. The minimum count over all such lines — equivalently, all halfspaces with x\mathbf{x} on their boundary — is the halfspace depth.

2.1 Definition

Recall a closed halfspace in Rd\mathbb{R}^d is a set of the form

Hu,c  =  {yRd:uyc},uSd1,  cR,H_{\mathbf{u}, c} \;=\; \{\mathbf{y} \in \mathbb{R}^d : \mathbf{u}^\top \mathbf{y} \ge c\}, \qquad \mathbf{u} \in \mathbb{S}^{d-1}, \; c \in \mathbb{R},

where Sd1={uRd:u=1}\mathbb{S}^{d-1} = \{\mathbf{u} \in \mathbb{R}^d : \|\mathbf{u}\| = 1\} is the unit sphere. The boundary hyperplane is Hu,c={y:uy=c}\partial H_{\mathbf{u}, c} = \{\mathbf{y} : \mathbf{u}^\top \mathbf{y} = c\}.

Definition 2 (Population Tukey halfspace depth).

Let FF be a probability distribution on Rd\mathbb{R}^d and xRd\mathbf{x} \in \mathbb{R}^d. The halfspace depth of x\mathbf{x} with respect to FF is

HD(x;F)  =  infHxF(H),\operatorname{HD}(\mathbf{x}; F) \;=\; \inf_{H \ni \mathbf{x}} F(H),

where the infimum runs over all closed halfspaces HRdH \subset \mathbb{R}^d that contain x\mathbf{x}.

Equivalently, every closed halfspace containing x\mathbf{x} has the form Hu,cH_{\mathbf{u}, c} with uxc\mathbf{u}^\top \mathbf{x} \ge c. The worst case (smallest FF-mass) is achieved by halfspaces whose boundary passes exactly through x\mathbf{x}, so the infimum reduces to

HD(x;F)  =  infuSd1F({y:u(yx)0})  =  infuSd1P(u(Xx)0).\operatorname{HD}(\mathbf{x}; F) \;=\; \inf_{\mathbf{u} \in \mathbb{S}^{d-1}} F\bigl(\{\mathbf{y} : \mathbf{u}^\top(\mathbf{y} - \mathbf{x}) \ge 0\}\bigr) \;=\; \inf_{\mathbf{u} \in \mathbb{S}^{d-1}} \mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \mathbf{x}) \ge 0\bigr).

The reading is direct: pick any direction u\mathbf{u}, slide a hyperplane through x\mathbf{x} orthogonal to u\mathbf{u}, and count the FF-probability on the u\mathbf{u}-side. Halfspace depth is the worst (smallest) such fraction.

Definition 3 (Empirical Tukey halfspace depth).

Let X1,,Xn\mathbf{X}_1, \dots, \mathbf{X}_n be a sample from FF with empirical distribution FnF_n. The empirical halfspace depth of x\mathbf{x} is

HDn(x)  =  HD(x;Fn)  =  1ninfuSd1#{i:u(Xix)0}.\operatorname{HD}_n(\mathbf{x}) \;=\; \operatorname{HD}(\mathbf{x}; F_n) \;=\; \frac{1}{n} \inf_{\mathbf{u} \in \mathbb{S}^{d-1}} \#\bigl\{i : \mathbf{u}^\top(\mathbf{X}_i - \mathbf{x}) \ge 0\bigr\}.

The unit-interval normalization is conventional: HDn(x){0/n,1/n,,n/n}\operatorname{HD}_n(\mathbf{x}) \in \{0/n, 1/n, \dots, n/n\}. The maximum possible value is bounded by 1/21/2 in d2d \ge 2 for any FF with no atom at x\mathbf{x} (one can always find a hyperplane through x\mathbf{x} that puts at most half the mass on either side), and the upper bound is achieved only at the halfspace median — defined in §2.4.

A worked example fixes ideas. Consider four points in R2\mathbb{R}^2 at the corners of a unit square, {(0,0),(1,0),(0,1),(1,1)}\{(0,0), (1,0), (0,1), (1,1)\}, and the query point x=(0.5,0.5)\mathbf{x} = (0.5, 0.5). Every line through x\mathbf{x} either passes through two opposite corners — in which case those corners lie on the boundary, included in the closed halfspace, and the closed halfspace contains all four points on one side — or strictly separates two corners from two corners. The minimum closed-halfspace count is 22, so HDn(x)=2/4=1/2\operatorname{HD}_n(\mathbf{x}) = 2/4 = 1/2. By contrast, the corner x=(0,0)\mathbf{x}' = (0, 0) has HDn(x)=1/4\operatorname{HD}_n(\mathbf{x}') = 1/4 (only one of the four points lies in the halfspace {y1<0}{y1=0,y20}\{y_1 < 0\} \cup \{y_1 = 0, y_2 \ne 0\}), and any point x=(2,2)\mathbf{x}'' = (2, 2) outside the square has HDn(x)=0\operatorname{HD}_n(\mathbf{x}'') = 0.

2.2 Verifying the Zuo–Serfling axioms

Halfspace depth was the prototype Zuo and Serfling had in mind when they wrote down the four axioms; verification is short and clean.

Theorem 1 (HD is a Zuo–Serfling depth function).

Tukey halfspace depth HD(x;F)\operatorname{HD}(\mathbf{x}; F) satisfies the four Zuo–Serfling axioms (D1)–(D4).

Proof.

(D1) Affine invariance. Let ARd×dA \in \mathbb{R}^{d \times d} be nonsingular and bRd\mathbf{b} \in \mathbb{R}^d. The pushforward FA,bF^{A,\mathbf{b}} is the distribution of AX+bA\mathbf{X} + \mathbf{b}. A closed halfspace containing Ax+bA\mathbf{x} + \mathbf{b} has the form Hu,cH_{\mathbf{u}, c} with u(Ax+b)c\mathbf{u}^\top(A\mathbf{x} + \mathbf{b}) \ge c; pulling back under AA,

u(Ay+b)c        (Au)ycub.\mathbf{u}^\top(A\mathbf{y} + \mathbf{b}) \ge c \;\iff\; (A^\top \mathbf{u})^\top \mathbf{y} \ge c - \mathbf{u}^\top \mathbf{b}.

So Hu,cH_{\mathbf{u}, c} pulls back to a halfspace with normal v=Au\mathbf{v} = A^\top \mathbf{u} (rescaled to unit length, which only multiplies cc by a positive scalar) containing x\mathbf{x}. The map uAu/Au\mathbf{u} \mapsto A^\top \mathbf{u}/\|A^\top \mathbf{u}\| is a bijection of Sd1\mathbb{S}^{d-1}, so the infimum of FA,bF^{A,\mathbf{b}}-mass over halfspaces containing Ax+bA\mathbf{x} + \mathbf{b} equals the infimum of FF-mass over halfspaces containing x\mathbf{x}. Hence HD(Ax+b;FA,b)=HD(x;F)\operatorname{HD}(A\mathbf{x} + \mathbf{b}; F^{A,\mathbf{b}}) = \operatorname{HD}(\mathbf{x}; F).

(D2) Maximality at the symmetry center. Suppose FF is centrally symmetric about μ\boldsymbol{\mu}, i.e., Xμ=dμX\mathbf{X} - \boldsymbol{\mu} \stackrel{d}{=} \boldsymbol{\mu} - \mathbf{X}. For any unit direction u\mathbf{u},

P(u(Xμ)0)  =  P(u(μX)0)  =  P(u(Xμ)0).\mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) \ge 0\bigr) \;=\; \mathbb{P}\bigl(\mathbf{u}^\top(\boldsymbol{\mu} - \mathbf{X}) \ge 0\bigr) \;=\; \mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) \le 0\bigr).

The two probabilities sum to 1+P(u(Xμ)=0)11 + \mathbb{P}(\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) = 0) \ge 1, so each is at least 1/21/2. Therefore HD(μ;F)=infuP(u(Xμ)0)1/2\operatorname{HD}(\boldsymbol{\mu}; F) = \inf_{\mathbf{u}} \mathbb{P}(\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) \ge 0) \ge 1/2.

For any xμ\mathbf{x} \neq \boldsymbol{\mu}, take u=(xμ)/xμ\mathbf{u} = (\mathbf{x} - \boldsymbol{\mu})/\|\mathbf{x} - \boldsymbol{\mu}\| — the unit direction away from the center. Then

P(u(Xx)0)  =  P(u(Xμ)xμ).\mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \mathbf{x}) \ge 0\bigr) \;=\; \mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) \ge \|\mathbf{x} - \boldsymbol{\mu}\|\bigr).

By symmetry the distribution of u(Xμ)\mathbf{u}^\top(\mathbf{X} - \boldsymbol{\mu}) is symmetric about 00, so the right-hand probability is at most 1/21/2 and is strictly less than 1/21/2 whenever the corresponding marginal CDF is continuous at xμ>0\|\mathbf{x} - \boldsymbol{\mu}\| > 0. Under the standard regularity assumption (say, FF has a density), the strict inequality holds for every xμ\mathbf{x} \ne \boldsymbol{\mu}, so HD(x;F)<1/2HD(μ;F)\operatorname{HD}(\mathbf{x}; F) < 1/2 \le \operatorname{HD}(\boldsymbol{\mu}; F), with strict inequality. The supremum is uniquely attained at μ\boldsymbol{\mu}.

(D3) Monotonicity along rays. Fix any deepest point x\mathbf{x}^* of FF and any direction vRd\mathbf{v} \in \mathbb{R}^d. We must show ϕ(t)=HD(x+tv;F)\phi(t) = \operatorname{HD}(\mathbf{x}^* + t\mathbf{v}; F) is non-increasing on [0,)[0, \infty).

For each unit direction u\mathbf{u}, consider the function

gu(t)  =  P(u(Xxtv)0)  =  P(u(Xx)tuv).g_{\mathbf{u}}(t) \;=\; \mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \mathbf{x}^* - t\mathbf{v}) \ge 0\bigr) \;=\; \mathbb{P}\bigl(\mathbf{u}^\top(\mathbf{X} - \mathbf{x}^*) \ge t \mathbf{u}^\top \mathbf{v}\bigr).

For each u\mathbf{u}, gug_{\mathbf{u}} is non-increasing in tt on the set where uv>0\mathbf{u}^\top \mathbf{v} > 0 (the threshold rises as tt grows) and non-decreasing on the set where uv<0\mathbf{u}^\top \mathbf{v} < 0. By symmetry between u\mathbf{u} and u-\mathbf{u} (the infimum over Sd1\mathbb{S}^{d-1} is the same as the infimum over {±u}\{\pm\mathbf{u}\}), it suffices to take the infimum over u\mathbf{u} with uv0\mathbf{u}^\top \mathbf{v} \ge 0, on which set every gug_{\mathbf{u}} is non-increasing in tt. The infimum of non-increasing functions is non-increasing, so ϕ\phi is non-increasing on [0,)[0, \infty).

(D4) Vanishing at infinity. Take any unit direction u\mathbf{u}. As x\|\mathbf{x}\| \to \infty with x/xu\mathbf{x}/\|\mathbf{x}\| \to \mathbf{u}_*, the probability P(u(Xx)0)=P(uXux)0\mathbb{P}(\mathbf{u}_*^\top(\mathbf{X} - \mathbf{x}) \ge 0) = \mathbb{P}(\mathbf{u}_*^\top \mathbf{X} \ge \mathbf{u}_*^\top \mathbf{x}) \to 0 because ux\mathbf{u}_*^\top \mathbf{x} \to \infty. Hence HD(x;F)P(u(Xx)0)0\operatorname{HD}(\mathbf{x}; F) \le \mathbb{P}(\mathbf{u}_*^\top(\mathbf{X} - \mathbf{x}) \ge 0) \to 0. \square

So Tukey halfspace depth is a Zuo–Serfling depth function — the construction satisfies the entire axiomatic specification. The proofs above used only the closed-halfspace formulation; the same argument works for the empirical version with FnF_n in place of FF.

2.3 The Tukey median and contour convexity

The Tukey median of FF is any maximizer of HD(;F)\operatorname{HD}(\,\cdot\,; F):

μT(F)    argmaxxRdHD(x;F).\boldsymbol{\mu}_T(F) \;\in\; \arg\max_{\mathbf{x} \in \mathbb{R}^d} \operatorname{HD}(\mathbf{x}; F).

For populations FF with a density and central symmetry, the proof of (D2) shows μT(F)=μ\boldsymbol{\mu}_T(F) = \boldsymbol{\mu} uniquely. For the empirical distribution FnF_n, the maximizer is generally a region, not a point: the level set {x:HDn(x)=α}\{\mathbf{x} : \operatorname{HD}_n(\mathbf{x}) = \alpha^*\} at the maximum depth α\alpha^* is a non-empty closed convex set, and the conventional point estimator is its centroid.

The convexity comes from a structural fact about depth contours.

Proposition 1 (Convexity of halfspace depth contours).

For any FF and any α0\alpha \ge 0, the upper level set

HDα(F)  =  {xRd:HD(x;F)α}\operatorname{HD}_\alpha(F) \;=\; \{\mathbf{x} \in \mathbb{R}^d : \operatorname{HD}(\mathbf{x}; F) \ge \alpha\}

is closed and convex.

Proof.

Let x,yHDα(F)\mathbf{x}, \mathbf{y} \in \operatorname{HD}_\alpha(F) and λ[0,1]\lambda \in [0, 1]. Set z=λx+(1λ)y\mathbf{z} = \lambda \mathbf{x} + (1-\lambda) \mathbf{y}. For any closed halfspace HH with zH\mathbf{z} \in H, write H=Hu,cH = H_{\mathbf{u}, c} with uzc\mathbf{u}^\top \mathbf{z} \ge c. Either uxc\mathbf{u}^\top \mathbf{x} \ge c or uyc\mathbf{u}^\top \mathbf{y} \ge c — otherwise uz\mathbf{u}^\top \mathbf{z} is a convex combination of two values both less than cc, contradiction. So at least one of x,y\mathbf{x}, \mathbf{y} lies in HH, and

F(H)    min(HD(x;F),HD(y;F))    α.F(H) \;\ge\; \min\bigl(\operatorname{HD}(\mathbf{x}; F), \operatorname{HD}(\mathbf{y}; F)\bigr) \;\ge\; \alpha.

Taking the infimum over HzH \ni \mathbf{z} gives HD(z;F)α\operatorname{HD}(\mathbf{z}; F) \ge \alpha, so zHDα(F)\mathbf{z} \in \operatorname{HD}_\alpha(F). Closedness follows because xHD(x;F)\mathbf{x} \mapsto \operatorname{HD}(\mathbf{x}; F) is upper semicontinuous (the infimum of continuous functions of x\mathbf{x}). \square

The depth contours are closed convex sets nested by depth level, and the Tukey median is the deepest such set. This convexity property is what makes halfspace depth particularly useful for constructing multivariate confidence regions.

2.4 Computing 2D Tukey depth

Computing HDn(x)\operatorname{HD}_n(\mathbf{x}) in Rd\mathbb{R}^d requires evaluating an infimum over the unit sphere Sd1\mathbb{S}^{d-1}. In 2D this infimum reduces to a search over the nn angular bearings of the data points relative to x\mathbf{x}, and there is an exact O(nlogn)O(n \log n) algorithm — the rotating-halfplane sweep (Rousseeuw and Ruts 1996).

Algorithm (Rousseeuw–Ruts, sketch). Let yi=Xix\mathbf{y}_i = \mathbf{X}_i - \mathbf{x}. Compute angles θi=arctan2(yi,2,yi,1)[π,π)\theta_i = \arctan2(y_{i,2}, y_{i,1}) \in [-\pi, \pi) for each non-coincident yi\mathbf{y}_i. Sort the angles. The closed halfplane through x\mathbf{x} with normal at angle α+π/2\alpha + \pi/2 contains exactly the points whose angles fall in [α,α+π][\alpha, \alpha + \pi]. As α\alpha sweeps through [0,2π)[0, 2\pi), the count changes only when α\alpha crosses one of the θi\theta_i or one of the θiπ\theta_i - \pi markers; the minimum over all sweep positions, plus the count of points coincident with x\mathbf{x} (which lie on every halfplane boundary), is the depth count. The whole computation is dominated by the sort, O(nlogn)O(n \log n).

The companion algorithm in d3d \ge 3 is more involved: exact computation reduces to enumerating all hyperplanes through the query point and d1d-1 data points, an O(nd1)O(n^{d-1}) operation. Aloupis (2006) shows that exact computation of the depth at a single point — without requiring the full enumeration — is NP-hard when dd is part of the input. We return to that result in §4.4.

2.5 Depth contours

With the 2D algorithm in hand we can plot depth contours {x:HDn(x)α}\{\mathbf{x} : \operatorname{HD}_n(\mathbf{x}) \ge \alpha\} as a function of α\alpha. For Sample A the contours are nested and roughly concentric ellipses, oriented along the leading eigenvector of Σ\boldsymbol{\Sigma}, with the maximum depth attained near the origin at value 1/2\approx 1/2. For Sample C the contamination cluster does not redirect the contours: the deepest region remains near the origin, the outliers register only as a small bump on the outer (low-α\alpha) contours. This is the visual demonstration that halfspace depth has high breakdown — a quantitative argument follows in §4.2.

Tukey halfspace depth contours on Sample A (Gaussian) and Sample C (contaminated), showing the empirical Tukey median sitting near the origin in both panels
Static contour panels for Samples A and C. The empirical Tukey median (gold star) stays near the symmetry center (red plus) under 10% contamination; the outer contours barely register the (8, 8) cluster.

The interactive widget below recomputes contours on the fly. Switch the sample selector to compare A, B, C; drag the pink crosshair to read live depth at any query point; raise the grid resolution for a sharper picture (at the cost of more depth queries per redraw).

40×40
query (0.00, 0.00)HDₙ = 0.415

iid bivariate Gaussian; all four §1 centres agree near the origin.

★ marks the empirical Tukey median (centroid of the deepest grid cell). Drag the pink crosshair to read depth at any query point. Grid resolution trades accuracy against wait time — gridSize² Tukey-depth queries per redraw.

3. A zoo of depth functions

Tukey halfspace depth is canonical but not unique. Several other constructions satisfy the Zuo–Serfling axioms, each making different tradeoffs along three axes the §2 algorithm exposed: how fast the depth is to compute, how strongly the depth depends on parametric model assumptions, and how the depth scales with dimension. We catalogue four of the most useful — Mahalanobis, simplicial, projection, and spatial depth — verify the relevant axioms, and put the contours side by side on a common sample.

3.1 Mahalanobis depth

The fastest depth function — closed-form, requiring only a d×dd \times d matrix inverse — is the one anchored to the first two moments.

Definition 4 (Mahalanobis depth).

For a distribution FF with mean μF\boldsymbol{\mu}_F and positive-definite covariance ΣF\boldsymbol{\Sigma}_F, the Mahalanobis depth of x\mathbf{x} is

MD(x;F)  =  11+(xμF)ΣF1(xμF).\operatorname{MD}(\mathbf{x}; F) \;=\; \frac{1}{1 + (\mathbf{x} - \boldsymbol{\mu}_F)^\top \boldsymbol{\Sigma}_F^{-1} (\mathbf{x} - \boldsymbol{\mu}_F)}.

The empirical version uses the sample mean and sample covariance: MDn(x)=(1+(xXˉ)S1(xXˉ))1\operatorname{MD}_n(\mathbf{x}) = (1 + (\mathbf{x} - \bar{\mathbf{X}})^\top \mathbf{S}^{-1} (\mathbf{x} - \bar{\mathbf{X}}))^{-1}.

Mahalanobis depth satisfies (D1)–(D4): the affine-invariance proof is one line, since under xAx+b\mathbf{x} \mapsto A\mathbf{x} + \mathbf{b} the mean shifts to Aμ+bA\boldsymbol{\mu} + \mathbf{b}, the covariance shifts to AΣAA\boldsymbol{\Sigma} A^\top, and the quadratic form (xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) is preserved exactly. The maximum is attained at μF\boldsymbol{\mu}_F with value 11; monotonicity along rays is the convexity of quadratic forms; vanishing at infinity is automatic.

The catch is what maximum it produces. The argmax\arg\max of Mahalanobis depth is the mean, not the symmetry center. For a centrally symmetric FF with finite covariance the two coincide, so (D2) holds. For a non-elliptical FF — say, a banana-shaped distribution — Mahalanobis depth still produces ellipsoidal contours centered at the mean, even when the actual bulk of FF is far from elliptical. The depth is not wrong in the axiomatic sense; it is committed to the parametric ellipsoid, and any deviation from that model leaks straight into the depth contours. The breakdown is also the breakdown of the sample covariance: zero. A single far outlier moves S\mathbf{S} enough that S1\mathbf{S}^{-1} tilts toward the outlier, and the deepest point drifts.

3.2 Simplicial depth

Liu (1988, 1990) proposed a depth function defined entirely in terms of the convex-hull geometry of random simplices.

Definition 5 (Simplicial depth).

For a distribution FF on Rd\mathbb{R}^d, the simplicial depth of x\mathbf{x} is

SD(x;F)  =  P(xconv(X1,,Xd+1)),\operatorname{SD}(\mathbf{x}; F) \;=\; \mathbb{P}\bigl(\mathbf{x} \in \operatorname{conv}(\mathbf{X}_1, \dots, \mathbf{X}_{d+1})\bigr),

where X1,,Xd+1\mathbf{X}_1, \dots, \mathbf{X}_{d+1} are iid from FF and conv\operatorname{conv} denotes the closed convex hull (here a dd-simplex with probability 11 when FF has a density).

The empirical version is a U-statistic of order d+1d+1:

SDn(x)  =  (nd+1)11i1<<id+1n1{xconv(Xi1,,Xid+1)}.\operatorname{SD}_n(\mathbf{x}) \;=\; \binom{n}{d+1}^{-1} \sum_{1 \le i_1 < \cdots < i_{d+1} \le n} \mathbb{1}\bigl\{\mathbf{x} \in \operatorname{conv}(\mathbf{X}_{i_1}, \dots, \mathbf{X}_{i_{d+1}})\bigr\}.

The U-statistic structure is exactly what Liu’s result exploits: the kernel hh is symmetric, bounded, and has expectation SD(x;F)\operatorname{SD}(\mathbf{x}; F). Standard U-statistic theory (Hoeffding 1948) gives consistency, asymptotic normality, and explicit variance formulas for SDn(x)\operatorname{SD}_n(\mathbf{x}).

Theorem 2 (Liu 1990 — simplicial depth axioms).

SD(;F)\operatorname{SD}(\,\cdot\,; F) satisfies (D1)–(D4) for every absolutely continuous FF.

Proof.

(Sketch.) (D1) Affine invariance follows because affine maps preserve convex-hull membership: if AA is nonsingular, xconv(X1,,Xd+1)\mathbf{x} \in \operatorname{conv}(\mathbf{X}_1, \dots, \mathbf{X}_{d+1}) if and only if Ax+bconv(AX1+b,,AXd+1+b)A\mathbf{x} + \mathbf{b} \in \operatorname{conv}(A\mathbf{X}_1 + \mathbf{b}, \dots, A\mathbf{X}_{d+1} + \mathbf{b}). The pushforward simplex has the same indicator, so the expectations match. (D2) For centrally symmetric FF at the symmetry center μ\boldsymbol{\mu}, take X=d2μX\mathbf{X} \stackrel{d}{=} 2\boldsymbol{\mu} - \mathbf{X} and apply the Bárány–Kalai random-simplex argument: the probability that the simplex contains μ\boldsymbol{\mu} exceeds the probability that it contains any other point, with strict inequality under the density assumption. (D3) The probability P(xconv())\mathbb{P}(\mathbf{x} \in \operatorname{conv}(\dots)) is non-increasing as x\mathbf{x} moves outward from the deepest point because the convex-hull membership constraint becomes more restrictive. (D4) For x\|\mathbf{x}\| \to \infty, the probability that an iid (d+1)(d+1)-simplex contains x\mathbf{x} vanishes by tail-decay of FF. Full details in Liu (1990), Theorem 1. \square

Simplicial depth is fully nonparametric (no moment assumptions), affine invariant, and satisfies a clean U-statistic-based asymptotic theory. Its computational cost, however, is severe in high dimensions: the U-statistic of order d+1d+1 has (nd+1)\binom{n}{d+1} terms, and even in 2D this is (n3)n3/6\binom{n}{3} \asymp n^3/6 point-in-triangle tests per query.

3.3 Projection depth

Random projections give a depth function that scales gracefully with dimension at the cost of trading exactness for a Monte-Carlo approximation.

Definition 6 (Projection depth).

For a distribution FF on Rd\mathbb{R}^d, define the outlyingness of x\mathbf{x} as

O(x;F)  =  supuSd1uxm(uX)MAD(uX),O(\mathbf{x}; F) \;=\; \sup_{\mathbf{u} \in \mathbb{S}^{d-1}} \frac{|\mathbf{u}^\top \mathbf{x} - m(\mathbf{u}^\top \mathbf{X})|}{\operatorname{MAD}(\mathbf{u}^\top \mathbf{X})},

where m()m(\,\cdot\,) is the univariate median and MAD(Y)=med(Ym(Y))\operatorname{MAD}(Y) = \operatorname{med}(|Y - m(Y)|) is the median absolute deviation. The projection depth is

PD(x;F)  =  11+O(x;F).\operatorname{PD}(\mathbf{x}; F) \;=\; \frac{1}{1 + O(\mathbf{x}; F)}.

The reading: project FF onto direction u\mathbf{u} to get the univariate distribution of uX\mathbf{u}^\top \mathbf{X}. Compute the standardized univariate distance from ux\mathbf{u}^\top \mathbf{x} to the projected median. Take the supremum over directions. The deepest point in Rd\mathbb{R}^d is the one whose projected position is closest to the projected median in every direction — the multivariate generalization of “the median of every projection.”

In practice, the supremum over Sd1\mathbb{S}^{d-1} is approximated by sampling KK random unit directions and taking the maximum:

PD^K(x)  =  11+maxk=1,,KukxmkMADk.\widehat{\operatorname{PD}}_K(\mathbf{x}) \;=\; \frac{1}{1 + \max_{k=1,\dots,K} \frac{|\mathbf{u}_k^\top \mathbf{x} - m_k|}{\operatorname{MAD}_k}}.

The cost is O(Kn)O(K \cdot n) per query — linear in dimension and trivially parallel. Projection depth satisfies (D1)–(D4): affine invariance follows because the median and MAD are univariate-affine equivariant, and the supremum over Sd1\mathbb{S}^{d-1} pulls back the same way Tukey’s does. The approximation PD^KPD\widehat{\operatorname{PD}}_K \to \operatorname{PD} holds at every fixed x\mathbf{x} by the strong law applied to the supremum over a uniform sample on the sphere; the error is Op(logK/K)O_p(\sqrt{\log K / K}) uniformly on a compact set.

3.4 Spatial depth

Spatial depth (Vardi and Zhang 2000; Serfling 2002) replaces the sup-over-directions structure of projection depth with an L1L^1 optimality criterion.

Definition 7 (Spatial depth).

For a distribution FF on Rd\mathbb{R}^d with no atom at x\mathbf{x},

Sp(x;F)  =  1E[XxXx].\operatorname{Sp}(\mathbf{x}; F) \;=\; 1 - \biggl\| \mathbb{E}\biggl[\frac{\mathbf{X} - \mathbf{x}}{\|\mathbf{X} - \mathbf{x}\|}\biggr] \biggr\|.

The empirical version uses the sample mean of the unit-direction vectors. The geometric reading is clean. Each Xi\mathbf{X}_i contributes a unit vector pointing from x\mathbf{x} toward Xi\mathbf{X}_i. If x\mathbf{x} sits at the geometric median of the sample, the unit vectors balance: their average is close to 0\mathbf{0}, so Spn(x)1\operatorname{Sp}_n(\mathbf{x}) \approx 1. If x\mathbf{x} sits on the periphery, all the unit vectors point roughly the same way, the average has large norm, and Spn(x)0\operatorname{Sp}_n(\mathbf{x}) \approx 0. The deepest point is the spatial median, which coincides with the Fréchet geometric median.

Spatial depth is affine equivariant only under orthogonal transformations — it depends on the choice of norm via the unit-direction normalization. For elliptically distributed data, this matters less than one might expect, because all relevant comparisons are in L2L^2. The cost is O(n)O(n) per query, the cheapest of the four after Mahalanobis.

3.5 Side-by-side comparison

All four depth functions evaluated on Sample A produce qualitatively similar contour pictures — nested, roughly elliptical, peaked near the origin. The differences appear in (a) the rate at which contours decay outward, (b) sensitivity to contamination, and (c) whether the contour shape can deviate from an ellipse when the underlying distribution does. The five-panel figure below shows the contours on a common Gaussian sample.

Five depth functions evaluated on the same Gaussian sample, contours plotted side by side
On Gaussian data — the elliptical regime — Tukey, Mahalanobis, simplicial, projection, and spatial depth all produce nearly affine images of one another. The choice between them comes down to runtime, robustness, and dimensional scaling rather than which 'center' they pick out.

The interesting failure mode appears on non-elliptical data, where Mahalanobis depth’s commitment to the ellipsoid model leaks into the contours and produces depth structure that does not track the actual density of the sample.

Banana-shaped sample with Tukey contours that follow the bend and Mahalanobis contours that stay elliptical
Tukey halfspace depth tracks the bend of the banana sample; Mahalanobis stays committed to the elliptical model rooted at the sample mean. The depth value's max is similar (≈ 0.45 vs ≈ 1.00 by definition), but the geometry diverges.

The interactive comparison below lets you switch the sample (Gaussian / banana) and the right-hand depth function. The left panel always shows Tukey halfspace as the affine-invariant canonical reference. The simplicial branch is capped at n=60n = 60; for KK-direction projection depth, the inline approximation runs at K=200K = 200.

Gaussian (Σ tilted) · n = 60

On Gaussian data all five depths produce nearly affine images of one another — the elliptical bulk doesn't punish any choice. On the banana sample, Mahalanobis depth's elliptical commitment shows: contours stay aligned with the sample covariance even though the bulk bends. Tukey halfspace depth tracks the bend; the others vary.

4. Theory and computation

The §§2–3 implementations would be of limited value without theoretical guarantees that the empirical depth converges to its population counterpart, that the Tukey median is robust in a quantifiable sense, and that the asymptotic behaviour of empirical depth has a known limit law. This section assembles those guarantees and pairs each with its computational counterpart.

4.1 Consistency

Halfspace depth at a fixed point is consistent under minimal assumptions on FF.

Theorem 3 (Donoho–Gasko consistency).

Let FF be a probability distribution on Rd\mathbb{R}^d that assigns mass zero to every hyperplane. Then for every xRd\mathbf{x} \in \mathbb{R}^d,

HDn(x)  a.s.  HD(x;F)as n.\operatorname{HD}_n(\mathbf{x}) \;\xrightarrow{\text{a.s.}}\; \operatorname{HD}(\mathbf{x}; F) \quad \text{as } n \to \infty.

Proof.

Fix xRd\mathbf{x} \in \mathbb{R}^d. Define the class of closed halfspaces with x\mathbf{x} on the boundary,

Hx  =  {Hu,ux:uSd1}.\mathcal{H}_{\mathbf{x}} \;=\; \{H_{\mathbf{u}, \mathbf{u}^\top \mathbf{x}} : \mathbf{u} \in \mathbb{S}^{d-1}\}.

By definition HD(x;F)=infHHxF(H)\operatorname{HD}(\mathbf{x}; F) = \inf_{H \in \mathcal{H}_{\mathbf{x}}} F(H) and HDn(x)=infHHxFn(H)\operatorname{HD}_n(\mathbf{x}) = \inf_{H \in \mathcal{H}_{\mathbf{x}}} F_n(H).

The class Hx\mathcal{H}_{\mathbf{x}} has finite Vapnik–Chervonenkis dimension at most d+1d+1 — it is a sub-class of the VC class of all closed halfspaces, which has VC dimension d+1d+1. By the Vapnik–Chervonenkis theorem (Glivenko–Cantelli for classes of finite VC dimension),

supHHxFn(H)F(H)  a.s.  0.\sup_{H \in \mathcal{H}_{\mathbf{x}}} |F_n(H) - F(H)| \;\xrightarrow{\text{a.s.}}\; 0.

The sup-norm convergence transfers to the infimum: infFn(H)infF(H)supFn(H)F(H)\bigl|\inf F_n(H) - \inf F(H)\bigr| \le \sup |F_n(H) - F(H)|, so

HDn(x)HD(x;F)    supHHxFn(H)F(H)  a.s.  0.|\operatorname{HD}_n(\mathbf{x}) - \operatorname{HD}(\mathbf{x}; F)| \;\le\; \sup_{H \in \mathcal{H}_{\mathbf{x}}} |F_n(H) - F(H)| \;\xrightarrow{\text{a.s.}}\; 0.

The hyperplane-zero-mass assumption is needed only to handle boundary ties: without it, individual points on the boundary hyperplane Hu,ux\partial H_{\mathbf{u}, \mathbf{u}^\top \mathbf{x}} could destabilize Fn(H)F_n(H) as nn grows. Under the hyperplane-zero-mass assumption, the closed-halfspace count Fn(H)F_n(H) converges to F(H)F(H) at every HHxH \in \mathcal{H}_{\mathbf{x}} uniformly. \square

The same argument gives consistency for projection depth (with the additional assumption that the population MAD is bounded away from 00), spatial depth (with no atom at x\mathbf{x}), and Mahalanobis depth (under the additional finite-second-moment condition the definition requires).

4.2 Breakdown of the Tukey median

The Tukey median’s breakdown point — the smallest fraction of contamination that can move it arbitrarily far — gives a quantitative version of the §2.5 contour-stability picture.

Theorem 4 (Donoho–Gasko 1992 breakdown bound).

Let X1,,Xn\mathbf{X}_1, \dots, \mathbf{X}_n be in general position in Rd\mathbb{R}^d. The finite-sample breakdown point of the Tukey median is

εn    1d+1.\varepsilon^*_n \;\ge\; \frac{1}{d+1}.

Proof.

(Sketch.) For any contamination scheme that replaces kk original points with arbitrary Yj\mathbf{Y}_j, the contaminated empirical halfspace depth at any point x\mathbf{x} that was originally deep — say with depth αn\alpha_n on the clean sample — satisfies

HDncontaminated(x)    HDnclean(x)kn.\operatorname{HD}_n^{\text{contaminated}}(\mathbf{x}) \;\ge\; \operatorname{HD}_n^{\text{clean}}(\mathbf{x}) - \frac{k}{n}.

The maximal clean-sample depth is at most 1/21/2 but at least n/(d+1)/n\lceil n/(d+1) \rceil/n (by a packing argument: in general position, every d+1d+1 points form a simplex, and any point inside the convex hull of the data has depth at least 1/(d+1)1/(d+1) \cdot proportion-of-simplices-containing-it). Setting αn1/(d+1)\alpha_n \ge 1/(d+1), the bound gives that the deepest point of the contaminated sample has depth at least 1/(d+1)k/n1/(d+1) - k/n, which remains positive — and hence located in a bounded region — as long as k/n<1/(d+1)k/n < 1/(d+1). Therefore εn1/(d+1)\varepsilon^*_n \ge 1/(d+1). Full details: Donoho and Gasko (1992), Theorem 3.1. \square

In d=2d = 2 the bound gives ε1/3\varepsilon^* \ge 1/3, comfortably above the 1/n1/n breakdown of the sample mean.

Two-panel figure: HD_n consistency on a Gaussian as n grows, and Tukey-median displacement vs sample-mean displacement as contamination ε grows from 0 to 0.45
Left: $\\operatorname{HD}_n(\\mathbf{0})$ on iid $\\mathcal{N}(\\mathbf{0}, \\mathbf{I}_2)$ converges to the population value $1/2$ as $n$ grows from 50 to 5000 (Theorem 3 in action). Right: contamination at (15, 15) drags the sample mean linearly with $\\varepsilon$; the Tukey median displacement stays small until $\\varepsilon$ approaches the theoretical $1/3$ bound, then collapses.

4.3 Asymptotic distribution

Pointwise convergence (Theorem 3) is the consistency baseline. The next question is the rate.

Theorem 5 (Massé 2004 — asymptotic normality).

Let FF have a continuously differentiable density ff on Rd\mathbb{R}^d that is bounded away from zero on every compact set. Fix xRd\mathbf{x} \in \mathbb{R}^d at which the infimum defining HD(x;F)\operatorname{HD}(\mathbf{x}; F) is attained at a unique direction u\mathbf{u}_*. Then

n(HDn(x)HD(x;F))    d    N(0,σ2(x,F)),\sqrt{n}\bigl(\operatorname{HD}_n(\mathbf{x}) - \operatorname{HD}(\mathbf{x}; F)\bigr) \;\xrightarrow{\;d\;}\; \mathcal{N}\bigl(0,\, \sigma^2(\mathbf{x}, F)\bigr),

where σ2(x,F)=HD(x;F)(1HD(x;F))\sigma^2(\mathbf{x}, F) = \operatorname{HD}(\mathbf{x}; F)\bigl(1 - \operatorname{HD}(\mathbf{x}; F)\bigr).

Proof.

(Sketch in three steps.)

Step 1 — Hadamard differentiability of the depth functional. Define Φ:DR\Phi : \mathcal{D} \to \mathbb{R} by Φ(F)=infuSd1F(Hu,ux)\Phi(F) = \inf_{\mathbf{u} \in \mathbb{S}^{d-1}} F(H_{\mathbf{u}, \mathbf{u}^\top \mathbf{x}}), where D\mathcal{D} is the space of distribution functions on Rd\mathbb{R}^d equipped with the Kolmogorov sup-norm. Under the unique-minimizer assumption, the envelope theorem (a finite-dimensional version of Danskin’s theorem applied to the parametrization uF(Hu,ux)\mathbf{u} \mapsto F(H_{\mathbf{u}, \mathbf{u}^\top \mathbf{x}})) gives that Φ\Phi is Hadamard-differentiable at FF in the direction hh with derivative

DΦF(h)  =  h(Hu,ux),D\Phi_F(h) \;=\; h(H_{\mathbf{u}_*, \mathbf{u}_*^\top \mathbf{x}}),

where hh is interpreted as a signed measure and h(Hu,ux)h(H_{\mathbf{u}_*, \mathbf{u}_*^\top \mathbf{x}}) is its mass on the optimal halfspace.

Step 2 — Functional CLT for the empirical process. By Donsker’s theorem applied to the VC class of halfspaces, n(FnF)\sqrt{n}(F_n - F) converges in distribution (in the space of bounded functions on Hx\mathcal{H}_{\mathbf{x}}) to a Gaussian process GF\mathbb{G}_F with covariance kernel Cov(GF(H1),GF(H2))=F(H1H2)F(H1)F(H2)\operatorname{Cov}(\mathbb{G}_F(H_1), \mathbb{G}_F(H_2)) = F(H_1 \cap H_2) - F(H_1) F(H_2).

Step 3 — Functional delta method. Combining Steps 1 and 2,

n(HDn(x)HD(x;F))  =  n(Φ(Fn)Φ(F))    d    DΦF(GF)  =  GF(Hu,ux).\sqrt{n}(\operatorname{HD}_n(\mathbf{x}) - \operatorname{HD}(\mathbf{x}; F)) \;=\; \sqrt{n} \bigl(\Phi(F_n) - \Phi(F)\bigr) \;\xrightarrow{\;d\;}\; D\Phi_F(\mathbb{G}_F) \;=\; \mathbb{G}_F(H_{\mathbf{u}_*, \mathbf{u}_*^\top \mathbf{x}}).

The right-hand side is N(0,F(H)(1F(H)))\mathcal{N}(0, F(H_*)(1 - F(H_*))) where H=Hu,uxH_* = H_{\mathbf{u}_*, \mathbf{u}_*^\top \mathbf{x}}; recognizing F(H)=HD(x;F)F(H_*) = \operatorname{HD}(\mathbf{x}; F) closes the proof. \square

The unique-minimizer assumption is essential: at points where the depth-defining infimum is attained at a set of directions (e.g., a centrally symmetric distribution at its center, where every direction attains 1/21/2), the Hadamard-differentiability argument fails and the limit distribution becomes the supremum of a Gaussian process over the optimal-direction set.

4.4 Computational complexity

The §2 sweep gave O(nlogn)O(n \log n) for one query in 2D. The general-dd exact algorithm scales much worse.

Theorem 6 (Aloupis 2006).

Computing HDn(x)\operatorname{HD}_n(\mathbf{x}) exactly is NP-hard in dd when dd is part of the input.

The result is stated and not proved here — the proof is a reduction from the maximum-bisection problem and requires a separate combinatorial-geometry treatment. The practical consequence is decisive: exact computation of halfspace depth is restricted to fixed low dimension. The O(nd1)O(n^{d-1}) exact algorithm via hyperplane enumeration is feasible for d3d \le 3 (where it is O(n2)O(n^2)) and only marginally feasible for d=4d = 4 at moderate nn. For d5d \ge 5 the practitioner must fall back on the random-direction approximation HD^K(x)=mink=1,,KFn(Huk,ukx)\widehat{\operatorname{HD}}_K(\mathbf{x}) = \min_{k=1,\dots,K} F_n(H_{\mathbf{u}_k, \mathbf{u}_k^\top \mathbf{x}}).

4.5 Projection-depth approximation efficacy

How accurate is the random-direction approximation as a function of KK? The §3.3 analysis gave PD^KPD\widehat{\operatorname{PD}}_K \to \operatorname{PD} with error Op(logK/K)O_p(\sqrt{\log K / K}). The same bound applies to the random-direction approximation of halfspace depth itself, since both are infimum-over-the-sphere objects. In practice K=200K = 200 is sufficient for sub-percent error on moderately concentrated samples.

Log-log plot of per-query runtime vs dimension, showing projection depth scaling linearly while exact halfspace stops at d ≤ 3
Projection depth scales linearly in $K \\cdot n$, essentially flat in $d$ for fixed $n$. Exact halfspace runs only at $d \\in \\{2, 3\\}$ per Theorem 6; the shaded $d \\ge 4$ region marks the NP-hard regime.

The widget below re-runs the benchmark live. Slide KK to see the projection-depth time scale linearly in directions; slide nn to see both methods scale at their predicted rates.

200
50

Per-query timing in milliseconds. Exact halfspace runs only at d ∈ {2, 3} per the §4.4 NP-hardness result.

Projection depth's roughly flat curve in d (cost is O(K · n), independent of dimension after the projection step) is the punchline of §4.5: in practice, the only depth that scales is the approximate one. Increasing K trades accuracy for cost — the random-direction approximation has Op(√(log K / K)) error.

5. ML applications

The depth machinery of §§2–4 was developed without reference to a specific ML problem, but it slots cleanly into three: nonparametric classification through the DD-plot, multivariate outlier detection that does not assume ellipticity, and analysis of functional data — random samples in which each observation is a curve rather than a vector.

5.1 The DD-classifier

The depth-versus-depth plot, or DD-plot, is the depth-based analogue of a Q–Q plot for two distributions. Given samples X1,,Xn0\mathbf{X}_1, \dots, \mathbf{X}_{n_0} from a putative class-00 distribution F0F_0 and Y1,,Yn1\mathbf{Y}_1, \dots, \mathbf{Y}_{n_1} from class F1F_1, every point zRd\mathbf{z} \in \mathbb{R}^d has two depths:

D0(z)=D(z;Fn0)andD1(z)=D(z;Fn1).D_0(\mathbf{z}) = D(\mathbf{z}; F_{n_0}) \quad \text{and} \quad D_1(\mathbf{z}) = D(\mathbf{z}; F_{n_1}).

Plotting the pair (D0(z),D1(z))(D_0(\mathbf{z}), D_1(\mathbf{z})) for each training point in R2\mathbb{R}^2 produces the DD-plot — a low-dimensional summary of how the two classes’ depth structures compare, regardless of the dimension of the original data.

Definition 8 (DD-classifier).

Let DD be a depth function. Given training samples for classes 00 and 11 with size-balancing weights, the DD-classifier assigns a new query point z\mathbf{z} to class 11 iff

D1(z)  >  D0(z)D_1(\mathbf{z}) \;>\; D_0(\mathbf{z})

(with ties broken by a convention).

The classifier’s decision boundary in the original feature space Rd\mathbb{R}^d corresponds to the line D0=D1D_0 = D_1 in the DD-plot. When the two depth structures are well-separated, classes appear as distinct clusters in the DD-plot, and the diagonal is a competent decision rule.

Theorem 7 (Li, Cuesta-Albertos & Liu 2012).

Suppose the two classes have densities f0,f1f_0, f_1 that are unimodal and elliptically symmetric about distinct centers, with π0=π1=1/2\pi_0 = \pi_1 = 1/2. The Bayes-optimal classifier c^Bayes(z)=1{f1(z)f0(z)}\hat{c}_{\text{Bayes}}(\mathbf{z}) = \mathbb{1}\{f_1(\mathbf{z}) \ge f_0(\mathbf{z})\} is equivalent to the DD-classifier iff the densities are depth-calibrated: fif_i is a strictly monotone function of D(;Fi)D(\cdot; F_i).

This depth-calibration condition holds for halfspace depth on Gaussians (where depth is monotone in Mahalanobis distance, hence in density). The implication: when the elliptical-symmetry assumption is reasonable, the DD-classifier is essentially Bayes-optimal without ever fitting a parametric model — only the empirical depth function is needed. The implementation below computes a DD-plot for the harder Iris pair (Versicolor vs. Virginica) using halfspace depth on the petal-length × petal-width subspace.

DD-plot and corresponding decision regions for the Iris Versicolor vs. Virginica classification using Tukey halfspace depth
Left: DD-plot. The diagonal D₀ = D₁ is the decision boundary; the two clusters separate cleanly. Right: feature-space decision regions corresponding to the DD-rule. Training accuracy is 0.960 — competitive with logistic regression on the same features without a parametric model in sight.

The widget below reproduces both views with the live training-accuracy readout. Toggle DD-plot ↔ feature space.

Versicolor (n=50)Virginica (n=50)|training accuracy: 0.960

In the DD-plot, the diagonal D₀ = D₁ is the decision boundary. Points above the line are classified as Virginica; points below as Versicolor. The two clusters separate cleanly because the Iris petal subspace is nearly linearly separable.

5.2 Multivariate outlier detection

A multivariate outlier is a point that is far from the bulk of a distribution in a sense that respects the bulk’s shape. Mahalanobis distance is the standard tool, but it commits to ellipticity (§3.1). Depth-based outlier detection inverts the problem: outliers are the points with low depth.

Definition 9 (Depth-based outlier rule).

Fix a depth function DD and a threshold α(0,1/2)\alpha \in (0, 1/2). A point z\mathbf{z} is flagged as an outlier with respect to X1,,Xn\mathbf{X}_1, \dots, \mathbf{X}_n if

D(z;Fn)  <  α.D(\mathbf{z}; F_n) \;<\; \alpha.

The threshold α\alpha tunes the rule’s strictness. The most aggressive choice is α=1/n\alpha = 1/n, which flags only convex-hull boundary points; this catches isolated outliers but misses any contamination cluster dense enough to form its own convex region. A more useful default is a fixed α\alpha in the 0.020.020.100.10 range, calibrated either to a clean reference distribution or to the desired fraction of training points one is willing to flag.

A subtle but important property of depth-based outlier rules: they are not fooled by masking — the failure mode where multiple co-located outliers raise their own neighborhood density, inflate the sample covariance toward the contamination direction, and thereby pull their own Mahalanobis distances below the cutoff. Depth-based rules see through masking partially, because the convex-hull and projection structures that define depth do not get pulled toward dense regions of contamination — a contamination cluster of kk points can lower the halfspace depth of any point near the bulk by at most k/nk/n, regardless of how the cluster is arranged. The honest caveat: when the contamination cluster is dense enough to form its own convex region of significant volume, depth-based flagging can still miss outliers inside that region (they are deep with respect to the cluster, even if shallow with respect to the bulk), and false positives accumulate in the clean bulk’s outer rim.

Two-panel comparison of halfspace-depth flagging and Mahalanobis flagging on a sample with masking outliers
Halfspace-depth flagging at $\\alpha = 0.05$ catches roughly two-thirds of the masking outliers; Mahalanobis at the standard $\\chi^2_{2,0.975}$ cutoff catches none, because the masking cluster inflates the sample covariance toward itself. Both methods produce false positives in the clean rim — depth-based methods are not magic, just more honest about what an outlier is.

5.3 Functional depth

A functional dataset is a sample of curves X1(t),,Xn(t)X_1(t), \dots, X_n(t), each a function on a common index set TT. The natural depth analogue is band depth — defined by how often a curve lies between bands generated by other sample curves.

Definition 10 (Band depth (López-Pintado–Romo 2009)).

Let X1,,XnX_1, \dots, X_n be sample curves on TT. For an integer J2J \ge 2 define the band of JJ curves Xi1,,XiJX_{i_1}, \dots, X_{i_J} as

B(Xi1,,XiJ)={(t,y)T×R:minjXij(t)ymaxjXij(t)}.B(X_{i_1}, \dots, X_{i_J}) = \bigl\{ (t, y) \in T \times \mathbb{R} : \min_j X_{i_j}(t) \le y \le \max_j X_{i_j}(t) \bigr\}.

The band depth of a curve xx is

BDn(J)(x)=(nJ)1{i1,,iJ}1{graph(x)B(Xi1,,XiJ)}.\operatorname{BD}_n^{(J)}(x) = \binom{n}{J}^{-1} \sum_{\{i_1, \dots, i_J\}} \mathbb{1}\bigl\{ \mathrm{graph}(x) \subseteq B(X_{i_1}, \dots, X_{i_J}) \bigr\}.

The modified band depth relaxes the indicator to the fraction of TT on which the inclusion holds:

MBDn(J)(x)=(nJ)1{i1,,iJ}1λ(T)λ({t:minjXij(t)x(t)maxjXij(t)}).\operatorname{MBD}_n^{(J)}(x) = \binom{n}{J}^{-1} \sum_{\{i_1, \dots, i_J\}} \frac{1}{\lambda(T)} \lambda\bigl(\{t : \min_j X_{i_j}(t) \le x(t) \le \max_j X_{i_j}(t) \}\bigr).

Band depth and modified band depth share the §1.3 axiomatic structure (with appropriate substitutions for functional spaces) and provide center-outward orderings of curves: the deepest curve is the functional median, the most central curve in the sample.

Functional depth on noisy sinusoids with three injected outliers (constant, cosine-shape, extreme amplitude)
Modified band depth with $J = 2$ on 30 noisy sinusoids + 3 injected outliers. Left: the functional median (highlighted) tracks the central trajectory; the bottom-10% band catches two of three injected outliers (the constant-outlier hides inside the noise envelope). Right: depth values, with outliers (black) clustering at the low end.

5.4 Depth-weighted M-estimation

A natural generalization of the §4.2 Tukey median is depth-weighted M-estimation: instead of using the deepest point, build an estimator that weights all sample points by a function of their depth. For a weight function w:[0,1/2]R+w : [0, 1/2] \to \mathbb{R}_+,

μ^w=iw(HDn(Xi))Xiiw(HDn(Xi)).\hat{\boldsymbol{\mu}}_w = \frac{\sum_i w\bigl(\operatorname{HD}_n(\mathbf{X}_i)\bigr) \mathbf{X}_i}{\sum_i w\bigl(\operatorname{HD}_n(\mathbf{X}_i)\bigr)}.

For w(d)=1{dα}w(d) = \mathbb{1}\{d \ge \alpha\} this is the centroid of the α\alpha-trimmed Tukey region, generalizing the median. Continuous weight functions yield smooth estimators that retain a controllable breakdown point (Donoho and Gasko 1992 Theorem 4 gives a ε1/(d+1)\varepsilon^* \ge 1/(d+1) bound under a mild monotonicity condition on ww).

6. Connections and limits

The depth machinery sits at a junction between the rest of the formalML curriculum and the broader statistical literature on which it is built. This section orients the topic among its neighbors, then lays out the assumptions and computational realities that bound where statistical depth is the right tool and where it isn’t.

6.1 Within formalML

Quantile regression. The §1.1 motivation — that componentwise quantiles fail to give a well-defined multivariate notion of center — is the converse of the quantile regression motivation. Quantile regression handles the conditional multivariate problem (fixing a regressor and asking about the response distribution) by working one quantile level at a time. Statistical depth handles the unconditional multivariate problem (no regressor, just the joint distribution) by collapsing the multidimensional information into a center-outward scalar. The two converge at the multivariate-quantile-regression frontier, where Tukey contours of the residual distribution serve as nonparametric prediction regions.

Conformal prediction. Conformal prediction sets achieve marginal coverage by ranking calibration scores and inverting at the desired quantile. When the score is a multivariate residual, the natural rank is depth rather than componentwise quantile — which gives depth-based conformal prediction: prediction regions whose shapes adapt to the residual distribution’s geometry rather than being committed to coordinate-aligned boxes.

Extreme value theory. EVT and statistical depth are duals: EVT studies the outermost observations, depth studies the innermost. The two combine in multivariate extreme value theory, where shallow-depth observations identify candidate extremes and depth-based contours bound the central mass against which extremes are measured.

Rank tests. Depth provides the multivariate analogue of the rank ordering induced by the empirical CDF. Hodges-Lehmann-style multivariate rank tests built on depth-induced ranks are distribution-free under the elliptical-symmetry assumption and inherit the §4.1 consistency.

Prediction intervals. Multivariate prediction regions inherit the depth-conformal connection: a depth-based prediction region built from the calibration sample’s residual depths is the multivariate analogue of the standard quantile-based univariate prediction interval.

PCA and low-rank methods. PCA finds a lower-dimensional subspace explaining maximum variance; depth maps each point to a scalar quantifying its centrality. The two combine in projection depth (§3.3): the projection-depth value is the worst-case standardized residual over all 1D projections — a robustified version of the leading PCA loading.

Convex analysis. Theorems 1 (axiomatic) and Proposition 1 (convexity of contours) live entirely in the language of supporting hyperplanes, halfspaces, and convex hulls. The convex-analysis toolkit is the natural staging ground for derivations and contour-level-set arguments.

6.2 Cross-site connections

The univariate quantile function on formalstatistics/order-statistics-and-quantiles is the foundational concept that depth generalizes. Empirical-process limit theorems on formalstatistics/empirical-processes — the Vapnik–Chervonenkis Glivenko–Cantelli theorem and Donsker’s theorem on the halfspace VC class — are the structural prerequisites for Theorems 3 and 5 here. The elliptical-symmetry framework on formalstatistics/multivariate-distributions is the assumption under which Theorem 7 (DD-classifier Bayes-optimality) holds. The integrals defining halfspace and band masses live in formalcalculus/multivariable-integration.

6.3 Limits

Sample inefficiency in high dimensions. Statistical depth is a fundamentally geometric construction, and geometry in high dd is hostile. The §4.1 consistency rate is dimension-free pointwise — but the uniform rate over depth contours degrades as O(d/n)O(d / \sqrt{n}). For practical purposes, depth-based methods are most useful at d20d \le 20. Beyond that, deep-region estimation requires sample sizes that are usually unavailable.

Computational cost. Theorem 6 (Aloupis 2006) is decisive: exact halfspace depth is NP-hard when dd is part of the input. The §3.3 projection construction provides a polynomial-time approximation, but the approximation rate Op(logK/K)O_p(\sqrt{\log K / K}) in the number of projections degrades quickly when KK must scale with dd.

Ellipticity assumption in some applications. Theorem 7 (DD-classifier Bayes-optimality) requires elliptical class densities. When the class densities are non-elliptical, the DD-classifier is no longer Bayes-optimal — though it remains a competent classifier and inherits the §4.1 consistency. Practitioners should not lean on the Bayes-optimality result when the data exhibits multimodality, skewness, or heavy non-elliptical structure.

Non-uniqueness of the halfspace median. The Tukey median is in general a set, not a point; uniqueness requires additional symmetry assumptions on FF. Implementations report the centroid of the deepest level set (§2.3), but this introduces an arbitrary choice that has no statistical justification when the deepest level set has positive volume.

Functional depth is order-dependent. Band depth (§5.3) ranks curves but does not provide a metric structure on the function space — two equally deep curves can be very different in shape. The correct comparison metric (e.g., L2L^2 distance) must come from outside the depth machinery.

Connections

  • The §2 axiomatic theorems and the convexity of depth contours (Proposition 2.4) live entirely in the language of supporting hyperplanes, halfspaces, and convex hulls. The convex-analysis toolkit is the natural setting for derivations and contour-level-set arguments. convex-analysis
  • Projection depth (§3.3) is the worst-case standardized residual over all 1D projections — a robustified version of the leading PCA loading. PCA finds maximum-variance subspaces; projection depth finds maximum-outlyingness directions over the same sphere of unit projections. pca-low-rank
  • Depth-based conformal prediction sets achieve marginal coverage with shapes that adapt to the residual distribution's geometry, replacing componentwise quantile rankings with depth-induced rankings of multivariate scores. conformal-prediction
  • Statistical depth handles the unconditional multivariate-quantile problem by collapsing dimension into a center-outward scalar; quantile regression handles the conditional problem one quantile-level at a time. The two converge to multivariate-quantile regression with depth-based prediction regions. quantile-regression
  • Depth-induced multivariate ranks make Hodges-Lehmann-style multivariate rank tests well-defined, giving distribution-free tests under elliptical symmetry that inherit the §4.1 consistency. rank-tests
  • Depth and EVT are duals: EVT studies the outermost observations (block maxima, threshold exceedances), depth studies the innermost (the median, deep regions). They combine in multivariate EVT where shallow-depth observations identify candidate extremes. extreme-value-theory
  • Multivariate prediction regions inherit the depth-conformal connection: a depth-based prediction region from calibration-sample residual depths is the multivariate analogue of standard quantile-based prediction intervals. prediction-intervals

References & Further Reading