intermediate nonparametric-ml 50 min read

Conformal Prediction

Distribution-free prediction sets with finite-sample coverage from exchangeability alone

Distribution-Free Prediction and the Exchangeability Setup

We observe nn feature–response pairs (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) and want to say something useful about the next pair (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}). A point estimate μ^(Xn+1)\hat\mu(X_{n+1}) answers “what’s the most likely value of Yn+1Y_{n+1}” but leaves “how much should we trust this answer” untouched. A prediction set C^α(Xn+1)Y\hat C_\alpha(X_{n+1}) \subseteq \mathcal{Y} is a data-dependent subset of the response space chosen so that

P(Yn+1C^α(Xn+1))    1α,\mathbb{P}\bigl( Y_{n+1} \in \hat C_\alpha(X_{n+1}) \bigr) \;\geq\; 1 - \alpha,

where α(0,1)\alpha \in (0, 1) is the miscoverage level and the probability is over the joint distribution of training, calibration, and test data. Conformal prediction is the construction that makes C^α\hat C_\alpha valid in this sense — without distributional assumptions on (X,Y)(X, Y), without bounded-noise hypotheses, and without asymptotics. The only assumption is exchangeability.

Two structural points distinguish a prediction set from a confidence interval. First, a confidence interval covers a fixed parameter (a population mean, a quantile ξp\xi_p); a prediction set covers a random variable — the unobserved Yn+1Y_{n+1}. Second, the confidence interval shrinks at rate 1/n1/\sqrt{n} as the sample size grows, whereas the prediction set has an irreducible width set by the noise distribution: even with infinite training data, a prediction set for Yn+1Xn+1Y_{n+1} \mid X_{n+1} can never be tighter than the conditional spread of YY at that input.

Side-by-side panels showing a confidence band for the conditional mean (narrow, shrinks with n) and a prediction set for the next observation (wider, set by noise scale)
Confidence intervals shrink at rate 1/√n and cover the regression function E[Y | X = x]. Prediction sets accommodate the irreducible noise of Y | X and have a width floor independent of n. The two objects answer different questions.

This contrast — CI for a parameter, prediction set for a random variable — is the same one that formalStatistics: Order Statistics & Quantiles draws when introducing distribution-free CIs for the population quantile ξp\xi_p via paired order statistics. The argument used there relies on rank symmetry under exchangeability, and Topic 29 §29.10 Remark 21 explicitly motivates Vovk–Gammerman–Shafer’s distribution-free prediction as the natural generalization from a fixed parameter to a random variable. We will see this same rank-symmetry machinery in §3’s proof of marginal coverage.

Definition 1 (Exchangeability).

A finite sequence of random variables Z1,,ZNZ_1, \ldots, Z_N is exchangeable if its joint distribution is invariant under coordinate permutation: for every permutation π\pi of {1,,N}\{1, \ldots, N\},

(Z1,,ZN)  =d  (Zπ(1),,Zπ(N)).(Z_1, \ldots, Z_N) \;\stackrel{d}{=}\; (Z_{\pi(1)}, \ldots, Z_{\pi(N)}).

Exchangeability is strictly weaker than iid. Every iid sample is exchangeable — permuting independent draws from a common distribution leaves the joint law unchanged — but the converse fails: an urn drawn without replacement is exchangeable (any permutation of the draws is equally likely as a sequence) yet not iid (later draws depend on earlier ones). The structure of exchangeable sequences is described by de Finetti’s representation theorem: an infinite exchangeable sequence is a mixture of iid sequences, with the mixing measure capturing whatever shared latent structure links the observations. For finite-sample purposes, that subtlety doesn’t bite — what we need is the symmetry property itself.

The bet conformal prediction makes is that exchangeability holds for the augmented sequence (X1,Y1),,(Xn+1,Yn+1)(X_1, Y_1), \ldots, (X_{n+1}, Y_{n+1}) — training, calibration, and test points all drawn from the same source. This is genuinely weaker than iid: it accommodates dependence within the dataset (the urn case, time-series with stationary increments, certain sampling-without-replacement designs) provided that no observation has a privileged position. What it does not accommodate is distribution shift — a test distribution that differs from the training distribution. We return to this in §8, where we visualize how naive conformal prediction degrades under covariate shift and see how the importance-weighted variant of formalStatistics: Empirical Processes ‘s sister technique (Tibshirani et al. 2019) restores marginal coverage when the shift is known.

The asymptotic alternative to conformal prediction is the empirical-processes route: Topic 32 of formalstatistics develops uniform-convergence bounds (DKW, VC dimension, Glivenko–Cantelli) that imply distribution-free coverage in the large-nn limit. Conformal prediction sidesteps the uniform-convergence rate entirely — it gives an exact finite-sample statement — at the price that the coverage is marginal rather than conditional (a distinction that becomes the impossibility theorem of §8). Both routes lead to distribution-free prediction; they pay different prices.

Split (Inductive) Conformal Prediction

The simplest realization of the marginal-coverage promise — and the one used in essentially every modern application — is split conformal prediction, sometimes called inductive conformal prediction. The construction is mechanical: split the data into a training set and a calibration set, fit a predictor on the former, score residuals on the latter, take an empirical quantile, declare the prediction interval. Three short paragraphs and a definition cover everything; the depth is in the proof, which lands in §3.

Partition the nn available observations into a training set of size mm and a calibration set of size ncaln_{\text{cal}}:

(X1,Y1),,(Xm,Ym)trainingand(Xm+1,Ym+1),,(Xm+ncal,Ym+ncal)calibration.\underbrace{(X_1, Y_1), \ldots, (X_m, Y_m)}_{\text{training}} \quad\text{and}\quad \underbrace{(X_{m+1}, Y_{m+1}), \ldots, (X_{m+n_{\text{cal}}}, Y_{m+n_{\text{cal}}})}_{\text{calibration}}.

Then run four steps.

Step 1 — Fit. Train a base predictor μ^\hat\mu using only the training set. The base predictor can be anything — ridge regression, a random forest, a neural network — and the coverage guarantee will not depend on what it is. The quality of μ^\hat\mu controls the width of the prediction interval, not its validity.

Step 2 — Score. Compute a nonconformity score Si=s(Xi,Yi)S_i = s(X_i, Y_i) at each calibration point. For regression, the standard choice is the absolute residual Si=Yiμ^(Xi)S_i = |Y_i - \hat\mu(X_i)|; for classification we will use a different score in §7. The score is “large when (X,Y)(X, Y) looks anomalous given the fitted model.”

Step 3 — Threshold. Take the k^=(1α)(ncal+1)\hat k = \lceil (1 - \alpha)(n_{\text{cal}} + 1) \rceil-th smallest calibration score and call this value q^1α\hat q_{1-\alpha}. (When k^>ncal\hat k > n_{\text{cal}} — which happens for very small calibration sets — cap at ncaln_{\text{cal}}, equivalently set q^1α=+\hat q_{1-\alpha} = +\infty and the interval becomes all of R\mathbb{R}.)

Step 4 — Predict. For a new test point Xn+1X_{n+1}, the prediction set is the level-q^1α\hat q_{1-\alpha} sublevel set of the score:

C^α(Xn+1)  =  {yY:s(Xn+1,y)q^1α}.\hat C_\alpha(X_{n+1}) \;=\; \bigl\{ y \in \mathcal{Y} \,:\, s(X_{n+1}, y) \le \hat q_{1-\alpha} \bigr\}.

For the absolute-residual score this resolves to a symmetric interval centered on the prediction, [μ^(Xn+1)q^1α, μ^(Xn+1)+q^1α][\hat\mu(X_{n+1}) - \hat q_{1-\alpha},\ \hat\mu(X_{n+1}) + \hat q_{1-\alpha}]. The interval has the same half-width q^1α\hat q_{1-\alpha} at every test point, regardless of Xn+1X_{n+1} — split conformal with this score is not locally adaptive. We address that in §6 (CQR).

Definition 2 (Split Conformal Prediction Set).

With training-fitted predictor μ^\hat\mu, calibration set of size ncaln_{\text{cal}}, nonconformity score ss, and miscoverage level α(0,1)\alpha \in (0, 1), the split conformal prediction set at input xx is

C^α(x)  =  {yY:s(x,y)q^1α},\hat C_\alpha(x) \;=\; \bigl\{ y \in \mathcal{Y} \,:\, s(x, y) \le \hat q_{1-\alpha} \bigr\},

where q^1α\hat q_{1-\alpha} is the

k^  =  (1α)(ncal+1)\hat k \;=\; \bigl\lceil (1 - \alpha)(n_{\text{cal}} + 1) \bigr\rceil

smallest of the calibration scores (Si)i=1ncal(S_i)_{i=1}^{n_{\text{cal}}}.

The "+1+1" inside the ceiling deserves a moment. It looks fussy — and it is, in the sense that for moderate ncaln_{\text{cal}} it changes q^1α\hat q_{1-\alpha} by at most one rank — but it is also load-bearing. Under exchangeability, the (unobserved) test-point score Sn+1S_{n+1} will be inserted into the augmented sequence (S1,,Sncal+1)(S_1, \ldots, S_{n_{\text{cal}}+1}) at a uniformly random rank in {1,,ncal+1}\{1, \ldots, n_{\text{cal}} + 1\}. The ceiling-and-plus-one converts that rank uniformity directly into the marginal-coverage probability. Drop the +1+1 and the bound is no longer tight; replace the ceiling with a floor and the bound flips to a strict inequality in the wrong direction. We reconstruct this argument in §3.

Three-panel figure showing one run of split conformal: histogram of calibration scores with the threshold line, prediction band overlaid on training and calibration data, and a test set colored by whether each observation falls inside the band
One run of split conformal on heteroscedastic regression data. Left: calibration scores (absolute residuals between observed and predicted y) with the rank-181 threshold at α = 0.10, n_cal = 200. Middle: the symmetric prediction band (predictor ± threshold) — note the constant width independent of x. Right: empirical coverage on a held-out test set (90.0% here, against a target of 90%). The width insensitivity to x is what CQR (§6) fixes.

The construction is so spare that it raises an immediate question: why does this work? Nothing in steps 1–4 used the joint distribution of (X,Y)(X, Y), the smoothness of μ^\hat\mu, or any property of the base predictor beyond its having been fit before the calibration scores were computed. The next section gives the answer — Theorem 1 (marginal coverage), proved entirely from rank symmetry under exchangeability.

Marginal Coverage: The Central Theorem

Theorem 1 is the reason split conformal works. It is also the reason the construction is so insensitive to the choice of base predictor and nonconformity score — the proof never opens those black boxes. Everything follows from a single combinatorial fact: under exchangeability, the test-point score is equally likely to land at any rank within the augmented sequence of calibration-plus-test scores. Convert that rank uniformity to a coverage probability, and the bound is finite-sample, two-sided, and tight.

Theorem 1 (Marginal Coverage — Lei, G'Sell, Rinaldo, Tibshirani & Wasserman 2018).

Let (Xi,Yi)i=1ncal+1(X_i, Y_i)_{i=1}^{n_{\text{cal}} + 1} be exchangeable, where the first ncaln_{\text{cal}} pairs constitute the calibration set and (Xncal+1,Yncal+1)(X_{n_{\text{cal}}+1}, Y_{n_{\text{cal}}+1}) is a test point. Suppose the nonconformity score ss does not depend on the calibration or test set — for example, it depends only on a separately-trained predictor. Then the split conformal prediction set C^α\hat C_\alpha at level α(0,1)\alpha \in (0, 1) satisfies

P(Yncal+1C^α(Xncal+1))    1α.\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha(X_{n_{\text{cal}}+1}) \bigr) \;\geq\; 1 - \alpha.

Moreover, if the score distribution is continuous (no ties almost surely),

P(Yncal+1C^α(Xncal+1))    1α+1ncal+1.\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha(X_{n_{\text{cal}}+1}) \bigr) \;\leq\; 1 - \alpha + \frac{1}{n_{\text{cal}} + 1}.
Proof.

Let Si=s(Xi,Yi)S_i = s(X_i, Y_i) for i=1,,ncal+1i = 1, \ldots, n_{\text{cal}} + 1. Because ss is a fixed function — the training set has been consumed already, and ss does not depend on the remaining ncal+1n_{\text{cal}} + 1 points — exchangeability of the underlying pairs lifts to exchangeability of the scores: the joint distribution of (S1,,Sncal+1)(S_1, \ldots, S_{n_{\text{cal}}+1}) is invariant under any permutation of indices.

Let RR denote the rank of the test-point score Sncal+1S_{n_{\text{cal}}+1} in the augmented vector (S1,,Sncal+1)(S_1, \ldots, S_{n_{\text{cal}}+1}), that is,

R  =  #{i{1,,ncal+1}:SiSncal+1}.R \;=\; \#\bigl\{ i \in \{1, \ldots, n_{\text{cal}} + 1\} \,:\, S_i \le S_{n_{\text{cal}}+1} \bigr\}.

Continuity of the score distribution ensures no ties almost surely, so RR is well-defined as an integer in {1,,ncal+1}\{1, \ldots, n_{\text{cal}} + 1\}. Exchangeability means each of the ncal+1n_{\text{cal}} + 1 positions is equally likely to be the test point’s, so

P(R=j)  =  1ncal+1for each j{1,,ncal+1}.()\mathbb{P}(R = j) \;=\; \frac{1}{n_{\text{cal}} + 1} \quad \text{for each } j \in \{1, \ldots, n_{\text{cal}} + 1\}. \tag{$\ast$}

In words: RR is uniformly distributed on {1,,ncal+1}\{1, \ldots, n_{\text{cal}} + 1\}. This is the only probabilistic fact the proof needs.

Set k^=(1α)(ncal+1)\hat k = \lceil (1 - \alpha)(n_{\text{cal}} + 1) \rceil. The prediction set covers Yncal+1Y_{n_{\text{cal}}+1} iff the test score lies at or below the threshold:

Yncal+1C^α        Sncal+1q^1α        Sncal+1 ranks at most k^ among (S1,,Sncal).\begin{aligned} Y_{n_{\text{cal}}+1} \in \hat C_\alpha \;&\iff\; S_{n_{\text{cal}}+1} \le \hat q_{1-\alpha} \\ \;&\iff\; S_{n_{\text{cal}}+1} \text{ ranks at most } \hat k \text{ among } (S_1, \ldots, S_{n_{\text{cal}}}). \end{aligned}

A short combinatorial step converts the calibration-set rank to the augmented-set rank. If Sncal+1S_{n_{\text{cal}}+1} occupies augmented rank RR, exactly R1R - 1 of the augmented scores are strictly smaller than it; each of those R1R - 1 smaller scores must lie in the calibration set (the test set contributes only Sncal+1S_{n_{\text{cal}}+1} itself). So Sncal+1S_{n_{\text{cal}}+1} ranks at most k^\hat k among the calibration scores precisely when Rk^R \le \hat k:

Sncal+1 ranks at most k^ among calibration scores        Rk^.S_{n_{\text{cal}}+1} \text{ ranks at most } \hat k \text{ among calibration scores} \;\iff\; R \le \hat k.

Combining with (\ast):

P(Yncal+1C^α)  =  P(Rk^)  =  k^ncal+1.\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha \bigr) \;=\; \mathbb{P}(R \le \hat k) \;=\; \frac{\hat k}{n_{\text{cal}} + 1}.

The lower bound is immediate from the definition of k^\hat k:

k^  =  (1α)(ncal+1)    (1α)(ncal+1),\hat k \;=\; \lceil (1 - \alpha)(n_{\text{cal}} + 1) \rceil \;\geq\; (1 - \alpha)(n_{\text{cal}} + 1),

so P(Yncal+1C^α)1α\mathbb{P}(Y_{n_{\text{cal}}+1} \in \hat C_\alpha) \ge 1 - \alpha. For the upper bound, use zz+1\lceil z \rceil \le z + 1:

k^    (1α)(ncal+1)+1,\hat k \;\leq\; (1 - \alpha)(n_{\text{cal}} + 1) + 1,

so

P(Yncal+1C^α)    (1α)+1ncal+1.\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha \bigr) \;\leq\; (1 - \alpha) + \frac{1}{n_{\text{cal}} + 1}.

Remark (Marginal Coverage Proof Independence).

The proof never opens the black box of μ^\hat\mu or invokes a property of the data beyond exchangeability. It does not assume that YY has finite moments, that XX lies in a bounded set, or that the marginal distribution is continuous — only that the score distribution is continuous (which holds almost surely whenever YXY \mid X has a continuous conditional density, the typical regression setting). The score function ss is chosen by the analyst; the only restriction is that ss not depend on the calibration or test data, equivalent to fitting any tunable parts of ss — including the predictor μ^\hat\mu and its hyperparameters — using only the training set.

The two-sided bound says coverage concentrates in a strip of width 1ncal+1\frac{1}{n_{\text{cal}} + 1} above 1α1 - \alpha. At ncal=200n_{\text{cal}} = 200 and α=0.10\alpha = 0.10 the strip is [0.90, 0.9050][0.90,\ 0.9050] — only 0.50.5 percentage points wide. The next visualization runs T=500T = 500 Monte Carlo trials at user-controlled (α,ncal)(\alpha, n_{\text{cal}}) and shows the empirical coverage histogram concentrating inside the predicted band. Drag the calibration-size slider to see the strip narrow as ncaln_{\text{cal}} grows — and notice that even the ncal=20n_{\text{cal}} = 20 regime (where the upper bound is fully 55 percentage points above 1α1 - \alpha) preserves marginal validity in expectation.

T = 500 trials · n_train = 300 · n_test/trial = 20

Drag α to retarget coverage; switch n_cal to see the upper-bound strip narrow as 1/(n_cal+1) shrinks. The middle-panel histogram concentrates inside the orange [1−α, 1−α+1/(n_cal+1)] band predicted by Theorem 1; the bottom-panel running mean converges into the same band as t → T. The top panel's red test points are the rare misses (~α fraction).

The bound is tight, not just a one-sided guarantee — and that tightness is doing real work. A naive reading of "1α\ge 1 - \alpha" might suggest the procedure could over-cover wildly with no statistical penalty, but Theorem 1’s upper bound rules that out: split conformal does not waste coverage. As long as the score distribution is continuous and exchangeability holds, the procedure delivers within 1ncal+1\tfrac{1}{n_{\text{cal}} + 1} of nominal — no looser, no tighter.

What the bound does not say is anything about conditional coverage at a specific X=xX = x. The marginal probability averages over the distribution of XX, so a procedure that under-covers in one region and over-covers in another can still satisfy Theorem 1. We return to this in §8 with the Foygel-Barber 2021 impossibility theorem.

Cross-Validation in Conformal: CV+

Split conformal pays a calibration tax. The data partition (m, ncal)(m,\ n_{\text{cal}}) has to allocate enough mass to the calibration set for the threshold to be stable — typically ncaln/2n_{\text{cal}} \approx n/2 — which means the predictor μ^\hat\mu trains on only half the available data. For a fixed total budget nn this is genuinely wasteful: a better μ^\hat\mu would mean tighter prediction intervals (recall, the width depends on predictor quality even though validity doesn’t).

Cross-validation offers a way out. Partition {1,,n}\{1, \ldots, n\} into KK folds of roughly equal size, and for each fold kk fit a predictor μ^k\hat\mu_{-k} on the data with fold kk removed. The leave-one-fold-out residual at observation ii is

Ri  =  Yiμ^fold(i)(Xi),R_i \;=\; \bigl| Y_i - \hat\mu_{-\text{fold}(i)}(X_i) \bigr|,

and the CV+ prediction interval uses these residuals — combined with per-fold predictions at the test point — through the same quantile construction we will introduce for jackknife+ in §5. Every observation contributes to both the fit and the residual scoring; no calibration tax.

The trade-off is computational. Jackknife+ (K=nK = n) requires nn predictor fits; KK-fold CV+ requires only KK fits, typically K=5K = 5 or K=10K = 10. For expensive base predictors — random forests, neural networks, anything that costs minutes per fit — CV+ is the only practical route. The coverage guarantee is identical: Theorem 2 in §5 establishes the lower bound P(Yn+1C^αCV+)12α\mathbb{P}(Y_{n+1} \in \hat C_\alpha^{\text{CV+}}) \ge 1 - 2\alpha for both jackknife+ (K=nK = n) and KK-fold CV+ verbatim. The factor of 22 is worst-case over base predictors; in practice, both procedures achieve coverage very close to nominal 1α1 - \alpha.

The full procedural definition, the BCRT 2021 tournament/comparison-graph proof, and the empirical comparison to split conformal all live in §5. The anchor here exists so that any sister-site reference to “cross-validation in machine learning” can deep-link to a coherent treatment without minting a separate topic.

Full (Transductive) Conformal Prediction

CV+ avoids the calibration tax by recycling fits across folds. Full conformal — the original Vovk-Gammerman-Shafer construction — avoids it differently: by treating the candidate response itself as the variable and refitting the predictor once per candidate. Every observation is used for both fitting and calibration; the price is a per-candidate refit at test time.

The procedure. To form the prediction set C^α(Xn+1)\hat C_\alpha(X_{n+1}) at a new test input Xn+1X_{n+1}, iterate over candidate values yYy \in \mathcal{Y} and for each one run four steps.

Step 1 — Augment. Form the augmented dataset Dy={(Xi,Yi)}i=1n{(Xn+1,y)}\mathcal{D}_y = \{(X_i, Y_i)\}_{i=1}^{n} \cup \{(X_{n+1}, y)\} — the original nn observations plus the test input paired with the candidate yy.

Step 2 — Fit. Fit a predictor μ^y\hat\mu_y on Dy\mathcal{D}_y, or equivalently define a score function sys_y from Dy\mathcal{D}_y.

Step 3 — Score. Compute calibration scores Si(y)=sy(Xi,Yi)S_i^{(y)} = s_y(X_i, Y_i) for i=1,,n+1i = 1, \ldots, n+1 — every observation, including the augmented test pair.

Step 4 — Include. Include yy in C^α(Xn+1)\hat C_\alpha(X_{n+1}) iff the rank of Sn+1(y)S_{n+1}^{(y)} among (S1(y),,Sn+1(y))(S_1^{(y)}, \ldots, S_{n+1}^{(y)}) is at most (1α)(n+1)\lceil (1 - \alpha)(n + 1) \rceil.

The marginal-coverage guarantee extends from §3 with no change. When the candidate yy equals the true (unobserved) Yn+1Y_{n+1}, the augmented dataset is exchangeable, and Theorem 1’s argument — uniformity of the test-score rank in {1,,n+1}\{1, \ldots, n+1\} — applies verbatim. Therefore P(Yn+1C^α(Xn+1))1α\mathbb{P}(Y_{n+1} \in \hat C_\alpha(X_{n+1})) \ge 1 - \alpha.

Why study it. Full conformal is rarely the practical default but it earns its keep on two counts. First, statistical efficiency: every observation contributes to both the fit and the rank computation, so for small nn — say n100n \le 100 — the prediction interval is meaningfully tighter than split conformal at the same α\alpha. Second, theoretical cleanliness: many extensions of conformal (online conformal under streaming data, conformal under model misspecification, conformal for transductive learning) are easier to state and analyze in the full-conformal framework, then specialized to split.

Computational cost. Naively, full conformal requires one predictor fit per candidate yy. For finite response spaces — classification with KK classes — this is O(K)O(K) fits per test point, eminently tractable. For continuous regression, the candidate space must be discretized to a grid, giving O(grid size)O(\text{grid size}) fits, which gets expensive but is embarrassingly parallel. For specific predictors the per-candidate refit collapses: ridge regression admits closed-form leave-one-out updates that turn full conformal into a single O(n)O(n) pass, and recent work has produced similar closed-form recipes for kernel ridge, lasso, and certain neural network families.

Line plot showing mean prediction interval width at a fixed test point as total dataset size n varies on a log scale, with split conformal (blue) consistently wider than full conformal (green), the gap closing as n grows
Statistical efficiency comparison at a fixed test location, averaged over 60 replicates. Split conformal (blue) splits 50/50 between train and calibration; full conformal (green) uses all n observations for both. The width gap is meaningful at small n (full is ~25% tighter at n = 40) and shrinks to noise as n → ∞. Split's calibration tax is real, but it's a tax on the small-data regime.

For most practical purposes — modern ML with nn in the thousands and base predictors that take seconds to fit — split conformal is the right default and the rest of this topic uses it. Full conformal sits in the toolkit for the small-nn regime, the closed-form-update predictors, and the theoretical extensions that are easier to derive there. Jackknife+ and CV+ in §5 carve out a middle ground: leave-one-out residuals through a small number of refits, with the same exchangeability machinery delivering a (slightly weaker) coverage bound.

Jackknife+ and CV+

Split conformal pays a calibration tax; full conformal pays a per-candidate refit tax. Jackknife+ (Foygel Barber, Candès, Ramdas & Tibshirani 2021) charts a third route: use every observation for both fitting and residual scoring through leave-one-out refitting. Each calibration residual is computed against a predictor that did not see that observation, restoring statistical efficiency without the per-candidate explosion of full conformal. The price is mild — nn predictor fits per training (one for each leave-one-out), parallelizable — and the coverage guarantee weakens by a factor of two.

For each i{1,,n}i \in \{1, \ldots, n\}, let μ^i\hat\mu_{-i} denote the predictor fit on the dataset with the ii-th observation removed, and define the leave-one-out residual

Ri  =  Yiμ^i(Xi).R_i \;=\; \bigl| Y_i - \hat\mu_{-i}(X_i) \bigr|.

This RiR_i is an honest residual in the sense that it never used (Xi,Yi)(X_i, Y_i) in its training — exactly the property split conformal extracted by partitioning the data, but achieved here without sacrificing any sample.

Definition 3 (Jackknife+ Prediction Interval).

The jackknife+ prediction interval at test point Xn+1X_{n+1}, miscoverage level α(0,1)\alpha \in (0, 1), is

C^αJ+(Xn+1)  =  [Q^α{μ^i(Xn+1)Ri}, Q^1α+{μ^i(Xn+1)+Ri}],\hat C_\alpha^{\text{J+}}(X_{n+1}) \;=\; \Bigl[\, \hat Q_\alpha^{-}\bigl\{ \hat\mu_{-i}(X_{n+1}) - R_i \bigr\},\ \hat Q_{1-\alpha}^{+}\bigl\{ \hat\mu_{-i}(X_{n+1}) + R_i \bigr\} \,\Bigr],

where Q^α\hat Q_\alpha^{-} is the α(n+1)\lfloor \alpha (n+1) \rfloor-th smallest of a collection of nn values (with Q^0=\hat Q_0^{-} = -\infty) and Q^1α+\hat Q_{1-\alpha}^{+} is the (1α)(n+1)\lceil (1-\alpha)(n+1) \rceil-th smallest (with Q^1+=+\hat Q_1^{+} = +\infty).

The endpoints look more elaborate than the symmetric split-conformal interval, but each is just an empirical quantile of nn leave-one-out predictions adjusted by their respective residuals. The lower endpoint asks “across all leave-one-out fits, what’s the small-quantile of μ^i(Xn+1)Ri\hat\mu_{-i}(X_{n+1}) - R_i” — a low-side estimate that accommodates predictor variability across LOO fits and the residual magnitudes. The upper endpoint is symmetric. Crucially, the ii-th LOO predictor’s variability and its residual move together: a fold whose held-out point is hard to predict produces both a wild μ^i(Xn+1)\hat\mu_{-i}(X_{n+1}) and a large RiR_i, and the construction lets these self-correct.

Theorem 2 (Jackknife+ Coverage Bound — Foygel Barber, Candès, Ramdas & Tibshirani 2021).

Under exchangeability of (Xi,Yi)i=1n+1(X_i, Y_i)_{i=1}^{n+1},

P(Yn+1C^αJ+(Xn+1))    12α.\mathbb{P}\bigl( Y_{n+1} \in \hat C_\alpha^{\text{J+}}(X_{n+1}) \bigr) \;\geq\; 1 - 2\alpha.

The constant 22 cannot be improved without further structural assumptions on the predictor: BCRT 2021 exhibit a degenerate base predictor under which jackknife+ achieves coverage arbitrarily close to 12α1 - 2\alpha.

Proof.

The proof follows BCRT 2021 §3.2; we reconstruct the key combinatorial moves and direct the reader to the original for the formalization of the tournament step.

For each i{1,,n}i \in \{1, \ldots, n\}, define the leave-one-out swap residual

Ri  =  Yn+1μ^i(Xn+1),R_i^{\ast} \;=\; \bigl| Y_{n+1} - \hat\mu_{-i}(X_{n+1}) \bigr|,

the absolute residual we would have observed if the test point (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}) had taken the role of training point ii and (Xi,Yi)(X_i, Y_i) had taken the role of the test point. By exchangeability of the n+1n+1 observations together with the fact that the predictor μ^i\hat\mu_{-i} is a symmetric function of its inputs (the order of the training observations does not matter), the joint distribution of the residuals is invariant under any swap of indices (i,n+1)(i, n+1). In particular, RiR_i and RiR_i^{\ast} have the same marginal distribution for each ii.

Consider the comparison graph GG on vertex set {1,,n+1}\{1, \ldots, n+1\} defined by placing a directed edge iji \to j whenever the leave-{i,j}\{i,j\}-out residual at ii exceeds the leave-{i,j}\{i,j\}-out residual at jj (BCRT 2021 §3.2). For the swap-symmetry argument, what matters is the pairwise comparison between each training index i{1,,n}i \in \{1, \ldots, n\} and the test index n+1n+1. The exchangeability swap above shows the comparison is symmetric in in+1i \leftrightarrow n+1, so the directed in-degree of vertex n+1n+1 in this graph is uniformly distributed among {0,1,,n}\{0, 1, \ldots, n\}.

The “bad event” for jackknife+ — the test point falling outside the prediction interval — splits into two failure modes:

{Yn+1<lower endpoint}and{Yn+1>upper endpoint}.\bigl\{ Y_{n+1} < \text{lower endpoint} \bigr\} \quad\text{and}\quad \bigl\{ Y_{n+1} > \text{upper endpoint} \bigr\}.

BCRT 2021 Lemma 1 establishes that each of these failure modes corresponds to at most an α\alpha fraction of comparison-graph configurations. Concretely: the lower endpoint is the α(n+1)\lfloor \alpha(n+1) \rfloor-th smallest of {μ^i(Xn+1)Ri}\{\hat\mu_{-i}(X_{n+1}) - R_i\}, and a counting argument over the comparison graph shows

P(Yn+1<lower endpoint)    α,\mathbb{P}\bigl( Y_{n+1} < \text{lower endpoint} \bigr) \;\leq\; \alpha,

with the symmetric statement for the upper endpoint. The argument is non-trivial because the residuals RiR_i are not independent under the swap — they share the LOO predictor structure — but the tournament-style counting in BCRT 2021 Lemma 1 handles this dependence by showing the bad events at indices ii and jj cannot both occur for too many pairs simultaneously.

Combining the two bounds via a union bound:

P(Yn+1C^αJ+(Xn+1))    α+α  =  2α,\mathbb{P}\bigl( Y_{n+1} \notin \hat C_\alpha^{\text{J+}}(X_{n+1}) \bigr) \;\leq\; \alpha + \alpha \;=\; 2\alpha,

equivalently P(Yn+1C^αJ+)12α\mathbb{P}(Y_{n+1} \in \hat C_\alpha^{\text{J+}}) \ge 1 - 2\alpha.

Two ingredients carry the argument: (a) rank symmetry under exchangeability combined with the symmetry of μ^\hat\mu in its inputs, and (b) the union bound over the two endpoint-failure modes. The full formalization of step (a) — the comparison-graph counting that resolves the dependence between RiR_i and RjR_j — is BCRT 2021 Theorem 1, which we have not reproduced here.

Remark (Jackknife+ Worst-Case Tightness).

The factor of 22 in 12α1 - 2\alpha is worst-case over base predictors. BCRT 2021 §4 constructs an adversarial base predictor — essentially a constant function on most of the input space with a single sharp anomaly — that drives jackknife+ to coverage arbitrarily close to 12α1 - 2\alpha from above. For reasonable base predictors (ridge, random forest, neural networks with standard regularization) the empirical coverage of jackknife+ is typically within a percentage point or two of the nominal 1α1 - \alpha — much closer to split conformal’s tight 1α+1/(ncal+1)1 - \alpha + 1/(n_{\text{cal}}+1) than the worst-case 12α1 - 2\alpha would suggest. The factor-of-two penalty is the theoretical price for refusing the calibration-set partition; the empirical price is usually negligible.

CV+ (the KK-fold extension introduced in §3.8) inherits Theorem 2 verbatim: P(Yn+1C^αCV+)12α\mathbb{P}(Y_{n+1} \in \hat C_\alpha^{\text{CV+}}) \ge 1 - 2\alpha for any K2K \ge 2. The proof is identical — the swap-symmetry and union-bound moves work for any leave-fold-out predictor structure, not just leave-one-out. The trade-off is the one we already saw: KK fits instead of nn, with the empirical coverage typically a hair worse than jackknife+ at small KK (more residual variance) and effectively identical at K=nK = n (which recovers jackknife+).

Three-panel figure: leave-one-out residual visualization on a small dataset showing fitted-vs-LOO segments at three highlighted points; bar chart comparing empirical coverage across split conformal, jackknife+, and CV+ with the nominal target and worst-case bound; box plot of interval widths across trials for the three methods
Three procedures, n = 60, α = 0.10, T = 300 trials. Left: leave-one-out residuals R_i shown as red segments at three highlighted observations — each is the gap between the fully-fit prediction and the LOO-fit prediction at that point. Middle: empirical coverage. Split conformal lands at the marginal target; jackknife+ and CV+ land essentially at target too — well above the worst-case bound 1 − 2α = 0.80. Right: interval-width distributions. Jackknife+ and CV+ produce slightly tighter intervals than split conformal at small n by avoiding the calibration tax.

The empirical takeaway visible in the figure: at n=60n = 60 — small enough that the calibration tax matters — jackknife+ and CV+ produce tighter intervals than split conformal at the same α\alpha, and the BCRT 12α1 - 2\alpha bound is loose by a wide margin. The choice between the three procedures is essentially a budget question. With abundant data and an expensive base predictor, split conformal is the workhorse. With a moderate nn and a cheap-to-refit predictor, jackknife+ extracts more from each observation. With expensive predictors but enough data to support fold-level fits, CV+ is the practical compromise.

What none of the three procedures achieve — and what §6 addresses — is locally adaptive width. Every procedure in §2–§5 produces a band whose half-width is a single global threshold; the band is wide where it doesn’t need to be (low-noise regions) and narrow where it shouldn’t be (high-noise regions). Conformalized quantile regression (CQR) fixes this by replacing the mean-residual nonconformity score with one built from quantile-regression endpoints.

Conformalized Quantile Regression (CQR)

The split-conformal interval [μ^(Xn+1)q^1α, μ^(Xn+1)+q^1α][\hat\mu(X_{n+1}) - \hat q_{1-\alpha},\ \hat\mu(X_{n+1}) + \hat q_{1-\alpha}] has the same half-width q^1α\hat q_{1-\alpha} everywhere. On homoscedastic data this is fine — the conditional spread of YXY \mid X is constant by hypothesis. On heteroscedastic data it is wasteful where the noise is small and over-confident where the noise is large: the band over-covers in low-noise regions and under-covers in high-noise regions. The marginal coverage guarantee from Theorem 1 still holds — it is an average over the input distribution — but the conditional coverage at any specific xx can drift far from 1α1 - \alpha in either direction.

Conformalized quantile regression (Romano, Patterson & Candès 2019) keeps the conformal envelope but swaps the base learner. Instead of fitting a single mean predictor μ^\hat\mu, fit two quantile regressions — one estimating the lower conditional quantile q^α/2(x)\hat q_{\alpha/2}(x) of YX=xY \mid X = x, the other estimating the upper conditional quantile q^1α/2(x)\hat q_{1-\alpha/2}(x). The conditional QR endpoints already capture heteroscedasticity by design (they are wider where YXY \mid X is more dispersed); the conformal layer wraps them with a calibration step that preserves marginal coverage exactly. CQR thus inherits the distribution-free guarantee of split conformal and the locally adaptive width of quantile regression — the best of both worlds, with the only assumption still exchangeability.

Quantile regression itself is a topic for another day. For our purposes, treat q^α/2(x)\hat q_{\alpha/2}(x) and q^1α/2(x)\hat q_{1-\alpha/2}(x) as black-box functions that take training data and return estimated conditional quantiles — fitted by check-loss minimization, possibly with regularization. Quantile Regression covers the estimation theory, asymptotic distribution, and broader applications.

Definition 4 (CQR Prediction Set).

Let q^α/2(x)\hat q_{\alpha/2}(x) and q^1α/2(x)\hat q_{1-\alpha/2}(x) be quantile-regression estimates of the conditional α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of YX=xY \mid X = x, fit on the training set. For each calibration point ii, define the CQR nonconformity score

Ei  =  max{q^α/2(Xi)Yi, Yiq^1α/2(Xi)}.E_i \;=\; \max\bigl\{\, \hat q_{\alpha/2}(X_i) - Y_i,\ Y_i - \hat q_{1-\alpha/2}(X_i) \,\bigr\}.

The score EiE_i is positive when YiY_i lies outside the QR interval [q^α/2(Xi), q^1α/2(Xi)][\hat q_{\alpha/2}(X_i),\ \hat q_{1-\alpha/2}(X_i)] — measuring how far outside, in the direction of the violation — and negative when YiY_i lies inside. Let Q^1α\hat Q_{1-\alpha} denote the (1α)(ncal+1)\lceil (1-\alpha)(n_{\text{cal}} + 1) \rceil-th smallest of (Ei)i=1ncal(E_i)_{i=1}^{n_{\text{cal}}}. The CQR prediction set at test input xx is

C^αCQR(x)  =  [q^α/2(x)Q^1α, q^1α/2(x)+Q^1α].\hat C_\alpha^{\text{CQR}}(x) \;=\; \bigl[\, \hat q_{\alpha/2}(x) - \hat Q_{1-\alpha},\ \hat q_{1-\alpha/2}(x) + \hat Q_{1-\alpha} \,\bigr].

The construction is mechanically simple: shift the lower QR endpoint down by Q^1α\hat Q_{1-\alpha}, shift the upper up by the same amount. If the QR fits are well-calibrated and most YiY_i already fall inside their intervals, then EiE_i is negative for most ii, the empirical quantile Q^1α\hat Q_{1-\alpha} is small or even negative, and CQR produces an interval narrower than the QR interval itself (a tighter band, achievable because the calibration step “credits back” the QR’s natural slack). If the QR fits are systematically narrow or wrong, EiE_i is positive for many ii, Q^1α\hat Q_{1-\alpha} is large and positive, and the CQR interval inflates the QR endpoints to recover the marginal-coverage guarantee. Either way, the band width is allowed to vary with xx — which is the locally adaptive property naive split conformal lacks.

Show

Heteroscedasticity: σ(x) = 0.3 + h · |x|. At h = 0 the noise is constant and both bands are equivalent. As h grows, naive split-conformal's constant-width band over-covers near x = 0 and under-covers in the tails (visible in the right panel as a U-shaped dip below 1 − α). CQR's per-x quantile estimates track the noise envelope, keeping the right-panel curve much flatter — approximate conditional coverage as a side benefit of locally adaptive width.

Theorem 3 (CQR Coverage Inheritance — Romano, Patterson & Candès 2019).

Under exchangeability of (Xi,Yi)i=1ncal+1(X_i, Y_i)_{i=1}^{n_{\text{cal}}+1},

P(Yncal+1C^αCQR(Xncal+1))    1α.\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha^{\text{CQR}}(X_{n_{\text{cal}}+1}) \bigr) \;\geq\; 1 - \alpha.
Proof.

CQR is split conformal applied to the nonconformity score

s(x,y)  =  max{q^α/2(x)y, yq^1α/2(x)}.s(x, y) \;=\; \max\bigl\{\, \hat q_{\alpha/2}(x) - y,\ y - \hat q_{1-\alpha/2}(x) \,\bigr\}.

This score depends on the training set (through the fitted quantile regressions q^α/2\hat q_{\alpha/2} and q^1α/2\hat q_{1-\alpha/2}) but not on the calibration or test data — exactly the hypothesis of Theorem 1. The marginal-coverage statement therefore applies verbatim, giving P(Yncal+1C^αCQR)1α\mathbb{P}(Y_{n_{\text{cal}}+1} \in \hat C_\alpha^{\text{CQR}}) \ge 1 - \alpha. The two-sided refinement from Theorem 1 also extends: if the score distribution is continuous, coverage is bounded above by 1α+1/(ncal+1)1 - \alpha + 1/(n_{\text{cal}}+1).

Remark (CQR's Novelty Is the Score, Not the Theorem).

The proof of Theorem 3 is one paragraph because CQR is a special case of Theorem 1. The novelty of CQR is not a new probabilistic argument — it is the choice of score function. Naive split conformal with s(x,y)=yμ^(x)s(x, y) = |y - \hat\mu(x)| produces an interval of constant half-width q^1α\hat q_{1-\alpha} centered on μ^(x)\hat\mu(x). CQR replaces this with an interval whose endpoints are themselves regressed on xx: the lower endpoint is q^α/2(x)\hat q_{\alpha/2}(x) shifted down by Q^1α\hat Q_{1-\alpha}, the upper is q^1α/2(x)\hat q_{1-\alpha/2}(x) shifted up. Because the QR estimates capture heteroscedasticity at the base-learner level, the conformal envelope inherits a locally adaptive width without any additional machinery. The lesson generalizes: any time you want a different shape of prediction interval, design the nonconformity score to extract that shape, and let Theorem 1 deliver the coverage guarantee for free.

Remark (Approximate Conditional Coverage).

CQR does not achieve exact conditional coverage — that is impossible for any distribution-free procedure with finite, informative prediction sets, as we will show in §8. It does achieve an empirically much flatter conditional-coverage curve than naive split conformal, in the sense that the gap P(YC^(X)X=x)(1α)|\mathbb{P}(Y \in \hat C(X) \mid X = x) - (1 - \alpha)| is small on average over xx when the QR base learner is well-specified. The visualization above makes this precise on a heteroscedastic example: naive split conformal under-covers in the high-noise tails by 5–10 percentage points, while CQR holds within 2–3 percentage points across the input range. The improvement is empirical, not theoretical, and it depends on the QR fits being good enough to capture the conditional spread — which is itself a hard estimation problem for high-dimensional or low-data settings.

CQR is the practical default whenever the data is visibly heteroscedastic and computational budget allows fitting two quantile regressions. It is also the construction that points furthest in the direction of conditional coverage — a property that motivates the impossibility theorem of §8 and the importance-weighted extensions for known covariate shift. Before that, §7 takes the conformal envelope into a different setting entirely: classification, where the response is discrete and “interval” gets replaced by prediction set.

Adaptive Prediction Sets (APS) for Classification

Everything so far has been regression. Conformal prediction adapts to classification with a single change: replace the absolute-residual nonconformity score with one built on top of the classifier’s softmax probabilities. The output is no longer an interval — it is a set of predicted classes, sized to match the model’s uncertainty at the test input. Adaptive Prediction Sets (APS), due to Romano, Sesia & Candès (2020), is the canonical construction. Like CQR, it is split conformal applied to a different score, so Theorem 1 delivers the marginal coverage guarantee for free.

Setup. Let Y={1,,K}\mathcal{Y} = \{1, \ldots, K\} be the discrete response space. A classifier produces softmax probabilities π^(cx)[0,1]\hat\pi(c \mid x) \in [0, 1] with cπ^(cx)=1\sum_c \hat\pi(c \mid x) = 1. Order the classes at input xx from highest to lowest probability:

c(1)(x), c(2)(x), , c(K)(x),π^(c(1)x)π^(c(2)x)π^(c(K)x).c_{(1)}(x),\ c_{(2)}(x),\ \ldots,\ c_{(K)}(x), \quad \hat\pi(c_{(1)} \mid x) \geq \hat\pi(c_{(2)} \mid x) \geq \cdots \geq \hat\pi(c_{(K)} \mid x).

For any class yy, write ρ(y;x)\rho(y; x) for its rank in this descending ordering — so the most-probable class has rank 1, the second-most has rank 2, and so on.

Definition 5 (APS Score and Prediction Set, Deterministic Variant).

The APS nonconformity score at a labeled point (x,y)(x, y) is the cumulative mass of all classes ranked strictly above yy:

s(x,y)  =  j=1ρ(y;x)1π^(c(j)(x)x).s(x, y) \;=\; \sum_{j=1}^{\rho(y; x) - 1} \hat\pi\bigl( c_{(j)}(x) \mid x \bigr).

The top-predicted class has score 00 (no classes above it); the second has score π^(c(1)x)\hat\pi(c_{(1)} \mid x); and so on through the ordering. Given calibration scores Si=s(Xi,Yi)S_i = s(X_i, Y_i) for i=1,,ncali = 1, \ldots, n_{\text{cal}}, let q^1α\hat q_{1-\alpha} denote the (1α)(ncal+1)\lceil (1 - \alpha)(n_{\text{cal}} + 1) \rceil-th smallest. The deterministic APS prediction set at a new test input xx is

C^αAPS(x)  =  {cY:s(x,c)q^1α}.\hat C_\alpha^{\text{APS}}(x) \;=\; \bigl\{\, c \in \mathcal{Y} \,:\, s(x, c) \le \hat q_{1-\alpha} \,\bigr\}.

Equivalently: include classes from highest probability downward until the cumulative mass of the strictly-higher classes exceeds the threshold.

Three Gaussian blobs at fixed centers; σ controls within-class spread (smaller = more separable, larger = more overlap). At small σ the classifier is confident almost everywhere and APS produces singletons; at large σ the decision boundaries get thick and the set sizes grow to 2 or 3 there. Marginal coverage tracks 1 − α regardless of σ; per-set-size coverage stays close to nominal — APS is not over-extracting coverage from any one set-size bucket.

The deterministic APS variant always includes the top-predicted class (its score is 0q^1α0 \le \hat q_{1-\alpha} for any non-trivial threshold), so C^αAPS\hat C_\alpha^{\text{APS}} is never empty. Romano-Sesia-Candès also describe a randomized variant that achieves exactly 1α1 - \alpha coverage by allowing the borderline class to be included with a probability tuned to the calibration set; the deterministic version overcovers by at most 1ncal+1\frac{1}{n_{\text{cal}} + 1} — the same discretization slack we saw in Theorem 1’s upper bound — and is the version implemented in libraries like mapie.

Coverage. APS is split conformal with the score above. The score depends only on the training-fitted classifier (and therefore not on the calibration or test data), so Theorem 1 applies verbatim and gives

P(Yncal+1C^αAPS(Xncal+1))    1α,\mathbb{P}\bigl( Y_{n_{\text{cal}}+1} \in \hat C_\alpha^{\text{APS}}(X_{n_{\text{cal}}+1}) \bigr) \;\geq\; 1 - \alpha,

with the same two-sided refinement 1α+1/(ncal+1)\le 1 - \alpha + 1/(n_{\text{cal}}+1) when scores are continuous. (For APS the score is discrete by construction — it takes at most KK distinct values per input — so the upper bound holds with the discrete analog: ties between calibration scores can be broken with the deterministic-rule convention with no impact on the lower bound.)

Why APS, not top-K? A naive alternative is “include the smallest set of classes whose cumulative softmax mass exceeds 1α1 - \alpha.” This is the top-KK procedure where KK varies per input. It works when the classifier’s softmax is well-calibrated — meaning the predicted probabilities match the true conditional probabilities P(Y=cX=x)\mathbb{P}(Y = c \mid X = x) — but degrades sharply when softmax is miscalibrated, which is the typical case for modern neural networks. APS is the calibrated alternative: the threshold q^1α\hat q_{1-\alpha} is determined empirically from the calibration set rather than implicitly through the softmax magnitudes, so coverage is decoupled from softmax-calibration assumptions. Whatever the classifier’s calibration quality, APS recovers the marginal-coverage guarantee.

Adaptivity. The set size at any test input is exactly the number of classes cc with s(x,c)q^1αs(x, c) \le \hat q_{1-\alpha} — equivalently, the smallest KK^\ast such that j=1K1π^(c(j)x)>q^1α\sum_{j=1}^{K^\ast - 1} \hat\pi(c_{(j)} \mid x) > \hat q_{1-\alpha}. Where the classifier is confident (π^(c(1))1\hat\pi(c_{(1)}) \approx 1), the second class already has score 1>q^1α\approx 1 > \hat q_{1-\alpha} (typically), so the set is a singleton. Where the classifier is genuinely uncertain across multiple classes, the top score grows slowly and several classes pass the threshold, producing a multi-class set. Romano-Sesia-Candès 2020 prove that APS achieves approximate conditional coverage when softmax is well-calibrated — the average gap P(YC^(X)X=x)(1α)|\mathbb{P}(Y \in \hat C(X) \mid X = x) - (1 - \alpha)| is small — though, as with CQR, exact conditional coverage is impossible by the result we turn to next.

The figure above is worth pausing on. The middle panel — the region map colored by set size — is the most direct visualization of what APS is doing. Inside each class blob the set is a singleton (green); along the decision boundaries between blobs, where two classes are competitive, the set grows to two (amber); in the central region where all three classes are roughly equally probable, the set grows to three (red, when present). The bar chart on the right confirms the marginal coverage and shows that per-set-size coverage is also close to 1α1 - \alpha — the procedure is not gaming the average by extracting all its coverage from one set-size bucket.

APS extends to multi-label classification, ordinal classification, and hierarchical class structures — each by appropriate redesign of the nonconformity score. The pattern from §6 holds: the score does the work, and Theorem 1 delivers the guarantee. Where the difficulty really lies is the conditional coverage problem: APS approximates it, but it cannot be achieved exactly without distributional assumptions. The next section makes that precise.

Conditional Coverage: Impossibility and Approximations

Theorem 1 guarantees marginal coverage: averaged over the joint distribution of training, calibration, and test data, the prediction set covers Yn+1Y_{n+1} with probability at least 1α1 - \alpha. The marginal probability is computed over the distribution of XX, so a procedure that systematically under-covers in one input region and over-covers in another can still satisfy Theorem 1. A user typically wants more — conditional coverage,

P(Yn+1C^α(Xn+1)Xn+1=x)    1αfor PX-almost every x.()\mathbb{P}\bigl( Y_{n+1} \in \hat C_\alpha(X_{n+1}) \,\big|\, X_{n+1} = x \bigr) \;\geq\; 1 - \alpha \quad \text{for } P_X\text{-almost every } x. \tag{$\ast\ast$}

Conditional coverage means the procedure works uniformly across the input space: no input region is systematically undercovered. CQR and APS are designed with this property in view, and they achieve it approximately under reasonable assumptions on the base learner. The result of this section is that exact conditional coverage — the bound at ()(\ast\ast) for every xx, distribution-free — is impossible for any procedure that produces finite, informative prediction sets. Foygel-Barber et al. 2021 made this precise by exhibiting an adversarial family of distributions on which any conditionally-valid procedure must produce arbitrarily wide intervals at the adversarial input.

Theorem 4 (Conditional Coverage Impossibility — Foygel Barber, Candès, Ramdas & Tibshirani 2021).

Let P\mathcal{P} be a class of distributions on R×R\mathbb{R} \times \mathbb{R} satisfying:

(i) XX has a continuous distribution under each PPP \in \mathcal{P};

(ii) P\mathcal{P} is closed under spiked-variance perturbations: for any σ0,M>0\sigma_0, M > 0, x0Rx_0 \in \mathbb{R}, and ε>0\varepsilon > 0, the distribution

XUniform[1,1],YX=xN ⁣(0, σ02+M21{xx0<ε/2})X \sim \text{Uniform}[-1, 1], \quad Y \mid X = x \sim N\!\bigl(0,\ \sigma_0^2 + M^2 \cdot \mathbb{1}\{|x - x_0| < \varepsilon/2\} \bigr)

belongs to P\mathcal{P}.

Suppose C^α\hat C_\alpha is a (data-dependent) prediction procedure satisfying conditional coverage ()(\ast\ast) uniformly over P\mathcal{P}. Then for any fixed sample size nn and any δ>0\delta > 0, there exists a distribution P0PP_0 \in \mathcal{P} (with spike width εδ\varepsilon \le \delta) such that, with probability at least 1/21/2 over the calibration data,

Lebesgue(C^α(x0))    2Φ1(1α/2)σ02+M2,\text{Lebesgue}\bigl( \hat C_\alpha(x_0) \bigr) \;\geq\; 2 \Phi^{-1}(1 - \alpha/2) \sqrt{\sigma_0^2 + M^2},

where Φ\Phi is the standard normal CDF. As MM can be taken arbitrarily large, the expected Lebesgue measure of C^α(x0)\hat C_\alpha(x_0) under P0P_0 is unbounded.

Proof.

The argument is constructive. Fix the sample size nn and the resolution parameter δ>0\delta > 0. We will exhibit a member of P\mathcal{P} on which any conditionally-valid procedure must produce an arbitrarily wide prediction set at x0=0x_0 = 0.

Step 1: Choose the spike. Set ε=min(δ,1/n)\varepsilon = \min(\delta, 1/n). Let P0P_0 be the spiked-variance distribution from condition (ii) with spike center x0=0x_0 = 0, spike width ε\varepsilon, baseline noise σ0\sigma_0, and spike magnitude MM (to be sent to infinity at the end). Under P0P_0, the conditional distribution at x=0x = 0 — the spike center — is N(0,σ02+M2)N(0, \sigma_0^2 + M^2), while at any xx outside the spike interval [ε/2,ε/2][-\varepsilon/2, \varepsilon/2] the conditional distribution is N(0,σ02)N(0, \sigma_0^2).

Step 2: The “no spike samples” event AA. Each draw (Xi,Yi)(X_i, Y_i) has XiX_i in the spike interval with probability ε/2\varepsilon/2 (the interval has length ε\varepsilon and XX is uniform on [1,1][-1, 1]). The probability that no calibration point falls in the spike interval is

P(A)  =  (1ε/2)ncal    (11/(2n))n    e1/2    0.61\mathbb{P}(A) \;=\; (1 - \varepsilon/2)^{n_{\text{cal}}} \;\geq\; (1 - 1/(2n))^{n} \;\to\; e^{-1/2} \;\approx\; 0.61

as nn \to \infty. For moderate nn — and for our ε1/n\varepsilon \le 1/n — this probability exceeds 1/21/2. So with probability at least 1/21/2 over the calibration data, no observation in the calibration set is informative about the spike.

Step 3: Indistinguishability. Let QQ denote the spike-free baseline distribution: XUniform[1,1]X \sim \text{Uniform}[-1, 1], YXN(0,σ02)Y \mid X \sim N(0, \sigma_0^2), no spike. On the event AA (no calibration point lands in the spike), the calibration data is observationally indistinguishable from a sample drawn under QQ — every observed pair (Xi,Yi)(X_i, Y_i) is consistent with both P0P_0 and QQ. Because C^α\hat C_\alpha is a measurable function of the calibration data, on the event AA the procedure produces the same prediction set it would have produced under QQ:

C^αA  =  C^α(Q)\hat C_\alpha \big|_{A} \;=\; \hat C_\alpha^{(Q)}

with C^α(Q)\hat C_\alpha^{(Q)} denoting the procedure’s output when calibrated under QQ. The procedure cannot tell which distribution it is operating under because the calibration data does not separate them.

Step 4: Conditional coverage at the spike. Now condition on event AA and on the test input Xn+1=x0=0X_{n+1} = x_0 = 0. Under the true distribution P0P_0, the conditional law Yn+1Xn+1=x0Y_{n+1} \mid X_{n+1} = x_0 is N(0,σ02+M2)N(0, \sigma_0^2 + M^2) — full spike variance. The conditional-coverage hypothesis ()(\ast\ast) requires

P(Yn+1C^α(Q)(x0)A, Xn+1=x0)    1α,\mathbb{P}\bigl( Y_{n+1} \in \hat C_\alpha^{(Q)}(x_0) \,\big|\, A,\ X_{n+1} = x_0 \bigr) \;\geq\; 1 - \alpha,

where we have substituted C^α=C^α(Q)\hat C_\alpha = \hat C_\alpha^{(Q)} from Step 3. The left side is the probability that an N(0,σ02+M2)N(0, \sigma_0^2 + M^2) random variable lies in a fixed (data-dependent) set C^α(Q)(x0)\hat C_\alpha^{(Q)}(x_0).

Step 5: Anderson’s lemma. For any measurable SRS \subseteq \mathbb{R} with P(ZS)1α\mathbb{P}(Z \in S) \ge 1 - \alpha where ZN(0,τ2)Z \sim N(0, \tau^2), the Lebesgue measure of SS is bounded below by the Lebesgue measure of the smallest symmetric interval centered at the mode containing the same probability mass:

Lebesgue(S)    2Φ1(1α/2)τ.\text{Lebesgue}(S) \;\geq\; 2 \Phi^{-1}(1 - \alpha/2) \cdot \tau.

(This is Anderson’s lemma applied to a unimodal Gaussian: among all sets of prescribed measure, the central interval maximizes the Gaussian probability; equivalently, among all sets of prescribed Gaussian probability, the central interval minimizes Lebesgue measure.) Applied to S=C^α(Q)(x0)S = \hat C_\alpha^{(Q)}(x_0) and τ2=σ02+M2\tau^2 = \sigma_0^2 + M^2:

Lebesgue(C^α(Q)(x0))    2Φ1(1α/2)σ02+M2.\text{Lebesgue}\bigl( \hat C_\alpha^{(Q)}(x_0) \bigr) \;\geq\; 2 \Phi^{-1}(1 - \alpha/2) \sqrt{\sigma_0^2 + M^2}.

Step 6: Send MM \to \infty. The right side grows without bound as MM increases, and the inequality holds on the event AA (probability at least 1/21/2). The expected Lebesgue measure of C^α(x0)\hat C_\alpha(x_0) under P0P_0 is therefore unbounded, completing the construction.

Remark (Impossibility Holds with Only Measurability).

The proof never invokes any regularity property of the procedure C^α\hat C_\alpha beyond its being a measurable function of the calibration data. The argument does not assume the procedure is symmetric, exchangeable, or based on any specific score; it does not assume the predictor μ^\hat\mu is consistent or unbiased; it does not even assume C^α\hat C_\alpha produces an interval (it works for arbitrary measurable subsets of R\mathbb{R}). The “no spike samples” event AA does the entire heavy lifting: on AA, the calibration data does not separate P0P_0 from QQ, so any procedure computed only from data cannot discriminate between the two — yet ()(\ast\ast) at x0x_0 requires distinguishing them. The impossibility is a statement about the information content of the calibration data, not about the procedure’s design.

Remark (Approximate Conditional Coverage of CQR/APS).

Theorem 4 does not contradict §6 (CQR) or §7 (APS). Both procedures retain marginal coverage — Theorem 1 still applies — and produce adaptive prediction sets whose width or size varies with xx. What they do not claim is conditional coverage uniformly over the spiked-variance family P\mathcal{P}. Under additional smoothness assumptions on YXY \mid X — assumptions that exclude spiked-variance distributions, such as Lipschitz continuity of the conditional CDF — CQR achieves an approximate conditional-coverage guarantee in which the average gap

EX[P(YC^(X)X)(1α)]\mathbb{E}_{X}\Bigl[\, \bigl| \mathbb{P}(Y \in \hat C(X) \mid X) - (1 - \alpha) \bigr| \,\Bigr]

is small. The empirical message of §6’s CQR-vs-naive comparison was exactly this: under reasonable conditions, CQR’s per-bin coverage curve is much flatter than naive split conformal’s. Theorem 4 says: drop the smoothness assumptions and adversarial constructions reassert themselves. Any practical claim to “conditional coverage” implicitly invokes such assumptions.

Six-panel figure organized as 2 rows by 3 spike-width values: top row shows training data with the spike region shaded and the split conformal prediction band overlaid; bottom row shows binned conditional coverage as a function of x, demonstrating that marginal coverage holds at every spike width but conditional coverage on the spike region degrades as the spike narrows
The impossibility theorem made tangible. Three spike widths ε ∈ {0.10, 0.30, 0.60}, with split conformal applied at α = 0.10. Top row: training data and prediction bands — the band has constant width regardless of spike width, because the calibration set rarely sees the spike. Bottom row: empirical conditional coverage as a function of x. Marginal coverage holds at the 0.90 target in every panel; conditional coverage on the spike region collapses as ε shrinks (the procedure has no information to widen the band where it should). At ε = 0.10 the spike-region coverage is far below nominal.

Remark (Covariate Shift Connection — Tibshirani, Foygel Barber, Candès & Ramdas 2019).

The impossibility motivates an extension that has become the standard fix when the user has known distribution-shift information. If the test distribution PXtestP_X^{\text{test}} differs from the training distribution PXtrainP_X^{\text{train}}, exchangeability fails and naive split conformal can lose its marginal coverage at the test distribution. Tibshirani, Foygel Barber, Candès & Ramdas (NeurIPS 2019) showed that importance-weighted split conformal — where calibration scores are weighted by the likelihood ratio w(x)=ptest(x)/ptrain(x)w(x) = p_{\text{test}}(x) / p_{\text{train}}(x) — restores marginal coverage at the test distribution under known shift. The construction is the weighted empirical quantile of the calibration scores, with the test point’s own weight entering the denominator (this is the weightedSplitConformal in nonparametric-ml.ts). It does not solve the conditional-coverage problem of Theorem 4 — that remains impossible — but it does solve the marginal-coverage-under-shift problem, which is the more practical concern in many deployment settings.

Show

Train distribution: x ~ N(0, 1), blue histogram. Test distribution: x ~ N(Δ, 1), green histogram. Naive split conformal (blue band) was calibrated on the train distribution and progressively under-covers as Δ grows — the bottom-panel blue curve drops well below 1 − α. The TBCR 2019 importance-weighted variant (green band) uses the closed-form Gaussian density ratio w(x) = N(x; Δ, 1) / N(x; 0, 1) to recalibrate; the bottom-panel green curve holds at the 1 − α target across the entire shift range.

The takeaway from §8 is paired. Negative: distribution-free conditional coverage is impossible at full generality; any procedure claiming it must restrict to a smaller distribution class. Positive: marginal coverage under shift is recoverable when the shift is known, via importance-weighted variants of the procedures we’ve already developed. Conditional coverage in practice is achieved approximately by adaptive base learners (CQR, APS) and assessed empirically — an honest report of the per-region coverage gap, not a theoretical guarantee. With these limits drawn, §9 returns to the practical question that motivated the topic: how do we wrap conformal prediction around any black-box ML model?

Wrapping Black-Box ML Models

The whole point of conformal prediction’s distribution-free guarantee is that the base predictor can be anything — ridge regression, gradient boosting, random forest, neural network, transformer, anything that maps inputs to predictions. Theorem 1 doesn’t care what’s inside; it only cares that the score function is held fixed across calibration and test. This means the entire topic up to here applies, unchanged, to any model you already have.

The skeleton in five steps:

Step 1 — Split. Partition your data into training, calibration, and test (typically 50/50 between the first two; test is whatever you want to evaluate on later).

Step 2 — Fit. Train your favorite black-box model μ^\hat\mu on the training set. Hyperparameter tuning, early stopping, ensembling — all fine, as long as none of it touches the calibration set.

Step 3 — Score. Compute calibration scores Si=Yiμ^(Xi)S_i = |Y_i - \hat\mu(X_i)| on the calibration set (or the CQR/APS variants from §6/§7 if you want adaptivity).

Step 4 — Threshold. Set q^1α\hat q_{1-\alpha} to the (1α)(ncal+1)\lceil (1-\alpha)(n_{\text{cal}} + 1) \rceil-th smallest calibration score.

Step 5 — Predict. For each new test point Xn+1X_{n+1}, return C^α(Xn+1)=[μ^(Xn+1)q^1α, μ^(Xn+1)+q^1α]\hat C_\alpha(X_{n+1}) = [\hat\mu(X_{n+1}) - \hat q_{1-\alpha},\ \hat\mu(X_{n+1}) + \hat q_{1-\alpha}] (or the appropriate analog for your score function).

That is the entire procedure. There are no model-class-specific tweaks; ridge and a 100-million-parameter transformer use the same five steps. The cost of conformal wrapping is one calibration pass and one threshold lookup — additive overhead measured in milliseconds for any non-trivial base predictor.

The empirical demonstration in Figure 9 below applies these five steps to three base predictors on the same heteroscedastic regression dataset: a degree-3 polynomial ridge (the predictor we have used throughout), a small MLP with two hidden layers, and a random forest. All three target α=0.10\alpha = 0.10 and all three achieve empirical marginal coverage near 0.90 — the conformal layer enforces the guarantee regardless of the base predictor’s particulars. What does vary across predictors is the width of the prediction interval: the better-fit predictor produces tighter intervals because its residuals on the calibration set are smaller, so q^1α\hat q_{1-\alpha} shrinks. Width depends on accuracy, coverage does not — a one-line summary of what conformal prediction buys you.

Four-panel figure: top row shows three predictor-specific prediction bands (ridge, MLP, random forest) on the same heteroscedastic data with each panel reporting empirical coverage and mean width; bottom panel overlays all three bands for direct visual comparison
Three base predictors, one conformal envelope, α = 0.10. Top row: ridge (blue), MLP (green), random forest (amber) — each panel reports empirical coverage and mean band width. All three land at ~0.90 coverage on the held-out test set. Bottom: direct overlay. The widths differ because the residuals do; the coverage doesn't differ because the conformal calibration step doesn't care which predictor produced the residuals.

For production deployments, the mapie library implements split conformal, jackknife+, CQR, and APS (deterministic and randomized) for arbitrary scikit-learn estimators with a few-line API. The Python reference implementation in this topic’s notebook uses mapie for the §9 black-box demonstration — the same pattern carries to PyTorch and JAX models with minimal adaptation. Conformal prediction is one of the rare statistical guarantees that scales from n=50n = 50 toy datasets to n=109n = 10^9 industrial pipelines without changing its assumptions or its math.

The wrapping pattern also extends in directions we have only sketched. Online conformal prediction (Vovk-Gammerman-Shafer 2005, Gibbs-Candès 2021) updates the threshold as new data arrives, retaining marginal coverage even when the data stream has slow distribution drift. Conformal prediction for time series handles non-exchangeable streams via blocked or weighted variants. Conformal classification with structured output spaces (multi-label, hierarchical, set-valued) generalizes APS by redesigning the score for each setting. None of these change the five-step skeleton above; they swap the score, the calibration step, or the data assumption — and let the marginal-coverage theorem do its work.

Notation Reference

The symbols used throughout this topic, gathered for quick lookup. The “Connections” and “References & Further Reading” sections that follow are auto-generated from the topic frontmatter; the related-topics relationships expressed there are the structured form of the cross-references made inline in §1, §6, and §8.

SymbolMeaning
(Xi,Yi)(X_i, Y_i)Feature–response pair, XiXX_i \in \mathcal{X}, YiYY_i \in \mathcal{Y}.
ncaln_{\text{cal}}Calibration set size.
α(0,1)\alpha \in (0, 1)Miscoverage level; nominal coverage is 1α1 - \alpha.
μ^(x)\hat\mu(x)Fitted base predictor (training data only, in the split case).
s(x,y)s(x, y)Nonconformity score; large = anomalous.
Si=s(Xi,Yi)S_i = s(X_i, Y_i)Calibration score at observation ii.
q^1α\hat q_{1-\alpha}Threshold: the (1α)(ncal+1)\lceil (1-\alpha)(n_{\text{cal}}+1) \rceil-th smallest of {Si}\{S_i\}.
C^α(x)\hat C_\alpha(x)Prediction set: {yY:s(x,y)q^1α}\{y \in \mathcal{Y} : s(x, y) \le \hat q_{1-\alpha}\}.
RiR_iLeave-one-out residual (jackknife+ / CV+).
Q^α, Q^1α+\hat Q_\alpha^-,\ \hat Q_{1-\alpha}^+Empirical quantiles with floor/ceiling conventions (jackknife+).
q^α/2(x), q^1α/2(x)\hat q_{\alpha/2}(x),\ \hat q_{1-\alpha/2}(x)Conditional quantile estimates (CQR base learner).
EiE_iCQR nonconformity score at observation ii.
π^(cx)\hat\pi(c \mid x)Predicted softmax probability of class cc at input xx (APS).
c(j)(x)c_{(j)}(x)jj-th most probable class at xx.
ρ(y;x)\rho(y; x)Rank of class yy in the descending probability ordering.
σ2(x;ε)\sigma^2(x; \varepsilon)Spiked-variance adversarial family from the Theorem 4 proof.
Φ1\Phi^{-1}Standard normal inverse CDF (Anderson’s lemma in Theorem 4).

Forward Connections to Planned formalML Topics

Three topics in the T4 Nonparametric & Distribution-Free track that pick up where this one leaves off. Plain-text references — links will activate as the topics ship.

Quantile Regression — the base learner inside CQR (§6). The full topic covers QR estimation theory, the asymptotic distribution of β^(τ)\hat\beta(\tau), regularized variants (lasso, group lasso) for high dimensions, and the broader use of quantile regression beyond the conformal envelope.

Statistical Depth (coming soon) — depth-based prediction regions are the geometric alternative to the score-based conformal construction. For symmetric distributions the two routes reproduce each other; for asymmetric ones they diverge instructively, with depth regions tracking the geometric center of mass and conformal sets tracking the score quantile. The cross-fertilization runs both ways.

Prediction Intervals (coming soon) — the umbrella topic that positions conformal prediction inside the broader prediction-interval ecosystem (frequentist plug-in, Bayesian posterior predictive, quantile-regression direct, conformal). Each route has different assumptions and different asymptotic properties; the comparative table is the topic’s central contribution.

Connections

  • Concentration is the standard tool for bounding empirical-quantile fluctuation; conformal uses rank symmetry instead, which delivers an exact (not asymptotic) finite-sample statement. The contrast is instructive — concentration gives uniform asymptotic bounds, conformal gives marginal exact bounds. concentration-inequalities
  • Calibration scores function as low-dimensional summaries of prediction error; the leave-one-out updates underlying jackknife+ are formally analogous to LOO covariance updates in low-rank settings. pca-low-rank
  • Exchangeability — the only assumption conformal prediction requires — is a measure-theoretic invariance under coordinate permutation, strictly weaker than iid sampling. de Finetti's representation theorem describes the structure of exchangeable sequences. measure-theoretic-probability
  • PAC bounds give *learning* guarantees (low expected loss) under iid assumptions; conformal gives *coverage* guarantees under exchangeability. The two operate on orthogonal axes — predictability versus uncertainty — and the choice of which to use depends on whether the downstream task cares about average error or per-prediction reliability. pac-learning
  • Theorem 1 of this topic (split-conformal marginal coverage) is cited verbatim there in §2 and reused in §5.1's CQR-coverage-decomposition proof. The score-function abstraction in §2.1 of prediction-intervals lifts directly from the conformal frame; CQR is just split conformal under a quantile-regression score with the conformal $(1-\alpha)$-quantile threshold. prediction-intervals
  • T4 track closer. Depth-based conformal prediction sets achieve marginal coverage with shapes that adapt to the residual distribution's geometry, replacing componentwise quantile rankings with depth-induced rankings of multivariate scores. The split-conformal marginal-coverage theorem here transfers verbatim to the depth-as-score regime. statistical-depth

References & Further Reading