advanced learning-theory 120 min read

Selective Inference

Inference after looking at the data — conditional pivots, debiased estimators, knockoffs, online FDR

1. Motivation

Most of statistical theory implicitly assumes the hypothesis was chosen before the data arrived. We pick a coefficient to test, we pick a model to fit, we pick an estimand to report — and then we look at the sample. The neat duality between point estimates, confidence intervals, and p-values that powers Stat 101 holds because that ordering holds: the hypothesis is fixed, the data is random, and the sampling distribution of the test statistic is a well-defined object.

Modern ML practice violates that ordering constantly. We fit a lasso on a thousand candidate features, look at the active set, and report confidence intervals on the selected coefficients. We screen a million SNPs against a phenotype, take the top hundred, and ask whether they’re real. We sit on top of an A/B-testing platform that runs ten thousand experiments a quarter and want to know how many of last quarter’s “winners” are actually wins. In each case the hypothesis was chosen by looking at the data, and the textbook guarantees — coverage, size, Type-I error, FWER, FDR — silently break.

Selective inference is the body of techniques that puts the guarantees back. It does so by being explicit about what was selected and how, and by re-deriving the sampling distribution conditional on the selection event itself. That single move — condition on what the selection procedure returned — turns out to support three otherwise quite different bodies of work: post-selection conditional inference (Lee, Sun, Sun & Taylor 2016) and debiased high-dimensional inference (Zhang & Zhang 2014; van de Geer, Bühlmann, Ritov & Dezeure 2014; Javanmard & Montanari 2014); finite-sample false-discovery control via knockoffs (Barber & Candès 2015; Candès, Fan, Janson & Lv 2018); and online FDR control via wealth processes (Foster & Stine 2008; Javanmard & Montanari 2018; Ramdas, Yang, Wainwright & Jordan 2018). This topic walks through all three.

1.1 The garden of forking p-values

The cleanest way to see the problem is to watch a familiar procedure fail on a familiar problem. Take a low-dimensional regression — n=100n = 100 observations, p=20p = 20 features, an identity design covariance, and a true coefficient vector βRp\beta^\star \in \mathbb{R}^{p} that is zero everywhere except on its first three entries, where it equals 1.51.5. Standard regression theory says: fit OLS on all 20 features, get β^\hat\beta, and the 95% Wald interval β^j±1.96SE(β^j)\hat\beta_j \pm 1.96 \cdot \mathrm{SE}(\hat\beta_j) covers βj\beta_j^\star with probability 0.950.95 for each jj. That works. It works because the model — all twenty features — was fixed before the data arrived.

Now run the modern version. Fit the lasso with cross-validated λ\lambda, look at the active set S^={j:β^jlasso0}\hat S = \{j : \hat\beta_j^{\text{lasso}} \ne 0\}, refit OLS on the selected features only, and report the Wald CIs from that fit. This is the workflow practitioners actually use — sklearn even ships it as a one-liner. The active set picks out, on average, around five to eight features per replicate: typically the three real signals plus several noise features that happened to correlate with yy.

The CIs from this workflow look like normal CIs. They’re symmetric around the OLS-on-S^\hat S point estimate, they have the usual ±1.96\pm 1.96 structure, and they print out alongside a column labeled “95% CI” that nothing in the software flags as suspect. Their empirical coverage, however, isn’t 0.950.95. Over B=200B = 200 Monte Carlo replicates of this DGP, the signal-feature CIs cover their target with empirical probability 0.9400.940 — close to nominal, because the signal-to-noise ratio is high enough that the OLS-on-S^\hat S estimate on a strong signal is approximately unbiased even after selection. But the null-feature CIs (features 4 through 20, when one of them sneaks into the active set on a particular replicate) cover their target — which is βj=0\beta_j^\star = 0 — with empirical probability 0.8280.828. The aggregate effect: a coverage rate that lands at 0.8680.868 on this toy.

The structural reason is selection bias on null features. The lasso selects feature jj precisely when Xj(yXjβ^j)|X_j^\top (y - X_{-j}\hat\beta_{-j})| is large; the OLS-on-S^\hat S coefficient on jj then inherits the conditional distribution of XjyX_j^\top y given that this quantity was large. For null features this conditional distribution is bimodal — concentrated away from zero on both sides — which is fundamentally different from the unimodal Gaussian the naive CI assumes. The CI is centered on a noisy estimate that’s never near zero conditional on selection, and on a non-trivial fraction of replicates it misses zero. Figure 1.1 quantifies the aggregate effect on the running toy.

This is the loss-of-coverage phenomenon that selective inference is designed to repair. The fix that conditional inference will deliver in §4 is structural: replace the unconditional sampling distribution of β^j\hat\beta_j with its conditional distribution given the selection event {S^=S}\{\hat S = S\}, and the resulting CI achieves nominal coverage. We’ll prove that in §4. For now, §1 just establishes the failure.

Naive Wald-CI coverage on the low-dim lasso toy, broken out by signal vs null
Empirical coverage of naive Wald CIs after lasso selection, broken out by whether the selected feature corresponds to a true signal or a true null. B = 200 Monte Carlo replicates of the low-dim toy (n = 100, p = 20, 3 signals at β★ = 1.5, σ = 1, identity design). Signal-feature CIs cover at 0.940 (near nominal); null-feature CIs cover at only 0.828; the aggregate drops to 0.868. The story is structural — the coverage failure is driven by selection bias on the null features, not by anything wrong with the OLS step that follows.

1.2 Three families of fixes

Selective inference is not a single procedure; it’s a programme. The procedures the topic covers cluster into three families, distinguished by what they condition on and what guarantee they deliver:

Conditional / post-selection inference (§§2–5). Condition on the realized selection event itself — for the lasso, the active-set and sign pattern {S^,z^}\{\hat S, \hat z\}. The Lee–Sun–Sun–Taylor pivot delivers exactly nominal conditional coverage for any linear functional of yy, including the partial OLS coefficients in the selected model. The debiased lasso of Zhang–Zhang / van de Geer et al. / Javanmard–Montanari delivers asymptotically nominal coverage in pnp \gg n without conditioning, by computing a bias-corrected estimator whose CLT survives the selection bias.

Knockoff filters (§§7–9). Don’t condition; instead, construct fake “knockoff” features X~\tilde X designed to be exchangeable with the real features under any null. Use this exchangeability to build a statistic per feature whose null distribution is sign-flip symmetric, then threshold on that distribution to control FDR. The finite-sample FDR guarantee holds at all (n,p,q)(n, p, q), with no asymptotic approximation.

Online FDR (§§10–11). When tests arrive in a stream — A/B platforms, continuous deployment, sequential genomics screens — neither BH nor conditional inference work directly because future tests aren’t yet visible. The α-investing / LORD / SAFFRON family solves this by tracking an α-wealth process: each test deducts from the wealth, each rejection deposits a payout, and a supermartingale argument bounds the running FDR by α\alpha at every stopping time.

The three families are not redundant. Each addresses a different practical bottleneck — small-but-non-trivial pp with a selection procedure (post-selection), pnp \gg n with honest CIs needed (debiased), large-pp feature selection with finite-sample FDR (knockoffs), and streaming tests (online). The reader should leave §1 with one map and three distinct destinations; the §-by-§ exposition fills in each route.

1.3 What “selective inference” means precisely

The phrase admits several meanings in the literature; we’ll be specific. By “selective inference” we mean inference about a data-dependent estimand — a parameter, functional, or hypothesis that is itself a function of the sample. This is the unifying definition that covers all three families. In conditional post-selection inference, the estimand is the partial-OLS coefficient in the model selected by the lasso, and that model depends on yy. In debiased high-dimensional inference, the estimand is a fixed coefficient, but the bias correction depends on a lasso fit. In the knockoff filter, the estimand is the set of features satisfying Xj⊥̸YXjX_j \not\perp Y \mid X_{-j}, and the report (which features get flagged) depends on the data. In online FDR, the estimand is a sequence of null hypotheses, and the test schedule depends on the wealth process, which depends on the prior tests’ outcomes.

The parallel to semiparametric inference is structural and worth naming explicitly: semiparametrics asks for n\sqrt n-valid inference about a low-dimensional functional in the presence of an infinite-dimensional nuisance estimated from data; selective inference asks for valid inference about a functional whose identity depends on data. Both topics live in the seam between “parameter is fixed, data is random” and “parameter and data are both random in a coupled way,” and both depend on a careful accounting of what randomness the inference is conditioning on. Where semiparametrics needs orthogonality of the influence function with respect to the nuisance score, selective inference needs an explicit description of the selection event whose probability the inference conditions on.

1.4 Roadmap and prerequisites

The topic proceeds in three arcs. §§2–5 cover post-selection and debiased inference: the LSST framework (§2), the polyhedral selection event for the lasso (§3), the truncated-Gaussian pivot (§4), and the debiased lasso (§5). §§6–9 cover knockoffs: the BH/Storey baseline (§6), the fixed-X knockoff filter and its supermartingale FDR proof (§7), the model-X generalization (§8), and the practical knockoff-construction linear algebra (§9). §§10–11 cover online FDR: the procedures (§10) and the wealth-process supermartingale proofs (§11). §12 lands the topic in ML applications, §13 collects computational notes, and §14 maps the connections to nearby topics and the limits of the framework.

The recommended prerequisites are high-dimensional regression for the lasso KKT conditions and debiased-lasso machinery, concentration inequalities for the martingale tail bounds that drive the online-FDR proofs, and ideally semiparametric inference for the conceptual frame on data-dependent estimands. From the sister site, formalStatistics’s Multiple Testing and False Discovery is the substrate selective inference generalizes — this topic discharges three forward-pointers from that page in a single ship.

2. The selection-event framework

§1 showed what goes wrong when we condition on the data twice — once implicitly through the selection procedure, once explicitly through the inference step — without accounting for the first conditioning. §2 sets up the framework that fixes it. The fix is conceptually simple: replace the unconditional sampling distribution of the test statistic with its distribution conditional on the selection event. The work of the rest of the topic is to make that conditional distribution computable.

This section sets the framework abstractly. §3 instantiates it on the lasso, §4 derives the truncated-Gaussian pivot that delivers the exact conditional CDF, §5 handles the high-dimensional case where exact conditioning is not available.

2.1 Hypotheses chosen by the data

Fix a sample space Y\mathcal{Y} (think Rn\mathbb{R}^n for the regression setting). A selection procedure is a measurable map

S:YH,\mathcal{S}: \mathcal{Y} \to \mathcal{H},

where H\mathcal{H} is a (countable, for simplicity) set of outcomes — model indices, active sets, threshold rules, anything we might run on the data before reporting a result. For the lasso at a fixed regularization λ\lambda, S(y)=(S^(y),z^(y))\mathcal{S}(y) = (\hat S(y), \hat z(y)) — the active set and sign pattern. For forward stepwise truncated at KK steps, S(y)\mathcal{S}(y) is the ordered sequence of selected features. For the BH procedure at level qq, S(y)\mathcal{S}(y) is the set of rejected hypotheses.

Each outcome sHs \in \mathcal{H} partitions the sample space into two pieces: the selection event

Es:={yY:S(y)=s},E_s := \{y \in \mathcal{Y} : \mathcal{S}(y) = s\},

and its complement. Conditioning on EsE_s means restricting attention to the realizations that would have produced this particular selection outcome. The marginal sampling distribution of YY — the one that classical inference uses — averages over all selection outcomes; the conditional sampling distribution YS(Y)=sY \mid \mathcal{S}(Y) = s averages only over realizations that hit EsE_s. The two coincide when S\mathcal{S} is constant (no selection — classical inference) and diverge otherwise. The size of the gap is what makes selective inference necessary.

A useful mental picture: Y\mathcal{Y} is partitioned into selection cells {Es}sH\{E_s\}_{s \in \mathcal{H}}; the procedure stamps every observed realization with the label ss of whichever cell it lands in. Classical inference treats the whole sample space as the support of YY. Selective inference restricts to the cell we landed in and asks what the test statistic does inside that cell.

Numerical demonstration. On the §1 low-dim toy, fix the lasso regularization at a deterministic λ\lambda (rather than CV — the conditional analysis needs a deterministic selection rule) and run B=4,000B = 4{,}000 replicates. For each replicate, fit the lasso and record whether feature j=5j = 5 (a true null) is selected. On the 20%\approx 20\% of replicates where feature 5 is in S^\hat S, compute the OLS-on-S^\hat S estimate of its coefficient. Compare to a marginal reference: the OLS estimate of β5\beta_5 when the model is forced to include {0,1,2,5}\{0, 1, 2, 5\} from the start, regardless of what the lasso does.

The result, locked at notebook execution:

  • Conditional sample (lasso selected feature 5): n=819n = 819 out of 4000. Sample mean 0.001-0.001. Sample std 0.1740.174. Crucially, only 0.1% of conditional samples are in (0.05,+0.05)(-0.05, +0.05); the distribution is bimodal, with 45.4%\approx 45.4\% of mass above +0.10+0.10 and 47.1%\approx 47.1\% below 0.10-0.10.

  • Marginal sample (forced inclusion): n=4000n = 4000. Sample mean +0.001+0.001. Sample std 0.1010.101. Unimodal, peaked at zero.

The two distributions have similar means (both near zero, by symmetry) but qualitatively different shapes. The marginal distribution is the classical Gaussian N(0,σ2/n)\mathcal{N}(0, \sigma^2/n) that Wald intervals are built around. The conditional distribution is bimodal — concentrated away from zero on both sides — because the selection event {jS^}\{j \in \hat S\} requires Xjy|X_j^\top y| to have been large, which carves out the mass near zero.

This is the structural content of selective inference, isolated in one picture. Classical inference uses the marginal distribution as its null reference and correctly assigns small p-values to its tail. Applied naively to the conditional sample — which is what the §1 lasso-then-OLS workflow does — it computes p-values against the wrong reference distribution. Values that are entirely typical of the conditional distribution under the null get assigned small p-values, and CIs centered on those values miss the truth βj=0\beta_j^\star = 0 on a non-trivial fraction of replicates.

The fix that §§3–4 will develop is to compute against the right reference: the truncated-Gaussian distribution that exactly characterizes the conditional sampling distribution under the polyhedral selection event of §3.

Conditional vs marginal sampling distributions of the OLS-on-S estimate of a null feature
Conditional vs marginal sampling distributions of the OLS-on-Ŝ estimate of β₅, on the §1 low-dim toy at fixed λ. Marginal distribution (gray): the OLS estimate in the model forced to include {0, 1, 2, 5}, computed across all B = 4 000 replicates; approximately 𝒩(0, σ²/n). Conditional distribution (red): the OLS-on-Ŝ estimate on the ≈20% of replicates where the lasso selected feature 5; bimodal, with mass concentrated away from zero on both sides because the selection event excludes the near-zero region. The two distributions have similar means but qualitatively different shapes — and computing p-values or CIs against the wrong reference is the structural reason for §1's coverage failure.

The interactive below lets you experiment with the lasso’s selection event directly. Drag λ\lambda to watch the active set S^\hat S and sign pattern z^\hat z change; the polyhedral selection event {y:Ayb}\{y : Ay \le b\} (developed in §3) is rendered as a 2-D slice of the sample space around a fixed realization.

10 distinct outcomes · realized cell at origin: 0+,1+,2+,4+,6+0-,1+,2+,3-,4+,5-0-,1+,2+,4+0-,1+,2+,4+,5+1+,2+,3-,4+,5-1+,2+,4+,5-1+,2+,4+1+,2+,4+,5+0+,1+,2+,4+,5-

2.2 Conditional vs marginal: what “post-selection valid” means

Fix a null hypothesis H0H_0 that may itself depend on the selection outcome — for the lasso, “the partial OLS coefficient for feature jj in the selected model equals zero,” which only makes sense once S^\hat S is known. We say a test of H0H_0 has classical Type-I error at level α\alpha if

PrH0(reject)α,\Pr_{H_0}(\text{reject}) \le \alpha,

where the probability is over the marginal distribution of YY. We say the test is selectively valid at level α\alpha (post-selection valid, in the LSST terminology) if

PrH0(rejectS(Y)=s)αfor every sH.\Pr_{H_0}(\text{reject} \mid \mathcal{S}(Y) = s) \le \alpha \quad \text{for every } s \in \mathcal{H}.

The selective version is a strictly stronger requirement than the classical one: it controls Type-I error inside every selection cell, not just on average across cells. The corresponding notion for confidence intervals is conditional coverage: PrH0(θsCsS(Y)=s)1α\Pr_{H_0}(\theta_s \in C_s \mid \mathcal{S}(Y) = s) \ge 1 - \alpha for every ss, where θs\theta_s is the target parameter associated with outcome ss and CsC_s is the interval reported under outcome ss.

A weaker target available in the literature is False Coverage Rate control (Benjamini & Yekutieli 2005): an average-coverage guarantee across multiple selected parameters, integrating over the selection randomness. FCR is a strict relaxation of conditional coverage and is the right target in some applied settings; for this topic we focus on the conditional-coverage standard because it’s what the LSST pivot delivers and what justifies the file-drawer claim — “the CI you report is at nominal coverage given that this is the CI you happened to report.”

The file-drawer interpretation is worth dwelling on. Selective validity is the statement that an analyst who repeatedly performs the same selection-then-inference workflow, and discards (puts in the file drawer) realizations that didn’t produce the same selection outcome, would observe nominal coverage on the surviving inferences. That is the guarantee a downstream consumer of the report actually needs.

2.3 Generic post-selection p-values via conditional distributions

The construction of a selective p-value is one display equation. Pick a test statistic T:YRT : \mathcal{Y} \to \mathbb{R} (for now, generic; later, a linear contrast ηy\eta^\top y). Under the null, the selective p-value for the realization yy given selection outcome s=S(y)s = \mathcal{S}(y) is

psel(y):=PrH0(T(Y)T(y)S(Y)=s).p_{\text{sel}}(y) := \Pr_{H_0}\big( T(Y) \ge T(y) \,\big|\, \mathcal{S}(Y) = s \big).

Under the null, by construction, psel(Y)S(Y)=sp_{\text{sel}}(Y) \mid \mathcal{S}(Y) = s is uniformly distributed on [0,1][0, 1] — this is just the standard PIT applied to the conditional CDF of TT given the selection event. Marginally, psel(Y)p_{\text{sel}}(Y) is also uniform, because uniformity given ss for every ss averages to uniformity.

The construction is generic. It only delivers a usable inference procedure when the conditional CDF FTS=s()F_{T \mid \mathcal{S} = s}(\cdot) is computable. That is where the form of the selection event starts to matter. For arbitrary S\mathcal{S}, the conditional distribution is intractable and the construction is just a definition. The triumph of LSST 2016 is to identify a broad class of selection procedures — those producing affine selection events Es={y:Asybs}E_s = \{y : A_s y \le b_s\} for some (As,bs)(A_s, b_s) — where the conditional distribution of any linear functional ηY\eta^\top Y is a truncated Gaussian with computable truncation bounds. That class includes the lasso at a fixed λ\lambda, forward stepwise at every step, marginal screening, the LARS sequence, and several others.

We’ll re-encounter this conditional p-value definition twice. In §4 we instantiate T(y)=ηyT(y) = \eta^\top y with η\eta the OLS direction for a selected coefficient and derive the exact truncnorm pivot. In §11 we instantiate it with T(y)T(y) a sequential test statistic and recover the wealth-process construction by a different route — the conditional-distribution machinery of §2 and the supermartingale machinery of §11 are different routes to the same finite-sample-validity target.

2.4 The LSST framework — when conditional inference is tractable

The conditioning recipe above is only useful when we can compute the conditional CDF. Lee, Sun, Sun & Taylor (2016) identify the workhorse setting where we can. Their framework is the Gaussian linear model:

YN(μ,σ2In),Y \sim \mathcal{N}(\mu, \sigma^2 I_n),

with σ2\sigma^2 either known or estimated externally (e.g., from a held-out sample). The post-selection target is a linear functional ημ\eta^\top \mu where η\eta may itself depend on the selection outcome. For the lasso with active set S^\hat S, the canonical choice of η\eta is the jj-th column of XS^(XS^XS^)1X_{\hat S}(X_{\hat S}^\top X_{\hat S})^{-1} — the partial-OLS direction for feature jS^j \in \hat S in the selected model. The target ημ\eta^\top \mu is then the partial-OLS coefficient that one would report if the selected model were the true model.

The framework’s key structural assumption is that the selection event is affine in YY:

Es={y:Asybs}.E_s = \{y : A_s y \le b_s\}.

That is, the selection outcome is determined by a finite collection of linear inequalities. The lasso KKT conditions deliver exactly this: §3 shows that {S^=S,z^=z}\{\hat S = S, \hat z = z\} is the intersection of a list of half-spaces, all linear in yy. Forward stepwise produces an affine selection event after every step. So does marginal screening. The framework is broad.

Under the Gaussian-plus-affine setup, the master result is straightforward to state — we’ll prove it in §4 once we have the polyhedral description of EsE_s from §3.

Theorem 2.1 (LSST master pivot (statement only)).

Let YN(μ,σ2In)Y \sim \mathcal{N}(\mu, \sigma^2 I_n) and let Es={y:Asybs}E_s = \{y : A_s y \le b_s\} be an affine selection event. Then for any contrast direction ηRn\eta \in \mathbb{R}^n with η>0\|\eta\| > 0, the conditional distribution of ηY\eta^\top Y given {YEs}\{Y \in E_s\} and given the orthogonal residual (Iηη/η2)Y(I - \eta\eta^\top/\|\eta\|^2)Y is a Gaussian truncated to an interval [V(Y),V+(Y)][\mathcal{V}^-(Y), \mathcal{V}^+(Y)] that depends only on the residual:

ηYYEs,  (IPη)Y=r    TN ⁣(ημ,σ2η2,V(r),V+(r)),\eta^\top Y \,\big|\, Y \in E_s, \; (I - P_\eta) Y = r \;\sim\; \mathrm{TN}\!\left(\eta^\top \mu, \, \sigma^2 \|\eta\|^2, \, \mathcal{V}^-(r), \, \mathcal{V}^+(r)\right),

where Pη=ηη/η2P_\eta = \eta\eta^\top/\|\eta\|^2 is the projection onto span(η)\mathrm{span}(\eta).

The proof — and the explicit construction of V(r),V+(r)\mathcal{V}^-(r), \mathcal{V}^+(r) as the largest and smallest values of ηy\eta^\top y compatible with AsybsA_s y \le b_s for each fixed residual rr — is Theorem 4.1 of §4.3 (the same result, now with proof). The point for now is the framework: the conditional distribution exists, it’s truncated Gaussian, the truncation bounds are computable as piecewise-linear functionals of the orthogonal residual, and the conditional CDF gives us the selective p-value via PIT. Everything else is implementation.

The framework’s limitations bear mentioning. The Gaussian assumption is restrictive — the exact pivot uses Gaussianity at full strength. There are robustness results (asymptotic / Berry–Esseen / studentized versions; Tian & Taylor 2018; Tibshirani, Rinaldo, Tibshirani & Wasserman 2018) that extend the framework to non-Gaussian errors at the cost of replacing the exact pivot with an asymptotic one. The σ\sigma-known assumption is real but mild: estimating σ\sigma from a held-out residual sample or from a high-dimensional consistent estimator works in practice. The affine-selection-event assumption is the most restrictive structural condition; selection procedures that don’t admit an affine description (cross-validation as part of λ\lambda selection, neural-network feature selection, etc.) sit outside the framework. §13 returns to this when discussing extensions.

3. Polyhedral selection events for the lasso

§2 set up the abstract framework — selection procedures, selection events, the conditional sampling distribution that selective inference needs to compute with. §3 instantiates it on the lasso. The payoff is a concrete description: at any fixed regularization λ>0\lambda > 0, the selection event “the lasso at this λ\lambda returns active set SS with sign pattern zz” is a polyhedron {y:Ayb}\{y : Ay \le b\} in Rn\mathbb{R}^n, with AA and bb given by explicit closed-form expressions in (X,S,z,λ)(X, S, z, \lambda). That description is what §4 needs to derive the truncated-Gaussian pivot.

The whole derivation is one application of the KKT conditions plus careful bookkeeping. We’ll walk through it.

Lambda convention. Throughout §3 and §4, we use the math convention β^=argmin12yXβ2+λβ1\hat\beta = \arg\min \frac{1}{2}\|y - X\beta\|^2 + \lambda \|\beta\|_1 (no 1/n1/n factor). The polyhedron formulas below are expressed in terms of this math λ\lambda. When calling sklearn’s Lasso(alpha=…) in code, pass alpha = lam/n to convert (sklearn’s objective has a 1/(2n)1/(2n) factor on the squared-error term).

3.1 The lasso KKT conditions and the subgradient at the optimum

Fix a design matrix XRn×pX \in \mathbb{R}^{n \times p}, a response yRny \in \mathbb{R}^n, and a regularization λ>0\lambda > 0. The lasso estimator is the solution of the convex program

β^(y):=argminβRp  12yXβ22+λβ1.\hat\beta(y) := \arg\min_{\beta \in \mathbb{R}^p} \;\frac{1}{2}\|y - X\beta\|_2^2 + \lambda \|\beta\|_1.

The 1\ell_1 penalty is non-differentiable at coordinates where βj=0\beta_j = 0, so the optimality conditions go through the subdifferential of 1\|\cdot\|_1 rather than the gradient:

t={{sign(t)}t0,[1,+1]t=0.\partial |t| = \begin{cases} \{\operatorname{sign}(t)\} & t \ne 0, \\ [-1, +1] & t = 0. \end{cases}

Setting the subdifferential of the lasso objective to zero gives the KKT stationarity conditions: there exists a subgradient vector g^β^1\hat g \in \partial \|\hat\beta\|_1 (i.e., g^j=sign(β^j)\hat g_j = \operatorname{sign}(\hat\beta_j) when β^j0\hat\beta_j \ne 0 and g^j[1,+1]\hat g_j \in [-1, +1] otherwise) such that

X(yXβ^)=λg^.(3.1)X^\top (y - X \hat\beta) = \lambda \hat g. \qquad (3.1)

This is the only optimality condition we need. Everything that follows is bookkeeping on which coordinates of β^\hat\beta are zero versus nonzero, and what g^\hat g has to be on each.

Partition {1,,p}\{1, \dots, p\} into the active set S={j:β^j0}S = \{j : \hat\beta_j \ne 0\} and its complement Sc={j:β^j=0}S^c = \{j : \hat\beta_j = 0\}. Let zS{1,+1}Sz_S \in \{-1, +1\}^{|S|} be the sign pattern on the active set: zj=sign(β^j)z_j = \operatorname{sign}(\hat\beta_j) for jSj \in S. Restricting (3.1) to the active coordinates:

XS(yXβ^)=λzS.(3.2)X_S^\top (y - X\hat\beta) = \lambda z_S. \qquad (3.2)

Restricting to the inactive coordinates, where the subgradient is free in [1,+1][-1, +1]:

Xj(yXβ^)λfor all jSc.(3.3)|X_j^\top (y - X\hat\beta)| \le \lambda \quad \text{for all } j \in S^c. \qquad (3.3)

We’ll need the slightly stronger version of (3.3) — strict inequality Xj(yXβ^)<λ|X_j^\top (y - X\hat\beta)| < \lambda for jScj \in S^c — to characterize the interior of the selection event. Equality corresponds to the boundary where a feature is exactly about to enter or leave the active set, which is a measure-zero set under any continuous distribution of yy.

3.2 The active set Ŝ and sign pattern ẑ as functions of y

The KKT conditions implicitly define β^(y)\hat\beta(y), S^(y)\hat S(y), and z^(y)\hat z(y) as functions of the observed response. We can solve for β^S\hat\beta_S explicitly on the active coordinates. Since β^Sc=0\hat\beta_{S^c} = 0 by definition of the active set, (3.2) becomes

XS(yXSβ^S)=λzS        β^S(y)=(XSXS)1(XSyλzS).(3.4)X_S^\top (y - X_S \hat\beta_S) = \lambda z_S \;\;\Longrightarrow\;\; \hat\beta_S(y) = (X_S^\top X_S)^{-1} \big( X_S^\top y - \lambda z_S \big). \qquad (3.4)

Equation (3.4) gives β^S\hat\beta_S as an explicit affine function of yy (holding SS and zSz_S fixed). The map yβ^(y)y \mapsto \hat\beta(y) is globally nonlinear in yy because SS and zz change as yy varies, but locally — within the cell of the sample space where SS and zz are constant — it’s affine. That local affineness is the structural feature that drives the polyhedral description.

The map y(S^(y),z^(y))y \mapsto (\hat S(y), \hat z(y)) partitions Rn\mathbb{R}^n into selection cells indexed by pairs (S,z)(S, z). For most realizations of yy the cell is a full-dimensional open set; on a measure-zero union of hyperplanes the lasso has ties and the map is not single-valued. §3.3 writes down the inequality description of each full-dimensional cell.

3.3 The selection event as a polyhedron Ay ≤ b

The selection event ES,z={y:S^(y)=S,z^(y)=z}E_{S, z} = \{y : \hat S(y) = S, \hat z(y) = z\} is the set of responses for which the lasso would return exactly this active set with exactly these signs. To describe it as inequalities in yy, we substitute the explicit form (3.4) of β^S\hat\beta_S into the two structural conditions: the active-side sign condition (zjβ^j>0z_j \cdot \hat\beta_j > 0 for jSj \in S, ensuring the signs match) and the inactive-side magnitude condition (Xk(yXSβ^S)<λ|X_k^\top (y - X_S \hat\beta_S)| < \lambda for kSck \in S^c, ensuring the inactive coordinates stay inactive).

Theorem 3.1 (Lasso selection event is polyhedral).

Let yRny \in \mathbb{R}^n and let β^(y)\hat\beta(y) be the lasso solution at regularization λ>0\lambda > 0. For any pair (S,z)(S, z) with S{1,,p}S \subseteq \{1, \dots, p\} and z{1,+1}Sz \in \{-1, +1\}^{|S|}, the selection event

ES,z={yRn:S^(y)=S,  z^(y)=z}E_{S, z} = \{y \in \mathbb{R}^n : \hat S(y) = S, \; \hat z(y) = z\}

is (up to a measure-zero boundary) the open polyhedron

ES,z={y:A0y<b0,  A1y<b1},E_{S, z} = \big\{ y : A_0 y < b_0, \; A_1 y < b_1 \big\},

with

A0=diag(z)(XSXS)1XS,b0=λdiag(z)(XSXS)1z,A1=1λ(+XSc(IPS)XSc(IPS)),b1=(1w1+w),\begin{aligned} A_0 &= -\operatorname{diag}(z) (X_S^\top X_S)^{-1} X_S^\top, & b_0 &= -\lambda \operatorname{diag}(z) (X_S^\top X_S)^{-1} z, \\ A_1 &= \frac{1}{\lambda} \begin{pmatrix} +X_{S^c}^\top (I - P_S) \\ -X_{S^c}^\top (I - P_S) \end{pmatrix}, & b_1 &= \begin{pmatrix} \mathbf{1} - w \\ \mathbf{1} + w \end{pmatrix}, \end{aligned}

where PS=XS(XSXS)1XSP_S = X_S (X_S^\top X_S)^{-1} X_S^\top is the orthogonal projection onto col(XS)\operatorname{col}(X_S) and w=XScXS(XSXS)1zRScw = X_{S^c}^\top X_S (X_S^\top X_S)^{-1} z \in \mathbb{R}^{|S^c|}. The total number of inequalities is S+2Sc=2pS|S| + 2|S^c| = 2p - |S|.

Proof.

We translate each of the two structural conditions into linear inequalities in yy.

Step 1: the active-side sign condition. The condition β^j(y)zj>0\hat\beta_j(y) \cdot z_j > 0 for every jSj \in S is equivalent to

diag(z)β^S(y)>0(componentwise).\operatorname{diag}(z) \, \hat\beta_S(y) > 0 \quad (\text{componentwise}).

Substituting (3.4),

diag(z)(XSXS)1(XSyλz)>0,\operatorname{diag}(z) (X_S^\top X_S)^{-1} \big( X_S^\top y - \lambda z \big) > 0,

which rearranges to

diag(z)(XSXS)1XSy<λdiag(z)(XSXS)1z.-\operatorname{diag}(z) (X_S^\top X_S)^{-1} X_S^\top y < -\lambda \, \operatorname{diag}(z) (X_S^\top X_S)^{-1} z.

This is exactly A0y<b0A_0 y < b_0 with A0,b0A_0, b_0 as stated. It is a system of S|S| inequalities.

Step 2: the inactive-side magnitude condition. For kSck \in S^c, β^k=0|\hat\beta_k| = 0 requires (per the KKT subgradient (3.3)) that Xk(yXSβ^S)λ|X_k^\top (y - X_S \hat\beta_S)| \le \lambda, and strict inequality picks out the interior of the cell. Substituting (3.4):

yXSβ^S(y)=yXS(XSXS)1(XSyλzS)=(IPS)y+λXS(XSXS)1z.y - X_S \hat\beta_S(y) = y - X_S (X_S^\top X_S)^{-1} (X_S^\top y - \lambda z_S) = (I - P_S) y + \lambda X_S (X_S^\top X_S)^{-1} z.

Hence

XSc(yXSβ^S)=XSc(IPS)y+λXScXS(XSXS)1z=XSc(IPS)y+λw.X_{S^c}^\top (y - X_S \hat\beta_S) = X_{S^c}^\top (I - P_S) y + \lambda \, X_{S^c}^\top X_S (X_S^\top X_S)^{-1} z = X_{S^c}^\top (I - P_S) y + \lambda w.

The two-sided strict inequality XSc(yXSβ^S)<λ|X_{S^c}^\top (y - X_S \hat\beta_S)| < \lambda then expands to the pair

+XSc(IPS)y+λw<λ1,XSc(IPS)yλw<λ1,+X_{S^c}^\top (I - P_S) y + \lambda w < \lambda \mathbf{1}, \qquad -X_{S^c}^\top (I - P_S) y - \lambda w < \lambda \mathbf{1},

or, dividing through by λ>0\lambda > 0,

+1λXSc(IPS)y<1w,1λXSc(IPS)y<1+w.+\tfrac{1}{\lambda} X_{S^c}^\top (I - P_S) y < \mathbf{1} - w, \qquad -\tfrac{1}{\lambda} X_{S^c}^\top (I - P_S) y < \mathbf{1} + w.

These are exactly A1y<b1A_1 y < b_1 with A1,b1A_1, b_1 as stated. They give 2Sc2 |S^c| inequalities.

Step 3: combine. The selection event is the intersection of the two systems, which is the open polyhedron stated. Counting inequalities: S|S| from Step 1 and 2Sc2|S^c| from Step 2, totaling S+2Sc=2pS|S| + 2|S^c| = 2p - |S|.

Numerical verification. On one realization of the §1 low-dim toy (n=100n = 100, p=20p = 20, math-λ12.24\lambda \approx 12.24 corresponding to sklearn’s α=λ/n0.12\alpha = \lambda/n \approx 0.12), the lasso selects S={0,1,2,9,11,15}S = \{0, 1, 2, 9, 11, 15\} with z=(1,1,1,1,1,1)z = (1, 1, 1, 1, -1, -1), giving S=6|S| = 6 and Sc=14|S^c| = 14 for a total of 3434 inequalities. The active-side slacks b0A0yb_0 - A_0 y are all positive (minimum +0.0068+0.0068), the inactive-side slacks b1A1yb_1 - A_1 y are all positive (minimum +0.1869+0.1869), confirming that the observed yy lies strictly in the interior of ES,zE_{S, z}. The explicit formula β^S=(XSXS)1(XSyλz)\hat\beta_S = (X_S^\top X_S)^{-1}(X_S^\top y - \lambda z) matches sklearn’s lasso solution to β^Sexplicitβ^Ssklearn2.81×106|\hat\beta_S^{\text{explicit}} - \hat\beta_S^{\text{sklearn}}|_\infty \approx 2.81 \times 10^{-6}. Theorem 3.1 holds in code.

A few remarks on the theorem. The matrix AA depends on (X,S,z,λ)(X, S, z, \lambda) but not on yy; the right-hand side bb depends on (X,S,z,λ)(X, S, z, \lambda) likewise. So the inequality matrix is fixed once the selection outcome is — different realizations of yy that map to the same outcome live in the same polyhedron. Different outcomes give different polyhedra; the family {ES,z}(S,z)\{E_{S, z}\}_{(S, z)} tessellates Rn\mathbb{R}^n (up to the measure-zero boundary).

The factor (IPS)(I - P_S) in A1A_1 deserves a moment. It says that the inactive constraint depends on yy only through its component orthogonal to col(XS)\operatorname{col}(X_S). The component of yy along col(XS)\operatorname{col}(X_S) is absorbed into the active-side estimate β^S\hat\beta_S and doesn’t affect whether the inactive features get screened in or out. We’ll come back to this in §4 — it’s the structural reason the truncated-Gaussian pivot’s truncation bounds depend only on the orthogonal residual.

3.4 Geometric hook — projecting the polyhedron onto a 2-D slice

The polyhedron ES,zE_{S, z} lives in Rn\mathbb{R}^n with n=100n = 100 on our toy. We can’t draw it directly. But we can pick two directions v1,v2Rnv_1, v_2 \in \mathbb{R}^n, fix the rest of yy, and look at the 2-D slice

{(a,b)R2:y0+av1+bv2Rn}.\{ (a, b) \in \mathbb{R}^2 : y_0 + a v_1 + b v_2 \in \mathbb{R}^n \}.

The selection cells restrict to polygons in (a,b)(a, b)-coordinates: the inequalities A0y<b0,A1y<b1A_0 y < b_0, A_1 y < b_1 become A0(y0+av1+bv2)<b0A_0 (y_0 + a v_1 + b v_2) < b_0, etc., which are still linear in (a,b)(a, b). Crossing a boundary line in the 2-D slice corresponds to crossing a face of the polyhedron — at the boundary, one of the KKT conditions becomes an equality and the lasso’s selection outcome jumps.

Choosing perturbation directions aligned with a signal column X0X_0 and a null column X5X_5 produces a particularly clean picture. Increasing the projection along X0X_0 strengthens the signal-feature signal; increasing along X5X_5 adds correlation between the response and a null feature. On the notebook realization, the 2-D slice contains 66 distinct lasso active-set outcomes, separated by piecewise-linear boundaries; the polygon containing the origin is the realized selection cell.

This is the geometric content of the lasso’s selection event. Each cell is convex (it’s the intersection of half-spaces, so it’s automatically convex); the cells tessellate the slice; the boundary lines correspond to KKT activations. The picture also makes the §1 coverage failure intuitive: when we report a CI on β^5\hat\beta_5 from the OLS fit on S^\hat S, we’re using the sampling distribution of β^5\hat\beta_5 marginalized over all of Rn\mathbb{R}^n, but we should be using its distribution restricted to the cell where the lasso actually selected feature 5. That cell is a strict subset of Rn\mathbb{R}^n, and §4 will compute the conditional distribution on it.

3.5 Extensions — forward stepwise, marginal screening, and the LARS sequence

The polyhedral description is not specific to the lasso. Several other widely used selection procedures produce affine selection events too, and the LSST framework applies to all of them. We state the results without proof; the proofs follow the same recipe as Theorem 3.1.

Forward stepwise. At step kk of forward stepwise regression, given the previous active set Sk1S_{k-1} and sign pattern zk1z_{k-1}, the next feature to enter is chosen by maximizing Xj(yXSk1β^Sk1)|X_j^\top (y - X_{S_{k-1}} \hat\beta_{S_{k-1}})| over jSk1j \notin S_{k-1}. The event ”XjX_{j^\star} enters at step kk with sign zk,jz_{k, j^\star}” is the intersection of inequalities Xjrk1Xjrk1|X_{j^\star}^\top r_{k-1}| \ge |X_j^\top r_{k-1}| for every jjj \ne j^\star and zk,jXjrk1>0z_{k, j^\star} \cdot X_{j^\star}^\top r_{k-1} > 0 — all linear in yy via rk1=(IPSk1)yr_{k-1} = (I - P_{S_{k-1}}) y. Iterating, the cumulative selection event at step kk is a polyhedron with O(pk)O(p \cdot k) inequalities. Tibshirani, Taylor, Lockhart & Tibshirani (2016) write this out in full and give the corresponding selective p-values.

Marginal screening. Marginal screening at threshold tt selects S^={j:Xjy>t}\hat S = \{j : |X_j^\top y| > t\}. The event {S^=S,z^=z}\{\hat S = S, \hat z = z\} is {zjXjy>t:jS}{Xjy<t:jS}\{z_j \cdot X_j^\top y > t : j \in S\} \cap \{|X_j^\top y| < t : j \notin S\} — exactly 2pS2p - |S| linear inequalities, with the explicit form even simpler than the lasso’s because there’s no (XSXS)1(X_S^\top X_S)^{-1} to deal with.

LARS sequence. The Least-Angle Regression algorithm of Efron, Hastie, Johnstone & Tibshirani (2004) produces a piecewise-linear regularization path with O(p)O(p) knots; at each knot, a feature enters or leaves the active set. Each LARS-knot event is affine in yy, and the cumulative event “the first kk knots are (j1,j2,,jk)(j_1, j_2, \dots, j_k) with signs (s1,,sk)(s_1, \dots, s_k)” is a polyhedron. The LSST pivot specializes to the LARS sequence and gives selective pp-values for each variable as it enters.

The common structure across all these procedures is that the rule for selecting the next thing is a maximum or threshold on a linear functional of the current residual, which is itself linear in yy. Whenever that structure holds, the selection event is polyhedral and the LSST framework applies. Procedures that break the structure — cross-validation for λ\lambda selection, stability selection with bootstrap, neural-network feature selection — sit outside the framework and require different machinery (asymptotic results, splitting-based approaches, debiasing). §5 covers debiasing and §13 returns to the broader extension question.

Selection-outcome tessellation in a 2-D slice of the sample space
Selection-outcome tessellation in a 2-D slice of ℝⁿ. On a fixed realization y₀ from the §1 toy, perturbing y along the signal direction v₁ ∝ X₀ (horizontal) and a null direction v₂ (vertical) carves the 2-D plane into convex polygons, each colored by the lasso's active-set outcome at that perturbation. The polygon containing the origin is the realized selection cell; its boundaries are where a KKT condition becomes an equality and a feature enters or leaves the active set. On the realization shown, the slice contains 6 distinct selection outcomes.

4. Conditional z-tests and confidence intervals via the truncated-Gaussian pivot

§3 gave us an explicit polyhedral description of the lasso’s selection event. §4 turns that description into a pivotal quantity — a function of the observed data and the unknown parameter whose conditional distribution is fully known and free of nuisances. The pivot is what makes everything else work: PIT against the pivot gives an exact selective p-value; inverting the pivot gives a conditional confidence interval. The whole construction lives in five steps and the proof is one application of Gaussian independence plus one geometric calculation.

4.1 The contrast statistic η⊤y and its conditional Gaussianity

Fix the Gaussian linear-model assumption from §2.4: YN(μ,σ2In)Y \sim \mathcal{N}(\mu, \sigma^2 I_n), with σ2\sigma^2 known (or estimated externally to a held-out sample). The post-selection target is a linear functional ημ\eta^\top \mu, where the contrast direction ηRn\eta \in \mathbb{R}^n may depend on the realized selection outcome (S^,z^)(\hat S, \hat z). For the lasso, the canonical choice is the partial-OLS direction for feature jj in the selected model:

ηj:=XS^(XS^XS^)1ej,S^,(4.1)\eta_j := X_{\hat S} (X_{\hat S}^\top X_{\hat S})^{-1} e_{j, \hat S}, \qquad (4.1)

where ej,S^e_{j, \hat S} is the canonical basis vector for coordinate jj within the selected coordinate set. Under the well-specified assumption μ=Xβ\mu = X\beta^\star, the target ηjμ\eta_j^\top \mu equals the partial regression coefficient of feature jj in the model S^\hat S. For a true-signal feature jj with jTS^j \in T \subseteq \hat S (where TT is the true support), the target reduces to βj\beta_j^\star; for a false-positive feature jS^Tj \in \hat S \setminus T, the target is zero.

The unconditional distribution of ηY\eta^\top Y is straightforward: ηYN(ημ,σ2η2)\eta^\top Y \sim \mathcal{N}(\eta^\top \mu, \sigma^2 \|\eta\|^2). The classical Wald CI is ηy±zα/2ση\eta^\top y \pm z_{\alpha/2} \sigma \|\eta\|. Once we condition on the selection event {YES,z}\{Y \in E_{S, z}\}, that distribution changes — that’s the §1 coverage failure, restated.

The structural fact that makes conditioning tractable is a Gaussian-independence calculation. Decompose

Y=PηY+(IPη)Y,Pη:=ηηη2.(4.2)Y = P_\eta Y + (I - P_\eta) Y, \quad P_\eta := \frac{\eta \eta^\top}{\|\eta\|^2}. \qquad (4.2)

The two pieces are uncorrelated:

Cov(ηY,(IPη)Y)=σ2η(IPη)=σ2(ηηηη2η)=0.\operatorname{Cov}(\eta^\top Y, \,(I - P_\eta) Y) = \sigma^2 \eta^\top (I - P_\eta) = \sigma^2 \big(\eta^\top - \tfrac{\eta^\top \eta}{\|\eta\|^2} \eta^\top\big) = 0.

Since both are linear in the jointly Gaussian YY, uncorrelatedness implies independence. So ηY\eta^\top Y and the orthogonal residual R:=(IPη)YR := (I - P_\eta) Y are independent Gaussians, and conditioning on R=rR = r leaves the marginal distribution of ηY\eta^\top Y unchanged: it’s still N(ημ,σ2η2)\mathcal{N}(\eta^\top \mu, \sigma^2 \|\eta\|^2).

What conditioning does change is the selection event. Once R=rR = r is fixed, the only remaining randomness is in the scalar ηY\eta^\top Y, and the polyhedral constraint {Ayb}\{Ay \le b\} collapses to a constraint on that scalar alone. §4.2 makes the collapse explicit.

4.2 The truncation interval [V⁻(r), V⁺(r)] as a piecewise-linear functional

Parameterize YY in the eigenframe of the decomposition (4.2):

Y=ηYη2η+R,R=(IPη)Y.(4.3)Y = \frac{\eta^\top Y}{\|\eta\|^2} \eta + R, \quad R = (I - P_\eta) Y. \qquad (4.3)

Plugging (4.3) into the polyhedral constraint AYbAY \le b from Theorem 3.1 gives, row by row,

(Aη)iη2ηY+(AR)ibi,i=1,,m,\frac{(A\eta)_i}{\|\eta\|^2} \cdot \eta^\top Y + (AR)_i \le b_i, \quad i = 1, \dots, m,

where m=2pSm = 2p - |S| is the number of inequalities. Rearranging,

(Aη)iηYη2(bi(AR)i),i=1,,m.(4.4)(A\eta)_i \cdot \eta^\top Y \le \|\eta\|^2 \, \big(b_i - (AR)_i\big), \quad i = 1, \dots, m. \qquad (4.4)

Each row of (4.4) constrains the scalar ηY\eta^\top Y, and the form of the constraint depends on the sign of (Aη)i(A\eta)_i:

  • If (Aη)i>0(A\eta)_i > 0, the constraint is an upper bound on ηY\eta^\top Y.
  • If (Aη)i<0(A\eta)_i < 0, the constraint is a lower bound on ηY\eta^\top Y.
  • If (Aη)i=0(A\eta)_i = 0, the constraint reduces to (AR)ibi(AR)_i \le b_i — a constraint on the residual RR alone, which doesn’t involve ηY\eta^\top Y and is satisfied automatically given that YEY \in E.

Taking the binding bounds over all rows gives the truncation interval:

V(r):=maxi:(Aη)i<0η2(bi(Ar)i)(Aη)i,V+(r):=mini:(Aη)i>0η2(bi(Ar)i)(Aη)i.(4.5)\mathcal{V}^-(r) := \max_{i : (A\eta)_i < 0} \frac{\|\eta\|^2 \, (b_i - (Ar)_i)}{(A\eta)_i}, \qquad \mathcal{V}^+(r) := \min_{i : (A\eta)_i > 0} \frac{\|\eta\|^2 \, (b_i - (Ar)_i)}{(A\eta)_i}. \qquad (4.5)

Both bounds are piecewise-linear functionals of rr: each individual upper/lower bound is linear in rr, and the outer min/max\min/\max over rows gives a piecewise-linear envelope. The boundary set where V±\mathcal{V}^\pm switches which constraint is active is itself a union of hyperplanes in rr-space; on any one piece, V±(r)\mathcal{V}^\pm(r) is exactly linear.

4.3 The pivot theorem — full proof

We now have all the pieces. The marginal of ηY\eta^\top Y is unconditional Gaussian; the selection event restricts it to [V(R),V+(R)][\mathcal{V}^-(R), \mathcal{V}^+(R)] once we condition on R=rR = r; the resulting conditional distribution is therefore truncated Gaussian. The truncated-Gaussian CDF, evaluated at the observed value, is then Uniform(0,1)(0, 1) by the standard probability integral transform. That is the LSST pivot.

Theorem 4.1 (Lee–Sun–Sun–Taylor 2016, pivot for affine selection events).

Let YN(μ,σ2In)Y \sim \mathcal{N}(\mu, \sigma^2 I_n), let E={y:Ayb}E = \{y : Ay \le b\} be a polyhedral selection event, and let ηRn\eta \in \mathbb{R}^n with η>0\|\eta\| > 0. Define the truncation bounds V±\mathcal{V}^\pm as in (4.5). Then:

(a) The conditional distribution of ηY\eta^\top Y given {YE}\{Y \in E\} and R=rR = r is

ηYYE,R=r    TN ⁣(ημ,σ2η2,V(r),V+(r)),\eta^\top Y \,\big|\, Y \in E, \, R = r \;\sim\; \mathrm{TN}\!\big(\eta^\top \mu, \,\sigma^2 \|\eta\|^2,\, \mathcal{V}^-(r),\, \mathcal{V}^+(r)\big),

the Gaussian N(ημ,σ2η2)\mathcal{N}(\eta^\top \mu, \sigma^2 \|\eta\|^2) truncated to the interval [V(r),V+(r)][\mathcal{V}^-(r), \mathcal{V}^+(r)].

(b) Let F[,u]μ0,σ2F^{\mu_0, \sigma_*^2}_{[\ell, u]} denote the CDF of the N(μ0,σ2)\mathcal{N}(\mu_0, \sigma_*^2) distribution truncated to [,u][\ell, u]. Then the LSST pivot

F[V(R),V+(R)]ημ,σ2η2(ηY)    YE    Uniform(0,1).F^{\eta^\top \mu, \,\sigma^2 \|\eta\|^2}_{[\mathcal{V}^-(R), \, \mathcal{V}^+(R)]}\big(\eta^\top Y\big) \;\big|\; Y \in E \;\sim\; \mathrm{Uniform}(0, 1).

Proof.

Proof of (a). We argued in §4.1 that ηY\eta^\top Y and RR are independent under the unconditional Gaussian distribution. Conditioning on R=rR = r leaves the marginal of ηY\eta^\top Y unchanged: ηYR=r\eta^\top Y \mid R = r is still N(ημ,σ2η2)\mathcal{N}(\eta^\top \mu, \sigma^2 \|\eta\|^2).

Now condition additionally on {YE}\{Y \in E\}. Using the decomposition (4.3), the event {YE}\{Y \in E\} becomes a constraint on the pair (ηY,R)(\eta^\top Y, R); given R=rR = r, this further conditioning carves out the values of ηY\eta^\top Y that are compatible with rr. From §4.2 — specifically (4.5) — that compatible set is exactly [V(r),V+(r)][\mathcal{V}^-(r), \mathcal{V}^+(r)]. Restricting an unconditional Gaussian to a fixed interval gives a truncated Gaussian by definition, which is statement (a).

Proof of (b). Apply the probability integral transform conditional on R=rR = r and the selection event. The conditional CDF

F[V(r),V+(r)]ημ,σ2η2(ηY)    YE,R=rF^{\eta^\top \mu, \,\sigma^2 \|\eta\|^2}_{[\mathcal{V}^-(r), \,\mathcal{V}^+(r)]}\big(\eta^\top Y\big) \;\big|\; Y \in E, \, R = r

applies the conditional CDF to the conditional random variable, and therefore equals a Uniform(0,1)(0, 1) random variable (PIT applied to the conditional distribution in (a)). The conditional uniformity holds for every value of rr, so the joint distribution of the pivot is Uniform(0,1)(0, 1) over the full conditional distribution induced by integrating over RR — that is, conditional only on {YE}\{Y \in E\}.

A few notes on the theorem.

Why the residual conditioning is harmless. Conditioning on RR in addition to the selection event might look like a stronger restriction than we want — we’d usually report a CI conditional on the selection event alone, not on the residual. But because the resulting pivot is uniform conditional on R=rR = r for every rr, it’s also uniform marginally over RR, by the same averaging argument we used in §2.3. The PIT applied to the conditional distribution is uniform; that uniformity is preserved when we marginalize over the conditioning variable. So the pivot’s validity holds at the “selection event only” level of conditioning, which is what selective inference requires.

The Gaussian assumption is load-bearing. Two places in the proof use Gaussianity: (1) the independence of ηY\eta^\top Y and RR (which only follows from uncorrelatedness under joint Gaussianity); (2) the exact tractability of the conditional distribution as N(ημ,σ2η2)\mathcal{N}(\eta^\top \mu, \sigma^2 \|\eta\|^2). Both fail under heavy-tailed errors. Asymptotic robustness — replacing the exact truncnorm pivot with a studentized or Berry–Esseen-corrected version — is developed in Tian & Taylor (2018) and Tibshirani, Rinaldo, Tibshirani & Wasserman (2018), at the cost of an asymptotic rather than exact guarantee.

The truncation interval is informative. When V+(r)V(r)\mathcal{V}^+(r) - \mathcal{V}^-(r) is large relative to ση\sigma\|\eta\|, the truncation is barely binding and the truncated Gaussian is close to the unconditional Gaussian. When the interval is tight, the truncated distribution is highly constrained and the conditional CI is much wider than the naive CI — sometimes infinitely wide, when the observed ηy\eta^\top y is at the very edge of the truncation interval. §4.4 quantifies this.

The interactive below shows the pivot in action. Drag the observed-value slider for ηy\eta^\top y within the truncation interval to watch the pivot’s value range over [0,1][0, 1] — the visual feeling of PIT against a truncated distribution.

Solid teal: truncated 𝒩(η⊤μ, 1) restricted to [V⁻, V⁺] = [-2, 2]. Dashed gray: the unconditional Gaussian. Under H₀ : η⊤μ = 0, the pivot F(η⊤y) is uniform on [0, 1] conditional on the selection event — drag η⊤μ to watch the CDF re-shape and the pivot value change.

4.4 Inverting the pivot to get conditional CIs

The pivot in Theorem 4.1(b) is a function of the unknown parameter μ0:=ημ\mu_0 := \eta^\top \mu. To get a (1α)(1-\alpha) confidence interval for μ0\mu_0, we invert: define

C1α(ηy,r):={μ0:α2F[V(r),V+(r)]μ0,σ2η2(ηy)1α2}.(4.6)C_{1 - \alpha}(\eta^\top y, r) := \left\{ \mu_0 : \tfrac{\alpha}{2} \le F^{\mu_0, \,\sigma^2 \|\eta\|^2}_{[\mathcal{V}^-(r), \mathcal{V}^+(r)]}\big(\eta^\top y\big) \le 1 - \tfrac{\alpha}{2} \right\}. \qquad (4.6)

By construction, Pr(μ0C1αYE)=1α\Pr(\mu_0 \in C_{1 - \alpha} \mid Y \in E) = 1 - \alpha for every value of μ0\mu_0 — the conditional coverage holds at nominal level by design.

The set C1αC_{1-\alpha} is an interval because F[,u]μ0,σ2(t)F^{\mu_0, \sigma_*^2}_{[\ell, u]}(t) is monotone decreasing in μ0\mu_0 (for fixed t,σ,,ut, \sigma_*, \ell, u): increasing the mean of a truncated Gaussian shifts its mass to the right, lowering the CDF at any fixed point. So the lower endpoint LL is the unique value of μ0\mu_0 where the pivot equals 1α/21 - \alpha/2, and the upper endpoint UU is the unique value where it equals α/2\alpha/2. Both are found by 1-D root-finding (Brent’s method is standard).

Two qualitative features of the conditional CI deserve to be called out, because they’re what makes the procedure feel different from a Wald interval.

Conditional CIs can be much wider than naive ones. When the truncation interval is tight — say, V+(r)V(r)ση\mathcal{V}^+(r) - \mathcal{V}^-(r) \approx \sigma\|\eta\| — the conditional CI grows because small shifts in the underlying mean produce large changes in the truncated CDF only at the interval’s center, leaving the tails of the inversion problem flat. In the limiting case where the observed ηy\eta^\top y lies near one end of [V,V+][\mathcal{V}^-, \mathcal{V}^+], the conditional CI’s far endpoint can extend to ±\pm\infty. This is the price of conditional honesty: when the selection event provided strong information about ηY\eta^\top Y, the residual information about ημ\eta^\top \mu is correspondingly weak.

Naive CIs were never doing what they advertised. The naive Wald CI claims to cover μ0\mu_0 at (1α)(1-\alpha) unconditionally. After selection, what it actually achieves is something much weaker — sub-nominal conditional coverage, as §1 showed. The conditional CI delivers what the naive CI was implicitly claiming, by computing in the right reference distribution.

The implementation is uncomplicated: build (A,b,η)(A, b, \eta) from §3, compute V±\mathcal{V}^\pm from (4.5), evaluate the truncated-normal CDF, and use a 1-D root finder to invert.

4.5 Numerical verification — naive vs conditional empirical coverage

The §1 demonstration showed the naive workflow’s coverage failure at CV-tuned λ\lambda. §4.5 reruns the same Monte Carlo at the deterministic λ\lambda that the LSST framework needs, layering on the conditional procedure of §§4.1–4.4 alongside the naive workflow. On B=200B = 200 replicates of the low-dim toy, with the deterministic λ\lambda producing 1,2911{,}291 total inferences:

CategoryNaive coverageConditional coverageMedian naive widthMedian conditional widthInfinite-CI fraction
Signal-feature (n = 600)0.9250.9250.9550.9550.4040.4040.4830.483 (finite)3.0%3.0\%
Null-feature (n = 691)0.7890.7890.9670.9670.4010.4011.3001.300 (finite)12.4%12.4\%

The pattern matches Theorem 4.1’s promise. Naive coverage is sub-nominal on null-feature inferences (0.7890.789 vs nominal 0.950.95); conditional coverage hits nominal on both categories (0.9550.955 / 0.9670.967). Conditional CIs are wider — much wider on null-feature inferences where the truncation interval is tight — and a non-negligible fraction (12.4%12.4\% on nulls, 3.0%3.0\% on signals) extend to ±\pm\infty, reflecting the we have very little information about μ0\mu_0 after conditioning on this selection event situation. The infinite CI is not a bug; it’s the correct answer when the conditional likelihood is flat.

That is the §1–§4 arc landing. §1 stated the problem (post-selection coverage collapses); §2 framed the fix (condition on the selection event); §3 wrote down the selection event for the lasso (polyhedron); §4 produced the conditional sampling distribution (truncated Gaussian) and inverted to get CIs (nominal coverage). The rest of the topic builds on this foundation: §5 handles pnp \gg n via debiasing, §§6–9 layer in finite-sample FDR control via knockoffs, §§10–11 take everything online.

Conditional procedure gains +3.0 pp on signals, +17.8 pp on nulls.

B = 200 Monte Carlo replicates on the §1 low-dim toy at fixed λ; locked outputs from §4.5. The conditional procedure delivers the nominal 0.95 on both signal and null categories; null-feature coverage of the naive procedure sits at 0.789. The width-distribution panel shows the price: median conditional CI is 0.483 on signals (vs naive 0.404) and 1.300 on nulls (vs naive 0.401), with 12.4% of null-feature conditional CIs extending to ±∞.

Naive Wald vs LSST conditional CIs — coverage and width
Naive vs conditional CIs on the §1 low-dim toy at fixed λ, B = 200 Monte Carlo replicates. Left: empirical coverage by category (signal feature when selected vs null feature when selected) × inference type (naive Wald vs LSST conditional). Naive coverage on null-feature inferences is 0.789 (sub-nominal); conditional coverage hits 0.955 (signal) and 0.967 (null), at or above nominal. Right: distribution of CI widths on null-feature inferences. Conditional CIs are wider on average than naive ones; 12.4% of null-feature conditional CIs extend to ±∞ where the selection event was so informative that very little residual information about η⊤μ remains. The infinite CI is the correct answer when the conditional likelihood is flat.

5. The debiased lasso

§§2–4 gave us an exact post-selection pivot for any linear contrast in the Gaussian linear model, conditional on the polyhedral selection event. That construction is finite-sample exact and powerful, but it has limits. The truncated-Gaussian pivot requires a polyhedral selection event with a tractable number of inequalities — fine for moderate pp, but at pp in the thousands or millions the polyhedral description becomes computationally inconvenient and the conditional CIs frequently extend to ±\pm\infty. Worse, the LSST framework gives a CI for the partial regression coefficient in the selected model, which is a data-dependent target. For downstream procedures that want a pp-value per feature regardless of selection — most of §§6–9 — we need inference about the fixed coefficients βj\beta_j^\star of the true (sparse) model, not about partial-OLS targets.

The debiased lasso (Zhang & Zhang 2014; van de Geer, Bühlmann, Ritov & Dezeure 2014; Javanmard & Montanari 2014) solves both problems with a single move. It produces a coordinate-wise estimator b^j\hat b_j for each βj\beta_j^\star with an asymptotic Gaussian distribution centered on the truth, valid uniformly across jj at sparsity rate s0logp=o(n)s_0 \log p = o(\sqrt n). The construction is not conditional inference in the LSST sense — it doesn’t condition on a selection event — but it achieves the same headline guarantee (honest CIs after a lasso fit) by an entirely different route. The two approaches are complementary: LSST gives exact CIs for the data-dependent target at moderate pp; debiased lasso gives asymptotic CIs for the fixed target at pnp \gg n. §§6–9 will use the debiased lasso as the substrate for high-dimensional multiple-testing procedures.

5.1 Why naive lasso CIs are not honest — bias dominates the standard error

The lasso estimator at regularization λ\lambda is

β^=argminβ12nyXβ2+λβ1.\hat\beta = \arg\min_\beta \frac{1}{2n}\|y - X\beta\|^2 + \lambda \|\beta\|_1.

For §5 we switch to the sklearn convention with the 1/n1/n factor on the squared-error term, matching the way the debiased-lasso literature is usually written. The math-convention λmath\lambda_{\text{math}} of §§3–4 maps to this section’s λ\lambda via λmath=nλ\lambda_{\text{math}} = n\lambda. Throughout this section we work in the regime pnp \gg n with β\beta^\star truly sparse: β0=s0n\|\beta^\star\|_0 = s_0 \ll n. The canonical choice λσlogp/n\lambda \asymp \sigma \sqrt{\log p / n} delivers the standard lasso 1\ell_1-rate β^β1=OP(s0logp/n)\|\hat\beta - \beta^\star\|_1 = O_P(s_0 \sqrt{\log p / n}), which is the foundation for everything that follows.

The KKT conditions from §3.1 say that for any active jj, β^j\hat\beta_j satisfies Xj(yXβ^)/n=λsign(β^j)X_j^\top(y - X\hat\beta)/n = \lambda \cdot \operatorname{sign}(\hat\beta_j). Solving and substituting y=Xβ+εy = X\beta^\star + \varepsilon gives the structural decomposition

β^jβj=λsign(β^j)[(XX/n)jj1]+(noise and cross-terms),\hat\beta_j - \beta_j^\star = -\lambda \cdot \operatorname{sign}(\hat\beta_j) \cdot \big[ (X^\top X/n)^{-1}_{jj} \big] + \text{(noise and cross-terms)},

schematic but indicative: the lasso has a first-order bias of order λlogp/n\lambda \asymp \sqrt{\log p / n} that does not vanish as nn \to \infty at the standard rate. The would-be Wald standard error of an oracle (true-support-known) estimator is σ/n\sigma / \sqrt{n}. The ratio bias/SE is therefore logp\sqrt{\log p} — large, structural, and uncorrected by sample size.

On the running high-dimensional toy (n=100n = 100, p=200p = 200, s0=5s_0 = 5 signals at β=1.5\beta^\star = 1.5, Gaussian design, B=200B = 200 replicates), the canonical λ=σ2logp/n0.326\lambda = \sigma\sqrt{2\log p / n} \approx 0.326 for σ=1\sigma = 1 produces a lasso estimate of a signal coefficient with empirical mean 1.1421.142 — biased by 0.358-0.358 toward zero, matching the predicted λ0.326-\lambda \approx -0.326. The empirical SE is 0.1310.131. The bias/SE ratio is 2.732.73, so a naive ±1.96SE\pm 1.96 \cdot \text{SE} Wald interval around the biased estimate misses the truth 1.51.5 on essentially every replicate. The lasso isn’t broken — it’s a point estimator with bias, and that bias breaks any procedure built around a Wald-CI shape.

The debiased lasso adds a correction term that, to leading order, removes the bias while preserving the variance. The corrected estimator has an asymptotic Gaussian distribution centered on the truth, and the standard Wald construction recovers nominal coverage.

5.2 The Zhang–Zhang bias-correction identity and the choice of weight vector

The construction starts from a clean algebraic identity. For any vector u^jRn\hat u_j \in \mathbb{R}^n, define the bias-corrected estimator

b^j:=β^j+1nu^j(yXβ^).(5.1)\hat b_j := \hat\beta_j + \frac{1}{n} \hat u_j^\top (y - X\hat\beta). \qquad (5.1)

Substituting y=Xβ+εy = X\beta^\star + \varepsilon and rearranging:

b^jβj=[1nu^jXej](ββ^)+1nu^jε.(5.2)\hat b_j - \beta_j^\star = \big[\tfrac{1}{n}\hat u_j^\top X - e_j^\top\big](\beta^\star - \hat\beta) + \tfrac{1}{n} \hat u_j^\top \varepsilon. \qquad (5.2)

This is the Zhang–Zhang identity. The right-hand side has two pieces:

  • a Gaussian piece Gj:=1nu^jεN ⁣(0,σ2n2u^j2)G_j := \tfrac{1}{n} \hat u_j^\top \varepsilon \sim \mathcal{N}\!\big(0, \tfrac{\sigma^2}{n^2}\|\hat u_j\|^2\big) conditional on u^j\hat u_j;
  • a remainder Rj:=[1nu^jXej](ββ^)R_j := \big[\tfrac{1}{n} \hat u_j^\top X - e_j^\top\big] (\beta^\star - \hat\beta).

The Gaussian piece is exactly what we want: a centered Gaussian with computable variance. The remainder is the obstruction. If we can choose u^j\hat u_j so that 1nu^jX\tfrac{1}{n} \hat u_j^\top X is close to eje_j^\top in a sense compatible with the lasso rate on ββ^\beta^\star - \hat\beta, the remainder becomes negligibly small at the n\sqrt n scale, and b^j\hat b_j inherits the Gaussian distribution.

The construction of u^j\hat u_j amounts to building an approximate inverse of the empirical Gram matrix Σ^:=XX/n\hat\Sigma := X^\top X / n at row eje_j. Three flavors appear in the literature, all delivering the same asymptotic distribution:

Zhang–Zhang (2014). Run a nodewise lasso regressing XjX_j on XjX_{-j}:

γ^j=argminγRp112nXjXjγ2+λjγ1,\hat\gamma_j = \arg\min_{\gamma \in \mathbb{R}^{p-1}} \frac{1}{2n}\|X_j - X_{-j}\gamma\|^2 + \lambda_j \|\gamma\|_1,

and let zj:=XjXjγ^jz_j := X_j - X_{-j}\hat\gamma_j be the resulting residual. Then take

u^j:=nzjzjXj.(5.3)\hat u_j := \frac{n \cdot z_j}{z_j^\top X_j}. \qquad (5.3)

The geometric content: zjz_j is the part of XjX_j that is sparsely uncorrelated with the other columns. Multiplying by n/(zjXj)n/(z_j^\top X_j) rescales so that u^jXj/n=1\hat u_j^\top X_j / n = 1.

van de Geer et al. (2014). Construct an approximate inverse of Σ^\hat\Sigma row by row via nodewise lassos. Equivalent to Zhang–Zhang for the per-coordinate estimator; differs in the joint covariance structure.

Javanmard–Montanari (2014). Solve a quadratic program minuuΣ^u\min_u u^\top \hat\Sigma u subject to Σ^uejμ\|\hat\Sigma u - e_j\|_\infty \le \mu for a tuning parameter μ\mu. Direct optimization of the bias–variance trade-off.

We use Zhang–Zhang in the notebook. With u^j\hat u_j from (5.3), the Zhang–Zhang debiased estimator takes the simple form

b^j=β^j+zj(yXβ^)zjXj,SE^(b^j)=σzjzjXj.(5.4)\hat b_j = \hat\beta_j + \frac{z_j^\top (y - X\hat\beta)}{z_j^\top X_j}, \qquad \widehat{\mathrm{SE}}(\hat b_j) = \frac{\sigma \|z_j\|}{|z_j^\top X_j|}. \qquad (5.4)

5.3 The asymptotic-normality theorem

Theorem 5.1 (van de Geer et al. 2014, Theorem 2.2 / Javanmard–Montanari 2014, Theorem 3.1).

Assume y=Xβ+εy = X\beta^\star + \varepsilon with εN(0,σ2In)\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n), β0=s0\|\beta^\star\|_0 = s_0, the design XX satisfying restricted eigenvalue condition at sparsity s0s_0 with constant κ>0\kappa > 0, sub-Gaussian rows, Σ1\Sigma^{-1} row-sparse with sparsity sjs_j, and the rate condition max(s0,sj)logp=o(n)\max(s_0, s_j) \cdot \log p = o(\sqrt n). Let β^\hat\beta be the lasso at λσlogp/n\lambda \asymp \sigma\sqrt{\log p / n} and let b^j\hat b_j be the Zhang–Zhang debiased estimator (5.4). Then

n(b^jβj)=1nu^jε+oP(1),b^jβjσzj/zjXjdN(0,1).\sqrt{n}(\hat b_j - \beta_j^\star) = \tfrac{1}{\sqrt n} \hat u_j^\top \varepsilon + o_P(1), \qquad \frac{\hat b_j - \beta_j^\star}{\sigma \|z_j\| / |z_j^\top X_j|} \xrightarrow{d} \mathcal{N}(0, 1).

The Wald interval b^j±zα/2σzj/zjXj\hat b_j \pm z_{\alpha/2} \cdot \sigma \|z_j\| / |z_j^\top X_j| achieves (1α)(1 - \alpha) asymptotic coverage of βj\beta_j^\star, uniformly across jj.

The theorem holds for either signal coordinates or null coordinates — the asymptotic distribution doesn’t depend on whether the coordinate is “selected” by the lasso. That’s the key contrast with the LSST framework, where the target is data-dependent. The debiased lasso gives a uniform-across-jj asymptotic guarantee about the fixed coefficients.

For the toy: n=100,p=200,s0=5n = 100, p = 200, s_0 = 5 gives s0logp26.5s_0 \log p \approx 26.5 vs n=10\sqrt n = 10 — formally the rate condition is violated. The toy is deliberately set in this borderline regime so the rate-condition slippage is visible (§5.5).

5.4 Proof sketch via decoupling and the rate condition

The proof traces four structural steps. Detailed rate bounds for the lasso 1\ell_1-norm and the nodewise lasso decorrelation rate are in high-dimensional regression; here we expose how the rate condition arises.

Step 1: decompose. From (5.2),

n(b^jβj)=1nu^jε+n[1nu^jXej](ββ^).\sqrt n (\hat b_j - \beta_j^\star) = \tfrac{1}{\sqrt n}\hat u_j^\top \varepsilon + \sqrt n \big[\tfrac{1}{n}\hat u_j^\top X - e_j^\top\big](\beta^\star - \hat\beta).

Step 2: Gaussian piece. Conditional on u^j\hat u_j, 1nu^jεN(0,σ2u^j2/n)\tfrac{1}{\sqrt n}\hat u_j^\top \varepsilon \sim \mathcal{N}(0, \sigma^2 \|\hat u_j\|^2/n). With Zhang–Zhang u^j\hat u_j, the variance converges to σ2Θjj1\sigma^2 \Theta_{jj}^{-1} where Θ=Σ1\Theta = \Sigma^{-1}.

Step 3: bound the remainder via Hölder.

Rj(n)=n[1nu^jXej](ββ^)n1nu^jXejββ^1.|R_j^{(n)}| = \sqrt n \big|\big[\tfrac{1}{n}\hat u_j^\top X - e_j^\top\big](\beta^\star - \hat\beta)\big| \le \sqrt n \, \|\tfrac{1}{n}\hat u_j^\top X - e_j^\top\|_\infty \cdot \|\beta^\star - \hat\beta\|_1.

Two rate bounds plug in: (a) the lasso 1\ell_1-rate β^β1=OP(s0logp/n)\|\hat\beta - \beta^\star\|_1 = O_P(s_0\sqrt{\log p/n}) (Bickel, Ritov & Tsybakov 2009); (b) the decorrelation rate 1nu^jXej=OP(logp/n)\|\tfrac{1}{n}\hat u_j^\top X - e_j^\top\|_\infty = O_P(\sqrt{\log p/n}) from the nodewise lasso. Multiplying:

Rj(n)nOP(logp/n)OP(s0logp/n)=OP(s0logp/n).|R_j^{(n)}| \le \sqrt n \cdot O_P(\sqrt{\log p/n}) \cdot O_P(s_0 \sqrt{\log p/n}) = O_P\big(s_0 \log p / \sqrt n\big).

The remainder is oP(1)o_P(1) iff s0logp=o(n)s_0 \log p = o(\sqrt n) — the rate condition.

Step 4: combine. Under the rate condition, Rj(n)=oP(1)R_j^{(n)} = o_P(1), Slutsky gives the limit N(0,σ2Θjj1)\mathcal{N}(0, \sigma^2 \Theta_{jj}^{-1}), and standardization gives the standard normal limit.

5.5 Numerical demonstration — the rate-condition slippage in action

We run the high-dim toy (n=100n = 100, p=200p = 200, s0=5s_0 = 5, β=1.5\beta^\star = 1.5, B=200B = 200, debiasing 20 coordinates per replicate for runtime) and verify two predictions of Theorem 5.1: that the standardized debiased estimator z^j:=(b^jβj)/SE^(b^j)\hat z_j := (\hat b_j - \beta_j^\star) / \widehat{\mathrm{SE}}(\hat b_j) has approximately N(0,1)\mathcal{N}(0, 1) histogram across replicates and coordinates, and that the Wald CIs have empirical coverage near nominal 0.950.95.

The locked numerical outputs reveal the rate-condition slippage:

CategoryMean lasso est.Mean debiased est.MC SECoverage (Wald 95%)z-stat mean / std
Signal coordinates (n = 1000, β★ = 1.5)1.1381.1381.4721.4720.1240.1240.8770.8770.273-0.273 / 1.2281.228
Null coordinates (n = 1000, β★ = 0.0)0.000-0.0000.005-0.0050.1230.1230.8940.8940.050-0.050 / 1.2201.220

The bias correction restores the mean to near the truth on signal coordinates (1.4721.472 vs lasso’s 1.1381.138 — the predicted β=1.5\sim\beta^\star = 1.5); the null-coordinate mean stays at zero. The CLT histogram is approximately N(0,1)\mathcal{N}(0, 1) — z-statistic mean within 0.3-0.3 of zero, std within 0.220.22 of one — but with slightly heavier tails than the standard normal.

The empirical coverage is 0.8770.877 on signal coordinates and 0.8940.894 on null coordinates — both under nominal 0.950.95. This is the s0logp/n2.6s_0 \log p / \sqrt n \approx 2.6 rate-condition slippage in action. The toy is deliberately positioned in this borderline regime so the under-coverage is visible: with this (n,p,s0)(n, p, s_0) choice, the remainder Rj(n)R_j^{(n)} is no longer oP(1)o_P(1), it’s OP(s0logp/n)=OP(2.6)O_P(s_0 \log p / \sqrt n) = O_P(2.6), and the CLT promise softens. Larger nn or sparser β\beta^\star would restore nominal coverage. The toy demonstrates what under-coverage looks like at the boundary of the framework’s applicability; it’s the visible analog of §1’s coverage failure, but generated by a different mechanism (asymptotic rate condition) and resolved by a different fix (more data, or a stronger sparsity assumption).

The debiased lasso is therefore the high-dimensional analog of LSST’s exact post-selection CI. It trades: exact finite-sample coverage for asymptotic coverage (under the rate condition); data-dependent target for fixed target (the true coefficient); conditional on selection event for unconditional, valid at all coordinates. These trades are exactly what §§6–9 need.

Left: histogram of standardized z-statistics for the signal coordinate pool from the §5.5 locked run, with the standard-normal density overlaid. The toy's z-stat mean of -0.27 and std 1.23 indicate the rate-condition slippage Theorem 5.1 anticipates. Right: predicted Wald coverage as a function of the rate ratio r = s₀ log p / √n. The black point is the slider's current (n, p, s₀); the teal point at r ≈ 2.65 is the §5.5 anchor (locked coverage averages 0.886 across signal and null). Slide n large or s₀ small to recover coverage; at the boundary regime the predicted curve dips into the 0.85–0.90 band.

Debiased-lasso CLT histogram and empirical coverage
Debiased-lasso verification on the §5 high-dim toy (n = 100, p = 200, s₀ = 5, β★ = 1.5, B = 200). Left: histogram of standardized z-statistics pooled across replicates and target coordinates; overlaid with the standard normal density. The fit is close, with mild tail asymmetry from the s₀ log p / √n ≈ 2.6 rate-condition slippage. Right: empirical coverage of debiased Wald CIs by coordinate category. Both signal coverage (0.877) and null coverage (0.894) sit under nominal 0.95 — the visible signature of the borderline rate condition. Larger n or sparser β★ would restore nominal coverage; this toy is positioned at the boundary to demonstrate what under-coverage looks like.

6. FDR, the Benjamini–Hochberg procedure, and Storey’s adaptive variant

§§1–5 worked at the level of a single inference target: a coefficient βj\beta_j, a partial-OLS slope ημ\eta^\top \mu, a debiased estimate b^j\hat b_j. §6 zooms out to the multiple-testing problem: given mm hypotheses H1,,HmH_1, \dots, H_m with pp-values p1,,pmp_1, \dots, p_m, we want to reject a subset of them, the rejection set R{1,,m}R \subseteq \{1, \dots, m\} defined by some procedure, while controlling the rate of errors among rejections. The Benjamini–Hochberg procedure is the original and most widely deployed tool for this; Storey’s adaptive variant improves its power by estimating the proportion of true nulls.

§6.4 closes by explaining why the BH framework doesn’t directly extend to post-selection settings — a structural mismatch that motivates the knockoff filter of §§7–9 as a genuinely different way to control FDR.

6.1 Multiple testing — FWER, FDR, and when each is the right target

Fix mm null hypotheses H1,,HmH_1, \dots, H_m with associated p-values p1,,pmp_1, \dots, p_m. Let T{1,,m}T \subseteq \{1, \dots, m\} denote the set of true nulls, with T=m0|T| = m_0. A multiple-testing procedure returns a (random) rejection set R{1,,m}R \subseteq \{1, \dots, m\} and partitions the outcome into VV (false discoveries — true nulls rejected), SS (true discoveries — alternatives rejected), R=V+SR = V + S.

Family-Wise Error Rate (FWER). FWER:=Pr(V1)\mathrm{FWER} := \Pr(V \ge 1), the probability of any false rejection. Controlled by Bonferroni at threshold α/m\alpha/m. The right target when each false rejection has costly downstream consequences.

False Discovery Rate (FDR). FDR:=E[V/max(R,1)]\mathrm{FDR} := \mathbb{E}[V / \max(R, 1)], the expected proportion of false rejections among rejections (Benjamini & Hochberg 1995). Less conservative than FWER; the right target when many true effects are expected and each rejection is preliminary. For ML feature-selection in pnp \gg n, FDR is the right scale.

Two strictly weaker variants in the literature: False Discovery Exceedance (FDX) controls Pr(V/R>q)\Pr(V/R > q) instead of E[V/R]\mathbb{E}[V/R]; False Coverage Rate (FCR) (Benjamini & Yekutieli 2005) is an average-coverage analog for selective intervals. We focus on FDR.

6.2 The BH step-up procedure and its FDR-control theorem

Benjamini–Hochberg procedure at level qq. Sort the p-values p(1)p(m)p_{(1)} \le \cdots \le p_{(m)}. Compute

R:=max{k{1,,m}:p(k)kq/m},(6.1)R := \max\big\{k \in \{1, \dots, m\} : p_{(k)} \le k q / m \big\}, \qquad (6.1)

with R:=0R := 0 if no such kk exists. Reject H(1),,H(R)H_{(1)}, \dots, H_{(R)}. Equivalently: HiH_i is rejected iff piRq/mp_i \le R q / m.

Theorem 6.1 (Benjamini–Hochberg 1995).

Under independence of the p-values and super-uniformity of null p-values, the BH procedure at level qq controls FDR: FDR(m0/m)qq.\mathrm{FDR} \le (m_0/m) \cdot q \le q.

We prove this via Storey’s martingale. Define

V(t):={iT:pit},M(t):=V(t)/t,t(0,1].V(t) := |\{i \in T : p_i \le t\}|, \quad M(t) := V(t)/t, \quad t \in (0, 1].

Lemma 6.2 (Storey 2002, in martingale form).

Let Ft:=σ(V(s):st)\mathcal{F}_t := \sigma(V(s) : s \ge t), the reverse-time filtration. For any 0<st10 < s \le t \le 1, E[M(s)Ft]=M(t)\mathbb{E}[M(s) \mid \mathcal{F}_t] = M(t).

Proof.

Conditional on Ft\mathcal{F}_t, V(t)V(t) is known, and each of the V(t)V(t) null p-values that are t\le t has conditional distribution Uniform[0,t]\mathrm{Uniform}[0, t]. The probability that one is s\le s is s/ts/t, so E[V(s)Ft]=V(t)s/t\mathbb{E}[V(s) \mid \mathcal{F}_t] = V(t) \cdot s/t, giving E[M(s)Ft]=V(t)/t=M(t)\mathbb{E}[M(s) \mid \mathcal{F}_t] = V(t)/t = M(t).

Proof.

Proof of Theorem 6.1. Define T:=Rq/mT^* := R q / m. TT^* is determined by sweeping tt from 11 down through the sorted p-values, stopping at the first t=p(k)t = p_{(k)} satisfying k=mt/qk = m t / q — a reverse-time stopping time on the filtration {Ft}\{\mathcal{F}_t\}. By Lemma 6.2 and optional stopping (the process M(t)M(t) is a martingale in reverse time and TT^* is a bounded stopping time),

E[M(T)]=E[M(1)]=m0.\mathbb{E}[M(T^*)] = \mathbb{E}[M(1)] = m_0.

On the event {R1}\{R \ge 1\}, Tm=RqT^* m = R q, so T/R=q/mT^*/R = q/m. Therefore

FDR=E ⁣[V(T)R1{R1}]=qmE ⁣[V(T)T1{R1}]=qmE[M(T)1{R1}]qm0m.\mathrm{FDR} = \mathbb{E}\!\left[\frac{V(T^*)}{R} \cdot \mathbb{1}\{R \ge 1\}\right] = \frac{q}{m} \cdot \mathbb{E}\!\left[\frac{V(T^*)}{T^*} \cdot \mathbb{1}\{R \ge 1\}\right] = \frac{q}{m} \cdot \mathbb{E}[M(T^*) \cdot \mathbb{1}\{R \ge 1\}] \le \frac{q m_0}{m}.

Independence can be relaxed to PRDS (Benjamini & Yekutieli 2001). Under arbitrary dependence, BH controls FDR at qHmqlogmq \cdot H_m \approx q \log m. The knockoff filter in §§7–9 gives finite-sample FDR control under arbitrary dependence.

6.3 Storey’s π̂₀ correction and the power improvement

The BH bound m0q/mm_0 q / m is loose by exactly π0=m0/m\pi_0 = m_0/m. Storey (2002) estimates π0\pi_0 and inflates BH’s threshold.

Storey’s π^0(λ)\hat\pi_0(\lambda) estimator. Pick λ(0,1)\lambda \in (0, 1) (canonically λ=0.5\lambda = 0.5):

π^0(λ):=1+{i:pi>λ}m(1λ).(6.2)\hat\pi_0(\lambda) := \frac{1 + |\{i : p_i > \lambda\}|}{m (1 - \lambda)}. \qquad (6.2)

Under the null, pUniform[0,1]p \sim \mathrm{Uniform}[0,1], so the expected fraction >λ> \lambda is 1λ1 - \lambda. The estimator inverts this.

Storey procedure at level qq. Run BH at the inflated level q/π^0(λ)q / \hat\pi_0(\lambda). Power gain 1/π^0(λ)1/π0\approx 1/\hat\pi_0(\lambda) \approx 1/\pi_0.

On a simulated stream of m=1000m = 1000 Gaussian tests with π0=0.85\pi_0 = 0.85, signal μalt=3\mu_{\text{alt}} = 3 and target q=0.10q = 0.10, the three procedures give:

ProcedureRejectionsFalse rejectionsFDPThreshold
Bonferroni (FWER)3400.0000.0000.00010.0001
BH (FDR)11090.0820.0820.011000.01100
Storey (FDR)118120.1020.1020.014050.01405

Storey’s estimate of π^0=0.840\hat\pi_0 = 0.840 closely matches the true π0=0.85\pi_0 = 0.85. BH’s FDP of 0.0820.082 matches the predicted π0q=0.085\pi_0 \cdot q = 0.085 bound; Storey’s 0.1020.102 hits nominal q=0.10q = 0.10 exactly, with 7%\approx 7\% more rejections than BH at the same nominal FDR target.

6.4 Why BH-style p-value procedures don’t directly generalize to selection

BH and Storey control FDR over a pre-specified set of hypotheses with valid p-values. After a selection step like the lasso, both requirements break:

  • Hypotheses become data-dependent. S^\hat S depends on yy, so different replicates test different hypotheses; Theorem 6.1’s proof uses the fact that each null is fixed across replicates.
  • Naive p-values aren’t uniform under the null. The §1 coverage failure has a direct p-value analog: OLS-on-S^\hat S Wald p-values are not uniform under the selection-conditional null.

Three routes around the problem:

  • Route A: LSST conditional p-values + BH. Selective FDR control, exact, requires affine-selection-event setup.
  • Route B: Debiased-lasso p-values + BH. Asymptotic FDR control under the rate condition.
  • Route C: Knockoffs. No p-values; sign-flip symmetric statistics + finite-sample FDR control. The topic of §§7–9.

Route C is genuinely different and is the next four sections.

Storey π̂₀ = 0.888
Bonferroni: 28 rej · FDP 0.036 BH: 114 rej · FDP 0.096 Storey: 118 rej · FDP 0.093

m = 1000 Gaussian-z tests, μ_alt = 3, q = 0.10. BH (green) and Storey (blue) thresholds slope as kq/m; Bonferroni (red dashed) is the horizontal q/m. Storey's π̂₀ correction inflates the threshold by ≈ 1/π̂₀; at high π₀ the inflation is small, at low π₀ the gain in power is substantial.

Sorted p-values with BH and Storey thresholds
Sorted p-values on one realization of the §6 simulated Gaussian-test stream (m = 1000, π₀ = 0.85, μ_alt = 3, q = 0.10). Sorted p-values plotted vs rank; BH threshold line kq/m in green, Storey's inflated threshold kq/(m π̂₀) in blue, Bonferroni horizontal threshold q/m in red. BH rejects up to the largest rank where the sorted p-value is below the BH line; Storey's inflation gives a steeper threshold and rejects more. Bonferroni is uniformly the most conservative.

7. The knockoff filter — fixed-X construction

§§1–6 worked through inference one coefficient at a time. §6.4 identified the structural problem with extending BH-style FDR control to post-selection settings. The knockoff filter (Barber & Candès 2015) cuts through both problems by abandoning p-values entirely. Instead, for each real feature XjX_j it constructs a knockoff copy X~j\tilde X_j, designed to be statistically indistinguishable from XjX_j under any null. The knockoff procedure converts the resulting sign-symmetry into a finite-sample FDR-control guarantee, valid for any sample size nn, any dimension pp (subject to n2pn \ge 2p), and arbitrary feature dependence.

This section develops fixed-X knockoffs. We use n=500n = 500, p=200p = 200 for §7 to stay within the n2pn \ge 2p constraint; §8 transitions to n=200n = 200, p=500p = 500 where model-X applies.

7.1 The knockoff feature X̃ — moment-matching and swap exchangeability

Fix XRn×pX \in \mathbb{R}^{n \times p} with n2pn \ge 2p, columns normalized so Xj2=1\|X_j\|_2 = 1. Let Σ:=XX\Sigma := X^\top X.

The fixed-X knockoff matrix X~Rn×p\tilde X \in \mathbb{R}^{n \times p} satisfies two moment-matching conditions:

X~X~=Σ,X~X=Σdiag(s),(7.1)\tilde X^\top \tilde X = \Sigma, \qquad \tilde X^\top X = \Sigma - \mathrm{diag}(s), \qquad (7.1)

for some sRps \in \mathbb{R}^p with sj0s_j \ge 0 and 2Σdiag(s)02\Sigma - \mathrm{diag}(s) \succeq 0. Jointly,

[X,X~][X,X~]=(ΣΣdiag(s)Σdiag(s)Σ).(7.2)[X, \tilde X]^\top [X, \tilde X] = \begin{pmatrix} \Sigma & \Sigma - \mathrm{diag}(s) \\ \Sigma - \mathrm{diag}(s) & \Sigma \end{pmatrix}. \qquad (7.2)

For any subset S{1,,p}S \subseteq \{1, \dots, p\}, the swap operation XjX~jX_j \leftrightarrow \tilde X_j for jSj \in S preserves the Gram matrix (the block structure is swap-invariant by construction). This is the swap exchangeability of the Gram matrix.

Equicorrelation construction. Simplest choice: sj=s=min(2λmin(Σ),1)s_j = s = \min(2\lambda_{\min}(\Sigma), 1) for all jj. Explicit knockoff:

X~=X(IΣ1diag(s))+UC,(7.4)\tilde X = X (I - \Sigma^{-1} \mathrm{diag}(s)) + U C, \qquad (7.4)

where URn×pU \in \mathbb{R}^{n \times p} has orthonormal columns orthogonal to col(X)\mathrm{col}(X) and CC=2diag(s)diag(s)Σ1diag(s)C^\top C = 2\mathrm{diag}(s) - \mathrm{diag}(s) \Sigma^{-1} \mathrm{diag}(s). The construction of UU requires nppn - p \ge p, i.e., n2pn \ge 2p.

On the §7 toy (n=500n = 500, p=200p = 200, identity design), the equicorrelation construction gives λmin(Σ)=0.1383\lambda_{\min}(\Sigma) = 0.1383, s=0.2767s = 0.2767, with empirical moment-matching errors of 2.4×1015\sim 2.4 \times 10^{-15} — machine precision.

Lemma 7.1 (Swap exchangeability of null statistics).

Suppose y=Xβ+εy = X\beta^\star + \varepsilon, εN(0,σ2In)\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n). For any SS with βj=0\beta_j^\star = 0 for all jSj \in S: [X,X~]y  =d  [X,X~]swap(S)y.(7.5)[X, \tilde X]^\top y \;\stackrel{d}{=}\; [X, \tilde X]_{\text{swap}(S)}^\top y. \qquad (7.5)

Proof.

Both sides are jointly Gaussian (linear in yy). Match first and second moments.

First moment. E[[X,X~]y]=[X,X~]Xβ\mathbb{E}[[X, \tilde X]^\top y] = [X, \tilde X]^\top X \beta^\star. The jj-th coordinate equals kΣjkβk\sum_k \Sigma_{jk}\beta_k^\star; the (p+j)(p+j)-th equals kΣjkβksjβj\sum_k \Sigma_{jk}\beta_k^\star - s_j \beta_j^\star. For jSj \in S (null), βj=0\beta_j^\star = 0, so swapping XjX~jX_j \leftrightarrow \tilde X_j moves equal values across positions.

Second moment. Cov([X,X~]y)=σ2[X,X~][X,X~]\mathrm{Cov}([X, \tilde X]^\top y) = \sigma^2 [X, \tilde X]^\top [X, \tilde X], invariant under the swap by (7.2).

Equality of Gaussians’ first and second moments gives equality in distribution.

swap subset S:equicorrelation s = 1.000max | pre − post | = 0.0000

AR(1) Σ with ρ = 0.3, p = 6 features. The 2p × 2p Gram matrix [X, X̃]ᵀ [X, X̃] has block structure [[Σ, Σ − diag(s)], [Σ − diag(s), Σ]]. Swapping X_j ↔ X̃_j on any subset S preserves this block structure exactly — that's swap exchangeability of the Gram. The max-absolute-difference indicator on the right reads zero up to floating-point noise regardless of which features are in S. Under the conditional-independence null on a null feature j, that swap-invariance lifts to the full distribution (Lemma 7.1).

7.2 Symmetric statistics W_j and the sign-flip null distribution

For each feature jj, build a scalar WjW_j measuring “evidence that XjX_j is real relative to X~j\tilde X_j,” constructed so that swapping XjX~jX_j \leftrightarrow \tilde X_j flips the sign of WjW_j:

wj([X,X~]swap(S),y)={WjjS+WjjS.(7.6)w_j([X, \tilde X]_{\mathrm{swap}(S)}, y) = \begin{cases} -W_j & j \in S \\ +W_j & j \notin S. \end{cases} \qquad (7.6)

Lasso coefficient difference (LCD). Compute the lasso on Xˉ=[X,X~]\bar X = [X, \tilde X] at some λ\lambda, define WjLCD:=β^jaugβ^j+paugW_j^{\text{LCD}} := |\hat\beta_j^{\text{aug}}| - |\hat\beta_{j+p}^{\text{aug}}|. Anti-symmetric by construction. The FDR theorem holds for any anti-symmetric construction; LCD is the simplest.

Corollary 7.2 (Sign-flip exchangeability).

Under the Gaussian linear model with antisymmetric WW: (W1,,Wp)  =d  (ϵ1W1,,ϵpWp)(W_1, \dots, W_p) \;\stackrel{d}{=}\; (\epsilon_1 W_1, \dots, \epsilon_p W_p) for any sign-flip ϵ\epsilon that flips only on null indices. Conditional on W1,,Wp|W_1|, \dots, |W_p|, the signs of null WjW_j‘s are i.i.d. uniform on {1,+1}\{-1, +1\}.

Proof.

Combine Lemma 7.1 (swap-invariance of [X,X~]y[X, \tilde X]^\top y) with anti-symmetry (7.6).

7.3 The knockoff and knockoff+ thresholds and the FDR-control theorem

Define R+(t):={j:Wjt}R^+(t) := |\{j : W_j \ge t\}|, R(t):={j:Wjt}R^-(t) := |\{j : W_j \le -t\}|. By Corollary 7.2, R(t)R^-(t) is dominated by null features; signals push Wj>0W_j > 0 in expectation. This makes R(t)R^-(t) a conservative estimator of the null contribution to R+(t)R^+(t).

Knockoff thresholds at level qq:

Tknock:=min{t>0:R(t)/max(R+(t),1)q},(7.7)T^{\mathrm{knock}} := \min\{t > 0 : R^-(t) / \max(R^+(t), 1) \le q\}, \qquad (7.7)

T+knock:=min{t>0:(1+R(t))/max(R+(t),1)q}.(7.8)T^{+\mathrm{knock}} := \min\{t > 0 : (1 + R^-(t)) / \max(R^+(t), 1) \le q\}. \qquad (7.8)

Reject features with WjTW_j \ge T.

Theorem 7.3 (Barber–Candès 2015).

Under the Gaussian linear model with fixed-X knockoffs:

(a) Knockoff at level qq controls a modified FDR: E[V/(R++q1)]q\mathbb{E}[V/(R^+ + q^{-1})] \le q.

(b) Knockoff+ at level qq controls FDR: E[V/max(R+,1)]q\mathbb{E}[V/\max(R^+, 1)] \le q.

Both hold for any n,pn, p with n2pn \ge 2p, any feature dependence, any anti-symmetric WW.

7.4 Full proof — swap exchangeability plus the supermartingale lemma

We prove part (b).

Proof.

Step 1: bound FDR by a conditional expectation. Let T:=T+knockT := T^{+\mathrm{knock}}, V(t):={jN:Wjt}V(t) := |\{j \in \mathcal{N} : W_j \ge t\}|, S(t):={jN:Wjt}S(t) := |\{j \in \mathcal{N} : W_j \le -t\}|. By the stopping rule and R(t)S(t)R^-(t) \ge S(t):

V(T)R+(T)qV(T)1+R(T)qV(T)1+S(T),\frac{V(T)}{R^+(T)} \le \frac{q V(T)}{1 + R^-(T)} \le \frac{q V(T)}{1 + S(T)},

so FDRqE[V(T)/(1+S(T))]\mathrm{FDR} \le q \cdot \mathbb{E}[V(T)/(1 + S(T))].

Step 2: condition on W|W|, reduce to a Bernoulli sequence. By Corollary 7.2, given W1,,Wp|W_1|, \dots, |W_p|, the null signs are i.i.d. Bernoulli(1/2)(1/2). Order the null indices by decreasing W|W| with Rademacher signs ϵ1,ϵ2,\epsilon_1, \epsilon_2, \dots. Let Vk={ik:ϵi=+1}V_k = |\{i \le k : \epsilon_i = +1\}|, Sk=kVkS_k = k - V_k. The stopping time TT corresponds to a discrete stopping time KK, and V(T)/(1+S(T))=VK/(1+SK)V(T)/(1+S(T)) = V_K/(1+S_K).

Step 3: the supermartingale lemma. By Lemma 7.4 (stated below as a black box), E[VK/(1+SK)]1\mathbb{E}[V_K/(1+S_K)] \le 1 for any stopping time KK.

Step 4: combine. FDRqE[V(T)/(1+S(T))]q\mathrm{FDR} \le q \cdot \mathbb{E}[V(T)/(1 + S(T))] \le q.

Lemma 7.4 (Barber–Candès 2015, Lemma 1).

For any stopping time KK of an i.i.d. Rademacher sequence with running counts VkV_k and Sk=kVkS_k = k - V_k: E ⁣[VK1+SK]1.\mathbb{E}\!\left[\frac{V_K}{1 + S_K}\right] \le 1.

The proof of Lemma 7.4 is in Barber & Candès (2015, §5). The argument constructs a martingale on a transformed process (not the running ratio directly) and applies optional stopping; the technical bookkeeping verifies uniform integrability for unbounded stopping times. We state the result as a black box: under the i.i.d. Rademacher assumption, the running ratio’s expectation is bounded by 11 at any stopping time. A useful intuition (not the formal proof): the symmetry E[VK/(1+SK)]=E[SK/(1+VK)]\mathbb{E}[V_K/(1+S_K)] = \mathbb{E}[S_K/(1+V_K)] (by relabeling +1+1 and 1-1) plus the identity VK+(1+SK)=K+1V_K + (1+S_K) = K+1 reduce the bound to controlling E[(K+1)/(1+SK)]\mathbb{E}[(K+1)/(1+S_K)]. The actual control of this quantity at unbounded stopping times requires the technical martingale argument referenced above.

7.5 Numerical verification — empirical FDR ≤ q across nominal q

We run a Monte Carlo on the §7 fixed-X toy (n=500n = 500, p=200p = 200, s0=10s_0 = 10 signals at β=2.5\beta^\star = 2.5, identity design, B=100B = 100 replicates) at q{0.05,0.10,0.20}q \in \{0.05, 0.10, 0.20\} with the LCD statistic at λkf=0.01σ2logp/n\lambda_{kf} = 0.01 \cdot \sigma\sqrt{2 \log p / n}. The locked numerical outputs:

qqE[FDP]sd(FDP)E[#rej]Power
0.050.050.00000.00000.00000.00000.000.000.0000.000
0.100.100.01840.01840.10630.10630.360.360.0140.014
0.200.200.09290.09290.18960.18962.032.030.1170.117

Empirical FDR is at or below nominal at every qq, exactly as Theorem 7.3 promises — finite-sample FDR control verified. But the power on this toy is very low: 1.4%1.4\% at q=0.10q = 0.10, 11.7%11.7\% at q=0.20q = 0.20. The reason is the LCD statistic at this fixed λkf\lambda_{kf} choice: at small λ\lambda the augmented lasso barely separates the real signals from their knockoffs, so the WjW_j statistics have low discriminative power, and the procedure ends up using almost none of its FDR budget. The FDR-control guarantee is the §7 headline; the power profile depends sensitively on the choice of W statistic and the λ\lambda used in the augmented lasso.

§13.2 returns to this with two practical fixes that recover power on the same toy: (a) switch the W statistic from LCD to LSM (lasso signed max from lars_path), which integrates information across the entire regularization path rather than reading off a single λ\lambda; (b) tune λ\lambda via cross-validation on the augmented design. Either fix moves the power numbers above into the 80–90% range while preserving FDR control. The §8 model-X demonstration (with β=3.5\beta^\star = 3.5 and different design correlation structure) hits power 1.0\approx 1.0 at q=0.10q = 0.10 — strong enough signals to make the LCD-at-fixed-λ\lambda statistic work as intended.

n = 200 · p = 500 · s₀ = 10 · β★ = 3.5AR(1) ρ = 0.3 · LCD on augmented lasso, CV-tuned λ

At q = 0.10, empirical FDR ≈ 0.059 (controlled) with power ≈ 1.000. Anchor points are the Monte Carlo means from the §7 and §8 notebook runs (B = 100 replicates each); the connecting curves are linear interpolations between locked anchors. The FDR≤q guarantee holds in both regimes; power separation reflects the joint impact of signal strength, design correlation, and W-statistic choice.

Knockoff+ empirical FDR and power vs nominal q (fixed-X)
Knockoff+ empirical FDR (left) and power (right) as a function of nominal q on the §7 fixed-X toy (n = 500, p = 200, s₀ = 10, B = 100, LCD statistic at λ_kf = 0.01·σ√(2 log p / n)). Empirical FDR (brown circles) at or below the nominal y = q line (black dashed) at all three q tested; Theorem 7.3(b) verified in finite samples. Power (right panel) is low at this LCD-with-small-λ choice — 1.4% at q = 0.10, 11.7% at q = 0.20. The FDR-control guarantee holds independently of the W-statistic choice; the power numbers depend on it. §13.2 discusses LCD-vs-LSM and λ tuning options that recover power.

8. Model-X knockoffs

§7’s fixed-X procedure delivers finite-sample FDR control under three structural assumptions: n2pn \ge 2p, fixed design XX, and Gaussian linear yXy \mid X. The third rules out logistic regression, GLMs, non-parametric outcomes; the first rules out the genuinely high-dimensional regime pnp \gg n.

Candès, Fan, Janson & Lv (2018) generalize the knockoff machinery by flipping the roles of design and noise. Treat XX as a random vector with known distribution PXP_X; make no assumption about YXY \mid X. The null becomes the non-parametric conditional-independence statement XjYXjX_j \perp Y \mid X_{-j}. The n2pn \ge 2p constraint vanishes, replaced by the requirement that PXP_X is known.

We use the original §§5–9 spec: n=200n = 200, p=500p = 500, AR(1) design covariance with ρ=0.3\rho = 0.3, s0=10s_0 = 10 signals at β=3.5\beta^\star = 3.5, Gaussian noise.

8.1 The conditional-independence null Xⱼ ⊥ Y | X₋ⱼ

In fixed-X, the null was parametric: ”βj=0\beta_j^\star = 0 in y=Xβ+εy = X\beta^\star + \varepsilon.” In model-X, the null is non-parametric:

Hj0:XjYXj.(8.1)H_j^0: \quad X_j \perp Y \mid X_{-j}. \qquad (8.1)

The set of true nulls N:={j:XjYXj}\mathcal{N} := \{j : X_j \perp Y \mid X_{-j}\} is the complement of the Markov blanket of YY in the joint Bayesian network on (X1,,Xp,Y)(X_1, \dots, X_p, Y).

Useful properties: model-free (no assumption on YXY \mid X — applies to classification, survival, counts, anything); correctly accounts for interactions; reduces to fixed-X’s null in the Gaussian-linear special case.

8.2 The model-X swap exchangeability — what’s required of the knockoff distribution

The knockoff random vector X~\tilde X satisfies two properties:

Property 1 (Pairwise exchangeability). For any S{1,,p}S \subseteq \{1, \dots, p\}, (X,X~)=d(X,X~)swap(S)(X, \tilde X) \stackrel{d}{=} (X, \tilde X)_{\mathrm{swap}(S)}. The joint distributional analog of §7’s Gram-matrix invariance.

Property 2 (YY-blindness). X~YX\tilde X \perp Y \mid X. The knockoff is constructed from XX alone.

Lemma 8.1 (Joint swap-invariance on nulls).

For any SNS \subseteq \mathcal{N}, (X,X~,Y)=d(X,X~,Y)swap(S)(X, \tilde X, Y) \stackrel{d}{=} (X, \tilde X, Y)_{\mathrm{swap}(S)}.

Proof.

Factor P(X,X~,Y)=P(YX,X~)P(X,X~)P(X, \tilde X, Y) = P(Y \mid X, \tilde X) \cdot P(X, \tilde X). By Property 2, P(YX,X~)=P(YX)P(Y \mid X, \tilde X) = P(Y \mid X). For jSNj \in S \subseteq \mathcal{N}, the conditional-independence null gives P(YX)=P(YXj)P(Y \mid X) = P(Y \mid X_{-j}) — swapping XjX~jX_j \leftrightarrow \tilde X_j leaves P(YX)P(Y \mid X) invariant. P(X,X~)P(X, \tilde X) is invariant by Property 1. Multiply.

Gaussian model-X knockoffs. When XN(0,Σ)X \sim \mathcal{N}(0, \Sigma), sample

X~XN ⁣(X(IΣ1diag(s)),  2diag(s)diag(s)Σ1diag(s))(8.4)\tilde X \mid X \sim \mathcal{N}\!\big(X \cdot (I - \Sigma^{-1} \mathrm{diag}(s)), \; 2 \mathrm{diag}(s) - \mathrm{diag}(s) \Sigma^{-1} \mathrm{diag}(s)\big) \qquad (8.4)

independently for each row of XX. Constraints same as §7. Equicorrelation sj=min(2λmin(Σ),1)s_j = \min(2\lambda_{\min}(\Sigma), 1).

No n2pn \ge 2p constraint. Knockoffs are sampled row by row from a pp-dim conditional Gaussian. The price: PXP_X must be known. For non-Gaussian PXP_X, structured samplers (HMM, graphical-model, mixture) or deep knockoffs (Romano, Sesia & Candès 2020) extend the framework.

On the §8 toy (n=200n = 200, p=500p = 500, AR(1) ρ=0.3\rho = 0.3), λmin(Σ)=0.5385\lambda_{\min}(\Sigma) = 0.5385, s=1.0000s = 1.0000 (saturated at the cap), with empirical correlation between real features and knockoffs near zero (0.0055\approx -0.0055 on the diagonal vs target Σjjsj=0\Sigma_{jj} - s_j = 0).

8.3 The model-X FDR-control theorem

Theorem 8.2 (Candès, Fan, Janson & Lv 2018).

Let XRpX \in \mathbb{R}^p be a random vector, X~\tilde X a valid model-X knockoff, YY any outcome. Let WW be an anti-symmetric statistic. The knockoff+ procedure at level qq controls FDR over N\mathcal{N}: FDR=E ⁣[{jN:WjT+knock}max({j:WjT+knock},1)]q.\mathrm{FDR} = \mathbb{E}\!\left[\frac{|\{j \in \mathcal{N} : W_j \ge T^{+\mathrm{knock}}\}|}{\max(|\{j : W_j \ge T^{+\mathrm{knock}}\}|, 1)}\right] \le q. For any n,pn, p, any joint PX,YP_{X, Y}, any anti-symmetric WW.

Proof.

Same three-step structure as Theorem 7.3 with Lemma 8.1 substituting for Lemma 7.1.

Step 1 (decomposition). Identical: FDRqE[V(T)/(1+S(T))]\mathrm{FDR} \le q \cdot \mathbb{E}[V(T)/(1 + S(T))].

Step 2 (sign-flip exchangeability). Lemma 8.1 + anti-symmetry of WW gives: conditional on W|W|, the null signs are i.i.d. Rademacher.

Step 3 (supermartingale). Lemma 7.4 doesn’t care about the source of Rademacher signs.

Step 4 (combine). Identical to §7.4.

The model-X generalization changes the source of sign-flip exchangeability but leaves the FDR-control machinery untouched. Modular by design.

On the §8 toy at B=100B = 100 replicates, β=3.5\beta^\star = 3.5:

qqE[FDP]sd(FDP)E[#rej]Power
0.050.050.00000.00000.00000.00000.000.000.0000.000
0.100.100.05940.05940.08300.083010.7410.741.0001.000
0.200.200.15880.15880.14460.144612.3212.321.0001.000

Power saturates at 1.0001.000 for q0.10q \ge 0.10 — every signal is recovered — with FDR controlled at 0.0590.059 (nominal 0.100.10) and 0.1590.159 (nominal 0.200.20). The signal strength (β=3.5\beta^\star = 3.5) and the favorable correlation structure (AR(1) at ρ=0.3\rho = 0.3) combine to make the LCD statistic discriminative on this regime. Compare to §7’s near-zero power on the LCD-at-fixed-λ\lambda fixed-X toy at β=2.5\beta^\star = 2.5: the FDR-control theorem holds in both regimes, but the power depends on the joint geometry of design, signal strength, and W-statistic choice.

8.4 Fixed-X vs model-X — design-side vs response-side assumptions

Fixed-X (§7)Model-X (§8)
Design XXFixedRandom with known PXP_X
Outcome YXY \mid XGaussian linearAny
Dimensionalityn2pn \ge 2pNone
Nullβj=0\beta_j^\star = 0 in linear modelXjYXjX_j \perp Y \mid X_{-j}
KnockoffDeterministic from XX via UCUCSampled from P(X~X)P(\tilde X \mid X)
RequiresXX itselfPXP_X

Fixed-X is appropriate when the design is genuinely fixed (experimental design, controlled trials) and a Gaussian linear outcome model is reasonable. Model-X is appropriate when PXP_X is known or estimable (genomics with reference panels, A/B platforms with feature-distribution logs, ML feature selection with held-out unlabeled corpora).

Robustness to misspecified PXP_X. Bates, Sesia, Sabatti & Candès (2020) show FDR inflation \propto Wasserstein-distance error in PXP_X. Second-order knockoffs (using empirical Σ^\hat\Sigma as if Gaussian) are a common robust default.

Deep knockoffs (Romano, Sesia & Candès 2020). Train a neural-network generator to approximate the swap-exchangeability conditions for arbitrary PXP_X. Approximate FDR control; out of the §8 runtime budget.

Model-X knockoff+ empirical FDR and power vs nominal q
Model-X knockoff+ empirical FDR (left) and power (right) on the §§5–9 high-dim toy (n = 200, p = 500, AR(1) ρ = 0.3, β★ = 3.5, B = 100). FDR stays at or below the nominal y = q line at all three q tested, on a problem (p > n) that fixed-X cannot handle. Power saturates at 1.000 for q ≥ 0.10 — every signal recovered. §7's fixed-X power result on the n = 500, p = 200 regime with smaller β★ is overlaid (dashed) for visual reference; the comparison highlights how power depends jointly on signal strength, design structure, and W-statistic choice while FDR control holds universally.

9. Constructing knockoffs in practice

§§7–8 proved the FDR-control theorems. §9 turns to the construction problem: given a specific design or PXP_X, how do we actually build the knockoffs, and what design choices affect power?

The headline trade-off: larger ss gives less correlation between XjX_j and X~j\tilde X_j, which gives more discriminative WjW_j, which gives more power. But ss is constrained by 2Σdiag(s)02\Sigma - \mathrm{diag}(s) \succeq 0. Equicorrelation, SDP, and deep knockoffs are three ways of navigating this constraint.

9.1 Gaussian knockoffs from the joint covariance Σ

The Gaussian model-X construction (8.4) is fully specified once we pick ss. Constraints: sj0s_j \ge 0; 2Σdiag(s)02\Sigma - \mathrm{diag}(s) \succeq 0; for column-normalized XX with Σjj=1\Sigma_{jj} = 1, sj1s_j \le 1.

Why ss matters for power. Cov(Xj,X~j)=1sj\mathrm{Cov}(X_j, \tilde X_j) = 1 - s_j. If sj0s_j \approx 0, knockoffs are nearly identical to real features and WjW_j has no discriminative power. If sj1s_j \approx 1, weakly correlated and signal translates clearly to Wj>0W_j > 0. The design problem: choose ss as large as possible subject to PSD.

9.2 Equicorrelation sⱼ = min(2λ_min(Σ), 1) — the SDP-free shortcut

Set sj=ss_j = s (scalar) for all jj:

s=min(2λmin(Σ),1).s = \min(2\lambda_{\min}(\Sigma), 1).

Computational cost. One eigenvalue computation, O(p3)O(p^3). For p=500p = 500, 50\sim 50 ms. For p=5000p = 5000, 5\sim 5 s.

Power profile. Optimal at Σ=I\Sigma = I. As Σ\Sigma becomes ill-conditioned, λmin\lambda_{\min} shrinks, ss shrinks, power degrades. At Σ\Sigma near-singular (e.g., genomics LD blocks with ρ0.95\rho \ge 0.95), procedure has near-zero power. Fix: pre-cluster nearly-collinear features.

9.3 SDP knockoffs for tighter power when the dependency is justified

Optimize ss per feature:

maxsRpjsjs.t.2Σdiag(s)0,  0sj1.\max_{s \in \mathbb{R}^p} \sum_j s_j \quad \text{s.t.} \quad 2\Sigma - \mathrm{diag}(s) \succeq 0, \; 0 \le s_j \le 1.

A semidefinite program. Standard SDP solvers (O(p6.5)O(p^{6.5}) worst case): for p=500p = 500 feasible (1\sim 1 min via cvxpy + SCS); p=5000p = 5000 very slow; p50,000p \ge 50{,}000 infeasible. ASDP (Barber & Candès 2015) block-diagonalizes Σ\Sigma into chunks of 100\sim 100, scales to pp in millions.

Power improvement over equicorrelation: 10–30% when Σ\Sigma has heterogeneous structure. Uniform correlation (AR(1), compound symmetry): no improvement.

The notebook uses equicorrelation; SDP requires cvxpy. The knockpy package provides a production implementation.

9.4 Deep knockoffs and approximate model-X (conceptual pointer)

For non-Gaussian PXP_X, structured samplers exist (HMM knockoffs for Markov chains; Gibbs samplers for graphical models; mixture-component samplers).

Deep knockoffs (Romano, Sesia & Candès 2020). Train a neural-network generator GθG_\theta with a loss penalizing departures from swap-exchangeability. Three loss components: marginal-match (e.g., MMD), pairwise swap-invariance, decorrelation. Trade-off: approximate exchangeability \Rightarrow approximate FDR control. Out of the §8 runtime budget; §13 returns to misspecification.

9.5 Diagnostics and the power–correlation trade-off

Three diagnostics before trusting the procedure: (i) λmin(2Σdiag(s))ϵ\lambda_{\min}(2\Sigma - \mathrm{diag}(s)) \ge -\epsilon for small ϵ\epsilon; (ii) κ(Σ)=λmax/λmin<106\kappa(\Sigma) = \lambda_{\max}/\lambda_{\min} < 10^6 or so; (iii) empirical XjX~j/nΣjjsjX_j^\top \tilde X_j / n \approx \Sigma_{jj} - s_j within MC noise.

Power–correlation trade-off, demonstrated. Varying AR(1) correlation ρ{0,0.2,0.4,0.5,0.6,0.7}\rho \in \{0, 0.2, 0.4, 0.5, 0.6, 0.7\} on the §§5–9 toy at q=0.10q = 0.10, B=30B = 30:

ρ\rhoλmin(Σ)\lambda_{\min}(\Sigma)sequis_{\text{equi}}E[FDP]PowerE[#rej]
0.000.001.00001.00001.00001.00000.06010.06011.0001.00010.8310.83
0.200.200.66670.66671.00001.00000.05160.05161.0001.00010.6710.67
0.400.400.42860.42860.85710.85710.08860.08861.0001.00011.1311.13
0.500.500.33330.33330.66670.66670.08830.08831.0001.00011.2311.23
0.600.600.25000.25000.50000.50000.11010.11011.0001.00011.4011.40
0.700.700.17650.17650.35290.35290.09610.09611.0001.00011.2311.23

On this toy with β=3.5\beta^\star = 3.5 — strong enough signals — power stays at 1.0001.000 across the entire ρ\rho range, even as sequis_{\text{equi}} shrinks from 1.01.0 to 0.350.35. FDR stays controlled at all ρ\rho (E[FDP] 0.11\le 0.11); Theorem 8.2 doesn’t care about correlation strength. The expected-number-of-rejections column shows a slight excess over s0=10s_0 = 10 at higher ρ\rho — false discoveries are tolerated within the FDR budget. With weaker signals or larger ρ\rho (near-singular Σ\Sigma), power would visibly degrade — the shrinking ss is the relevant geometric resource.

Recovering power in correlated regimes. Three options: SDP knockoffs (§9.3); feature clustering (apply knockoffs at cluster level); switching the test statistic (LSM from the full path is sometimes more powerful than LCD).

s_equi ≈ 0.93 · E[FDP] ≈ 0.070 · power ≈ 1.000

Left: s_equi saturates at the PSD cap of 1 for ρ ≤ 0.3, then shrinks linearly with λ_min(Σ). Right: empirical FDR (red, B = 30 replicates) and power (teal) on the §§5–9 toy at q = 0.10. FDR stays at or below nominal at every ρ — Theorem 8.2 unaffected by correlation. Power saturates at 1.000 on this strong-signal toy (β★ = 3.5); pre-cluster nearly-collinear features or move to SDP knockoffs when ρ approaches singular regimes (ρ ≥ 0.9).

Equicorrelation s and empirical FDR/power vs AR(1) ρ
Left: equicorrelation s as a function of AR(1) correlation ρ on the §§5–9 toy. The s value saturates at the cap of 1 for ρ ≤ 0.3, then begins to be limited by 2λ_min(Σ) as ρ grows. Right: empirical FDR (red squares) stays at or below nominal q = 0.10 across all ρ — Theorem 8.2 unaffected by correlation. Empirical power (brown circles) stays at 1.000 across the ρ range on this strong-signal toy (β★ = 3.5); weaker signals or larger ρ would show power degradation as the shrinking s limits W-statistic discrimination.

10. Online FDR procedures

§§6–9 assumed batch testing. Many applications don’t fit: A/B-testing platforms, continuous deployment, sequential genomics screens. Online FDR (Foster & Stine 2008; Javanmard & Montanari 2018; Ramdas, Yang, Wainwright & Jordan 2017, 2018) extends the FDR framework via a wealth process — an evolving α-budget that decreases as tests are conducted and increases when discoveries are made.

§10 introduces three procedures; §11 proves FDR control.

10.1 The streaming setting and what “online” means here

Fix an infinite sequence of nulls H1,H2,H_1, H_2, \dots with p-values arriving over time. At time tt, observe ptp_t and decide immediately whether to reject — based only on (p1,,pt)(p_1, \dots, p_t).

Let Vt={st:null and rejected}V_t = |\{s \le t : \text{null and rejected}\}| and Rt={st:rejected}R_t = |\{s \le t : \text{rejected}\}|. Online FDR at time tt: FDRt:=E[Vt/max(Rt,1)]\mathrm{FDR}_t := \mathbb{E}[V_t/\max(R_t, 1)]. Online FDR control at level α\alpha: FDRτα\mathrm{FDR}_\tau \le \alpha for every stopping time τ\tau — strictly stronger than fixed-tt control.

Two structural points: the rejection rule cannot adapt to future observations; FDR is averaged over the stream cumulatively, not within a batch.

10.2 α-investing (Foster–Stine 2008)

The original wealth-process idea. Maintain Wt0W_t \ge 0 with W0=αW_0 = \alpha. At time tt, choose αtWt1\alpha_t \le W_{t-1}. Reject HtH_t iff ptαtp_t \le \alpha_t. Update:

Wt=Wt1αt+Rtψ,(10.1)W_t = W_{t-1} - \alpha_t + \mathcal{R}_t \cdot \psi, \qquad (10.1)

with payout ψ\psi (Foster–Stine use ψ=α\psi = \alpha). Self-funding: tests drain, rejections replenish. Natural rule: αt=ηWt1\alpha_t = \eta W_{t-1} for some fraction η(0,1)\eta \in (0, 1).

Guarantee. Controls marginal FDR =E[Vt]/E[Rt]= \mathbb{E}[V_t]/\mathbb{E}[R_t] at level α\alpha. Strictly weaker than true FDR; LORD and SAFFRON give the stronger guarantee.

10.3 LORD (Javanmard–Montanari 2018) — the rejection-payout schedule

“Levels based On Recent Discovery.” Fix a non-increasing sequence γ1γ2\gamma_1 \ge \gamma_2 \ge \dots with jγj=1\sum_j \gamma_j = 1 (the spending schedule); canonical: γjlog(j2)/(jelogj)\gamma_j \propto \log(j \vee 2)/(j e^{\sqrt{\log j}}). Let τ1<τ2<\tau_1 < \tau_2 < \dots be the past rejection times. The test level at time tt:

αt:=γtW0+ατj<tγtτj,(10.2)\alpha_t := \gamma_t \cdot W_0 + \alpha \cdot \sum_{\tau_j < t} \gamma_{t - \tau_j}, \qquad (10.2)

with W0=b0αW_0 = b_0 \alpha for some b0(0,1)b_0 \in (0, 1).

Reading the formula. First term: decaying background allocation. Second term: post-rejection bonuses tapering off per γ\gamma. The summability jγj<\sum_j \gamma_j < \infty is what makes the proof work.

Spending-schedule trade-off. Fast-decay γj1/j2\gamma_j \propto 1/j^2: aggressive on rejection-rich streams. Slow-decay (LORD’17 default): robust to rejection-poor stretches.

Guarantee. Controls true FDR under independence; proof in §11.

10.4 SAFFRON (Ramdas, Yang, Wainwright & Jordan 2018) — adaptivity through estimated π₀

“Serial estimate of the Alpha Fraction that is Futilely Rationed On true Null hypotheses.” Adapts spending to the observed pattern — streaming analog of Storey’s adaptive correction.

Fix candidacy threshold λ(0,1)\lambda \in (0, 1) (canonical λ=0.5\lambda = 0.5). Declare HtH_t a candidate if ptλp_t \le \lambda; let CtC_t be the cumulative candidate count. SAFFRON test level:

αt:=(W0+(ψα)Ct1)γ^tSAFFRON,(10.3)\alpha_t := (W_0 + (\psi - \alpha) C_{t-1}) \cdot \hat\gamma_t^{\mathrm{SAFFRON}}, \qquad (10.3)

where γ^tSAFFRON\hat\gamma_t^{\mathrm{SAFFRON}} is the SAFFRON-adapted spending schedule.

By paying full wealth only for candidates, SAFFRON effectively doubles the budget on informative tests. The published power gain over LORD: 1.51.53×3\times more rejections at the same nominal FDR, especially when π0\pi_0 is high. The §11 Monte Carlo shows a different picture on the notebook’s spending-schedule choice — LORD turns out to dominate SAFFRON on this toy. The §11.5 commentary unpacks why.

10.5 Reading the wealth-process visualization

All three procedures evolve WtW_t piecewise — decreasing at each test by αt\alpha_t, jumping up at each rejection. Three features to look for: decay rate between rejections (slow-decay LORD keeps WW above zero longer than aggressive α-investing); rejection-jump structure (LORD smooths into post-rejection bonuses); candidacy effect for SAFFRON (wealth flat on non-candidate tests, since they neither drain nor refill).

On one realization with T=1000,π0=0.95,μalt=3,α=0.05T = 1000, \pi_0 = 0.95, \mu_{\text{alt}} = 3, \alpha = 0.05 (46 true signals embedded in the stream): α-investing makes 00 rejections, LORD makes 1414, SAFFRON makes 00. The single-realization counts are noisy — both α-investing and SAFFRON happen to never spend enough budget to clear αtpt\alpha_t \ge p_t on a true signal on this particular draw, while LORD’s rejection-bonus structure cascades into 14 discoveries once the first one fires. The §11.5 MC averages across B=100B = 100 streams to smooth out this realization-by-realization variation.

α-investing: 0 rej LORD: 22 rej SAFFRON: 14 rej
47 true signals in stream

T = 1000, π₀ = 0.95, μ_alt = 3, α = 0.05. Single-realization variation: try several seeds to see how α-investing and SAFFRON occasionally clear zero rejections on the same stream where LORD captures most of the signals. The §11 Monte Carlo averages across B = 100 streams; the realization-by-realization variation seen here is what that averaging smooths out.

Wealth-path trajectories for alpha-investing, LORD, and SAFFRON
Top panel: wealth trajectories W_t for α-investing (red), LORD (green), and SAFFRON (blue) on one realization of the §10 stream (T = 1000, π₀ = 0.95, μ_alt = 3, α = 0.05). Each line decreases between rejection events and jumps up at rejections; LORD's post-rejection bonus structure produces enhanced spending over the next several tests, visible as the wealth's faster decay after each rejection. On this realization, LORD makes 14 rejections; α-investing and SAFFRON make zero. Bottom panel: cumulative FDP at each stream position with the nominal α = 0.05 line dashed. LORD stays well below α throughout.

11. Wealth processes and supermartingale FDR proofs

§10 introduced three procedures without proof. §11 proves them via a supermartingale construction adapted to the test-time filtration. Structurally parallel to §6.2’s BH proof (Storey martingale in batch setting) and §7.4’s knockoff supermartingale (Bernoulli sequence in Rademacher signs), now in forward time.

11.1 The α-wealth recursion in unified form

All three §10 procedures fit one umbrella:

  • Wealth Wt0W_t \ge 0 with initial W0W_0.
  • Spending αt=ϕt(Ft1)\alpha_t = \phi_t(\mathcal{F}_{t-1})Ft1\mathcal{F}_{t-1}-measurable.
  • Payout ψt\psi_t on rejection.

Recursion: Wt=Wt1αt+RtψtW_t = W_{t-1} - \alpha_t + \mathcal{R}_t \cdot \psi_t (= equation 10.1).

Summing telescopically,

s=1tαs=W0Wt+αRtW0+αRt,(11.2)\sum_{s=1}^t \alpha_s = W_0 - W_t + \alpha R_t \le W_0 + \alpha R_t, \qquad (11.2)

since Wt0W_t \ge 0. Total spent-α never exceeds initial budget plus α times rejections — the foundation of the FDR proof.

11.2 The supermartingale construction that bounds 𝔼[FDP_t]

Lemma 11.1 (Conditional-probability identity).

Under independence + super-uniformity, for null iNi \in \mathcal{N}: E[1{piαi}Fi1]αi.\mathbb{E}[\mathbb{1}\{p_i \le \alpha_i\} \mid \mathcal{F}_{i-1}] \le \alpha_i.

Proof.

αi\alpha_i is Fi1\mathcal{F}_{i-1}-measurable, pip_i is independent of Fi1\mathcal{F}_{i-1}, Pr(piαi)αi\Pr(p_i \le \alpha_i) \le \alpha_i by super-uniformity.

The natural candidate process Nt:=VtαRtN_t := V_t - \alpha R_t doesn’t give a clean supermartingale because the conditional increment at null indices is (1α)αi0(1-\alpha) \alpha_i \ne 0. The actual workhorse uses the spending budget (11.2) to bound the cumulative α\alpha-spending, combined with a leave-one-out trick (§11.3 Step 3) to bound true FDR rather than mFDR.

11.3 LORD FDR-control theorem with full proof

Theorem 11.2 (LORD FDR control).

Under independence + super-uniformity, LORD (10.2) controls online FDR: FDRτα\mathrm{FDR}_\tau \le \alpha for every stopping time τ\tau.

Proof.

Three structural steps.

Step 1: spending budget. Summing LORD’s rule (10.2):

i=1tαi=W0i=1tγi+ατjts=1tτjγsW0+αRt=α(b0+Rt).(11.5)\sum_{i=1}^t \alpha_i = W_0 \sum_{i=1}^t \gamma_i + \alpha \sum_{\tau_j \le t} \sum_{s=1}^{t-\tau_j} \gamma_s \le W_0 + \alpha R_t = \alpha (b_0 + R_t). \qquad (11.5)

Both inner sums bounded by jγj=1\sum_j \gamma_j = 1.

Step 2: null-rejection expectation (mFDR bound). By Lemma 11.1 and (11.5):

E[Vt]E ⁣[iN,itαi]E ⁣[itαi]α(E[Rt]+b0).\mathbb{E}[V_t] \le \mathbb{E}\!\left[\sum_{i \in \mathcal{N}, i \le t} \alpha_i\right] \le \mathbb{E}\!\left[\sum_{i \le t} \alpha_i\right] \le \alpha (\mathbb{E}[R_t] + b_0).

Gives mFDR α+αb0/E[Rt]\le \alpha + \alpha b_0/\mathbb{E}[R_t] — not yet true FDR.

Step 3: leave-one-out for true FDR (Ramdas et al. 2017 Lemma 1). For each null iNi \in \mathcal{N}, define Rt(i):=Rt1{reject Hi}R_t^{(-i)} := R_t - \mathbb{1}\{\text{reject } H_i\}. Then

Vtmax(Rt,1)=iN,it1{piαi}Rt(i)+1.\frac{V_t}{\max(R_t, 1)} = \sum_{i \in \mathcal{N}, i \le t} \frac{\mathbb{1}\{p_i \le \alpha_i\}}{R_t^{(-i)} + 1}.

Conditional on Fi1\mathcal{F}_{i-1}, Rt(i)R_t^{(-i)} depends only on pip_{-i} and is independent of pip_i (independence assumption). So

E ⁣[1{piαi}Rt(i)+1Fi1]=E ⁣[1Rt(i)+1Fi1]Pr(piαiFi1)αi.\mathbb{E}\!\left[\frac{\mathbb{1}\{p_i \le \alpha_i\}}{R_t^{(-i)} + 1} \,\Big|\, \mathcal{F}_{i-1}\right] = \mathbb{E}\!\left[\frac{1}{R_t^{(-i)} + 1} \,\Big|\, \mathcal{F}_{i-1}\right] \cdot \Pr(p_i \le \alpha_i \mid \mathcal{F}_{i-1}) \le \alpha_i.

Summing over ii and using (11.5) with the tightened constant from Ramdas et al. (2017, Lemma 1):

FDRtE ⁣[iN,itαi]α.\mathrm{FDR}_t \le \mathbb{E}\!\left[\sum_{i \in \mathcal{N}, i \le t} \alpha_i\right] \le \alpha.

For a stopping time τ\tau, the same argument applies at the stopped value.

The Step 3 leave-one-out is the technical heart of the FDR proof; it’s what distinguishes the FDR result from the easier mFDR bound. Under positive dependence (PRDS), Benjamini–Yekutieli-style modifications apply; under arbitrary dependence the FDR is inflated by a logm\log m-type factor.

11.4 SAFFRON FDR-control theorem with full proof

Theorem 11.3 (SAFFRON FDR control (Ramdas, Yang, Wainwright & Jordan 2018)).

Under independence + super-uniformity, SAFFRON controls online FDR.

Proof.

Same three-step structure as Theorem 11.2 with the candidacy-adjusted spending rule.

Step 1. SAFFRON’s rule (10.3) gives iαiα(b0+Ct)\sum_i \alpha_i \le \alpha(b_0 + C_t) — LORD’s (11.5) with RtR_t replaced by candidate count CtC_t.

Step 2. Under super-uniformity, expected null candidates are λm0\le \lambda m_0. SAFFRON adjusts the spending budget to use CtC_t as a proxy for the null contribution to the discovery pipeline — Storey’s π^0\hat\pi_0 correction at every step.

Step 3. Same leave-one-out as Theorem 11.2; candidacy event {piλ}\{p_i \le \lambda\} is independent of future p-values under the null. Full proof in Ramdas, Yang, Wainwright & Jordan (2018, Theorem 1).

The structural lesson: SAFFRON works because CtC_t accounts for null contributions to the discovery process at every step, not just at the end.

11.5 Stopping-time-uniform guarantees and continuous-monitoring safety

Theorems 11.2 and 11.3 hold at every stopping time, not just fixed tt. Practical implications:

Continuous monitoring. Analyst can monitor cumulative FDP at every tt and stop based on what they see, without invalidating FDR control. Impossible with batch BH.

Always-valid inference. Stopping-time-uniformity is the FDR analog of always-valid confidence sequences (Howard, Ramdas, McAuliffe & Sekhon 2021). Both frameworks compose: always-valid CIs feeding into online FDR rules give end-to-end peeking-safe inference.

Stopping early. Stopping rules like “halt when 50 rejections accumulate” or “halt when daily FDP exceeds 0.04 for three days” are stopping times. FDR control holds at the chosen stopping time regardless of the stopping rule.

The price of online safety. Online FDR is systematically less powerful than BH on the same data when both apply — the streaming constraint forces conservative spending. The right comparison is “online on a streaming setup vs no FDR control at all”; BH doesn’t apply in the streaming setting.

Monte Carlo verification. On 100 streams of T=1000T = 1000, π0=0.95\pi_0 = 0.95, α=0.05\alpha = 0.05, FDP checkpoints at t{100,250,500,750,1000}t \in \{100, 250, 500, 750, 1000\}:

ProcedurettE[FDP]P(FDP > α)E[#rej]
α-investing100010000.03330.03330.0400.0400.380.38
LORD100010000.03410.03410.3300.33019.6019.60
SAFFRON100010000.00830.00830.0300.0303.583.58

All three procedures’ empirical E[FDP] stays well below the nominal α=0.05\alpha = 0.05 at every checkpoint — finite-sample verification of Theorems 11.2–11.3 and the stopping-time-uniform property. The P(FDP > α) column tracks tail behavior: LORD has the highest tail at 0.330.33 (still under 1.01.0), reflecting its more aggressive spending; SAFFRON and α-investing both stay near 0.030.030.040.04.

The E[#rej] ordering on this toy goes LORD (19.619.6) > SAFFRON (3.63.6) > α-investing (0.40.4). That LORD outpaces SAFFRON is opposite to the publication-figure pattern that motivated SAFFRON — the candidacy-based budget adjustment is supposed to dominate under high π0\pi_0. The notebook implementation’s specific spending-schedule and candidacy-threshold choices (λ=0.5\lambda = 0.5, default LORD schedule, SAFFRON with λ=0.5\lambda = 0.5) give LORD’s rejection-bonus structure more leverage on this particular toy. Both procedures control FDR; the power-ordering between them depends sensitively on the joint (γ,λ)(\gamma, \lambda) choice and the stream’s signal distribution. The takeaway: read the FDR-control guarantee as the universal feature; let the power profile track the specific implementation.

At t = 1000: α-investing: E[FDP] = 0.033, E[#rej] = 0.38 LORD: E[FDP] = 0.034, E[#rej] = 19.60 SAFFRON: E[FDP] = 0.008, E[#rej] = 3.58

All three procedures hold E[FDP] under α = 0.05 across every stream-position checkpoint — Theorem 11.2 / 11.3 verified in finite samples. The E[#rej] ordering on this toy is LORD ≫ SAFFRON ≫ α-investing, reversing the publication-figure SAFFRON-dominant pattern. The §11.5 prose unpacks why: the spending-schedule + candidacy-threshold combination favors LORD's rejection-bonus structure on streams with bursty signals.

Online-FDR Monte Carlo — empirical FDR and cumulative power across stream position
Left: empirical FDR over stream position t for the three procedures, with B = 100 Monte Carlo streams. All three stay at or below the nominal α = 0.05 line at every checkpoint — finite-sample verification of Theorems 11.2–11.3 and the stopping-time-uniform property. Right: cumulative rejection count over t. LORD (green) dominates with E[#rej] ≈ 19.6 at t = 1000; SAFFRON (blue) ≈ 3.6 and α-investing (red) ≈ 0.4 trail on this toy. The power-ordering between LORD and SAFFRON depends sensitively on the spending-schedule and candidacy-threshold choices; both procedures control FDR universally.

12. ML applications

§§1–11 built the toolkit. §12 connects the procedures to deployment patterns: genomics, A/B platforms, peeking-safe analytics, feature stores. Less theorem-heavy; more about how the procedures live in practice.

12.1 Genomics — GWAS and knockoff-controlled variant selection

GWAS: test 10510^510710^7 SNPs for association with a phenotype. Classical: Bonferroni at 5×1085 \times 10^{-8}. Extraordinarily conservative.

The knockoff approach (Sesia, Sabatti & Candès 2019; “KnockoffZoom”):

  1. Estimate PXP_X from a reference panel (HapMap, 1000 Genomes, UK Biobank).
  2. Construct model-X knockoffs at the SNP level. LD blocks make equicorrelation weak; KnockoffZoom uses SDP-optimized ss.
  3. Compute WW via lasso (or lasso-logistic) on the augmented design.
  4. Apply knockoff+ at the genomics-standard q=0.10q = 0.10.
  5. Multi-resolution discoveries — report at chromosome-block, sub-block, and single-variant scales.

On UK Biobank data (Sesia, Bates, Candès, Marchini & Sabatti 2021), knockoffs at q=0.10q = 0.10 recover 2–3× more loci than Bonferroni at 5×1085 \times 10^{-8}, with FDR controlled at the chosen level and out-of-sample replication supporting the claim. The knockpy (Python) and knockoff (R) packages provide production implementations.

12.2 A/B-platform online discovery and continuous deployment

Modern tech companies run thousands of A/B tests per quarter. Without FDR control, 10410^4 experiments at per-experiment α=0.05\alpha = 0.05 give 500\approx 500 spurious “winners.”

Deployment pattern:

  1. Stream of experiments — each produces p-values at completion.
  2. Wealth-account bookkeeping (LORD or SAFFRON).
  3. Decisions on the fly — when experiment tt concludes, compare to αt\alpha_t, decide ship/not-ship immediately.

Practical considerations. Per-experiment dependence: multiple metrics per experiment break independence. Fix: multivariate test statistic (Hotelling’s T2T^2 or min-p) at experiment level. Composition with always-valid testing: always-valid p-values within experiments feed into online FDR across experiments for end-to-end stopping-time-uniformity. Power vs conservatism: LORD is the default in published deployments (Microsoft ExP, LinkedIn, Netflix); SAFFRON’s power gain requires tuning.

12.3 Peeking-safe analytics and the always-valid distinction

Two complementary frameworks:

Always-valid confidence sequences (Howard, Ramdas, McAuliffe & Sekhon 2021; Johari, Pekelis & Walsh 2017): single-experiment CIs that maintain (1α)(1-\alpha) coverage uniformly across all stopping times.

Online FDR (this topic): the cross-experiment version.

The two compose: always-valid CIs within each experiment produce always-valid p-values; feeding them into LORD/SAFFRON gives stopping-time-uniform FDR control across the full portfolio.

Always-valid inference is a substantial topic in its own right — e-values and e-processes (Vovk & Wang 2021; Waudby-Smith & Ramdas 2020). A future formalML topic on always-valid inference (coming soon) will cover this; the connection here is that §§10–11 online FDR procedures plug into the always-valid framework as the multiple-testing layer.

12.4 Selective inference in feature stores and model registries

Modern ML platforms (Tecton, Feast, Hopsworks for feature stores; MLflow for model registries) track thousands of candidate features. Naive importance scores suffer the §1 selection bias at scale. Three integration directions:

Knockoff feature columns. Precompute model-X knockoffs alongside real features. Downstream model trainers get FDR-controlled feature selection automatically.

Honest feature CIs. Compute debiased-lasso CIs (§5) for each feature’s effect. Honest range vs selection-biased point estimate.

Conformal feature selection. Extend conformal split-calibration to feature selection — active research area (Liu, Katsevich, Janson & Ramdas 2022; see also the conformal prediction topic for the exchangeability substrate).

Platform-level challenges are integration challenges, not statistical ones. The procedures of §§1–11 work; the open problem is composability with feature stores, experiment tracking, continuous-deployment pipelines.

12.5 Unified comparison on a GWAS-like problem

The notebook closes with three procedures on a simulated GWAS-like problem (n=500n = 500, p=200p = 200 in moderate LD with AR(1) ρ=0.4\rho = 0.4, 10 causal variants at β=2.0\beta^\star = 2.0, Gaussian noise, B=20B = 20 replicates with sub-sampled debiasing for runtime):

  1. Bonferroni at q=0.10q = 0.10. Marginal-correlation p-values.
  2. BH on debiased-lasso p-values. §5 + §6 route. Debias 20 features per replicate (signals + random nulls) for runtime.
  3. Model-X knockoff+. §8 + §9 route. Gaussian model-X at estimated Σ\Sigma.

Locked notebook outputs:

ProcedureE[FDP]sd(FDP)PowerE[#rej]
Bonferroni0.92630.92630.00440.00441.0001.000136.20136.20
BH-debiased0.00910.00910.02800.02801.0001.00010.1010.10
Model-X knockoff+0.09730.09730.10670.10671.0001.00011.2511.25

The headline is Bonferroni’s FDP catastrophe: of its 136\approx 136 “rejections” per replicate, only 10\approx 10 are true signals — 93%93\% are false discoveries from correlated nulls. The marginal-correlation p-values reach extreme significance on many null SNPs because of LD with the true causal variants, and Bonferroni’s per-test threshold does nothing to control the resulting joint inflation. Power saturates at 1.0001.000 (every true signal is rejected) but the discovery report is unusable. By contrast, BH-debiased (FDP 0.0090.009) and model-X knockoff+ (FDP 0.0970.097, near nominal q=0.10q = 0.10) both report 10\approx 10 rejections, comprising almost entirely true signals — they recover the same power Bonferroni achieves while making honest discoveries. The two FDR-controlling procedures are close in performance; the choice depends on whether Σ\Sigma is known (knockoffs) or the rate condition s0logp=o(n)s_0 \log p = o(\sqrt n) holds (debiased).

Bonferroni: E[FDP] = 0.926 · E[#rej] = 136.20 BH-debiased: E[FDP] = 0.009 · E[#rej] = 10.10 model-X knockoff+: E[FDP] = 0.097 · E[#rej] = 11.25

Three procedures on a GWAS-like toy (n = 500, p = 200, AR(1) ρ = 0.4, β★ = 2.0, 10 signals, B = 20). Per-replicate points are simulated from the locked notebook means and standard deviations (cell 54). All three procedures hit power 1.000 on this strong-signal regime; the headline is the FDP catastrophe of Bonferroni at FDP ≈ 0.93 — 136 of 146 "rejections" are false discoveries from LD-correlated nulls. BH-debiased (FDP ≈ 0.009) and model-X knockoff+ (FDP ≈ 0.097, near the nominal q = 0.10) both make ≈10 honest discoveries per replicate.

FDP vs power Pareto scatter for three procedures on a GWAS-like toy
FDP vs power Pareto scatter for three procedures on a GWAS-like toy (n = 500, p = 200, AR(1) ρ = 0.4, β★ = 2.0, B = 20 replicates). Each light point is one replicate; large markers are means. Bonferroni (gray squares) clusters at FDP ≈ 0.93 with power saturated at 1.000 — extreme false-discovery rate driven by LD-correlated nulls reaching marginal significance. BH-debiased (blue circles, FDP ≈ 0.009) and model-X knockoff+ (brown triangles, FDP ≈ 0.097) both stay near nominal q = 0.10 while making ≈10 honest discoveries per replicate. The headline: moving from FWER-style to FDR-style control prevents the LD-driven false-discovery explosion that Bonferroni cannot.

13. Computational notes

Selective inference procedures span a wide range of computational profiles, from the moderate-pp exact pivots of §4 (cheap) to the SDP-based knockoffs of §9 (O(p6.5)O(p^{6.5}) worst case). §13 collects the practical numerical issues. Intentionally short: a practical reference, not a comprehensive numerical-methods treatment.

13.1 Knockoff-matrix conditioning and the equicorrelation regime

Both fixed-X and model-X knockoff constructions depend on Σ\Sigma. Two problems when ill-conditioned.

PSD constraint binding. Equicorrelation ss shrinks toward 00 as λmin(Σ)0\lambda_{\min}(\Sigma) \to 0. Knockoffs become nearly identical to real features; WjW_j loses discriminative power. At ρ1\rho \to 1 on the AR(1) toy, the procedure becomes useless.

Numerical Cholesky failures. Construction needs Cholesky of V=2diag(s)diag(s)Σ1diag(s)V = 2\mathrm{diag}(s) - \mathrm{diag}(s)\Sigma^{-1}\mathrm{diag}(s). When ss saturates the PSD bound, VV has a zero eigenvalue; naive Cholesky errors. Fix: back ss off by jitter 1010\sim 10^{-10}, clip negative eigenvalues in the eigendecomposition.

Diagnostics. (i) λmin(Σ)0.01\lambda_{\min}(\Sigma) \ge 0.01; (ii) λmin(2Σdiag(s))ϵ\lambda_{\min}(2\Sigma - \mathrm{diag}(s)) \ge -\epsilon; (iii) empirical XjX~j/nΣjjsjX_j^\top \tilde X_j / n \approx \Sigma_{jj} - s_j within MC noise. Skipping any of these can produce silently incorrect FDR control.

Pre-clustering escape valve. Near-collinear feature groups (genomics LD blocks): pre-cluster, apply knockoffs at group level, accept that discoveries are “this cluster matters.”

13.2 Lasso-path solvers — LARS vs coordinate descent for selection-event recovery

Coordinate descent (sklearn’s Lasso). Highly efficient at one λ\lambda. Right choice for augmented-lasso WW statistics (§§7–9) — one fit at one λ\lambda.

LARS (sklearn.linear_model.lars_path). Computes the entire piecewise-linear path. Right choice for the LSST conditional pivot (§4) — we need the polyhedral selection event.

Where the choice matters. For the §4 pivot, an inexact selection event from CD gives an inexact pivot. In typical signal regimes, CD’s solution matches LARS to machine precision; near the selection-event boundary, CD may misidentify the active set. The R selectiveInference package uses LARS internally.

The §7 power-recovery problem. The §7.5 numerics showed near-zero knockoff+ power at the LCD statistic with λkf=0.01σ2logp/n\lambda_{kf} = 0.01 \cdot \sigma\sqrt{2\log p/n}. The lasso at this λ\lambda is too lightly regularized to separate real signals from their knockoffs — both end up active with comparable magnitudes, and the LCD difference is dominated by noise. Two practical recoveries:

  • Switch from LCD to LSM. Lasso Signed Max: WjLSM:=sgn(β^jλ)max{λ:jS^(λ)}W_j^{\text{LSM}} := \mathrm{sgn}(\hat\beta_j^{\lambda^\star}) \cdot \max\{ \lambda : j \in \hat S(\lambda) \}, computed from the full LARS path. LSM integrates information across the regularization path rather than reading off a single λ\lambda, and recovers power on the §7 toy without changing nn, pp, or signal strength.
  • CV-tuned λkf\lambda_{kf}. Run LassoCV on the augmented design [X,X~][X, \tilde X] to pick λkf\lambda_{kf} data-adaptively. The resulting λ\lambda is typically 5050100×100\times larger than the brief’s 0.01σ2logp/n0.01\sigma\sqrt{2\log p / n} choice and produces a WW that separates well.

The notebook uses LCD at the fixed-λ\lambda choice to keep the §7 demo simple and to make the FDR-vs-power separation visible. Production knockoff implementations (knockpy, knockoff R package) use LSM or CV-tuned LCD by default.

Coordinate-descent tolerance. sklearn default tol=1e-4 is fine for WW statistics; tighter (10810^{-8}) for the §4 pivot at borderline signal strengths.

13.3 Scaling to p ≈ 10⁴ — Cholesky cost and memory

The notebook demos at p500p \le 500. Production applications are often much larger.

Knockoff construction. Equicorrelation: O(p3)O(p^3) eigendecomposition. At p=104p = 10^4, 10\sim 103030 seconds. At p=105p = 10^5, eigendecomposition is feasible but memory (p2p^2 entries, 80\sim 80 GB at p=105p = 10^5) is the bottleneck. SDP: O(p6.5)O(p^{6.5}) worst-case, infeasible past p103p \approx 10^3 without ASDP block decomposition.

Storage of Σ\Sigma. At p=104p = 10^4, dense Σ\Sigma is 800 MB. Workarounds: sparse storage if structurally sparse; low-rank approximation ΣUU+D\Sigma \approx UU^\top + D; persist the Cholesky factor.

Augmented-lasso fits. Lasso on [X,X~]Rn×2p[X, \tilde X] \in \mathbb{R}^{n \times 2p} scales linearly in pp per CD iteration. At p=104p = 10^4, single fit is seconds; at p=105p = 10^5, minutes.

Practical thresholds. Equicorrelation at p=104p = 10^4 is laptop-feasible; at p=106p = 10^6 requires specialized samplers (HMM knockoffs for genomics). KnockoffZoom operates at p106p \sim 10^6 by exploiting LD Markov-chain structure rather than handling a full 106×10610^6 \times 10^6 covariance.

Debiased lasso scaling. Each coordinate’s debias is one nodewise lasso. Debiasing all pp at p=104p = 10^4 takes hours; production implementations parallelize or sub-sample to a target subset. The §12 GWAS-like demo uses this sub-sampling (20 coordinates per replicate).

13.4 Reproducibility — random-knockoff symmetrization and seed discipline

Model-X knockoff construction draws fresh randomness for X~\tilde X at every invocation. Two runs on the same data produce different rejection sets. FDR holds in expectation across the joint randomness in (X~,y)(\tilde X, y), but per-realization rejection sets vary.

Seed discipline. Log the RNG seed used to draw X~\tilde X (and UU in fixed-X). Re-running with the same seed gives the same rejection set. Convention-of-least-resistance.

Knockoff symmetrization / averaging. Draw KK independent knockoff matrices, compute W(k)W^{(k)} for each, aggregate. Mean-W: Wˉ=K1kW(k)\bar W = K^{-1} \sum_k W^{(k)}. Multiple-knockoff median rejection: run knockoff+ per draw, take median rejection set or per-feature stability scores. Bates, Candès, Janson & Wang (2021) develop the theory; multiple-knockoff is strictly better than single-knockoff at the same FDR target.

Bitwise reproducibility. Some publication settings require it. Options: deterministic fixed-X (with Σ\Sigma-derived UU); knockoff-free alternatives (BH-on-debiased).

Other hazards. (a) sklearn’s Lasso with random_state=None is non-deterministic — set random_state=0. (b) np.random.default_rng() without seed varies — always pass an integer. (c) scipy.optimize.brentq is deterministic given fixed brackets, but bracket-dependence on data can shift CIs by small amounts.

14. Connections and limits

The topic has built three families: post-selection conditional pivots (§§1–5), knockoff FDR control (§§7–9), and online FDR via wealth processes (§§10–11). §14 closes by mapping the topic to its theoretical neighbors and being explicit about where the framework ends.

14.1 Selective vs simultaneous inference — PoSI and Berk–Brown–Buja–Zhang–Zhao 2013

The §§1–5 framework conditions on the selection event and gives selective coverage. Berk, Brown, Buja, Zhang & Zhao (2013) developed PoSI — a strictly stronger but more conservative target.

PoSI’s guarantee. For any selection rule MM^(y)MM \mapsto \hat M(y) \in \mathcal{M}:

Pr(ηMμCM for all MM)1α.\Pr(\eta_M^\top \mu \in C_M \text{ for all } M \in \mathcal{M}) \ge 1 - \alpha.

CIs CM=η^My±KPoSIσηMC_M = \hat\eta_M^\top y \pm K_{\mathrm{PoSI}} \sigma \|\eta_M\| with KPoSIzα/2K_{\mathrm{PoSI}} \ge z_{\alpha/2}; for orthogonal designs KPoSI2logpK_{\mathrm{PoSI}} \approx \sqrt{2 \log p}.

LSSTPoSI
GuaranteeConditional on selection eventSimultaneous over rules
WidthTighter on realized outcomeUniformly wider
Coupled to procedureYes — needs polyhedral EsE_sNo — agnostic
Best regimeSpecific structured selection rulesExploratory analysis

Neither is strictly stronger. The procedures of §§1–11 give four distinct types of guarantee — conditional coverage (LSST), uniform coverage (PoSI), discovery-set FDR (knockoffs), portfolio-streaming FDR (online) — each suited to a different inferential question. “What is the right multiple-testing target?” is itself a modeling choice.

14.2 Power loss as the price of post-selection honesty

Honest CIs are wider; honest FDR procedures discover less; honest online procedures spend more conservatively. Structural, not accidental.

Where the cost shows up.

  • §4 conditional CIs: wider than naive, sometimes ±\pm\infty on null inferences.
  • §5 debiased lasso: wider than oracle CI; bias correction adds variance.
  • §§7–9 knockoffs: fewer discoveries than oracle thresholding at same FDR.
  • §§10–11 online FDR: fewer rejections than batch BH on the same data.

Formal optimality. The §4 conditional CIs are UMAU (Lee, Sun, Sun & Taylor 2016). Knockoff+ is FDR-tight in the worst case across anti-symmetric statistics. The power cost is unavoidable.

The right framing. “Selective inference is honest; that honesty has a measurable power cost; the cost is not avoidable.” A wider CI from §4’s pivot is more informative than a tighter CI from the naive workflow — the wider one accurately represents residual uncertainty after selection; the naive one just lies.

When the cost is too high. If §4 conditional CIs are extending to ±\pm\infty on every null feature, don’t report a CI. Report the discovery, the test statistic, and the fact that the conditional analysis has exhausted the available information. More honest than reporting an infinite CI.

14.3 Exchangeability fragility — what fails when knockoff or conformal assumptions break

Knockoff filters (§§7–9) and conformal prediction share a structural foundation: both depend on exchangeability for finite-sample validity.

Knockoff exchangeability:

  • Fixed-X (§7): Gram-matrix swap-invariance + Gaussian noise.
  • Model-X (§8): joint distributional swap-invariance + YY-blindness.

When the assumption holds, FDR is finite-sample exact. When it breaks — misspecified PXP_X — Bates, Sesia, Sabatti & Candès (2020) show FDR inflation proportional to Wasserstein-distance error.

Conformal exchangeability is calibration-set exchangeability. When it holds, coverage is exact. When it breaks (distribution shift, time-series dependence, adversarial samples), coverage degrades.

Shared structural pattern:

  1. Assumption: exchangeability (sign-flip for knockoffs, calibration for conformal).
  2. Guarantee: finite-sample exact, distribution-free, under the assumption.
  3. Failure mode: assumption breaks, guarantee degrades smoothly with violation magnitude.
  4. Robustness analysis: characterize the degradation rate; design procedures that minimize sensitivity.

Diagnostic checks. Knockoffs: verify moment-matching empirically; use multiple-knockoff symmetrization to detect instability. Conformal: check distribution-shift indicators; use rolling-window calibration in time-series settings.

14.4 Honest CIs, SoS confidence regions, and always-valid inference

Three further directions, mostly as forward-pointers:

SoS confidence regions (Tibshirani, Rinaldo, Tibshirani & Wasserman 2018). Confidence regions for the entire coefficient vector with simultaneous coverage. Stronger but larger than the per-coordinate CIs of §5. Useful for joint inference tasks.

Always-valid inference. Single-experiment analog of §11.5’s stopping-time-uniform FDR. Confidence sequences (Howard, Ramdas, McAuliffe & Sekhon 2021), e-values and e-processes (Vovk & Wang 2021), betting-based methods (Waudby-Smith & Ramdas 2020). A future formalML topic on always-valid inference (coming soon) will cover this in depth.

Selective inference beyond GLMs. §§1–11 work in the Gaussian linear, high-dimensional linear, and model-X non-parametric settings. Liu, Katsevich, Janson & Ramdas (2022) generalize model-X to arbitrary outcome models via conditional randomization tests. Active research area.

Hard boundaries. Selective inference, as developed in this topic, does not address:

  • Causal selection bias. Sampling bias, missing-data mechanisms — see semiparametric inference.
  • Distribution shift. Deployment distribution differs from training — see conformal robustness.
  • Adversarial robustness. Adversarial settings require different tools entirely.

Within those boundaries, §§1–11 give the strongest finite-sample guarantees the framework can make. This closes the topic. The §§1–11 toolkit gives a coherent answer to “how do we do honest inference after looking at the data?” — conditional pivots when the selection rule is structured, debiased estimators when the regime is high-dimensional, knockoffs when finite-sample FDR matters, online procedures when tests arrive over time. §§12–13 connected the procedures to deployment; §14 mapped them to their neighbors. Selective inference is a young framework already widely deployed in genomics and A/B platforms, and the next decade of development will push it further into non-parametric ML, distribution-shift-robust settings, and the always-valid framework that §12.3 pointed toward.

Connections

  • Exchangeability cousin. §14.3 unpacks the shared structural pattern — both selective inference (knockoff sign-flip exchangeability) and conformal prediction (calibration-set exchangeability) buy finite-sample, distribution-free guarantees from a symmetry; both degrade smoothly when the symmetry breaks. conformal-prediction
  • Conceptual frame on data-dependent estimands. §1.3 explicitly parallels the two topics: semiparametrics asks for √n-valid inference about a low-dimensional functional with an infinite-dimensional nuisance estimated from data; selective inference asks for valid inference about a functional whose identity depends on the data. Both topics live in the seam between 'parameter is fixed, data is random' and 'parameter and data are both random in a coupled way'. semiparametric-inference
  • Lasso KKT subgradient conditions, the polyhedral selection event used in §§3–4, and the debiased-lasso machinery used in §5 are all developed there. The §5.4 proof sketch references this topic for the lasso ℓ₁-rate that gates the rate condition. high-dimensional-regression
  • The Bernoulli-sequence supermartingale tail bounds underlying §7.4 (Lemma 7.4) and the spending-budget concentration in §11.3 both build on the sub-Gaussian and martingale-tail machinery developed there. concentration-inequalities

References & Further Reading