advanced learning-theory 60 min read

Always-Valid Inference

Time-uniform confidence sequences, e-processes, and the betting reformulation — sequential A/B testing without peeking penalties

1. Motivation — why fixed-nn confidence intervals fail under peeking

1.1 The peeking analyst

Picture an A/B test. We’ve decided to run it for 10,000 visitors, and we’ll form a 95% confidence interval for the lift at the end. After 1,000 visitors, we glance at the dashboard. The lift looks significant — the 95% CI excludes zero. We stop.

That action — peeking and stopping based on what we saw — quietly invalidates our confidence statement. The fixed-nn CI we relied on, say a Hoeffding band

Cnfixed  =  [Xˉn2σ2nlog2δ,  Xˉn+2σ2nlog2δ],C_n^{\text{fixed}} \;=\; \left[\, \bar X_n - \sqrt{\tfrac{2\sigma^2}{n}\log\tfrac{2}{\delta}},\; \bar X_n + \sqrt{\tfrac{2\sigma^2}{n}\log\tfrac{2}{\delta}} \,\right],

promises exactly one thing:

P ⁣(θCnfixed)    1δfor the single, pre-specified n.\mathbb{P}\!\left(\theta \in C_n^{\text{fixed}}\right) \;\geq\; 1 - \delta \qquad \text{for the single, pre-specified } n.

That guarantee says nothing about C1fixed,C2fixed,,CnfixedC_1^{\text{fixed}}, C_2^{\text{fixed}}, \ldots, C_n^{\text{fixed}} holding simultaneously. The naïve union bound P(nN:θCnfixed)Nδ\mathbb{P}(\exists\, n \leq N : \theta \notin C_n^{\text{fixed}}) \leq N\delta is vacuous for N1/δN \geq 1/\delta. We need a different kind of object — one whose coverage guarantee survives looking, deciding, and stopping.

1.2 A concrete optional-stopping demonstration

We’ll make this visceral. Generate i.i.d. XiN(0,1)X_i \sim \mathcal{N}(0, 1) — true mean μ=0\mu = 0. At each nn, form the fixed-nn Gaussian CI Cnfixed=[Xˉnzδ/2/n,Xˉn+zδ/2/n]C_n^{\text{fixed}} = [\bar X_n - z_{\delta/2}/\sqrt{n},\, \bar X_n + z_{\delta/2}/\sqrt{n}] with z0.0251.96z_{0.025} \approx 1.96. Adopt the aggressive-analyst stopping rule

τ  =  min{n1:0Cnfixed},τnmax if no exit by nmax.\tau \;=\; \min\{\, n \geq 1 : 0 \notin C_n^{\text{fixed}} \,\}, \qquad \tau \wedge n_{\max} \text{ if no exit by } n_{\max}.

Across B=5,000B = 5{,}000 Monte Carlo replicates, the fraction of replicates stopping at some τnmax\tau \leq n_{\max} with the CI excluding the truth climbs from 0.31\approx 0.31 at nmax=50n_{\max} = 50 to 0.65\approx 0.65 at nmax=10,000n_{\max} = 10{,}000, and would approach 11 as nmaxn_{\max} \to \infty. Drag the slider below to watch the failure rate grow with the analyst’s patience.

P(false rejection by n_max) ≈ 0.644 vs nominal δ = 0.050
Empirical false-rejection rate of the fixed-n Gaussian CI under aggressive stopping, plotted against n_max; trajectories on the right show how the running mean grazes the shrinking band.
Notebook Figure 1. Empirical false-rejection rate of the fixed-n Gaussian CI under the aggressive-analyst stopping rule, with 200 sample trajectories of the running mean and the shrinking ±z_(δ/2)/√n band.

The pathology is not subtle. The fixed-nn band shrinks like 1/n1/\sqrt{n}, but Xˉn\bar X_n fluctuates on the same 1/n1/\sqrt{n} scale by the CLT — so the band is exactly tight enough at any single nn and not tight enough to hold over all nn simultaneously.

1.3 The law of the iterated logarithm — a preview

The deepest reason the fixed-nn band fails under peeking comes from the law of the iterated logarithm (LIL), proved in Hartman and Wintner (1941) and sharpened by Strassen (1964). For i.i.d. mean-zero XiX_i with variance σ2\sigma^2, the partial sums Sn=i=1nXiS_n = \sum_{i=1}^n X_i satisfy

lim supnSn2nσ2loglogn  =  1almost surely.\limsup_{n \to \infty}\, \frac{S_n}{\sqrt{2 n \sigma^2 \log\log n}} \;=\; 1 \qquad \text{almost surely.}

Translated to the running mean, Xˉn|\bar X_n| infinitely often exceeds σ2loglogn/n\sigma\sqrt{2 \log\log n / n} — a width that grows by a factor of loglogn\sqrt{\log\log n} over the fixed-nn band. The random walk revisits arbitrarily large loglogn\sqrt{\log\log n}-scaled deviations forever. Any band whose half-width is O(1/n)O(1/\sqrt{n}) alone is eventually exited with probability one. The peeking analyst is a victim not of bad luck but of the law of large numbers’ fine print.

200 trajectories, n ∈ [10, 10000], log-x.
200 sample trajectories of the running mean overlaid with the shrinking fixed-n Gaussian band and the LIL-shaped band, on a log-x axis; the LIL band envelopes the fluctuations as n grows while the fixed-n band is exited persistently.
Notebook Figure 2. The same 200 trajectories from Figure 1 with two bands overlaid — the shrinking fixed-n Gaussian band (which the trajectories exit) and the LIL-shaped band σ√(2 log log n / n) (which envelopes them).

The LIL also tells us what kind of widening we’ll have to pay. Any honest confidence sequence must accommodate at least a loglogn\sqrt{\log\log n} factor. The modern boundaries of §§5–6 hit this rate up to constants; the formal discussion of the LIL as a lower bound is in §13.3.

1.4 What we want — time-uniform coverage

Definition 1.1 (Confidence sequence).

Let X1,X2,X_1, X_2, \ldots be a stochastic process adapted to a filtration (Fn)n0(\mathcal{F}_n)_{n \geq 0}, and let θ\theta be a parameter of the data-generating distribution. A (1δ)(1-\delta)-confidence sequence for θ\theta is a sequence (Cn)n1(C_n)_{n \geq 1} of Fn\mathcal{F}_n-measurable random sets satisfying

P ⁣(θCn for all n1)    1δ.(1.1)\mathbb{P}\!\left(\theta \in C_n \text{ for all } n \geq 1\right) \;\geq\; 1 - \delta. \qquad\qquad (1.1)

The “for all n1n \geq 1” is the entire game. It promises that the event “every CnC_n contains θ\theta” has probability 1δ\geq 1 - \delta — a statement about the full sample path, not about any individual nn.

Proposition 1.2 (Stopping-time equivalence).

(Cn)n1(C_n)_{n \geq 1} is a (1δ)(1-\delta)-confidence sequence if and only if for every stopping time τ\tau (taking values in N{}\mathbb{N} \cup \{\infty\}),

P ⁣(τ< and θCτ)    δ.(1.2)\mathbb{P}\!\left(\tau < \infty \text{ and } \theta \notin C_\tau\right) \;\leq\; \delta. \qquad\qquad (1.2)
Proof.

(\Rightarrow) Fix a stopping time τ\tau. On {τ<,  θCτ}\{\tau < \infty,\; \theta \notin C_\tau\}, there exists at least one finite index at which θCn\theta \notin C_n. Hence {τ<,  θCτ}{n1:θCn}\{\tau < \infty,\; \theta \notin C_\tau\} \subseteq \{\exists\, n \geq 1 : \theta \notin C_n\}, and taking probabilities gives P(τ<,θCτ)δ\mathbb{P}(\tau < \infty, \theta \notin C_\tau) \leq \delta by (1.1).

(\Leftarrow) Define the exit time τ=inf{n1:θCn}\tau^\star = \inf\{n \geq 1 : \theta \notin C_n\} (with τ=\tau^\star = \infty if no such nn). Each 1[θCn]\mathbf{1}[\theta \in C_n] is Fn\mathcal{F}_n-measurable, so τ\tau^\star is a stopping time. On {τ<}\{\tau^\star < \infty\}, θCτ\theta \notin C_{\tau^\star} by definition, so

P ⁣(n1:θCn)=P(τ<)=P(τ<,θCτ)δ\mathbb{P}\!\left(\exists\, n \geq 1 : \theta \notin C_n\right) = \mathbb{P}(\tau^\star < \infty) = \mathbb{P}(\tau^\star < \infty, \theta \notin C_{\tau^\star}) \leq \delta

by (1.2). Taking complements yields (1.1).

Time-uniform coverage and stopping-time validity are the same property. We use anytime-valid and time-uniform interchangeably, and say a procedure is anytime-valid if its guarantee holds under (1.2) for every stopping time, including stopping rules the analyst designs by staring at the data.

1.5 Roadmap

Arc A — Classical sequential analysis (§§3–4). Wald (1945, 1947) and Robbins (1970): the SPRT prototype and the method-of-mixtures structural template.

Arc B — Modern confidence sequences (§§5–6). Howard, Ramdas, McAuliffe, and Sekhon (2020, 2021): every fixed-nn Chernoff-style tail bound has a time-uniform analogue obtained by replacing Markov on the MGF with Ville on the supermartingale. Waudby-Smith and Ramdas (2024): the betting reformulation, with empirical-Bernstein confidence sequences matching LIL.

Arc C — e-values and e-processes (§§8–10). Vovk (1993, 2019), Vovk and Wang (2021), Grünwald, de Heide, and Koolen (2024): the formal definition of e-value, the duality between e-processes and anytime-valid tests, and the merging operations that handle multiple-testing without Bonferroni.

All three arcs reduce to the same one tool: Ville’s inequality applied to a nonnegative supermartingale. We develop that tool in §2.

2. From Markov to Ville — the supermartingale lift

We promised in §1 that the entire topic rests on one tool. This section delivers it. The fixed-nn Chernoff recipe — every tail bound on P(Snt)\mathbb{P}(S_n \geq t) the reader has ever seen — comes from a single application of Markov’s inequality to a moment-generating function. The time-uniform version comes from a single substitution: replace that Markov step with Ville’s inequality, applied to the corresponding nonnegative supermartingale. The bounds are uniformly valid over all sample sizes nn simultaneously, at the cost of a loglogn\sqrt{\log\log n} factor that §13.3 will show is unavoidable.

2.1 Markov’s inequality and the fixed-nn Chernoff recipe

Lemma 2.1 (Markov's inequality).

For any nonnegative random variable Y0Y \geq 0 and any a>0a > 0,

P(Ya)    E[Y]a.(2.1)\mathbb{P}(Y \geq a) \;\leq\; \frac{\mathbb{E}[Y]}{a}. \qquad\qquad (2.1)

Markov bounds a tail probability by an expectation. The Chernoff recipe takes Y=eλXY = e^{\lambda X} for some tilt λ>0\lambda > 0:

P(Xt)  =  P(eλXeλt)    eλtE[eλX].\mathbb{P}(X \geq t) \;=\; \mathbb{P}(e^{\lambda X} \geq e^{\lambda t}) \;\leq\; e^{-\lambda t}\, \mathbb{E}[e^{\lambda X}].

Apply this to Sn=i=1nXiS_n = \sum_{i=1}^n X_i for i.i.d. XiX_i with cumulant generating function ψ(λ)=logE[eλX1]\psi(\lambda) = \log \mathbb{E}[e^{\lambda X_1}]. Independence factors the expectation, so E[eλSn]=enψ(λ)\mathbb{E}[e^{\lambda S_n}] = e^{n\psi(\lambda)}, and Markov gives

P(Snt)    eλt+nψ(λ)for every λ0.(2.2)\mathbb{P}(S_n \geq t) \;\leq\; e^{-\lambda t + n \psi(\lambda)} \qquad \text{for every } \lambda \geq 0. \qquad\qquad (2.2)

Minimize over λ\lambda to get the Chernoff bound; specialize ψ\psi to get every named tail bound.

Example 1 (The sub-Gaussian case).

A random variable XX with mean μ\mu is sub-Gaussian with variance proxy σ2\sigma^2 if E[eλ(Xμ)]eλ2σ2/2\mathbb{E}[e^{\lambda(X - \mu)}] \leq e^{\lambda^2 \sigma^2 / 2} for all λR\lambda \in \mathbb{R}. Plugging into (2.2), P(Snnμt)exp(λt+nλ2σ2/2)\mathbb{P}(S_n - n\mu \geq t) \leq \exp(-\lambda t + n \lambda^2 \sigma^2 / 2). Minimum over λ>0\lambda > 0 at λ=t/(nσ2)\lambda^\star = t/(n\sigma^2) equals exp(t2/(2nσ2))\exp(-t^2 / (2 n \sigma^2)). Inverting at level δ\delta gives the fixed-nn Hoeffding CI

P ⁣(Xˉnμσ2log(1/δ)n)    δ.(2.3)\mathbb{P}\!\left(\bar X_n - \mu \geq \sigma\sqrt{\tfrac{2\log(1/\delta)}{n}}\right) \;\leq\; \delta. \qquad\qquad (2.3)

This is the band that failed under peeking in §1. It fails because the Markov step in (2.2) is a single-nn statement. To control the supremum over nn, we need an inequality that handles every nn at once.

2.2 Nonnegative supermartingales and Ville’s inequality

Definition 2.2 (Filtration, supermartingale).

A filtration on a probability space is an increasing sequence of σ\sigma-algebras (Fn)n0(\mathcal{F}_n)_{n \geq 0}, with FnFn+1\mathcal{F}_n \subseteq \mathcal{F}_{n+1} for all nn. A real-valued process (Mn)n0(M_n)_{n \geq 0} adapted to (Fn)(\mathcal{F}_n) is a supermartingale if EMn<\mathbb{E}|M_n| < \infty for all nn and

E[Mn+1Fn]    Mna.s.(2.4)\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] \;\leq\; M_n \quad \text{a.s.} \qquad\qquad (2.4)

The supermartingale is nonnegative if Mn0M_n \geq 0 almost surely for every nn.

Example 2 (The sub-Gaussian supermartingale).

For i.i.d. mean-zero sub-Gaussian XiX_i with variance proxy σ2\sigma^2, fix λR\lambda \in \mathbb{R} and define

MnsG(λ)  =  exp ⁣(λSnnλ2σ2/2),M0sG(λ)=1.(2.5)M_n^{\mathrm{sG}}(\lambda) \;=\; \exp\!\left(\lambda S_n - n \lambda^2 \sigma^2 / 2\right), \qquad M_0^{\mathrm{sG}}(\lambda) = 1. \qquad\qquad (2.5)

Each MnsG(λ)0M_n^{\mathrm{sG}}(\lambda) \geq 0 trivially. The supermartingale property holds:

E ⁣[Mn+1sG(λ)MnsG(λ)  |  Fn]=E ⁣[eλXn+1λ2σ2/2]1\mathbb{E}\!\left[\frac{M_{n+1}^{\mathrm{sG}}(\lambda)}{M_n^{\mathrm{sG}}(\lambda)} \;\middle|\; \mathcal{F}_n\right] = \mathbb{E}\!\left[e^{\lambda X_{n+1} - \lambda^2 \sigma^2 / 2}\right] \leq 1

by the sub-Gaussian MGF bound. This is the Wald–Doob supermartingale — the canonical fixed-nn → time-uniform lift.

Theorem 2.3 (Ville's inequality (Ville 1939)).

Let (Mn)n0(M_n)_{n \geq 0} be a nonnegative supermartingale with E[M0]<\mathbb{E}[M_0] < \infty. Then for any a>0a > 0,

P ⁣(supn0Mna)    E[M0]a.(2.6)\mathbb{P}\!\left(\sup_{n \geq 0} M_n \geq a\right) \;\leq\; \frac{\mathbb{E}[M_0]}{a}. \qquad\qquad (2.6)

Same right-hand side as Markov (2.1), but the left-hand side controls the supremum of the process — every nn simultaneously. The same Markov budget E[M0]/a\mathbb{E}[M_0]/a buys a uniform statement over the whole sample path.

Proof.

We use Doob’s optional-stopping theorem for bounded stopping times (Theorem 2.5 below). Fix a>0a > 0 and define τa=inf{n0:Mna}\tau_a = \inf\{n \geq 0 : M_n \geq a\} (with τa=\tau_a = \infty if no such nn).

Each indicator 1{Mna}\mathbf{1}\{M_n \geq a\} is Fn\mathcal{F}_n-measurable, so {τan}Fn\{\tau_a \leq n\} \in \mathcal{F}_n — making τa\tau_a a stopping time. For any finite horizon NN, τaNN\tau_a \wedge N \leq N, so Theorem 2.5 gives E[MτaN]E[M0]\mathbb{E}[M_{\tau_a \wedge N}] \leq \mathbb{E}[M_0].

On {τaN}\{\tau_a \leq N\}, τaN=τa\tau_a \wedge N = \tau_a and MτaaM_{\tau_a} \geq a. On {τa>N}\{\tau_a > N\}, MτaN=MN0M_{\tau_a \wedge N} = M_N \geq 0. Hence

E[MτaN]aP(τaN),\mathbb{E}[M_{\tau_a \wedge N}] \geq a \cdot \mathbb{P}(\tau_a \leq N),

so aP(τaN)E[M0]a \cdot \mathbb{P}(\tau_a \leq N) \leq \mathbb{E}[M_0], i.e. P(τaN)E[M0]/a\mathbb{P}(\tau_a \leq N) \leq \mathbb{E}[M_0]/a.

The events {τaN}\{\tau_a \leq N\} are increasing with union {τa<}={supnMna}\{\tau_a < \infty\} = \{\sup_n M_n \geq a\}. Continuity of measure gives

P ⁣(supnMna)=P(τa<)=limNP(τaN)E[M0]/a.\mathbb{P}\!\left(\sup_n M_n \geq a\right) = \mathbb{P}(\tau_a < \infty) = \lim_{N \to \infty} \mathbb{P}(\tau_a \leq N) \leq \mathbb{E}[M_0]/a.
λ★ = μ₁/σ² = 0.30, drift λμ₁ − λ²/2 = 0.0450, GRO μ₁²/(2σ²) = 0.0450.
Empirical max crossing: H₀ = 0.039, H₁ = 1.000.
Log Wald–Doob supermartingale trajectories under H_0 (blue, bounded by Ville) and under H_1 (red, drifting up); lower panel shows empirical sup-crossing probability staying below δ under H_0 and climbing to 1 under H_1.
Notebook Figure 3. Wald–Doob supermartingale trajectories at λ = 0.3 — bounded under H_0 by Ville (top, blue) and exhibiting linear positive drift μ_1²/(2σ²) under H_1 (top, red). The bottom panel shows the empirical sup-crossing probability staying below δ = 0.05 under H_0 (≈ 0.04 — Ville slack) and climbing to ≈ 1 under H_1.

2.3 Doob’s optional-stopping theorem and stopping-time invariance

Definition 2.4 (Stopping time).

A random variable τ:Ω{0,1,2,}{}\tau : \Omega \to \{0, 1, 2, \ldots\} \cup \{\infty\} is a stopping time with respect to (Fn)(\mathcal{F}_n) if {τn}Fn\{\tau \leq n\} \in \mathcal{F}_n for every n0n \geq 0.

The event “we’ve stopped by time nn” is decidable from the first nn observations alone — no peeking into the future.

Theorem 2.5 (Doob's optional-stopping theorem (bounded stopping times)).

Let (Mn)n0(M_n)_{n \geq 0} be a supermartingale and τ\tau a stopping time with τN\tau \leq N almost surely for some finite NN. Then

E[Mτ]    E[M0].(2.7)\mathbb{E}[M_\tau] \;\leq\; \mathbb{E}[M_0]. \qquad\qquad (2.7)
Proof.

The key combinatorial identity: for any 0τN0 \leq \tau \leq N,

Mτ  =  M0+n=0N1(Mn+1Mn)1{n<τ}.M_\tau \;=\; M_0 \,+\, \sum_{n=0}^{N-1} (M_{n+1} - M_n)\, \mathbf{1}\{n < \tau\}.

For τ=k\tau = k, the indicators are 1{n<k}=1\mathbf{1}\{n < k\} = 1 for n=0,,k1n = 0, \ldots, k-1 and 00 otherwise, so the sum telescopes to Mk=MτM_k = M_\tau. Taking expectations and using {n<τ}={τn}cFn\{n < \tau\} = \{\tau \leq n\}^c \in \mathcal{F}_n,

E[(Mn+1Mn)1{n<τ}]=E[1{n<τ}E[Mn+1MnFn]]0\mathbb{E}[(M_{n+1} - M_n)\, \mathbf{1}\{n < \tau\}] = \mathbb{E}[\mathbf{1}\{n < \tau\}\, \mathbb{E}[M_{n+1} - M_n \mid \mathcal{F}_n]] \leq 0

by the supermartingale property. Summing over n=0,,N1n = 0, \ldots, N-1 gives E[Mτ]E[M0]\mathbb{E}[M_\tau] \leq \mathbb{E}[M_0].

For unbounded τ\tau the statement still holds for nonnegative supermartingales: E[Mτ1{τ<}]E[M0]\mathbb{E}[M_\tau \mathbf{1}\{\tau < \infty\}] \leq \mathbb{E}[M_0] — the uniform-integrability extension. We won’t need the unbounded version directly.

Corollary 2.6 (Stopping-time invariance of nonnegative supermartingales).

Let (Mn)n0(M_n)_{n \geq 0} be a nonnegative supermartingale with E[M0]1\mathbb{E}[M_0] \leq 1. Then for any stopping time τ\tau and any α(0,1)\alpha \in (0, 1),

P(Mτ1/α,  τ<)    α.(2.8)\mathbb{P}(M_\tau \geq 1/\alpha,\; \tau < \infty) \;\leq\; \alpha. \qquad\qquad (2.8)

Compare (2.8) with Proposition 1.2. Both characterize the property “guarantee holds for every stopping rule”. This is the structural reason supermartingales give anytime-valid inference.

2.4 The fixed-nn → time-uniform recipe

Step 1 — start with a fixed-nn Chernoff bound. Identify a class of distributions and an MGF upper bound ψ(λ)\psi(\lambda) such that E[eλXi]eψ(λ)\mathbb{E}[e^{\lambda X_i}] \leq e^{\psi(\lambda)} for λI\lambda \in I.

Step 2 — construct the Wald–Doob supermartingale. Mn(λ)=exp(λSnnψ(λ))M_n(\lambda) = \exp(\lambda S_n - n\, \psi(\lambda)), M0(λ)=1M_0(\lambda) = 1, is a nonnegative supermartingale by Example 2.

Step 3 — apply Ville. P(supnMn(λ)1/δ)δ\mathbb{P}(\sup_n M_n(\lambda) \geq 1/\delta) \leq \delta, or equivalently

P ⁣(n1:λSnnψ(λ)log(1/δ))δ.(2.9)\mathbb{P}\!\left(\exists\, n \geq 1 : \lambda S_n - n\,\psi(\lambda) \geq \log(1/\delta)\right) \leq \delta. \qquad\qquad (2.9)

Step 4 — mix over λ\lambda to get an adaptive boundary. Replace the single-λ\lambda bound with Mnmix=Mn(λ)dπ(λ)M_n^{\mathrm{mix}} = \int M_n(\lambda)\, d\pi(\lambda) for a mixing distribution π\pi. By Fubini, MnmixM_n^{\mathrm{mix}} is a nonnegative supermartingale. Ville applies, giving the method-of-mixtures boundary of Robbins (1970).

This cookbook drives §§3–6. Every confidence sequence is steps 1–4 instantiated with a different MGF bound and mixing distribution.

We close by validating the recipe numerically. Under H0:μ=0H_0: \mu = 0, MnsG(λ)M_n^{\mathrm{sG}}(\lambda) trajectories stay bounded by Ville. Under H1:μ=μ1>0H_1: \mu = \mu_1 > 0, logMn\log M_n exhibits linear positive drift λμ1λ2σ2/2\lambda \mu_1 - \lambda^2 \sigma^2/2, maximized at λ=μ1/σ2\lambda^\star = \mu_1 / \sigma^2, giving the optimal log-growth μ12/(2σ2)\mu_1^2 / (2\sigma^2) — exactly the KL divergence between N(μ1,σ2)\mathcal{N}(\mu_1, \sigma^2) and N(0,σ2)\mathcal{N}(0, \sigma^2). This is the first glimpse of the growth-rate-optimal property of §9.

σ = 1. Each single-λ line is tight at one n; the mixture tracks the lower envelope and only loses a constant factor over LIL.
Empirical sup-crossing probability vs n under H_0 (stays below δ, validating Ville) and under H_1 (climbs to 1, demonstrating power).
Notebook Figure 4. Validating Ville's recipe — the empirical sup-crossing probability stays below δ = 0.05 under H_0 (left, the type-I error guarantee) and climbs to ≈ 1 under H_1 (right, the power story).

3. The sequential probability ratio test (SPRT)

§2’s recipe is more easily appreciated through its first historical instance: Abraham Wald’s sequential probability ratio test, developed at Columbia’s Statistical Research Group in 1943 for wartime munitions inspection and declassified in Wald (1945, 1947). The SPRT is the prototype anytime-valid procedure — its statistic is a martingale under the null, Ville’s inequality gives the type-I error guarantee, and Wald’s identity gives the expected stopping time. The whole modern theory of e-processes is, structurally, the SPRT’s likelihood ratio reinterpreted as a betting wealth and generalized beyond simple-vs-simple testing.

3.1 Wald’s likelihood-ratio martingale

The SPRT addresses the simple-vs-simple hypothesis test H0:Xif0H_0: X_i \sim f_0 vs H1:Xif1H_1: X_i \sim f_1 for known densities. Wald’s central object is the likelihood ratio

Λn  =  i=1nf1(Xi)f0(Xi),Λ0=1.(3.1)\Lambda_n \;=\; \prod_{i=1}^n \frac{f_1(X_i)}{f_0(X_i)}, \qquad \Lambda_0 = 1. \qquad\qquad (3.1)

Read Λn\Lambda_n as the betting odds in favor of H1H_1 after seeing X1,,XnX_1, \ldots, X_n. The logarithm turns this into a sum, logΛn=Zi\log \Lambda_n = \sum Z_i with Zi=log(f1(Xi)/f0(Xi))Z_i = \log(f_1(X_i)/f_0(X_i)).

Lemma 3.1 (Likelihood-ratio martingale).

Let f0,f1f_0, f_1 be densities with respect to μ\mu, with f0>0f_0 > 0 on the support of f1f_1. For i.i.d. Xif0X_i \sim f_0 and Fn=σ(X1,,Xn)\mathcal{F}_n = \sigma(X_1, \ldots, X_n), (Λn)n0(\Lambda_n)_{n \geq 0} is a nonnegative martingale with EH0[Λn]=1\mathbb{E}_{H_0}[\Lambda_n] = 1.

Proof.

Nonnegativity is automatic. For the martingale property,

EH0[Λn+1Fn]=Λnf1(x)f0(x)f0(x)dμ(x)=Λnf1(x)dμ(x)=Λn.\mathbb{E}_{H_0}[\Lambda_{n+1} \mid \mathcal{F}_n] = \Lambda_n \cdot \int \frac{f_1(x)}{f_0(x)} f_0(x)\, d\mu(x) = \Lambda_n \cdot \int f_1(x)\, d\mu(x) = \Lambda_n.

Induction with EH0[Λ0]=1\mathbb{E}_{H_0}[\Lambda_0] = 1 closes the argument.

That this is a martingale (equality, not inequality) is the cleanest possible setup for the time-uniform lift.

Example 3 (Gaussian mean shift).

For f0=N(μ0,σ2)f_0 = \mathcal{N}(\mu_0, \sigma^2) and f1=N(μ1,σ2)f_1 = \mathcal{N}(\mu_1, \sigma^2),

Zi  =  μ1μ0σ2 ⁣(Xiμ0+μ12).(3.2)Z_i \;=\; \frac{\mu_1 - \mu_0}{\sigma^2}\!\left(X_i - \frac{\mu_0 + \mu_1}{2}\right). \qquad\qquad (3.2)

Under H0H_0, EH0[Zi]=KL(f0f1)<0\mathbb{E}_{H_0}[Z_i] = -\mathrm{KL}(f_0 \,\|\, f_1) < 0. Under H1H_1, EH1[Zi]=+KL(f1f0)>0\mathbb{E}_{H_1}[Z_i] = +\mathrm{KL}(f_1 \,\|\, f_0) > 0.

The SPRT chooses two thresholds 0<B<1<A0 < B < 1 < A and defines

τ  =  inf{n1:ΛnA or ΛnB},(3.3)\tau \;=\; \inf\{\, n \geq 1 : \Lambda_n \geq A \text{ or } \Lambda_n \leq B \,\}, \qquad\qquad (3.3)

with decision: reject H0H_0 if ΛτA\Lambda_\tau \geq A, accept H0H_0 if ΛτB\Lambda_\tau \leq B.

A = 19.00, B = 0.0526. Empirical α = 0.037 (target 0.050), β = 0.048 (target 0.050).
Log likelihood ratio trajectories under H_0 (blue, drifting down) and H_1 (red, drifting up), with horizontal boundaries at log A and log B; stopped paths marked with terminal dots.
Notebook Figure 5. SPRT trajectories under both hypotheses (μ_0 = 0, μ_1 = 0.3, σ = 1, α = β = 0.05) — log Λ_n drifts in opposite directions under H_0 and H_1, hitting the Wald boundaries log A ≈ 2.94 and log B ≈ -2.94 at the stopping time τ.

3.2 Boundary-crossing analysis and Wald’s identity

Type-I error. By Lemma 3.1 and Ville (Theorem 2.3),

α  =  PH0(reject)    PH0 ⁣(supnΛnA)    1A.(3.4)\alpha \;=\; \mathbb{P}_{H_0}(\text{reject}) \;\leq\; \mathbb{P}_{H_0}\!\left(\sup_n \Lambda_n \geq A\right) \;\leq\; \frac{1}{A}. \qquad\qquad (3.4)

Type-II error. The dual process 1/Λn1/\Lambda_n is a nonnegative martingale under H1H_1. Ville again:

β  =  PH1(accept)    B.(3.5)\beta \;=\; \mathbb{P}_{H_1}(\text{accept}) \;\leq\; B. \qquad\qquad (3.5)

Setting A=1/αA = 1/\alpha and B=βB = \beta controls both errors. Wald’s tighter approximation:

AWald=(1β)/α,BWald=β/(1α).(3.6)A^{\text{Wald}} = (1 - \beta)/\alpha, \qquad B^{\text{Wald}} = \beta/(1 - \alpha). \qquad\qquad (3.6)

Theorem 3.2 (Wald's identity).

Let Z1,Z2,Z_1, Z_2, \ldots be i.i.d. with EZ1<\mathbb{E}|Z_1| < \infty, and τ\tau a stopping time with E[τ]<\mathbb{E}[\tau] < \infty. Then

E ⁣[i=1τZi]  =  E[τ]E[Z1].(3.7)\mathbb{E}\!\left[\sum_{i=1}^\tau Z_i\right] \;=\; \mathbb{E}[\tau] \cdot \mathbb{E}[Z_1]. \qquad\qquad (3.7)
Proof.

The centered random walk Mn=SnnE[Z1]M_n = S_n - n \mathbb{E}[Z_1] is a martingale. Apply Doob’s optional-stopping theorem to τN\tau \wedge N: E[SτN]=E[τN]E[Z1]\mathbb{E}[S_{\tau \wedge N}] = \mathbb{E}[\tau \wedge N] \cdot \mathbb{E}[Z_1]. Under the integrability hypothesis, dominated convergence extends to NN \to \infty.

Under no-overshoot, logΛταlogA+(1α)logB\log \Lambda_\tau \approx \alpha \log A + (1-\alpha) \log B under H0H_0, giving

EH0[τ]    αlogA+(1α)logBKL(f0f1),(3.8)\mathbb{E}_{H_0}[\tau] \;\approx\; \frac{\alpha \log A + (1-\alpha)\log B}{-\mathrm{KL}(f_0 \,\|\, f_1)}, \qquad\qquad (3.8)

and analogously EH1[τ][(1β)logA+βlogB]/KL(f1f0)\mathbb{E}_{H_1}[\tau] \approx [(1-\beta) \log A + \beta \log B]/\mathrm{KL}(f_1 \,\|\, f_0).

Example 4 (Gaussian SPRT — the efficiency win).

μ0=0,μ1=0.3,σ=1,α=β=0.05\mu_0 = 0, \mu_1 = 0.3, \sigma = 1, \alpha = \beta = 0.05: KL=0.045\mathrm{KL} = 0.045, logA2.94\log A \approx 2.94, logB2.94\log B \approx -2.94. Plugging in, Wald’s identity gives EH0[τ]59\mathbb{E}_{H_0}[\tau] \approx 59, with empirical mean 64\approx 64 across B=5,000B = 5{,}000 Monte Carlo replicates. The fixed-nn Neyman–Pearson test at the same operating point needs nNP120n_{\text{NP}} \approx 120 — the SPRT detects in about half the samples.

Mean ratio E[τ]/n_NP = 49.0% — the Wald–Wolfowitz efficiency gap.
Bar chart comparing SPRT expected stopping time E[τ] against the fixed-n Neyman–Pearson sample size n_NP across effect sizes; SPRT is consistently ~50% of n_NP.
Notebook Figure 6. The Wald–Wolfowitz efficiency win — SPRT's expected stopping time is uniformly ~50% of the fixed-n Neyman–Pearson sample size across effect sizes from |μ_1 − μ_0|/σ = 0.1 to 1.0, at α = β = 0.05.

3.3 SPRT minimax optimality (Wald–Wolfowitz)

Theorem 3.3 (Wald–Wolfowitz (1948)).

Among all sequential tests with type-I error α\leq \alpha, type-II error β\leq \beta, and finite expected stopping times, the SPRT achieves the minimum EH0[τ]\mathbb{E}_{H_0}[\tau] and EH1[τ]\mathbb{E}_{H_1}[\tau] simultaneously.

Proof sketch. Place a Bayes prior π\pi on {H0,H1}\{H_0, H_1\} plus per-observation sampling cost cc. The Bayes-optimal test, by dynamic programming on the posterior πn=πΛn/(πΛn+(1π))\pi_n = \pi \Lambda_n / (\pi \Lambda_n + (1-\pi)), depends only on whether πn\pi_n exits a fixed interval — exactly the SPRT structure. Every SPRT corresponds to a Bayes-optimal test for some prior. Full development in Wald and Wolfowitz (1948), Lehmann and Romano (2005, ch. 4), Siegmund (1985, ch. II).

The simultaneous-minimization claim is remarkable: no procedure does better under one hypothesis without doing worse under the other. The 50% efficiency win of Example 4 is the best possible gain.

Caveat about scope. Theorem 3.3 covers simple-vs-simple. For composite alternatives, no single LR adapts to all — Robbins’s mixtures (§4) and the betting / predictable-mixture constructions (§6) recover anytime-validity at the price of giving up the simple-vs-simple optimum.

3.4 SPRT as the prototype e-process

Under H0H_0, Λn\Lambda_n is a nonnegative martingale with EH0[Λn]=1\mathbb{E}_{H_0}[\Lambda_n] = 1. So PH0(supnΛn1/α)α\mathbb{P}_{H_0}(\sup_n \Lambda_n \geq 1/\alpha) \leq \alpha by Ville. The rejection rule “reject H0H_0 when Λn1/α\Lambda_n \geq 1/\alpha” is anytime-valid at level α\alpha — for every stopping rule.

This is the definition of an e-process (formalized in §8). The likelihood ratio is the prototype: a nonnegative process with unit expectation under the null, applied via Ville. Every modern anytime-valid construction inherits this structure.

Under H1H_1, logΛn\log \Lambda_n has linear positive drift KL(f1f0)\mathrm{KL}(f_1 \,\|\, f_0) — the largest possible growth rate among nonnegative H0H_0-martingales. This is the growth-rate-optimal (GRO) property of Grünwald and Koolen (2022), formalized in §9. The Kelly-criterion betting interpretation (§6, §9) makes the identification precise: Λn\Lambda_n is the wealth of a gambler who knows f1/f0f_1/f_0 exactly and bets log-optimally.

4. Robbins’ method of mixtures

The SPRT works for simple-vs-simple but in practice the alternative is almost never a single distribution. The A/B-test analyst doesn’t know whether the lift will be 0.1, 0.3, or 0.5 — only that it’s some unknown μ0\mu \neq 0. The single-λ\lambda supermartingale of §2 is similarly fragile: tight at one sample size and one tilt, with the bound deteriorating rapidly away from those settings. Herbert Robbins (1970) proposed the structural fix — instead of choosing a single λ\lambda, integrate over a prior π(λ)\pi(\lambda). The mixed object inherits the supermartingale property from its constituents (by Fubini), and the resulting confidence sequence adapts simultaneously to every λ\lambda in the support of π\pi.

4.1 Mixing over alternatives to adapt

Two related problems motivate the mixing trick: composite alternatives (the SPRT of §3 requires a specific f1f_1; a misspecified μ1\mu_1 accumulates evidence at the wrong rate) and unknown sample size (the fixed-nn → time-uniform recipe noted that single-λ\lambda bounds are tight at one nn and loose elsewhere). The structural punchline: a single-λ\lambda test is fragile in two senses, and the same fix — average over λ\lambda — handles both.

4.2 The Robbins–Siegmund mixture supermartingale

Take a family (Mn(λ))n0(M_n(\lambda))_{n \geq 0} of nonnegative supermartingales with E[M0(λ)]=1\mathbb{E}[M_0(\lambda)] = 1 each. Choose a probability measure π\pi on Λ\Lambda — the mixing prior. Define

Mnmix  =  ΛMn(λ)dπ(λ).(4.1)M_n^{\mathrm{mix}} \;=\; \int_\Lambda M_n(\lambda)\, d\pi(\lambda). \qquad\qquad (4.1)

Theorem 4.1 (Mixture supermartingale).

Assume (λ,ω)Mn(λ)(ω)(\lambda, \omega) \mapsto M_n(\lambda)(\omega) is jointly measurable, Mn(λ)0M_n(\lambda) \geq 0, and each M(λ)M_\cdot(\lambda) is a supermartingale w.r.t. (Fn)(\mathcal{F}_n) with E[M0(λ)]=1\mathbb{E}[M_0(\lambda)] = 1. Then MnmixM_n^{\mathrm{mix}} is a nonnegative supermartingale with E[M0mix]=1\mathbb{E}[M_0^{\mathrm{mix}}] = 1.

Proof.

Nonnegativity and unit initial expectation are immediate from Tonelli. Adapted-ness: each Mn(λ)M_n(\lambda) is Fn\mathcal{F}_n-measurable; joint measurability + Fubini makes MnmixM_n^{\mathrm{mix}} Fn\mathcal{F}_n-measurable. Supermartingale property: by Fubini on the nonnegative integrand,

E[Mn+1mixFn]=E[Mn+1(λ)Fn]dπ(λ)Mn(λ)dπ(λ)=Mnmix.\mathbb{E}[M_{n+1}^{\mathrm{mix}} \mid \mathcal{F}_n] = \int \mathbb{E}[M_{n+1}(\lambda) \mid \mathcal{F}_n]\, d\pi(\lambda) \leq \int M_n(\lambda)\, d\pi(\lambda) = M_n^{\mathrm{mix}}.

Theorem 4.1 + Ville (Theorem 2.3) gives P(supnMnmix1/δ)δ\mathbb{P}(\sup_n M_n^{\mathrm{mix}} \geq 1/\delta) \leq \delta. Every mixing prior π\pi produces an anytime-valid level-δ\delta test.

There is a Bayesian–frequentist symmetry worth flagging. The integral (4.1) is exactly the marginal likelihood from a Bayesian model with prior π\pi. Yet our usage is purely frequentist: Ville’s inequality, not Bayes’s theorem, and the resulting type-I error guarantee holds unconditionally on π\pi. This is the safe testing perspective of Grünwald, de Heide, and Koolen (2024).

4.3 Conjugate Gaussian–Gaussian mixture and the closed-form boundary

The canonical example: XiiidN(μ,σ2)X_i \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with known σ\sigma, test H0:μ=0H_0: \mu = 0 vs composite μR\mu \in \mathbb{R}. The sub-Gaussian Wald–Doob martingale is Mn(λ)=exp(λSnnλ2σ2/2)M_n(\lambda) = \exp(\lambda S_n - n \lambda^2 \sigma^2 / 2). Mix with λN(0,ρ2)\lambda \sim \mathcal{N}(0, \rho^2).

The mixed integral combines exponents: λSnλ2(1+nρ2σ2)/(2ρ2)\lambda S_n - \lambda^2 \cdot (1 + n \rho^2 \sigma^2)/(2\rho^2). Completing the square in λ\lambda and integrating out gives the closed-form Robbins mixture

Mnmix  =  11+nρ2σ2exp ⁣(ρ2Sn22(1+nρ2σ2)).(4.3)M_n^{\mathrm{mix}} \;=\; \frac{1}{\sqrt{1 + n \rho^2 \sigma^2}}\, \exp\!\left(\frac{\rho^2 S_n^2}{2(1 + n \rho^2 \sigma^2)}\right). \qquad\qquad (4.3)

Applying Ville at threshold 1/δ1/\delta and inverting for the running mean (after translating from H0:μ=0H_0: \mu = 0 to general μ\mu by replacing XiX_i with XiμX_i - \mu):

Theorem 4.2 (Gaussian-mixture confidence sequence).

Let XiiidN(μ,σ2)X_i \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with σ\sigma known. For any ρ>0\rho > 0 and δ(0,1)\delta \in (0, 1),

Cnmix(ρ)  =  Xˉn  ±  2(1+nρ2σ2)n2ρ2log1+nρ2σ2δ(4.4)C_n^{\mathrm{mix}}(\rho) \;=\; \bar X_n \;\pm\; \sqrt{\frac{2(1 + n \rho^2 \sigma^2)}{n^2 \rho^2}\, \log \frac{\sqrt{1 + n \rho^2 \sigma^2}}{\delta}} \qquad\qquad (4.4)

is a (1δ)(1 - \delta)-confidence sequence for μ\mu.

Three things to read off. (i) The boundary scales as logn/n\sqrt{\log n / n} for large nn — a logn\sqrt{\log n} factor over the fixed-nn Hoeffding bound. (ii) The parameter ρ\rho is a bias–variance dial: large ρ\rho tight at small nn, small ρ\rho tight at large nn. Optimal ρ=1/(σn)\rho^\star = 1/(\sigma\sqrt{n^\star}) for design horizon nn^\star gives the tightest CS near nn^\star. (iii) The boundary (4.4) is the lower envelope of all single-λ\lambda boundaries (by Jensen on the log in (4.1)) — for every nn, the mixture is at worst a small constant factor wider than the best single-λ\lambda choice for that nn.

σ = 1. The mixture (4.4) tracks the lower envelope of single-λ boundaries; LIL is the asymptotic floor.
Log-log plot of single-λ sub-Gaussian boundaries (thin gray) and Gaussian-mixture envelope (blue) and LIL rate (dashed brown), showing the mixture tracking the lower envelope of single-λ choices.
Notebook Figure 7. Single-λ vs Gaussian-mixture envelope (log-log, n ∈ [10, 10^6]). Each thin gray line is a single-λ sub-Gaussian boundary tight at one n; the blue Gaussian-mixture (4.4) tracks the lower envelope at every n. The dashed brown LIL rate σ√(2 log log n / n) is the asymptotic floor.

4.4 Why mixtures almost match the LIL rate

The Gaussian mixture boundary is σlogn/n\sigma\sqrt{\log n / n}; the LIL lower bound is σ2loglogn/n\sigma\sqrt{2\log\log n / n}. The ratio is logn/loglogn\sqrt{\log n / \log\log n} — at n=106n = 10^6, 2.3\approx 2.3. Closing the gap requires heavier-tailed mixing densities:

  1. Stitched discretized mixtures (Howard et al. 2020, §4). Discretize a geometric grid of design horizons nk=2kn_k = 2^k, take a separate single-λ\lambda boundary tight at each nkn_k, and union-bound them with weights wk1/k2w_k \propto 1/k^2. Matches LIL with explicit constants. No closed form.
  2. Predictable-mixture / betting boundaries (Waudby-Smith and Ramdas 2024). Replace fixed π\pi with a data-dependent (predictable) sequence λ1,λ2,\lambda_1, \lambda_2, \ldots. Formally tighter than any fixed-prior mixture and matches the LIL rate. Developed in §6.

For practical horizons (n106n \leq 10^6), the Gaussian-mixture’s 2\sqrt{2}3×3\times widening over LIL is a modest price for the closed-form expression (4.3), and most A/B-testing platforms ship it as default. The formal LIL lower bound is in §13.3.

Max false-rejection over n ∈ [50, 10000]: fixed-n = 0.658, mixture = 0.032 (target δ = 0.05).
Side-by-side empirical false-rejection rate vs n_max — the fixed-n band climbs past 60%, while the Gaussian-mixture CS stays flat below δ = 0.05.
Notebook Figure 8. The mixture CS closes the loop on §1's peeking demonstration. Under the same aggressive-analyst stopping rule, the fixed-n band's empirical false-rejection rate climbs past 60% by n_max = 10,000 (left, anytime-invalid), while the Gaussian-mixture CS (4.4) stays flat at ≈ 0.03 — below the nominal δ = 0.05 for the full horizon (right, anytime-valid).

5. The Howard–Ramdas–McAuliffe–Sekhon boundary atlas

§§2–4 built one confidence sequence — the sub-Gaussian Robbins mixture. The systematic observation of Howard, Ramdas, McAuliffe, and Sekhon (2020), the paper that systematized the modern field, is that every fixed-nn Chernoff bound has a time-uniform analogue. This section catalogues the three boundary classes that cover essentially every applied use of AVI: sub-Gaussian, sub-exponential, sub-Bernoulli.

5.1 The Wald–Doob exponential supermartingale

For i.i.d. XiX_i with E[X1]=μ\mathbb{E}[X_1] = \mu, define the centered CGF ψ(λ)=logE[eλ(X1μ)]\psi(\lambda) = \log \mathbb{E}[e^{\lambda(X_1 - \mu)}]. The Wald–Doob exponential martingale is

Mn(λ)  =  exp ⁣(λ(Snnμ)nψ(λ)),M0(λ)=1.(5.2)M_n(\lambda) \;=\; \exp\!\left(\lambda (S_n - n\mu) - n\, \psi(\lambda)\right), \qquad M_0(\lambda) = 1. \qquad\qquad (5.2)

Lemma 5.1 (Sub-class supermartingale).

Let ψ~:IR\tilde\psi: I \to \mathbb{R} upper-bound ψ\psi on an interval I0I \ni 0: ψ(λ)ψ~(λ)\psi(\lambda) \leq \tilde\psi(\lambda). Then M~n(λ)=exp(λ(Snnμ)nψ~(λ))\tilde M_n(\lambda) = \exp(\lambda(S_n - n\mu) - n \tilde\psi(\lambda)) is a nonnegative supermartingale with E[M~n(λ)]1\mathbb{E}[\tilde M_n(\lambda)] \leq 1 for every λI\lambda \in I.

Proof.

Nonnegativity is immediate. One-step ratio:

E[M~n+1/M~nFn]=E[eλ(Xn+1μ)]eψ~(λ)=eψ(λ)ψ~(λ)1\mathbb{E}[\tilde M_{n+1}/\tilde M_n \mid \mathcal{F}_n] = \mathbb{E}[e^{\lambda(X_{n+1} - \mu)}] \cdot e^{-\tilde\psi(\lambda)} = e^{\psi(\lambda) - \tilde\psi(\lambda)} \leq 1

since ψ(λ)ψ~(λ)\psi(\lambda) \leq \tilde\psi(\lambda).

Recipe. Pick a distribution class → choose ψ~\tilde\psi → build M~n(λ)\tilde M_n(\lambda) → mix over λ\lambda → apply Ville → invert.

5.2 Sub-Gaussian boundaries

For sub-Gaussian variance proxy σ2\sigma^2, ψ~sG(λ)=λ2σ2/2\tilde\psi_{\mathrm{sG}}(\lambda) = \lambda^2 \sigma^2 / 2 for all λR\lambda \in \mathbb{R}. The single-λ\lambda boundary is a straight line in the (n,Sn)(n, S_n) plane with intercept log(1/δ)/λ\log(1/\delta)/\lambda and slope λσ2/2\lambda \sigma^2 / 2. The Robbins mixture (4.4) sits along the lower envelope of all these lines, at the cost of a logn/loglogn\sqrt{\log n / \log\log n} excess over LIL. This is the baseline anytime-valid boundary.

5.3 Sub-exponential boundaries via Bernstein tilting

Sub-Gaussian excludes Poisson, χ2\chi^2, exponential. The next-most-permissive class is sub-exponential.

Definition 5.2 (Sub-exponential).

A random variable XX with mean μ\mu is sub-exponential with parameters (ν2,c)(\nu^2, c) if

ψX(λ)    ψ~sE(λ)  :=  λ2ν22(1cλ)for all λ<1/c.(5.7)\psi_X(\lambda) \;\leq\; \tilde\psi_{\mathrm{sE}}(\lambda) \;:=\; \frac{\lambda^2 \nu^2}{2 \,(1 - c |\lambda|)} \qquad \text{for all } |\lambda| < 1/c. \qquad\qquad (5.7)

ν2\nu^2 is the variance scale, cc controls exponential-tail behavior. Setting c=0c = 0 recovers sub-Gaussian.

Examples. Centered Poisson(λ)(\lambda) is sub-exponential with ν2=λ,c=1\nu^2 = \lambda, c = 1; centered χd2\chi^2_d has ν2=4d,c=4\nu^2 = 4d, c = 4; centered Exponential(1) has ν2=1,c=1\nu^2 = 1, c = 1.

The time-uniform Bennett-mixture CS has asymptotic form

CnsEmix  =  Xˉn  ±  ν2log(n/δ)nsub-Gaussian-like main  ±  clog(1/δ)ntail correction.(5.9)C_n^{\mathrm{sE-mix}} \;=\; \bar X_n \;\pm\; \underbrace{\nu \sqrt{\frac{2 \log(\sqrt{n}/\delta)}{n}}}_{\text{sub-Gaussian-like main}} \;\pm\; \underbrace{\frac{c \log(1/\delta)}{n}}_{\text{tail correction}}. \qquad\qquad (5.9)

For n100c2/ν2n \geq 100 \cdot c^2 / \nu^2, the tail correction is negligible.

Example 5 (Poisson CS).

XiiidPoisson(μ)X_i \overset{\text{iid}}{\sim} \mathrm{Poisson}(\mu): ν2=μ,c=1\nu^2 = \mu, c = 1. CnXˉn±2Xˉnlog(n/δ)/n±log(1/δ)/nC_n \approx \bar X_n \pm \sqrt{2 \bar X_n \log(\sqrt{n}/\delta)/n} \pm \log(1/\delta)/n — the anytime-valid analogue of the Wald rate-ratio CI.

5.4 Sub-Bernoulli boundaries via the binary KL function

For bounded X[0,1]X \in [0, 1] with mean μ\mu, the Bernoulli MGF gives a tighter bound than worst-case sub-Gaussian σ2=1/4\sigma^2 = 1/4.

Lemma 5.3 (Bernoulli MGF bound).

If X[0,1]X \in [0,1] with E[X]=μ\mathbb{E}[X] = \mu,

E ⁣[eλX]    (1μ)+μeλ,(5.10)\mathbb{E}\!\left[e^{\lambda X}\right] \;\leq\; (1 - \mu) + \mu e^\lambda, \qquad\qquad (5.10)

with equality iff XX is Bernoulli(μ)(\mu).

Proof.

f(x)=eλxf(x) = e^{\lambda x} is convex. For x[0,1]x \in [0,1], eλx(1x)e0+xeλe^{\lambda x} \leq (1-x) e^0 + x e^\lambda. Take expectation.

Taking logs: ψX(λ)log((1μ)+μeλ)λμ\psi_X(\lambda) \leq \log((1-\mu) + \mu e^\lambda) - \lambda \mu. The fixed-nn Chernoff bound under this MGF is P(Xˉnμ+ϵ)exp(nkl(μ+ϵμ))\mathbb{P}(\bar X_n \geq \mu + \epsilon) \leq \exp(-n \cdot \mathrm{kl}(\mu + \epsilon \,\|\, \mu)) where kl(pq)=plog(p/q)+(1p)log((1p)/(1q))\mathrm{kl}(p \| q) = p \log(p/q) + (1-p)\log((1-p)/(1-q)) is the binary KL function. This is tighter than sub-Gaussian Hoeffding (which uses σ2=1/4\sigma^2 = 1/4); the asymmetry of binary KL near μ=0\mu = 0 or μ=1\mu = 1 captures the rare-event regime correctly.

The Wald–Doob supermartingale is M~nsB(λ,μ)=eλSn/[(1μ)+μeλ]n\tilde M_n^{\mathrm{sB}}(\lambda, \mu) = e^{\lambda S_n} / [(1-\mu) + \mu e^\lambda]^n, a martingale under Bernoulli(μ)(\mu). Mixing in the Bayesian-updated pp-parametrization with pBeta(a,b)p \sim \mathrm{Beta}(a, b) gives the Beta-binomial marginal likelihood ratio

MnsBmix(μ;a,b)  =  B(a+Sn,b+nSn)/B(a,b)μSn(1μ)nSn.(5.14)M_n^{\mathrm{sB-mix}}(\mu \,;\, a, b) \;=\; \frac{B(a + S_n, b + n - S_n) / B(a, b)}{\mu^{S_n} (1 - \mu)^{n - S_n}}. \qquad\qquad (5.14)

The CS is the set of μ\mu where MnsBmix(μ)<1/δM_n^{\mathrm{sB-mix}}(\mu) < 1/\deltaimplicit, computed by bisection.

Example 6 (A/B test for conversion lift).

n=1000,Xˉn=0.05,δ=0.05n = 1000, \bar X_n = 0.05, \delta = 0.05: sub-Gaussian Hoeffding gives ±0.043\pm 0.043 (relative half-width 86%, useless). Sub-Bernoulli mixture CS gives ±0.018\pm 0.018 (relative half-width 36%). The sub-Bernoulli adapts to μ0.05\mu \approx 0.05 via the asymmetric KL.

σ = 1; Gaussian mixture (4.4) at ρ² = 1
Log-log plot of fixed-n, sub-Gaussian, sub-exponential (Poisson), and sub-Bernoulli boundary half-widths, with LIL reference.
Notebook Figure 9. The boundary atlas — fixed-n Hoeffding (dotted gray, anytime-invalid), sub-Gaussian Gaussian-mixture (4.4), sub-exponential Bennett mixture (5.9) for Poisson(1), and sub-Bernoulli Beta-mixture (5.14) at μ = 0.05, all at δ = 0.05. The LIL rate σ√(2 log log n / n) is the asymptotic floor.

5.5 Numerical comparison with fixed-nn Hoeffding

Setup: XiiidN(0,1)X_i \overset{\text{iid}}{\sim} \mathcal{N}(0, 1), δ=0.05\delta = 0.05, ρ2=1\rho^2 = 1. Fixed-nn Hoeffding: 2log(2/δ)/n\sqrt{2 \log(2/\delta)/n}.

nnFixed-nn HoeffdingSub-Gaussian mix (4.4)RatioSub-Bern. mix @ Xˉ=0.05\bar X = 0.05Ratio
10210^20.2720.2720.3270.3271.20×1.20\times0.1060.1060.39×0.39\times
10310^30.0860.0860.1140.1141.32×1.32\times0.0290.0290.34×0.34\times
10410^40.0270.0270.0390.0391.44×1.44\times0.0090.0090.34×0.34\times
10510^50.00860.00860.01320.01321.54×1.54\times0.0030.0030.35×0.35\times
10610^60.00270.00270.004450.004451.64×1.64\times0.0010.0010.36×0.36\times

The sub-Gaussian mixture is 1.201.201.64×1.64\times wider than fixed-nn Hoeffding — the cost of anytime-validity. Sample-size cost is 1.5\approx 1.53×3\times. The sub-Bernoulli at low conversion rates is tighter than fixed-nn Hoeffding via the binary KL.

■ fixed-n band  ■ anytime-valid mixture
Aggressive stopping rule; B = 1500 replicates per class.
Three-panel empirical false-rejection rate under aggressive stopping for sub-Gaussian, Poisson, and Bernoulli streams; fixed-n climbs high, mixture stays below δ.
Notebook Figure 10. Empirical coverage across three distribution classes. The fixed-n band's false-rejection rate climbs above δ under aggressive stopping in all three regimes; the corresponding mixture CS (sub-Gaussian, Bennett, Beta-mixture) keeps it bounded by δ uniformly.

Practitioner verdict. For bounded [0,1][0,1] data, sub-Bernoulli by default. For sub-Gaussian data, Robbins mixture (4.4). For Poisson/χ2\chi^2/exponential, Bennett mixture (5.9) with cc correction (negligible for nc2/ν2n \gg c^2/\nu^2).

6. Empirical-Bernstein and predictable mixtures — the betting reformulation

§5’s atlas got us most of the way, but two limitations linger. First, the sub-Gaussian and sub-exponential boundaries use worst-case variance proxies; for low-variance data the boundary is loose by a substantial factor (e.g., 25×25\times for Bernoulli at μ=0.01\mu = 0.01). Second, the closed-form Gaussian-mixture has a residual logn/loglogn\sqrt{\log n / \log\log n} excess over LIL. The reformulation that fixes both is due to Waudby-Smith and Ramdas (2024), built on game-theoretic probability (Vovk, Shafer) and Kelly’s (1956) information-theoretic gambling.

6.1 Variance adaptation as wealth accumulation

For small λ\lambda, expand exp(λ(Xiμ)λ2σ2/2)1+λ(Xiμ)λ2σ2/2+O(λ3)\exp(\lambda(X_i - \mu) - \lambda^2 \sigma^2/2) \approx 1 + \lambda(X_i - \mu) - \lambda^2 \sigma^2/2 + O(\lambda^3). Replace the worst-case σ2\sigma^2 correction with the empirical per-step λ2(Xiμ)2/2\lambda^2 (X_i - \mu)^2 / 2, giving the product form:

Kn(λ)  =  i=1n[1+λ(Xiμ)].(6.1)K_n(\lambda) \;=\; \prod_{i=1}^n \left[1 + \lambda(X_i - \mu)\right]. \qquad\qquad (6.1)

A cumulative product, not an exponential of a sum — the wealth process of a gambler betting fraction λ\lambda at each round.

Definition 6.1 (Betting wealth, fixed λ).

For i.i.d. Xi[a,b]X_i \in [a, b] with E[Xi]=μ\mathbb{E}[X_i] = \mu under H0H_0 and λ<1/max(μa,bμ)|\lambda| < 1 / \max(\mu - a, b - \mu), the betting wealth process is (6.1) with K0=1K_0 = 1.

Lemma 6.2 (Fixed-bet wealth is a martingale under H_0).

Under Definition 6.1, (Kn(λ))n0(K_n(\lambda))_{n \geq 0} is a nonnegative martingale with EH0[Kn(λ)]=1\mathbb{E}_{H_0}[K_n(\lambda)] = 1.

Proof.

Nonnegativity is from the λ|\lambda| constraint. One-step ratio:

EH0[1+λ(Xn+1μ)Fn]=1+λ0=1.\mathbb{E}_{H_0}[1 + \lambda(X_{n+1} - \mu) \mid \mathcal{F}_n] = 1 + \lambda \cdot 0 = 1.

Multiplying by Kn(λ)K_n(\lambda) and induction with K0=1K_0 = 1 gives the claim.

Variance adaptation as the headline win. For Bernoulli at μ=0.05\mu = 0.05: Var(X)=0.0475\mathrm{Var}(X) = 0.0475, but worst-case sub-Gaussian σ2=1/4=0.25\sigma^2 = 1/4 = 0.25. The boundary based on σ2\sigma^2 is 0.25/0.04752.3×\sqrt{0.25/0.0475} \approx 2.3\times wider than what the actual data warrant. The betting wealth recovers that factor automatically by using the actual deviations XiμX_i - \mu.

6.2 Predictable processes and Kelly bets

The fixed-λ\lambda wealth (6.1) is already a martingale. Lemma 6.2 has more to offer: the property survives when λ\lambda is a predictable process.

Definition 6.3 (Predictable bet schedule).

A sequence (λn)n1(\lambda_n)_{n \geq 1} is predictable if λn\lambda_n is Fn1\mathcal{F}_{n-1}-measurable for each nn. The predictable-bet wealth process is

Kn({λi})  =  i=1n[1+λi(Xiμ)],K0=1.(6.2)K_n(\{\lambda_i\}) \;=\; \prod_{i=1}^n \left[1 + \lambda_i \cdot (X_i - \mu)\right], \qquad K_0 = 1. \qquad\qquad (6.2)

Theorem 6.4 (Predictable wealth is a martingale under H_0).

With λn\lambda_n predictable and positivity-preserving, (Kn({λi}))n0(K_n(\{\lambda_i\}))_{n \geq 0} is a nonnegative martingale with EH0[Kn]=1\mathbb{E}_{H_0}[K_n] = 1.

Proof.

By predictability, λn+1\lambda_{n+1} is Fn\mathcal{F}_n-measurable, so

EH0[Kn+1/KnFn]=1+λn+1EH0[Xn+1μFn]=1.\mathbb{E}_{H_0}[K_{n+1}/K_n \mid \mathcal{F}_n] = 1 + \lambda_{n+1} \cdot \mathbb{E}_{H_0}[X_{n+1} - \mu \mid \mathcal{F}_n] = 1.

Predictability preserves the unbiasedness while letting λn\lambda_n be learned from prior observations.

Definition 6.5 (Kelly-optimal bet).

Under alternative H1H_1 with EH1[X]=μ1μ\mathbb{E}_{H_1}[X] = \mu_1 \neq \mu, the Kelly-optimal bet is

λ  =  argmaxλEH1 ⁣[log ⁣(1+λ(Xμ))].(6.3)\lambda^\star \;=\; \arg\max_\lambda \mathbb{E}_{H_1}\!\left[\log\!\left(1 + \lambda(X - \mu)\right)\right]. \qquad\qquad (6.3)

Two reasons: (i) expected detection time is approximately log(1/α)/EH1[L1]\log(1/\alpha) / \mathbb{E}_{H_1}[L_1] where Ln=logKnL_n = \log K_n, so maximizing EH1[L1]\mathbb{E}_{H_1}[L_1] minimizes detection time; (ii) by the LLN, logKn/nEH1[L1]\log K_n / n \to \mathbb{E}_{H_1}[L_1] almost surely.

Closed-form Kelly for small bets. First-order expansion gives the approximation

λ    μ1μEH1[(Xμ)2].(6.4)\lambda^\star \;\approx\; \frac{\mu_1 - \mu}{\mathbb{E}_{H_1}[(X - \mu)^2]}. \qquad\qquad (6.4)

For Bernoulli the exact closed form is λ=(μ1μ0)/[μ0(1μ0)]\lambda^\star = (\mu_1 - \mu_0)/[\mu_0(1-\mu_0)] — see (9.7) in §9.3. The two agree to leading order in μ1μ0|\mu_1 - \mu_0|.

Empirical Kelly. Plug in the running estimates:

λ^n  =  Xˉn1μm^2n1.(6.5)\hat\lambda_n \;=\; \frac{\bar X_{n-1} - \mu}{\hat m_2^{n-1}}. \qquad\qquad (6.5)

Each λ^n\hat\lambda_n depends only on X1,,Xn1X_1, \ldots, X_{n-1}, hence predictable.

Theoretical Kelly growth = kl(μ₁ ‖ μ₀) = 0.0201. Empirical Ĝ ≈ 0.0206 per step.
Log betting wealth trajectories under H_0 (bounded, blue) and H_1 (drifting up, red) with the rejection threshold log(1/δ).
Notebook Figure 11. Betting-wealth trajectories under H_0: Bern(0.5) (blue, bounded by Ville) and H_1: Bern(0.6) (red, drifting up at Kelly rate KL(0.6 ‖ 0.5) ≈ 0.0201). The horizontal threshold log(1/δ) is the rejection boundary.

6.3 Waudby-Smith–Ramdas (2024) bounded-variable boundaries

Waudby-Smith and Ramdas (2024) systematized the betting approach for bounded Xi[0,1]X_i \in [0, 1]. The PrPl-EB schedule (predictable plug-in, empirical Bernstein):

λ^nEB(μ)  =  sgn(Xˉn1μ)min ⁣(2log(2/δ)nσ^n12,    λnmax).(6.6)\hat\lambda_n^{\mathrm{EB}}(\mu) \;=\; \mathrm{sgn}(\bar X_{n-1} - \mu) \cdot \min\!\left(\sqrt{\frac{2 \log(2/\delta)}{n\, \hat\sigma_{n-1}^2}}, \;\; \lambda_n^{\max}\right). \qquad\qquad (6.6)

Theorem 6.6 (Empirical-Bernstein CS (Waudby-Smith and Ramdas 2024, Theorem 2)).

For XiX_i i.i.d. bounded on [0,1][0, 1] with mean μ\mu and variance σ2>0\sigma^2 > 0, the PrPl-EB CS satisfies P(μCn n)1δ\mathbb{P}(\mu \in C_n\ \forall n) \geq 1 - \delta, and asymptotically

HW(n)    2σ2loglognnas n.(6.7)\mathrm{HW}(n) \;\sim\; \sqrt{\frac{2 \sigma^2 \log\log n}{n}} \qquad \text{as } n \to \infty. \qquad\qquad (6.7)

LIL match with the true variance σ2\sigma^2, not the worst-case bound. This closes both the LIL gap (cf. §13.3) and the variance-adaptation gap at once.

Side-by-side Bernoulli(0.05) CS over n — sub-Bernoulli Beta-mixture (left) vs PrPl-EB (right, visibly tighter).
Notebook Figure 12. On a Bernoulli(0.05) stream, PrPl-EB (Theorem 6.6) is uniformly tighter than the sub-Bernoulli Beta-mixture (5.14). At n = 10⁴, the ratio settles to ≈ 0.5 — a 4× sample-size win.

Example 7 (PrPl-EB beats Beta-mixture for low μ).

At Xˉn=0.05,n=104,δ=0.05\bar X_n = 0.05, n = 10^4, \delta = 0.05: PrPl-EB half-width 0.0099\approx 0.0099; sub-Bernoulli Beta-mixture 0.023\approx 0.023. A 57%57\% reduction.

6.4 Practical numerical computation

Log-space computation. Always store logKn=log(1+λ^i(Xiμ))\log K_n = \sum \log(1 + \hat\lambda_i (X_i - \mu)), never KnK_n directly. Overflow at n200n \approx 200 for GRO bets on well-separated data.

CS endpoint computation via bisection. Kn(μ)K_n(\mu) is unimodal in μ\mu; CS endpoints are roots of logKn(μ)log(1/δ)=0\log K_n(\mu) - \log(1/\delta) = 0 via bisection. Tolerance 10610^{-6} suffices.

Clipping for positivity. λ^n<c/max(μa,bμ)|\hat\lambda_n| < c/\max(\mu - a, b - \mu) for c(0,1)c \in (0, 1). Standard c=0.5c = 0.5.

Forward-pointing. The betting wealth is the prototype of what §8 calls an e-process. The Kelly criterion is the prototype of §9’s growth-rate optimality. §10’s merging operations combine wealth processes from independent experiments.

8. e-values and e-processes

We’ve spent six sections building one object — a nonnegative (super)martingale with unit initial expectation that we plug into Ville — under three different names. In §3 it was Wald’s likelihood ratio. In §§4–5 it was a Robbins mixture. In §6 it was Waudby-Smith–Ramdas’s betting wealth. The structural identity, since Vovk (1993) and Vovk and Wang (2021), is the e-value / e-process. This section formalizes the concept, states the duality between e-processes and anytime-valid tests, catalogues the e-processes already built, and develops the bridge to p-values.

8.1 Definition — E[E] ≤ 1

Definition 8.1 (e-value).

Let H0H_0 specify PH0\mathbb{P}_{H_0}. A nonnegative random variable E0E \geq 0 is an e-value for H0H_0 if

EH0[E]    1.(8.1)\mathbb{E}_{H_0}[E] \;\leq\; 1. \qquad\qquad (8.1)

“e” for “expectation-bounded” or “evidence.” The budget "1\leq 1" is the catch: under the null, EE can’t on average exceed 11. Spending that budget on the tail — making PH0(E1/α)\mathbb{P}_{H_0}(E \geq 1/\alpha) small — is the test design problem.

Lemma 8.2 (Markov's inequality gives the e-test).

For an e-value EE and α(0,1)\alpha \in (0, 1),

PH0(E1/α)    α.(8.2)\mathbb{P}_{H_0}(E \geq 1/\alpha) \;\leq\; \alpha. \qquad\qquad (8.2)
Proof.

Markov applied to EE with a=1/αa = 1/\alpha: PH0(E1/α)αEH0[E]α\mathbb{P}_{H_0}(E \geq 1/\alpha) \leq \alpha \cdot \mathbb{E}_{H_0}[E] \leq \alpha.

(8.2) is the single-shot level-α\alpha test: reject when E1/αE \geq 1/\alpha.

Definition 8.3 (e-process).

A sequence (En)n0(E_n)_{n \geq 0} of e-values adapted to (Fn)(\mathcal{F}_n) is an e-process for H0H_0 if it is a nonnegative supermartingale under PH0\mathbb{P}_{H_0} with EH0[E0]1\mathbb{E}_{H_0}[E_0] \leq 1.

An e-process is a predictable strategy for spending the unit budget over time.

8.2 From e-value to anytime-valid test via Ville

Theorem 8.4 (Anytime-valid test from an e-process).

For an e-process (En)(E_n) and any α(0,1)\alpha \in (0, 1) and stopping time τ\tau,

PH0(Eτ1/α,  τ<)    PH0 ⁣(supn0En1/α)    α.(8.4)\mathbb{P}_{H_0}(E_\tau \geq 1/\alpha,\; \tau < \infty) \;\leq\; \mathbb{P}_{H_0}\!\left(\sup_{n \geq 0} E_n \geq 1/\alpha\right) \;\leq\; \alpha. \qquad\qquad (8.4)
Proof.

Set inclusion plus Ville (Theorem 2.3) at threshold 1/α1/\alpha using EH0[E0]1\mathbb{E}_{H_0}[E_0] \leq 1.

Theorem 8.5 (e-process / anytime-valid-test duality (informal)).

Every sequence of {0,1}\{0,1\}-valued anytime-valid tests at level α\alpha arises from an e-process via ϕn=1{En1/α}\phi_n = \mathbf{1}\{E_n \geq 1/\alpha\} for some e-process (En)(E_n).

Precise statement and proof in Howard et al. (2020, §2.3) and Ramdas et al. (2023, Theorem 4.1). Anytime-valid testing IS e-process testing.

8.3 The likelihood ratio as the canonical e-process

The constructions of §§3–6 are all e-processes:

  • §3: Wald’s LR. EH0[Λn]=1\mathbb{E}_{H_0}[\Lambda_n] = 1 by Lemma 3.1. ✓
  • §§2, 4, 5: Wald–Doob supermartingales and mixtures. ✓ (sub-Gaussian, sub-exponential, sub-Bernoulli variants).
  • §6: Betting wealth. ✓

The likelihood ratio is the canonical e-process; every other construction is a relaxation or generalization for the composite-alternative case.

Example 8 (Reading off the type-I error guarantee).

Sub-Bernoulli Beta-mixture e-process on conversion data: at n=1000n = 1000, MnsBmix(μ=0.05)=22M_n^{\mathrm{sB-mix}}(\mu = 0.05) = 22. By (8.4), the rejection rule “reject μ=0.05\mu = 0.05 when En20E_n \geq 20” has type-I error 0.05\leq 0.05for every stopping rule.

KL(0.05 ‖ 0.10) = 0.0167 per step (theoretical Wald drift).
Log-scale e-process trajectories: Wald LR with correct alternative, Wald LR with mis-specified null, Beta-mixture, PrPl-EB, and the fixed-n -log p (not an e-process).
Notebook Figure 13. The e-process catalogue on a Bernoulli(0.05) stream testing H_0: μ = 0.10. Wald LR (correct alt), sub-Bernoulli Beta-mixture, PrPl-EB are all anytime-valid; the fixed-n p-value's −log p (gray dotted) is not.

8.4 e-to-p calibration and the calibrator-of-calibrators

Definition 8.6 (p-value).

P[0,1]P \in [0, 1] is a valid p-value for H0H_0 if PH0(Pα)α\mathbb{P}_{H_0}(P \leq \alpha) \leq \alpha for every α(0,1)\alpha \in (0, 1).

Lemma 8.7 (e-to-p calibration).

For an e-value EE, P=min(1,1/E)P = \min(1, 1/E) is a valid p-value.

Proof.

{Pα}={E1/α}\{P \leq \alpha\} = \{E \geq 1/\alpha\} for α(0,1)\alpha \in (0,1), so PH0(Pα)=PH0(E1/α)α\mathbb{P}_{H_0}(P \leq \alpha) = \mathbb{P}_{H_0}(E \geq 1/\alpha) \leq \alpha by Lemma 8.2.

For an e-process, Pn=min(1,1/En)P_n = \min(1, 1/E_n) is a running anytime-valid p-value sequence.

The p-to-e direction is harder. The naive E=1/PE = 1/P fails: E[1/P]=\mathbb{E}[1/P] = \infty for PUniform(0,1)P \sim \mathrm{Uniform}(0,1).

Definition 8.8 (p-to-e calibrator).

f:[0,1][0,]f: [0, 1] \to [0, \infty] is a p-to-e calibrator if f(P)f(P) is an e-value whenever PP is a valid p-value.

Theorem 8.9 (Vovk and Wang 2021, calibrator characterization).

ff is a p-to-e calibrator iff ff is decreasing, upper semi-continuous, and 01f(p)dp1\int_0^1 f(p)\, dp \leq 1. Admissible calibrators have =1\int = 1 and are generated by f(p)=dΦ/dpf(p) = -d\Phi/dp for concave Φ:[0,1][0,1]\Phi: [0, 1] \to [0, 1] with Φ(0)=1,Φ(1)=0\Phi(0) = 1, \Phi(1) = 0.

Proof in Vovk and Wang (2021, §2). Examples include the Vovk family fκ(p)=κpκ1f_\kappa(p) = \kappa p^{\kappa - 1} for κ(0,1)\kappa \in (0, 1) and the Shafer logarithmic f(p)=(1p+plogp)/(plog2p)f(p) = (1 - p + p \log p) / (p \log^2 p).

Why the asymmetry. e-to-p is canonical (lossless); p-to-e is non-canonical (lossy). p-values are rank statements; e-values are magnitude statements. e→p collapses magnitude into rank; p→e tries to recover magnitude from rank (impossible without extra structure). Takeaway: start with an e-process and report e-values directly; convert to p-values for reader-friendliness only.

Mean e-value under null (Theorem 8.9 says E[E] = 1): Vovk-κ: 0.985, Vovk-1/3: 0.948, Shafer: 0.891.
Two-panel: histogram of P = min(1, 1/E) from null e-values vs uniform reference; e-value distributions from Vovk-1/2, Vovk-1/3, Shafer-log calibrators applied to uniform null p-values.
Notebook Figure 14. The calibration round-trip. Left: P = min(1, 1/E) under H_0 has a distribution stochastically dominating Uniform(0,1) — valid p-values. Right: Vovk-1/2, Vovk-1/3, and Shafer-log calibrators applied to uniform p-values produce e-value distributions with E[E] = 1 (Theorem 8.9).

9. Growth-rate optimality, the Kelly connection, and universal e-processes

§8 named the e-process and showed every anytime-valid test is one. But it didn’t say which e-process is the right one. This section provides the optimality theory. The key concept is growth-rate optimality (GRO): the e-process that maximizes expected log-growth under the alternative is the fastest-detecting anytime-valid procedure. For simple-vs-simple, the GRO e-process is Wald’s likelihood ratio, and its growth rate equals the KL divergence — closing the loop from §3.4 to a sharp theorem. The same criterion has a 70-year-old gambling interpretation (Kelly 1956). For composite alternatives, universal e-processes match the alternative’s KL up to logarithmic regret, generalizing Cover’s (1991) universal portfolios.

9.1 GRO — definition and the optimization problem

Definition 9.1 (Growth rate of an e-process).

For e-process (En)(E_n) and alternative H1H_1,

GH1(E)  =  limn1nEH1[logEn].(9.1)G_{H_1}(E) \;=\; \lim_{n \to \infty}\, \frac{1}{n}\, \mathbb{E}_{H_1}[\log E_n]. \qquad\qquad (9.1)

For i.i.d. data and multiplicative-product e-processes En=ξiE_n = \prod \xi_i, GH1(E)=EH1[logξ1]G_{H_1}(E) = \mathbb{E}_{H_1}[\log \xi_1].

Definition 9.2 (Growth-rate-optimal e-process (Grünwald and Koolen 2022; Shafer 2021)).

Let EH0\mathcal{E}_{H_0} denote the set of all e-processes for H0H_0. The GRO e-process for testing H0H_0 against H1H_1 is

EGRO  =  argmaxEEH0GH1(E).(9.2)E^{\mathrm{GRO}} \;=\; \mathop{\mathrm{arg\,max}}_{E \in \mathcal{E}_{H_0}}\, G_{H_1}(E). \qquad\qquad (9.2)

Why log-growth, not raw growth? EH1[En]\mathbb{E}_{H_1}[E_n] has no upper bound (aggressive bets give arbitrary mean growth). But by LLN, Enexp(nGH1(E))E_n \approx \exp(n \cdot G_{H_1}(E)) almost surely under H1H_1 — maximizing log-growth corresponds to maximizing the typical detection rate.

9.2 KL divergence as the optimal growth rate

Theorem 9.3 (GRO theorem for simple-vs-simple).

For H0:f0H_0: f_0, H1:f1H_1: f_1 with f0,f1f_0, f_1 mutually absolutely continuous, Λn=f1(Xi)/f0(Xi)\Lambda_n = \prod f_1(X_i)/f_0(X_i) is GRO with GH1(Λ)=KL(f1f0)G_{H_1}(\Lambda) = \mathrm{KL}(f_1 \,\|\, f_0). For any other e-process EE, GH1(E)KL(f1f0)G_{H_1}(E) \leq \mathrm{KL}(f_1 \,\|\, f_0).

Proof.

Step 1: Λn\Lambda_n is an e-process by Lemma 3.1.

Step 2: EH1[logΛn]=nEXf1[log(f1/f0)]=nKL(f1f0)\mathbb{E}_{H_1}[\log \Lambda_n] = n \cdot \mathbb{E}_{X \sim f_1}[\log(f_1/f_0)] = n \cdot \mathrm{KL}(f_1 \,\|\, f_0).

Step 3: For any e-process EE with EH0[En]1\mathbb{E}_{H_0}[E_n] \leq 1, by Jensen (concavity of log\log):

EH1[log(En/Λn)]logEH1[En/Λn].\mathbb{E}_{H_1}[\log(E_n / \Lambda_n)] \leq \log \mathbb{E}_{H_1}[E_n / \Lambda_n].

The Radon–Nikodym derivative dPH1n/dPH0n=Λnd\mathbb{P}_{H_1}^{\otimes n} / d\mathbb{P}_{H_0}^{\otimes n} = \Lambda_n, so

EH1[En/Λn]=(En/Λn)f1dx=Enf0dx=EH0[En]1.\mathbb{E}_{H_1}[E_n / \Lambda_n] = \int (E_n / \Lambda_n) \prod f_1\, dx = \int E_n \prod f_0\, dx = \mathbb{E}_{H_0}[E_n] \leq 1.

Hence EH1[logEn]EH1[logΛn]=nKL(f1f0)\mathbb{E}_{H_1}[\log E_n] \leq \mathbb{E}_{H_1}[\log \Lambda_n] = n \cdot \mathrm{KL}(f_1 \,\|\, f_0).

The same KL divergence governs fixed-nn Chernoff exponents (Neyman–Pearson, Stein’s lemma) and anytime-valid growth rates. The LR achieves both simultaneously.

Operational consequence. Expected stopping time of the GRO rejection rule:

EH1[τ]    log(1/α)KL(f1f0).(9.4)\mathbb{E}_{H_1}[\tau] \;\approx\; \frac{\log(1/\alpha)}{\mathrm{KL}(f_1 \,\|\, f_0)}. \qquad\qquad (9.4)

This is the §3.2 SPRT efficiency calculation reinterpreted: the SPRT achieves the minimum stopping time because the LR is GRO. The Wald–Wolfowitz theorem (§3.3) is the same statement in classical-statistics language.

9.3 The betting interpretation

Kelly’s setup: a gambler with bankroll W0W_0. At each round, KK outcomes with true probabilities pip_i and “fair” odds oi=1/qio_i = 1/q_i. Per-round log-growth at portfolio bb:

E[log(W1/W0)]  =  ipilog(bi/qi)  =  KL(pq)KL(pb).(9.5)\mathbb{E}[\log(W_1/W_0)] \;=\; \sum_i p_i \log(b_i / q_i) \;=\; \mathrm{KL}(p \,\|\, q) - \mathrm{KL}(p \,\|\, b). \qquad\qquad (9.5)

Maximizing over bb gives the Kelly bet bi=pib_i^\star = p_ibet proportional to true probability.

Theorem 9.4 (Kelly 1956 — optimal gambling growth rate).

The Kelly bet b=pb^\star = p achieves G=KL(pq)G^\star = \mathrm{KL}(p \,\|\, q), the maximum expected log-growth.

The bridge. Theorem 9.4 and Theorem 9.3 say the same thing in different languages:

Anytime-valid testingKelly gambling
Null H0:f0H_0: f_0Bookmaker’s odds qq
Alternative H1:f1H_1: f_1True probabilities pp
e-process with EH0[E]1\mathbb{E}_{H_0}[E] \leq 1Wealth process at fair odds
GRO e-process (LR)Kelly bet (b=pb = p)
G=KL(f1f0)G^\star = \mathrm{KL}(f_1 \,\|\, f_0)G=KL(pq)G^\star = \mathrm{KL}(p \,\|\, q)

An e-process for H0H_0 is the wealth of a gambler placing fair bets at the odds implied by H0H_0. This is the test by betting perspective of Shafer (2021) and the game-theoretic probability program of Shafer and Vovk (2019).

Empirical Kelly and the §6 betting wealth. For Bernoulli with null μ0\mu_0 and unknown alternative, the exact Kelly bet (solving (9.2) for simple Bernoulli alternatives) is

λ(μ1)  =  μ1μ0μ0(1μ0),(9.7)\lambda^\star(\mu_1) \;=\; \frac{\mu_1 - \mu_0}{\mu_0 (1 - \mu_0)}, \qquad\qquad (9.7)

with Kelly growth rate equal to the binary KL: kl(μ1μ0)\mathrm{kl}(\mu_1 \,\|\, \mu_0). The empirical Kelly substitutes Xˉn1\bar X_{n-1} for μ1\mu_1 — the predictable plug-in rule (6.5).

Example 9 (Bernoulli Kelly explicit).

μ0=0.5,μ1=0.6\mu_0 = 0.5, \mu_1 = 0.6: λ=0.1/0.25=0.4\lambda^\star = 0.1/0.25 = 0.4. Kelly rate =kl(0.60.5)=0.02014= \mathrm{kl}(0.6 \| 0.5) = 0.02014. Expected detection time at α=0.05\alpha = 0.05: log(20)/0.02014149\log(20)/0.02014 \approx 149 — matches §6 simulation.

A note on over-betting. The §9 simulation also evaluates λ=1.0\lambda = 1.0 (over-aggressive). For Bernoulli(0.6)(0.6) vs Bernoulli(0.5)(0.5), λ=1\lambda = 1 gives expected log-growth 0.6log1.5+0.4log0.50.2430.277=0.0340.6 \log 1.5 + 0.4 \log 0.5 \approx 0.243 - 0.277 = -0.034negative. Over-betting beyond Kelly’s optimum is strictly worse than not betting at all; this is the Kelly criterion’s most counterintuitive consequence: under-aggressive (λ<λ\lambda < \lambda^\star) only loses efficiency, but over-aggressive (λ>2λ\lambda > 2\lambda^\star for symmetric Bernoulli) actually loses wealth.

Kelly G★ = kl(0.60 ‖ 0.5) = 0.0201 per step. Mis-tuned (λ=1) Ĝ ≈ -0.0368. Wealth shrinks under H_1 — over-bet.
Log e-process trajectories under H_1 for GRO Kelly LR (steep upward), empirical Kelly (slightly less steep), universal mixture (slightly less still), and the mis-tuned λ=1.0 bet (drifting DOWN).
Notebook Figure 15. GRO growth-rate race under H_1: Bern(0.6) testing μ = 0.5. GRO Kelly LR (red) hits the threshold fastest; empirical Kelly and universal mixture are close behind. The over-aggressive λ = 1.0 bet (orange) drifts DOWN under H_1 — over-betting is worse than not betting (notebook printout: G_emp = −0.0348 per step, −172.86% of GRO).

9.4 Universal portfolios as universal e-processes

The GRO theorem assumes a known alternative. In practice the alternative is composite; no single e-process can be GRO simultaneously for every fH1f \in H_1.

Cover (1991) addressed this for portfolio selection. Cover’s universal portfolio achieves logSnCoverlogSnO(Klogn)\log S_n^{\mathrm{Cover}} \geq \log S_n^\star - O(K \log n) uniformly over all return sequences.

Translation to e-processes. Replace “stock returns” with “per-observation betting wealth ratios.” The construction becomes the Robbins-mixture e-process

Enuniv  =  ΘEn(θ)dπ(θ).(9.10)E_n^{\mathrm{univ}} \;=\; \int_\Theta E_n(\theta)\, d\pi(\theta). \qquad\qquad (9.10)

Theorem 9.5 (Universal e-process (mixture form, informal)).

Under regularity conditions on (Θ,π)(\Theta, \pi), for any θΘ\theta^\star \in \Theta,

EH1(θ)[logEnuniv]    EH1(θ)[logEn(θ)]O(dimΘlogn).(9.11)\mathbb{E}_{H_1(\theta^\star)}[\log E_n^{\mathrm{univ}}] \;\geq\; \mathbb{E}_{H_1(\theta^\star)}[\log E_n(\theta^\star)] - O(\dim \Theta \cdot \log n). \qquad\qquad (9.11)

Proof via Laplace approximation. Full development in Cover (1991), Barron and Hengartner (1998), Grünwald and Koolen (2022, §4).

Structural significance. Universal e-processes generalize universal source codes (Cover and Thomas 2006, ch. 13) from information theory. A universal source code achieves the entropy rate of any source in a class up to O(logn/n)O(\log n / n) overhead per symbol; a universal e-process achieves the KL-growth rate of any alternative in a class up to O(logn/n)O(\log n / n) overhead per observation. The MDL principle (Grünwald 2007) recognizes both as instances of one principle.

For the practitioner: universal e-processes give honest, anytime-valid inference for composite alternatives without requiring parametric likelihood specification or prior. The cost is O((dimΘ)logn/n)O((\dim \Theta) \log n / n) regret — at n=106n = 10^6 and dimΘ=1\dim \Theta = 1, 14\approx 14 extra observations on top of the GRO oracle’s expected log(1/α)/KL\log(1/\alpha)/\mathrm{KL}.

Universal mixture over 21-point μ-grid in [0.01, 0.99]. Regret vs the oracle Kelly LR grows like log n (Theorem 9.5) — the cost of not knowing the true alternative.
Log-scale regret plot: log E_n^univ − log E_n(μ_1) growing like log n across three μ_1 values.
Notebook Figure 16. Universal-portfolio regret on Bern(μ_1) streams testing H_0: μ = 0.5. The regret log E_n^univ − log E_n(μ_1) scales as O(log n) (Theorem 9.5) — modest cost for not knowing the alternative.

10. Merging e-values

A single test rarely lives in isolation. A modern A/B-testing platform runs hundreds of concurrent experiments; a meta-analyst pools evidence from a dozen studies; a bandit tracks rewards across arms. The classical answer is Bonferroni on p-values — honest but punitively conservative. E-values handle the combination dramatically better. Arithmetic-mean merging is valid under arbitrary dependence. Product merging under independence is dramatically tighter than even the most efficient p-value combiner. Vovk and Wang (2021) prove that arithmetic mean is the dominant symmetric merger under arbitrary dependence.

10.1 Arithmetic mean — robust under arbitrary dependence

Theorem 10.1 (Arithmetic-mean merger).

Let E(1),,E(m)E^{(1)}, \ldots, E^{(m)} be e-values for H0H_0 with arbitrary joint distribution. The arithmetic mean EAM=1mE(i)E^{\mathrm{AM}} = \frac{1}{m} \sum E^{(i)} is an e-value for H0H_0.

Proof.

Nonnegativity is immediate. By linearity:

EH0[EAM]=1mEH0[E(i)]1mm1=1.\mathbb{E}_{H_0}[E^{\mathrm{AM}}] = \frac{1}{m}\sum \mathbb{E}_{H_0}[E^{(i)}] \leq \frac{1}{m}\cdot m \cdot 1 = 1.

Linearity does not require independence.

The dependence structure can be anything and the arithmetic mean remains valid. Compare with p-values: the smallest p-value is not generally a valid p-value (Bonferroni loses a factor of mm); the arithmetic mean of e-values loses nothing.

Extending to e-processes. If (En(1)),,(En(m))(E_n^{(1)}), \ldots, (E_n^{(m)}) are e-processes with arbitrary cross-dependence, Eˉn=1mEn(i)\bar E_n = \frac{1}{m} \sum E_n^{(i)} is itself an e-process (Theorem 10.1 + Theorem 4.1). Anytime-valid meta-analysis with no dependence assumption.

Example 10 (A/B testing across user segments).

m=10m = 10 segment-level e-processes En(i)E_n^{(i)} testing per-segment treatment effect. We want global anytime-valid test of “no effect in any segment.” Segments may be correlated (shared platform-level confounders). Arithmetic-mean process Eˉn\bar E_n is anytime-valid for the global null — no Bonferroni, no covariance estimation.

10.2 Geometric mean and product — independence required

Theorem 10.2 (Product merger under independence).

Let E(1),,E(m)E^{(1)}, \ldots, E^{(m)} be jointly independent e-values for H0H_0. The product Eprod=E(i)E^{\mathrm{prod}} = \prod E^{(i)} is an e-value.

Proof.

By independence, EH0[Eprod]=EH0[E(i)]1=1\mathbb{E}_{H_0}[E^{\mathrm{prod}}] = \prod \mathbb{E}_{H_0}[E^{(i)}] \leq \prod 1 = 1.

The product can be exponentially larger than the arithmetic mean. E(i)=5E^{(i)} = 5 for m=10m = 10: EAM=5E^{\mathrm{AM}} = 5, Eprod=510=9.8×106E^{\mathrm{prod}} = 5^{10} = 9.8 \times 10^6. Independent observations let us multiply our beliefs.

Geometric mean is dominated. EGMEAME^{\mathrm{GM}} \leq E^{\mathrm{AM}} pointwise (AM–GM); no reason to prefer it.

Arithmetic mean E^AM
E^AM = 5.000
p-equiv (Vovk-1/2): 1.60e-1
Valid under ANY dependence.
Product E^prod indep. required
E^prod = 3.125e+3
p-equiv (Vovk-1/2): 4.10e-7
Invalid under correlation — see Viz 10.2.
Geometric mean E^GM
E^GM = 5.000
Dominated by E^AM (AM–GM).
Bonferroni p (p-value path)
min p × m = 8.00e-1
Always loses factor of m. Use E^AM instead.

10.3 The Vovk–Wang (2021) merging-function characterization

Definition 10.3 (E-merging function).

A symmetric M:[0,)m[0,)M: [0, \infty)^m \to [0, \infty) is an e-merging function under arbitrary dependence if M(E(1),,E(m))M(E^{(1)}, \ldots, E^{(m)}) is an e-value whenever each E(i)E^{(i)} is, regardless of joint distribution.

Theorem 10.4 (Vovk and Wang 2021, Proposition 3.1).

Every symmetric e-merging function under arbitrary dependence is dominated pointwise by the arithmetic mean:

M(e1,,em)    1mi=1mei(e1,,em)[0,)m.(10.4)M(e_1, \ldots, e_m) \;\leq\; \frac{1}{m} \sum_{i=1}^m e_i \qquad \forall (e_1, \ldots, e_m) \in [0, \infty)^m. \qquad\qquad (10.4)

Proof in Vovk and Wang (2021, §3). The arithmetic mean is the unique (up to dominated alternatives) optimal symmetric e-merging function under arbitrary dependence.

Weighted means Mw(e)=wieiM_w(e) = \sum w_i e_i with wi0,wi1w_i \geq 0, \sum w_i \leq 1 are also valid mergers; encode prior beliefs about study quality.

Contrast with p-value merging. The unique symmetric p-merging function under arbitrary dependence is Bonferroni. There is no p-value analogue of the arithmetic-mean merger that improves linearly in mm without independence assumptions. The asymmetry: e-value’s EH0[E]1\mathbb{E}_{H_0}[E] \leq 1 is a linear constraint (Jensen-friendly); p-value’s P(Pα)α\mathbb{P}(P \leq \alpha) \leq \alpha is a quantile constraint (adds up under worst-case dependence — Bonferroni).

At ρ = 0.50: E[E^AM] = 1.008, E[E^prod] = 34.533. AM ≤ 1 always (Theorem 10.1); the product's truncated mean already exceeds 1 for ρ > 0 even after winsorization at 10⁶.
Bar chart comparing combined evidence — Bonferroni p, AM e-value, product e-value, Fisher combined p — at m=10 independent tests, each E=5.
Notebook Figure 17. Quantifying the Bonferroni penalty. At m = 10 independent tests, each with e-value E = 5: Bonferroni p ≈ 0.40 (not significant); arithmetic-mean e ≈ 5 (p-equivalent ≈ 0.20); product e ≈ 9.77 × 10⁶ (p-equivalent ≈ 1.02 × 10⁻⁷); Fisher combined p ≈ 2 × 10⁻⁵. Product merger gives ~10⁷× more evidence than Fisher at the same input strength.

10.4 When you should not Bonferroni

Bonferroni is right when (i) tests are fixed-sample, (ii) mm is small, (iii) external constraint demands p-values. For every other multiple-testing setting, e-value merging is dramatically better. Four headline cases:

Sequential multiple testing. Platform runs mm concurrent A/B tests, peeking at each. Bonferroni on fixed-nn p-values gives per-test α/m\alpha/m — compounds with sequential widening. E-value alternative: maintain anytime-valid En(i)E_n^{(i)} per test, merge via Eˉn\bar E_n. Global type-I at Eˉn1/α\bar E_n \geq 1/\alpha is α\leq \alpha by Ville. No Bonferroni.

Online FDR control. LORD/SAFFRON for p-values; e-LOND/e-LORD (Wang and Ramdas 2022) for e-values. The e-value versions use the wealth-process structure directly; strictly more powerful under independence.

Meta-analysis with unknown correlation. Random-effects models assume distributional structure; e-value AM merger is assumption-free.

Bandit / off-policy evaluation. Adaptive sampling breaks standard CIs. Per-arm e-process CSs (Karampatziakis–Mineiro–Ramdas 2021) merged via AM give a global anytime-valid CS for the policy value.

Unifying message. e-values let you merge across studies, segments, and time without paying the Bonferroni penalty.

11. ML applications — A/B testing, bandits, and adaptive data

This section closes the loop to practice. Every modern A/B-testing platform ships an anytime-valid CS as the default stopping criterion. Every contextual-bandit deployment that needs honest evaluation uses off-policy e-processes. The shift, between roughly 2017 and 2022, replaced the fixed-horizon Wald test that had been standard for two decades.

11.1 Modern experimentation platforms

The commercial A/B-testing market has standardized on anytime-valid methodology:

PlatformAnytime-valid constructionReference
EppoGaussian-mixture CS + PrPl-EB for binaryEppo methodology docs
StatsigSequential probability ratio test + mixture CSHoward et al. (2020)
Optimizely “Stats Engine”Mixture sequential probability ratio test (mSPRT)Johari, Pekelis, Walsh (2015); Johari et al. (2017) KDD
Microsoft EXPSequential tests + mixture CS, internal useDeng et al. (2016) DSAA
LinkedIn ExperimentationmSPRT for binary, Gaussian-mixture for continuousXu et al. (2015) KDD
NetflixAnytime-valid linear models / sequential causalLindon, Ham, Tingley, Bojinov (2022)
VWOBayesian sequential testing (posterior-CS)VWO methodology docs

All seven platforms ran fixed-horizon tt-tests as default in 2015. By 2022 all seven had transitioned to some form of anytime-valid reporting. The shift was driven by practitioner demand: analysts want to look at the dashboard daily and make calls in real time, and the fixed-horizon framework offered no honest way to do that. Johari, Koomen, Pekelis, and Walsh (2017) “Peeking at A/B tests” made the §1 peeking failure explicit.

11.2 Anytime-valid stopping criteria in practice

The operational pattern across platforms looks essentially identical.

Step 1 — Specify the metric and inferential target. Lift in revenue per user, conversion rate, retention, etc. H0:lift=0H_0: \mathrm{lift} = 0 (two-sided) or H0:lift0H_0: \mathrm{lift} \leq 0 (one-sided).

Step 2 — Choose the CS construction. Continuous: sub-Gaussian Robbins mixture (4.4). Counts: sub-exponential Bennett mixture (5.9). Binary: sub-Bernoulli Beta-mixture (5.14) for Xˉ0.5\bar X \approx 0.5, PrPl-EB betting CS (6.6) for boundary means.

Step 3 — Pick the level and mixing parameter. δ{0.05,0.01}\delta \in \{0.05, 0.01\}. Mixing ρ\rho tuned for expected horizon nn^\star.

Step 4 — Run and monitor. CS updates as observations stream in. Three actions: continue, stop-and-ship (CS excludes null favorably), stop-and-abandon (CS narrow enough to declare practical equivalence).

The crucial property: the stopping rule does not need to be pre-specified. Whether the analyst stops at n=100n = 100 or n=10,000n = 10{,}000, the type-I error guarantee holds.

Quantifying the cost. §5.5 showed 1.5\sim 1.53×3 \times sample-size cost vs fixed-nn at the same nominal width. Operationally: an experiment that “would have” reached 80%80\% power at n=5,000n = 5{,}000 with fixed-horizon typically needs n[7,500,15,000]n \in [7{,}500, 15{,}000] with anytime-valid CS if run to the planned horizon. Savings come from not having to — strong effects detected at a fraction of nn^\star. Platforms report 20–40% reduction in average experiment duration vs fixed-horizon baseline.

current n = 319
decision: SHIP
looks taken: 0
Stylized A/B-test dashboard mockup showing lift estimate + anytime-valid CS over time, with the fixed-n band overlay for contrast.
Notebook Figure 18. A stylized rendering of a modern experimentation platform's dashboard: lift estimate with the PrPl-EB anytime-valid CS (blue) vs the fixed-n band (red dashed). The platform's analysts can stop at any point — the CS guarantee survives.

11.3 Off-policy confidence sequences for bandits

Modern recommendation systems use contextual bandits with adaptive sampling — action probabilities depend on history. Standard i.i.d. analysis breaks.

The standard OPE estimator:

V^n(π)  =  1ni=1nρiRi,ρi  =  π(AiXi)μ(AiXi).(11.1)\hat V_n(\pi) \;=\; \frac{1}{n} \sum_{i=1}^n \rho_i R_i, \qquad \rho_i \;=\; \frac{\pi(A_i \mid X_i)}{\mu(A_i \mid X_i)}. \qquad\qquad (11.1)

Karampatziakis, Mineiro, and Ramdas (2021) constructed anytime-valid CSs for V(π)V(\pi):

KnOPE(V0;{λi})  =  i=1n[1+λi(ρiRiV0)].(11.2)K_n^{\mathrm{OPE}}(V_0; \{\lambda_i\}) \;=\; \prod_{i=1}^n \left[1 + \lambda_i \cdot (\rho_i R_i - V_0)\right]. \qquad\qquad (11.2)

By Theorem 6.4 + Ville, this is anytime-valid. Works under arbitrary adaptive behavior policies μ\mu — the behavior policy can depend on the entire history.

Example 11 (Continuous-monitoring contextual bandit).

News-recommendation system with 5-arm Thompson sampling bandit. Product team wants anytime-valid 95% CS for each arm’s expected CTR. Using (11.2) with importance weights from Thompson’s posterior, CS updates at each impression. After 100,000 impressions, CSs of width ≈ 0.02 around each arm — honest decisions about which arms to retire. The bandit’s exploration probabilities changing in real time does not invalidate the CSs.

Thompson behavior policy; greedy target π★ = arm 3 (true CTR = 0.30). Off-policy CS via (11.2) is anytime-valid under arbitrary behavior policy.
Two-panel: Thompson sampling posteriors on left; per-arm off-policy CS via (11.2) on right.
Notebook Figure 19. Off-policy CS for adaptive bandits — Thompson sampling generates arm-selection probabilities that depend on history (left), but the (11.2) off-policy CS (right) remains anytime-valid regardless. The bandit retires arms based on CSs that exclude the target value V_0.

11.4 Causal-effect-under-policy-shift extensions

Bibaut, Petersen, Vlassis, Dimakopoulou, and van der Laan (2021) “Sequential causal inference in a single world of connected units” handles time-evolving populations: treatments assigned over time (possibly adaptively), target causal estimand changes as population shifts, anytime-valid inference for the current estimand. Combines doubly-robust estimation (outcome models + propensity scores) with §6 betting-wealth machinery. Under standard rate conditions (oP(n1/4)o_P(n^{-1/4}) for nuisance), the CS is anytime-valid for time-varying causal estimands.

Three increasingly difficult scenarios: (i) static treatment + outcome drift; (ii) static treatment + covariate drift; (iii) adaptive treatment policy. The third subsumes the bandit case and extends to confounders, IVs, and Athey–Imbens-style econometric extensions.

11.5 Common pitfalls

Pitfall 1 — Treating fixed-nn CIs as “almost” anytime-valid. Wrap fixed-nn Gaussian band in “stop after nminn_{\min}” rule. The §1.2 simulation: nmin=100n_{\min} = 100, nmax=10,000n_{\max} = 10{,}000, empirical false-rejection 65%\approx 65\% at δ=0.05\delta = 0.05. Fixed-nn band is not anytime-valid for any nminn_{\min}.

Pitfall 2 — Miscalibrating the mixing prior. ρ=1/(σn)\rho^\star = 1/(\sigma\sqrt{n^\star}) gives tightest CS near nn^\star. Fix: tune ρ\rho to expected stopping time, or use PrPl-EB (parameter-free, self-tunes).

Pitfall 3 — Confusing CS coverage with sample-mean accuracy. CS is an interval estimate for the population parameter — not a prediction interval for the sample mean’s future trajectory.

Pitfall 4 — Neglecting multi-test merging from §10. m=10m = 10 concurrent A/B tests at per-test δ=0.05\delta = 0.05 → family-wise error 0.40\sim 0.40 if naively interpreted. Right fix: arithmetic-mean merging (Theorem 10.1) for global test; e-LOND/e-LORD for FDR control.

Pitfall 5 — Reporting CS at horizon nn without context. CS at single nn is wider than fixed-nn CI by logn/log(1/δ)\sqrt{\log n / \log(1/\delta)}. Report CS width with operational context — the CS bought the right to stop at any nn, including much earlier if evidence accumulated quickly.

Unifying message: anytime-validity is a property of the inference procedure, not a label slapped on a fixed-nn statistic.

12. Computational notes

Four implementation concerns recurred across §§3–11. Collected here as a reference.

12.1 Numerical stability at large nn

Always compute e-processes in log-space. Betting wealth Kn=[1+λi(Xiμ)]K_n = \prod [1 + \lambda_i(X_i - \mu)] overflows Float64 (10308\approx 10^{308}) at n200n \approx 200 on well-separated data. Universal fix: log_K_n = np.cumsum(np.log1p(lambdas * (X - mu))) — stable to n109+n \approx 10^{9+}.

Log-sum-exp for mixture e-processes. Enmix=kπkEn(k)E_n^{\mathrm{mix}} = \sum_k \pi_k E_n^{(k)}: shift by m=maxklogEn(k)m = \max_k \log E_n^{(k)} before exponentiating. scipy.special.logsumexp.

Cumulative-max for “ever-crossed” computations. np.maximum.accumulate(E_traj, axis=1) for the running supremum in O(n)O(n) vectorized time.

Positivity-preserving clip. λi<c/max(μa,bμ)|\lambda_i| < c/\max(\mu - a, b - \mu) for c(0,1)c \in (0, 1). Standard c=0.5c = 0.5.

12.2 Closed-form vs Monte Carlo boundary evaluation

Closed-form (sub-Gaussian (4.4), sub-exponential (5.9)): pre-compute boundary as function of (n,δ,σ,ρ)(n, \delta, \sigma, \rho). O(1)O(1) per nn. Right call for production A/B-testing.

Implicit + bisection (sub-Bernoulli (5.14), PrPl-EB (6.6)): roots of logEn(μ)log(1/δ)=0\log E_n(\mu) - \log(1/\delta) = 0 via bisection. O(log(1/ϵ))O(\log(1/\epsilon)) per nn.

Streaming update pattern. Maintain fixed grid of KK candidate μ\mu values, update each in O(1)O(1) per new observation. Per-observation cost O(K)O(K).

Anchors from notebook cell 31: closed-form (~10 μs flat), implicit (~5 ms flat), MC at B = 1000, n = 10⁵ takes ~2 s. Use closed-form in production.
Log-log runtime plot: closed-form O(1), implicit + bisection O(log(1/ε)), MC at B=1000 O(B·n), across n from 10^2 to 10^7.
Notebook Figure 20. Boundary-evaluation runtime hierarchy — closed-form Gaussian mixture is O(1) per n (~10 μs), implicit Beta-mixture + bisection is O(log(1/ε)) per n (~5 ms), MC verification at B = 1000 is O(B·n) (~2 s at n = 10^5). Ratios: implicit ~500× closed-form, MC ~200,000× closed-form.

12.3 Special-function inversions

  • scipy.special.betaln for sub-Bernoulli Beta-mixture (5.14). Overflow-safe.
  • scipy.special.lambertw(z, k=0).real for sub-exponential boundary equations x+αex=βx + \alpha e^x = \beta.
  • Inverse binary KL via brentq on g(μ)=kl(Xˉnμ)c/ng(\mu) = \mathrm{kl}(\bar X_n \,\|\, \mu) - c/n.
  • scipy.stats.norm.ppf(1 - \delta/2) for fixed-nn Gaussian quantile reference.

12.4 Reproducibility

Use np.random.default_rng(seed), not np.random.seed(). Per-experiment RNG with explicit state.

Seed-locked verification. Print MC summaries; numbers should reproduce exactly on fresh kernel run.

Numerical-determinism caveats. Float arithmetic is bit-deterministic; parallel BLAS reductions may reorder summations (1012\sim 10^{-12} variation in 104\sim 10^4-term sums). Below MC noise floor for these simulations.

13. Connections and limits

We close by situating AVI in the broader formalML landscape and surveying what the framework can’t yet do.

13.1 AVI vs PAC-Bayes — the same Ville step, two applications

The most striking structural identity in this topic: AVI and PAC-Bayes Bounds use the same supermartingale + Ville machinery for different applications.

PAC-Bayes: with prior PP and posterior QQ, the bound

Pr ⁣(EhQ[L(h)L^n(h)]    KL(QP)+log(1/δ)n)    δ\Pr\!\left(\, \mathbb{E}_{h \sim Q}[L(h) - \hat L_n(h)] \;\geq\; \sqrt{\frac{\mathrm{KL}(Q \,\|\, P) + \log(1/\delta)}{n}} \,\right) \;\leq\; \delta

via Donsker–Varadhan + Markov on an exponential PAC-Bayes supermartingale.

AVI: with null H0H_0 and e-process EnE_n, the bound Pr(supnEn1/δ)δ\Pr(\sup_n E_n \geq 1/\delta) \leq \delta via Markov on Wald–Doob’s supermartingale + GRO with rate KL(f1f0)\mathrm{KL}(f_1 \,\|\, f_0).

The two differ in application: PAC-Bayes is generalization (loss of a randomized predictor); AVI is testing (type-I error of a sequential test). The mathematical engine — Markov + supermartingale + KL change-of-measure via Donsker–Varadhan — is identical. The Donsker–Varadhan inequality is the central tool in both proofs:

logEQ[eX]  =  supPQ{EP[X]KL(PQ)}.\log \mathbb{E}_Q[e^X] \;=\; \sup_{P \ll Q} \left\{\, \mathbb{E}_P[X] - \mathrm{KL}(P \,\|\, Q) \,\right\}.

A reader who has done pac-bayes-bounds and this topic has met the same theorem twice. That unification is the central contribution of the game-theoretic probability program (Shafer and Vovk 2019).

13.2 AVI vs selective inference — different conditioning events

The structural cousin one track over is Selective Inference (T6). Both address the same family of practitioner failures with different conditioning.

Selective inference handles model-selection-induced peeking. After lasso, condition on the polyhedral selection event {Ayb}\{Ay \leq b\}. Truncated-Gaussian pivot (Lee, Sun, Sun, Taylor 2016).

Always-valid inference handles time-induced peeking. After A/B test, condition on stopping-time event {τn}\{\tau \leq n\}. Robbins mixture (§4) and PrPl-EB betting (§6).

Selective inferenceAlways-valid inference
Selection event Ay ≤ b (polyhedral)Stopping-time event τ ≤ n (data-adaptive)
Truncated-Gaussian pivotTime-uniform Ville bound
Conditional CI given selectionAnytime-valid CS over stopping rules
Power loss as price of post-selection honestySample-size cost as price of optional-stopping honesty

Complementary, not competing. A platform that runs an experiment, performs feature selection, and reports a CI should use both.

13.3 The LIL lower bound and the constant gap

The §1.3 preview promised LIL as the asymptotic floor. The formal Hartman–Wintner (1941) statement:

lim supnSn2nσ2loglogn  =  1almost surely.\limsup_{n \to \infty}\, \frac{|S_n|}{\sqrt{2 n \sigma^2 \log\log n}} \;=\; 1 \qquad \text{almost surely.}

Stopping-time lifting gives: no (1δ)(1-\delta)-confidence sequence can have half-width smaller than σ2loglogn/n\sigma\sqrt{2 \log\log n / n} for all nn simultaneously. Proof via the LIL almost-sure envelope crossing any narrower-than-LIL band at some random stopping time.

Asymptotic picture across the three regimes:

  • Fixed-nn Hoeffding (anytime-invalid): HW(n)σ2log(2/δ)/n\mathrm{HW}(n) \sim \sigma\sqrt{2 \log(2/\delta)/n} — pure 1/n1/\sqrt{n}.
  • Gaussian-mixture CS (4.4): σ2logn/n\sim \sigma\sqrt{2 \log n / n} — extra logn\sqrt{\log n}.
  • PrPl-EB betting CS (6.6, LIL-matching): σ2loglogn/n\sim \sigma\sqrt{2 \log\log n / n} — extra loglogn\sqrt{\log\log n}.
  • LIL lower bound: σ2loglogn/n\sim \sigma\sqrt{2 \log\log n / n} (the hard floor).

At n=106n = 10^6, loglogn2.62\log\log n \approx 2.62; unavoidable cost over fixed-nn is 2.621.62×\sqrt{2.62} \approx 1.62\times in width, 2.6×\approx 2.6\times in sample size. PrPl-EB hits this exactly; Gaussian mixture pays an extra logn/loglogn2.3×\sqrt{\log n / \log\log n} \approx 2.3\times at n=106n = 10^6 (total 3.7×\sim 3.7\times width, 14×\sim 14\times sample size). PrPl-EB is conjectured to be constant-optimal (Waudby-Smith and Ramdas 2024); formal proof open.

13.4 Composite nulls, nuisance parameters, and the frontier

The framework as developed handles simple and parametric-composite alternatives elegantly. Active frontier:

Composite nulls. H0={Pθ:θΘ0}H_0 = \{\mathbb{P}_\theta : \theta \in \Theta_0\} requires supθΘ0Eθ[En]1\sup_{\theta \in \Theta_0} \mathbb{E}_\theta[E_n] \leq 1. Grünwald, de Heide, Koolen (2024) “Safe testing” provides the framework. Larsson, Ramdas, Ruf (2024) give explicit constructions via online learning.

Nuisance parameters. Doubly-robust estimators + betting machinery: Bibaut et al. (2021) for causal nuisance.

Continuous-time AVI. Discrete-nn replaced by càdlàg supermartingales. Applications: high-frequency financial monitoring, physiological signal analysis.

Distribution-free / adversarial AVI. Cutkosky and Mhammedi (2024) — anytime-valid bounds under no distributional assumption, only an “online learning expert” structure.

Bayesian–frequentist hybrids. Variational posterior calibration via e-values: use Bayesian prediction, get frequentist coverage.

Multiple testing under arbitrary dependence. Wang and Ramdas (2022) e-LOND and e-LORD — online FDR procedures using e-process arithmetic; strictly more powerful than p-value-based equivalents.

13.5 Forward-pointers and the close

T6 Causal Inference Methods — doubly-robust + betting machinery (Bibaut et al. 2021) gives anytime-valid CIs for causal estimands under adaptive treatment.

T5 Bayesian Inference — §4.2 Bayes–frequentist correspondence — Robbins mixtures are Bayesian marginal likelihoods, interpreted frequentistically via Ville. Safe testing formalizes this duality.

T5 Variational Inference — e-value calibration of variational posteriors (Grünwald frontier).

T5 Sequential Monte Carlo — SMC importance-sampling weights have anytime-valid extensions via e-processes.

Closing. The AVI framework is the answer to the §1.1 question: how do we make honest inference under peeking? The classical fixed-horizon framework gives no answer — the §1.2 simulation showed false-rejection climbing to 65%\sim 65\% under peeking. AVI replaces the fixed-horizon assumption with anytime-validity, at the modest cost of a 1.5\sim 1.53×3\times sample-size widening. For modern data analysis — sequential A/B testing, multi-armed bandits, adaptive clinical trials, online causal inference, real-time monitoring of any kind — AVI is the right framework, and the e-process is the right primitive.

The conceptual scaffolding is small enough to hold in one’s head. A nonnegative supermartingale with unit initial expectation, run through Ville’s inequality. That single object — built as a Wald likelihood ratio, a Robbins mixture, a sub-class Wald–Doob supermartingale, or a Waudby-Smith–Ramdas betting wealth — handles every anytime-valid construction. The frameworks that look distinct (PAC-Bayes, selective inference, online FDR, off-policy evaluation, safe testing) use the same machinery with different problem-specific decorations.

What the practitioner gets is the operational right to peek. Dashboard checked at any cadence, experiment stopped at any sample size, decisions made based on current CS endpoints — all without inflating type-I error. The cost is 1.5×\sim 1.5\times the planned sample size at any single horizon, paid in exchange for the ability to stop early when evidence is overwhelming, which on average more than compensates. The platforms in §11.1 have all made this trade, and the §1 peeking analyst is no longer a problem to solve but a user whose intuitive workflow the framework was designed to support.

The mathematical machinery has been with us since Wald (1945) and Robbins (1970). What changed in the 2020s is the recognition — across academia (Howard et al. 2020, Waudby-Smith and Ramdas 2024) and industry (Optimizely, Eppo, Statsig, et al.) — that anytime-validity is the right default for modern data analysis, and that the supermartingale + Ville machinery is the right tool for delivering it. This topic was written from inside that recognition.

Connections

  • AVI lifts every fixed-n Chernoff bound to a time-uniform analogue via Ville's inequality on the corresponding Wald–Doob supermartingale (§2.4). The sub-Gaussian, sub-exponential, and sub-Bernoulli boundary atlas of §5 IS the time-uniform extension of the Chernoff machinery developed there. concentration-inequalities
  • Both topics use the same supermartingale + Ville + Donsker–Varadhan machinery for different applications: PAC-Bayes for generalization of randomized predictors, AVI for type-I error of sequential tests. The §13.1 structural identity makes this explicit — a reader who has done both has met the same theorem twice. pac-bayes-bounds
  • The growth-rate-optimal (GRO) e-process for testing H_0: f_0 vs H_1: f_1 has expected log-growth equal to KL(f_1 || f_0). Theorem 9.3 identifies KL as the Cramér rate function in fixed-n large deviations AND the Kelly-optimal log-growth rate in anytime-valid testing — the same divergence governs both. kl-divergence
  • Structural cousin one track over: both address post-hoc inference failures with different conditioning. Selective inference conditions on a polyhedral selection event Ay ≤ b; AVI conditions on stopping-time events τ ≤ n. The §13.2 comparison table makes the complementarity explicit. selective-inference
  • The supermartingale framework (filtrations, adapted processes, Doob's optional-stopping theorem, Ville's inequality) is the measure-theoretic substrate. Every confidence-sequence construction in this topic instantiates the same template — choose a Wald–Doob supermartingale, mix it, apply Ville. measure-theoretic-probability

References & Further Reading