intermediate probability 35 min read

Concentration Inequalities

Quantitative convergence rates — from Markov's inequality to sub-Gaussian theory and generalization bounds

Overview & Motivation

The Measure-Theoretic Probability topic established that sample averages converge to population means. The Weak Law of Large Numbers tells us XˉnPμ\bar{X}_n \xrightarrow{P} \mu. The Strong Law strengthens this to almost sure convergence. But neither law answers the question a practitioner actually asks: how many samples do we need?

Suppose we train a classifier on nn samples and observe an empirical error rate of 5%. How confident are we that the true error is close to 5%? The law of large numbers says “eventually” — but “eventually” is not a sample size. We need quantitative bounds on tail probabilities: explicit formulas relating nn, the error tolerance ε\varepsilon, and the failure probability δ\delta.

This is exactly what concentration inequalities provide. They answer questions of the form:

P(Xˉnμε)f(n,ε)P\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq f(n, \varepsilon)

where ff decreases (typically exponentially) in nn and ε\varepsilon. Inverting these bounds gives us sample complexity: the number of samples sufficient to guarantee Xˉnμ<ε|\bar{X}_n - \mu| < \varepsilon with probability at least 1δ1 - \delta.

The progression of tightness. We develop a hierarchy of increasingly powerful bounds:

  1. Markov’s inequality uses only the mean — the weakest bound but the most general.
  2. Chebyshev’s inequality adds the variance — a polynomial (1/n1/n) rate.
  3. Chernoff bounds exploit the full moment generating function — exponential (ecne^{-cn}) rates.
  4. Hoeffding’s inequality provides a clean, usable exponential bound for bounded random variables.
  5. Sub-Gaussian theory abstracts the pattern: a random variable concentrates like a Gaussian if its MGF is controlled by a quadratic.
  6. Bernstein’s inequality refines the bound when the variance is small relative to the range.
  7. McDiarmid’s inequality extends concentration from sums to arbitrary functions of independent variables.

Each step trades generality for tightness. By the end, we will have the precise tools to prove generalization bounds — the theoretical guarantee that a machine learning model’s training error approximates its true error — and derive sample complexity results for finite hypothesis classes.


Basic Tail Bounds: Markov and Chebyshev

The Tail Probability Question

Given a random variable XX with known moments (mean, variance, etc.), what can we say about the probability that XX deviates far from its expected value? This is the tail probability question, and it is the foundation of all concentration inequalities.

We begin with the simplest possible bound, which uses only the first moment.

Markov’s Inequality

Theorem 1 (Markov's Inequality).

Let X0X \geq 0 be a non-negative random variable with E[X]<E[X] < \infty. Then for any t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{E[X]}{t}

Proof.

We write:

E[X]=0XdP{Xt}XdP{Xt}tdP=tP(Xt)E[X] = \int_0^\infty X \, dP \geq \int_{\{X \geq t\}} X \, dP \geq \int_{\{X \geq t\}} t \, dP = t \cdot P(X \geq t)

The first inequality follows because X0X \geq 0 (so restricting the domain of integration decreases the integral). The second inequality follows because on the set {Xt}\{X \geq t\}, we have XtX \geq t. Dividing both sides by t>0t > 0 completes the proof. \square

Remark.

Markov’s inequality is remarkably general — it requires only non-negativity and finite mean. But this generality comes at a cost: the bound is often loose. For example, if XExp(1)X \sim \text{Exp}(1) (so E[X]=1E[X] = 1), Markov gives P(X5)1/5=0.20P(X \geq 5) \leq 1/5 = 0.20, while the true value is e50.0067e^{-5} \approx 0.0067.

Chebyshev’s Inequality

We can obtain a tighter bound by exploiting the variance. The idea is simple: apply Markov’s inequality to Xμ2|X - \mu|^2.

Theorem 2 (Chebyshev's Inequality).

Let XX have finite mean μ=E[X]\mu = E[X] and finite variance σ2=Var(X)\sigma^2 = \text{Var}(X). Then for any t>0t > 0:

P(Xμt)σ2t2P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Proof.

Since (Xμ)20(X - \mu)^2 \geq 0, we apply Markov’s inequality to Y=(Xμ)2Y = (X - \mu)^2 with threshold t2t^2:

P(Xμt)=P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P(|X - \mu| \geq t) = P((X - \mu)^2 \geq t^2) \leq \frac{E[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2} \quad \square

Corollary 1 (Chebyshev for Sample Means).

If X1,,XnX_1, \ldots, X_n are i.i.d. with mean μ\mu and variance σ2\sigma^2, then:

P(Xˉnμε)σ2nε2P\left(|\bar{X}_n - \mu| \geq \varepsilon\right) \leq \frac{\sigma^2}{n\varepsilon^2}

Proof.

The sample mean Xˉn\bar{X}_n has mean μ\mu and variance σ2/n\sigma^2/n. Apply Chebyshev. \square

Remark.

This gives our first sample complexity bound: to guarantee P(Xˉnμε)δP(|\bar{X}_n - \mu| \geq \varepsilon) \leq \delta, we need nσ2/(ε2δ)n \geq \sigma^2 / (\varepsilon^2 \delta). This is the WLLN proof we saw in the previous topic, now made quantitative. But the polynomial rate O(1/n)O(1/n) is far from optimal — we will soon see exponential bounds.

One-Sided Variants

Proposition 1 (Cantelli's Inequality (One-sided Chebyshev)).

Let XX have finite mean μ\mu and variance σ2\sigma^2. Then for any t>0t > 0:

P(Xμt)σ2σ2+t2P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2}

Proof.

For any s>0s > 0, we have:

P(Xμt)=P(Xμ+st+s)E[(Xμ+s)2](t+s)2=σ2+s2(t+s)2P(X - \mu \geq t) = P(X - \mu + s \geq t + s) \leq \frac{E[(X - \mu + s)^2]}{(t + s)^2} = \frac{\sigma^2 + s^2}{(t+s)^2}

by Markov applied to (Xμ+s)2(X - \mu + s)^2. Minimizing over ss gives s=σ2/ts^* = \sigma^2/t, yielding the result. \square

Tail bound comparison — Markov, Chebyshev, and the true tail for several distributions, showing the gap between polynomial bounds and true exponential tail decay

The figure above demonstrates how loose the basic tail bounds are compared to the true tail probability. In the next section, we introduce the key idea that bridges the gap: the Chernoff method.

Tail Bound Explorer
True tail Markov Chebyshev Chernoff
# Tail Bound Comparison: Markov vs Chebyshev vs True Tail
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

t_vals = np.linspace(0.5, 8, 200)
mu, sigma2 = 1.0, 1.0  # Exp(1): mean=1, var=1

true_tail = np.exp(-t_vals)                           # P(X >= t) = e^{-t}
markov_bound = np.minimum(mu / t_vals, 1.0)           # E[X] / t
chebyshev_bound = np.minimum(sigma2 / (t_vals - mu)**2, 1.0)  # Var / (t - mu)^2

# Chebyshev is only meaningful for t > mu
chebyshev_bound[t_vals <= mu] = 1.0

plt.semilogy(t_vals, true_tail, 'k-', linewidth=2.5, label='True tail')
plt.semilogy(t_vals, markov_bound, '--', color='#D97706', linewidth=2, label='Markov')
plt.semilogy(t_vals[t_vals > mu], chebyshev_bound[t_vals > mu],
             '--', color='#534AB7', linewidth=2, label='Chebyshev')
plt.xlabel('t'); plt.ylabel('P(X >= t)')
plt.title('Exponential(1): Tail Bounds vs True Tail')
plt.legend()

The Chernoff Method and Moment Generating Functions

The Key Idea: Exponentiate and Apply Markov

The breakthrough from polynomial to exponential tail bounds comes from a remarkably simple trick: instead of applying Markov directly to XX, apply it to eλXe^{\lambda X} for an optimized parameter λ>0\lambda > 0.

Definition 1 (Moment Generating Function).

The moment generating function (MGF) of a random variable XX is:

MX(λ)=E[eλX],λRM_X(\lambda) = E[e^{\lambda X}], \quad \lambda \in \mathbb{R}

whenever the expectation exists. The MGF is also called the Laplace transform of the distribution of XX.

Remark.

The MGF encodes all moments of XX: E[Xk]=MX(k)(0)E[X^k] = M_X^{(k)}(0). This is why exploiting the MGF gives stronger bounds than using any finite number of moments.

The Chernoff Bound

Theorem 3 (Chernoff Bound).

For any random variable XX and any tRt \in \mathbb{R}:

P(Xt)infλ>0eλtMX(λ)P(X \geq t) \leq \inf_{\lambda > 0} e^{-\lambda t} M_X(\lambda)

Proof.

For any λ>0\lambda > 0, the function xeλxx \mapsto e^{\lambda x} is strictly increasing, so:

P(Xt)=P(eλXeλt)E[eλX]eλt=eλtMX(λ)P(X \geq t) = P(e^{\lambda X} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda X}]}{e^{\lambda t}} = e^{-\lambda t} M_X(\lambda)

where the inequality is Markov applied to the non-negative random variable eλXe^{\lambda X}. Since this holds for all λ>0\lambda > 0, we take the infimum. \square

Remark.

The optimization over λ\lambda is the “Chernoff” step. Different choices of λ\lambda give different trade-offs between the eλte^{-\lambda t} decay and the MGF growth. The optimal λ\lambda^* satisfies MX(λ)/MX(λ)=tM_X'(\lambda^*) / M_X(\lambda^*) = t, which is the equation for the Legendre transform (or rate function) of the log-MGF.

Application to Sums of Independent Variables

For a sum Sn=X1++XnS_n = X_1 + \cdots + X_n of independent random variables, the MGF factorizes:

MSn(λ)=i=1nMXi(λ)M_{S_n}(\lambda) = \prod_{i=1}^n M_{X_i}(\lambda)

This is because independence gives E[eλSn]=E[ieλXi]=iE[eλXi]E[e^{\lambda S_n}] = E[\prod_i e^{\lambda X_i}] = \prod_i E[e^{\lambda X_i}]. The log-MGF (cumulant generating function) therefore adds:

ψSn(λ)=i=1nψXi(λ)\psi_{S_n}(\lambda) = \sum_{i=1}^n \psi_{X_i}(\lambda)

This additivity is why the Chernoff method is particularly powerful for sums of independent variables.

Proposition 2 (Chernoff for Bernoulli Sums).

Let X1,,XnX_1, \ldots, X_n be i.i.d. Ber(p)\text{Ber}(p) and Sn=i=1nXiS_n = \sum_{i=1}^n X_i. For any t>npt > np:

P(Snt)(npt)t(nnpnt)ntP(S_n \geq t) \leq \left(\frac{np}{t}\right)^t \left(\frac{n - np}{n - t}\right)^{n-t}

Proof.

The MGF of Ber(p)\text{Ber}(p) is MXi(λ)=1p+peλM_{X_i}(\lambda) = 1 - p + pe^{\lambda}. So MSn(λ)=(1p+peλ)nM_{S_n}(\lambda) = (1 - p + pe^{\lambda})^n. The Chernoff bound gives P(Snt)infλ>0eλt(1p+peλ)nP(S_n \geq t) \leq \inf_{\lambda > 0} e^{-\lambda t}(1 - p + pe^{\lambda})^n. Taking the derivative in λ\lambda and setting it to zero gives eλ=t(1p)(nt)pe^{\lambda^*} = \frac{t(1-p)}{(n-t)p}. Substituting yields the result after algebraic simplification. \square

The Log-MGF and Cramér’s Theorem (Preview)

Definition 2 (Cumulant Generating Function).

The cumulant generating function (CGF, or log-MGF) of XX is:

ψX(λ)=logMX(λ)=logE[eλX]\psi_X(\lambda) = \log M_X(\lambda) = \log E[e^{\lambda X}]

Definition 3 (Rate Function (Legendre–Fenchel Transform)).

The rate function of XX is:

ψX(t)=supλR(λtψX(λ))\psi_X^*(t) = \sup_{\lambda \in \mathbb{R}} \left(\lambda t - \psi_X(\lambda)\right)

This is the Legendre–Fenchel transform (convex conjugate) of ψX\psi_X. The Chernoff bound becomes:

P(Xt)eψX(t)P(X \geq t) \leq e^{-\psi_X^*(t)}

The rate function ψX(t)\psi_X^*(t) measures how “surprising” it is to observe XtX \geq t. This connects to large deviations theory and information-theoretic quantities like the KL divergence.

The Chernoff method — optimizing the exponential tilt parameter λ, comparing Chernoff vs Chebyshev bounds for a Binomial distribution, and the rate function (Legendre transform)

# Chernoff Method: Optimizing the Exponential Tilt
p, n = 0.3, 100
t_target = 40  # P(S_n >= 40) where np = 30

lambdas = np.linspace(0.01, 2.0, 200)
chernoff_bounds = []
for lam in lambdas:
    mgf = (1 - p + p * np.exp(lam))**n
    bound = np.exp(-lam * t_target) * mgf
    chernoff_bounds.append(bound)

chernoff_bounds = np.array(chernoff_bounds)
opt_idx = np.argmin(chernoff_bounds)

# The optimal λ* minimizes the Chernoff bound
print(f"Optimal λ = {lambdas[opt_idx]:.3f}")
print(f"Chernoff bound = {chernoff_bounds[opt_idx]:.2e}")
print(f"True probability = {1 - stats.binom.cdf(39, n, p):.2e}")

Hoeffding’s Inequality

The Chernoff method is powerful but requires computing the MGF of each distribution. Hoeffding’s inequality provides a distribution-free exponential bound for bounded random variables, avoiding the need to compute the specific MGF.

Hoeffding’s Lemma

The key technical ingredient is a bound on the MGF of a bounded, zero-mean random variable.

Lemma 1 (Hoeffding's Lemma).

Let XX be a random variable with E[X]=0E[X] = 0 and aXba \leq X \leq b almost surely. Then for all λ>0\lambda > 0:

E[eλX]exp(λ2(ba)28)E[e^{\lambda X}] \leq \exp\left(\frac{\lambda^2 (b - a)^2}{8}\right)

Proof.

Since aXba \leq X \leq b, we can write X=αb+(1α)aX = \alpha b + (1 - \alpha) a where α=Xaba[0,1]\alpha = \frac{X - a}{b - a} \in [0, 1]. By convexity of eλxe^{\lambda x}:

eλXXabaeλb+bXbaeλae^{\lambda X} \leq \frac{X - a}{b - a} e^{\lambda b} + \frac{b - X}{b - a} e^{\lambda a}

Taking expectations and using E[X]=0E[X] = 0 (so E[Xa]=aE[X - a] = -a and E[bX]=bE[b - X] = b):

E[eλX]abaeλb+bbaeλaE[e^{\lambda X}] \leq \frac{-a}{b - a} e^{\lambda b} + \frac{b}{b - a} e^{\lambda a}

Let h=λ(ba)h = \lambda(b - a), p=abap = \frac{-a}{b - a}. Then the right side equals eg(h)e^{g(h)} where:

g(h)=ph+log(1p+peh)g(h) = -ph + \log(1 - p + pe^h)

We have g(0)=0g(0) = 0, g(0)=0g'(0) = 0, and g(h)1/4g''(h) \leq 1/4 for all hh (since g(h)=peh(1p+peh)p2e2h(1p+peh)2g''(h) = \frac{pe^h(1-p+pe^h) - p^2 e^{2h}}{(1-p+pe^h)^2} is maximized at p=1/2p = 1/2). By Taylor’s theorem with the mean-value form of the remainder:

g(h)h28=λ2(ba)28g(h) \leq \frac{h^2}{8} = \frac{\lambda^2(b-a)^2}{8} \quad \square

Hoeffding’s Inequality

Theorem 4 (Hoeffding's Inequality).

Let X1,,XnX_1, \ldots, X_n be independent random variables with aiXibia_i \leq X_i \leq b_i almost surely. Let Sn=i=1nXiS_n = \sum_{i=1}^n X_i. Then for any t>0t > 0:

P(SnE[Sn]t)exp(2t2i=1n(biai)2)P(S_n - E[S_n] \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

and by a symmetric argument:

P(SnE[Sn]t)2exp(2t2i=1n(biai)2)P(|S_n - E[S_n]| \geq t) \leq 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof.

Apply the Chernoff method: for any λ>0\lambda > 0, by independence:

E[eλ(SnE[Sn])]=i=1nE[eλ(XiE[Xi])]E\left[e^{\lambda(S_n - E[S_n])}\right] = \prod_{i=1}^n E\left[e^{\lambda(X_i - E[X_i])}\right]

Each Yi=XiE[Xi]Y_i = X_i - E[X_i] satisfies E[Yi]=0E[Y_i] = 0 and aiE[Xi]YibiE[Xi]a_i - E[X_i] \leq Y_i \leq b_i - E[X_i], with range biaib_i - a_i. By Hoeffding’s Lemma:

E[eλYi]exp(λ2(biai)28)E\left[e^{\lambda Y_i}\right] \leq \exp\left(\frac{\lambda^2(b_i - a_i)^2}{8}\right)

So:

P(SnE[Sn]t)exp(λt+λ28i=1n(biai)2)P(S_n - E[S_n] \geq t) \leq \exp\left(-\lambda t + \frac{\lambda^2}{8}\sum_{i=1}^n(b_i - a_i)^2\right)

Minimizing over λ\lambda gives λ=4ti(biai)2\lambda^* = \frac{4t}{\sum_i (b_i - a_i)^2}, yielding:

P(SnE[Sn]t)exp(2t2i=1n(biai)2)P(S_n - E[S_n] \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right) \quad \square

Corollary 2 (Hoeffding for Sample Means).

If X1,,XnX_1, \ldots, X_n are i.i.d. with aXiba \leq X_i \leq b, then:

P(Xˉnμε)2exp(2nε2(ba)2)P(|\bar{X}_n - \mu| \geq \varepsilon) \leq 2\exp\left(-\frac{2n\varepsilon^2}{(b-a)^2}\right)

Remark.

Inverting: to achieve P(Xˉnμε)δP(|\bar{X}_n - \mu| \geq \varepsilon) \leq \delta, we need:

n(ba)22ε2log2δn \geq \frac{(b-a)^2}{2\varepsilon^2}\log\frac{2}{\delta}

Compare to Chebyshev’s nσ2/(ε2δ)n \geq \sigma^2/(\varepsilon^2\delta): Hoeffding’s dependence on δ\delta is log(1/δ)\log(1/\delta) instead of 1/δ1/\delta — exponentially better for small δ\delta.

Hoeffding's inequality — empirical verification showing the bound vs Monte Carlo estimates, sample complexity comparison, and tightness across different distributions

# Hoeffding's Inequality: Empirical Verification
np.random.seed(42)
n_values = np.arange(10, 501, 5)
eps = 0.2
a, b = 0, 1  # Uniform[0,1]
mu, sigma2 = 0.5, 1/12
n_mc = 50000

empirical_probs = []
for n in n_values:
    samples = np.random.uniform(0, 1, (n_mc, n))
    means = samples.mean(axis=1)
    empirical_probs.append(np.mean(np.abs(means - mu) >= eps))

hoeffding_bound = 2 * np.exp(-2 * n_values * eps**2 / (b - a)**2)
chebyshev_bound = np.minimum(sigma2 / (n_values * eps**2), 1.0)

# The Hoeffding bound is a valid upper bound at every n
# Chebyshev is always looser (higher) than Hoeffding
Concentration Rate Demo
Hoeffding Chebyshev
n = 1000

Sub-Gaussian Random Variables

Hoeffding’s inequality works for bounded random variables, but many important random variables are unbounded (e.g., Gaussian). We need a framework that captures the tail behavior rather than relying on boundedness. This leads to the theory of sub-Gaussian random variables — a class defined by having tails that decay at least as fast as a Gaussian’s.

Motivation

A standard Gaussian ZN(0,1)Z \sim N(0, 1) has MGF E[eλZ]=eλ2/2E[e^{\lambda Z}] = e^{\lambda^2/2}. The Chernoff method gives:

P(Zt)et2/2P(Z \geq t) \leq e^{-t^2/2}

This Gaussian tail bound has the form ect2e^{-ct^2}. Many distributions — even non-Gaussian ones — exhibit this same qualitative behavior. Sub-Gaussian theory captures this commonality.

Definition and Equivalent Characterizations

Definition 4 (Sub-Gaussian Random Variable).

A random variable XX with E[X]=0E[X] = 0 is sub-Gaussian with parameter σ\sigma (written XsubG(σ2)X \sim \text{subG}(\sigma^2)) if its MGF satisfies:

E[eλX]eσ2λ2/2for all λRE[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2 / 2} \quad \text{for all } \lambda \in \mathbb{R}

The parameter σ\sigma is called the sub-Gaussian proxy variance (or sub-Gaussian norm).

Proposition 3 (Equivalent Characterizations of Sub-Gaussian).

For a mean-zero random variable XX, the following are equivalent (up to universal constants in the parameters):

  1. MGF condition: E[eλX]eσ2λ2/2E[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2 / 2} for all λ\lambda.
  2. Tail condition: P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} for all t0t \geq 0.
  3. Moment condition: Xp=(E[Xp])1/pσp\|X\|_p = (E[|X|^p])^{1/p} \leq \sigma\sqrt{p} for all p1p \geq 1.
  4. ψ2\psi_2-norm: Xψ2=inf{t>0:E[eX2/t2]2}<\|X\|_{\psi_2} = \inf\{t > 0 : E[e^{X^2/t^2}] \leq 2\} < \infty.

Remark.

These characterizations are not equivalent with the same constant — they are equivalent up to universal multiplicative factors. The MGF condition is the cleanest for proving concentration bounds. The tail condition is most intuitive.

Key Examples

Example 1 (Gaussian). ZN(0,σ2)Z \sim N(0, \sigma^2) is subG(σ2)\text{subG}(\sigma^2) — the defining example.

Example 2 (Rademacher). X=±1X = \pm 1 with equal probability. Then E[eλX]=cosh(λ)eλ2/2E[e^{\lambda X}] = \cosh(\lambda) \leq e^{\lambda^2/2}, so XsubG(1)X \sim \text{subG}(1).

Example 3 (Bounded). If aXba \leq X \leq b and E[X]=0E[X] = 0, then XsubG((ba)2/4)X \sim \text{subG}((b-a)^2/4) by Hoeffding’s Lemma.

Example 4 (Not sub-Gaussian). XExp(1)1X \sim \text{Exp}(1) - 1 is not sub-Gaussian: its MGF E[eλX]=eλ/(1λ)E[e^{\lambda X}] = e^{-\lambda}/(1 - \lambda) diverges at λ=1\lambda = 1, so it cannot be bounded by a quadratic for all λ\lambda.

Concentration for Sums of Sub-Gaussians

Theorem 5 (Sub-Gaussian Sum).

If X1,,XnX_1, \ldots, X_n are independent with XisubG(σi2)X_i \sim \text{subG}(\sigma_i^2), then i=1nXisubG(i=1nσi2)\sum_{i=1}^n X_i \sim \text{subG}\left(\sum_{i=1}^n \sigma_i^2\right).

Proof.

By independence:

E[exp(λiXi)]=iE[eλXi]ieσi2λ2/2=exp(λ22iσi2)E\left[\exp\left(\lambda \sum_i X_i\right)\right] = \prod_i E[e^{\lambda X_i}] \leq \prod_i e^{\sigma_i^2 \lambda^2/2} = \exp\left(\frac{\lambda^2}{2}\sum_i \sigma_i^2\right) \quad \square

Corollary 3 (Sub-Gaussian Sample Means).

If X1,,XnX_1, \ldots, X_n are i.i.d. subG(σ2)\text{subG}(\sigma^2), then:

P(Xˉnε)2exp(nε22σ2)P\left(|\bar{X}_n| \geq \varepsilon\right) \leq 2\exp\left(-\frac{n\varepsilon^2}{2\sigma^2}\right)

Maximal Inequality

Theorem 6 (Sub-Gaussian Maximum).

If X1,,XnX_1, \ldots, X_n are (not necessarily independent) sub-Gaussian with parameter σ2\sigma^2, then:

E[max1inXi]σ2lognE\left[\max_{1 \leq i \leq n} X_i\right] \leq \sigma\sqrt{2\log n}

Proof.

For any λ>0\lambda > 0:

E[maxiXi]1λlogE[eλmaxiXi]1λlog(i=1nE[eλXi])1λlog(neσ2λ2/2)=lognλ+σ2λ2E\left[\max_i X_i\right] \leq \frac{1}{\lambda}\log E\left[e^{\lambda \max_i X_i}\right] \leq \frac{1}{\lambda}\log\left(\sum_{i=1}^n E[e^{\lambda X_i}]\right) \leq \frac{1}{\lambda}\log(ne^{\sigma^2\lambda^2/2}) = \frac{\log n}{\lambda} + \frac{\sigma^2\lambda}{2}

where we used maxiXi1λlogieλXi\max_i X_i \leq \frac{1}{\lambda}\log\sum_i e^{\lambda X_i} (log-sum-exp) and Jensen’s inequality. Optimizing: λ=2logn/σ\lambda^* = \sqrt{2\log n}/\sigma gives the result. \square

Remark.

The logn\sqrt{\log n} rate is ubiquitous in high-dimensional statistics. It tells us that the maximum of nn sub-Gaussian variables grows only logarithmically — a fundamental reason why many statistical procedures work in high dimensions.

Sub-Gaussian theory — MGF comparison across distributions, tail decay rates, and the maximal inequality showing the sqrt(2 log n) scaling

Sub-Gaussian Classifier
Sub-Gaussian ✓

Sub-Exponential Random Variables and Bernstein’s Inequality

Beyond Sub-Gaussian: Heavier Tails

Not all useful random variables are sub-Gaussian. The centered exponential XE[X]X - E[X] where XExp(λ)X \sim \text{Exp}(\lambda), or the square Z2Z^2 of a Gaussian, have tails that decay as ecte^{-ct} rather than ect2e^{-ct^2}. These are sub-exponential random variables.

Definition 5 (Sub-Exponential Random Variable).

A random variable XX with E[X]=0E[X] = 0 is sub-exponential with parameters (ν2,b)(\nu^2, b) if:

E[eλX]exp(ν2λ22)for all λ<1bE[e^{\lambda X}] \leq \exp\left(\frac{\nu^2 \lambda^2}{2}\right) \quad \text{for all } |\lambda| < \frac{1}{b}

Remark.

The MGF condition looks like the sub-Gaussian condition, but it only holds for λ<1/b|\lambda| < 1/b rather than all λ\lambda. This restricted domain reflects the heavier tails.

Proposition 4 (Sub-Gaussian ⊂ Sub-Exponential).

Every sub-Gaussian random variable is sub-exponential (with b=0b = 0, formally). The converse is false: Z21Z^2 - 1 where ZN(0,1)Z \sim N(0,1) is sub-exponential but not sub-Gaussian.

Proposition 5 (Product Rule (Sub-Gaussian squares)).

If XsubG(σ2)X \sim \text{subG}(\sigma^2), then X2E[X2]X^2 - E[X^2] is sub-exponential with parameters (ν2,b)(\nu^2, b) depending on σ\sigma.

Bernstein’s Inequality

Bernstein’s inequality is sharper than Hoeffding when the variance is small relative to the range. It interpolates between sub-Gaussian behavior (for small deviations) and sub-exponential behavior (for large deviations).

Theorem 7 (Bernstein's Inequality).

Let X1,,XnX_1, \ldots, X_n be independent mean-zero random variables with XiM|X_i| \leq M almost surely. Let σ2=1ni=1nE[Xi2]\sigma^2 = \frac{1}{n}\sum_{i=1}^n E[X_i^2]. Then:

P(1ni=1nXit)exp(nt2/2σ2+Mt/3)P\left(\frac{1}{n}\sum_{i=1}^n X_i \geq t\right) \leq \exp\left(-\frac{n t^2 / 2}{\sigma^2 + Mt/3}\right)

Proof.

For bounded XiX_i with XiM|X_i| \leq M, we have E[Xik]k!2σi2Mk2E[|X_i|^k] \leq \frac{k!}{2} \sigma_i^2 M^{k-2} for k2k \geq 2 (Bernstein’s moment condition). This gives:

E[eλXi]=k=0λkE[Xik]k!1+λ2σi22k=0(Mλ)k=1+λ2σi2/21MλE[e^{\lambda X_i}] = \sum_{k=0}^\infty \frac{\lambda^k E[X_i^k]}{k!} \leq 1 + \frac{\lambda^2\sigma_i^2}{2}\sum_{k=0}^\infty (M|\lambda|)^k = 1 + \frac{\lambda^2\sigma_i^2/2}{1 - M|\lambda|}

for λ<1/M|\lambda| < 1/M. Using 1+xex1 + x \leq e^x and the Chernoff method, then optimizing λ\lambda, we obtain the result. \square

Remark.

The denominator σ2+Mt/3\sigma^2 + Mt/3 creates two regimes:

  • Small deviations (tσ2/Mt \ll \sigma^2/M): the bound behaves as exp(nt2/(2σ2))\exp(-nt^2/(2\sigma^2)) — sub-Gaussian with the variance, not the range.
  • Large deviations (tσ2/Mt \gg \sigma^2/M): the bound behaves as exp(nt/(2M/3))\exp(-nt/(2M/3)) — sub-exponential, linear in tt.

When σ2(ba)2\sigma^2 \ll (b-a)^2, Bernstein is much tighter than Hoeffding in the small-deviation regime.

Bernstein's inequality — sub-Gaussian vs sub-exponential tail decay, Bernstein vs Hoeffding comparison when variance is small, and the two-regime analysis


McDiarmid’s Inequality

From Sums to General Functions

Hoeffding’s inequality applies to sums Xˉn=1niXi\bar{X}_n = \frac{1}{n}\sum_i X_i. But in machine learning, we often care about more general functions of independent random variables — for instance, the empirical risk R^(h)=1ni=1n(h(Xi),Yi)\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(X_i), Y_i), or the supremum of an empirical process suphHR^(h)R(h)\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)|.

McDiarmid’s inequality extends concentration to any function f(X1,,Xn)f(X_1, \ldots, X_n) that doesn’t depend too much on any single coordinate.

The Bounded Differences Condition

Definition 6 (Bounded Differences Property).

A function f:XnRf : \mathcal{X}^n \to \mathbb{R} satisfies the bounded differences property with constants c1,,cnc_1, \ldots, c_n if for every ii and every x1,,xn,xiXx_1, \ldots, x_n, x_i' \in \mathcal{X}:

supx1,,xn,xif(x1,,xi,,xn)f(x1,,xi,,xn)ci\sup_{x_1, \ldots, x_n, x_i'} |f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)| \leq c_i

In words: changing any single input xix_i to xix_i' changes ff by at most cic_i.

McDiarmid’s Inequality

Theorem 8 (McDiarmid's Inequality).

Let X1,,XnX_1, \ldots, X_n be independent random variables, and let ff satisfy the bounded differences property with constants c1,,cnc_1, \ldots, c_n. Then:

P(f(X1,,Xn)E[f(X1,,Xn)]t)exp(2t2i=1nci2)P(f(X_1, \ldots, X_n) - E[f(X_1, \ldots, X_n)] \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof.

The proof uses a martingale difference decomposition. Define:

Di=E[fX1,,Xi]E[fX1,,Xi1]D_i = E[f \mid X_1, \ldots, X_i] - E[f \mid X_1, \ldots, X_{i-1}]

Then fE[f]=i=1nDif - E[f] = \sum_{i=1}^n D_i is a telescoping sum of martingale differences (E[DiX1,,Xi1]=0E[D_i \mid X_1, \ldots, X_{i-1}] = 0).

The bounded differences condition implies Dici|D_i| \leq c_i almost surely. Each DiD_i is conditionally bounded and mean-zero given X1,,Xi1X_1, \ldots, X_{i-1}, so Hoeffding’s Lemma applies conditionally:

E[eλDiX1,,Xi1]eλ2ci2/8E[e^{\lambda D_i} \mid X_1, \ldots, X_{i-1}] \leq e^{\lambda^2 c_i^2 / 8}

The remainder follows the same Chernoff optimization as Hoeffding’s inequality, yielding the stated bound with constant 22 instead of 88 by tighter analysis of the conditional ranges. \square

Remark.

McDiarmid’s inequality is sometimes called the “method of bounded differences.” It subsumes Hoeffding’s inequality: for f(x1,,xn)=1nixif(x_1, \ldots, x_n) = \frac{1}{n}\sum_i x_i with xi[a,b]x_i \in [a, b], each ci=(ba)/nc_i = (b-a)/n, and the bound reduces to Hoeffding.

Application: Concentration of Empirical Risk

Example 5 (Empirical Risk). Let (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) be i.i.d. training samples, hh a fixed hypothesis, and \ell a loss function bounded by MM. Define the empirical risk:

R^(h)=1ni=1n(h(Xi),Yi)\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(X_i), Y_i)

Changing any single sample (Xi,Yi)(X_i, Y_i) changes R^(h)\hat{R}(h) by at most M/nM/n. By McDiarmid:

P(R^(h)R(h)ε)2exp(2nε2/M2)P(|\hat{R}(h) - R(h)| \geq \varepsilon) \leq 2\exp(-2n\varepsilon^2/M^2)

This is the concentration result that enables generalization bounds: the empirical risk concentrates around the true risk at an exponential rate.

McDiarmid's inequality — empirical risk concentration, bounded differences for various functions, and sample complexity for PAC-style bounds


Applications to Machine Learning

Generalization Bounds for a Fixed Hypothesis

The simplest generalization bound follows directly from Hoeffding’s inequality (or McDiarmid). Given a hypothesis hh, its true risk R(h)=E[(h(X),Y)]R(h) = E[\ell(h(X), Y)] and empirical risk R^(h)=1ni=1n(h(Xi),Yi)\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \ell(h(X_i), Y_i), if the loss is bounded by MM:

Theorem 9 (Fixed-Hypothesis Generalization Bound).

With probability at least 1δ1 - \delta:

R(h)R^(h)Mlog(2/δ)2n|R(h) - \hat{R}(h)| \leq M\sqrt{\frac{\log(2/\delta)}{2n}}

Proof.

Apply Hoeffding’s Corollary 2 with ε=Mlog(2/δ)2n\varepsilon = M\sqrt{\frac{\log(2/\delta)}{2n}} and solve 2e2nε2/M2=δ2e^{-2n\varepsilon^2/M^2} = \delta. \square

The Union Bound and Finite Hypothesis Classes

For a finite hypothesis class H={h1,,hK}\mathcal{H} = \{h_1, \ldots, h_K\}, we want uniform convergence: suphHR^(h)R(h)ε\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| \leq \varepsilon simultaneously for all hh. The union bound (Boole’s inequality) and Hoeffding give:

Theorem 10 (Uniform Convergence for Finite H).

With probability at least 1δ1 - \delta:

suphHR(h)R^(h)Mlog(2H/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \leq M\sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2n}}

Proof.

By the union bound:

P(h:R(h)R^(h)ε)k=1KP(R^(hk)R(hk)ε)2Ke2nε2/M2P(\exists h : |R(h) - \hat{R}(h)| \geq \varepsilon) \leq \sum_{k=1}^K P(|\hat{R}(h_k) - R(h_k)| \geq \varepsilon) \leq 2Ke^{-2n\varepsilon^2/M^2}

Set this equal to δ\delta and solve for ε\varepsilon. \square

Remark.

The price of uniformity is logH\log|\mathcal{H}| — logarithmic in the size of the hypothesis class. This is remarkably mild: even with H=10100|\mathcal{H}| = 10^{100} hypotheses, we only need log(10100)230\log(10^{100}) \approx 230 additional samples’ worth of concentration.

Sample Complexity

Corollary 4 (Sample Complexity for Finite H).

To achieve suphHR(h)R^(h)ε\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \leq \varepsilon with probability 1δ\geq 1 - \delta, it suffices to have:

nM22ε2log2Hδn \geq \frac{M^2}{2\varepsilon^2}\log\frac{2|\mathcal{H}|}{\delta}

This is the PAC-style sample complexity bound. The dependence on the key parameters is:

  • 1/ε21/\varepsilon^2: quadratic in the accuracy — fundamental to most concentration-based bounds.
  • logH\log|\mathcal{H}|: logarithmic in the hypothesis class size — the key insight that makes learning tractable.
  • log(1/δ)\log(1/\delta): logarithmic in the confidence — exponentially better than Chebyshev’s 1/δ1/\delta.
Generalization Bound Calculator
n (samples): 1,000
|H| (hypotheses): 100
δ (failure prob): 0.0501
M (loss bound): 1.0
Generalization bound ε
±0.064
With 1,000 samples and 100 hypotheses, the empirical risk is within ±0.064 of the true risk with 95.0% confidence.

The Bridge to PAC Learning

This section has developed the concentration machinery that the PAC Learning Framework will use extensively. The key ingredients flowing forward are:

  1. Hoeffding’s inequality for bounding the gap between empirical and true risk.
  2. The union bound for extending single-hypothesis bounds to finite classes.
  3. Sub-Gaussian theory as the abstract framework for tail control.
  4. McDiarmid’s inequality for concentration of general functions (needed for Rademacher complexity).

The PAC learning framework formalizes this: a hypothesis class H\mathcal{H} is PAC-learnable if there exists an algorithm that, given nn(ε,δ,H)n \geq n(\varepsilon, \delta, \mathcal{H}) i.i.d. samples, outputs a hypothesis h^\hat{h} with R(h^)infhHR(h)εR(\hat{h}) - \inf_{h \in \mathcal{H}} R(h) \leq \varepsilon with probability 1δ\geq 1 - \delta. The sample complexity n(ε,δ,H)n(\varepsilon, \delta, \mathcal{H}) depends on the “complexity” of H\mathcal{H} — measured by VC dimension, Rademacher complexity, or covering numbers — all of which rely on the concentration inequalities developed here.

ML applications — generalization gap vs sample size, the price of uniform convergence, sample complexity landscape, and the concentration inequality hierarchy

# Generalization Bounds: Union Bound + Hoeffding for Finite H
np.random.seed(42)
n_vals = np.arange(50, 2001, 20)
true_risk = 0.3
n_mc = 5000

# Empirical generalization gaps (Monte Carlo)
gen_gaps_95 = []
for n in n_vals:
    emp_risks = np.random.binomial(n, true_risk, n_mc) / n
    gaps = np.abs(emp_risks - true_risk)
    gen_gaps_95.append(np.percentile(gaps, 95))

# Hoeffding bound for 95% confidence (delta = 0.05)
hoeffding_95 = np.sqrt(np.log(2 / 0.05) / (2 * n_vals))

# Sample complexity: n >= log(2|H|/delta) / (2 * eps^2)
# For |H| = 100, delta = 0.05:
n_needed = lambda eps: np.log(2 * 100 / 0.05) / (2 * eps**2)

Connections & Further Reading

Connections Map

TopicDomainRelationship
Measure-Theoretic ProbabilityProbability & StatisticsPrerequisite — provides LpL^p spaces, convergence modes, and the expectation operator used throughout
PCA & Low-Rank ApproximationLinear AlgebraMatrix concentration inequalities bound the gap between sample and population covariance eigenvalues
PAC Learning FrameworkProbability & StatisticsDirect downstream — uses Hoeffding + union bound for sample complexity; Rademacher complexity uses McDiarmid
Bayesian Nonparametrics (coming soon)Probability & StatisticsConcentration bounds for posterior contraction rates

Key Notation Summary

SymbolMeaning
MX(λ)=E[eλX]M_X(\lambda) = E[e^{\lambda X}]Moment generating function
ψX(λ)=logMX(λ)\psi_X(\lambda) = \log M_X(\lambda)Cumulant generating function (log-MGF)
ψX(t)\psi_X^*(t)Rate function (Legendre–Fenchel transform)
subG(σ2)\text{subG}(\sigma^2)Sub-Gaussian with proxy variance σ2\sigma^2
ψ2\|\cdot\|_{\psi_2}Sub-Gaussian (Orlicz) norm
R(h)=E[(h(X),Y)]R(h) = E[\ell(h(X), Y)]True risk
R^(h)\hat{R}(h)Empirical risk
H\mathcal{H}Hypothesis class

Connections

  • Direct prerequisite — provides Lp spaces, expectation, independence, the convergence hierarchy, and the Lebesgue integral machinery used throughout. measure-theoretic-probability
  • Matrix concentration inequalities (extensions of the scalar theory developed here) bound the gap between sample and population covariance eigenvalues. pca-low-rank

References & Further Reading

  • book Concentration Inequalities: A Nonasymptotic Theory of Independence — Boucheron, Lugosi & Massart (2013) Definitive reference — covers all topics in this page and much more
  • book High-Dimensional Probability — Vershynin (2018) Excellent treatment of sub-Gaussian/sub-exponential theory
  • book High-Dimensional Statistics: A Non-Asymptotic Viewpoint — Wainwright (2019) Connects concentration to statistical estimation and testing
  • book Understanding Machine Learning: From Theory to Algorithms — Shalev-Shwartz & Ben-David (2014) Chapters 2-4 use these inequalities for PAC learning
  • book Probability: Theory and Examples — Durrett (2019) Chapter 2 covers the classical large deviations perspective
  • paper Probability Inequalities for Sums of Bounded Random Variables — Hoeffding (1963) The original paper — still beautifully clear
  • paper On the Method of Bounded Differences — McDiarmid (1989) Original presentation of the bounded differences inequality