intermediate probability 50 min read

Measure-Theoretic Probability

From sigma-algebras to martingales — the rigorous foundations that underpin modern statistics and machine learning

Overview & Motivation

Why do we need measure theory for probability? The short answer: because naive probability breaks down in continuous settings.

Consider a “uniform” random variable XX on [0,1][0, 1]. We want P(XA)P(X \in A) for subsets A[0,1]A \subseteq [0, 1]. For intervals, this is simple: P(X[a,b])=baP(X \in [a, b]) = b - a. But what about arbitrary subsets? Can we assign a “length” (probability) to every subset of [0,1][0, 1] while preserving countable additivity?

The answer, due to Vitali (1905), is no — there exist non-measurable sets. This forces us to restrict our attention to a carefully chosen collection of “well-behaved” subsets: a sigma-algebra.

This is not an abstract curiosity. Every time we write E[X]E[X], compute a conditional expectation E[XG]E[X \mid \mathcal{G}], invoke the law of large numbers, or price a financial derivative, we are relying on measure-theoretic machinery. The Lebesgue integral replaces the Riemann integral because it handles limits of random variables correctly (via the Monotone and Dominated Convergence Theorems). Conditional expectation, defined as a Radon–Nikodym derivative, is the mathematical backbone of filtering, Bayesian inference, and martingale theory. And martingales themselves are the language of fair pricing in mathematical finance.

What We Cover

  1. Sigma-algebras & Measurable Spaces — the sets we can assign probabilities to, and why we need them.
  2. Measures & Probability Measures — Kolmogorov’s axioms, Lebesgue measure, and the Cantor set.
  3. Measurable Functions & Random Variables — formalizing “random quantities” as measurable maps.
  4. The Lebesgue Integral & Expectation — building the integral from simple functions, with full proofs of MCT and DCT.
  5. Convergence of Random Variables — the four modes, their hierarchy, the Laws of Large Numbers, and the CLT.
  6. Product Measures & Fubini’s Theorem — integrating over product spaces and why E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] for independent variables.
  7. Conditional Expectation & Radon–Nikodym — the deepest idea: conditional expectation as an L2L^2 projection.
  8. A Preview of Martingales — filtrations, adapted processes, and connections to finance.

Connections

This topic connects to the rest of the formalML curriculum in several directions:

  • PCA & Low-Rank Approximation — the sample covariance Σ^=1n1XTX\hat{\Sigma} = \frac{1}{n-1} X^T X converges to the population covariance Σ\Sigma by the Law of Large Numbers; L2L^2 theory guarantees convergence of eigenvalues.
  • Concentration Inequalities — builds directly on the LpL^p spaces and convergence theory developed here, quantifying rates of convergence beyond the LLN.
  • PAC Learning Framework (coming soon) — uses measure-theoretic probability to formalize learnability.
  • Bayesian Nonparametrics (coming soon) — requires conditional expectation and the Radon–Nikodym theorem for prior specifications on infinite-dimensional spaces.

Sigma-Algebras and Measurable Spaces

Why We Need Sigma-Algebras

The fundamental question of probability is: given a sample space Ω\Omega, which subsets can we assign probabilities to?

For finite Ω\Omega, the answer is easy: every subset. The power set 2Ω2^\Omega works. But for uncountable Ω\Omega — like R\mathbb{R} or [0,1][0, 1] — the power set is too large. Vitali’s 1905 construction shows that no translation-invariant, countably additive measure can be defined on all subsets of [0,1][0, 1]. We must restrict to a smaller collection of sets that is still rich enough to do calculus.

The right structure is a sigma-algebra: a collection of subsets closed under complements and countable unions. This is precisely what we need for probability — we want to say “the probability of AA or BB” (unions), “the probability of not AA” (complements), and we want these operations to work for countable sequences of events.

Definition 1 (Sigma-algebra).

A sigma-algebra (or σ\sigma-algebra) on a set Ω\Omega is a collection F2Ω\mathcal{F} \subseteq 2^\Omega satisfying:

  1. ΩF\Omega \in \mathcal{F} (the whole space is measurable),
  2. If AFA \in \mathcal{F}, then AcFA^c \in \mathcal{F} (closure under complements),
  3. If A1,A2,FA_1, A_2, \ldots \in \mathcal{F}, then n=1AnF\bigcup_{n=1}^\infty A_n \in \mathcal{F} (closure under countable unions).

The pair (Ω,F)(\Omega, \mathcal{F}) is a measurable space.

Remark.

Properties (2) and (3) together imply closure under countable intersections (by De Morgan’s laws: An=(Anc)c\bigcap A_n = (\bigcup A_n^c)^c), and property (1) with (2) gives F\emptyset \in \mathcal{F}.

Examples on a Finite Set

Let Ω={1,2,3}\Omega = \{1, 2, 3\}. Three sigma-algebras on Ω\Omega:

  • Trivial: F0={,Ω}\mathcal{F}_0 = \{\emptyset, \Omega\} — we can only say “something happens” or “nothing happens.”
  • Partial: F1={,{1},{2,3},Ω}\mathcal{F}_1 = \{\emptyset, \{1\}, \{2, 3\}, \Omega\} — we can distinguish element 1 from the rest.
  • Power set: F2=2Ω\mathcal{F}_2 = 2^\Omega — we can distinguish every element.

The trivial sigma-algebra carries the least information; the power set carries the most. This idea — sigma-algebras as information — is the conceptual key to conditional expectation and filtrations.

Generated Sigma-Algebras and the Borel Sets

Given any collection C\mathcal{C} of subsets of Ω\Omega, there is a smallest sigma-algebra containing C\mathcal{C}, written σ(C)\sigma(\mathcal{C}). It is the intersection of all sigma-algebras containing C\mathcal{C} — and since the power set 2Ω2^\Omega is always a sigma-algebra, this intersection is non-empty.

Definition 2 (Borel sigma-algebra).

The Borel sigma-algebra on R\mathbb{R}, denoted B(R)\mathcal{B}(\mathbb{R}), is the sigma-algebra generated by the open intervals:

B(R)=σ({(a,b):a<b,  a,bR})\mathcal{B}(\mathbb{R}) = \sigma\bigl(\{(a, b) : a < b, \; a, b \in \mathbb{R}\}\bigr)

Equivalently, B(R)=σ(open sets)\mathcal{B}(\mathbb{R}) = \sigma(\text{open sets}). The Borel sigma-algebra contains all open sets, closed sets, countable intersections of open sets (GδG_\delta sets), countable unions of closed sets (FσF_\sigma sets), and much more. It is the standard sigma-algebra for probability on R\mathbb{R}.

The Borel sets on Rd\mathbb{R}^d are defined analogously: B(Rd)=σ(open sets in Rd)\mathcal{B}(\mathbb{R}^d) = \sigma(\text{open sets in } \mathbb{R}^d).

Here is a Python implementation that verifies the sigma-algebra axioms on finite sets and computes generated sigma-algebras by closure:

def is_sigma_algebra(omega, F):
    """Verify whether F is a sigma-algebra on omega."""
    omega_set = frozenset(omega)
    F_sets = {frozenset(s) for s in F}

    # Axiom 1: omega in F
    if omega_set not in F_sets:
        return False, "Omega not in F"

    # Axiom 2: closure under complements
    for A in F_sets:
        complement = omega_set - A
        if complement not in F_sets:
            return False, f"Complement of {set(A)} not in F"

    # Axiom 3: closure under (finite, here) unions
    for A in F_sets:
        for B in F_sets:
            if A | B not in F_sets:
                return False, f"Union {set(A)}{set(B)} not in F"

    return True, "Valid sigma-algebra"

Hasse diagrams of three sigma-algebras on a three-element set, ordered by inclusion. The trivial sigma-algebra has 2 elements, the partial has 4, and the power set has 8.

Sigma-Algebra Explorer — Ω = {1, 2, 3, 4}
Ω1234
Click elements to toggle generators
{1,2}{3,4}Ω
|𝓕| = 4Generators: {1,2}⊂ 2^Ω (4 of 16 subsets)

Measures and Probability Measures

Definition of a Measure

A sigma-algebra tells us which subsets are measurable. A measure tells us how big they are.

Definition 3 (Measure).

Let (Ω,F)(\Omega, \mathcal{F}) be a measurable space. A measure is a function μ:F[0,]\mu : \mathcal{F} \to [0, \infty] satisfying:

  1. μ()=0\mu(\emptyset) = 0 (the empty set has zero measure),
  2. Countable additivity: If A1,A2,FA_1, A_2, \ldots \in \mathcal{F} are pairwise disjoint, then μ(n=1An)=n=1μ(An)\mu\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty \mu(A_n)

The triple (Ω,F,μ)(\Omega, \mathcal{F}, \mu) is a measure space.

Remark.

Countable additivity is the crucial axiom. Finite additivity (μ(AB)=μ(A)+μ(B)\mu(A \cup B) = \mu(A) + \mu(B) for disjoint A,BA, B) is too weak — it cannot guarantee that limits of measurable operations behave well.

Fundamental Properties

Proposition 1 (Monotonicity of measures).

If ABA \subseteq B, then μ(A)μ(B)\mu(A) \leq \mu(B).

Proof.

Write B=A(BA)B = A \cup (B \setminus A) with AA and BAB \setminus A disjoint. Then μ(B)=μ(A)+μ(BA)μ(A)\mu(B) = \mu(A) + \mu(B \setminus A) \geq \mu(A).

Proposition 2 (Continuity from below).

If A1A2A_1 \subseteq A_2 \subseteq \cdots and A=n=1AnA = \bigcup_{n=1}^\infty A_n, then μ(A)=limnμ(An)\mu(A) = \lim_{n \to \infty} \mu(A_n).

Proof.

Define B1=A1B_1 = A_1 and Bn=AnAn1B_n = A_n \setminus A_{n-1} for n2n \geq 2. Then the BnB_n are pairwise disjoint, A=BnA = \bigsqcup B_n, and An=k=1nBkA_n = \bigsqcup_{k=1}^n B_k. By countable additivity:

μ(A)=n=1μ(Bn)=limNn=1Nμ(Bn)=limNμ(AN)\mu(A) = \sum_{n=1}^\infty \mu(B_n) = \lim_{N \to \infty} \sum_{n=1}^N \mu(B_n) = \lim_{N \to \infty} \mu(A_N)

Proposition 3 (Continuity from above).

If A1A2A_1 \supseteq A_2 \supseteq \cdots, μ(A1)<\mu(A_1) < \infty, and A=n=1AnA = \bigcap_{n=1}^\infty A_n, then μ(A)=limnμ(An)\mu(A) = \lim_{n \to \infty} \mu(A_n).

Proof.

Apply continuity from below to A1AnA1AA_1 \setminus A_n \uparrow A_1 \setminus A, then use μ(A1An)=μ(A1)μ(An)\mu(A_1 \setminus A_n) = \mu(A_1) - \mu(A_n) (valid since μ(A1)<\mu(A_1) < \infty).

Proposition 4 (Inclusion-exclusion).

For any A,BFA, B \in \mathcal{F}:

μ(AB)=μ(A)+μ(B)μ(AB)\mu(A \cup B) = \mu(A) + \mu(B) - \mu(A \cap B)

Continuity from below and above illustrated with nested sets — the measure of the limit equals the limit of the measures.

Lebesgue Measure

Definition 4 (Lebesgue measure).

Lebesgue measure λ\lambda on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})) is the unique measure satisfying:

λ([a,b])=bafor all ab\lambda([a, b]) = b - a \quad \text{for all } a \leq b

Key properties:

  • Translation invariance: λ(A+x)=λ(A)\lambda(A + x) = \lambda(A) for all xRx \in \mathbb{R}.
  • Scaling: λ(cA)=cλ(A)\lambda(cA) = |c| \cdot \lambda(A).
  • Countable sets have measure zero: λ(Q)=0\lambda(\mathbb{Q}) = 0.
  • The Cantor set has measure zero but is uncountable.

The Cantor set is a remarkable object: it is closed, uncountable, has Lebesgue measure zero, and is totally disconnected. We construct it by iteratively removing middle thirds from [0,1][0, 1]. At each step, the total length removed is k=0n12k/3k+1\sum_{k=0}^{n-1} 2^k / 3^{k+1}, which converges to 11 — leaving a set of measure zero that still contains uncountably many points (every number in [0,1][0, 1] with a ternary expansion using only digits 0 and 2).

The Cantor set construction: iterative removal of middle thirds, with the total removed measure converging to 1.

Probability Measures and Kolmogorov’s Axioms

Definition 5 (Probability measure).

A probability measure is a measure PP on (Ω,F)(\Omega, \mathcal{F}) with P(Ω)=1P(\Omega) = 1. The triple (Ω,F,P)(\Omega, \mathcal{F}, P) is a probability space.

Kolmogorov’s axioms (1933) are precisely the axioms for a probability measure:

  1. P(A)0P(A) \geq 0 for all AFA \in \mathcal{F} (non-negativity).
  2. P(Ω)=1P(\Omega) = 1 (normalization).
  3. P(An)=P(An)P(\bigsqcup A_n) = \sum P(A_n) for pairwise disjoint (An)(A_n) (countable additivity).

Every familiar probability distribution defines a probability measure on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})). The uniform distribution on [0,1][0, 1] is simply Lebesgue measure restricted to [0,1][0, 1]. A Gaussian N(μ,σ2)N(\mu, \sigma^2) defines P(A)=A1σ2πe(xμ)2/(2σ2)dxP(A) = \int_A \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)} dx for Borel sets AA.


Measurable Functions and Random Variables

Measurable Functions

Definition 6 (Measurable function).

Let (Ω,F)(\Omega, \mathcal{F}) and (S,S)(S, \mathcal{S}) be measurable spaces. A function f:ΩSf : \Omega \to S is (F,S)(\mathcal{F}, \mathcal{S})-measurable if the preimage of every measurable set is measurable:

f1(B):={ωΩ:f(ω)B}Ffor all BSf^{-1}(B) := \{\omega \in \Omega : f(\omega) \in B\} \in \mathcal{F} \quad \text{for all } B \in \mathcal{S}

Remark.

It suffices to check preimages of a generating collection. For S=RS = \mathbb{R} with S=B(R)\mathcal{S} = \mathcal{B}(\mathbb{R}), it is enough to verify f1((,a])Ff^{-1}((-\infty, a]) \in \mathcal{F} for all aRa \in \mathbb{R}.

Proposition 5 (Preservation of measurability).

If ff and gg are measurable functions ΩR\Omega \to \mathbb{R}, then so are f+gf + g, fgfg, f/gf/g (where g0g \neq 0), max(f,g)\max(f, g), min(f,g)\min(f, g), f|f|, f+f^+, and ff^-.

Proposition 6 (Limits of measurable functions).

If f1,f2,f_1, f_2, \ldots are measurable, then supnfn\sup_n f_n, infnfn\inf_n f_n, lim supnfn\limsup_n f_n, and lim infnfn\liminf_n f_n are all measurable. In particular, if fnff_n \to f pointwise, then ff is measurable.

Proof.

For supnfn\sup_n f_n: we have {supnfna}=n=1{fna}\{\sup_n f_n \leq a\} = \bigcap_{n=1}^\infty \{f_n \leq a\}, which is a countable intersection of measurable sets.

Proposition 6 is one of the key advantages of measurable functions over continuous functions: pointwise limits of measurable functions are measurable, while pointwise limits of continuous functions need not be continuous.

Random Variables

Definition 7 (Random variable).

A random variable on a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is a measurable function X:(Ω,F)(R,B(R))X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R})).

This is the measure-theoretic formalization of “a quantity whose value depends on the outcome of a random experiment.” The measurability condition X1(B)FX^{-1}(B) \in \mathcal{F} ensures that P(XB)P(X \in B) is well-defined for every Borel set BB.

A random vector X:ΩRdX : \Omega \to \mathbb{R}^d is (F,B(Rd))(\mathcal{F}, \mathcal{B}(\mathbb{R}^d))-measurable. Component-wise: X=(X1,,Xd)X = (X_1, \ldots, X_d) is a random vector if and only if each XiX_i is a random variable.

Distributions and Independence

Definition 8 (Law / distribution / pushforward).

The distribution (or law) of a random variable XX is the probability measure μX\mu_X on (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R})) defined by:

μX(B)=P(XB)=P(X1(B))for all BB(R)\mu_X(B) = P(X \in B) = P(X^{-1}(B)) \quad \text{for all } B \in \mathcal{B}(\mathbb{R})

This is the pushforward of PP by XX, written μX=PX1\mu_X = P \circ X^{-1} or X#PX_\# P.

The cumulative distribution function (CDF) FX(x)=P(Xx)=μX((,x])F_X(x) = P(X \leq x) = \mu_X((-\infty, x]) uniquely determines μX\mu_X.

The pushforward measure: if X is standard normal, then Y = X² has a chi-squared(1) distribution. The transformation maps the density through the change-of-variables formula.

Definition 9 (Independence).

Events A1,,AnFA_1, \ldots, A_n \in \mathcal{F} are independent if:

P(iSAi)=iSP(Ai)for every subset S{1,,n}P\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i) \quad \text{for every subset } S \subseteq \{1, \ldots, n\}

Random variables X1,,XnX_1, \ldots, X_n are independent if the sigma-algebras σ(X1),,σ(Xn)\sigma(X_1), \ldots, \sigma(X_n) are independent, where σ(Xi)={Xi1(B):BB(R)}\sigma(X_i) = \{X_i^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}.

Equivalently, X1,,XnX_1, \ldots, X_n are independent if and only if the joint CDF factors: FX1,,Xn(x1,,xn)=FX1(x1)FXn(xn)F_{X_1, \ldots, X_n}(x_1, \ldots, x_n) = F_{X_1}(x_1) \cdots F_{X_n}(x_n).

Remark.

Pairwise vs. mutual independence. Pairwise independence does not imply mutual independence. A classical counterexample: let X,YX, Y be independent Rademacher (±1\pm 1 with equal probability) and Z=XYZ = XY. Then each pair is independent, but {X=1,Y=1,Z=1}\{X = 1, Y = 1, Z = 1\} has probability 1/41/81/4 \neq 1/8.


The Lebesgue Integral and Expectation

Simple Functions and the Construction

The Lebesgue integral is built in three stages: simple functions → non-negative functions → general functions.

Definition 10 (Simple function).

A simple function is a measurable function ϕ:ΩR\phi : \Omega \to \mathbb{R} taking finitely many values. We can write:

ϕ=i=1nai1Ai\phi = \sum_{i=1}^n a_i \mathbf{1}_{A_i}

where a1,,ana_1, \ldots, a_n are distinct values and Ai=ϕ1({ai})FA_i = \phi^{-1}(\{a_i\}) \in \mathcal{F}.

Definition 11 (Lebesgue integral of simple functions).

For a non-negative simple function ϕ=ai1Ai\phi = \sum a_i \mathbf{1}_{A_i}:

Ωϕdμ=i=1naiμ(Ai)\int_\Omega \phi \, d\mu = \sum_{i=1}^n a_i \, \mu(A_i)

Definition 12 (Lebesgue integral (non-negative functions)).

For a measurable f:Ω[0,]f : \Omega \to [0, \infty]:

Ωfdμ=sup{Ωϕdμ:0ϕf,  ϕ simple}\int_\Omega f \, d\mu = \sup\left\{\int_\Omega \phi \, d\mu : 0 \leq \phi \leq f, \; \phi \text{ simple}\right\}

Definition 13 (Lebesgue integral (general functions)).

For measurable f:ΩRf : \Omega \to \mathbb{R}, write f=f+ff = f^+ - f^- where f+=max(f,0)f^+ = \max(f, 0) and f=max(f,0)f^- = \max(-f, 0). Then ff is integrable (written fL1(μ)f \in L^1(\mu)) if both f+dμ<\int f^+ d\mu < \infty and fdμ<\int f^- d\mu < \infty, and:

Ωfdμ=Ωf+dμΩfdμ\int_\Omega f \, d\mu = \int_\Omega f^+ d\mu - \int_\Omega f^- d\mu

Riemann vs. Lebesgue

The Riemann integral partitions the domain into small intervals and sums f(xi)Δxif(x_i^*) \Delta x_i. The Lebesgue integral partitions the range into small intervals and sums yiμ({f[yi,yi+1)})y_i \cdot \mu(\{f \in [y_i, y_{i+1})\}).

This “horizontal slicing” is why the Lebesgue integral handles limits better: it does not care about the geometric arrangement of the domain, only about the measure of level sets.

Riemann integration partitions the domain (vertical slicing), while Lebesgue integration partitions the range (horizontal slicing). The Lebesgue approach handles irregular functions where Riemann fails.

The Monotone Convergence Theorem

Theorem 1 (Monotone Convergence Theorem (MCT)).

Let 0f1f20 \leq f_1 \leq f_2 \leq \cdots be measurable functions with fnff_n \uparrow f pointwise. Then:

Ωfdμ=limnΩfndμ\int_\Omega f \, d\mu = \lim_{n \to \infty} \int_\Omega f_n \, d\mu

Proof.

Step 1. Since fnff_n \leq f for all nn, we have fndμfdμ\int f_n \, d\mu \leq \int f \, d\mu, so limnfndμfdμ\lim_n \int f_n \, d\mu \leq \int f \, d\mu.

Step 2. We need to show fdμlimnfndμ\int f \, d\mu \leq \lim_n \int f_n \, d\mu. Since fdμ\int f \, d\mu is the supremum over simple functions ϕf\phi \leq f, it suffices to show that for any non-negative simple ϕf\phi \leq f, we have ϕdμlimnfndμ\int \phi \, d\mu \leq \lim_n \int f_n \, d\mu.

Step 3. Fix such a ϕ\phi and let 0<α<10 < \alpha < 1. Define An={fnαϕ}A_n = \{f_n \geq \alpha \phi\}. Since fnfϕ>αϕf_n \uparrow f \geq \phi > \alpha \phi on {ϕ>0}\{\phi > 0\}, we have AnΩA_n \uparrow \Omega (up to a μ\mu-null set where ϕ=0\phi = 0).

Step 4. Then fndμAnfndμαAnϕdμ\int f_n \, d\mu \geq \int_{A_n} f_n \, d\mu \geq \alpha \int_{A_n} \phi \, d\mu. By continuity from below (applied to the measures μϕ(A)=Aϕdμ\mu_\phi(A) = \int_A \phi \, d\mu), as nn \to \infty:

limnfndμαΩϕdμ\lim_n \int f_n \, d\mu \geq \alpha \int_\Omega \phi \, d\mu

Since α<1\alpha < 1 was arbitrary, let α1\alpha \uparrow 1 to get limnfndμϕdμ\lim_n \int f_n \, d\mu \geq \int \phi \, d\mu. Taking the supremum over ϕ\phi gives the result.

Fatou’s Lemma and the Dominated Convergence Theorem

Lemma 1 (Fatou's Lemma).

If fn0f_n \geq 0 are measurable, then:

Ωlim infnfndμlim infnΩfndμ\int_\Omega \liminf_{n \to \infty} f_n \, d\mu \leq \liminf_{n \to \infty} \int_\Omega f_n \, d\mu

Proof.

Define gn=infknfkg_n = \inf_{k \geq n} f_k. Then gnlim inffng_n \uparrow \liminf f_n and gnfng_n \leq f_n, so gnfn\int g_n \leq \int f_n. Apply the MCT to (gn)(g_n):

lim inffn=limngn=lim infngnlim infnfn\int \liminf f_n = \lim_n \int g_n = \liminf_n \int g_n \leq \liminf_n \int f_n

Theorem 2 (Dominated Convergence Theorem (DCT)).

Let fnff_n \to f pointwise (or μ\mu-a.e.), and suppose there exists an integrable gg with fng|f_n| \leq g for all nn. Then ff is integrable and:

limnΩfndμ=Ωfdμ\lim_{n \to \infty} \int_\Omega f_n \, d\mu = \int_\Omega f \, d\mu

Proof.

Since fng|f_n| \leq g and fnff_n \to f pointwise, fg|f| \leq g, so fL1f \in L^1. Apply Fatou’s lemma to g+fn0g + f_n \geq 0:

g+f=(g+f)lim inf(g+fn)=g+lim inffn\int g + \int f = \int (g + f) \leq \liminf \int (g + f_n) = \int g + \liminf \int f_n

So flim inffn\int f \leq \liminf \int f_n. Similarly, applying Fatou to gfn0g - f_n \geq 0:

gfg+lim inf(fn)=glim supfn\int g - \int f \leq \int g + \liminf \int (-f_n) = \int g - \limsup \int f_n

So lim supfnf\limsup \int f_n \leq \int f. Together: flim inffnlim supfnf\int f \leq \liminf \int f_n \leq \limsup \int f_n \leq \int f.

The DCT is the workhorse of probability theory. Whenever we want to exchange a limit and an integral — which happens constantly in proving convergence results — we look for a dominating function. Without one, the exchange can fail dramatically, as the next example shows.

Dominated Convergence Theorem in action: with a dominating function, the integral of the limit equals the limit of the integrals. Without one, the integral can diverge.

Expectation and LpL^p Spaces

Definition 14 (Expectation).

The expectation of a random variable XX on (Ω,F,P)(\Omega, \mathcal{F}, P) is:

E[X]=ΩXdPE[X] = \int_\Omega X \, dP

provided the integral exists. The variance is Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.

Definition 15 (L^p space).

For 1p<1 \leq p < \infty, the space Lp(Ω,F,μ)L^p(\Omega, \mathcal{F}, \mu) consists of all measurable ff with fpdμ<\int |f|^p d\mu < \infty, with norm:

fp=(Ωfpdμ)1/p\|f\|_p = \left(\int_\Omega |f|^p \, d\mu\right)^{1/p}

For p=p = \infty: f=inf{M:μ({f>M})=0}\|f\|_\infty = \inf\{M : \mu(\{|f| > M\}) = 0\} (essential supremum).

Theorem 3 (L^p is a Banach space).

LpL^p (with functions identified up to μ\mu-a.e. equality) is a complete normed space.

Theorem 4 (Hölder's inequality).

If 1/p+1/q=11/p + 1/q = 1 with 1p,q1 \leq p, q \leq \infty, then:

Ωfgdμfpgq\int_\Omega |fg| \, d\mu \leq \|f\|_p \|g\|_q

The case p=q=2p = q = 2 is the Cauchy–Schwarz inequality: E[XY]E[X2]E[Y2]|E[XY]| \leq \sqrt{E[X^2]} \sqrt{E[Y^2]}. The space L2L^2 is a Hilbert space with inner product f,g=fgdμ\langle f, g \rangle = \int fg \, d\mu.


Convergence of Random Variables

The four modes of convergence — almost sure, in probability, in LpL^p, and in distribution — form a hierarchy that is central to asymptotic statistics and the theoretical foundations of machine learning.

The Four Modes

Definition 16 (Almost sure convergence).

Xna.s.XX_n \xrightarrow{\text{a.s.}} X if:

P(ω:Xn(ω)X(ω))=1P\left(\omega : X_n(\omega) \to X(\omega)\right) = 1

That is, for almost every outcome ω\omega, the sequence of numbers X1(ω),X2(ω),X_1(\omega), X_2(\omega), \ldots converges to X(ω)X(\omega).

Definition 17 (Convergence in probability).

XnPXX_n \xrightarrow{P} X if for every ε>0\varepsilon > 0:

limnP(XnX>ε)=0\lim_{n \to \infty} P(|X_n - X| > \varepsilon) = 0

Definition 18 (Convergence in L^p).

XnLpXX_n \xrightarrow{L^p} X if:

limnE[XnXp]=0\lim_{n \to \infty} E[|X_n - X|^p] = 0

Definition 19 (Convergence in distribution).

XndXX_n \xrightarrow{d} X if FXn(x)FX(x)F_{X_n}(x) \to F_X(x) at every continuity point xx of FXF_X.

The Hierarchy

The implications between these modes form two chains:

Lp    in probability    in distributionL^p \implies \text{in probability} \implies \text{in distribution}

a.s.    in probability    in distribution\text{a.s.} \implies \text{in probability} \implies \text{in distribution}

And the converses are generally false, with important exceptions.

Theorem 5 (L^p implies convergence in probability).

XnLpX    XnPXX_n \xrightarrow{L^p} X \implies X_n \xrightarrow{P} X

Proof.

By Markov’s inequality: P(XnX>ε)E[XnXp]εp0P(|X_n - X| > \varepsilon) \leq \frac{E[|X_n - X|^p]}{\varepsilon^p} \to 0.

Theorem 6 (Almost sure implies convergence in probability).

Xna.s.X    XnPXX_n \xrightarrow{\text{a.s.}} X \implies X_n \xrightarrow{P} X

Proof.

Define An={XnX>ε}A_n = \{|X_n - X| > \varepsilon\}. Almost sure convergence gives P(lim supAn)=0P(\limsup A_n) = 0 (by Borel–Cantelli-type reasoning). Since P(An)P(knAk)P(lim supAn)=0P(A_n) \leq P(\bigcup_{k \geq n} A_k) \to P(\limsup A_n) = 0.

Theorem 7 (Convergence in probability implies convergence in distribution).

XnPX    XndXX_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X

Proof.

For any continuity point xx of FXF_X and ε>0\varepsilon > 0:

FXn(x)=P(Xnx)P(Xx+ε)+P(XnX>ε)F_{X_n}(x) = P(X_n \leq x) \leq P(X \leq x + \varepsilon) + P(|X_n - X| > \varepsilon)

Letting nn \to \infty then ε0\varepsilon \downarrow 0 gives lim supFXn(x)FX(x)\limsup F_{X_n}(x) \leq F_X(x). A similar lower bound gives the result.

Counterexamples

The converses fail in illuminating ways:

  • In probability does not imply a.s.: The “typewriter sequence” is the classic counterexample. Consider [0,1][0, 1] with Lebesgue measure, and define fn=1[k/m,(k+1)/m]f_n = \mathbf{1}_{[k/m, (k+1)/m]} where nn enumerates pairs (m,k)(m, k) by cycling through intervals of decreasing width. Then fn0f_n \to 0 in probability (the interval width shrinks), but for every ω[0,1]\omega \in [0, 1], infinitely many fn(ω)=1f_n(\omega) = 1.

  • In distribution does not imply in probability: Let XN(0,1)X \sim N(0,1) and Yn=XY_n = -X. Then YndXY_n \xrightarrow{d} X (since XN(0,1)-X \sim N(0,1) too), but P(YnX>1)=P(2X>1)>0P(|Y_n - X| > 1) = P(|2X| > 1) > 0 for all nn.

The four modes of convergence illustrated: almost sure convergence shows paths settling down, convergence in probability shows the probability of large deviations shrinking, the SLLN shows running averages converging, and the CLT shows histograms approaching the bell curve.

The typewriter sequence: the indicator function cycles through intervals of decreasing width, converging in probability to zero but failing to converge almost surely at any point.

The Laws of Large Numbers

Theorem 8 (Weak Law of Large Numbers (WLLN)).

Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[X1]=μE[X_1] = \mu and Var(X1)=σ2<\text{Var}(X_1) = \sigma^2 < \infty. Then:

Xˉn=1ni=1nXiPμ\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu

Proof.

By Chebyshev’s inequality: P(Xˉnμ>ε)Var(Xˉn)ε2=σ2nε20P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0.

Theorem 9 (Strong Law of Large Numbers (SLLN)).

Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[X1]<E[|X_1|] < \infty and E[X1]=μE[X_1] = \mu. Then:

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

The SLLN is strictly stronger than the WLLN: it requires only a finite first moment (not second moment), and the convergence is almost sure. The proof uses the fourth-moment method or truncation arguments and is considerably more involved than the WLLN proof.

The Central Limit Theorem

Theorem 10 (Central Limit Theorem (CLT)).

Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[X1]=μE[X_1] = \mu and Var(X1)=σ2(0,)\text{Var}(X_1) = \sigma^2 \in (0, \infty). Then:

Xˉnμσ/n=i=1n(Xiμ)σndN(0,1)\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} = \frac{\sum_{i=1}^n (X_i - \mu)}{\sigma \sqrt{n}} \xrightarrow{d} N(0, 1)

The CLT is the deepest result in elementary probability. Its measure-theoretic proof uses characteristic functions: φZn(t)et2/2\varphi_{Z_n}(t) \to e^{-t^2/2}, plus Lévy’s continuity theorem (convergence of characteristic functions if and only if convergence in distribution).

Notice the distinction: the SLLN gives almost sure convergence of Xˉn\bar{X}_n to μ\mu (a constant), while the CLT gives convergence in distribution of the rescaled fluctuations n(Xˉnμ)/σ\sqrt{n}(\bar{X}_n - \mu)/\sigma to a Gaussian. These are complementary, not competing, results.


Product Measures and Fubini’s Theorem

Given measurable spaces (Ω1,F1)(\Omega_1, \mathcal{F}_1) and (Ω2,F2)(\Omega_2, \mathcal{F}_2), the product sigma-algebra F1F2\mathcal{F}_1 \otimes \mathcal{F}_2 is the sigma-algebra on Ω1×Ω2\Omega_1 \times \Omega_2 generated by the measurable rectangles {A1×A2:A1F1,A2F2}\{A_1 \times A_2 : A_1 \in \mathcal{F}_1, A_2 \in \mathcal{F}_2\}.

Theorem 11 (Product Measure).

If (Ω1,F1,μ1)(\Omega_1, \mathcal{F}_1, \mu_1) and (Ω2,F2,μ2)(\Omega_2, \mathcal{F}_2, \mu_2) are σ\sigma-finite measure spaces, there exists a unique measure μ1μ2\mu_1 \otimes \mu_2 on (Ω1×Ω2,F1F2)(\Omega_1 \times \Omega_2, \mathcal{F}_1 \otimes \mathcal{F}_2) satisfying:

(μ1μ2)(A1×A2)=μ1(A1)μ2(A2)(\mu_1 \otimes \mu_2)(A_1 \times A_2) = \mu_1(A_1) \cdot \mu_2(A_2)

For probability spaces, this gives the joint distribution of independent random variables: if XX and YY are independent with laws μX\mu_X and μY\mu_Y, then (X,Y)(X, Y) has law μXμY\mu_X \otimes \mu_Y.

Theorem 12 (Tonelli's Theorem).

If f:Ω1×Ω2[0,]f : \Omega_1 \times \Omega_2 \to [0, \infty] is (F1F2)(\mathcal{F}_1 \otimes \mathcal{F}_2)-measurable and μ1,μ2\mu_1, \mu_2 are σ\sigma-finite, then:

Ω1×Ω2fd(μ1μ2)=Ω1(Ω2f(ω1,ω2)dμ2)dμ1=Ω2(Ω1f(ω1,ω2)dμ1)dμ2\int_{\Omega_1 \times \Omega_2} f \, d(\mu_1 \otimes \mu_2) = \int_{\Omega_1}\left(\int_{\Omega_2} f(\omega_1, \omega_2) \, d\mu_2\right) d\mu_1 = \int_{\Omega_2}\left(\int_{\Omega_1} f(\omega_1, \omega_2) \, d\mu_1\right) d\mu_2

Theorem 13 (Fubini's Theorem).

If additionally ff is integrable (i.e., fd(μ1μ2)<\int |f| \, d(\mu_1 \otimes \mu_2) < \infty), then the same iterated-integral equalities hold for signed ff.

Remark.

Tonelli works for non-negative functions without integrability assumptions. Fubini requires integrability but allows signed functions. The standard workflow: use Tonelli to check f<\int |f| < \infty, then apply Fubini.

Probabilistic consequence. For independent random variables X,YX, Y with densities fX,fYf_X, f_Y:

E[g(X,Y)]=g(x,y)fX(x)fY(y)dxdyE[g(X, Y)] = \int \int g(x, y) f_X(x) f_Y(y) \, dx \, dy

and the order of integration can be swapped freely. This is why we can factor joint expectations of independent variables: E[XY]=E[X]E[Y]E[XY] = E[X]E[Y].

import numpy as np
from scipy import integrate

# Verify Fubini: ∫∫ x²e^{-y} dx dy over [0,1]×[0,∞)
# Iterated integral 1: ∫₀¹ x² dx · ∫₀^∞ e^{-y} dy = (1/3)(1) = 1/3
result_1 = integrate.dblquad(lambda y, x: x**2 * np.exp(-y), 0, 1, 0, np.inf)

# Iterated integral 2 (reversed order)
result_2 = integrate.dblquad(lambda x, y: x**2 * np.exp(-y), 0, np.inf, 0, 1)

print(f"Order 1: {result_1[0]:.6f}")  # 0.333333
print(f"Order 2: {result_2[0]:.6f}")  # 0.333333

Conditional Expectation and Radon–Nikodym

Absolute Continuity and the Radon–Nikodym Theorem

Definition 20 (Absolute continuity).

A measure ν\nu is absolutely continuous with respect to μ\mu (written νμ\nu \ll \mu) if μ(A)=0    ν(A)=0\mu(A) = 0 \implies \nu(A) = 0 for all AFA \in \mathcal{F}.

Theorem 14 (Radon–Nikodym Theorem).

Let μ\mu and ν\nu be σ\sigma-finite measures on (Ω,F)(\Omega, \mathcal{F}) with νμ\nu \ll \mu. Then there exists a measurable function f:Ω[0,)f : \Omega \to [0, \infty), unique μ\mu-a.e., such that:

ν(A)=Afdμfor all AF\nu(A) = \int_A f \, d\mu \quad \text{for all } A \in \mathcal{F}

The function ff is the Radon–Nikodym derivative dνdμ\frac{d\nu}{d\mu}.

Proof.

The proof (due to von Neumann) uses the Riesz Representation Theorem on L2(μ+ν)L^2(\mu + \nu). The functional Λ(g)=gdν\Lambda(g) = \int g \, d\nu is bounded on L2(μ+ν)L^2(\mu + \nu), so by Riesz there exists hh with gdν=ghd(μ+ν)\int g \, d\nu = \int gh \, d(\mu + \nu). Taking g=1Ag = \mathbf{1}_A and rearranging yields f=h/(1h)f = h / (1 - h).

If XX has density fXf_X with respect to Lebesgue measure λ\lambda, then μXλ\mu_X \ll \lambda with dμXdλ=fX\frac{d\mu_X}{d\lambda} = f_X. The probability density function is a Radon–Nikodym derivative.

Application to finance. The Radon–Nikodym theorem enables change of measure — the foundation of risk-neutral pricing. If PP is the real-world measure and QQ is the risk-neutral measure:

EQ[X]=EP[dQdPX]E_Q[X] = E_P\left[\frac{dQ}{dP} X\right]

This is the mathematical core of the Fundamental Theorem of Asset Pricing.

Conditional Expectation

The measure-theoretic definition of conditional expectation is one of the deepest ideas in probability. We cannot define E[XG]E[X \mid \mathcal{G}] as a single number — it is a random variable that is G\mathcal{G}-measurable, capturing the “best prediction of XX given the information in G\mathcal{G}.”

Definition 21 (Conditional expectation).

Let XL1(Ω,F,P)X \in L^1(\Omega, \mathcal{F}, P) and let GF\mathcal{G} \subseteq \mathcal{F} be a sub-sigma-algebra. The conditional expectation E[XG]E[X \mid \mathcal{G}] is the (a.s. unique) G\mathcal{G}-measurable random variable satisfying:

GE[XG]dP=GXdPfor all GG\int_G E[X \mid \mathcal{G}] \, dP = \int_G X \, dP \quad \text{for all } G \in \mathcal{G}

Existence. This follows from the Radon–Nikodym theorem. Define ν(G)=GXdP\nu(G) = \int_G X \, dP on G\mathcal{G}. Then νPG\nu \ll P|_\mathcal{G}, and E[XG]=dνdPGE[X \mid \mathcal{G}] = \frac{d\nu}{dP|_\mathcal{G}}.

Properties of Conditional Expectation

Proposition 7 (Properties of conditional expectation).

Let X,YL1X, Y \in L^1 and G,H\mathcal{G}, \mathcal{H} be sub-sigma-algebras with HGF\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}.

  1. Linearity: E[aX+bYG]=aE[XG]+bE[YG]E[aX + bY \mid \mathcal{G}] = aE[X \mid \mathcal{G}] + bE[Y \mid \mathcal{G}].
  2. Tower property: E[E[XG]H]=E[XH]E\bigl[E[X \mid \mathcal{G}] \bigm| \mathcal{H}\bigr] = E[X \mid \mathcal{H}].
  3. Taking out what is known: If YY is G\mathcal{G}-measurable and XYL1XY \in L^1, then E[XYG]=YE[XG]E[XY \mid \mathcal{G}] = Y \cdot E[X \mid \mathcal{G}].
  4. Independence: If XX is independent of G\mathcal{G}, then E[XG]=E[X]E[X \mid \mathcal{G}] = E[X].
  5. Trivial conditioning: E[X{,Ω}]=E[X]E[X \mid \{\emptyset, \Omega\}] = E[X].
  6. Full conditioning: E[XF]=XE[X \mid \mathcal{F}] = X.
  7. Jensen’s inequality: If φ\varphi is convex, then φ(E[XG])E[φ(X)G]\varphi(E[X \mid \mathcal{G}]) \leq E[\varphi(X) \mid \mathcal{G}].
Proof (Tower property).

We must show that E[XH]E[X \mid \mathcal{H}] satisfies the defining property for E[E[XG]H]E[E[X \mid \mathcal{G}] \mid \mathcal{H}]. For any HHH \in \mathcal{H}:

Since HG\mathcal{H} \subseteq \mathcal{G}, we have HGH \in \mathcal{G}, so by the definition of E[XG]E[X \mid \mathcal{G}]:

HE[XG]dP=HXdP\int_H E[X \mid \mathcal{G}] \, dP = \int_H X \, dP

And by the definition of E[XH]E[X \mid \mathcal{H}]: HE[XH]dP=HXdP\int_H E[X \mid \mathcal{H}] \, dP = \int_H X \, dP.

So HE[XG]dP=HE[XH]dP\int_H E[X \mid \mathcal{G}] \, dP = \int_H E[X \mid \mathcal{H}] \, dP for all HHH \in \mathcal{H}, which means E[XH]E[X \mid \mathcal{H}] satisfies the defining property for E[E[XG]H]E[E[X \mid \mathcal{G}] \mid \mathcal{H}]. By a.s. uniqueness, they are equal.

Conditional Expectation as L2L^2 Projection

Here is the connection that ties conditional expectation to the Linear Algebra track. The conditional expectation E[XG]E[X \mid \mathcal{G}] is the orthogonal projection of XX onto L2(Ω,G,P)L^2(\Omega, \mathcal{G}, P) — the subspace of G\mathcal{G}-measurable square-integrable random variables.

Theorem 15 (L^2 projection characterization).

If XL2(Ω,F,P)X \in L^2(\Omega, \mathcal{F}, P), then E[XG]E[X \mid \mathcal{G}] is the unique element YL2(G)Y \in L^2(\mathcal{G}) minimizing:

E[(XY)2]=XY22E[(X - Y)^2] = \|X - Y\|_2^2

Proof.

We verify the orthogonality condition. For any ZL2(G)Z \in L^2(\mathcal{G}):

XE[XG],Z=E[(XE[XG])Z]=E[XZ]E[E[XG]Z]\langle X - E[X \mid \mathcal{G}], Z \rangle = E[(X - E[X \mid \mathcal{G}])Z] = E[XZ] - E[E[X \mid \mathcal{G}] \cdot Z]

By the “taking out what is known” property: E[E[XG]Z]=E[E[XZG]]=E[XZ]E[E[X \mid \mathcal{G}] \cdot Z] = E[E[XZ \mid \mathcal{G}]] = E[XZ] (using the tower property at the last step). So XE[XG],Z=0\langle X - E[X \mid \mathcal{G}], Z \rangle = 0 — the residual is orthogonal to every G\mathcal{G}-measurable function.

This connects directly to PCA: PCA projects data onto a low-dimensional subspace that minimizes mean squared error. Conditional expectation is the infinite-dimensional analog — projecting onto the subspace of functions measurable with respect to a sub-sigma-algebra.

For jointly normal (X,Y)(X, Y) with correlation ρ\rho, the conditional expectation takes the familiar regression form: E[YX=x]=μY+ρσYσX(xμX)E[Y \mid X = x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X). The variance reduction is 1ρ21 - \rho^2 — exactly the fraction of variance “explained” by the conditioning.

Conditional expectation as L² projection: the left panel shows a bivariate normal scatter with the regression line (the conditional expectation), and the right panel shows the MSE curve with its minimum at the optimal slope.

import numpy as np

# Conditional expectation as L² projection for jointly normal (X, Y)
np.random.seed(42)
n = 5000
rho = 0.7
mu_x, mu_y, sigma_x, sigma_y = 2.0, 3.0, 1.0, 1.0

# Generate bivariate normal
Z1, Z2 = np.random.randn(n), np.random.randn(n)
X = mu_x + sigma_x * Z1
Y = mu_y + sigma_y * (rho * Z1 + np.sqrt(1 - rho**2) * Z2)

# E[Y | X = x] = mu_y + rho * (sigma_y / sigma_x) * (x - mu_x)
slope = rho * sigma_y / sigma_x
Y_hat = mu_y + slope * (X - mu_x)

# Verify: MSE is minimized at the conditional expectation
mse_optimal = np.mean((Y - Y_hat)**2)
mse_mean_only = np.mean((Y - mu_y)**2)
variance_reduction = 1 - mse_optimal / mse_mean_only

print(f"MSE (conditional): {mse_optimal:.4f}")    # ≈ σ_y²(1-ρ²) = 0.51
print(f"MSE (unconditional): {mse_mean_only:.4f}")  # ≈ σ_y² = 1.0
print(f"Variance reduction: {variance_reduction:.4f}")  # ≈ ρ² = 0.49

A Preview of Martingales

Filtrations and Adapted Processes

Definition 22 (Filtration).

A filtration on (Ω,F)(\Omega, \mathcal{F}) is an increasing sequence of sub-sigma-algebras:

F0F1F2F\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots \subseteq \mathcal{F}

Each Fn\mathcal{F}_n represents the information available at time nn. The filtration models the flow of information over time.

Definition 23 (Adapted process).

A sequence of random variables (Xn)n0(X_n)_{n \geq 0} is adapted to a filtration (Fn)(\mathcal{F}_n) if XnX_n is Fn\mathcal{F}_n-measurable for each nn. In words: at time nn, we can observe XnX_n (it depends only on information available at time nn).

The natural filtration of (Xn)(X_n) is Fn=σ(X0,X1,,Xn)\mathcal{F}_n = \sigma(X_0, X_1, \ldots, X_n) — the sigma-algebra generated by the first n+1n+1 observations. This is the minimal filtration to which (Xn)(X_n) is adapted.

Martingales

Definition 24 (Martingale).

An adapted, integrable process (Mn)n0(M_n)_{n \geq 0} is a martingale with respect to (Fn)(\mathcal{F}_n) if:

E[Mn+1Fn]=Mnfor all n0E[M_{n+1} \mid \mathcal{F}_n] = M_n \quad \text{for all } n \geq 0

If \leq replaces ==, we have a supermartingale (expected to decrease). If \geq, a submartingale (expected to increase).

The martingale condition says: given everything we know now, our best prediction of tomorrow’s value is today’s value. The process has “no drift” — no systematic tendency to increase or decrease.

Examples

Random walk. Let Z1,Z2,Z_1, Z_2, \ldots be i.i.d. with E[Zi]=0E[Z_i] = 0. Then Mn=i=1nZiM_n = \sum_{i=1}^n Z_i is a martingale:

E[Mn+1Fn]=E[Mn+Zn+1Fn]=Mn+E[Zn+1Fn]=Mn+E[Zn+1]=MnE[M_{n+1} \mid \mathcal{F}_n] = E[M_n + Z_{n+1} \mid \mathcal{F}_n] = M_n + E[Z_{n+1} \mid \mathcal{F}_n] = M_n + E[Z_{n+1}] = M_n

using the “taking out what is known” property and independence.

Pólya urn. Start with 1 red and 1 blue ball. At each step, draw a ball, then replace it with 2 balls of the same color. Let MnM_n = fraction of red balls after nn draws. Then (Mn)(M_n) is a martingale — and by the martingale convergence theorem, MnM_n converges almost surely to a Beta(1,1)=Uniform(0,1)\text{Beta}(1, 1) = \text{Uniform}(0, 1) random variable.

Likelihood ratio. If PP and QQ are probability measures with QPQ \ll P, and X1,X2,X_1, X_2, \ldots are i.i.d. under PP, then the likelihood ratio Ln=i=1ndQdP(Xi)L_n = \prod_{i=1}^n \frac{dQ}{dP}(X_i) is a PP-martingale. This connects to sequential hypothesis testing and the Radon–Nikodym derivative from the previous section.

Financial Interpretation

In mathematical finance, a martingale models a fair game — a process where no betting strategy can generate a positive expected profit.

  • A discounted asset price is a martingale under the risk-neutral measure QQ (Fundamental Theorem of Asset Pricing).
  • The Efficient Market Hypothesis (weak form) asserts that prices, conditioned on historical information, should be martingales.
  • In regime detection, the question is whether the martingale property holds uniformly or whether the drift switches between regimes. GARCH(1,1) captures time-varying conditional variance (Var(XtFt1)\text{Var}(X_t \mid \mathcal{F}_{t-1}) is Ft1\mathcal{F}_{t-1}-measurable), while the Statistical Jump Model detects changes in the conditional distribution itself.

Martingale examples: a simple random walk (martingale), a random walk with drift (submartingale), a Pólya urn process (martingale converging a.s.), and regime-switching volatility.


Connections & Further Reading

Cross-Track and Within-Track Connections

TargetTrackRelationship
PCA & Low-Rank ApproximationLinear AlgebraΣ^=1n1XTX\hat{\Sigma} = \frac{1}{n-1} X^T X converges to Σ\Sigma by LLN; L2L^2 theory guarantees convergence of eigenvalues
Concentration InequalitiesProbability & StatisticsBuilds on LpL^p spaces and convergence theory to quantify rates of convergence beyond LLN
PAC Learning FrameworkProbability & StatisticsUses measure-theoretic probability to formalize learnability
Bayesian NonparametricsProbability & StatisticsRequires conditional expectation, Radon–Nikodym, and product measures for priors on infinite-dimensional spaces
Shannon Entropy & Mutual InformationInformation TheoryEntropy is E[logp(X)]E[-\log p(X)], directly using the expectation and Radon–Nikodym machinery developed here. Differential entropy requires the Lebesgue integral; conditional entropy uses conditional expectation.
Categories & FunctorsCategory TheoryThe category Meas of measurable spaces and measurable functions provides the categorical framework for probability theory. Random variables are morphisms in Meas, and the pushforward of probability measures is functorial.

Financial Applications

ApplicationConnection
GARCH(1,1)Conditional variance Var(XtFt1)\text{Var}(X_t \mid \mathcal{F}_{t-1}) is filtration-adapted
Statistical Jump ModelRegime probabilities are conditional expectations given observed filtration
Option pricing (Black–Scholes)Discounted prices are QQ-martingales; dQ/dPdQ/dP is state price density
Efficient Market HypothesisPrices form martingale w.r.t. public information filtration

Notation Reference

SymbolMeaning
(Ω,F,P)(\Omega, \mathcal{F}, P)Probability space
B(R)\mathcal{B}(\mathbb{R})Borel sigma-algebra on R\mathbb{R}
λ\lambdaLebesgue measure
1A\mathbf{1}_AIndicator function of set AA
f+=max(f,0)f^+ = \max(f, 0)Positive part
Lp(μ)L^p(\mu)Space of pp-integrable functions
E[XG]E[X \mid \mathcal{G}]Conditional expectation
dνdμ\frac{d\nu}{d\mu}Radon–Nikodym derivative
a.s.\xrightarrow{\text{a.s.}}, P\xrightarrow{P}, Lp\xrightarrow{L^p}, d\xrightarrow{d}Modes of convergence

Connections

  • The sample covariance matrix converges to the population covariance by the Law of Large Numbers; L2 theory guarantees convergence of eigenvalues. pca-low-rank

References & Further Reading

  • book Probability: Theory and Examples — Durrett (2019) Primary reference — standard graduate probability textbook
  • book Probability and Measure — Billingsley (2012) Careful measure-theoretic development
  • book Probability with Martingales — Williams (1991) Excellent introduction to martingale theory
  • book Real Analysis — Folland (2013) Standard reference for Lebesgue integration and Lp spaces
  • book Measure Theory and Probability Theory — Athreya & Lahiri (2006) Balanced between measure theory and probability
  • book A Probability Path — Resnick (2014) Transition from undergraduate to measure-theoretic probability
  • book Stochastic Calculus for Finance — Shreve (2004) Martingale theory applied to financial mathematics