advanced probability 50 min read

Bayesian Nonparametrics

From the Dirichlet process to Gaussian processes and posterior consistency

Overview & Motivation

In parametric statistics, we choose a model family — say, Gaussian distributions N(μ,σ2)\mathcal{N}(\mu, \sigma^2) — and reduce inference to estimating a fixed, finite-dimensional parameter θ=(μ,σ2)R2\theta = (\mu, \sigma^2) \in \mathbb{R}^2. This works beautifully when the model is well-specified. But what if the true data-generating process is multimodal, heavy-tailed, or otherwise poorly captured by any finite-dimensional family?

The PAC learning framework gave us one answer: control model complexity through the VC dimension or Rademacher complexity, using structural risk minimization (SRM) to balance approximation error and estimation error. Bayesian nonparametrics offers a fundamentally different approach: place a prior directly on an infinite-dimensional parameter space and let the effective complexity of the posterior grow with the data.

The distinction between the two paradigms is worth making precise:

  • Parametric models: fix the number of parameters a priori (e.g., fit a mixture of K=3K = 3 Gaussians).
  • Nonparametric models: let the effective number of parameters grow with nn (e.g., the number of mixture components adapts to the data).

This naming is somewhat misleading — “nonparametric” models have more parameters than parametric ones, not fewer. A better name might be “infinite-parametric,” but the convention is firmly established.

Parametric vs nonparametric paradigm

Remark (The Bayesian Resolution of Model Selection).

Recall from the PAC Learning Framework that structural risk minimization balances approximation and estimation error by selecting from a nested sequence of hypothesis classes H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots. Bayesian nonparametrics sidesteps model selection entirely: by placing a prior on the union dHd\bigcup_d \mathcal{H}_d, the posterior automatically concentrates on the appropriate complexity level. The marginal likelihood provides an automatic “Occam’s razor” — complex models are penalized by the prior unless the data strongly support them.

We’ll develop three canonical nonparametric models in this topic:

  1. The Dirichlet Process — a prior on probability measures, used for clustering and density estimation.
  2. The Gaussian Process — a prior on functions, used for regression and classification.
  3. The Indian Buffet Process — a prior on binary matrices, used for latent feature models.

Each places a prior on an infinite-dimensional object (a measure, a function, a binary matrix with infinitely many columns), yet admits tractable posterior inference through clever constructive representations.


The Dirichlet Distribution

The Dirichlet process is the infinite-dimensional generalization of the Dirichlet distribution, so we begin by reviewing the finite-dimensional case carefully.

Definition 1 (Dirichlet Distribution).

For a positive integer KK and a parameter vector α=(α1,,αK)R>0K\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K) \in \mathbb{R}_{>0}^K, the Dirichlet distribution Dir(α)\text{Dir}(\boldsymbol{\alpha}) is the probability distribution on the (K1)(K-1)-simplex

ΔK1={(p1,,pK)RK:pk0,k=1Kpk=1}\Delta_{K-1} = \left\{(p_1, \ldots, p_K) \in \mathbb{R}^K : p_k \geq 0, \sum_{k=1}^K p_k = 1\right\}

with density

f(pα)=Γ ⁣(k=1Kαk)k=1KΓ(αk)k=1Kpkαk1,f(\mathbf{p} \mid \boldsymbol{\alpha}) = \frac{\Gamma\!\left(\sum_{k=1}^K \alpha_k\right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1},

where Γ()\Gamma(\cdot) is the gamma function.

The concentration parameter α0=kαk\alpha_0 = \sum_k \alpha_k controls how concentrated the distribution is around the base measure b=α/α0\mathbf{b} = \boldsymbol{\alpha}/\alpha_0:

  • When α01\alpha_0 \gg 1, draws are concentrated near b\mathbf{b} (low variance).
  • When α01\alpha_0 \ll 1, draws are concentrated near the vertices of the simplex (sparse, winner-take-all).
  • When α0=1\alpha_0 = 1 and αk=1/K\alpha_k = 1/K for all kk, draws are approximately uniform on the simplex.

Dirichlet distribution on the simplex

Proposition 1 (Moments of the Dirichlet).

If pDir(α)\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha}) with α0=kαk\alpha_0 = \sum_k \alpha_k, then:

  1. E[pk]=αk/α0\mathbb{E}[p_k] = \alpha_k / \alpha_0,
  2. Var(pk)=αk(α0αk)α02(α0+1)\text{Var}(p_k) = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)},
  3. Cov(pj,pk)=αjαkα02(α0+1)\text{Cov}(p_j, p_k) = \frac{-\alpha_j \alpha_k}{\alpha_0^2(\alpha_0 + 1)} for jkj \neq k.
Proof.

These follow from the integral representation of the beta function. For the mean, note that marginally pkBeta(αk,α0αk)p_k \sim \text{Beta}(\alpha_k, \alpha_0 - \alpha_k), so E[pk]=αk/α0\mathbb{E}[p_k] = \alpha_k / \alpha_0. The variance follows from the Beta variance formula. For the covariance, we use the constraint kpk=1\sum_k p_k = 1, which gives jkCov(pj,pk)=Var(pk)\sum_{j \neq k} \text{Cov}(p_j, p_k) = -\text{Var}(p_k). By the symmetry structure of the Dirichlet, all off-diagonal covariances involving pkp_k have the same sign (negative), and the formula follows from direct computation.

Proposition 2 (Dirichlet–Multinomial Conjugacy).

If pDir(α)\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha}) and n=(n1,,nK)pMultinomial(N,p)\mathbf{n} = (n_1, \ldots, n_K) \mid \mathbf{p} \sim \text{Multinomial}(N, \mathbf{p}), then

pnDir(α1+n1,,αK+nK).\mathbf{p} \mid \mathbf{n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K).

Proof.

By Bayes’ theorem, f(pn)f(np)f(p)f(\mathbf{p} \mid \mathbf{n}) \propto f(\mathbf{n} \mid \mathbf{p}) f(\mathbf{p}). The multinomial likelihood is f(np)kpknkf(\mathbf{n} \mid \mathbf{p}) \propto \prod_k p_k^{n_k}, and the Dirichlet prior density is f(p)kpkαk1f(\mathbf{p}) \propto \prod_k p_k^{\alpha_k - 1}. Multiplying:

f(pn)kpkαk+nk1,f(\mathbf{p} \mid \mathbf{n}) \propto \prod_k p_k^{\alpha_k + n_k - 1},

which we recognize as Dir(α+n)\text{Dir}(\boldsymbol{\alpha} + \mathbf{n}).

Remark (Aggregation Property).

The Dirichlet distribution satisfies a crucial aggregation property: if (p1,,pK)Dir(α1,,αK)(p_1, \ldots, p_K) \sim \text{Dir}(\alpha_1, \ldots, \alpha_K) and we merge components jj and kk into pjk=pj+pkp_{jk} = p_j + p_k, then the resulting vector follows Dir(,αj+αk,)\text{Dir}(\ldots, \alpha_j + \alpha_k, \ldots) with the merged parameter. This property is exactly what allows the infinite-dimensional extension — the Dirichlet process — to be self-consistent under arbitrary partitions.


The Dirichlet Process

The Dirichlet process, introduced by Ferguson (1973), is the cornerstone of Bayesian nonparametrics. It is a distribution over probability distributions — a “prior over priors” — that generalizes the Dirichlet distribution to infinite-dimensional spaces.

Definition 2 (Dirichlet Process).

Let (Θ,A)(\Theta, \mathcal{A}) be a measurable space, α>0\alpha > 0 a concentration parameter, and G0G_0 a probability measure on (Θ,A)(\Theta, \mathcal{A}) called the base measure. A random probability measure GG on (Θ,A)(\Theta, \mathcal{A}) follows a Dirichlet process, written GDP(α,G0)G \sim \text{DP}(\alpha, G_0), if for every finite measurable partition {A1,,AK}\{A_1, \ldots, A_K\} of Θ\Theta:

(G(A1),G(A2),,G(AK))Dir(αG0(A1),αG0(A2),,αG0(AK)).(G(A_1), G(A_2), \ldots, G(A_K)) \sim \text{Dir}(\alpha G_0(A_1), \alpha G_0(A_2), \ldots, \alpha G_0(A_K)).

This definition is elegant but requires verification that such an object exists — the condition must be self-consistent across all possible partitions.

Theorem 1 (Existence and Uniqueness of the Dirichlet Process).

For any concentration parameter α>0\alpha > 0 and base measure G0G_0 on a Polish space (Θ,A)(\Theta, \mathcal{A}), there exists a unique probability measure on the space of probability measures over (Θ,A)(\Theta, \mathcal{A}) satisfying the Dirichlet process definition.

Proof.

The key is the Kolmogorov extension theorem. We verify two conditions:

  1. Consistency under marginalization. If {A1,,AK}\{A_1, \ldots, A_K\} is a partition and we merge AjAkA_j \cup A_k into a single set, the resulting marginal distribution must be Dir(,αG0(Aj)+αG0(Ak),)\text{Dir}(\ldots, \alpha G_0(A_j) + \alpha G_0(A_k), \ldots). This follows from the aggregation property of the Dirichlet distribution (Remark 2).

  2. Consistency under refinement. If we refine AjA_j into Aj=B1B2A_j = B_1 \cup B_2, the joint distribution of (G(A1),,G(B1),G(B2),,G(AK))(G(A_1), \ldots, G(B_1), G(B_2), \ldots, G(A_K)) must agree with the Dirichlet definition on the finer partition. This follows from the conditional independence structure: G(B1)/G(Aj)G(Aj)Beta(αG0(B1),αG0(B2))G(B_1)/G(A_j) \mid G(A_j) \sim \text{Beta}(\alpha G_0(B_1), \alpha G_0(B_2)), independent of the other components.

With these consistency conditions verified, the Kolmogorov extension theorem guarantees the existence of a unique probability measure on the product σ\sigma-algebra.

The two parameters of the DP have clear roles:

  • Base measure G0G_0: the “prior guess” at what GG looks like. E[G(A)]=G0(A)\mathbb{E}[G(A)] = G_0(A) for every measurable AA — draws from the DP are centered around G0G_0.
  • Concentration parameter α\alpha: controls how close GG is to G0G_0. As α\alpha \to \infty, GG0G \to G_0 in distribution. As α0\alpha \to 0, GG concentrates on a single atom.

Proposition 3 (Moments of the DP).

If GDP(α,G0)G \sim \text{DP}(\alpha, G_0) and AAA \in \mathcal{A}, then:

  1. E[G(A)]=G0(A)\mathbb{E}[G(A)] = G_0(A),
  2. Var(G(A))=G0(A)(1G0(A))α+1\text{Var}(G(A)) = \frac{G_0(A)(1 - G_0(A))}{\alpha + 1}.
Proof.

These follow directly from the moments of the Beta distribution. For any measurable set AA, consider the partition {A,Ac}\{A, A^c\}. Then (G(A),G(Ac))Dir(αG0(A),α(1G0(A)))(G(A), G(A^c)) \sim \text{Dir}(\alpha G_0(A), \alpha(1 - G_0(A))), so G(A)Beta(αG0(A),α(1G0(A)))G(A) \sim \text{Beta}(\alpha G_0(A), \alpha(1 - G_0(A))). The beta distribution gives E[G(A)]=αG0(A)α=G0(A)\mathbb{E}[G(A)] = \frac{\alpha G_0(A)}{\alpha} = G_0(A) and Var(G(A))=αG0(A)α(1G0(A))α2(α+1)=G0(A)(1G0(A))α+1\text{Var}(G(A)) = \frac{\alpha G_0(A) \cdot \alpha(1 - G_0(A))}{\alpha^2(\alpha + 1)} = \frac{G_0(A)(1 - G_0(A))}{\alpha + 1}.

The most surprising — and practically important — property of the Dirichlet process is its almost sure discreteness.

DP draws and discreteness

Theorem 2 (Almost Sure Discreteness).

If GDP(α,G0)G \sim \text{DP}(\alpha, G_0), then GG is almost surely a discrete measure, regardless of whether the base measure G0G_0 is continuous or discrete. That is, with probability one,

G=k=1wkδθkG = \sum_{k=1}^{\infty} w_k \delta_{\theta_k}

for some random weights wk0w_k \geq 0 with kwk=1\sum_k w_k = 1 and random atoms θkΘ\theta_k \in \Theta.

Proof.

We prove this via the stick-breaking construction (§4), which provides an explicit representation G=k=1wkδθkG = \sum_{k=1}^{\infty} w_k \delta_{\theta_k} with wk=Vkj<k(1Vj)w_k = V_k \prod_{j < k}(1 - V_j) and VkiidBeta(1,α)V_k \stackrel{\text{iid}}{\sim} \text{Beta}(1, \alpha). Since each Vk[0,1]V_k \in [0, 1] and E[Vk]=1/(1+α)>0\mathbb{E}[V_k] = 1/(1 + \alpha) > 0, the product j<k(1Vj)0\prod_{j < k}(1 - V_j) \to 0 almost surely, ensuring kwk=1\sum_k w_k = 1 almost surely. The atoms θk\theta_k are drawn i.i.d. from G0G_0, so GG is a countable mixture of point masses.

An alternative argument: for any fixed atom θ\theta, Pr[G({θ})>0]=0\Pr[G(\{\theta\}) > 0] = 0 when G0G_0 is non-atomic. But GG still has atoms — they arise randomly, not at pre-specified locations. The key insight is that the DP’s finite-dimensional distributions are Dirichlet, and as the partition becomes finer, the mass concentrates on increasingly few partition elements. In the limit, this produces a discrete measure with probability one.

Remark (Discreteness Is a Feature, Not a Bug).

The almost sure discreteness of the DP means draws θ1,θ2,G\theta_1, \theta_2, \ldots \mid G will exhibit ties — multiple observations share the same value. This clustering property is exactly what makes the DP useful for mixture modeling: the number of distinct values (clusters) grows logarithmically with nn, adapting to the data.


Constructive Representations

Ferguson’s definition (Definition 2) is clean but non-constructive — it tells us the finite-dimensional marginals without directly telling us how to sample from the DP. Three equivalent constructive representations fill this gap, each offering different computational and conceptual advantages.

The Stick-Breaking Construction

Definition 3 (Stick-Breaking Construction (Sethuraman, 1994)).

Let VkiidBeta(1,α)V_k \stackrel{\text{iid}}{\sim} \text{Beta}(1, \alpha) for k=1,2,k = 1, 2, \ldots and θkiidG0\theta_k \stackrel{\text{iid}}{\sim} G_0, mutually independent. Define the stick-breaking weights

wk=Vkj=1k1(1Vj),k=1,2,w_k = V_k \prod_{j=1}^{k-1} (1 - V_j), \qquad k = 1, 2, \ldots

Then G=k=1wkδθkDP(α,G0)G = \sum_{k=1}^{\infty} w_k \delta_{\theta_k} \sim \text{DP}(\alpha, G_0).

The name is vivid: imagine a stick of length 1. Break off a fraction V1V_1 (the first weight w1=V1w_1 = V_1). From the remaining piece of length 1V11 - V_1, break off a fraction V2V_2 (giving w2=V2(1V1)w_2 = V_2(1 - V_1)). Continue ad infinitum. The process almost surely exhausts the stick: kwk=1\sum_k w_k = 1 with probability one.

Theorem 3 (Stick-Breaking Equivalence).

The random measure G=k=1wkδθkG = \sum_{k=1}^{\infty} w_k \delta_{\theta_k} constructed via stick-breaking is distributed as DP(α,G0)\text{DP}(\alpha, G_0).

Proof.

We verify the defining property. Let {A1,,AK}\{A_1, \ldots, A_K\} be a finite measurable partition of Θ\Theta. Then

G(Aj)=k=1wk1[θkAj].G(A_j) = \sum_{k=1}^{\infty} w_k \mathbf{1}[\theta_k \in A_j].

We need to show (G(A1),,G(AK))Dir(αG0(A1),,αG0(AK))(G(A_1), \ldots, G(A_K)) \sim \text{Dir}(\alpha G_0(A_1), \ldots, \alpha G_0(A_K)). The proof proceeds by showing that the Laplace transform of (G(A1),,G(AK))(G(A_1), \ldots, G(A_K)) matches that of the Dirichlet distribution. Since θk\theta_k are i.i.d. G0\sim G_0 and independent of the weights, and the weights have the stick-breaking structure with Beta(1,α)\text{Beta}(1, \alpha) breaks, this Laplace transform evaluates to the Dirichlet Laplace transform, confirming the DP distribution. The full calculation appears in Sethuraman (1994).

Constructive representations

In practice, we truncate the stick-breaking construction at KK components, which gives an excellent approximation when KK is large enough:

def stick_breaking_sample(alpha, G0_sampler, K=200):
    """Sample a truncated DP via stick-breaking."""
    V = rng.beta(1, alpha, size=K)
    w = np.zeros(K)
    w[0] = V[0]
    for k in range(1, K):
        w[k] = V[k] * np.prod(1 - V[:k])
    atoms = G0_sampler(K)
    return w, atoms

G0_sampler = lambda K: rng.normal(0, 1, K)
Stick-Breaking Construction Explorer
α (concentration): 1.0
K (truncation): 20
With α = 1.0, the DP draw is a discrete distribution over K = 20 atoms. Remaining stick mass: 0.0%.

The Chinese Restaurant Process

Definition 4 (Chinese Restaurant Process).

Consider a sequence of customers θ1,θ2,\theta_1, \theta_2, \ldots arriving at a restaurant with infinitely many tables. The first customer sits at table 1. Customer n+1n+1 sits at:

  • an occupied table kk (with currently nkn_k customers) with probability nkα+n\frac{n_k}{\alpha + n},
  • a new table with probability αα+n\frac{\alpha}{\alpha + n}.

When a new table is opened, a dish ϕG0\phi \sim G_0 is drawn for that table. Each customer at table kk receives the dish ϕk\phi_k associated with their table.

Theorem 4 (CRP–DP Equivalence).

The sequence θ1,θ2,\theta_1, \theta_2, \ldots generated by the Chinese Restaurant Process is exchangeable, and the directing measure (in the sense of de Finetti’s theorem) is GDP(α,G0)G \sim \text{DP}(\alpha, G_0).

Proof.

Step 1: Predictive distribution. By the CRP construction, the conditional distribution of θn+1\theta_{n+1} given θ1,,θn\theta_1, \ldots, \theta_n is

θn+1θ1,,θnαα+nG0+1α+ni=1nδθi.\theta_{n+1} \mid \theta_1, \ldots, \theta_n \sim \frac{\alpha}{\alpha + n} G_0 + \frac{1}{\alpha + n} \sum_{i=1}^n \delta_{\theta_i}.

This is the Pólya urn predictive rule (Blackwell & MacQueen, 1973).

Step 2: Exchangeability. We verify that p(θ1,,θn)p(\theta_1, \ldots, \theta_n) is invariant under permutations. The joint probability factors as:

p(θ1,,θn)=αKnk=1Kn(nk1)!α(α+1)(α+n1)k=1KnG0(ϕk),p(\theta_1, \ldots, \theta_n) = \frac{\alpha^{K_n} \prod_{k=1}^{K_n} (n_k - 1)!}{\alpha(\alpha+1)\cdots(\alpha+n-1)} \prod_{k=1}^{K_n} G_0(\phi_k),

where KnK_n is the number of distinct values (tables) and nkn_k is the count at table kk. This expression depends on (θ1,,θn)(\theta_1, \ldots, \theta_n) only through the partition structure (which values are equal) — not on the ordering. Hence the sequence is exchangeable.

Step 3: De Finetti’s representation. By de Finetti’s theorem, an exchangeable sequence of random variables is a mixture of i.i.d. sequences: θiGiidG\theta_i \mid G \stackrel{\text{iid}}{\sim} G for some random GG. The predictive distribution (Step 1) uniquely identifies GDP(α,G0)G \sim \text{DP}(\alpha, G_0) through the posterior characterization in §5.

The CRP provides an intuitive simulation algorithm:

alpha_crp = 2.0
n_customers = 50
tables = []       # table sizes
assignments = []  # which table each customer sits at

for i in range(n_customers):
    if len(tables) == 0:
        tables.append(1)
        assignments.append(0)
    else:
        probs = np.array(tables + [alpha_crp]) / (i + alpha_crp)
        choice = rng.choice(len(probs), p=probs)
        if choice == len(tables):   # new table
            tables.append(1)
            assignments.append(len(tables) - 1)
        else:                        # existing table
            tables[choice] += 1
            assignments.append(choice)

The Pólya Urn Scheme

Definition 5 (Pólya Urn Scheme (Blackwell & MacQueen, 1973)).

The Pólya urn provides a sequential construction equivalent to the CRP. Start with an urn containing a “paint can” of color G0G_0 with mass α\alpha. At step n+1n+1:

  1. Draw a ball from the urn uniformly at random (proportional to mass).
  2. If the paint can is drawn, generate θn+1G0\theta_{n+1} \sim G_0 and add a unit-mass ball of color θn+1\theta_{n+1} to the urn.
  3. If a ball of color θi\theta_i is drawn, set θn+1=θi\theta_{n+1} = \theta_i and add another unit-mass ball of the same color.

This produces the same predictive rule as the CRP:

θn+1θ1,,θnαα+nG0+1α+ni=1nδθi.\theta_{n+1} \mid \theta_1, \ldots, \theta_n \sim \frac{\alpha}{\alpha + n} G_0 + \frac{1}{\alpha + n} \sum_{i=1}^n \delta_{\theta_i}.

Remark (Expected Number of Clusters).

In the CRP, the expected number of distinct tables after nn customers is

E[Kn]=i=1nαα+i1αlog ⁣(nα+1).\mathbb{E}[K_n] = \sum_{i=1}^{n} \frac{\alpha}{\alpha + i - 1} \approx \alpha \log\!\left(\frac{n}{\alpha} + 1\right).

This logarithmic growth is a hallmark of the DP: the number of clusters grows slowly, providing an automatic regularization effect.


Posterior Inference

One of the most elegant properties of the Dirichlet process is its conjugacy: the posterior of a DP prior given i.i.d. observations is again a DP, with parameters updated in a natural way.

Theorem 5 (DP Posterior Update).

Let GDP(α,G0)G \sim \text{DP}(\alpha, G_0) and θ1,,θnGiidG\theta_1, \ldots, \theta_n \mid G \stackrel{\text{iid}}{\sim} G. Then the posterior distribution of GG given θ1,,θn\theta_1, \ldots, \theta_n is

Gθ1,,θnDP ⁣(α+n,  αα+nG0+nα+nF^n),G \mid \theta_1, \ldots, \theta_n \sim \text{DP}\!\left(\alpha + n,\; \frac{\alpha}{\alpha + n} G_0 + \frac{n}{\alpha + n} \hat{F}_n\right),

where F^n=1ni=1nδθi\hat{F}_n = \frac{1}{n}\sum_{i=1}^n \delta_{\theta_i} is the empirical distribution.

Proof.

We verify the defining property of the DP. Let {A1,,AK}\{A_1, \ldots, A_K\} be a finite measurable partition of Θ\Theta. Define nj={i:θiAj}n_j = |\{i : \theta_i \in A_j\}|, so jnj=n\sum_j n_j = n.

Prior. (G(A1),,G(AK))Dir(αG0(A1),,αG0(AK))(G(A_1), \ldots, G(A_K)) \sim \text{Dir}(\alpha G_0(A_1), \ldots, \alpha G_0(A_K)).

Likelihood. Given GG, the observations θ1,,θn\theta_1, \ldots, \theta_n are i.i.d. from GG, so the count vector (n1,,nK)GMultinomial(n,(G(A1),,G(AK)))(n_1, \ldots, n_K) \mid G \sim \text{Multinomial}(n, (G(A_1), \ldots, G(A_K))).

Posterior. By Dirichlet–Multinomial conjugacy (Proposition 2):

(G(A1),,G(AK))θ1,,θnDir(αG0(A1)+n1,,αG0(AK)+nK).(G(A_1), \ldots, G(A_K)) \mid \theta_1, \ldots, \theta_n \sim \text{Dir}(\alpha G_0(A_1) + n_1, \ldots, \alpha G_0(A_K) + n_K).

We rewrite the updated parameters:

αG0(Aj)+nj=(α+n)(αα+nG0(Aj)+nα+nnjn)=(α+n)Gn(Aj),\alpha G_0(A_j) + n_j = (\alpha + n) \left(\frac{\alpha}{\alpha + n} G_0(A_j) + \frac{n}{\alpha + n} \cdot \frac{n_j}{n}\right) = (\alpha + n) G_n(A_j),

where Gn=αα+nG0+nα+nF^nG_n = \frac{\alpha}{\alpha + n} G_0 + \frac{n}{\alpha + n} \hat{F}_n is the updated base measure. Since this holds for every finite measurable partition, Gθ1,,θnDP(α+n,Gn)G \mid \theta_1, \ldots, \theta_n \sim \text{DP}(\alpha + n, G_n).

DP posterior update

Corollary 1 (Posterior Predictive Distribution).

The predictive distribution for θn+1\theta_{n+1} given θ1,,θn\theta_1, \ldots, \theta_n (marginalizing over GG) is

θn+1θ1,,θnαα+nG0+1α+ni=1nδθi.\theta_{n+1} \mid \theta_1, \ldots, \theta_n \sim \frac{\alpha}{\alpha + n} G_0 + \frac{1}{\alpha + n} \sum_{i=1}^n \delta_{\theta_i}.

Proof.

Integrate θn+1GG\theta_{n+1} \mid G \sim G against the posterior Gθ1,,θnDP(α+n,Gn)G \mid \theta_1, \ldots, \theta_n \sim \text{DP}(\alpha + n, G_n):

E[θn+1Aθ1,,θn]=E[G(A)θ1,,θn]=Gn(A)=αα+nG0(A)+nα+nF^n(A).\mathbb{E}[\theta_{n+1} \in A \mid \theta_1, \ldots, \theta_n] = \mathbb{E}[G(A) \mid \theta_1, \ldots, \theta_n] = G_n(A) = \frac{\alpha}{\alpha + n} G_0(A) + \frac{n}{\alpha + n} \hat{F}_n(A).

This recovers the Pólya urn predictive rule (Definition 5).

Remark (Posterior as Weighted Average).

The posterior base measure Gn=αα+nG0+nα+nF^nG_n = \frac{\alpha}{\alpha+n}G_0 + \frac{n}{\alpha+n}\hat{F}_n is a weighted average of the prior G0G_0 and the empirical distribution F^n\hat{F}_n. As nn \to \infty, the posterior concentrates around F^n\hat{F}_n — the data overwhelm the prior. As α\alpha \to \infty, the posterior stays close to G0G_0 — the prior dominates. This interpolation between prior belief and data evidence is the essence of Bayesian learning.

DP Posterior Updating
Click on the plot to add data points (max 20)
n = 0 observations

Dirichlet Process Mixture Models

The DP by itself generates discrete distributions, but real data is often continuous. The Dirichlet process mixture model (DPMM) solves this by using the DP as a mixing distribution: each observation is drawn from a kernel (e.g., Gaussian) centered at a DP-sampled atom. This produces a countable mixture with an unknown number of components.

Definition 6 (Dirichlet Process Mixture Model).

A DPMM with concentration parameter α\alpha, base measure G0G_0, and kernel F(θ)F(\cdot \mid \theta) is the hierarchical model:

GDP(α,G0),θiGiidG,xiθiF(θi).G \sim \text{DP}(\alpha, G_0), \qquad \theta_i \mid G \stackrel{\text{iid}}{\sim} G, \qquad x_i \mid \theta_i \sim F(\cdot \mid \theta_i).

In the Gaussian case with G0=NIW(μ0,κ0,ν0,Ψ0)G_0 = \text{NIW}(\mu_0, \kappa_0, \nu_0, \Psi_0) (Normal-Inverse-Wishart) and F(μ,Σ)=N(μ,Σ)F(\cdot \mid \mu, \Sigma) = \mathcal{N}(\mu, \Sigma), this becomes a Gaussian mixture model with a random (potentially infinite) number of components.

The generative process via stick-breaking:

  1. Draw weights: VkBeta(1,α)V_k \sim \text{Beta}(1, \alpha), set wk=Vkj<k(1Vj)w_k = V_k \prod_{j<k}(1-V_j).
  2. Draw atoms: θkG0\theta_k \sim G_0.
  3. For each observation ii: assign to component ziz_i with Pr[zi=k]=wk\Pr[z_i = k] = w_k, then draw xiF(θzi)x_i \sim F(\cdot \mid \theta_{z_i}).

Definition 7 (Collapsed DPMM via CRP).

The CRP representation provides an equivalent generative model that integrates out GG:

  1. For i=1,,ni = 1, \ldots, n, assign observation ii to cluster kk with probability:
    • nk,iα+n1\frac{n_{k,-i}}{\alpha + n - 1} for existing cluster kk (where nk,in_{k,-i} is the count excluding ii),
    • αα+n1\frac{\alpha}{\alpha + n - 1} for a new cluster.
  2. If assigned to a new cluster, draw θnewG0\theta_{\text{new}} \sim G_0.
  3. Draw xiF(θzi)x_i \sim F(\cdot \mid \theta_{z_i}).

Gibbs sampling for DPMMs. The collapsed Gibbs sampler (Neal, Algorithm 3) iterates over observations, resampling each cluster assignment ziz_i from its full conditional:

Pr[zi=kzi,x1,,xn]{nk,iF(xiθk)existing cluster k,αF(xiθ)dG0(θ)new cluster.\Pr[z_i = k \mid z_{-i}, x_1, \ldots, x_n] \propto \begin{cases} n_{k,-i} \cdot F(x_i \mid \theta_k) & \text{existing cluster } k, \\ \alpha \cdot \int F(x_i \mid \theta) \, dG_0(\theta) & \text{new cluster.} \end{cases}

When the kernel FF and base measure G0G_0 are conjugate (e.g., Gaussian-NIW), the marginal likelihood F(xiθ)dG0(θ)\int F(x_i \mid \theta) dG_0(\theta) and the updated parameter θk\theta_k given all observations assigned to cluster kk have closed-form expressions.

Example 1 (Gaussian DPMM — Univariate).

For F(μ,σ2)=N(μ,σ2)F(\cdot \mid \mu, \sigma^2) = \mathcal{N}(\mu, \sigma^2) with known variance σ2\sigma^2 and G0=N(μ0,σ02)G_0 = \mathcal{N}(\mu_0, \sigma_0^2):

  • Marginal likelihood: N(xμ,σ2)N(μμ0,σ02)dμ=N(xμ0,σ2+σ02)\int \mathcal{N}(x \mid \mu, \sigma^2) \mathcal{N}(\mu \mid \mu_0, \sigma_0^2) d\mu = \mathcal{N}(x \mid \mu_0, \sigma^2 + \sigma_0^2).
  • Posterior for cluster mean: μk{xi:zi=k}N ⁣(σ02μ0+nkσ2xˉkσ02+nkσ2,  (σ02+nkσ2)1)\mu_k \mid \{x_i : z_i = k\} \sim \mathcal{N}\!\left(\frac{\sigma_0^{-2}\mu_0 + n_k\sigma^{-2}\bar{x}_k}{\sigma_0^{-2} + n_k\sigma^{-2}},\; (\sigma_0^{-2} + n_k\sigma^{-2})^{-1}\right).

DPMM clustering

def run_dpmm_gibbs(X, alpha, n_iter=50, sigma2=0.5, mu0=0, sigma02=10):
    """Collapsed Gibbs sampler for a univariate Gaussian DPMM."""
    n = len(X)
    z = rng.integers(0, 3, size=n)  # random initial assignments
    for iteration in range(n_iter):
        for i in range(n):
            # Remove observation i from its cluster
            clusters = {}
            for j in range(n):
                if j == i: continue
                clusters.setdefault(z[j], []).append(j)
            probs, cluster_ids = [], []
            for k, members in clusters.items():
                n_k = len(members)
                x_bar = np.mean(X[members])
                precision_post = 1/sigma02 + n_k/sigma2
                mu_post = (mu0/sigma02 + n_k*x_bar/sigma2) / precision_post
                sigma2_pred = sigma2 + 1/precision_post
                log_prob = np.log(n_k) + norm.logpdf(X[i], mu_post, np.sqrt(sigma2_pred))
                probs.append(np.exp(log_prob))
                cluster_ids.append(k)
            # New cluster probability
            log_new = np.log(alpha) + norm.logpdf(X[i], mu0, np.sqrt(sigma2 + sigma02))
            probs.append(np.exp(log_new))
            cluster_ids.append(max(cluster_ids, default=-1) + 1)
            probs = np.array(probs) / sum(probs)
            z[i] = cluster_ids[rng.choice(len(probs), p=probs)]
    return z

Gaussian Processes

While the Dirichlet process provides a nonparametric prior on probability measures, the Gaussian process provides a nonparametric prior on functions. In the Bayesian nonparametric view, a GP is a prior on an infinite-dimensional function space that admits tractable finite-dimensional marginals.

Definition 8 (Gaussian Process).

A Gaussian process is a collection of random variables {f(x)}xX\{f(x)\}_{x \in \mathcal{X}}, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by its mean function m(x)m(x) and covariance function (kernel) k(x,x)k(x, x'):

fGP(m,k),m(x)=E[f(x)],k(x,x)=Cov(f(x),f(x)).f \sim \mathcal{GP}(m, k), \qquad m(x) = \mathbb{E}[f(x)], \qquad k(x, x') = \text{Cov}(f(x), f(x')).

For any finite set of inputs x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n), the function values (f(x1),,f(xn))(f(x_1), \ldots, f(x_n)) follow a multivariate Gaussian:

(f(x1),,f(xn))N(m,K),(f(x_1), \ldots, f(x_n)) \sim \mathcal{N}(\mathbf{m}, \mathbf{K}),

where mi=m(xi)\mathbf{m}_i = m(x_i) and Kij=k(xi,xj)\mathbf{K}_{ij} = k(x_i, x_j).

Definition 9 (Common Kernels).

The choice of kernel encodes prior assumptions about the function:

  1. Squared exponential (RBF): k(x,x)=σf2exp ⁣(xx222)k(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right) — infinitely differentiable functions. Parameters: signal variance σf2\sigma_f^2, length-scale \ell.

  2. Matérn-ν\nu: k(x,x)=σf221νΓ(ν)(2νxx)νKν ⁣(2νxx)k(x, x') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|x-x'\|}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\|x-x'\|}{\ell}\right)ν1\lceil\nu\rceil - 1 times differentiable. For ν=3/2\nu = 3/2: k(x,x)=σf2(1+3xx)exp ⁣(3xx)k(x, x') = \sigma_f^2\left(1 + \frac{\sqrt{3}|x-x'|}{\ell}\right)\exp\!\left(-\frac{\sqrt{3}|x-x'|}{\ell}\right).

  3. Linear: k(x,x)=σb2+σv2(xc)(xc)k(x, x') = \sigma_b^2 + \sigma_v^2 (x - c)(x' - c) — equivalent to Bayesian linear regression.

The real power of GPs lies in the closed-form posterior — conditioning on observed data is just Gaussian conditioning.

Theorem 6 (GP Posterior).

Let fGP(0,k)f \sim \mathcal{GP}(0, k) and observe y=f(X)+ε\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\varepsilon} where εiiidN(0,σn2)\varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma_n^2). Then the posterior fX,yf \mid \mathbf{X}, \mathbf{y} is again a GP with:

E[f(x)X,y]=k(K+σn2I)1y,\mathbb{E}[f(x^*) \mid \mathbf{X}, \mathbf{y}] = \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y},

Var(f(x)X,y)=k(x,x)k(K+σn2I)1k,\text{Var}(f(x^*) \mid \mathbf{X}, \mathbf{y}) = k(x^*, x^*) - \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*,

where k=(k(x,x1),,k(x,xn))\mathbf{k}_* = (k(x^*, x_1), \ldots, k(x^*, x_n))^\top and Kij=k(xi,xj)\mathbf{K}_{ij} = k(x_i, x_j).

Proof.

Write the joint distribution of the training outputs y\mathbf{y} and the test output f=f(x)f_* = f(x^*):

(yf)N ⁣(0,  (K+σn2Ikkk(x,x))).\begin{pmatrix} \mathbf{y} \\ f_* \end{pmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{pmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{k}_* \\ \mathbf{k}_*^\top & k(x^*, x^*) \end{pmatrix}\right).

By the standard formula for Gaussian conditionals (p(ab)=N(μa+ΣabΣbb1(bμb),  ΣaaΣabΣbb1Σba)p(a \mid b) = \mathcal{N}(\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(b - \mu_b),\; \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})):

fyN ⁣(k(K+σn2I)1y,  k(x,x)k(K+σn2I)1k).f_* \mid \mathbf{y} \sim \mathcal{N}\!\left(\mathbf{k}_*^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y},\; k(x^*, x^*) - \mathbf{k}_*^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{k}_*\right).

Since this holds for any finite set of test points, the posterior is a GP.

GP regression

In practice, the matrix inverse (K+σn2I)1(\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} is computed via the Cholesky decomposition for numerical stability:

def rbf_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
    """Squared exponential (RBF) kernel."""
    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
    return signal_var * np.exp(-0.5 * sqdist / length_scale**2)

# GP posterior via Cholesky decomposition
K_train = rbf_kernel(X_train, X_train) + sigma_n**2 * np.eye(len(X_train))
K_star = rbf_kernel(X_test, X_train)
K_ss = rbf_kernel(X_test, X_test)

L = np.linalg.cholesky(K_train)                       # O(n^3) factorization
alpha_gp = np.linalg.solve(L.T, np.linalg.solve(L, y_train))  # O(n^2) solve
mu_post = K_star @ alpha_gp                            # posterior mean

v = np.linalg.solve(L, K_star.T)
var_post = np.diag(K_ss) - np.sum(v**2, axis=0)       # posterior variance
std_post = np.sqrt(np.maximum(var_post, 0))            # clamp for numerics
GP Posterior Explorer
Click to add observations (max 10)
Kernel:
0 observations

Remark (GP–DP Connection).

The Dirichlet process and Gaussian process are complementary nonparametric priors: the DP is a prior on discrete measures (used for clustering and density estimation), while the GP is a prior on continuous functions (used for regression and classification). Both are infinite-dimensional priors with tractable finite-dimensional marginals. The DP’s finite marginals are Dirichlet; the GP’s finite marginals are Gaussian.


The Indian Buffet Process

The Dirichlet process provides a nonparametric prior for clustering (each observation belongs to exactly one cluster). But what if we want each observation to possess multiple latent features? The Indian Buffet Process (IBP), introduced by Griffiths and Ghahramani (2005), provides a nonparametric prior on binary feature matrices with infinitely many columns.

Definition 10 (Indian Buffet Process).

Consider nn customers sequentially visiting an Indian buffet with infinitely many dishes. Customer 1 tries Poisson(α)\text{Poisson}(\alpha) dishes. Customer ii (for i2i \geq 2):

  • tries each previously tasted dish kk with probability mk/im_k / i, where mkm_k is the number of previous customers who tried dish kk,
  • then tries Poisson(α/i)\text{Poisson}(\alpha / i) new dishes that no previous customer has tried.

The result is a random binary matrix Z{0,1}n×K\mathbf{Z} \in \{0,1\}^{n \times K}, where KK is the (random, potentially infinite) number of dishes tasted by at least one customer, and Zik=1Z_{ik} = 1 if customer ii tried dish kk.

IBP feature allocation

Proposition 4 (Properties of the IBP).

If Z\mathbf{Z} is generated by the IBP with parameter α\alpha:

  1. The expected total number of dishes is E[K]=αHn\mathbb{E}[K] = \alpha H_n, where Hn=i=1n1/ilognH_n = \sum_{i=1}^n 1/i \approx \log n is the nn-th harmonic number.
  2. The expected number of features per customer is α\alpha.
  3. The distribution on equivalence classes of binary matrices (up to column permutation) is exchangeable.

Remark (DP vs IBP).

The DP and IBP are complementary priors for different latent structures:

PropertyDirichlet ProcessIndian Buffet Process
Prior onProbability measuresBinary matrices
Observation modelEach xix_i belongs to one clusterEach xix_i has multiple features
AnalogyChinese Restaurant ProcessIndian Buffet
Underlying processBeta (stick-breaking)Beta process
Expected componentsαlogn\alpha \log n clustersαHn\alpha H_n features

The IBP can be derived from a beta process prior, just as the CRP arises from the DP. The beta process BBP(c,B0)B \sim \text{BP}(c, B_0) is a completely random measure whose atoms have weights in [0,1][0,1]; each observation independently “selects” each atom with probability equal to its weight, producing the binary matrix Z\mathbf{Z}.


Posterior Consistency and Contraction Rates

The deepest connection between Bayesian nonparametrics and the PAC learning framework lies in the theory of posterior consistency. Just as PAC learning asks “does the learner converge to a good hypothesis as nn \to \infty?”, posterior consistency asks “does the Bayesian posterior converge to the truth?”

Definition 11 (Posterior Consistency).

A Bayesian nonparametric model with prior Π\Pi on a parameter space Θ\Theta is posterior consistent at the true parameter θ0\theta_0 if, for every neighborhood UU of θ0\theta_0 (in an appropriate topology),

Π(θUX1,,Xn)Pθ01as n.\Pi(\theta \in U \mid X_1, \ldots, X_n) \xrightarrow{P_{\theta_0}} 1 \quad \text{as } n \to \infty.

Definition 12 (Posterior Contraction Rate).

The posterior contracts at rate εn0\varepsilon_n \to 0 around θ0\theta_0 if

Π(d(θ,θ0)>MεnX1,,Xn)Pθ00as n,\Pi(d(\theta, \theta_0) > M\varepsilon_n \mid X_1, \ldots, X_n) \xrightarrow{P_{\theta_0}} 0 \quad \text{as } n \to \infty,

for every M>0M > 0 and some metric dd on Θ\Theta. The rate εn\varepsilon_n measures how fast the posterior concentrates.

Schwartz’s Theorem

The classical result on posterior consistency is due to Schwartz (1965), who identified the key condition: the prior must assign positive probability to Kullback-Leibler neighborhoods of the truth.

Theorem 7 (Schwartz's Theorem).

Let P0P_0 be the true data-generating distribution with density p0p_0, and let Π\Pi be a prior on a space of densities. If P0P_0 is in the Kullback-Leibler support of Π\Pi — that is, for every ε>0\varepsilon > 0,

Π ⁣({p:KL(p0p)<ε})>0,\Pi\!\left(\left\{p : \text{KL}(p_0 \| p) < \varepsilon\right\}\right) > 0,

then the posterior is weakly consistent at P0P_0: for every weak neighborhood UU of P0P_0,

Π(pUX1,,Xn)1P0-a.s.\Pi(p \in U \mid X_1, \ldots, X_n) \to 1 \quad P_0\text{-a.s.}

Proof.

The proof uses three key steps:

  1. Posterior ratio test. For any measurable set AA in the complement of UU, the posterior probability satisfies:

Π(AX1,,Xn)=Ai=1np(Xi)p0(Xi)dΠ(p)i=1np(Xi)p0(Xi)dΠ(p).\Pi(A \mid X_1, \ldots, X_n) = \frac{\int_A \prod_{i=1}^n \frac{p(X_i)}{p_0(X_i)} d\Pi(p)}{\int \prod_{i=1}^n \frac{p(X_i)}{p_0(X_i)} d\Pi(p)}.

  1. Numerator control. By the law of large numbers, for pp with KL(p0p)>ε\text{KL}(p_0 \| p) > \varepsilon, the log-likelihood ratio 1nilogp(Xi)p0(Xi)KL(p0p)<ε\frac{1}{n}\sum_i \log\frac{p(X_i)}{p_0(X_i)} \to -\text{KL}(p_0 \| p) < -\varepsilon almost surely. The numerator decays exponentially.

  2. Denominator control. The KL support condition ensures the denominator does not decay as fast: there exists a “sieve” of densities near p0p_0 that maintain sufficient posterior mass.

Combining, the ratio Π(AX1,,Xn)0\Pi(A \mid X_1, \ldots, X_n) \to 0 almost surely.

Posterior Contraction Rates

Modern theory (Ghosal, Ghosh & van der Vaart, 2000) goes beyond consistency to rates. The key result establishes that posterior contraction rates are governed by the interplay between prior concentration (how much mass the prior places near the truth) and model complexity (measured by metric entropy).

Posterior consistency

Theorem 8 (Posterior Contraction Rate (Ghosal, Ghosh & van der Vaart)).

Suppose the following conditions hold for a sequence εn0\varepsilon_n \to 0 with nεn2n\varepsilon_n^2 \to \infty:

  1. Prior concentration: Π({p:KL(p0p)εn2,  V2(p0,p)εn2})ec1nεn2\Pi(\{p : \text{KL}(p_0 \| p) \leq \varepsilon_n^2,\; V_2(p_0, p) \leq \varepsilon_n^2\}) \geq e^{-c_1 n\varepsilon_n^2}, where V2(p0,p)=P0[(log(p0/p))2][KL(p0p)]2V_2(p_0, p) = P_0[(\log(p_0/p))^2] - [\text{KL}(p_0 \| p)]^2.

  2. Sieve complexity: There exist sets Θn\Theta_n (sieves) with Π(Θnc)ec2nεn2\Pi(\Theta_n^c) \leq e^{-c_2 n\varepsilon_n^2} and logN(εn,Θn,d)c3nεn2\log N(\varepsilon_n, \Theta_n, d) \leq c_3 n\varepsilon_n^2, where N(ε,Θ,d)N(\varepsilon, \Theta, d) is the ε\varepsilon-covering number.

Then Π(d(p,p0)>MεnX1,,Xn)0\Pi(d(p, p_0) > M\varepsilon_n \mid X_1, \ldots, X_n) \to 0 in P0P_0-probability for sufficiently large MM.

Remark (Connection to PAC Learning).

The parallel between posterior contraction and PAC learning is striking:

PAC LearningPosterior Contraction
Sample complexity n(ε,δ)n(\varepsilon, \delta)Contraction rate εn\varepsilon_n
VC dimension / Rademacher complexityMetric entropy logN(εn,Θn,d)\log N(\varepsilon_n, \Theta_n, d)
Approximation error (bias)Prior concentration (KL support)
Estimation error (variance)Sieve complexity (covering number)
Structural risk minimizationBayesian model selection (marginal likelihood)

Both frameworks say: learning succeeds when the model class is rich enough to approximate the truth (low bias) but structured enough to avoid overfitting (controlled complexity). The key difference is the mechanism: PAC bounds are worst-case over distributions, while Bayesian rates depend on the prior and are typically average-case.

Consistency of DP Mixture Models

Proposition 5 (Consistency of DP Gaussian Mixtures).

Let P0P_0 be a distribution with a continuous, bounded density p0p_0 on Rd\mathbb{R}^d. A Dirichlet process mixture of Gaussians with base measure G0=NIW(μ0,κ0,ν0,Ψ0)G_0 = \text{NIW}(\mu_0, \kappa_0, \nu_0, \Psi_0) and any concentration parameter α>0\alpha > 0 is posterior consistent at P0P_0.

Proof.

We verify the KL support condition of Schwartz’s theorem. For any ε>0\varepsilon > 0, we need to show Π(KL(p0p)<ε)>0\Pi(\text{KL}(p_0 \| p) < \varepsilon) > 0.

  1. Since Gaussian mixtures are dense in continuous densities (in the L1L^1 sense), for any ε>0\varepsilon' > 0, there exists a finite Gaussian mixture q=k=1KπkN(μk,Σk)q = \sum_{k=1}^K \pi_k \mathcal{N}(\mu_k, \Sigma_k) with KL(p0q)<ε<ε\text{KL}(p_0 \| q) < \varepsilon' < \varepsilon.

  2. The DP prior assigns positive probability to any finite mixture: the weights (π1,,πK)(\pi_1, \ldots, \pi_K) can be approximated by stick-breaking weights (each VkBeta(1,α)V_k \sim \text{Beta}(1, \alpha) has full support on (0,1)(0,1)), and the atoms (μk,Σk)(\mu_k, \Sigma_k) can be approximated since G0G_0 is a non-degenerate NIW (with full support on Rd×S+d\mathbb{R}^d \times S_+^d).

  3. Since the KL divergence is continuous in the density (in the L1L^1 topology), a neighborhood of qq in the prior also satisfies KL(p0p)<ε\text{KL}(p_0 \| p) < \varepsilon, and this neighborhood has positive prior probability.

Remark (Minimax Rates).

Under regularity conditions (e.g., p0p_0 is β\beta-Hölder smooth), DP Gaussian mixture models achieve the near-minimax contraction rate εn=nβ/(2β+d)(logn)t\varepsilon_n = n^{-\beta/(2\beta + d)}(\log n)^t for some t>0t > 0. This matches the minimax rate up to a logarithmic factor — the Bayesian nonparametric approach is rate-adaptive, automatically achieving near-optimal rates without needing to know the smoothness β\beta in advance.


Connections & Further Reading

Connection Map

TopicDomainRelationship
PAC Learning FrameworkProbability & StatisticsPosterior contraction rates parallel PAC sample complexity; Bayesian model selection provides an alternative to SRM
Concentration InequalitiesProbability & StatisticsPosterior contraction proofs use concentration of the log-likelihood ratio; GP concentration bounds use sub-Gaussian tail inequalities
Measure-Theoretic ProbabilityProbability & StatisticsThe DP is defined on measure spaces; posterior consistency uses dominated convergence and the law of large numbers
Spectral TheoremLinear AlgebraGP kernel matrices are positive semi-definite; the eigendecomposition of the kernel determines the GP’s RKHS
SVDLinear AlgebraLow-rank GP approximations (Nyström method) use the SVD of the kernel matrix
PCA & Low-Rank ApproximationLinear AlgebraKernel PCA is equivalent to projecting onto the leading eigenfunctions of the GP kernel; functional PCA uses GP priors

Key Notation Summary

SymbolMeaning
DP(α,G0)\text{DP}(\alpha, G_0)Dirichlet process with concentration α\alpha and base measure G0G_0
G0G_0Base measure (prior guess for DP draws)
α\alphaConcentration parameter
δθ\delta_\thetaPoint mass (Dirac delta) at θ\theta
Dir(α)\text{Dir}(\boldsymbol{\alpha})Dirichlet distribution with parameter vector α\boldsymbol{\alpha}
VkiidBeta(1,α)V_k \stackrel{\text{iid}}{\sim} \text{Beta}(1, \alpha)Stick-breaking beta variables
wk=Vkj<k(1Vj)w_k = V_k \prod_{j<k}(1-V_j)Stick-breaking weight
F^n=1ni=1nδθi\hat{F}_n = \frac{1}{n}\sum_{i=1}^n \delta_{\theta_i}Empirical distribution
KnK_nNumber of distinct clusters after nn observations
GP(m,k)\mathcal{GP}(m, k)Gaussian process with mean mm and kernel kk
k(K+σn2I)1y\mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}GP posterior mean
KL(p0p)\text{KL}(p_0 \| p)Kullback-Leibler divergence
N(ε,Θ,d)N(\varepsilon, \Theta, d)ε\varepsilon-covering number of Θ\Theta under metric dd
nβ/(2β+d)n^{-\beta/(2\beta+d)}Minimax contraction rate for β\beta-smooth densities in Rd\mathbb{R}^d

Connections

  • Direct prerequisite — posterior contraction rates parallel PAC sample complexity; Bayesian model selection via the marginal likelihood provides an alternative to structural risk minimization. pac-learning
  • Foundational — the Dirichlet process is defined on measure spaces; posterior consistency proofs use the law of large numbers, dominated convergence, and Kullback-Leibler divergence. measure-theoretic-probability
  • Posterior contraction proofs use concentration of the log-likelihood ratio; Gaussian process concentration bounds use sub-Gaussian tail inequalities. concentration-inequalities
  • GP kernel matrices are positive semi-definite; the eigendecomposition of the kernel operator determines the reproducing kernel Hilbert space. spectral-theorem
  • Low-rank GP approximations (Nyström method) use the SVD of the kernel matrix. svd
  • Kernel PCA is equivalent to projecting onto the leading eigenfunctions of the GP kernel; functional PCA uses GP priors. pca-low-rank

References & Further Reading

  • paper A Bayesian Analysis of Some Nonparametric Problems — Ferguson (1973) The foundational paper defining the Dirichlet process
  • paper A Constructive Definition of Dirichlet Priors — Sethuraman (1994) The stick-breaking construction
  • paper Ferguson Distributions Via Pólya Urn Schemes — Blackwell & MacQueen (1973) The Pólya urn representation and exchangeability
  • book Gaussian Processes for Machine Learning — Rasmussen & Williams (2006) Standard reference for GP theory and computation
  • book Fundamentals of Nonparametric Bayesian Inference — Ghosal & van der Vaart (2017) Comprehensive treatment of posterior consistency and contraction rates
  • paper Infinite Latent Feature Models and the Indian Buffet Process — Griffiths & Ghahramani (2005) Introduction of the IBP
  • paper Markov Chain Sampling Methods for Dirichlet Process Mixture Models — Neal (2000) Gibbs sampling algorithms for DPMMs
  • paper Convergence rates of posterior distributions — Ghosal, Ghosh & van der Vaart (2000) The general posterior contraction rate theorem
  • paper On Bayes procedures — Schwartz (1965) The foundational posterior consistency theorem
  • book Bayesian Nonparametrics — Hjort, Holmes, Müller & Walker (2010) Comprehensive survey of BNP methods