advanced bayesian-ml 65 min read

Variational Bayes for Model Selection

The variational evidence approximation as a model-selection criterion — Laplace bridge to the BIC, KL projection bias, importance-weighted and annealed bias correction, and the bits-back coding interpretation

Overview & Motivation

Suppose we have thirty observations drawn from a noisy cubic — a curve we don’t know in advance, and that we’re trying to recover. We choose a Bayesian polynomial regression model

yi=β0+β1xi++βdxid+εi,εiN(0,σ2),βN(0,τ2I),y_i \,=\, \beta_0 + \beta_1 x_i + \cdots + \beta_d x_i^d + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2), \qquad \boldsymbol\beta \sim \mathcal{N}(\mathbf{0}, \tau^2 I),

and let the polynomial degree dd index a family of candidate models {M1,M2,,M9}\{M_1, M_2, \ldots, M_9\}. The choice of dd is itself a modeling choice. Small dd gives a rigid, low-capacity fit — for d=1d = 1 we’re committing to a straight line through data the line cannot possibly explain. Large dd gives the model enough flexibility to overfit — for d=9d = 9 we have ten free coefficients trying to explain thirty noisy points, and at some point we’re fitting the noise. The right choice is somewhere in between. The Bayesian way to choose is to compute the marginal likelihood of each candidate,

p(yMd)  =  p(yβ,Md)p(βMd)dβ,p(y \mid M_d) \;=\; \int p(y \mid \boldsymbol\beta, M_d) \, p(\boldsymbol\beta \mid M_d) \, d\boldsymbol\beta,

and prefer the model with the highest score. This is Bayesian model selection in its classical form, and it underlies every Bayes-factor comparison.

What does the marginal log-evidence look like as we sweep dd from 1 to 9?

Three traces of log-evidence vs polynomial degree d=1..9: closed-form log p(y), Laplace approximation overlaid, mean-field ELBO sitting below.
Bayesian Occam's razor on polynomial regression. The closed-form log-evidence (blue) peaks at d=3, the true cubic. The Laplace approximation (gray) overlays exactly — a feature of Gaussian-conjugate priors. The mean-field ELBO (red) tracks the evidence near the simple-model end and pulls away as the posterior gains correlation structure. The growing gap is the topic.

The closed-form curve in blue is what we’d compute analytically: the marginal likelihood under the prior, integrated over β\boldsymbol\beta. Two features stand out. First, the curve has an interior maximum at d=3d = 3 — the data really were drawn from a cubic, and the evidence picks that up. Anything less is too rigid to fit, anything more is paying an Occam penalty for unused capacity. Second, the cliff between the underfit and well-fit regimes is striking, because the data fundamentally cannot fit lower-order curves and the evidence reflects that.

The mechanism behind the smooth Occam decay past d=3d = 3 is structural: the prior N(0,τ2I)\mathcal{N}(\mathbf{0}, \tau^2 I) spreads mass over a (d+1)(d{+}1)-dimensional coefficient space, and as that space grows the prior dilutes the predictive density at any particular dataset. High-capacity models lose out automatically, with no external regularizer needed. This is Bayesian Occam’s razor — the canonical “complex models are penalized by their own diffuseness” argument MacKay (1992, 2003 Ch. 28) developed in detail. We get it for free, and it is the reason Bayes-factor model comparison was the standard approach to model selection long before AIC, BIC, and cross-validation arrived as alternatives.

A second curve runs along the first, in gray dashed: the Laplace approximation to the evidence, computed by replacing the posterior with a Gaussian centered at the maximum-a-posteriori estimate. For Gaussian-conjugate models, the Laplace approximation is exact — no asymptotic remainder — because the joint log-density is exactly quadratic in β\boldsymbol\beta. The gray curve overlays the blue one to within numerical precision. We’ll exploit this in §3 to derive the BIC as the leading-order term in the Laplace expansion as nn \to \infty, and the constant Schwarz dropped will be a recurring theme — it is the Bayesian geometric content that classical information criteria discard.

A third curve, in red, sits below both: the evidence lower bound, the ELBO of the optimal mean-field variational approximation. By Jensen’s inequality the ELBO is a lower bound on the log-evidence, and by Variational Inference we know this bound is tight when the variational family contains the posterior. As dd grows, the posterior becomes a multivariate Gaussian with non-trivial correlation, and the mean-field family — which forces independence between coordinates — can no longer recover it. The bound becomes loose, and the gap grows.

That growing gap is the topic. The ELBO is not a wrong estimator of the evidence; it is a biased estimator with a known mechanism. The bias has a sign — it always understates the evidence — and it has a structure: it equals the reverse Kullback–Leibler divergence from the variational family’s optimum to the true posterior. As the family grows in expressiveness, the gap shrinks; as the posterior grows in complexity, the gap grows. Variational Bayes for model selection — VBMS — uses the ELBO as a stand-in for logp(yM)\log p(y \mid M) in Bayes-factor model comparisons: fast, biased, with structure that we’ll learn to characterize, correct, and diagnose.

The plan of attack runs in three phases. First (§§2–3) we’ll define the VBMS criterion formally and prove the asymptotic bridge to the Bayesian Information Criterion via the Laplace expansion — the classical linkage between Bayesian and frequentist model selection that BIC made famous. Second (§§4–5) we’ll work two iconic examples end-to-end: a Bayesian Gaussian mixture model where mean-field VB performs automatic relevance determination over mixture components (Bishop 2006 §10.2) and a polynomial-regression Bayes-factor ranking where we can compare ELBO-based rankings to the closed-form evidence and see exactly when they agree (most of the time) and when they don’t. Third (§§6–10) we’ll characterize the bias geometrically as KL projection (§6), pin down when the bias preserves model rankings versus when it flips them (§7), develop bias-correction tools — importance-weighted ELBO and annealed importance sampling (§§8–9) — and close with the information-theoretic bits-back coding interpretation that places VBMS within the minimum-description-length framework (§10). Sections §§11–13 are computational notes, modern-ML applications, and synthesis.

The reader should leave with a precise statement of what the ELBO is and is not as an evidence estimator, a working knowledge of when VBMS is reliable and when to reach for AIS or SMC instead, and a small toolbox of diagnostic and corrective methods — Pareto-k^\hat k, IWELBO, AIS schedule design — for catching pathologies before they corrupt model selection.

Bayesian Occam on polynomial regression
n = 30, σ² = 0.0625; argmaxd log p(y | Md) = d = 3
Shaded region between blue and red traces is the KL projection bias of §6. Increasing τ² weakens the prior, smooths Occam decay; decreasing τ² penalizes high-d models more sharply.

From ELBO to log-evidence — the VBMS criterion

§1 introduced the polynomial-regression Occam picture and the central observation: the mean-field ELBO sits below the closed-form log-evidence by a quantity that grows with model complexity. Before we can use the ELBO as a model-selection criterion, we need to make the relationship between ELBO and log-evidence precise. The variational-inference topic developed this from scratch; we recap the essentials here with model-selection emphasis.

A model MM with parameter θ\theta and likelihood p(yθ,M)p(y \mid \theta, M) places a prior p(θM)p(\theta \mid M) over its parameters. The marginal log-evidence

logp(yM)  =  logp(yθ,M)p(θM)dθ\log p(y \mid M) \;=\; \log \int p(y \mid \theta, M) \, p(\theta \mid M) \, d\theta

is the integral the Bayesian model-selection literature has been computing — or trying to compute — for half a century. Outside conjugate cases the integral is intractable. Variational inference replaces it with the evidence lower bound (ELBO) of an arbitrary distribution q(θ)q(\theta) on the parameter space:

ELBOM(q)  :=  Eq[logp(y,θM)]Eq[logq(θ)].\mathrm{ELBO}_M(q) \;:=\; \mathbb{E}_q[\log p(y, \theta \mid M)] - \mathbb{E}_q[\log q(\theta)].

Two facts (Variational Inference §2): the ELBO is a uniform lower bound on the log-evidence by Jensen’s inequality applied to the importance-sampling identity, and maximizing the ELBO over qq is equivalent to minimizing the reverse KL divergence KL(qp(y,M))\mathrm{KL}(q \,\|\, p(\cdot \mid y, M)), so the variational optimum qMq^*_M is the closest member of QM\mathcal{Q}_M to the true posterior under reverse-KL projection.

The VBMS criterion is the natural one — for each candidate model MkM_k with variational family Qk\mathcal{Q}_k, compute the maximum ELBO over Qk\mathcal{Q}_k and compare across models.

Definition 1 (VBMS criterion).

Let MM be a model with prior p(θM)p(\theta \mid M), likelihood p(yθ,M)p(y \mid \theta, M), and variational family QM\mathcal{Q}_M. The variational Bayes for model selection criterion of MM is

VBMS(M)  :=  maxqQMELBOM(q).\mathrm{VBMS}(M) \;:=\; \max_{q \in \mathcal{Q}_M} \mathrm{ELBO}_M(q).

We rank competing models {M1,,MK}\{M_1, \ldots, M_K\} by VBMS(Mk)\mathrm{VBMS}(M_k), preferring the model with the highest score. The ratio exp(VBMS(Mi)VBMS(Mj))\exp\bigl(\mathrm{VBMS}(M_i) - \mathrm{VBMS}(M_j)\bigr) is the VBMS Bayes factor for MiM_i versus MjM_j.

The fundamental identity linking VBMS to the log-evidence is the same evidence decomposition that anchored variational-inference §2 — restated here per model.

Theorem 1 (Evidence decomposition).

For any model MM and any qQMq \in \mathcal{Q}_M absolutely continuous with respect to p(y,M)p(\cdot \mid y, M),

logp(yM)  =  ELBOM(q)+KL(qp(y,M)).\log p(y \mid M) \;=\; \mathrm{ELBO}_M(q) \,+\, \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr).
Proof.

From the definition of reverse KL,

KL(qp(y,M))  =  Eq[logq(θ)]Eq[logp(θy,M)].\mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr) \;=\; \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(\theta \mid y, M)].

Substitute Bayes’ rule p(θy,M)=p(y,θM)/p(yM)p(\theta \mid y, M) = p(y, \theta \mid M) / p(y \mid M) and pull logp(yM)\log p(y \mid M) out of the expectation (it doesn’t depend on θ\theta):

KL(qp(y,M))  =  Eq[logq(θ)]Eq[logp(y,θM)]+logp(yM).\mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr) \;=\; \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(y, \theta \mid M)] + \log p(y \mid M).

The first two terms on the right are precisely ELBOM(q)-\mathrm{ELBO}_M(q). Rearranging gives the claim.

The decomposition is exact, requires no asymptotic regularity, and holds for any qq with the right support. Three consequences anchor the rest of the topic.

The ELBO is a uniform lower bound on the log-evidence. Reverse KL is non-negative, so ELBOM(q)logp(yM)\mathrm{ELBO}_M(q) \le \log p(y \mid M) for all qq. Equality holds if and only if q=p(y,M)q = p(\cdot \mid y, M) almost everywhere — that is, when QM\mathcal{Q}_M contains the posterior. For Gaussian-conjugate models with full-rank Gaussian variational family, equality is achievable; for almost everything else of practical interest, it is not.

The bias has a sign. VBMS systematically underestimates the log-evidence: VBMS(M)logp(yM)\mathrm{VBMS}(M) \le \log p(y \mid M), with the gap logp(yM)VBMS(M)=KL(qMp(y,M))0\log p(y \mid M) - \mathrm{VBMS}(M) = \mathrm{KL}(q^*_M \,\|\, p(\cdot \mid y, M)) \ge 0. As §6 will show, this gap is monotonically non-increasing in family expressiveness — adding capacity to QM\mathcal{Q}_M can only improve the bound, never worsen it.

The bias is what we have to characterize. All of §6 (KL projection mechanism) and §7 (when does the bias preserve rankings versus flip them?) is about understanding this gap.

Three forms of the ELBO

Definition 1 gives one expression. Two algebraic rearrangements produce equivalent expressions, each useful in a different context.

Form 1 — Energy plus entropy. Substituting H(q)=Eq[logq]H(q) = -\mathbb{E}_q[\log q] into Definition 1,

ELBOM(q)  =  Eq[logp(y,θM)]+H(q).\mathrm{ELBO}_M(q) \;=\; \mathbb{E}_q[\log p(y, \theta \mid M)] + H(q).

The ELBO rewards qq-mass in regions of high joint log-density (the energy term, by analogy with the Boltzmann distribution’s βE-\beta E) and rewards qq for being spread out (the entropy term). Maximizing the ELBO trades off between concentrating on the mode of p(y,θM)p(y, \theta \mid M) and avoiding collapse to a delta. This is the form §10 leans on for the bits-back coding interpretation — the energy term is an expected description length, the entropy term is the bits-back savings, and the variational free energy ELBO-\mathrm{ELBO} is the expected net code length.

Form 2 — Reconstruction minus prior-KL. Decomposing logp(y,θM)=logp(yθ,M)+logp(θM)\log p(y, \theta \mid M) = \log p(y \mid \theta, M) + \log p(\theta \mid M),

ELBOM(q)  =  Eq[logp(yθ,M)]KL(qp(M)).\mathrm{ELBO}_M(q) \;=\; \mathbb{E}_q[\log p(y \mid \theta, M)] - \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid M)\bigr).

The ELBO rewards qq for placing mass on θ\theta values that explain the data (the reconstruction term) while penalizing qq for moving away from the prior (the prior-KL term). This is the form most VBMS implementations use in practice — it is how PyMC, Stan, and NumPyro compute the ELBO during ADVI optimization, and how variational autoencoders are trained.

Form 3 — Evidence minus posterior-KL. This is Theorem 1 rearranged:

ELBOM(q)  =  logp(yM)KL(qp(y,M)).\mathrm{ELBO}_M(q) \;=\; \log p(y \mid M) - \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr).

The ELBO equals the log-evidence (a constant in qq) minus the reverse-KL gap. This is the cleanest form for analyzing VBMS as a model-selection criterion — the bias is the second term, and it is the entire content of §6’s bias analysis.

The VBMS Bayes factor and the bias-asymmetry decomposition

For VBMS specifically, Form 3 is the operationally important one. Computing VBMS(Mi)VBMS(Mj)\mathrm{VBMS}(M_i) - \mathrm{VBMS}(M_j) as a Bayes-factor approximation, the log-evidences and the KL gaps decompose cleanly:

VBMS(Mi)VBMS(Mj)  =  logp(yMi)p(yMj)    [KL(qip(y,Mi))KL(qjp(y,Mj))].\mathrm{VBMS}(M_i) - \mathrm{VBMS}(M_j) \;=\; \log \frac{p(y \mid M_i)}{p(y \mid M_j)} \;-\; \Bigl[\mathrm{KL}\bigl(q^*_i \,\|\, p(\cdot \mid y, M_i)\bigr) - \mathrm{KL}\bigl(q^*_j \,\|\, p(\cdot \mid y, M_j)\bigr)\Bigr].

Read literally: the VBMS Bayes factor equals the true log Bayes factor minus an asymmetry between the two models’ reverse-KL gaps. When the bias asymmetry is small relative to the true Bayes factor, VBMS preserves rankings; when the asymmetry is large, rankings can flip. §7 makes this precise — Wang & Blei’s (2019) consistency theorem says that under regularity conditions on the families and the model gap growing in nn, the asymmetry stays bounded while the Bayes factor diverges, so VBMS rankings agree with closed-form Bayes factors at large nn.

A practical note: maximizing the ELBO over QM\mathcal{Q}_M is generally non-convex in the parameters of qq, so VBMS(M)\mathrm{VBMS}(M) in practice means the ELBO at a local optimum found by ADVI or CAVI. In conjugate-exponential models with mean-field qq (CAVI updates closed-form), the local optimum is global; outside that case, the standard recipe is multiple random restarts with the maximum-ELBO restart selected (§11 returns to this).

Three stacked bars showing log p(y) decomposed into ELBO + KL for three nested variational families on the polynomial-regression target.
The evidence decomposition across three nested families on the polynomial-regression target from §1. The total height equals log p(y) — constant in q. The ELBO segment grows and the KL gap shrinks as the family grows: mean-field, block-diagonal, full-rank Gaussian (where the family contains the conjugate posterior and KL = 0). The bias of §6 is the KL segment.

The figure visualizes Theorem 1 across three nested families on the polynomial-regression target from §1. Total bar height equals logp(yM4)\log p(y \mid M_4), which is constant in qq. The proportion split between ELBO segment and KL segment varies as qq ranges from mean-field through block-diagonal to full-rank. The full-rank Gaussian contains the conjugate posterior exactly, so the KL segment vanishes and the ELBO equals the log-evidence. Mean-field has positive KL gap, block-diagonal has less — the gap shrinks monotonically as the family grows, exactly as Theorem 1 plus the §6 monotonicity result will predict.

Evidence decomposition: log p(y) = ELBO(q) + KL(q ‖ p(·|y))
Bar total height stays at −log p(y); the ELBO/KL split shifts as the family grows.

We now have the operational identity and the criterion. The next two sections turn the criterion into a useful model-selection tool: §3 builds the asymptotic bridge to the BIC via the Laplace expansion, and §4 works the canonical Bishop-style automatic-relevance-determination example end-to-end.

Laplace approximation and the BIC bridge

VBMS replaces the intractable evidence integral with an optimization over a variational family. The classical alternative — predating VBMS by half a century — replaces the integral with a Gaussian. The Laplace approximation says: assume the joint log-density logp(y,θM)\log p(y, \theta \mid M) has a single mode at θ^MAP\hat\theta_{\mathrm{MAP}}, take the second-order Taylor expansion around that mode, and integrate the resulting quadratic form analytically. The exponential of a quadratic is Gaussian, the integral is closed-form, and the result is a recipe for logp(yM)\log p(y \mid M) that requires only finding the MAP and computing the Hessian there. As nn \to \infty the MAP becomes increasingly well-localized, the quadratic approximation becomes exact, and the Laplace formula recovers the true log-evidence to higher and higher accuracy.

The classical Bayesian Information Criterion drops out as the leading-order term in the Laplace expansion at large nn, with everything else thrown away as O(1)O(1) noise. This is BIC’s secret history: it is not a frequentist criterion that happens to look Bayesian, it is a truncated Laplace approximation to a Bayesian quantity. We develop the Laplace approximation in this section both because it gives us a fast — and for Gaussian-conjugate models, exact — reference for the log-evidence to compare VBMS against, and because the asymptotic bridge to BIC anchors VBMS in the same model-selection universe as AIC, BIC, WAIC, and the rest of the classical-IC catalog.

The Laplace expansion

Fix a model MM with joint log-density

(θ)  :=  logp(y,θM)  =  logp(yθ,M)+logp(θM),\ell(\theta) \;:=\; \log p(y, \theta \mid M) \;=\; \log p(y \mid \theta, M) + \log p(\theta \mid M),

a smooth function of θRd\theta \in \mathbb{R}^d. Assume \ell has a unique global maximum at the MAP estimator θ^:=argmaxθ(θ)\hat\theta := \arg\max_\theta \ell(\theta), with strictly positive-definite negative Hessian H(θ^):=2(θ^)H(\hat\theta) := -\nabla^2 \ell(\hat\theta). Taylor-expand \ell around θ^\hat\theta:

(θ)  =  (θ^)12(θθ^)H(θ^)(θθ^)+O(θθ^3).\ell(\theta) \;=\; \ell(\hat\theta) \,-\, \tfrac{1}{2} (\theta - \hat\theta)^\top H(\hat\theta) (\theta - \hat\theta) \,+\, O\bigl(\|\theta - \hat\theta\|^3\bigr).

The linear term vanishes because θ^\hat\theta is a critical point. Exponentiating and integrating, the Gaussian integral is closed-form

exp ⁣[12(θθ^)H(θ^)(θθ^)]dθ  =  (2π)d/2detH(θ^),\int \exp\!\left[-\tfrac{1}{2}(\theta - \hat\theta)^\top H(\hat\theta)(\theta - \hat\theta)\right] d\theta \;=\; \frac{(2\pi)^{d/2}}{\sqrt{\det H(\hat\theta)}},

and substituting and taking logs gives the Laplace expansion.

Theorem 2 (Laplace expansion).

Let MM be a model with smooth joint log-density (θ)=logp(y,θM)\ell(\theta) = \log p(y, \theta \mid M), MAP estimator θ^\hat\theta at an interior maximum, and positive-definite Hessian H(θ^)=2(θ^)H(\hat\theta) = -\nabla^2 \ell(\hat\theta) at θ^\hat\theta. Then

logp(yM)  =  (θ^)+d2log(2π)12logdetH(θ^)+O(n1/2).\log p(y \mid M) \;=\; \ell(\hat\theta) + \tfrac{d}{2}\log(2\pi) - \tfrac{1}{2}\log \det H(\hat\theta) + O\bigl(n^{-1/2}\bigr).

The four terms each have a clean interpretation. The first is the data fit at the MAP. The second is the prior contribution, evaluated at the MAP. The third is a Gaussian volume constant. The fourth — the most subtle and most informative — is the geometric correction: detH(θ^)\det H(\hat\theta) measures the local curvature of the joint log-density at the MAP, so 12logdetH-\tfrac{1}{2}\log\det H is large positive when the posterior is broad (small curvature) and large negative when the posterior is concentrated (large curvature). It is a local-volume measurement, exactly the “Bayesian Occam” content the §1 figure was capturing.

For Gaussian-conjugate models, the Laplace approximation is exact. The joint log-density (θ)\ell(\theta) is then a quadratic in θ\theta and the Taylor expansion has no remainder past the quadratic term. The O(n1/2)O(n^{-1/2}) term in Theorem 2 is identically zero. We exploited this in our verified notebook: the closed-form logp(yMd)\log p(y \mid M_d) and the Laplace approximation agree to numerical precision (gap <1013< 10^{-13}) across all dd from 1 to 9. This makes Laplace a free, fast, and exact reference for VBMS comparisons in the conjugate-Gaussian regime; we’ll use it that way in §5.

The Schwarz BIC as the leading-order Laplace approximation

Now consider nn \to \infty with iid data y1,,ynθp(θ,M)y_1, \ldots, y_n \mid \theta \sim p(\cdot \mid \theta, M). The data-sum term in the Hessian scales linearly in nn while the prior contribution stays O(1)O(1), so for large nn we have H(θ^)nI(θ^)+O(1)H(\hat\theta) \approx n \cdot I(\hat\theta) + O(1), where I(θ)=Eyθ[2logp(yθ,M)]I(\theta) = \mathbb{E}_{y \mid \theta}\bigl[-\nabla^2 \log p(y \mid \theta, M)\bigr] is the per-observation Fisher information. The MAP θ^\hat\theta converges to the MLE θ^MLE\hat\theta_{\mathrm{MLE}} at rate O(n1)O(n^{-1}), and we can swap one for the other to leading order. Substituting into Theorem 2 and expanding logdet(nI+O(1))=dlogn+logdetI+O(n1)\log\det(n \, I + O(1)) = d \log n + \log\det I + O(n^{-1}),

logp(yM)  =  logp(yθ^MLE,M)d2logn+O(1)  Bayesian  content+O(n1/2).\log p(y \mid M) \;=\; \log p(y \mid \hat\theta_{\mathrm{MLE}}, M) - \tfrac{d}{2}\log n + \mathrm{O(1)\;Bayesian\;content} + O\bigl(n^{-1/2}\bigr).

The first two terms are exactly BIC(M)/2-\mathrm{BIC}(M)/2 in Schwarz’s (1978) original formulation, where BIC(M):=2logp(yθ^MLE,M)+dlogn\mathrm{BIC}(M) := -2 \log p(y \mid \hat\theta_{\mathrm{MLE}}, M) + d \log n.

Corollary 1 (BIC asymptotic equivalence).

Under the regularity conditions of Theorem 2 plus iid sampling, as nn \to \infty,

logp(yM)  =  BIC(M)/2+C(M,p,θ^)+O(n1/2),\log p(y \mid M) \;=\; -\mathrm{BIC}(M)/2 + C(M, p, \hat\theta) + O\bigl(n^{-1/2}\bigr),

where C(M,p,θ^):=logp(θ^M)+d2log(2π)12logdetI(θ^)C(M, p, \hat\theta) := \log p(\hat\theta \mid M) + \tfrac{d}{2}\log(2\pi) - \tfrac{1}{2}\log\det I(\hat\theta) is the O(1)O(1) “Bayesian content” that BIC discards.

Top: closed-form log p(y), Laplace approximation, and -BIC/2 plotted against sample size n on a log scale at fixed degree d=3. Bottom: gap between closed-form and -BIC/2 converging to a constant as n grows.
Laplace and BIC bridge to closed-form log p(y) on the polynomial-regression spine at fixed d = 3. Top: closed-form (blue) and Laplace (gray dashed) overlay exactly because the conjugate-Gaussian case has no asymptotic remainder; -BIC/2 (orange) runs parallel but offset above. Bottom: the offset log p(y) - (-BIC/2) converges to a constant as n → ∞ — the O(1) Bayesian content of Corollary 1.

The bottom panel shows the convergence directly: as nn grows, the gap between the closed-form logp(y)\log p(y) and BIC/2-\mathrm{BIC}/2 converges to a constant — and that constant is exactly C(M,p,θ^)C(M, p, \hat\theta) in Corollary 1, the Bayesian content the BIC discards.

A historical note: Schwarz’s (1978) original derivation is exactly the calculation above, presented as a coding-theoretic argument and stripped of the O(1)O(1) constants. The frequentist tradition has since rediscovered the BIC under various derivations — consistent model selection, Bayes factor approximation, MDL coding length — but these are all reconstructions of the leading-order Laplace approximation we developed here. For a deeper treatment of the regularity conditions, see Tierney & Kadane (1986) and the Bayesian Neural Networks topic §3, where the Laplace machinery is developed at full depth in the context of weight-space neural-network posteriors. Watanabe’s (2010) extension to singular models — where the Hessian is degenerate, regularity fails, and the BIC penalty dlognd \log n is wrong — produces the WBIC and WAIC of §13.

Laplace and BIC convergence to log p(y)
Cubic data + degree-d polynomial model. As n → ∞, the BIC penalty leaves an O(1) gap = the Bayesian content C(M, p, θ̂).

Variational Bayes for Bayesian GMMs

§3 developed the Laplace approximation and saw it land at the BIC asymptotically. The Bayesian Gaussian mixture model offers something different: a model where Bayesian Occam’s razor operates automatically inside the variational fit itself, without any external model-selection sweep. Pick a generous upper bound KmaxK_{\max} on the number of mixture components, run mean-field VB once, and the variational posterior over the mixture weights π\boldsymbol\pi drives unused components’ weights to zero. The result is a single fit that effectively chooses its own KK. This is automatic relevance determination (ARD), and it was the original motivation for “Variational Bayes” as a label — Bishop’s (2006) §10.2 worked example and Beal’s (2003) thesis built the framework around exactly this kind of self-pruning model.

The Bayesian GMM with conjugate hyperprior

The hierarchical Bayesian GMM places conjugate priors on every parameter the original GMM had as a free quantity:

πDir(α01)μkΛkN ⁣(m0,(β0Λk)1)ΛkWishart(W0,ν0)znπCat(π)xnzn=k,μk,ΛkN(μk,Λk1)\begin{aligned} \boldsymbol\pi &\sim \mathrm{Dir}(\alpha_0 \cdot \mathbf{1}) \\ \boldsymbol\mu_k \mid \boldsymbol\Lambda_k &\sim \mathcal{N}\!\bigl(\mathbf{m}_0, (\beta_0 \boldsymbol\Lambda_k)^{-1}\bigr) \\ \boldsymbol\Lambda_k &\sim \mathrm{Wishart}(\mathbf{W}_0, \nu_0) \\ z_n \mid \boldsymbol\pi &\sim \mathrm{Cat}(\boldsymbol\pi) \\ \mathbf{x}_n \mid z_n = k, \boldsymbol\mu_k, \boldsymbol\Lambda_k &\sim \mathcal{N}(\boldsymbol\mu_k, \boldsymbol\Lambda_k^{-1}) \end{aligned}

The Normal-Wishart prior on (μk,Λk)(\boldsymbol\mu_k, \boldsymbol\Lambda_k) is conjugate for the Gaussian likelihood; the Dirichlet prior on π\boldsymbol\pi is conjugate for the categorical assignment znz_n.

The hyperparameter that does the ARD work is α0\alpha_0. Take α0<1\alpha_0 < 1 — say α0=0.1\alpha_0 = 0.1 — and the Dirichlet prior puts most of its mass at the corners of the simplex, which corresponds to “mostly empty” mixture configurations where only a few components have non-trivial weight. The model is a priori sparse over components. With a uniform prior (α0=1\alpha_0 = 1), we’d lose the sparsity bias and get back to the standard non-ARD GMM behavior.

Mean-field VB and CAVI updates

The variational family is the standard mean-field factorization with each factor parameterized in its conjugate exponential form: q(π)q(\boldsymbol\pi) Dirichlet, q(μk,Λk)q(\boldsymbol\mu_k, \boldsymbol\Lambda_k) Normal-Wishart, q(zn)q(z_n) Categorical.

Algorithm 1 (CAVI for Bayesian GMM).

Initialize responsibilities {rnk}\{r_{nk}\} (e.g., from k-means or random Gaussian distances). Iterate:

M-step (parameter updates given responsibilities):

αk=α0+Nkβk=β0+Nkmk=1βk(β0m0+Nkxˉk)νk=ν0+Nk.\begin{aligned} \alpha_k &= \alpha_0 + N_k \\ \beta_k &= \beta_0 + N_k \\ \mathbf{m}_k &= \tfrac{1}{\beta_k}\bigl(\beta_0 \mathbf{m}_0 + N_k \bar{\mathbf{x}}_k\bigr) \\ \nu_k &= \nu_0 + N_k. \end{aligned}

E-step (responsibility updates given parameters):

logrnk  =  E[logπk]+12E[logdetΛk]D2log(2π)12Δnk2+const,\log r_{nk} \;=\; \mathbb{E}[\log \pi_k] + \tfrac{1}{2}\mathbb{E}[\log \det \boldsymbol\Lambda_k] - \tfrac{D}{2}\log(2\pi) - \tfrac{1}{2}\,\Delta^2_{nk} + \mathrm{const},

where Δnk2=E[(xnμk)Λk(xnμk)]\Delta^2_{nk} = \mathbb{E}[(\mathbf{x}_n - \boldsymbol\mu_k)^\top \boldsymbol\Lambda_k (\mathbf{x}_n - \boldsymbol\mu_k)] is the expected Mahalanobis distance under the Normal-Wishart posterior. Normalize rnkr_{nk} across kk via softmax, recompute the responsibility-weighted statistics Nk,xˉk,SkN_k, \bar{\mathbf{x}}_k, \mathbf{S}_k, and repeat until the ELBO converges.

The relevant moments are closed-form: E[logπk]=ψ(αk)ψ(jαj)\mathbb{E}[\log \pi_k] = \psi(\alpha_k) - \psi(\sum_j \alpha_j) (digamma), and analogous expressions for E[logdetΛk]\mathbb{E}[\log \det \boldsymbol\Lambda_k] and Δnk2\Delta^2_{nk}. Bishop’s (2006) §10.2 derives all of this; we don’t reprove it here.

Bayesian Occam at the entropy term

The ARD effect comes from the Dirichlet hyperprior interacting with the entropy term in the ELBO.

Theorem 3 (Bayesian Occam at the entropy term).

For the Bayesian GMM with KmaxK_{\max} components and Dirichlet hyperprior Dir(α01)\mathrm{Dir}(\alpha_0 \cdot \mathbf{1}) with α0<1\alpha_0 < 1, the variational optimum q(π)=Dir(α1,,αKmax)q^*(\boldsymbol\pi) = \mathrm{Dir}(\alpha_1^*, \ldots, \alpha_{K_{\max}}^*) has αk=α0+Nk\alpha_k^* = \alpha_0 + N_k. Components with Nk0N_k \to 0 get αkα0\alpha_k^* \to \alpha_0, contributing posterior expected weight

Eq[πk]  =  αkjαj    α0jαj\mathbb{E}_{q^*}[\pi_k] \;=\; \frac{\alpha_k^*}{\sum_j \alpha_j^*} \;\to\; \frac{\alpha_0}{\sum_j \alpha_j^*}

which shrinks toward zero as the active components accumulate data. The result: at the variational optimum, the posterior over π\boldsymbol\pi concentrates on a small subset of components determined by which ones have data assigned to them.

The mechanism is direct: an unused component kk with no responsibility (Nk0N_k \to 0) gets posterior parameter αk=α0\alpha_k^* = \alpha_0, while an active component jj with NjN_j data points effectively assigned gets αj=α0+NjNj\alpha_j^* = \alpha_0 + N_j \approx N_j. The posterior expected-weight ratio is α0/Nj\alpha_0 / N_j. For α0=0.1\alpha_0 = 0.1 and Nj100N_j \approx 100, this ratio is 0.0010.001 — the unused component contributes about 0.1% of the mixture weight.

This is Bayesian Occam at the variational level. The ELBO rewards using more components when they help fit the data (reconstruction term increases) but penalizes unused components via the entropy of q(π)q(\boldsymbol\pi) — the Dirichlet hyperprior with α0<1\alpha_0 < 1 tips the balance so that the cost of activating an unused component exceeds its benefit unless the data strongly require it.

Left: 2D scatter of 400 points in four well-separated clusters with 2σ ellipses for each active component. Right: bar chart of posterior mixture weights showing 4 active and 6 collapsed.
Bishop §10.2 ARD made empirical: 4 clusters, K_max = 10, α₀ = 0.1. ARD identifies four active components (weights ≈ 0.25) and prunes six (weights ≈ 0.0002). Multiple random restarts converge to identical ELBO.

The result is exactly Theorem 3’s prediction: four components recover with weights 0.25\approx 0.25 each, and six components collapse to weights 0.0002\approx 0.0002. The collapsed components have αkα0=0.1\alpha_k^* \approx \alpha_0 = 0.1 (no data contribution), while the active components have αk100\alpha_k^* \approx 100.

From ARD to VBMS: ranking K_max

ARD gives us a single fit that selects KK automatically; VBMS gives us a procedure for ranking competing models by their ELBO. The two are equivalent in this setting. Running VBMS across Kmax{2,3,,10}K_{\max} \in \{2, 3, \ldots, 10\} on the same four-cluster data, we’d find VBMS(MK)\mathrm{VBMS}(M_K) jumping sharply between K=3K = 3 and K=4K = 4 (where the fourth cluster gets its own component) and then plateauing for K4K \ge 4 — additional components beyond four contribute essentially zero to the reconstruction term and are pruned to zero weight by ARD. The two procedures agree, with the latter being roughly an order of magnitude cheaper.

This equivalence justifies the practical recipe: run Bayesian GMM with a generous KmaxK_{\max} and small α0\alpha_0, count active components in the variational posterior, and report the result. It scales to large KmaxK_{\max} much better than fitting KK separate models.

Bayesian GMM ARD: 9 active / 10 components — α₀ = 0.10
4 well-separated clusters; CAVI converges in 29 iterations.

Comparing Bayesian regression models (polynomial Bayes factors)

§1 introduced the polynomial-regression model zoo M1,,M9M_1, \ldots, M_9 and the closed-form log-evidence figure. §3 showed that the Laplace approximation lands exactly on the closed-form curve in the Gaussian-conjugate regime, with the BIC running parallel above. We now turn the picture into a Bayes-factor model-comparison procedure: rather than ARD-style automatic component selection (the §4 approach), we compute VBMS(Md)\mathrm{VBMS}(M_d) for each model, take ratios as Bayes factors, and rank.

This is the canonical practitioner workflow for Bayesian model comparison via variational methods. PyMC, Stan, and NumPyro all expose ADVI plus a compute_log_likelihood step that produces the ELBO at convergence. The recipe in production: fit ADVI for each candidate model, take ELBO differences as VBMS Bayes factors, interpret on the Kass–Raftery (1995) scale, and report the winner.

The VBMS Bayes factor as a model-comparison criterion

We promote the §2 informal definition to a formal one.

Definition 2 (VBMS Bayes factor and induced ranking).

Let M={M1,M2,,MK}\mathcal{M} = \{M_1, M_2, \ldots, M_K\} be a finite set of candidate models, each with its variational family Qk\mathcal{Q}_k and VBMS criterion VBMS(Mk)=maxqQkELBOMk(q)\mathrm{VBMS}(M_k) = \max_{q \in \mathcal{Q}_k} \mathrm{ELBO}_{M_k}(q). The VBMS Bayes factor for MiM_i versus MjM_j is

BFijVBMS  :=  exp ⁣(VBMS(Mi)VBMS(Mj)).\mathrm{BF}_{ij}^{\mathrm{VBMS}} \;:=\; \exp\!\bigl(\mathrm{VBMS}(M_i) - \mathrm{VBMS}(M_j)\bigr).

The VBMS ranking induced on M\mathcal{M} is the permutation σ\sigma such that VBMS(Mσ(1))VBMS(Mσ(2))\mathrm{VBMS}(M_{\sigma(1)}) \ge \mathrm{VBMS}(M_{\sigma(2)}) \ge \cdots.

By the §2 bias-asymmetry decomposition,

logBFijVBMS  =  logBFij[KL(qip(y,Mi))KL(qjp(y,Mj))].\log \mathrm{BF}_{ij}^{\mathrm{VBMS}} \;=\; \log \mathrm{BF}_{ij} \,-\, \bigl[\mathrm{KL}(q^*_i \,\|\, p(\cdot \mid y, M_i)) - \mathrm{KL}(q^*_j \,\|\, p(\cdot \mid y, M_j))\bigr].

The two Bayes factors agree exactly when the bias asymmetry is zero, and they agree on ranking whenever the bias asymmetry is dominated by the closed-form Bayes factor. The polynomial-regression zoo is a regime where this holds robustly; we’ll verify it numerically.

Bar chart with three estimators (closed-form, Laplace, VBMS) for log Bayes factor relative to argmax across polynomial degrees d=1..9.
Polynomial Bayes-factor ranking — all three estimators (closed-form, Laplace, VBMS mean-field) agree on rank-order with d=3 as the winner. Underfit cliff (d=1, d=2) is severe; overfit decay (d=4..9) is gradual under closed-form/Laplace and steeper under VBMS due to the §6 bias growing with d. Rank-order preserved despite the bias-asymmetry.

Three observations. First, all three estimators agree on the closed-form ranking: d=3d = 3 wins under closed-form, Laplace, and VBMS, and the same underfit/overfit decay structure holds in all three columns. Second, the underfit cliff is captured by all three estimators with similar magnitudes. Third, the overfit decay shows the bias asymmetry: the VBMS overfit penalty is more aggressive than the closed-form, because the mean-field bias grows with dd and that growth gets attributed to the model rather than to the variational family. The crucial fact: despite the steeper VBMS overfit penalty, the rank-order is identical.

Proposition 1 (VBMS rank-order preservation for polynomial regression).

For Bayesian polynomial regression with Gaussian conjugate priors and the model zoo M={M1,M2,,MD}\mathcal{M} = \{M_1, M_2, \ldots, M_D\}, the VBMS ranking agrees with the closed-form ranking whenever, for every adjacent pair of degrees d,d+1d, d+1,

logp(yMd)logp(yMd+1)  >  KL(qdp(y,Md))KL(qd+1p(y,Md+1)).\bigl|\log p(y \mid M_d) - \log p(y \mid M_{d+1})\bigr| \;>\; \bigl|\mathrm{KL}(q^*_d \,\|\, p(\cdot \mid y, M_d)) - \mathrm{KL}(q^*_{d+1} \,\|\, p(\cdot \mid y, M_{d+1}))\bigr|.

In words: as long as the model evidence gap between adjacent-degree models exceeds the bias asymmetry between them, the ranking is preserved.

Kass–Raftery interpretation and reporting

The Kass & Raftery (1995) interpretation scale for log Bayes factors:

  • log10BF[0,0.5]\log_{10} \mathrm{BF} \in [0, 0.5]: barely worth mentioning
  • [0.5,1][0.5, 1]: substantial evidence
  • [1,2][1, 2]: strong evidence
  • >2> 2: decisive evidence

VBMS Bayes factors should not be taken at face value as quantitative log-evidence ratios. Use them for ranking and for “decisive” calls, but reach for IWELBO (§8), AIS (§9), or SMC when you need the magnitude precisely.

Polynomial Bayes-factor ranking — argmax: d = 3
All three estimators agree on rank-order (the §5 Proposition 1 claim) — magnitudes differ.

The KL projection bias

§§1–5 took the bias as given and showed how it surfaces in concrete settings. We now develop the bias geometrically. The reverse-KL gap that defines the ELBO bias is not a vague approximation error; it is a projection. The variational family QM\mathcal{Q}_M sits as a submanifold inside the space of all distributions on the parameter space, and the variational optimum qMq^*_M is the reverse-KL projection of the true posterior p(y,M)p(\cdot \mid y, M) onto that submanifold. The bias is the reverse-KL distance from the posterior to its projection — a geometric quantity, not an algebraic accident.

This geometric framing is what makes the bias-as-spine reading the topic’s central claim. The bias is not a bug; it is the price we pay for a tractable family, and the price has known structure.

Variational families as submanifolds; reverse-KL as the projection

Fix a model MM with posterior p(y,M)p(\cdot \mid y, M). Let P\mathcal{P} be the space of probability distributions on the parameter space (with finite reverse-KL to the posterior), and let QMP\mathcal{Q}_M \subset \mathcal{P} be the variational family. The reverse-KL projection of the posterior onto QM\mathcal{Q}_M is

qM  :=  argminqQMKL(qp(y,M))  =  argmaxqQMELBOM(q),q^*_M \;:=\; \arg\min_{q \in \mathcal{Q}_M} \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr) \;=\; \arg\max_{q \in \mathcal{Q}_M} \mathrm{ELBO}_M(q),

with the equivalence by Theorem 1. The ELBO bias of VBMS is

logp(yM)VBMS(M)  =  KL(qMp(y,M)),\log p(y \mid M) - \mathrm{VBMS}(M) \;=\; \mathrm{KL}\bigl(q^*_M \,\|\, p(\cdot \mid y, M)\bigr),

the reverse-KL distance from the posterior to its projection on QM\mathcal{Q}_M.

Bias monotonicity in family expressiveness

The geometric picture immediately gives the monotonicity result.

Theorem 4 (Bias monotonicity in family expressiveness).

Let Q1Q2QK\mathcal{Q}_1 \subseteq \mathcal{Q}_2 \subseteq \cdots \subseteq \mathcal{Q}_K be a nested chain of variational families on the parameter space of a model MM, and let qk:=argmaxqQkELBOM(q)q^*_k := \arg\max_{q \in \mathcal{Q}_k} \mathrm{ELBO}_M(q) for each kk. Then the ELBO sequence is non-decreasing,

ELBOM(q1)    ELBOM(q2)        ELBOM(qK)    logp(yM),\mathrm{ELBO}_M(q^*_1) \;\le\; \mathrm{ELBO}_M(q^*_2) \;\le\; \cdots \;\le\; \mathrm{ELBO}_M(q^*_K) \;\le\; \log p(y \mid M),

and the corresponding bias sequence bk:=logp(yM)ELBOM(qk)=KL(qkp(y,M))b_k := \log p(y \mid M) - \mathrm{ELBO}_M(q^*_k) = \mathrm{KL}(q^*_k \,\|\, p(\cdot \mid y, M)) is non-increasing in kk.

Proof.

By nesting, qkq^*_{k} is itself a member of Qk+1\mathcal{Q}_{k+1}. The maximum of ELBOM(q)\mathrm{ELBO}_M(q) over Qk+1\mathcal{Q}_{k+1} is therefore at least ELBOM(qk)\mathrm{ELBO}_M(q^*_k):

ELBOM(qk+1)  =  maxqQk+1ELBOM(q)    ELBOM(qk).\mathrm{ELBO}_M(q^*_{k+1}) \;=\; \max_{q \in \mathcal{Q}_{k+1}} \mathrm{ELBO}_M(q) \;\ge\; \mathrm{ELBO}_M(q^*_k).

Subtracting from logp(yM)\log p(y \mid M) flips the direction: bk+1bkb_{k+1} \le b_k. The upper bound ELBOM(qK)logp(yM)\mathrm{ELBO}_M(q^*_K) \le \log p(y \mid M) is Jensen’s inequality applied at the maximizer.

The proof is elementary, but the consequence is substantive: adding capacity to the variational family can only improve the bound on the log-evidence, never worsen it. We can grow Q\mathcal{Q} from mean-field to block-diagonal to full-rank Gaussian to normalizing-flow without ever introducing the risk that a richer family makes the ELBO worse.

The closed-form mean-field bias for Gaussian targets

The polynomial-regression spine has one feature that lets us compute the bias exactly: the posterior is multivariate Gaussian, and the mean-field reverse-KL projection has a closed form.

Let p(y,M)=N(μpost,Σpost)p(\cdot \mid y, M) = \mathcal{N}(\boldsymbol\mu_{\mathrm{post}}, \boldsymbol\Sigma_{\mathrm{post}}) with Λpost:=Σpost1\boldsymbol\Lambda_{\mathrm{post}} := \boldsymbol\Sigma_{\mathrm{post}}^{-1}, and let QMF\mathcal{Q}_{\mathrm{MF}} be the family of diagonal-covariance Gaussians. The optimal mean-field qMF=N(μpost,Σq)q^*_{\mathrm{MF}} = \mathcal{N}(\boldsymbol\mu_{\mathrm{post}}, \boldsymbol\Sigma_q) has variances Σq=diag(1/Λpost,jj)\boldsymbol\Sigma_q = \mathrm{diag}\bigl(1/\boldsymbol\Lambda_{\mathrm{post},jj}\bigr) — the means match the posterior exactly, but the variances are the inverses of the posterior precision diagonal, not the diagonal of the posterior covariance.

The reverse-KL gap reduces (after the trace and d-d terms cancel) to

bMF  =  KL(qMFp(y,M))  =  12[jlogΛpost,jjlogdetΛpost].b_{\mathrm{MF}} \;=\; \mathrm{KL}\bigl(q^*_{\mathrm{MF}} \,\|\, p(\cdot \mid y, M)\bigr) \;=\; \tfrac{1}{2}\Bigl[\sum_j \log \boldsymbol\Lambda_{\mathrm{post},jj} - \log\det\boldsymbol\Lambda_{\mathrm{post}}\Bigr].

The mean-field bias for a multivariate Gaussian posterior is the log-determinant gap between the diagonal of the precision matrix and the full precision matrix. By Hadamard’s inequality, jΛpost,jjdetΛpost\prod_j \boldsymbol\Lambda_{\mathrm{post},jj} \ge \det\boldsymbol\Lambda_{\mathrm{post}}, with equality iff Λpost\boldsymbol\Lambda_{\mathrm{post}} is diagonal — so bMF0b_{\mathrm{MF}} \ge 0 with equality precisely when the posterior factors across coordinates, i.e., when mean-field is exact. The bias measures how non-diagonal the posterior precision is.

For polynomial Bayesian regression at degree dd, the posterior precision involves the Vandermonde Gram matrix, which is famously near-collinear — successive monomials have strong off-diagonal entries. The diagonal-vs-full log-determinant gap grows with dd accordingly. This is the §1 figure’s red trace’s growing distance from the blue trace, derived from first principles.

Three families on the banana target

For non-Gaussian posteriors, the closed-form formula doesn’t apply, but the geometric monotonicity result of Theorem 4 still does. Consider the synthetic banana target whose joint log-density is

logpunnorm(θ0,θ1)    θ0229(θ10.6θ00.3θ02)22\log p_{\mathrm{unnorm}}(\theta_0, \theta_1) \;\propto\; -\tfrac{\theta_0^2}{2 \cdot 9} - \tfrac{(\theta_1 - 0.6\theta_0 - 0.3\theta_0^2)^2}{2}

— a 2D distribution where θ0N(0,9)\theta_0 \sim \mathcal{N}(0, 9) marginally and θ1θ0N(0.6θ0+0.3θ02,1)\theta_1 \mid \theta_0 \sim \mathcal{N}(0.6\theta_0 + 0.3\theta_0^2, 1). The conditional mean of θ1\theta_1 has both linear and quadratic structure in θ0\theta_0, so the target has both correlation (visible to a Gaussian) and curvature (only visible to a flow). The marginal log-evidence logZ=log(6π)2.936\log Z = \log(6\pi) \approx 2.936 is analytically computable.

Three nested families, in order of expressiveness:

  1. Mean-field Gaussian — 4 parameters (μ0,μ1,logσ0,logσ1)(\mu_0, \mu_1, \log\sigma_0, \log\sigma_1). Diagonal covariance, no correlation captured.
  2. Full-rank Gaussian — 5 parameters. Captures linear correlation between θ0\theta_0 and θ1\theta_1 but not the quadratic structure.
  3. Polynomial autoregressive flow — 6 parameters implementing θ0=μ0+σ0z0\theta_0 = \mu_0 + \sigma_0 z_0, θ1=a+bθ0+cθ02+σ1z1\theta_1 = a + b\theta_0 + c\theta_0^2 + \sigma_1 z_1. The family contains the banana target.
Three side-by-side panels showing the banana target (gray contours) overlaid with the optimal q from each variational family.
Theorem 4 made geometric: nested variational families on a banana target with closed-form log Z = log(6π) ≈ 2.936. Mean-field can only place a diagonal-cov ellipse; full-rank captures the linear correlation; the polynomial flow recovers the true coefficients within Monte Carlo noise. The bias shrinks monotonically as the family grows, vanishing when the family contains the target.

Mode-seeking versus mass-covering

A geometric note that distinguishes reverse-KL projection from forward-KL projection. Reverse KL KL(qp)\mathrm{KL}(q \,\|\, p) heavily penalizes qq-mass in regions where pp is small. The projection is therefore mode-seeking: qq^* concentrates within the support of pp and avoids placing mass where pp is small, even at the cost of failing to cover pp‘s tails. Forward KL KL(pq)\mathrm{KL}(p \,\|\, q) is the opposite — it produces mass-covering projections that are broad and tail-covering at the cost of mode-localization.

For VBMS specifically, the choice of reverse KL has consequences: VBMS systematically underestimates posterior variance and gives confidence intervals that are too narrow. For point predictions and model ranking this is acceptable; for uncertainty quantification it requires correction (the §11 PSIS-VI Pareto-k^\hat k diagnostic catches the worst cases).

KL projection gap on a banana target
Theorem 4: bias non-increasing in family expressiveness; vanishes only when family contains target.
Variational family
Mean-field can't capture correlation; full-rank gets the linear part; polynomial flow recovers the quadratic ridge.

Asymmetric bias and ranking preservation

§6 developed the bias geometrically. The natural next question — and the question that determines whether VBMS is a useful model-selection tool or just an interesting computational object — is what the bias does to model rankings. Are VBMS Bayes factors faithful to closed-form Bayes factors, and if not, in what regimes do they fail?

The Wang–Blei consistency theorem

Wang & Blei (2019) gave the cleanest statement of when VBMS rankings agree with closed-form rankings. Their result requires standard regularity conditions on the variational family — smooth parameterization, support compatibility with the posterior, ability to recover the true parameter at the population limit — plus an asymptotic identifiability assumption on the model space.

Theorem 5 (Wang–Blei ranking preservation).

Let MiM_i and MjM_j be two competing models with variational families Qi\mathcal{Q}_i and Qj\mathcal{Q}_j satisfying the Wang–Blei (2019) regularity conditions. Suppose the bias asymmetry is bounded:

KL(qip(y,Mi))KL(qjp(y,Mj))  <  logp(yMi)logp(yMj).\bigl|\mathrm{KL}(q^*_i \,\|\, p(\cdot \mid y, M_i)) - \mathrm{KL}(q^*_j \,\|\, p(\cdot \mid y, M_j))\bigr| \;<\; \bigl|\log p(y \mid M_i) - \log p(y \mid M_j)\bigr|.

Then VBMS(Mi)>VBMS(Mj)    logp(yMi)>logp(yMj)\mathrm{VBMS}(M_i) > \mathrm{VBMS}(M_j) \iff \log p(y \mid M_i) > \log p(y \mid M_j), so VBMS preserves the closed-form ranking. As nn \to \infty for MiMjM_i \ne M_j both identified by the data, the bias asymmetry stays O(1)O(1) while the closed-form Bayes factor diverges as Θ(n)\Theta(n), so the inequality holds for all sufficiently large nn.

The substantive content: the bias asymmetry is bounded by the variational family’s structural mismatch with the posteriors, which is a model-and-family property, not a sample-size property. The Bayes factor between two distinct identified models grows linearly in nn as the data accumulate. The asymmetry’s O(1)O(1) growth is therefore eventually dominated, and rankings preserve.

Empirical demonstration: hierarchical vs pooled Bayesian logistic regression

The cleanest empirical setup: hierarchical Bayesian logistic regression with random group intercepts versus complete-pooling Bayesian logistic regression on the same data. We sweep nper  group{5,15,50,100,200}n_{\mathrm{per\;group}} \in \{5, 15, 50, 100, 200\}, fit ADVI on each model at each nn, and run SMC as the gold-standard log-marginal-likelihood reference.

Two-panel figure. Left: per-data-point log-evidence vs n_per_group, four traces (ELBO and SMC for both M_hier and M_pool). Right: log-scale bar chart at each n showing |ELBO - SMC| for both models alongside the model gap.
Wang–Blei consistency on hierarchical vs pooled Bayesian logistic regression. M_hier matches the data-generating process; M_pool ignores group structure. Both VBMS (ELBO) and SMC consistently rank M_hier > M_pool at every n. M_hier's mean-field bias is structurally larger than M_pool's because the hierarchical posterior has correlated structure mean-field cannot capture. But the model gap dominates at every n — ranking preserved.

Two structural observations:

The asymmetry is exactly what Wang–Blei predict. MhierM_{\mathrm{hier}} has substantively more mean-field bias — the variational family forces the posterior over the 12 parameters to factor across coordinates, and the true posterior has non-trivial correlation between σα\sigma_\alpha and the individual αg\alpha_g‘s (the so-called “funnel” structure of hierarchical posteriors). MpoolM_{\mathrm{pool}} has essentially zero bias because its 2D posterior is approximately bivariate Gaussian with weak correlation.

Ranking preservation holds at every nn. The model gap grows roughly linearly in total nn, exactly as Theorem 5 says it should. The bias asymmetry stays bounded across all five nn-values. At the largest nn, the model gap is over 200× the largest bias gap; even at the smallest nn, the model gap dominates. VBMS and SMC produce identical rankings at every sample size we tried.

The boundary regime where rankings can flip

Theorem 5’s conclusion fails when the inequality fails — when the bias asymmetry exceeds the model gap. Three regimes where this can happen:

Weak model differences. When two models are nearly equally good — small effective sample size, ambiguous data, or genuine model uncertainty — the closed-form Bayes factor between them is small. Even modest bias asymmetry can flip the ranking. The §11 PSIS-k^\hat k diagnostic catches this empirically.

Highly correlated hierarchical posteriors. Deep multilevel models with multiple variance components, crossed random effects, or partial pooling at scale produce posteriors with funnel-shaped or otherwise strongly correlated structure that mean-field VB destroys. Two remedies: switch to full-rank ADVI, or use non-centered parameterization.

Genuinely multimodal posteriors. When the posterior has multiple distinct modes — identifiability failure, label-switching in non-symmetric mixtures, bimodality from genuine bimodal likelihood evidence — mean-field reverse-KL projection collapses to a single mode. The ELBO on the collapsed approximation can be catastrophically biased. The remedy is to use AIS (§9) or normalizing-flow variational families that can in principle capture multiple modes.

Diagnostic procedure

Combining Theorem 5 with the boundary characterizations above, a defensible procedure for using VBMS in model comparison work:

  1. Fit ADVI for each candidate model with 5–20 random restarts and select the maximum-ELBO restart.
  2. Compute PSIS-k^\hat k on the variational posterior of each fitted model.
  3. Apply the Yao et al. (2018a) decision rule. k^<0.5\hat k < 0.5: VBMS Bayes factors are reliable. 0.5k^<0.70.5 \le \hat k < 0.7: bias is moderate, treat with caution. k^0.7\hat k \ge 0.7: do not trust VBMS.
  4. For pairs flagged at step 3, replace the ELBO comparison with IWELBO (§8) at moderate KK, AIS (§9) for a gold-standard reference, or SMC.
  5. Report VBMS rankings, not magnitudes. When magnitudes matter, reach for IWELBO or AIS.
Wang–Blei consistency on hierarchical vs pooled Bayesian logistic regression
Both VBMS (ELBO) and SMC consistently rank Mhier > Mpool at every n. Bias asymmetric (mean-field hits hierarchical posterior correlations); model gap dominates.
Loading bias_flip.json…

Importance-weighted ELBO and bias correction

The §7 diagnostic procedure flagged certain model-comparison pairs for which the standard ELBO is too biased to trust. For those pairs, §8 develops one bias-correction tool — the importance-weighted ELBO of Burda, Grosse & Salakhutdinov (2016) — that progressively tightens the bound by averaging multiple importance samples from the same variational family.

The motivation: Theorem 1 says ELBO(q)logp(y)\mathrm{ELBO}(q) \le \log p(y), with the gap equal to the reverse-KL gap. The §6 monotonicity result says that growing the variational family’s expressiveness can shrink the gap, but for fixed qq the bound is what it is. IWELBO turns this around: hold qq fixed and use the same qq multiple times with importance-weighted averaging to construct a sequence of bounds IWELBO1(q),IWELBO2(q),\mathrm{IWELBO}_1(q), \mathrm{IWELBO}_2(q), \ldots that tighten monotonically and converge to logp(y)\log p(y) in the KK \to \infty limit.

Definition and intuition

For a fixed proposal distribution qq on the parameter space, the importance-sampling identity says

p(y)  =  q(θ)p(y,θ)q(θ)dθ  =  Eq[w(θ)],w(θ):=p(y,θ)q(θ).p(y) \;=\; \int q(\theta) \cdot \frac{p(y, \theta)}{q(\theta)} \, d\theta \;=\; \mathbb{E}_q[w(\theta)], \qquad w(\theta) := \frac{p(y, \theta)}{q(\theta)}.

A naive Monte Carlo estimator is p^(y)=(1/K)kw(θk)\hat p(y) = (1/K) \sum_k w(\theta_k) for θkq\theta_k \sim q iid. Taking logs,

logp^(y)  :=  logp^(y)  =  log ⁣(1Kk=1Kw(θk))\widehat{\log p}(y) \;:=\; \log \hat p(y) \;=\; \log\!\left(\tfrac{1}{K} \sum_{k=1}^K w(\theta_k)\right)

is a (biased but consistent) estimator of logp(y)\log p(y). The importance-weighted ELBO is its expectation under the iid sampling.

Definition 3 (Importance-weighted ELBO (IWELBO)).

For a model MM, a proposal distribution qq, and an integer K1K \ge 1,

IWELBOK(q)  :=  Eθ1,,θKiidq ⁣[log ⁣(1Kk=1Kp(y,θkM)q(θk))].\mathrm{IWELBO}_K(q) \;:=\; \mathbb{E}_{\theta_1, \ldots, \theta_K \overset{\mathrm{iid}}{\sim} q}\!\left[\log\!\left(\frac{1}{K} \sum_{k=1}^K \frac{p(y, \theta_k \mid M)}{q(\theta_k)}\right)\right].

When K=1K = 1, IWELBO1(q)=ELBOM(q)\mathrm{IWELBO}_1(q) = \mathrm{ELBO}_M(q), recovering the standard ELBO.

Monotone tightness

The formal statement is the central result of Burda, Grosse & Salakhutdinov (2016).

Theorem 6 (IWELBO monotone tightness).

For any proposal qq with KL(qp(y,M))<\mathrm{KL}(q \,\|\, p(\cdot \mid y, M)) < \infty and any K1K \ge 1,

ELBO(q)  =  IWELBO1(q)    IWELBO2(q)        logp(yM).\mathrm{ELBO}(q) \;=\; \mathrm{IWELBO}_1(q) \;\le\; \mathrm{IWELBO}_2(q) \;\le\; \cdots \;\le\; \log p(y \mid M).

If additionally the importance weights w(θ):=p(y,θM)/q(θ)w(\theta) := p(y, \theta \mid M) / q(\theta) have finite variance under qq, then IWELBOK(q)logp(yM)\mathrm{IWELBO}_K(q) \to \log p(y \mid M) as KK \to \infty.

Proof.

The upper bound is Jensen’s inequality applied to the log of the importance-weighted average.

For monotonicity, we use the leave-one-out averaging trick. Let θ1,,θK+1\theta_1, \ldots, \theta_{K+1} be iid from qq, write wk:=w(θk)w_k := w(\theta_k), and let ZK+1:=(1/(K+1))k=1K+1wkZ_{K+1} := (1/(K+1)) \sum_{k=1}^{K+1} w_k. Observe that ZK+1Z_{K+1} is itself the average over jj of leave-one-out estimators ZK(j):=(1/K)kjwkZ_K^{(-j)} := (1/K) \sum_{k \ne j} w_k:

ZK+1  =  1K+1j=1K+1ZK(j).Z_{K+1} \;=\; \frac{1}{K+1} \sum_{j=1}^{K+1} Z_K^{(-j)}.

Applying Jensen’s inequality to the log of the average:

logZK+1    1K+1jlogZK(j).\log Z_{K+1} \;\ge\; \tfrac{1}{K+1} \sum_j \log Z_K^{(-j)}.

Take expectation over the iid sample. By exchangeability, each ZK(j)Z_K^{(-j)} has the same distribution as ZKZ_K, so the right-hand side equals E[logZK]=IWELBOK(q)\mathbb{E}[\log Z_K] = \mathrm{IWELBO}_K(q). The left-hand side is IWELBOK+1(q)\mathrm{IWELBO}_{K+1}(q), giving the monotone tightening claim.

Convergence as KK \to \infty follows from the strong law of large numbers (which gives ZKp(y)Z_K \to p(y) almost surely) plus dominated convergence applied to logZK\log Z_K.

The leave-one-out trick is doing the substantive work: ZK+1Z_{K+1} is more concentrated than ZKZ_K, and Jensen’s inequality on the log rewards concentration.

Empirical demonstration

The polynomial-regression spine at d=4d = 4 gives a clean test case.

Top: IWELBO_K monotonically increasing toward closed-form log p(y) as K grows from 1 to 1000 on log scale. Bottom: bias decay shrinking with K.
IWELBO monotone tightening on polynomial regression at d=4. Top: IWELBO_K rises from the standard ELBO toward closed-form log p(y) as K grows, monotonically per Theorem 6. Bottom: the bias log p(y) - IWELBO_K shrinks sublinearly — limited by the heavy tails of the importance weights under mean-field q.

The monotone tightening of Theorem 6 is exactly observed — the IWELBO curve rises smoothly with KK on the log axis. The bias decay is sublinear in KK: a 1000-fold increase in KK reduces the bias by only about 60%. This is a known feature of importance sampling with heavy-tailed importance weights. In practice, IWELBO with K=50K = 50 to K=200K = 200 is a reasonable compromise between bias reduction and compute — beyond that the returns diminish quickly, and AIS (§9) becomes the better tool.

The STL gradient-bias trick

A practical wrinkle for IWELBO when used as a training objective rather than just an evaluation tool.

Lemma 1 (STL gradient bias (Roeder–Wu–Duvenaud 2017)).

The standard reparameterization-gradient estimator of ϕIWELBOK(qϕ)\nabla_\phi \mathrm{IWELBO}_K(q_\phi) has variance that does not vanish as KK \to \infty. The “sticking the landing” (STL) estimator replaces the score-function term ϕlogqϕ(θk)\nabla_\phi \log q_\phi(\theta_k) with its gradient-detached version, eliminating the asymptotic gradient variance while preserving the gradient’s expected value.

The intuition: at the optimum of IWELBOK(qϕ)\mathrm{IWELBO}_K(q_\phi), the score-function term has zero expectation but non-zero variance that grows with KK. Detaching ϕ\phi inside logq\log q removes the score-function term entirely, leaving an estimator whose variance behaves correctly under high-KK optimization. Most modern PPL frameworks implement STL by default.

IWELBO_K monotone tightening on polynomial regression at d = 4
Bias decay sublinear in K — heavy importance-weight tails cap practical gains; AIS (§9) gets there in fewer samples.

Annealed importance sampling as the gold-standard reference

§8 developed IWELBO. We saw that for well-behaved targets it works, but for harder targets it stalls. §9 develops the alternative: annealed importance sampling (AIS, Neal 2001), which sidesteps the heavy-tail problem by gradually transforming the proposal into the target through a sequence of intermediate distributions, with MCMC steps in between. The result is the gold-standard unbiased estimator of p(y)p(y).

The contrast with IWELBO is sharp. On the polynomial-regression spine at d=4d = 4, IWELBO at K=1000K = 1000 has measurable residual bias. AIS at T=100T = 100 schedule steps reaches the closed-form to within Monte Carlo noise — an order-of-magnitude improvement at one-tenth the inner-loop cost.

The annealing schedule and intermediate distributions

The core idea is geometric. We have a starting distribution we can sample from easily — typically the prior p(θ)p(\theta) — and a target distribution we can’t — the unnormalized joint p(θ)p(yθ)p(\theta) \cdot p(y \mid \theta), with normalizing constant p(y)p(y). AIS bridges them through a schedule of intermediate distributions parameterized by an inverse-temperature β[0,1]\beta \in [0, 1]:

pt(θ)    p(θ)p(yθ)βt,β0=0<β1<β2<<βT=1.p_t(\theta) \;\propto\; p(\theta) \cdot p(y \mid \theta)^{\beta_t}, \qquad \beta_0 = 0 < \beta_1 < \beta_2 < \cdots < \beta_T = 1.

At β0=0\beta_0 = 0: p0(θ)p(θ)p_0(\theta) \propto p(\theta) is the prior, with normalizing constant Z0=1Z_0 = 1. At βT=1\beta_T = 1: pT(θ)p(y,θ)p_T(\theta) \propto p(y, \theta), with normalizing constant ZT=p(y)Z_T = p(y).

AIS with MH transitions

Algorithm 2 (Annealed importance sampling (Neal 2001)).

Inputs: prior p(θ)p(\theta) from which we can sample; likelihood p(yθ)p(y \mid \theta) which we can evaluate; schedule 0=β0<β1<<βT=10 = \beta_0 < \beta_1 < \cdots < \beta_T = 1; MCMC transition kernels TtT_t that leave ptp_t invariant.

Per chain c=1,,Cc = 1, \ldots, C:

  1. Sample initial state θ0(c)p(θ)\theta_0^{(c)} \sim p(\theta) (the prior).
  2. Initialize log-weight logw(c)=0\log w^{(c)} = 0.
  3. For t=1,2,,Tt = 1, 2, \ldots, T:
    • Accumulate the weight increment: logw(c)+=(βtβt1)logp(yθt1(c))\log w^{(c)} \mathrel{+}= (\beta_t - \beta_{t-1}) \cdot \log p(y \mid \theta_{t-1}^{(c)}).
    • Sample θt(c)\theta_t^{(c)} via the transition kernel TtT_t targeting ptp_t.
  4. Return final log-weight logw(c)\log w^{(c)}.

Final estimate: logp^(y):=logsumexp(logw(c))logC\widehat{\log p}(y) := \mathrm{logsumexp}(\log w^{(c)}) - \log C.

Unbiasedness

Theorem 7 (AIS unbiasedness (Neal 2001)).

Let ww denote the AIS importance weight from a single chain run as in Algorithm 2, with the MH kernels TtT_t leaving ptp_t invariant. Then E[w]=p(y)\mathbb{E}[w] = p(y), where the expectation is over the AIS trajectory.

Proof.

The single-chain log-weight is logw=t=1T(βtβt1)logp(yθt1)\log w = \sum_{t=1}^T (\beta_t - \beta_{t-1}) \cdot \log p(y \mid \theta_{t-1}), so

w  =  ZTZ0t=1Tpt(θt1)pt1(θt1).w \;=\; \frac{Z_T}{Z_0} \cdot \prod_{t=1}^T \frac{p_t(\theta_{t-1})}{p_{t-1}(\theta_{t-1})}.

Taking expectation over the trajectory and using the transition-invariance identitypt1(θt1)Tt1(θt2θt1)dθt2=pt1(θt1)\int p_{t-1}(\theta_{t-1}) T_{t-1}(\theta_{t-2} \to \theta_{t-1}) \, d\theta_{t-2} = p_{t-1}(\theta_{t-1}) — we collapse the product inductively. After all TT steps the inner integral equals pT(θT1)dθT1=1\int p_T(\theta_{T-1}) d\theta_{T-1} = 1. So E[w]=ZT/Z0=p(y)\mathbb{E}[w] = Z_T / Z_0 = p(y).

The unbiasedness is exact at any schedule length T1T \ge 1 — the schedule length controls variance, not bias.

Empirical demonstration

Top: AIS estimate of log p(y) at T from 1 to 3000 on log scale, converging to closed-form. Bottom: |bias| vs T showing the dramatic drop.
AIS schedule sweep on polynomial regression at d=4. Top: AIS estimate (green) converges to closed-form log p(y) (blue dashed) as T grows. At T=1, AIS reduces to prior importance sampling — unbiased but high-variance. By T~100 the estimate is essentially exact. Bottom: |bias| collapsing as T grows.

Compare to IWELBO from §8 on the same target: at K=1000K = 1000, IWELBO has measurable bias; AIS at T=100T = 100 — a tenth the inner-loop count — is dramatically tighter. AIS is roughly two orders of magnitude more sample-efficient on this case. The reason: IWELBO does KK iid draws from a fixed proposal, so heavy importance-weight tails dominate. AIS does TT sequential draws through a schedule that gradually moves from prior to target.

Schedule design

A practical note: AIS’s sample efficiency depends on the schedule. Geometric schedules and adaptive schedules — Neal (2001) §6 — outperform linear when the posterior is much more concentrated than the prior. PyMC’s pm.sample_smc (Sequential Monte Carlo) is a related algorithm with adaptive scheduling.

For VBMS in production, the practical recipe is: use ELBO for the cheap first pass, use IWELBO at moderate KK to tighten when needed, and reach for AIS or SMC when Pareto-k^\hat k flags problems or when the magnitude (not just the ranking) of the Bayes factor matters.

AIS schedule sweep on polynomial regression at d = 4
At T=1, AIS reduces to prior IS (high-variance). By T~50 the schedule has unfolded to land near closed-form.

The MDL connection — variational free energy as bits-back coding

§§1–9 developed VBMS computationally. §10 changes the frame. We’ve been treating logp(yM)\log p(y \mid M) as a probabilistic quantity — the marginal likelihood of the data under the model. It is also a coding-theoretic quantity — the optimal expected description length of the data under the model, in Shannon’s sense. The variational free energy ELBO(q)-\mathrm{ELBO}(q) is the analogous coding length under a constructive coding scheme that uses qq as a substitute for the true posterior. The gap between them — exactly the §6 KL projection bias — is the coding overhead the variational scheme pays for using an imperfect codebook.

This is the Minimum Description Length frame for VBMS. Hinton & van Camp (1993) introduced the bits-back coding protocol that makes the connection precise. Honkela & Valpola (2004) extended it to general variational Bayes.

The bits-back coding protocol

The protocol, with q(θ)q(\theta) a variational approximation to p(θy)p(\theta \mid y):

  1. Sender draws θq(θ)\theta \sim q(\theta). The randomness used to draw θ\theta comes from an external source (an arithmetic-coded stream of auxiliary message bits). Drawing θ\theta from qq uses up logq(θ)-\log q(\theta) bits.
  2. Sender encodes θ\theta using a code based on the prior p(θ)p(\theta). Cost: logp(θ)-\log p(\theta) bits.
  3. Sender encodes yy using the conditional code p(yθ)p(y \mid \theta). Cost: logp(yθ)-\log p(y \mid \theta) bits.
  4. Receiver receives θ\theta and yy, decodes θ\theta using qq. Because both parties share qq, the receiver can simulate the same arithmetic-decoding step, recovering the auxiliary message bits the sender used to draw θ\theta — saving logq(θ)-\log q(\theta) bits.

The net cost is the gross minus the bits back, in expectation under θq\theta \sim q.

The bits-back coding identity

Theorem 8 (Bits-back coding identity).

For a model MM with prior p(θM)p(\theta \mid M), likelihood p(yθ,M)p(y \mid \theta, M), and variational distribution q(θ)q(\theta) absolutely continuous with respect to p(y,M)p(\cdot \mid y, M), the expected net description length under the bits-back coding protocol is

Lbitsback(q,M)  =  Eq[logp(θM)logp(yθ,M)+logq(θ)]  =  ELBOM(q).L_{\mathrm{bits-back}}(q, M) \;=\; \mathbb{E}_q[-\log p(\theta \mid M) - \log p(y \mid \theta, M) + \log q(\theta)] \;=\; -\mathrm{ELBO}_M(q).

The coding overhead is

Lbitsback(q,M)    (logp(yM))  =  KL(qp(y,M))    0.L_{\mathrm{bits-back}}(q, M) \;-\; (-\log p(y \mid M)) \;=\; \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr) \;\ge\; 0.
Proof.

The first identity is direct algebra:

Lbitsback(q,M)  =  (Eq[logp(y,θM)]Eq[logq(θ)])  =  ELBOM(q).L_{\mathrm{bits-back}}(q, M) \;=\; -\bigl(\mathbb{E}_q[\log p(y, \theta \mid M)] - \mathbb{E}_q[\log q(\theta)]\bigr) \;=\; -\mathrm{ELBO}_M(q).

The coding overhead identity follows from the §2 evidence decomposition (Theorem 1):

Lbitsback(q,M)(logp(yM))  =  logp(yM)ELBOM(q)  =  KL(qp(y,M)).L_{\mathrm{bits-back}}(q, M) - (-\log p(y \mid M)) \;=\; \log p(y \mid M) - \mathrm{ELBO}_M(q) \;=\; \mathrm{KL}\bigl(q \,\|\, p(\cdot \mid y, M)\bigr).

The two halves of Theorem 8 say the same thing in two languages. Probabilistically: the variational free energy ELBO(q)-\mathrm{ELBO}(q) is at least the negative log-evidence, with the gap equal to the reverse-KL gap. Information-theoretically: the bits-back coding length is at least the Shannon-optimal code length, with the overhead equal to the same reverse-KL gap. The §6 KL projection bias is the bits-back coding overhead. Same object; two disciplines.

Three horizontal bars: gross sender cost, Bayesian-optimal -log p(y), bits-back net -ELBO. KL gap shaded between Bayesian-optimal and bits-back net.
Bits-back coding identity numerically verified on polynomial regression at d=4. Gross sender cost (steps 2+3 of the protocol) plus the stretch term equals the bits-back net, exactly −ELBO. The Bayesian-optimal −log p(y) sits between, and the gap is the KL projection bias of §6 — the same object viewed two ways.

VBMS as MDL: shortest expected description length

The MDL frame for model selection: among candidate models {M1,,MK}\{M_1, \ldots, M_K\}, prefer the model that gives the shortest expected description length of the data. Under the optimal Bayesian code, this is the model with the highest logp(yM)\log p(y \mid M). Under the bits-back protocol with a chosen variational family QM\mathcal{Q}_M per model, this is the model with the highest VBMS(M)\mathrm{VBMS}(M) — the lowest variational free energy. In MDL language: VBMS is choosing the model whose variational coding scheme is most efficient.

This MDL frame matters for two reasons. First, it places VBMS within the broader minimum-description-length literature (Rissanen 1978, Grünwald 2007), connecting it to the BIC, MML, and other MDL-derived information criteria via shared coding-theoretic roots. Second, it suggests new model-selection criteria built on different coding protocols.

A coda on Hinton & van Camp’s original motivation: the 1993 paper introduced bits-back coding specifically for neural-network model compression — finding small networks whose weights have short bits-back description length, so that the network plus its weights compresses the training data well. Modern Bayesian neural network methods are direct descendants. The MDL connection is not just historical; it remains the motivating frame for several active areas of Bayesian deep learning.

Bits-back coding identity at d = 4
Theorem 8: bits-back net = −ELBO; the gap to the Shannon-optimal length is the §6 KL projection bias.

Computational notes

The previous sections developed VBMS as a mathematical object. §11 is what the practitioner actually does at the keyboard.

The basic workflow

In PyMC, the canonical VBMS workflow looks like:

import pymc as pm
import arviz as az
import numpy as np

models = {"M_hier": build_hier_model(...), "M_pool": build_pool_model(...)}
elbos = {}
for name, model in models.items():
    with model:
        approx = pm.fit(n=30_000, method="advi", random_seed=0, progressbar=False)
    elbos[name] = -float(np.mean(approx.hist[-1000:]))  # last-1000-iter ELBO
log_BF = elbos["M_hier"] - elbos["M_pool"]

The output is a single scalar ELBO per model and a single log Bayes factor. Three immediate weaknesses: ADVI is non-convex, so random_seed=0 may have landed in a local optimum; the last-1000-iteration ELBO is a Monte Carlo estimate with non-trivial variance; and we have no diagnostic to tell us whether the variational posterior is a reasonable approximation.

Multiple restarts and max-ELBO selection

ADVI maximizes a non-convex objective. With one random seed, the result depends on initialization. The standard remedy is multiple random restarts, fitting ADVI from 5–20 different random seeds and selecting the maximum-ELBO restart as VBMS(M)\mathrm{VBMS}(M).

The ill-behaved cases:

  • Variance collapse. Mean-field ADVI on heavy-tailed posteriors sometimes drives one or more variational variances to zero, giving a degenerate qq with -\infty KL gap and an artificially huge “ELBO” that is meaningless.
  • Mode collapse. On multimodal posteriors, ADVI restarts from different seeds can land in different modes with different ELBOs.
  • Plateau ELBO. The ADVI optimization can plateau slightly below the true optimum.

A reasonable default: 10 restarts × 50,000 iterations, last-2000-iteration mean ELBO, max-ELBO selection across restarts.

The Pareto-k^\hat k diagnostic

Yao, Vehtari, Simpson & Gelman (2018) proposed Pareto-smoothed importance sampling (PSIS) on the variational posterior as a diagnostic. The recipe: sample SS points θsq(θ)\theta_s \sim q^*(\theta), compute log importance weights logws=logp(y,θs)logq(θs)\log w_s = \log p(y, \theta_s) - \log q^*(\theta_s), fit a generalized Pareto distribution to the upper tail, and report the shape parameter k^\hat k.

The interpretation:

  • k^<0.5\hat k < 0.5: the importance weights have finite variance, qq^* is a reliable approximation.
  • 0.5k^<0.70.5 \le \hat k < 0.7: the variance is finite but the tail is heavy enough that bias correction (IWELBO at moderate KK) is recommended.
  • k^0.7\hat k \ge 0.7: the variance is infinite or the variational approximation has gone wrong; do not trust ELBO or IWELBO. Use AIS (§9) or full-rank VI.

ADVI versus full-rank versus structured VI

PyMC’s pm.fit accepts three method choices, in increasing order of expressiveness:

  • method="advi" — mean-field Gaussian, O(d)O(d) parameters. Fastest. Default.
  • method="fullrank_advi" — full-rank Gaussian, O(d2)O(d^2) parameters. Captures linear correlation.
  • method="svgd" or particle methods — non-parametric variational families.

For VBMS the choice matters because each method gives a different bias profile. A reasonable default: start with mean-field, escalate to full-rank when the dd is moderate (d100d \le 100) and Pareto-k^\hat k is borderline. Beyond d=100d = 100, structured VI with hand-designed factor groups (Hoffman & Blei 2015) is more sample-efficient than full-rank.

Hyperprior sensitivity

VBMS rankings can be sensitive to the strength of the prior at small nn. The §4 Bishop GMM ARD example showed this in one direction: α0<1\alpha_0 < 1 drives ARD; α01\alpha_0 \ge 1 disables it. The remedy: report VBMS rankings under at least two prior strengths and flag any pair whose ranking flips between them as “prior-sensitive.”

Budget guidelines and escalation

A working budget hierarchy:

  1. Mean-field ADVI, 10 restarts × 50K iter. ~1–10 seconds per model. First pass.
  2. Pareto-k^\hat k diagnostic. Always do this.
  3. Full-rank ADVI if Pareto-k^\hat k is borderline and d100d \le 100.
  4. IWELBO at moderate KK (K=50K = 50 to K=500K = 500) if ranking magnitudes matter.
  5. AIS or SMC if Pareto-k^\hat k is 0.7\ge 0.7, if multimodality is suspected, or if the magnitude of the Bayes factor matters.
  6. Run all of the above and compare if the comparison is high-stakes.

VBMS in modern ML

The §1–§11 development has been deliberately mathematical. We close the technical part of the topic with a brief tour of where VBMS shows up in modern machine learning practice.

BNN width and depth selection. A Bayesian neural network is a parametric model whose architecture is itself a model-selection choice. VBMS gives a fast first-pass ranking: fit ADVI on each architecture, take ELBO differences, rank. The §7 ranking-preservation criterion governs when this is reliable. The Bayesian Neural Networks topic develops the full BNN-VBMS workflow, including the MC-dropout-as-VI shortcut.

Neural architecture search. VBMS scales the BNN-width-selection idea to the full NAS problem: enumerate architectures, fit ADVI on each, rank by VBMS, take the top-kk for full HMC re-ranking, deploy the winner. Modern NAS literature has converged on this two-stage pattern (VBMS-style fast oracle plus expensive re-ranking).

Sparse Bayesian priors. Choosing among horseshoe, regularized horseshoe, spike-and-slab, R2-D2 priors on a specific dataset is a model-selection problem. VBMS over the hyperprior choices ranks them by ELBO. The mean-field bias is non-trivial here — sparse priors typically have heavy-tailed posteriors that mean-field underestimates — so full-rank or flow VI is the right tool. Sparse Bayesian Priors develops this in detail, with §11 specializing the VBMS escalation hierarchy to the four-prior menu and §6 working through the funnel pathology that makes naive HMC fail on the horseshoe.

Singular models and Watanabe’s WAIC. The §3 BIC asymptotic equivalence theorem required the Hessian to be positive definite at the MAP, which excludes singular models — mixture models, neural networks, and hidden Markov models. For these, the BIC’s (d/2)logn-(d/2) \log n penalty is wrong, and Watanabe’s (2010) WBIC and WAIC replace it with corrections derived from singular learning theory. The variational analog — VBMS for singular models — is an open research question.

Amortized variational inference. When VBMS is used inside an iterative algorithm, the cost of fitting a fresh ADVI from scratch for each candidate model becomes the bottleneck. Amortized inference — training an inference network that maps from data to variational parameters — replaces the per-task optimization with a forward pass. The planned Meta-Learning (coming soon) topic develops the connection.

Connections and limits

We’ve developed VBMS as a self-contained model-selection tool. What remains is to place the topic in its broader context — to show where VBMS sits in the landscape of model-selection methodology, what it does well, what it does badly, and which tool to reach for when VBMS is the wrong answer.

formalML topology

VBMS depends on five formalML topics as substrates and informs at least four planned T5 topics directly.

Backward dependencies. Variational Inference is the substrate — VBMS is variational inference applied to the marginal-likelihood integral. KL Divergence sits underneath — the §6 reverse-KL projection picture and the §7 mode-seeking distinction depend on the KL-divergence topic’s f-divergence framework. Minimum Description Length is the substrate of §10 — the bits-back coding identity discharges the curriculum-graph prereq edge directly. Bayesian Neural Networks §3 develops the Laplace approximation at full depth. Gradient Descent §4 develops the ADVI optimizer that §11 takes as a black box.

Forward dependencies. Several planned T5 topics build on VBMS. Sparse Bayesian Priors uses VBMS to rank horseshoe / regularized horseshoe / spike-and-slab / R2-D2 priors on a fixed dataset, with PSIS Pareto-k^\hat k as the gating diagnostic. Meta-Learning (coming soon) develops amortized variational inference. Stacking and Predictive Ensembles covers the M-open alternative to VBMS.

The M-closed / M-complete / M-open framework

The Bernardo & Smith (1994) framework distinguishes three regimes for Bayesian model comparison:

  • M-closed — the true data-generating process is one of the candidate models. This is the regime VBMS, BIC, and BMA all assume.
  • M-complete — the true data-generating process is outside the candidate set, but well-approximated by some convex combination. The asymptotic story is similar to M-closed.
  • M-open — the true data-generating process is genuinely outside the candidate set, and no convex combination of candidates is a good approximation. The right tool is stacking (Yao et al. 2018b).

For a topic on VBMS, M-closed is the assumed regime.

Honest limits

Three structural limits of VBMS:

ELBO is not a model-quality metric. When the same model is fit multiple times with different random seeds, the ELBO can vary by 0.1 to several nats — this is optimization variance, not statistical evidence.

Mode-seeking on multimodal posteriors. Reverse-KL projection is mode-seeking; on multimodal posteriors with KK distinct modes, mean-field VB collapses to a single mode and the resulting ELBO is missing logK-\log K nats of probability mass. For physically meaningful multimodality, the ELBO bias is real and rankings can be catastrophically wrong.

The bias is a permanent feature, not an artifact. §6 made the geometric statement: for any parametric variational family that does not contain the posterior, the bias is non-zero. No engineering trick eliminates it — only growing the family does. For VBMS in production, the §11 escalation hierarchy is needed; bias correction is not a formality but a working necessity for high-stakes comparisons.

Alternatives: when to reach for which

Five major alternatives to VBMS for model selection:

Bayesian Information Criterion (BIC, Schwarz 1978). The leading-order Laplace approximation to logp(yM)\log p(y \mid M) as nn \to \infty. Reach for it when: nn is large, the model is regular, the data are iid, and the candidate set is M-closed. Don’t reach for it when: nn is small, the model is singular, or the comparison is in M-open.

Akaike Information Criterion (AIC, Akaike 1974). A predictive-accuracy criterion with a 2logL(θ^MLE)+2d-2 \log L(\hat\theta_{\mathrm{MLE}}) + 2d form. Reach for it when: the goal is predictive performance rather than identifying the true model.

Widely Applicable Information Criterion (WAIC, Watanabe 2010). A predictive-accuracy criterion that handles singular models. Reach for it when: the model is singular; MCMC samples are available; the goal is predictive accuracy.

Pareto-smoothed importance-sampled leave-one-out cross-validation (PSIS-LOO, Vehtari et al. 2017). Approximate LOO via importance sampling on posterior draws. Reach for it when: the goal is small-data predictive performance; MCMC samples are available.

Bayesian Model Averaging (BMA, Hoeting et al. 1999). Rather than picking a single model, average the predictions of all candidate models weighted by their posterior model probabilities. Reach for it when: M-closed assumption holds; the goal is predictive performance under model uncertainty.

Stacking (Yao et al. 2018b). Average predictive distributions from candidate models via weights chosen to optimize cross-validated predictive accuracy. Reach for it when: M-open assumption holds. The Stacking and Predictive Ensembles topic develops the framework in detail.

A decision tree, simplified:

  • M-closed and want to identify the true model → VBMS or BIC for ranking, plus PSIS-LOO for confirmation.
  • M-closed and want predictive performance → BMA, with VBMS or marginal-likelihood-based weights.
  • M-complete and want predictive performance → BMA or stacking; the choice depends on how close to M-closed the data are.
  • M-open → stacking. Don’t use VBMS, BMA, or BIC.
  • Singular models (mixtures, NNs, HMMs) → WAIC for ranking; PSIS-LOO for predictive accuracy. VBMS is OK if you’re careful about Pareto-k^\hat k.
  • Small data with sensitivity to hyperpriors → PSIS-LOO with sensitivity sweeps; VBMS with hyperprior sensitivity analysis.
  • Large-scale exploration (NAS, hyperparameter grids) → VBMS for the first pass; AIS or PSIS-LOO for top-kk confirmation.

VBMS sits in the “fast, biased, useful for M-closed and M-complete” niche. It is not the right tool for M-open work or for high-stakes singular-model comparisons. Knowing where it fits and where it doesn’t is most of the value of the topic.

Closing remarks

Variational inference replaced the intractable evidence integral with an optimization. Variational Bayes for model selection took that optimization one step further and turned it into a model-comparison criterion — fast, biased, with bias structure characterizable through the §6 reverse-KL projection picture and the §7 Wang–Blei consistency theorem. The topic’s central claim, restated: the ELBO is not a wrong estimator of the log-evidence but a biased one with known mechanism, and that mechanism is informative enough that VBMS rankings agree with closed-form rankings in most regimes practitioners care about.

The right way to use VBMS, in one sentence: fit ADVI for each candidate model with multiple restarts; compute Pareto-k^\hat k on the variational posteriors; report VBMS rankings if Pareto-k^\hat k is small and the comparison is M-closed; escalate to IWELBO, AIS, or stacking when these conditions don’t hold.

VBMS is one tool. The topic was about understanding it well — its mechanics, its bias, its scope, and its place in the larger model-selection landscape.

Connections

  • Direct prerequisite. VI §2.2 explicitly forward-points to this topic: 'the ELBO is a free lower bound on the evidence...this is the substrate of variational Bayesian model selection.' This topic discharges that forward-pointer. The mean-field, ADVI, and full-rank machinery developed there is the substrate; this topic is what you do with it once you have multiple models to compare. variational-inference
  • The ELBO bias as a model-selection criterion is exactly the reverse-KL gap from the variational family to the posterior. Mode-seeking, mass-covering asymmetry, and the variational representations of f-divergences all bear on whether the bias preserves model rankings. §6 is the load-bearing section here. kl-divergence
  • §10's bits-back coding interpretation is the information-theoretic frame for VBMS: the variational free energy equals the expected description length of the data plus a coding overhead that vanishes only when the variational family contains the posterior. Discharges the curriculum-graph prereq edge `minimum-description-length → variational-bayes-for-model-selection`. minimum-description-length
  • The Laplace approximation developed in BNN §3 is the substrate of §3's classical-IC bridge. BNN §4's MC-dropout-as-VI also reads as a model-selection statement: dropout rate p selects among Bernoulli variational families, and VBMS gives a principled criterion for choosing p. bayesian-neural-networks
  • §13's honest-limits discussion places VBMS in the M-closed setting where it works well and contrasts with stacking's M-open / predictive-distribution-averaging treatment. VBMS picks one model; stacking averages many. stacking-and-predictive-ensembles
  • The GP marginal likelihood is a closed-form log-evidence; §5's polynomial-regression Bayes-factor example has a GP analog where kernel choice plays the model-selection role. GP §5.2's Bayesian Occam's razor narrative is the same mechanism §1 introduces here. gaussian-processes
  • §11 references PyMC's compute_log_likelihood and ArviZ's PSIS-VI utilities for the practical workflow; PP is the substrate of the §11 implementation examples. probabilistic-programming

References & Further Reading