advanced bayesian-ml 55 min read

Sequential Monte Carlo

A cloud of particles, a path of targets, and the unbiased log-evidence that falls out for free

§1. Overview and Motivation

We’re going to track a cloud of particles through a sequence of probability distributions. At each step we update each particle’s weight, occasionally throw out the low-weight ones and replicate the high-weight ones, and then jitter the survivors with a Markov move that respects the current target. Do that TT times against a well-chosen schedule of intermediate targets and you have sequential Monte Carlo — an algorithm that handles filtering on state-space models, sampling from intractable static posteriors, and unbiased estimation of marginal likelihoods, all with the same three moves and the same dozen lines of inner-loop code.

The geometric picture is worth holding onto. Imagine a fixed-size swarm of points {θ(i)}i=1N\{\theta^{(i)}\}_{i=1}^N in parameter space, each carrying a non-negative weight w(i)w^{(i)}. The weighted swarm represents the current distribution: high-weight clusters mark high-probability regions; the expectation of any test function ff under the current target is approximately iw(i)f(θ(i))\sum_i w^{(i)} f(\theta^{(i)}). As we slide the target distribution — making it sharper, or conditioning on more data, or bridging from a tractable prior to an intractable posterior — the swarm has to track it. SMC is the recipe for moving the swarm.

The topic delivers three structurally distinct uses of that one recipe:

  • Particle filtering on hidden Markov / state-space models, where the sequence of targets is the filtering posterior p(xty1:t)p(x_t \mid y_{1:t}) and the algorithm generalizes the Kalman filter to non-linear and non-Gaussian dynamics. We demo on a stochastic-volatility model where Kalman doesn’t apply.
  • SMC samplers in the Del Moral, Doucet, and Jasra (2006) sense, where we bridge a tractable prior to an intractable posterior along a synthetic annealing path πtπ01βtπTβt\pi_t \propto \pi_0^{1-\beta_t}\,\pi_T^{\beta_t}. The headline payoff: an unbiased estimator of the log marginal likelihood falls out as a side product, no extra work. We demo on a Gaussian-mixture model evidence calculation and a banana-shaped two-dimensional posterior.
  • Tempered transitions with adaptive schedules, where the temperature ladder βt\beta_t is chosen on the fly via an ESS criterion (Del Moral, Doucet, and Jasra 2012, building on Jasra, Stephens, Doucet, and Tsagaris 2011) rather than hand-tuned. We demo on a log-Gaussian Cox process where target curvature varies enough across the path that a fixed schedule wastes most of the compute.

That’s where we’re going. Let’s start with why anyone needs this — what breaks when we try to do it without the sequential structure.

1.1 Where importance sampling breaks

Standard importance sampling tells us we can estimate Eπ[f(θ)]\mathbb{E}_\pi[f(\theta)] for any target π\pi by drawing from a proposal qq and reweighting:

Eπ[f(θ)]i=1Nw(i)f(θ(i))i=1Nw(i),θ(i)q,w(i)=π(θ(i))q(θ(i)).\mathbb{E}_\pi[f(\theta)] \approx \frac{\sum_{i=1}^N w^{(i)}\, f(\theta^{(i)})}{\sum_{i=1}^N w^{(i)}}, \qquad \theta^{(i)} \sim q, \quad w^{(i)} = \frac{\pi(\theta^{(i)})}{q(\theta^{(i)})}.

In the limit NN \to \infty this estimator is consistent for any qq whose support contains π\pi‘s — a remarkably weak condition. So why doesn’t every Bayesian-inference paper just use it?

The answer is variance. The variance of the IS estimator is controlled by the chi-squared divergence between qq and π\pi, and that divergence explodes when qq misses the bulk of π\pi‘s mass. Concretely: if π\pi is a Gaussian centred at 55 and qq is a Gaussian centred at 00, both unit-variance, then most of qq‘s samples land in π\pi‘s left tail. A handful of samples that happen to drift to the right pick up enormous weights — and the normalized weights end up concentrated on one or two particles. This is weight degeneracy, and it’s the failure mode SMC is designed to fix.

Two panels showing weight degeneracy in one-shot importance sampling. Left: sorted normalized weights on log-log; the top few particles carry almost all the mass, with effective sample size around 10 out of 1000. Right: IS-reweighted histogram heavily concentrated near the few high-weight particles, missing the target mass.
Figure 1. Weight degeneracy in one-shot importance sampling at N = 1000 with target N(5, 1) and proposal N(0, 1). Effective sample size collapses to roughly 10 of 1000; the top five particles carry over 80% of the normalized mass.

We can see it numerically with N=1000N=1000 (Figure 1): even a five-standard-deviation gap between proposal and target leaves an effective sample size of roughly 1010 out of 10001000, with five particles carrying more than 80%80\% of the normalized mass.

The effective sample size (ESS), defined as ESS=1/i(w(i))2\mathrm{ESS} = 1 / \sum_i (w^{(i)})^2 on normalized weights, summarizes the damage in one number. ESS counts the number of “effective” particles — the number an iid sample from π\pi would need to match the IS estimator’s variance. A swarm of 1000 particles with ESS =3= 3 is contributing about as much information as 3 iid samples; the other 997 are dead weight.

The diagnosis is geometric. The proposal and target are too far apart in distribution space for a one-shot bridge. What we need is to interpolate between them.

1.2 The particle-cloud fix

The fix is structurally simple: instead of bridging qq to π\pi in one step, we build a path of intermediate distributions

q=π0,  π1,  π2,  ,  πT=πq = \pi_0,\; \pi_1,\; \pi_2,\; \ldots,\; \pi_T = \pi

where adjacent targets πt\pi_t and πt+1\pi_{t+1} are close enough that a one-step importance-reweighting between them isn’t degenerate. Then we sequentially march the particle cloud from one target to the next: at step tt, update each weight by the incremental ratio πt+1(θ(i))/πt(θ(i))\pi_{t+1}(\theta^{(i)})/\pi_t(\theta^{(i)}), throw out the low-weight particles and replicate the high-weight ones (the resampling move), and apply a Markov kernel that leaves πt+1\pi_{t+1} invariant to spread the survivors back out (the propagation move).

Three observations make the algorithm work:

  1. Reweighting between close targets is well-conditioned. If πt+1\pi_{t+1} differs from πt\pi_t by a small amount of extra likelihood or a small temperature step, then the importance ratios πt+1/πt\pi_{t+1}/\pi_t have small variance and ESS stays O(N)O(N).
  2. Resampling kills the survivor bias. Without it, the cloud would still degenerate after enough steps — just more slowly than one-shot IS. Resampling preserves the weighted distribution in expectation while exchanging it for a fresh equal-weighted cloud with the same empirical measure but more degrees of freedom for the next step.
  3. The Markov kernel restores diversity. Resampling duplicates particles, so we need a mutation step that moves the duplicates apart while preserving the current target. Any πt+1\pi_{t+1}-invariant Markov kernel works — Metropolis–Hastings, HMC, Gibbs — and the choice is essentially a hyperparameter.

The three moves — reweight, resample, propagate — are the SMC skeleton. Every variant we’ll meet is a specific way of choosing the target sequence and the propagation kernel; the skeleton itself doesn’t change.

1.3 Three uses of one skeleton

Particle filtering. Take a state-space model with latent state xtx_t, observation yty_t, transition density p(xtxt1)p(x_t \mid x_{t-1}), and emission density p(ytxt)p(y_t \mid x_t). The filtering target at time tt is πt(x0:t)=p(x0:ty1:t)\pi_t(x_{0:t}) = p(x_{0:t} \mid y_{1:t}). The reweighting ratio πt+1/πt\pi_{t+1}/\pi_t is the per-observation likelihood p(yt+1xt+1)p(y_{t+1} \mid x_{t+1}), and the propagation kernel is the forward transition p(xt+1xt)p(x_{t+1} \mid x_t). This is the bootstrap particle filter (Gordon, Salmond, and Smith 1993).

SMC samplers. Take a static target π(θ)\pi(\theta) and bridge from a tractable prior π0(θ)=p(θ)\pi_0(\theta) = p(\theta) via the geometric annealing path πt(θ)p(θ)p(yθ)βt\pi_t(\theta) \propto p(\theta)\, p(y \mid \theta)^{\beta_t} with 0=β0<β1<<βT=10 = \beta_0 < \beta_1 < \cdots < \beta_T = 1. The reweighting ratio is p(yθ(i))βt+1βtp(y \mid \theta^{(i)})^{\beta_{t+1} - \beta_t}, and the propagation kernel can be any πt\pi_t-invariant MCMC kernel.

Unbiased log-marginal-likelihood estimation. The SMC-samplers product Z^T=t=0T1r^t\widehat Z_T = \prod_{t=0}^{T-1} \widehat r_t is an unbiased estimator of the model evidence ZT=p(y)Z_T = p(y), with no assumption that the sampler has converged. We prove this in §8.2.

1.4 Roadmap and what this discharges

§2 reviews just enough importance sampling to set up what SMC fixes. §3 introduces the SMC skeleton abstractly. §§4–5 catalogue the two ingredient choices — resampling schemes and propagation kernels. §6 specializes to particle filtering with the stochastic-volatility worked example. §7 specializes to SMC samplers on static targets, and §8 extracts the unbiased log-evidence estimator with the Gaussian-mixture worked example. §9 develops the two adaptive moves — ESS-thresholded resampling and adaptive temperature schedules — with the banana worked example. §10 is the convergence theory. §11 is the log-Gaussian Cox process worked example. §§12–13 are computational discipline and connections.

The topic discharges three explicit forward-pointers: formalStatistics: bayesian-computation-and-mcmc ‘s sampler catalogue and importance-sampling variance vocabulary; Probabilistic Programming §6.3’s forward-pointer to pm.sample_smc, which §§8.4 and 11.3 reproduce with a hand-rolled NumPy implementation; and Variational Bayes for Model Selection §13’s forward-pointer to SMC samplers as the unbiased alternative to ELBO-as-log-evidence, discharged here in §8.2 (Theorem 4) and §8.4 (the Gaussian-mixture benchmark).

§2. Importance Sampling at a Glance — What SMC Fixes

Importance sampling is the prerequisite, not the target. We develop SMC against it, so we need the IS vocabulary in place: the variance identity, the ESS diagnostic, and a clean statement of why the one-shot strategy fails on Bayesian targets.

2.1 The IS identity and its variance

Let π\pi be a probability density on ΘRd\Theta \subseteq \mathbb{R}^d, and let qq be a proposal with q(θ)>0q(\theta) > 0 wherever π(θ)>0\pi(\theta) > 0. The importance weight at θ\theta is w(θ)=π(θ)/q(θ)w(\theta) = \pi(\theta)/q(\theta). The change-of-measure identity Eπ[f(θ)]=Eq[w(θ)f(θ)]\mathbb{E}_\pi[f(\theta)] = \mathbb{E}_q[w(\theta) f(\theta)] is immediate.

The unnormalized IS estimator:

μ^Nun=1Ni=1Nw(θ(i))f(θ(i)),θ(i)iidq,\hat\mu^{\,\mathrm{un}}_N = \frac{1}{N} \sum_{i=1}^N w(\theta^{(i)}) f(\theta^{(i)}), \qquad \theta^{(i)} \stackrel{\mathrm{iid}}{\sim} q,

is unbiased. The self-normalized IS estimator μ^Nsn=iw~(i)f(θ(i))/iw~(i)\hat\mu^{\,\mathrm{sn}}_N = \sum_i \tilde w^{(i)} f(\theta^{(i)}) / \sum_i \tilde w^{(i)} (with w~(i)=γ(θ(i))/q(θ(i))\tilde w^{(i)} = \gamma(\theta^{(i)})/q(\theta^{(i)}) for unnormalized target γ\gamma) is consistent but biased at O(1/N)O(1/N) from the delta-method Jensen gap.

Theorem 1 (Variance of the unnormalized IS estimator (chi-squared form)).

Suppose π\pi is normalized and Eπ[w(θ)f(θ)2]<\mathbb{E}_\pi[w(\theta) f(\theta)^2] < \infty. Then

Varq ⁣(μ^Nun)=1N(Eπ[w(θ)f(θ)2]Eπ[f]2).\mathrm{Var}_q\!\left(\hat\mu^{\,\mathrm{un}}_N\right) = \frac{1}{N}\left(\mathbb{E}_\pi[w(\theta) f(\theta)^2] - \mathbb{E}_\pi[f]^2\right).

In particular, taking f1f \equiv 1,

Varq ⁣(1Ni=1Nw(θ(i)))=χ2(πq)N.\mathrm{Var}_q\!\left(\frac{1}{N}\sum_{i=1}^N w(\theta^{(i)})\right) = \frac{\chi^2(\pi \,\Vert\, q)}{N}.
Proof.

The NN summands are iid under qq, so Varq(μ^Nun)=(1/N)Varq(w(θ)f(θ))\mathrm{Var}_q(\hat\mu^{\,\mathrm{un}}_N) = (1/N)\mathrm{Var}_q(w(\theta) f(\theta)). Expand:

Varq ⁣(w(θ)f(θ))=Eq ⁣[w(θ)2f(θ)2](Eπ[f])2.\mathrm{Var}_q\!\left(w(\theta) f(\theta)\right) = \mathbb{E}_q\!\left[w(\theta)^2 f(\theta)^2\right] - (\mathbb{E}_\pi[f])^2.

Change measure: Eq[w2f2]=(π/q)2f2qdθ=(π/q)f2πdθ=Eπ[wf2]\mathbb{E}_q[w^2 f^2] = \int (\pi/q)^2 f^2 q\, d\theta = \int (\pi/q) f^2 \pi\, d\theta = \mathbb{E}_\pi[w f^2]. For the f1f \equiv 1 case, Eπ[w]1=π2/qdθ1=χ2(πq)\mathbb{E}_\pi[w] - 1 = \int \pi^2/q\, d\theta - 1 = \chi^2(\pi \,\Vert\, q) by the standard chi-squared identity.

The variance identity is the source of every IS pathology. The estimator is consistent for any qq with full support on π\pi, but the rate is governed by χ2(πq)\chi^2(\pi \,\Vert\, q).

2.2 Effective sample size and weight degeneracy

Definition 1 (Effective sample size).

Given normalized weights w(i)0w^{(i)} \geq 0 with iw(i)=1\sum_i w^{(i)} = 1,

ESS=1i=1N(w(i))2.\mathrm{ESS} = \frac{1}{\sum_{i=1}^N (w^{(i)})^2}.

Lemma 1 (ESS bounds).

1ESSN1 \leq \mathrm{ESS} \leq N, with the upper bound attained iff all weights equal 1/N1/N and the lower bound iff one weight equals 11.

Proof.

Cauchy–Schwarz: 1=(w(i))2N(w(i))21 = (\sum w^{(i)})^2 \leq N \sum (w^{(i)})^2, so (w(i))21/N\sum (w^{(i)})^2 \geq 1/N and ESSN\mathrm{ESS} \leq N. Lower bound: (w(i))2maxiw(i)w(i)=maxiw(i)1\sum (w^{(i)})^2 \leq \max_i w^{(i)} \cdot \sum w^{(i)} = \max_i w^{(i)} \leq 1.

Proposition 1 (ESS via coefficient of variation).

For normalized weights with w(i)=1\sum w^{(i)} = 1,

ESS=N1+CV^2,CV^2=Ni(w(i))21.\mathrm{ESS} = \frac{N}{1 + \widehat{\mathrm{CV}}^2}, \qquad \widehat{\mathrm{CV}}^2 = N \sum_i (w^{(i)})^2 - 1.
Proof.

With wˉ=1/N\bar w = 1/N, (w(i)wˉ)2=(w(i))21/N\sum (w^{(i)} - \bar w)^2 = \sum (w^{(i)})^2 - 1/N, so CV^2=N(w(i))21\widehat{\mathrm{CV}}^2 = N \sum (w^{(i)})^2 - 1. Inverting gives the claim.

ESS, CV^2\widehat{\mathrm{CV}}^2, and asymptotically the chi-squared divergence are three faces of the same diagnostic: ESS/N1/(1+χ2(πq))\mathrm{ESS}/N \to 1/(1 + \chi^2(\pi \,\Vert\, q)) as NN \to \infty.

Two-panel ESS diagnostic: identity verification ESS = N / (1 + CV²) and ESS collapse as proposal-target gap grows.
Figure 2. ESS as a degeneracy diagnostic. Left: ESS matches N / (1 + CV²) exactly. Right: ESS collapses from N toward 1 as the proposal mean drifts five standard deviations away from the target mean.

2.3 Why a one-shot IS proposal fails on hard targets

The variance identity tells us we need χ2(πq)N\chi^2(\pi \,\Vert\, q) \ll N. In Bayesian inference the posterior πn(θ)p(θ)jp(yjθ)\pi_n(\theta) \propto p(\theta) \prod_j p(y_j \mid \theta) concentrates dramatically as nn grows.

Heuristic scaling. Under regularity conditions (Bernstein–von Mises), the posterior πn\pi_n concentrates at the MLE θ^n\hat\theta_n with effective volume nd/2\sim n^{-d/2}. The chi-squared divergence therefore scales as χ2(πnp)=O(nd/2)\chi^2(\pi_n \,\Vert\, p) = O(n^{d/2}) in the regular parametric setting. For irregular models the scaling is worse and can grow exponentially in nn. We won’t formalize this further (the precise statement is the Bernstein–von Mises theorem; see van der Vaart 1998 Chapter 10 for the rigorous version), but the message is unambiguous: one-shot IS from the prior is dead on arrival.

The SMC fix is now visible by inversion. Decompose the single shot into many small shots, each with bounded chi-squared. Choose a path π0=q,  π1,  ,  πT=π\pi_0 = q,\; \pi_1,\; \ldots,\; \pi_T = \pi such that χ2(πt+1πt)N\chi^2(\pi_{t+1} \,\Vert\, \pi_t) \ll N for each tt. The geometric annealing path achieves this when βt\beta_t varies slowly.

§3. The SMC Skeleton

We have the motivation (§1) and the IS toolkit (§2). Now we define the SMC algorithm abstractly — the skeleton that every specific variant in §§6–11 specializes.

3.1 A sequence of intermediate targets

Fix a final target π=πT\pi = \pi_T and a tractable starting distribution π0\pi_0. Build a path of intermediate distributions π0,π1,,πT\pi_0, \pi_1, \ldots, \pi_T connecting the two. Three families of paths recur:

  • Data-augmenting (particle filtering). πt(x0:t)=p(x0:ty1:t)\pi_t(x_{0:t}) = p(x_{0:t} \mid y_{1:t}). The state space grows with tt.
  • Annealing (SMC samplers). πt(θ)π0(θ)1βtπT(θ)βt\pi_t(\theta) \propto \pi_0(\theta)^{1-\beta_t}\,\pi_T(\theta)^{\beta_t} with β0=0<βT=1\beta_0 = 0 < \beta_T = 1.
  • Tempered. πt(θ)exp(βtU(θ))\pi_t(\theta) \propto \exp(-\beta_t U(\theta)) for a potential UU.

Write each target as πt(θ)=γt(θ)/Zt\pi_t(\theta) = \gamma_t(\theta) / Z_t, where γt\gamma_t is unnormalized (tractable) and Zt=γtdθZ_t = \int \gamma_t\, d\theta is the normalizing constant (typically intractable).

3.2 Reweight, resample, propagate — the three alternating moves

The SMC algorithm maintains a particle cloud {(θt(i),Wt(i))}i=1N\{(\theta_t^{(i)},\, W_t^{(i)})\}_{i=1}^N with θt(i)Θ\theta_t^{(i)} \in \Theta and Wt(i)0W_t^{(i)} \geq 0. We track unnormalized cumulative weights; self-normalized weights wt(i)=Wt(i)/jWt(j)w_t^{(i)} = W_t^{(i)} / \sum_j W_t^{(j)} are derived on demand.

Initialization. Draw θ0(i)iidπ0\theta_0^{(i)} \stackrel{\mathrm{iid}}{\sim} \pi_0 and set W0(i)=1W_0^{(i)} = 1.

For each t=0,1,,T1t = 0, 1, \ldots, T-1:

Move 1 (reweight). Incremental importance ratios

αt(i)=γt+1(θt(i))γt(θt(i)),Wt+1(i)=Wt(i)αt(i).\alpha_t^{(i)} = \frac{\gamma_{t+1}(\theta_t^{(i)})}{\gamma_t(\theta_t^{(i)})}, \qquad W_{t+1}^{(i)} = W_t^{(i)} \cdot \alpha_t^{(i)}.

Move 2 (resample, optional and ESS-thresholded). If ESSt<τN\mathrm{ESS}_t < \tau N, draw indices At(i)Categorical(wt+1(1),,wt+1(N))A_t^{(i)} \sim \mathrm{Categorical}(w_{t+1}^{(1)}, \ldots, w_{t+1}^{(N)}) (the specific scheme — multinomial, systematic, stratified, residual — is the topic of §4). Replace the cloud with {(θt(At(i)),Wˉt+1)}i=1N\{(\theta_t^{(A_t^{(i)})}, \bar W_{t+1})\}_{i=1}^N where Wˉt+1=1NjWt+1(j)\bar W_{t+1} = \frac{1}{N}\sum_j W_{t+1}^{(j)}.

Move 3 (propagate). Apply a Markov kernel Kt+1K_{t+1} that is πt+1\pi_{t+1}-invariant: πt+1(θ)Kt+1(θ,θ)dθ=πt+1(θ)\int \pi_{t+1}(\theta) K_{t+1}(\theta, \theta')\, d\theta = \pi_{t+1}(\theta'). For each particle, draw θt+1(i)Kt+1(θt(At(i)),)\theta_{t+1}^{(i)} \sim K_{t+1}(\theta_t^{(A_t^{(i)})}, \cdot).

Terminal output. After TT steps, EπT^[f]=iwT(i)f(θT(i))\widehat{\mathbb{E}_{\pi_T}}[f] = \sum_i w_T^{(i)} f(\theta_T^{(i)}) and Z^T=1NiWT(i)\widehat Z_T = \frac{1}{N}\sum_i W_T^{(i)}.

The convention is reweight → resample → propagate, matching Del Moral, Doucet, and Jasra (2006) §3 and PyMC’s pm.sample_smc.

3.3 The weighted empirical measure

The cloud represents a probability distribution via the weighted empirical measure

πtN(dθ)=i=1Nwt(i)δθt(i)(dθ).\pi_t^N(d\theta) = \sum_{i=1}^N w_t^{(i)}\, \delta_{\theta_t^{(i)}}(d\theta).

Three observations. (1) πtN\pi_t^N is the algorithm’s state; tracking it is what tracking the cloud means. (2) It’s a random measure — different runs of SMC produce different particle locations and weights. (3) Special cases correspond to special algorithm states: uniform weights right after a resample, point mass when the cloud collapses to one particle.

The §10.1 consistency theorem will establish that πtNπt\pi_t^N \to \pi_t weakly as NN \to \infty, with rates from the §10.3 CLT.

3.4 Notation, log-space discipline, and the cumulative-weight invariant

SymbolMeaning
ΘRd\Theta \subseteq \mathbb{R}^dparameter space (or state space)
πt=γt/Zt\pi_t = \gamma_t / Z_ttarget at step tt
θt(i)Θ\theta_t^{(i)} \in \Thetaparticle ii at step tt
Wt(i)0W_t^{(i)} \geq 0, t(i)=logWt(i)\ell_t^{(i)} = \log W_t^{(i)}cumulative unnormalized weight
wt(i)=Wt(i)/jWt(j)w_t^{(i)} = W_t^{(i)} / \sum_j W_t^{(j)}self-normalized weight
αt(i)=γt+1/γt\alpha_t^{(i)} = \gamma_{t+1}/\gamma_t at θt(i)\theta_t^{(i)}incremental ratio
Kt+1K_{t+1}πt+1\pi_{t+1}-invariant Markov kernel
ESSt=1/(wt(i))2\mathrm{ESS}_t = 1/\sum (w_t^{(i)})^2effective sample size
Z^t=1NWt(i)\widehat Z_t = \frac{1}{N}\sum W_t^{(i)}running normalizing-constant estimator

Log-space discipline. Particle weights on hard targets routinely span ten or more orders of magnitude. Linear-space representation silently underflows; the implementation enforces log-space throughout:

  • Store t(i)=logWt(i)\ell_t^{(i)} = \log W_t^{(i)} and update additively: t+1(i)=t(i)+logγt+1(θt(i))logγt(θt(i))\ell_{t+1}^{(i)} = \ell_t^{(i)} + \log \gamma_{t+1}(\theta_t^{(i)}) - \log \gamma_t(\theta_t^{(i)}).
  • Normalize via logsumexp: logwt(i)=t(i)logsumexp(t)\log w_t^{(i)} = \ell_t^{(i)} - \mathrm{logsumexp}(\ell_t).
  • Compute log-ESS: logESSt=logsumexp(2logwt)\log\mathrm{ESS}_t = -\mathrm{logsumexp}(2 \log w_t).
  • Compute log-Z^\widehat Z: logZ^t=logsumexp(t)logN\log\widehat Z_t = \mathrm{logsumexp}(\ell_t) - \log N.

A preview of the §8.2 unbiasedness theorem. Theorem 4 in §8.2 will establish that for every t{0,1,,T}t \in \{0, 1, \ldots, T\},

E ⁣[Z^t]=E ⁣[1Ni=1NWt(i)]=ZtZ0.\mathbb{E}\!\left[\widehat Z_t\right] = \mathbb{E}\!\left[\frac{1}{N}\sum_{i=1}^N W_t^{(i)}\right] = \frac{Z_t}{Z_0}.

The expectation is over all SMC randomness — initial draws from π0\pi_0, resampling indices, and Markov-kernel transitions. When π0\pi_0 is the prior (so Z0=1Z_0 = 1), Z^T\widehat Z_T is unbiased for the model evidence ZT=p(y)Z_T = p(y). This is the identity that justifies all of SMC’s evidence-estimation use cases. We use it as a structural fact while developing §§4–7; the full induction proof — the topic’s most important load-bearing result — is the subject of §8.2.

Resampling preserves the invariant. Pre-resample, Z^t+1=(1/N)iWt+1(i)\widehat Z_{t+1} = (1/N)\sum_i W_{t+1}^{(i)}. Post-resample, each particle carries Wˉt+1=(1/N)jWt+1(j)\bar W_{t+1} = (1/N)\sum_j W_{t+1}^{(j)}, so the new running estimator is (1/N)NWˉt+1=Wˉt+1(1/N) \cdot N \cdot \bar W_{t+1} = \bar W_{t+1}, identical to the pre-resample value. The resampling step is invariance-preserving by construction.

Three-panel SMC sweep on a Gaussian bridge from N(0, 4) to N(5, 1). Left: particle cloud trajectory across β with weight-encoded color. Center: ESS over the path with the τN threshold marked. Right: terminal cloud histogram matching the target Gaussian.
Figure 3. SMC skeleton on a Gaussian bridge at N = 200, T = 10. Two resampling events (marked R) keep ESS above τN; the terminal cloud matches the target Gaussian.

The viz below lets you change NN, TT, the resampling threshold τ\tau, and the RWM step size, and watch the cloud trajectory, the ESS profile, and the terminal histogram update.

log Ẑ_T = -0.233 (truth 0)
terminal μ = 5.007 (truth 5)
terminal σ = 1.096 (truth 1)
resamples = 2 / 10

Three behaviors worth discovering: set τ=0\tau = 0 (never resample) and the ESS collapses by the end of the path; set τ=1\tau = 1 (resample every step) and the algorithm injects more variance than necessary; shrink T=2T = 2 and the cloud can’t track the target. The §10.3 CLT formalizes the variance contributions of each move.

§4. Resampling Schemes

Move 2 of the skeleton — resampling — turns a degenerate weighted cloud into a fresh equal-weighted cloud. Four schemes are in common use, with a strict variance ordering:

Var(systematic)Var(stratified)Var(residual)Var(multinomial).\mathrm{Var}(\text{systematic}) \leq \mathrm{Var}(\text{stratified}) \leq \mathrm{Var}(\text{residual}) \leq \mathrm{Var}(\text{multinomial}).

All four are O(N)O(N) via numpy.searchsorted. Use searchsorted, never numpy.random.choice(..., p=weights) — this is a project-wide hard rule. Systematic is the default and what PyMC’s pm.sample_smc uses.

4.1 Multinomial resampling — the naive baseline

Multinomial resampling draws NN ancestor indices independently:

At(i)iidCategorical(wt(1),,wt(N)).A_t^{(i)} \stackrel{\mathrm{iid}}{\sim} \mathrm{Categorical}(w_t^{(1)}, \ldots, w_t^{(N)}).

The joint distribution of (Nt(1),,Nt(N))(N_t^{(1)}, \ldots, N_t^{(N)}) is multinomial(N;wt(1),,wt(N))(N; w_t^{(1)}, \ldots, w_t^{(N)}); the marginal of each Nt(i)N_t^{(i)} is Binomial(N,wt(i))(N, w_t^{(i)}).

cumw = np.cumsum(w_norm)
u = rng.random(N)
idx = np.clip(np.searchsorted(cumw, u), 0, N - 1)

4.2 Systematic resampling (Kitagawa 1996)

Systematic resampling draws a single uU(0,1/N)u \sim U(0, 1/N) and uses the deterministic grid uk=u+k/Nu_k = u + k/N. The negative association among the uku_k is what drives the variance reduction.

cumw = np.cumsum(w_norm)
u = (rng.random() + np.arange(N)) / N
idx = np.clip(np.searchsorted(cumw, u), 0, N - 1)

4.3 Stratified and residual resampling

Stratified resampling (Kitagawa 1996): NN independent uniforms, one per stratum: ukU(k/N,(k+1)/N)u_k \sim U(k/N, (k+1)/N). Same code as systematic but rng.random(N) instead of rng.random().

Residual resampling (Liu and Chen 1998): deterministic integer counts Nw(i)\lfloor N w^{(i)} \rfloor, plus a residual stage via multinomial on the leftover weights.

4.4 Variance reduction — why systematic dominates multinomial

Proposition 2 (Unbiased offspring counts).

For all four schemes, E[Nt(i)]=Nwt(i)\mathbb{E}[N_t^{(i)}] = N w_t^{(i)} for each ii.

Proof.

Multinomial: Nt(i)Binomial(N,wt(i))N_t^{(i)} \sim \mathrm{Binomial}(N, w_t^{(i)}) by construction. Systematic: Particle ii owns the cumulative-weight interval (ci1,ci](c_{i-1}, c_i] of length wt(i)w_t^{(i)}. Each grid point uk=u+k/Nu_k = u + k/N for uU(0,1/N)u \sim U(0, 1/N) falls in this interval with probability wt(i)w_t^{(i)} via a measure-shift argument. Summing over kk: E[Nt(i)]=Nwt(i)\mathbb{E}[N_t^{(i)}] = N w_t^{(i)}. Stratified: analogous, with each uku_k supported on a stratum of length 1/N1/N. Residual: Nt(i)=Nwt(i)+Ntres,(i)N_t^{(i)} = \lfloor N w_t^{(i)} \rfloor + N_t^{\text{res},(i)} where Ntres,(i)N_t^{\text{res},(i)} is Binomial in the residual stage; E[Nt(i)]=Nwt(i)\mathbb{E}[N_t^{(i)}] = N w_t^{(i)} by direct algebra.

Theorem 2 (Variance ordering).

For any bounded f:ΘRf: \Theta \to \mathbb{R},

Var ⁣[iNt(i)f(θt(i))]systematicVar ⁣[]stratifiedVar ⁣[]residualVar ⁣[]multinomial.\mathrm{Var}\!\left[\sum_i N_t^{(i)} f(\theta_t^{(i)})\right]_{\text{systematic}} \leq \mathrm{Var}\!\left[\cdot\right]_{\text{stratified}} \leq \mathrm{Var}\!\left[\cdot\right]_{\text{residual}} \leq \mathrm{Var}\!\left[\cdot\right]_{\text{multinomial}}.
Proof.

We prove the stratified ≤ multinomial direction by a law-of-total-variance argument; the others follow the same template.

Let XkmultiidU(0,1)X_k^{\text{mult}} \stackrel{\mathrm{iid}}{\sim} U(0,1) and XkstratU(k/N,(k+1)/N)X_k^{\text{strat}} \sim U(k/N, (k+1)/N) independently. Define g(u)=f(θt(A(u)))g(u) = f(\theta_t^{(A(u))}) where A(u)=min{j:cju}A(u) = \min\{j : c_j \geq u\} — the ancestor mapping induced by the cumulative-weight CDF.

Under stratified, Var[kg(Xkstrat)]=kVarU(k/N,(k+1)/N)[g]\mathrm{Var}[\sum_k g(X_k^{\text{strat}})] = \sum_k \mathrm{Var}_{U(k/N, (k+1)/N)}[g], since the XkstratX_k^{\text{strat}} are independent across kk.

Under multinomial, Var[kg(Xkmult)]=NVarU(0,1)[g]\mathrm{Var}[\sum_k g(X_k^{\text{mult}})] = N \cdot \mathrm{Var}_{U(0,1)}[g] by the same independence reasoning, since each XkmultX_k^{\text{mult}} has the same uniform marginal.

Apply the law of total variance with stratum indicator K=NXK = \lfloor NX \rfloor for XU(0,1)X \sim U(0, 1):

VarU(0,1)[g]=EK[VarU(K/N,(K+1)/N)[g]]+VarK[EU(K/N,(K+1)/N)[g]].\mathrm{Var}_{U(0,1)}[g] = \mathbb{E}_K[\mathrm{Var}_{U(K/N, (K+1)/N)}[g]] + \mathrm{Var}_K[\mathbb{E}_{U(K/N, (K+1)/N)}[g]].

The first term multiplied by NN equals the stratified variance. The second term is non-negative. Hence multinomial ≥ stratified, with strict inequality unless gg is constant within strata.

The other directions (systematic ≤ stratified, residual ≤ multinomial) are proved via the same negative-association machinery in Chopin and Papaspiliopoulos (2020) §9.3 and Douc, Cappé, and Moulines (2005) §4.3–§4.5.

4.5 The numpy.searchsorted primitive

All four schemes reduce to one operation: given a cumulative-weight array cc and “uniform array” uu, find ancestor indices via np.searchsorted(c, u). Cost: O(NlogN)O(N \log N) for NN queries (binary search), with excellent constants.

Why not np.random.choice(N, size=N, p=w_norm)? Three reasons: (1) performance — np.random.choice is 3–10× slower with allocator pressure on every call; (2) it only does multinomial — the worst scheme by Theorem 2; (3) GC pressure — fresh allocation each call vs reusable cumw buffer.

Project rule: always searchsorted, never np.random.choice with p=weights.

Four-panel comparison of multinomial, residual, stratified, and systematic resampling: offspring counts, estimator variance, wall-clock cost, and variance ratio sweep.
Figure 4. Variance ordering of the four resampling schemes at N = 200, M = 5000 replicates. Systematic and stratified attain the lowest variance; np.random.choice is shown for wall-clock contrast.

The viz below warps the weight distribution via a skewness parameter β\beta and shows offspring counts and estimator variance for all four schemes side-by-side.

larger β → more degenerate weights → wider variance gap
multinomialresidualstratifiedsystematicE[N⁽ⁱ⁾] = N w
multinomial: Var = 1.52e-4 (100% of mult.)
residual: Var = 8.74e-5 (57% of mult.)
stratified: Var = 7.91e-5 (52% of mult.)
systematic: Var = 6.57e-5 (43% of mult.)

§5. The Markov-kernel Propagation Step

Move 3 of the skeleton — propagation — moves each particle via a πt+1\pi_{t+1}-invariant Markov kernel. The kernel can be almost anything — any πt+1\pi_{t+1}-invariant kernel is admissible.

5.1 What kernel invariance buys us

A Markov kernel K:Θ×ΘR+K: \Theta \times \Theta \to \mathbb{R}_+ is π\pi-invariant when π(θ)K(θ,θ)dθ=π(θ)\int \pi(\theta) K(\theta, \theta')\, d\theta = \pi(\theta'). Equivalently: if XπX \sim \pi and YXK(X,)Y \mid X \sim K(X, \cdot), then YπY \sim \pi.

SMC uses MCMC kernels as invariance-preserving mutation steps — not as samplers in their own right. The crucial structural fact: SMC’s correctness does not require the kernel to mix well. Even a poorly-mixing kernel is fine for SMC, as long as it’s πt+1\pi_{t+1}-invariant. The §10 convergence theorems don’t depend on kernel mixing rates; they depend only on invariance.

Lemma 2 (Composition preserves invariance).

If K1K_1 and K2K_2 are π\pi-invariant, then K1K2K_1 \circ K_2 is π\pi-invariant.

Proof.

By Fubini, π(dθ)(K1K2)(θ,A)=K1(θ,A)[K2(θ,dθ)π(dθ)]=K1(θ,A)π(dθ)=π(A)\int \pi(d\theta) (K_1 \circ K_2)(\theta, A) = \int K_1(\theta', A) \left[\int K_2(\theta, d\theta') \pi(d\theta)\right] = \int K_1(\theta', A) \pi(d\theta') = \pi(A) using both invariance conditions.

So “multiple sweeps” of an invariant kernel is also invariant — the §3.2 algorithm is unchanged whether each propagation step applies KK once or kk times. This is what justifies the “RWM with kk sweeps” pattern in the §5.4 demo.

5.2 Metropolis–Hastings within SMC

The MH step: propose θq(θ)\theta' \sim q(\cdot \mid \theta), accept with probability α(θ,θ)=min{1,  π(θ)q(θθ)/[π(θ)q(θθ)]}\alpha(\theta, \theta') = \min\{1,\; \pi(\theta') q(\theta \mid \theta') / [\pi(\theta) q(\theta' \mid \theta)]\}. MH is π\pi-invariant by the detailed-balance proof in formalStatistics: bayesian-computation-and-mcmc §26.

Three observations about MH within SMC: (1) we only need π\pi up to a normalizing constant — the ratio kills the unknown ZZ; (2) each MH step costs one log-target evaluation per particle, plus the proposal density evaluation; (3) vectorization across NN particles is embarrassingly parallel.

5.3 Choosing the kernel — RWM, IMH, Gibbs, HMC

Random-walk Metropolis (RWM). q(θθ)=N(θθ,Σt)q(\theta' \mid \theta) = \mathcal{N}(\theta' \mid \theta, \Sigma_t), symmetric. Acceptance simplifies to α=min(1,γt+1(θ)/γt+1(θ))\alpha = \min(1, \gamma_{t+1}(\theta')/\gamma_{t+1}(\theta)). Cheap; scales poorly in dd. Roberts, Gelman, and Gilks (1997) give the optimal acceptance ≈ 0.234 with Σtd1/2\Sigma_t \sim d^{-1/2}. The SMC discipline: Σt=(2.382/d)Cov^(πtN)\Sigma_t = (2.38^2/d) \widehat{\mathrm{Cov}}(\pi_t^N), 3–10 sweeps per step.

Independent Metropolis–Hastings (IMH). q(θθ)=q(θ)q(\theta' \mid \theta) = q(\theta') — the proposal doesn’t depend on the current state. The standard SMC-samplers choice fits qq to the current cloud at every step. Cloud-fitted IMH stays high-acceptance because qq tracks πt+1\pi_{t+1}. PyMC’s pm.sample_smc default.

Gibbs sampling. Cycle through coordinates, sampling each from πt+1(θjθj)\pi_{t+1}(\theta_j \mid \theta_{-j}). Acceptance probability 1 when conditionals are closed-form.

Hamiltonian Monte Carlo (HMC). Augment with momentum, integrate Hamilton’s equations via leapfrog. Beskos et al. (2013): optimal scaling εd1/4\varepsilon \sim d^{-1/4}, dominant in d5d \gtrsim 5. Used in §11’s log-Gaussian Cox process demo.

Per-step kernel-cost summary:

KernelPer-particle cost per sweepTuning
RWM1 log-density evalstep size σ\sigma / covariance Σt\Sigma_t
IMH (cloud-fitted)1 log-density + 1 log-proposalproposal family
Gibbsdd conditional samplesnone (closed-form required)
HMCLL gradient evals + acceptanceε,L,M\varepsilon, L, M

5.4 The mutation-vs-resampling trade

Total SMC compute decomposes as CSMCTN(kClogdens+Cresample)C_{\mathrm{SMC}} \approx T \cdot N \cdot (k \cdot C_{\text{logdens}} + C_{\text{resample}}), dominated by the kernel cost when ClogdensC_{\text{logdens}} is non-trivial. The trade-off: smaller TT requires more kernel sweeps per step to keep the cloud mixing well, while larger TT amortizes kernel cost across more reweighting/resampling work. The §9.4 adaptive temperature schedule automates the TT side; the kernel and kk stay under user control.

Practical guidance:

  • RWM on smooth dd-dim targets: k=max(3,d/2)k = \max(3, \lceil d/2 \rceil) sweeps with adaptive Σt\Sigma_t.
  • IMH-cloud-fit on unimodal targets: k=1k = 1 application per step.
  • HMC with L=10L = 103030 leapfrog steps: k=1k = 1.
Three-panel comparison of RWM (k=1), RWM (k=5), and IMH cloud-fit propagation kernels on an anisotropic 2D Gaussian SMC sampler. Left: ESS over the path. Center: terminal cloud scatter. Right: per-step acceptance rate.
Figure 5. Three propagation kernels on an anisotropic 2D Gaussian bridge at N = 200, T = 8. IMH cloud-fit dominates RWM in both ESS and terminal-scatter tightness; RWM (k = 5) closes most of the gap by amortizing the kernel cost.

The viz below runs three kernel choices in parallel — RWM with k=1k = 1 sweep, RWM with k=5k = 5 sweeps, and IMH cloud-fit — on the same anisotropic 2D Gaussian bridge, with the same systematic-resample machinery. Watch the IMH-cloud-fit acceptance rate climb through the path as the cloud concentrates on πT\pi_T.

RWM (k=1)RWM (k=5)IMH cloud-fittarget π_T = N([3, 3], diag(0.3, 1.5))

§6. Particle Filtering on State-Space Models

§6 specializes the skeleton to state-space models — the original application of SMC (Gordon, Salmond, and Smith 1993). We develop the bootstrap and auxiliary particle filters, then deploy both on the stochastic-volatility model where the linear-Gaussian Kalman filter doesn’t apply.

6.1 The filtering recursion

A state-space model has latent process xtx_t with p(x0)p(x_0), p(xtxt1)p(x_t \mid x_{t-1}), and observation process ytxtp(ytxt)y_t \mid x_t \sim p(y_t \mid x_t). Joint density: p(x0:T,y1:T)=p(x0)tp(xtxt1)p(ytxt)p(x_{0:T}, y_{1:T}) = p(x_0) \prod_t p(x_t \mid x_{t-1}) p(y_t \mid x_t).

Theorem 3 (Filtering recursion).

For each t1t \geq 1:

(a) Predict. p(xty1:t1)=p(xtxt1)p(xt1y1:t1)dxt1p(x_t \mid y_{1:t-1}) = \int p(x_t \mid x_{t-1})\, p(x_{t-1} \mid y_{1:t-1})\, dx_{t-1}.

(b) Update. p(xty1:t)=p(ytxt)p(xty1:t1)/p(yty1:t1)p(x_t \mid y_{1:t}) = p(y_t \mid x_t)\, p(x_t \mid y_{1:t-1}) / p(y_t \mid y_{1:t-1}).

(c) Marginal likelihood. p(y1:t)=s=1tp(ysy1:s1)p(y_{1:t}) = \prod_{s=1}^t p(y_s \mid y_{1:s-1}) with p(ysy1:s1)=p(ysxs)p(xsy1:s1)dxsp(y_s \mid y_{1:s-1}) = \int p(y_s \mid x_s)\, p(x_s \mid y_{1:s-1})\, dx_s.

Proof.

(a) Marginalize xt1x_{t-1} out of the joint p(xt1,xty1:t1)p(x_{t-1}, x_t \mid y_{1:t-1}) and use Markov: p(xtxt1,y1:t1)=p(xtxt1)p(x_t \mid x_{t-1}, y_{1:t-1}) = p(x_t \mid x_{t-1}). (b) Bayes’ rule with conditional independence p(ytxt,y1:t1)=p(ytxt)p(y_t \mid x_t, y_{1:t-1}) = p(y_t \mid x_t). (c) Chain rule on the observation sequence.

The linear-Gaussian case reduces to the Kalman filter, which has closed-form predict and update steps (see formalStatistics: hidden-markov-models ). When the dynamics are non-linear or non-Gaussian, the integrals in Theorem 3 don’t admit closed-form evaluation, and we need a particle approximation.

6.2 The bootstrap particle filter

Specialize the skeleton: πt(x0:t)=p(x0:ty1:t)\pi_t(x_{0:t}) = p(x_{0:t} \mid y_{1:t}). Each particle is a trajectory x0:t(i)x_{0:t}^{(i)}. The marginal filtering posterior at time tt is p(xty1:t)iwt(i)δxt(i)p(x_t \mid y_{1:t}) \approx \sum_i w_t^{(i)} \delta_{x_t^{(i)}}.

The bootstrap choice: extend each trajectory by sampling xtp(xtxt1(i))x_t \sim p(x_t \mid x_{t-1}^{(i)}) from the prior dynamics. With this proposal the incremental IS weight collapses to

αt(i)p(ytxt(i)).\alpha_t^{(i)} \propto p(y_t \mid x_t^{(i)}).

The bootstrap PF algorithm:

Initialize: x_0^(i) ~ p(x_0), W_0^(i) = 1.
For t = 1, ..., T:
  (1) Propagate: x_t^(i) ~ p(x_t | x_{t-1}^(i))
  (2) Reweight: W_t^(i) = W_{t-1}^(i) · p(y_t | x_t^(i))
  (3) Diagnose: ESS_t, log Ẑ_t
  (4) Resample if ESS_t < τN

Order is propagate-then-reweight (the natural order for filtering — we extend the trajectory before scoring against the new observation). No MH correction: the prior dynamics aren’t a πt\pi_t-invariant kernel, but they are an unbiased proposal for the next-step posterior.

Log-likelihood as a side product. From the marginal-likelihood factorization, p^(ysy1:s1)=(1/N)ip(ysxs(i))\widehat{p}(y_s \mid y_{1:s-1}) = (1/N)\sum_i p(y_s \mid x_s^{(i)}). Cumulative logp^(y1:t)=logZ^t\log \widehat p(y_{1:t}) = \log \widehat Z_t is unbiased for logp(y1:t)\log p(y_{1:t}) by Theorem 4 — the same unbiasedness identity that powers SMC samplers’ evidence estimation.

The bootstrap PF fails when p(ytxt)p(y_t \mid x_t) is sharply peaked relative to p(xtxt1)p(x_t \mid x_{t-1}) — the prior dynamics place mass in the wrong region of state space, so few particles land where the likelihood is non-negligible. The auxiliary particle filter (§6.3) addresses this.

6.3 The auxiliary particle filter

APF (Pitt and Shephard 1999) chooses ancestors before propagating, with auxiliary weights that anticipate the next observation yty_t.

The APF algorithm:

For t = 1, ..., T:
  (a) Auxiliary weights: λ_{t-1}^(i) ∝ W_{t-1}^(i) · p̂(y_t | x_{t-1}^(i))
  (b) Resample ancestors A_t^(i) with probabilities ∝ λ_{t-1}^(i)
  (c) Propagate: x_t^(i) ~ p(x_t | x_{t-1}^{A_t^(i)})
  (d) Reweight with correction: W_t^(i) ∝ p(y_t | x_t^(i)) / p̂(y_t | x_{t-1}^{A_t^(i)})

The auxiliary target between consecutive filtering posteriors is πt1aux(x0:t1)p(x0:t1y1:t1)p^(ytxt1)\pi_{t-1}^{\text{aux}}(x_{0:t-1}) \propto p(x_{0:t-1} \mid y_{1:t-1}) \cdot \hat p(y_t \mid x_{t-1}) — the prior posterior weighted by an approximate one-step-ahead predictive likelihood. The APF is SMC with this synthetic intermediate target inserted between πt1\pi_{t-1} and πt\pi_t.

Tractable choices for p^\hat p: mean-predictive p^(ytxt1)=p(ytE[xtxt1])\hat p(y_t \mid x_{t-1}) = p(y_t \mid \mathbb{E}[x_t \mid x_{t-1}]) or mode-predictive. APF dominates the bootstrap PF when yty_t is informative and p^\hat p is a reasonable approximation; the gain shrinks when the dynamics are diffuse or when p^\hat p misses the predictive mode.

6.4 Worked example — the stochastic-volatility model

The SV model:

xt=ϕxt1+σεt,εtN(0,1),yt=exp(xt/2)ηt,ηtN(0,1),\begin{aligned} x_t &= \phi x_{t-1} + \sigma\, \varepsilon_t, &\varepsilon_t &\sim \mathcal{N}(0, 1),\\ y_t &= \exp(x_t/2)\, \eta_t, &\eta_t &\sim \mathcal{N}(0, 1), \end{aligned}

with ϕ(0,1)\phi \in (0, 1), σ>0\sigma > 0, stationary initial x0N(0,σ2/(1ϕ2))x_0 \sim \mathcal{N}(0, \sigma^2/(1-\phi^2)). The latent xtx_t is log-volatility; the observation yty_t is the return with conditional Gaussian variance exp(xt)\exp(x_t). Multiplicative emission means the Kalman filter doesn’t apply directly. The standard linearization logyt2=xt+logηt2\log y_t^2 = x_t + \log \eta_t^2 gives Harvey, Ruiz, and Shephard’s (1994) quasi-likelihood Kalman approximation, which underperforms the bootstrap PF on tail behavior.

The demo. Simulate T=100T = 100, ϕ=0.95\phi = 0.95, σ=0.2\sigma = 0.2. Run the bootstrap PF with N=500N = 500, τ=0.5\tau = 0.5, systematic resampling.

Four-panel stochastic-volatility particle-filter demo: truth latent x_t with observations, bootstrap PF posterior, auxiliary PF posterior, and per-step ESS for both.
Figure 6. Bootstrap and auxiliary particle filters on a stochastic-volatility model (T = 200, N = 500). The bootstrap filter tracks the latent log-volatility process via prior-dynamics propagation and emission-likelihood reweight; the APF improves on it when the emission is informative.

The viz below lets you sweep NN, TT, ϕ\phi, and σ\sigma to see how the bootstrap filter’s RMSE, ESS, and log-evidence respond. Try increasing ϕ\phi toward 0.99 to see how strong persistence helps the filter; try shrinking σ\sigma to see how a quiet latent process leaves the filter with little signal to learn from.

RMSE(filter mean vs truth) = 0.374
avg ESS = 364.9
log p̂(y_{1:T}) = -126.57

§7. SMC Samplers — Del Moral, Doucet, and Jasra (2006)

§7 takes the case where the target is static (a fixed Bayesian posterior) and the path of intermediate targets is constructed via annealing. This is the framework PyMC’s pm.sample_smc implements.

7.1 The annealing path

For Bayesian inference with prior p(θ)p(\theta) and likelihood p(yθ)p(y \mid \theta), the geometric annealing path:

γt(θ)=p(θ)p(yθ)βt,0=β0<β1<<βT=1.\gamma_t(\theta) = p(\theta)\, p(y \mid \theta)^{\beta_t}, \qquad 0 = \beta_0 < \beta_1 < \cdots < \beta_T = 1.

At β0=0\beta_0 = 0: prior. At βT=1\beta_T = 1: unnormalized posterior. Two reasons for the geometric path:

  1. Exponential-family closure. Conjugate models preserve parametric form along the path — useful for diagnostics and analytical comparison.
  2. Chi-squared control. χ2(πt+1πt)(βt+1βt)2Varπt[logp(yθ)]\chi^2(\pi_{t+1} \,\Vert\, \pi_t) \approx (\beta_{t+1} - \beta_t)^2 \cdot \mathrm{Var}_{\pi_t}[\log p(y \mid \theta)] to leading order, so adjacent targets have controllable chi-squared via the temperature step.

7.2 The reweighting formula along the path

Proposition 3 (Geometric-path incremental weights).

For γt(θ)=p(θ)p(yθ)βt\gamma_t(\theta) = p(\theta) p(y \mid \theta)^{\beta_t},

logαt(θ)=(βt+1βt)logp(yθ).\log \alpha_t(\theta) = (\beta_{t+1} - \beta_t)\, \log p(y \mid \theta).
Proof.

Direct: logαt=logγt+1logγt=logp(θ)+βt+1logp(yθ)logp(θ)βtlogp(yθ)=(βt+1βt)logp(yθ)\log \alpha_t = \log \gamma_{t+1} - \log \gamma_t = \log p(\theta) + \beta_{t+1} \log p(y \mid \theta) - \log p(\theta) - \beta_t \log p(y \mid \theta) = (\beta_{t+1} - \beta_t) \log p(y \mid \theta).

The prior cancels entirely; the incremental log-weight depends only on the log-likelihood times the temperature step. The variance of the log-weight increment under the current target is

Varπt[logαt]=(βt+1βt)2Varπt[logp(yθ)],\mathrm{Var}_{\pi_t}[\log \alpha_t] = (\beta_{t+1} - \beta_t)^2 \cdot \mathrm{Var}_{\pi_t}[\log p(y \mid \theta)],

which is the leading-order chi-squared expansion that justifies the path’s quadratic control.

7.3 Markov-kernel mutation between targets

Adaptive Gaussian IMH (PyMC’s default): proposal qt(θ)=N(θ;μ^t,Σ^t)q_t(\theta') = \mathcal{N}(\theta';\, \widehat{\boldsymbol{\mu}}_t,\, \widehat{\boldsymbol{\Sigma}}_t) fitted to the current weighted cloud. Refit at every step. Add ridge Σ^t+ϵId\widehat{\boldsymbol{\Sigma}}_t + \epsilon I_d for numerical stability when the cloud is anisotropic.

Adaptive RWM: qt(θθ)=N(θ;θ+(2.382/d)Σ^t)q_t(\theta' \mid \theta) = \mathcal{N}(\theta;\, \theta + (2.38^2/d)\widehat{\boldsymbol{\Sigma}}_t). Multiple sweeps per SMC step amortize the kernel cost.

7.4 SMC samplers vs annealed importance sampling

AIS (Neal 2001) = SMC samplers with (a) no resampling, (b) one kernel application per step. AIS works on easy problems but fails on long paths: cumulative weights spread across many orders of magnitude, ESS collapses, and the variance of logZ^T\log \widehat Z_T explodes.

SMC samplers = AIS + ESS-thresholded resampling + adaptive kernel tuning. The general algorithm reduces to AIS when τ=0\tau = 0 (never resample) and kernels are fixed across the path.

Modern practice uses SMC samplers exclusively; AIS retains niche advantages in distributed settings where resampling synchronization is expensive.

Four-panel SMC sampler on a Bayesian Poisson rate problem: cloud trajectory as ridgeline, ESS over the path, terminal vs quadrature reference, and one-shot IS failure.
Figure 7. SMC sampler on a non-conjugate 1D Bayesian Poisson rate problem at N = 500, T = 20, IMH cloud-fit. The one-shot IS panel shows the weight degeneracy that SMC fixes via incremental reweighting.

The viz below explores the four hand-tuned annealing schedules — linear, power-2, power-½, log-uniform — across path length TT and a “posterior concentration” CC that models how Varπt[logp(yθ)]\mathrm{Var}_{\pi_t}[\log p(y \mid \theta)] grows with the data size. Adjust CC upward to see how all four schedules struggle on more concentrated posteriors, and how back-loaded schedules (power-½) preserve ESS better than linear on hard targets.

larger C → posterior more concentrated → harder bridge
linear (β = t/T)(min ESS ≈ 83)power-2 (front-loaded)(min ESS ≈ 333)power-½ (back-loaded)(min ESS ≈ 21)log-uniform(min ESS ≈ 83)

§8. Unbiased Log-Marginal-Likelihood Estimation

The running normalizing-constant estimator Z^t\widehat Z_t has accompanied us since §3.4. §8 proves the unbiasedness theorem, the topic’s most important load-bearing result, and applies it to model-evidence comparison in §8.4.

8.1 The product-of-normalizers identity

Define the incremental ratio estimator r^t=(1/N)iαt(i)=(1/N)iγt+1(θt(i))/γt(θt(i))\widehat r_t = (1/N)\sum_i \alpha_t^{(i)} = (1/N)\sum_i \gamma_{t+1}(\theta_t^{(i)})/\gamma_t(\theta_t^{(i)}).

Proposition 4 (Product-of-normalizers factorization).

Under the §3.2 algorithm,

Z^T=t=0T1r^t.\widehat Z_T = \prod_{t=0}^{T-1} \widehat r_t.
Proof.

Z^0=1\widehat Z_0 = 1. Without resampling at step tt: Z^t+1=(1/N)iWt+1(i)=Z^tr^t\widehat Z_{t+1} = (1/N)\sum_i W_{t+1}^{(i)} = \widehat Z_t \cdot \widehat r_t (using Wt(i)Z^tW_t^{(i)} \equiv \widehat Z_t post-resample or the self-normalized variant otherwise). With resampling: the §3.4 invariant calculation shows Z^t+1\widehat Z_{t+1} is preserved across the resampling step. Iterating from Z^0=1\widehat Z_0 = 1 to step TT gives the product factorization.

8.2 The unbiasedness theorem

Theorem 4 (Unbiased SMC log-marginal-likelihood, Del Moral-Doucet-Jasra 2006 Proposition 1).

Consider the §3.2 SMC algorithm with: iid initial particles θ0(i)π0\theta_0^{(i)} \sim \pi_0 (normalized, so Z0=1Z_0 = 1); reweight by αt(i)\alpha_t^{(i)}; resample with any unbiased scheme (Proposition 2); propagate with any πt+1\pi_{t+1}-invariant Markov kernel. Then for every t{0,1,,T}t \in \{0, 1, \ldots, T\},

E[Z^t]=Zt.\mathbb{E}[\widehat Z_t] = Z_t.
Proof.

Layer 1 — Without resampling. Each particle’s cumulative weight is WT(i)=s=0T1γs+1(θs(i))/γs(θs(i))W_T^{(i)} = \prod_{s=0}^{T-1} \gamma_{s+1}(\theta_s^{(i)})/\gamma_s(\theta_s^{(i)}). The trajectory (θ0(i),θ1(i),,θT1(i))(\theta_0^{(i)}, \theta_1^{(i)}, \ldots, \theta_{T-1}^{(i)}) has joint density π0(θ0)s=1T1Ks(θs1,θs)\pi_0(\theta_0) \prod_{s=1}^{T-1} K_s(\theta_{s-1}, \theta_s). Compute:

E[WT]=π0(θ0)s=0T1γs+1(θs)γs(θs)s=1T1Ks(θs1,θs)dθ0dθT1.\mathbb{E}[W_T] = \int \pi_0(\theta_0) \cdot \prod_{s=0}^{T-1} \frac{\gamma_{s+1}(\theta_s)}{\gamma_s(\theta_s)} \cdot \prod_{s=1}^{T-1} K_s(\theta_{s-1}, \theta_s)\, d\theta_0 \cdots d\theta_{T-1}.

The s=0s=0 factor cancels: π0(θ0)γ1(θ0)/γ0(θ0)=γ1(θ0)\pi_0(\theta_0) \cdot \gamma_1(\theta_0)/\gamma_0(\theta_0) = \gamma_1(\theta_0) since π0=γ0\pi_0 = \gamma_0 (Z0=1Z_0 = 1).

Integrate θ0\theta_0 first. By K1K_1‘s π1\pi_1-invariance, γ1(θ0)K1(θ0,θ1)dθ0=γ1(θ1)\int \gamma_1(\theta_0) K_1(\theta_0, \theta_1)\, d\theta_0 = \gamma_1(\theta_1). This cancels with the s=1s=1 ratio’s denominator γ1(θ1)\gamma_1(\theta_1), leaving γ2(θ1)\gamma_2(\theta_1) as the integrand for θ1\theta_1.

Iterating: each kernel-invariance integration converts γs(θs1)Ks(θs1,θs)\gamma_s(\theta_{s-1}) K_s(\theta_{s-1}, \theta_s) into γs(θs)\gamma_s(\theta_s), which combines with the next ratio’s denominator to give γs+1(θs)\gamma_{s+1}(\theta_s).

After T1T-1 iterations: E[WT]=γT(θT1)dθT1=ZT\mathbb{E}[W_T] = \int \gamma_T(\theta_{T-1})\, d\theta_{T-1} = Z_T. Hence E[Z^T]=ZT\mathbb{E}[\widehat Z_T] = Z_T.

Layer 2 — With resampling. Induct on tt. Base case t=0t = 0 trivial. Inductive step: assume E[Z^t]=Zt\mathbb{E}[\widehat Z_t] = Z_t; show E[Z^t+1]=Zt+1\mathbb{E}[\widehat Z_{t+1}] = Z_{t+1}.

Using Proposition 4, Z^t+1=Z^tr^t\widehat Z_{t+1} = \widehat Z_t \cdot \widehat r_t. By tower rule, E[Z^t+1]=E[Z^tE[r^tFt1]]\mathbb{E}[\widehat Z_{t+1}] = \mathbb{E}[\widehat Z_t \cdot \mathbb{E}[\widehat r_t \mid \mathcal{F}_{t-1}]] where Ft1\mathcal{F}_{t-1} is the σ\sigma-algebra through step t1t-1.

The conditional E[r^tFt1]\mathbb{E}[\widehat r_t \mid \mathcal{F}_{t-1}] uses KtK_t‘s πt\pi_t-invariance to integrate over the kernel propagation, and Proposition 2’s offspring unbiasedness to integrate over resampling indices. After the bookkeeping (see Chopin and Papaspiliopoulos 2020 Theorem 7.4.1 for the detailed derivation):

E[r^tZ^tFt1]=Zt+1ZtZ^t.\mathbb{E}[\widehat r_t \cdot \widehat Z_t \mid \mathcal{F}_{t-1}] = \frac{Z_{t+1}}{Z_t} \cdot \widehat Z_t.

Iterating: E[Z^t+1]=(Zt+1/Zt)E[Z^t]=(Zt+1/Zt)Zt=Zt+1\mathbb{E}[\widehat Z_{t+1}] = (Z_{t+1}/Z_t) \cdot \mathbb{E}[\widehat Z_t] = (Z_{t+1}/Z_t) \cdot Z_t = Z_{t+1}.

Remark (The rigorous Feynman–Kac framework).

Layer 2’s conditional-expectation calculation is made fully rigorous via the Feynman–Kac path-measure formalism of Del Moral (2004). The technical machinery — Doob’s martingale theory, abstract empirical-process theory on path space — sits above this topic’s assumed background; see Chopin and Papaspiliopoulos (2020) Chapter 7 and Del Moral, Doucet, and Jasra (2006) Proposition 1 for the rigorous derivation.

Corollary. With π0\pi_0 the prior and γT(θ)=p(θ)p(yθ)\gamma_T(\theta) = p(\theta) p(y \mid \theta): E[Z^T]=p(y)\mathbb{E}[\widehat Z_T] = p(y). This is what PyMC’s pm.sample_smc returns as idata.sample_stats.log_marginal_likelihood.

8.3 Variance of the log-evidence estimator and the Jensen-gap correction

Z^T\widehat Z_T is unbiased for ZTZ_T; logZ^T\log \widehat Z_T is biased downward by Jensen’s inequality.

Proposition 5 (Jensen-gap bias of log Ẑ_T).

Let σ2=Var(Z^T)/ZT2\sigma^2 = \mathrm{Var}(\widehat Z_T)/Z_T^2. Then

E[logZ^T]=logZTσ22+O(σ4).\mathbb{E}[\log \widehat Z_T] = \log Z_T - \frac{\sigma^2}{2} + O(\sigma^4).
Proof.

logZ^T=logZT+log(1+U)\log \widehat Z_T = \log Z_T + \log(1 + U) with U=(Z^TZT)/ZTU = (\widehat Z_T - Z_T)/Z_T, E[U]=0\mathbb{E}[U] = 0 (by Theorem 4), E[U2]=σ2\mathbb{E}[U^2] = \sigma^2. Taylor: log(1+U)=UU2/2+O(U3)\log(1+U) = U - U^2/2 + O(U^3). Take expectation: E[log(1+U)]=0σ2/2+O(σ4)\mathbb{E}[\log(1+U)] = 0 - \sigma^2/2 + O(\sigma^4), since E[U3]=O(σ3)\mathbb{E}[U^3] = O(\sigma^3) but the next-order σ4\sigma^4 correction dominates after exact cancellation of the σ3\sigma^3 term against the symmetric tails.

Scaling: Var(Z^T)=O(T/N)\mathrm{Var}(\widehat Z_T) = O(T/N) under standard regularity (§10.3), so the Jensen gap is O(T/N)-O(T/N). At N=1000N = 1000, typical bias is 0.01\sim 0.010.10.1 — usually negligible, but the §8.4 demo shows how the bias accumulates when TT is forced large by a difficult target.

8.4 Worked example — Gaussian-mixture model evidence

The 2-component GMM: yipN(μ1,1)+(1p)N(μ2,1)y_i \sim p \cdot \mathcal{N}(\mu_1, 1) + (1-p) \cdot \mathcal{N}(\mu_2, 1), priors μ1,μ2N(0,4)\mu_1, \mu_2 \sim \mathcal{N}(0, 4), pBeta(2,2)p \sim \mathrm{Beta}(2, 2). Simulate n=50n = 50 at μ1=2\mu_1^\star = -2, μ2=2\mu_2^\star = 2, p=0.5p^\star = 0.5. The posterior is bimodal (label switching), which is the canonical SMC win-condition.

Four estimators compared:

  1. Hand-rolled SMC sampler. N=1000N = 1000, T=30T = 30 geometric, IMH cloud-fit.
  2. PyMC pm.sample_smc. Same model, default settings.
  3. Harmonic mean (Newton and Raftery 1994). The notorious failure baseline — biased upward and infinite-variance under standard regularity.
  4. Bridge sampling (Meng and Wong 1996). The gold-standard reference.

Predicted ordering. Bridge sampling and the SMC samplers (hand-rolled and PyMC) agree to Monte Carlo error. The harmonic mean is visibly biased upward — a textbook demonstration of why the harmonic-mean estimator is unfit for purpose.

Three-panel comparison of log-evidence estimators on a 2-component Gaussian mixture: violin plot of log Ẑ, MC SD, and bias vs bridge-sampling reference.
Figure 8. Log-evidence estimators on a 2-component GMM at M = 20 replicates. SMC and bridge sampling agree to within MC error; harmonic mean is visibly biased upward.

The viz below runs M=150M = 150 independent SMC instances on the §3 Gaussian-bridge target (truth ZT=1Z_T = 1, logZT=0\log Z_T = 0) and displays the histograms of Z^T\widehat Z_T and logZ^T\log \widehat Z_T. The Z^T\widehat Z_T histogram is centered on 1 (Theorem 4); the logZ^T\log \widehat Z_T histogram is shifted downward by the Jensen-gap prediction σ^2/2-\widehat\sigma^2/2 (Proposition 5).

Var(Ẑ) ≈ 9.67e-2

§9. Adaptive Choices: ESS-Thresholded Resampling and Adaptive Temperatures

§9 makes both schedule and resampling-trigger adaptive — the two adaptive mechanisms that turn textbook SMC into the production-grade algorithm deployed in PyMC. The §9.5 banana demo is the third headline demo of the topic.

9.1 The ESS threshold rule

The ESS threshold rule (Doucet, Godsill, and Andrieu 2000): resample if ESSt<τN\mathrm{ESS}_t < \tau N for τ(0,1]\tau \in (0, 1], typically τ=1/2\tau = 1/2.

Intuition: when ESS is high, the cloud is well-conditioned and resampling injects unnecessary variance. When ESS is low, resampling is needed to recover diversity. The threshold τ\tau is the breakpoint.

Practical regimes:

  • τ=0\tau = 0 — never resample (= AIS, fails on long paths).
  • τ=1\tau = 1 — resample every step (= Gordon-Salmond-Smith 1993 PF).
  • τ(0.3,0.7)\tau \in (0.3, 0.7) — practical sweet spot; τ=0.5\tau = 0.5 is the default.

9.2 ESS-thresholded resampling dominates every-step resampling

Theorem 5 (Resampling-variance comparison).

Under standard SMC regularity conditions, there exists τ(0,1)\tau^\star \in (0, 1) such that

Varτ(Z^T)Varevery(Z^T)c(TR)N,\mathrm{Var}_{\tau^\star}(\widehat Z_T) \leq \mathrm{Var}_{\mathrm{every}}(\widehat Z_T) - \frac{c \cdot (T - R^\star)}{N},

where RR^\star is the expected number of resampling events at τ\tau^\star and c>0c > 0 is a problem-dependent constant.

Proof.

Use the §10.3 CLT-form variance decomposition (anticipated here; proved independently in §10):

Var(Z^T)=ZT2Nt=0T1[Vtprop+1{resample at t}Vtresamp]+O(N2).\mathrm{Var}(\widehat Z_T) = \frac{Z_T^2}{N} \sum_{t=0}^{T-1} \bigl[V_t^{\mathrm{prop}} + \mathbb{1}\{\text{resample at }t\} \cdot V_t^{\mathrm{resamp}}\bigr] + O(N^{-2}).

Under every-step resampling, the indicator 1{resample at t}\mathbb{1}\{\text{resample at }t\} is always 1. Under ESS-thresholded resampling, the indicator fires only at R<TR^\star < T steps (in expectation). The propagation contribution VtpropV_t^{\mathrm{prop}} is identical under both rules — kernel mixing doesn’t depend on the resampling schedule.

Variance difference:

Varevery(Z^T)Varτ(Z^T)=ZT2Nt:not resampled at τVtresamp+O(N2)ZT2c(TR)N+O(N2),\mathrm{Var}_{\mathrm{every}}(\widehat Z_T) - \mathrm{Var}_{\tau^\star}(\widehat Z_T) = \frac{Z_T^2}{N} \sum_{t : \text{not resampled at }\tau^\star} V_t^{\mathrm{resamp}} + O(N^{-2}) \geq \frac{Z_T^2 \cdot c \cdot (T - R^\star)}{N} + O(N^{-2}),

where c=mintVtresamp>0c = \min_t V_t^{\mathrm{resamp}} > 0 is bounded below under standard regularity. When τ<1\tau^\star < 1 so R<TR^\star < T, the RHS is strictly positive.

9.3 Hand-tuned vs adaptive temperature schedules

Four hand-tuned schedules in common use: linear (βt=t/T\beta_t = t/T), geometric (log-equispaced), power-law (βt=(t/T)k\beta_t = (t/T)^k), and log-uniform.

Hand-tuning is fragile: (1) the optimal schedule depends on Varπt[logp(yθ)]\mathrm{Var}_{\pi_t}[\log p(y \mid \theta)], which varies along the path; (2) tuning is problem-specific and doesn’t generalize across dataset sizes; (3) the practitioner has to re-tune every time a model changes.

9.4 ESS-driven adaptive β

The adaptive rule (Del Moral, Doucet, and Jasra 2012, building on Jasra, Stephens, Doucet, and Tsagaris 2011): pre-commit to a target ESS fraction τadapt(0,1)\tau_{\mathrm{adapt}} \in (0, 1) (typically 0.8–0.95). At each step, define the projected ESS

ESSt+1(β)=(iwt(i)e(ββt)logp(yθt(i)))2i(wt(i))2e2(ββt)logp(yθt(i))\mathrm{ESS}_{t+1}(\beta) = \frac{\bigl(\sum_i w_t^{(i)} \cdot e^{(\beta-\beta_t) \log p(y \mid \theta_t^{(i)})}\bigr)^2}{\sum_i (w_t^{(i)})^2 \cdot e^{2(\beta-\beta_t) \log p(y \mid \theta_t^{(i)})}}

and bisect for βt+1\beta_{t+1} such that ESSt+1(βt+1)=τadaptN\mathrm{ESS}_{t+1}(\beta_{t+1}) = \tau_{\mathrm{adapt}} N.

Properties. ESS is monotonically decreasing in β\beta (smooth root-finding). Bisection convergence in O(log2(1/ϵ))O(\log_2(1/\epsilon)) steps. Path length TT is data-driven — the schedule stops when it can jump straight to β=1\beta = 1 without dropping below τadaptN\tau_{\mathrm{adapt}} N.

Bisection inner loop:

def find_next_beta(theta, log_w, log_lik_curr, beta_t, target_ess, tol=1e-3):
    def ess_at(beta):
        log_alpha = (beta - beta_t) * log_lik_curr
        log_w_new = log_w + log_alpha
        log_w_norm = log_w_new - logsumexp(log_w_new)
        return float(np.exp(-logsumexp(2.0 * log_w_norm)))
    lo, hi = beta_t, 1.0
    if ess_at(hi) >= target_ess:
        return hi
    while hi - lo > tol:
        mid = 0.5 * (lo + hi)
        if ess_at(mid) >= target_ess:
            lo = mid
        else:
            hi = mid
    return 0.5 * (lo + hi)

Combined with §9.1 ESS-threshold resampling, the SMC sampler has two adaptive mechanisms. The standard configuration is τadapt=0.95\tau_{\mathrm{adapt}} = 0.95 and τ=0.5\tau = 0.5 (PyMC’s default).

Correctness with adaptive schedules. Theorem 4 unbiasedness holds under path-adapted schedules via a Doob-decomposition argument (Del Moral, Doucet, and Jasra 2012; rigorously refined in Beskos, Jasra, Kantas, and Thiery 2016). The intuition: the schedule depends on past particles but not future ones, so the martingale structure that makes Z^t\widehat Z_t unbiased is preserved.

9.5 Worked example — the banana distribution

The banana: γT(θ)exp(θ12/2(θ2θ12)2/2)\gamma_T(\theta) \propto \exp(-\theta_1^2/2 - (\theta_2 - \theta_1^2)^2/2) on R2\mathbb{R}^2. Visually consistent with the Riemann Manifold HMC §10 banana demo — the two topics share this target by design. Properly normalized 2D density with ZT=1Z_T = 1.

Path: π0=N(0,16I2)\pi_0 = \mathcal{N}(\mathbf{0}, 16 I_2) (wide isotropic prior) to the banana posterior.

Compare a fixed linear schedule (T=20T = 20) vs the adaptive schedule (τadapt=0.9\tau_{\mathrm{adapt}} = 0.9). N=500N = 500, IMH cloud-fit, τ=0.5\tau = 0.5. Adaptive typically uses T8T \approx 81212 on this target.

Four-panel banana SMC: target density, cloud trajectory under a fixed linear schedule, cloud trajectory under an ESS-adaptive schedule, and ESS profile for both.
Figure 9. Banana SMC at N = 500: fixed linear schedule (T = 20) versus ESS-adaptive (τ_adapt = 0.9, T ≈ 8-12). The adaptive schedule reaches β = 1 in roughly half as many steps while keeping ESS above τN throughout.

The viz below runs both fixed and adaptive schedules on the banana side-by-side. Pick a fixed-schedule curvature (linear, power-2, power-½) and watch how the cloud terminal scatter compares to the adaptive one. The adaptive schedule’s TT value is data-driven and shown in the legend.

fixed (linear, T = 20)adaptive (T realized = 200)dashed-red banana ridge θ_2 = θ_1²

§10. Convergence Theory

§10 supplies the asymptotic theory underlying §§3–9. Three results carry the load: consistency (§10.1), the central limit theorem with variance decomposition (§10.3), and finite-sample concentration bounds (§10.4).

10.1 Consistency of the weighted empirical measure

Theorem 6 (Consistency of SMC (no-resampling case)).

Under iid initial particles θ0(i)π0\theta_0^{(i)} \sim \pi_0, πt\pi_t-invariant kernels KtK_t, no resampling, and E[(Wt(1))2]<\mathbb{E}[(W_t^{(1)})^2] < \infty: for every bounded measurable f:ΘRf: \Theta \to \mathbb{R},

πtN(f)=iWt(i)f(θt(i))jWt(j)PEπt[f]as N.\pi_t^N(f) = \frac{\sum_i W_t^{(i)} f(\theta_t^{(i)})}{\sum_j W_t^{(j)}} \xrightarrow{\mathrm{P}} \mathbb{E}_{\pi_t}[f] \quad \text{as } N \to \infty.
Proof.

Let π^tN(f)=(1/N)iWt(i)f(θt(i))\widehat\pi_t^N(f) = (1/N)\sum_i W_t^{(i)} f(\theta_t^{(i)}) and Z^t=(1/N)iWt(i)\widehat Z_t = (1/N)\sum_i W_t^{(i)}. The self-normalized estimator is the ratio π^tN(f)/Z^t\widehat\pi_t^N(f) / \widehat Z_t.

Step 1: expectations. By Theorem 4 Layer 1 with f(θT1)f(\theta_{T-1}) in the final integrand, E[π^tN(f)]=ZtEπt[f]\mathbb{E}[\widehat\pi_t^N(f)] = Z_t \cdot \mathbb{E}_{\pi_t}[f]. By Theorem 4, E[Z^t]=Zt\mathbb{E}[\widehat Z_t] = Z_t.

Step 2: variances vanish. The trajectories are iid under no resampling, so Var(π^tN(f))=(1/N)Var(Wt(1)f(θt(1)))\mathrm{Var}(\widehat\pi_t^N(f)) = (1/N)\mathrm{Var}(W_t^{(1)} f(\theta_t^{(1)})) and analogously for Z^t\widehat Z_t. Both are O(1/N)O(1/N) by the regularity assumption E[(Wt(1))2]<\mathbb{E}[(W_t^{(1)})^2] < \infty.

Step 3: Slutsky. By Chebyshev, π^tN(f)PZtEπt[f]\widehat\pi_t^N(f) \xrightarrow{\mathrm{P}} Z_t \mathbb{E}_{\pi_t}[f] and Z^tPZt>0\widehat Z_t \xrightarrow{\mathrm{P}} Z_t > 0. By the continuous mapping theorem and Slutsky: πtN(f)PEπt[f]\pi_t^N(f) \xrightarrow{\mathrm{P}} \mathbb{E}_{\pi_t}[f].

With resampling. The general case requires the Feynman–Kac framework — Del Moral (2004) Chapter 3 / Chopin and Papaspiliopoulos (2020) Theorem 11.1. The rate stays O(N1/2)O(N^{-1/2}) with constants depending on the resampling schedule.

10.2 The Feynman–Kac formalism — cite, don’t re-derive

The convergence theory of SMC has a unified treatment via the Feynman–Kac path measure Qt\mathbb{Q}_t on path space Θt\Theta^t:

Qt(dθ0:t)=1Ztπ0(θ0)s=0t1αs(θs)Ks+1(θs,dθs+1).\mathbb{Q}_t(d\theta_{0:t}) = \frac{1}{Z_t} \cdot \pi_0(\theta_0) \prod_{s=0}^{t-1} \alpha_s(\theta_s)\, K_{s+1}(\theta_s, d\theta_{s+1}).

The framework’s payoffs: unbiasedness, consistency, and CLT all follow from properties of Qt\mathbb{Q}_t. The technicalities — Doob’s martingale theory, Markov-chain renewal, abstract empirical-process theory on path space — sit above this topic’s assumed background.

References:

  • Del Moral (2004) Feynman–Kac Formulae — the definitive monograph; Chapters 3 and 9.
  • Chopin and Papaspiliopoulos (2020) An Introduction to Sequential Monte Carlo — textbook treatment, Chapter 11.
  • Cappé, Moulines, and Rydén (2005) Inference in Hidden Markov Models — particle-filter-specific theory, Chapters 7–8.

10.3 The central limit theorem

Theorem 7 (Central limit theorem for SMC).

Under regularity (bounded log-weights, kernel mixing, E[(Wt(1))4]<\mathbb{E}[(W_t^{(1)})^4] < \infty): for every bounded measurable ff,

N(πtN(f)πt(f))dN(0,σt2(f))as N,\sqrt{N}\bigl(\pi_t^N(f) - \pi_t(f)\bigr) \xrightarrow{d} \mathcal{N}(0, \sigma_t^2(f)) \quad \text{as } N \to \infty,

with σt2(f)=σ02(f~0)+s=1t[σs,prop2+1{resample at s}σs,resamp2]\sigma_t^2(f) = \sigma_0^2(\tilde f_0) + \sum_{s=1}^t [\sigma_{s,\mathrm{prop}}^2 + \mathbb{1}\{\text{resample at }s\} \sigma_{s,\mathrm{resamp}}^2].

Proof.

No-resampling case. Use the joint CLT and delta method on the ratio πtN(f)=π^tN(f)/Z^t\pi_t^N(f) = \widehat\pi_t^N(f) / \widehat Z_t. Under no resampling, trajectories are iid:

N((π^tN(f),Z^t)(ZtEπt[f],Zt))dN(0,Σ).\sqrt{N}((\widehat\pi_t^N(f), \widehat Z_t) - (Z_t \mathbb{E}_{\pi_t}[f], Z_t)) \xrightarrow{d} \mathcal{N}(0, \boldsymbol{\Sigma}).

Apply delta method with g(u,v)=u/vg(u, v) = u/v, gradient (1/Zt,Eπt[f]/Zt2)(1/Z_t, -\mathbb{E}_{\pi_t}[f]/Z_t^2). After simplification:

σt2(f)=1Zt2Var[Wt(1)(f(θt(1))Eπt[f])].\sigma_t^2(f) = \frac{1}{Z_t^2} \mathrm{Var}[W_t^{(1)}(f(\theta_t^{(1)}) - \mathbb{E}_{\pi_t}[f])].

In the ideal-mixing limit (qtπtq_t \to \pi_t), this reduces to Varπt[f]\mathrm{Var}_{\pi_t}[f].

Resampling-induced correction. When resampling fires at step ss, σs,resamp2=Varπs[f~s]0\sigma_{s,\mathrm{resamp}}^2 = \mathrm{Var}_{\pi_s}[\tilde f_s] \geq 0 adds to the variance. Each resampling event adds non-negative variance — the cost paid for the diversity injection.

Corollary (CLT for the normalizing-constant estimator).

Under the same regularity conditions, the analogue holds for Z^T\widehat Z_T:

N(Z^TZT)dN(0,σ~T2),\sqrt{N}(\widehat Z_T - Z_T) \xrightarrow{d} \mathcal{N}(0, \tilde\sigma_T^2),

with σ~T2=ZT2t=0T1[σ~t,prop2+1{resample at t}σ~t,resamp2]\tilde\sigma_T^2 = Z_T^2 \sum_{t=0}^{T-1} [\tilde\sigma_{t,\mathrm{prop}}^2 + \mathbb{1}\{\text{resample at }t\}\tilde\sigma_{t,\mathrm{resamp}}^2].

The propagation and resampling-stage variance contributions have analogous interpretations to those in Theorem 7. The no-resampling case follows from the standard CLT applied to the iid cumulative-weight sum; the general case is Del Moral (2004) Theorem 9.4 / Chopin and Papaspiliopoulos (2020) Theorem 11.2.

This is the result invoked by §9.2 Theorem 5 in the variance-decomposition argument for ESS-thresholded vs every-step resampling. The §9.2 proof writes Var(Z^T)σ~T2/N\mathrm{Var}(\widehat Z_T) \approx \tilde\sigma_T^2/N for large NN and reads off the contribution of each resampling event.

AIS-on-long-paths corollary. Without resampling, σt2(f)\sigma_t^2(f) inherits the cumulative chi-squared structure of the trajectory measure, growing exponentially in TT. With ESS-thresholded resampling, variance grows linearly in TT. This is the precise statement of why AIS fails on long paths and SMC samplers succeed.

10.4 Finite-sample particle-cloud error bounds

Theorem 8 (Concentration bound for SMC, Chopin and Papaspiliopoulos 2020 Theorem 11.5).

Under standard regularity plus bounded log-weights, for f1\|f\|_\infty \leq 1 and ε>0\varepsilon > 0:

P(πtN(f)πt(f)>ε)Ctexp ⁣(Nε2ctσt2(f)).\mathbb{P}(|\pi_t^N(f) - \pi_t(f)| > \varepsilon) \leq C_t \exp\!\left(-\frac{N \varepsilon^2}{c_t \sigma_t^2(f)}\right).

The proof is a Bernstein-type inequality on the path-measure empirical process, derivable from the formalML topic Concentration Inequalities combined with SMC-specific cumulative-weight tail bounds. The constants Ct,ctC_t, c_t depend on the resampling schedule and the kernel mixing rate; see Chopin and Papaspiliopoulos (2020) §11.5 for the detailed derivation.

At N=1000N = 1000, σt2=1\sigma_t^2 = 1, the deviation probability falls below 10310^{-3} at ε0.1\varepsilon \approx 0.1 — a concrete sense in which 1000 particles gives roughly two-decimal-place accuracy with high probability.

Three-panel numerical CLT verification on a 1D Gaussian target: rescaled-deviation histogram with fitted normal, rescaled SD vs N, and raw variance vs N on log-log axes.
Figure 10. Numerical CLT verification on the §3.4 Gaussian bridge. The √N-scaled deviations collapse onto a normal density; the rescaled SD stays constant in N, confirming the n^{-1/2} rate predicted by Theorem 7.

The viz below runs M=150M = 150 independent SMC instances at user-controlled NN, computes N(πTN(f)πT(f))\sqrt{N}(\pi_T^N(f) - \pi_T(f)) for a test function ff, and shows: an example terminal cloud, the rescaled-deviation histogram with a fitted normal density, and a Q–Q plot against standard-normal quantiles. As NN grows, the histogram tightens around a single normal density (Theorem 7); the Q–Q plot stays close to the y=xy = x reference.

sample SD = 1.370

§11. Worked Example — The Log-Gaussian Cox Process

The fourth and final headline demo. The LGCP discretized onto a spatial grid is a high-dimensional Bayesian inference target (64–256 dims at the grid resolutions we cover). Visually consistent with the Riemann Manifold HMC §10 demo — the two topics share this target by design.

11.1 The model and the discretization grid

Log-Gaussian Cox process. Latent z(x)=logλ(x)GP(μ0,k(,))z(x) = \log \lambda(x) \sim \mathcal{GP}(\mu_0, k(\cdot, \cdot)) on the unit square; conditionally Poisson point process with intensity λ(x)=exp(z(x))\lambda(x) = \exp(z(x)).

Discretization on K=m×mK = m \times m grid. z=(z1,,zK)N(μ01K,ΣK)\mathbf{z} = (z_1, \ldots, z_K) \sim \mathcal{N}(\mu_0 \mathbf{1}_K,\, \boldsymbol{\Sigma}_K), ykzkPoisson(Aexp(zk))y_k \mid z_k \sim \mathrm{Poisson}(A \exp(z_k)) where A=1/KA = 1/K is the cell area.

Kernel. Squared exponential k(c,c)=σz2exp(cc2/(22))k(c, c') = \sigma_z^2 \exp(-\|c - c'\|^2/(2\ell^2)). We match the RMHMC topic’s settings: σz2=1.91\sigma_z^2 = 1.91, =0.10\ell = 0.10, μ0=log(50)\mu_0 = \log(50).

Posterior. logp(zy)=12(zμ01)ΣK1(zμ01)+k[ykzkAezk]+const\log p(\mathbf{z} \mid \mathbf{y}) = -\frac{1}{2}(\mathbf{z} - \mu_0\mathbf{1})^\top \boldsymbol{\Sigma}_K^{-1}(\mathbf{z} - \mu_0\mathbf{1}) + \sum_k [y_k z_k - A e^{z_k}] + \text{const}. The first term is a Gaussian prior; the second is the data-fit Poisson log-likelihood. The posterior is non-conjugate and multimodal.

11.2 The SMC-sampler setup

Annealing path. γt(z)=exp(12(zμ01)ΣK1(zμ01))exp(βtk[ykzkAezk])\gamma_t(\mathbf{z}) = \exp(-\frac{1}{2}(\mathbf{z} - \mu_0\mathbf{1})^\top \boldsymbol{\Sigma}_K^{-1}(\mathbf{z} - \mu_0\mathbf{1})) \cdot \exp(\beta_t \sum_k [y_k z_k - A e^{z_k}]).

Initialization. Sample from N(μ01,ΣK)\mathcal{N}(\mu_0 \mathbf{1},\, \boldsymbol{\Sigma}_K) via the precomputed Cholesky factor LK\mathbf{L}_K.

Reweighting by Proposition 3: logαt(i)=(βt+1βt)k[ykzk(i)Aezk(i)]\log \alpha_t^{(i)} = (\beta_{t+1} - \beta_t)\sum_k [y_k z_k^{(i)} - A e^{z_k^{(i)}}].

Adaptive schedule τadapt=0.9\tau_{\mathrm{adapt}} = 0.9. Path length TT is data-driven.

Propagation IMH cloud-fit with ridge εIK\varepsilon I_K for ε=104\varepsilon = 10^{-4}.

Resampling systematic at τ=0.5\tau = 0.5.

Particle count N=300N = 300 default.

11.3 Head-to-head: SMC vs RMHMC vs NUTS

Three samplers compared on the LGCP demo:

  1. SMC samplers (our implementation).
  2. Riemann-manifold HMC (the previous topic).
  3. NUTS (PyMC default).

Metrics: posterior intensity RMSE, ESS/sec, log-evidence accuracy, robustness across replicates.

Pareto frontier. Speed (ESS/sec): RMHMC > SMC ≈ NUTS. Log-evidence: SMC ≫ RMHMC = NUTS (only SMC returns it unbiased). Multimodality robustness: SMC > RMHMC > NUTS.

11.4 Scaling with grid size

SamplerPer-step costMemory
SMC (IMH cloud-fit)O(NK2)O(N K^2) covariance + O(NK)O(N K) likelihoodO(NK+K2)O(NK + K^2)
RMHMCO(LK3)O(L K^3) Fisher CholeskyO(K2)O(K^2)
NUTSO(LK)O(L K) leapfrogO(K)O(K)

The Pareto frontier shifts with KK: at small KK NUTS dominates; at medium KK RMHMC dominates; at large KK SMC closes the gap because its O(NK2)O(NK^2) cost beats RMHMC’s O(LK3)O(LK^3) Cholesky per step.

Four-panel log-Gaussian Cox process SMC inference: true intensity heatmap, SMC posterior mean heatmap, ESS through the adaptive path, and posterior trajectory at representative cells.
Figure 11. LGCP inference at m = 16 (K = 256), N = 300. The SMC sampler recovers the multi-modal intensity field; the high- and low-intensity cells' posterior trajectories converge from the prior mean toward their respective truths.

The viz below loads a precomputed payload (Python: notebooks/sequential-monte-carlo/precompute_lgcp.py) for m{8,12,16}m \in \{8, 12, 16\}. The K=256K = 256 case is too heavy for in-browser SMC sampling; the precompute handles it offline and the viz displays the result. Pick a grid resolution to switch between K=64K = 64, K=144K = 144, and K=256K = 256 inference results.

Loading LGCP precompute payload…

§12. Computational Notes

Implementation-discipline coda: four practical disciplines that separate correct from production-grade SMC.

12.1 Buffer hoisting and the resampling hot path

Pre-allocate per-particle buffers once before the outer loop; reuse with in-place writes:

particles = np.empty((N, d), dtype=np.float64)
log_weights = np.zeros(N, dtype=np.float64)
log_lik_cache = np.empty(N, dtype=np.float64)
cumulative_w = np.empty(N, dtype=np.float64)
resampled_indices = np.empty(N, dtype=np.intp)

Anti-patterns to avoid: np.random.choice(N, size=N, p=weights) (allocator pressure on every call); cumulative_w = np.cumsum(w) (fresh allocation) instead of np.cumsum(w, out=cumulative_w) (in-place); list-of-arrays storage instead of contiguous arrays.

12.2 Log-space weight updates and logsumexp idioms

Particle weights on hard targets span 10+ orders of magnitude. Linear-space representation silently underflows on the kind of targets we care about.

Four idioms.

  1. Additive log-weight update: log_w += delta_beta * log_lik_cache.
  2. Normalize via logsumexp: log_w_norm = log_w - logsumexp(log_w).
  3. ESS in log-space: ess = np.exp(-logsumexp(2.0 * log_w_norm)).
  4. Cumulative logZ^\log \widehat Z: log_z = logsumexp(log_w) - np.log(N).

Failure mode this prevents: Neal’s funnel target — weights span 1015\sim 10^{15} orders within 5 SMC steps; linear-space silently collapses to a single particle carrying all the mass, while log-space remains numerically well-behaved.

scipy.special.logsumexp uses the shift-then-exp trick: logsumexp(x)=maxixi+logiexp(ximaxixi)\mathrm{logsumexp}(\mathbf{x}) = \max_i x_i + \log \sum_i \exp(x_i - \max_i x_i).

12.3 Parallelization across particles

SMC has a clean parallelism structure:

  • Embarrassingly parallel within a step (reweight and propagate operate per-particle).
  • Synchronization required at resampling (needs the global cumulative-weight array).
  • Strict serial across steps (step t+1t+1 depends on step tt).

Vectorized NumPy (single node) — the default in this topic, handles N1000N \approx 1000 on a multicore laptop.

Distributed compute (MPI). For N104N \gg 10^4 or d103d \gg 10^3: assign each rank a slice of the cloud, use MPI_Allreduce on log-weight sums, and MPI_Alltoall at resample steps. The particles library implements this.

GPU acceleration. Feasible for kernel evaluation; tricky for resampling because of the synchronization step. Modest speedup (1–3×) vs optimized CPU NumPy for typical SMC samplers.

12.4 Diagnostics via ArviZ

Convert the weighted cloud to ArviZ InferenceData by resampling to equal weights:

def to_inferencedata(theta, log_w, var_names):
    import arviz as az
    log_w_norm = log_w - logsumexp(log_w)
    w_norm = np.exp(log_w_norm)
    idx = np.searchsorted(np.cumsum(w_norm), np.random.random(len(theta)))
    samples = theta[idx]
    samples_dict = {name: samples[:, i:i+1] for i, name in enumerate(var_names)}
    return az.from_dict(samples_dict)

Standard diagnostics: az.ess (bulk and tail), az.rhat for multi-replicate runs, trace plots (with the caveat that SMC output is a weighted cloud, not a chain history), posterior predictive checks.

Log-evidence convention. Attach to idata.sample_stats.log_marginal_likelihood per PyMC convention.

Diagnostic checklist:

  1. Path length: did the adaptive schedule converge to βT=1\beta_T = 1 within budget?
  2. Min ESS along the path: did it stay above τN\tau N except at resampling steps?
  3. Kernel acceptance rate: above 0.2 for IMH cloud-fit, near 0.234 for RWM?
  4. logZ^\log \widehat Z stability across replicates: agreement within MC error?

§13. Connections and Limits

Closing the topic with curriculum-positioning. Where does SMC sit relative to other specialized samplers? What does it gain or lose? What are the active research frontiers?

13.1 SMC in the specialized-sampler cluster

The four T5 Advanced Bayesian Computation topics:

  • Stochastic-Gradient MCMC — Langevin/HMC flavors, single-chain, scales via subsampling, no log-evidence.
  • Riemann Manifold HMC — Fisher-metric mass matrix, best for smooth anisotropic targets.
  • Reversible-Jump MCMC (coming soon) — transdimensional inference across model spaces of different dimension.
  • Probabilistic Programming — the PPL infrastructure wrapping NUTS and pm.sample_smc.

Pareto frontier:

SamplerBest regimeTrade-off
NUTSSmooth unimodal, moderate ddFast per-step; no log-evidence
RMHMCSmooth anisotropicPer-step O(d3)O(d^3); no log-evidence
SG-MCMCHuge-nn smoothBiased; subsamples data
SMC samplersMultimodal; log-evidence neededHigher constants; unbiased log-evidence
Mean-field VIFast approximatePosterior factorization bias
Normalizing-flow VIAmortized deploymentExpensive training
INLALatent GaussianSpecialized model class

Three-question taxonomy. (1) Need log-evidence? → SMC. (2) Multimodal? → SMC. (3) Otherwise smooth unimodal? → NUTS/RMHMC.

13.2 SMC vs normalizing flows

Normalizing flowsSMC samplers
Training costExpensive (often GPU)Negligible
Per-sample costO(d2)O(d^2) per evaluationO(Nd)O(N d) for cloud
Sample qualityLimited by training errorExact in NN \to \infty
Log-evidenceAvailable (change-of-variables)Available (running Z^T\widehat Z_T)
MultimodalRequires expressive architecturesNative via tempering
DeploymentOne training + many samplesMany parallel SMC runs

Use flows when the same target is sampled many times (amortized inference); use SMC when each problem instance has a different target (one-shot inference).

Hybrid approaches:

  • Normalizing-flow proposals in SMC (Arbel, Matthews, and Doucet 2021).
  • SMC-trained flows (Naesseth, Linderman, Ranganath, and Blei 2018).

13.3 SMC for Bayes factors and posterior model probabilities

Unbiased logZ^T\log \widehat Z_T (Theorem 4) supports direct Bayes-factor computation:

logBFij=logZ^T(i)logZ^T(j).\log \mathrm{BF}_{ij} = \log \widehat Z_T^{(i)} - \log \widehat Z_T^{(j)}.

Practical workflow: run SMC on each candidate model, compute pairwise log-Bayes-factors, apply Jeffreys’s scale (logBF>5\log \mathrm{BF} > 5 is strong evidence).

Posterior model probabilities via logsumexp across models:

p(Miy)=exp(logp(Mi)+logZ^T(i))jexp(logp(Mj)+logZ^T(j)).p(M_i \mid y) = \frac{\exp(\log p(M_i) + \log \widehat Z_T^{(i)})}{\sum_j \exp(\log p(M_j) + \log \widehat Z_T^{(j)})}.

Caveats: the Jensen-gap bias (§8.3) partly cancels in ratios but not exactly; Lindley’s paradox (prior sensitivity in Bayes factors); nested vs non-nested models require care in defining the prior.

This discharges the cross-site forward-pointer to formalStatistics: bayesian-model-comparison-and-bma — SMC samplers are the canonical unbiased alternative to bridge sampling and harmonic-mean estimators in the model-evidence catalogue (formalstatistics treats marginal-likelihood / Bayes-factor computation in that topic).

13.4 Frontiers

  • Particle Gibbs (Andrieu, Doucet, and Holenstein 2010): particle MCMC for joint (θ,x1:T)(\theta, x_{1:T}) inference. Combines an SMC sweep over states with a Metropolis acceptance on the parameter.
  • SMC² (Chopin, Jacob, and Papaspiliopoulos 2013): doubly-nested SMC; outer chain on parameters, inner SMC on states.
  • Divide-and-conquer SMC (Lindsten et al. 2017): for graphical-model factorizations where the target decomposes hierarchically.
  • Variational SMC (Naesseth et al. 2018): neural-network-parameterized SMC proposals trained end-to-end against the ELBO.

Open research (as of 2026):

  • Adaptive proposals via reinforcement learning, where the kernel choice itself is a learnable policy.
  • Quasi-Monte Carlo SMC (Gerber and Chopin 2015), trading independent uniforms for low-discrepancy sequences.
  • GPU acceleration for the resampling step — currently the bottleneck in distributed SMC implementations.

The specialized-sampler cluster completes with Reversible-Jump MCMC (coming soon), which handles transdimensional moves that SMC operates within. The two are routinely combined: SMC for adaptive temperature within each model dimension, RJ-MCMC for jumps between dimensions.

Connections

  • Probabilistic-programming exposes `pm.sample_smc` as a first-class inference engine; §8.4 and §11.3 here reproduce its results with a hand-rolled NumPy implementation, and §7.3 documents IMH-cloud-fit as the kernel `pm.sample_smc` uses by default. The discharge of probabilistic-programming's §6.3 forward-pointer to SMC lives in §8.2 (unbiased log-evidence) and §11 (LGCP comparison). probabilistic-programming
  • RMHMC and SMC samplers solve the same problem — sampling from a hard posterior — by opposite strategies: RMHMC adaptively shapes the kinetic energy, SMC adaptively schedules an annealing path. §11.3 places SMC on the Pareto frontier RMHMC introduced, discharging RMHMC §12's forward-pointer to sequential Monte Carlo. riemann-manifold-hmc
  • SG-MCMC and SMC are sibling specialized samplers in the T5 cluster — SG-MCMC scales via subsampling at the cost of bias, SMC scales via parallel particle clouds and returns unbiased log-evidence as a side product. §13.1's three-question taxonomy positions each. stochastic-gradient-mcmc
  • VBMS uses ELBO-as-log-evidence with a variational-family bias; SMC samplers return unbiased log-evidence at finite N (Theorem 4). §8.4 and §13.3 here discharge VBMS's §13 forward-pointer to SMC by showing the GMM Bayes-factor comparison and the posterior-model-probability formula. variational-bayes-for-model-selection
  • Annealed flow transport (Arbel–Matthews–Doucet 2021) uses normalizing-flow proposals inside SMC; SMC-trained flows (Naesseth et al. 2018) flip the direction. The §13.2 comparison and §13.4 hybrid table position the two methods. normalizing-flows
  • §10.4's Theorem 8 finite-sample concentration bound is a Bernstein-type inequality on the SMC path-measure empirical process — a direct application of the concentration toolkit to interacting-particle systems. concentration-inequalities
  • RJ-MCMC handles transdimensional moves where SMC operates within a fixed dimension; the two are routinely combined, with SMC running adaptive temperature on each model dimension and RJ-MCMC jumping between dimensions. SMC closes the specialized-sampler cluster that RJ-MCMC will complete. reversible-jump-mcmc

References & Further Reading