advanced bayesian-ml 45 min read

Riemann Manifold HMC

Hamiltonian Monte Carlo on the manifold of probability distributions, using the Fisher information metric to traverse pathological posteriors that defeat constant-mass-matrix HMC

§1. Overview & Motivation

1.1 Where HMC breaks

Standard Hamiltonian Monte Carlo treats parameter space as flat Euclidean territory. Once we pick a mass matrix MM — typically the identity, or at best a constant preconditioner estimated from a pilot run — every direction in the parameter space gets scaled the same way, everywhere on the support. The leapfrog integrator marches across this flat landscape with a uniform step size ϵ\epsilon, and as long as the target’s geometry is roughly homogeneous, that works beautifully.

It stops working the moment local curvature varies across the support. Three textbook examples expose the failure mode.

Neal’s funnel. Consider the hierarchical prior θτN(0,eτ)\theta \mid \tau \sim \mathcal{N}(0, e^\tau) with τN(0,9)\tau \sim \mathcal{N}(0, 9). The marginal density on (τ,θ)(\tau, \theta) traces a horn-shaped support that’s narrow at one end — small τ\tau, where θ\theta has variance near 1 — and broad at the other, where θ\theta has variance eτe^\tau. A constant mass matrix forces a single step size. Too small, and HMC takes years to traverse the broad end; too large, and it shoots past the narrow neck on every iteration. The funnel is the canonical failure case in Bayesian hierarchical modeling — every practitioner has been bitten by it.

The banana. Take a bivariate Gaussian on (θ1,θ2)(\theta_1, \theta_2) and apply the warp θ2θ2+b(θ12a2)\theta_2 \mapsto \theta_2 + b(\theta_1^2 - a^2). The result is the banana distribution of Haario, Saksman, and Tamminen (2001) — a curved ridge whose orientation rotates as we move along it. The Fisher information here varies smoothly with position: where the ridge is steep, G(θ)G(\theta) has a strong off-diagonal; where it’s shallow, GG is nearly diagonal. Standard HMC with constant MM can’t reflect that local rotation, so trajectories either spiral away from the ridge (large ϵ\epsilon) or crawl along it (small ϵ\epsilon).

Ill-conditioned hierarchical likelihoods. Even outside designed pathologies, real models often have parameters that mix on wildly different scales — population-level versus group-level variance components, regression coefficients with very different prior scales. Each direction wants its own step size. A diagonal preconditioner helps a bit but only handles conditioning that’s constant across the support; anywhere the curvature changes with position, we’re back to the same trade-off.

The pattern that connects all three: the local Fisher information

G(θ)  =  Eyθ ⁣[θ2logp(yθ)]G(\theta) \;=\; -\,\mathbb{E}_{y \mid \theta}\!\bigl[\nabla^2_\theta \log p(y \mid \theta)\bigr]

— the matrix of expected second derivatives of the log-likelihood, with the expectation taken over the data distribution at θ\theta — is not constant, but standard HMC pretends it is.

L = 40 leapfrog steps, identity mass matrix
Figure 1. Three constant-mass-matrix HMC trajectories on each pathology. At small ε the trajectory crawls in the loose regions and never reaches the tight ones; at large ε it overshoots in the tight regions. A single ε cannot serve both. The Riemannian fix (§§3 onward) replaces M with a function of position.

1.2 The geometric fix

The Girolami–Calderhead (2011) move is to stop pretending. We promote the mass matrix from a constant to a function of position: M(θ)=G(θ)M(\theta) = G(\theta), with GG the Fisher information at θ\theta. The Hamiltonian becomes

H(θ,p)  =  U(θ)  +  12pG(θ)1p  +  12logdetG(θ),H(\theta, p) \;=\; U(\theta) \;+\; \tfrac{1}{2}\, p^\top G(\theta)^{-1} p \;+\; \tfrac{1}{2}\,\log\det G(\theta),

where U(θ)=logp(θy)U(\theta) = -\log p(\theta \mid y) is the negative log-posterior (our potential energy), and the extra log-determinant comes from normalizing the momentum’s conditional distribution pθN(0,G(θ))p \mid \theta \sim \mathcal{N}(0, G(\theta)) at each position. Read geometrically, this says: parameter space is a Riemannian manifold equipped with the Fisher metric, momentum lives in the cotangent bundle, and Hamiltonian flow now follows geodesics of the manifold rather than straight lines.

The pathology of the funnel and the banana evaporates in this picture. Where the original geometry was steep, G(θ)G(\theta) is large and the metric scales the local step accordingly; where the geometry was loose, G(θ)G(\theta) is small and the step opens up. The same nominal ϵ\epsilon now produces locally-appropriate motion everywhere.

There’s a cost. The new Hamiltonian is not separableH(θ,p)H(\theta, p) doesn’t split into a U(θ)+K(p)U(\theta) + K(p) form, because the kinetic part KK depends on θ\theta through G(θ)G(\theta). The standard explicit leapfrog loses its symplecticity in that case, so we have to upgrade to an implicit integrator: each leapfrog step solves a fixed-point equation for the new momentum, and another for the new position. Each step also requires factorizing G(θ)G(\theta) — typically by Cholesky, since we need both G1G^{-1} and detG\det G.

For the targets RMHMC is designed for, the cost is worth it. For an isotropic Gaussian on a well-conditioned likelihood, it isn’t. §8 makes this trade quantitative.

1.3 What you’ll need from elsewhere

This is an advanced topic and we lean heavily on the foundation.

From formalStatistics: bayesian-computation-and-mcmc , we assume comfortable familiarity with §26.9–10 on standard HMC: the canonical joint π(θ,p)eH(θ,p)\pi(\theta, p) \propto e^{-H(\theta, p)}, Hamilton’s equations, the leapfrog integrator’s symplecticity and reversibility, the Metropolis correction, and the No-U-Turn heuristic for trajectory length. §2 here is a notation refresher, not a re-derivation. The §26.10 Remark 29 forward-pointer to Riemannian-manifold methods is discharged by the present topic.

From Information Geometry, we use the Fisher information metric and Čencov’s uniqueness theorem. The Fisher metric will be the mass matrix; we want the reader to recognize that this is not an arbitrary choice but the canonical Riemannian structure on the manifold of probability distributions — the unique (up to scale) metric invariant under sufficient statistics.

From Riemannian Geometry and Geodesics and Curvature, we use the metric tensor, the geodesic equation, Christoffel symbols, and parallel transport. RMHMC’s Hamiltonian flow is geodesic flow on the manifold of distributions, perturbed by the gradient of the potential.

From formalcalculus, we use the formalCalculus: jacobian for the volume-preservation theorem in §6 and the formalCalculus: inverse-implicit for the well-posedness of the implicit leapfrog in §5.

From the Stochastic-Gradient MCMC topic shipped earlier in T5, §10 introduced the position-dependent Riemannian metric in the SDE setting — RSGLD. We treat that as the SDE companion; the present topic does the Hamiltonian-flow side.

1.4 The Girolami–Calderhead 2011 program

The defining paper of Riemannian-manifold MCMC is Mark Girolami and Ben Calderhead’s “Riemann manifold Langevin and Hamiltonian Monte Carlo methods” (Journal of the Royal Statistical Society, Series B, vol. 73, 2011), with discussion. The paper does several things at once. It introduces the Riemann-manifold Langevin algorithm (RMLA) and Riemann-manifold HMC (RMHMC) as a pair, with the Fisher information as the canonical metric tensor for both. It works through the differential geometry needed for non-statisticians, derives the generalized leapfrog integrator, and provides three substantial empirical demonstrations: the banana distribution, a stochastic-volatility model, and a log-Gaussian Cox process for spatial statistics. The discussion section runs more than thirty pages and includes contributions from many of the leading figures in computational statistics.

The contribution that survives most cleanly into modern practice is the framework, not the specific integrator: the idea that a posterior’s local geometry can and should be exploited by gradient-based MCMC, and that the Fisher information is the natural carrier of that geometry. Variants and extensions have followed — SoftAbs metrics (Betancourt 2013) when the Fisher matrix is ill-defined or non-positive-definite, Hessian-based metrics for non-conjugate models, magnetic HMC, and the integration of RMHMC ideas into NUTS-style adaptive trajectory tuning. §12 surveys these.

1.5 Roadmap

The plan from here. §2 is a notation refresher for HMC — anyone who has read formalstatistics §26 can skim it. §3 rebuilds the Riemannian-geometry vocabulary we’ll need, anchored in the Fisher metric. §4 writes down the Riemannian Hamiltonian and Hamilton’s equations on the manifold; this is where the new log-determinant term gets interpreted. §5 is the core integrator construction: why the explicit leapfrog fails, what the implicit splitting looks like, and how to actually solve the fixed-point system in code. §6 collects the three load-bearing proofs — volume preservation, reversibility, detailed balance. §7 discusses tuning. §8 is the cost analysis. §§9–10 are worked examples: the banana, then a downsized log-Gaussian Cox process. §11 is the implementation patterns the project enforces — buffer hoisting, SPD-stable solves, multi-chain diagnostics. §12 zooms out to natural gradient, position-dependent SDEs, and the active research frontier.

The notebook tracks the brief section-for-section. Every section that carries new math has both expository markdown and validating code; §§9–10 are entirely empirical.

Two-panel figure showing standard HMC trajectories on Neal's funnel and the banana distribution at three step sizes. Trajectories either stall in the loose regions or overshoot in the tight regions, depending on the step size choice.
Figure 1 (static fallback). Standard HMC trajectories on Neal's funnel and the banana distribution. No single step size works everywhere.

§2. HMC at a glance

This section establishes the notation and the three properties of the standard leapfrog integrator — symplecticity, reversibility, energy conservation up to bounded oscillation — that the rest of the topic rebuilds for the new integrator. We assume the reader has worked through standard HMC in formalStatistics: bayesian-computation-and-mcmc §26.9–10 or Neal (2011); the treatment here is consciously compressed. If anything below feels like new material rather than recall, that’s a signal to read the prereqs before continuing.

2.1 The Hamiltonian and the canonical joint

Let π(θ)eU(θ)\pi(\theta) \propto e^{-U(\theta)} denote our target distribution on Rd\mathbb{R}^d, where U(θ)=logπ(θ)U(\theta) = -\log \pi(\theta) (up to additive constants) is the potential energy — the negative log-target. HMC augments parameter space with an auxiliary momentum variable pRdp \in \mathbb{R}^d and constructs a joint distribution

π(θ,p)    exp ⁣(H(θ,p)),H(θ,p)  =  U(θ)  +  12pM1p,\pi(\theta, p) \;\propto\; \exp\!\bigl(-H(\theta, p)\bigr), \qquad H(\theta, p) \;=\; U(\theta) \;+\; \tfrac{1}{2}\, p^\top M^{-1} p,

where MRd×dM \in \mathbb{R}^{d \times d} is a fixed, symmetric, positive-definite mass matrix and the second term is the kinetic energy K(p)=12pM1pK(p) = \tfrac{1}{2} p^\top M^{-1} p. Because KK depends only on pp (not on θ\theta), the joint factorizes: π(θ,p)=π(θ)N(p;0,M)\pi(\theta, p) = \pi(\theta) \cdot \mathcal{N}(p; 0, M). Marginalizing over pp recovers the target, so sampling (θ,p)π(\theta, p) \sim \pi and discarding pp samples θπ(θ)\theta \sim \pi(\theta) — the augmentation is “free.”

The Hamiltonian generates a continuous-time flow ϕt\phi_t on (θ,p)(\theta, p) space via Hamilton’s equations,

dθdt  =  Hp  =  M1p,dpdt  =  Hθ  =  U(θ).\frac{d\theta}{dt} \;=\; \frac{\partial H}{\partial p} \;=\; M^{-1} p, \qquad \frac{dp}{dt} \;=\; -\frac{\partial H}{\partial \theta} \;=\; -\nabla U(\theta).

Three properties of ϕt\phi_t matter for what follows. First, ϕt\phi_t preserves HH — the value of the Hamiltonian is constant along any flow trajectory, so trajectories stay on level sets of HH. Second, ϕt\phi_t is symplectic — it preserves the differential 2-form ω=idθidpi\omega = \sum_i d\theta_i \wedge dp_i, which in particular implies ϕt\phi_t is volume-preserving on (θ,p)(\theta, p) space. Third, ϕt\phi_t is reversible under momentum negation: if we define N:(θ,p)(θ,p)\mathcal{N}: (\theta, p) \mapsto (\theta, -p), then NϕtN=ϕt\mathcal{N} \circ \phi_t \circ \mathcal{N} = \phi_{-t}. These three together imply ϕt\phi_t preserves the canonical measure π(θ,p)\pi(\theta, p) exactly — if we could integrate Hamilton’s equations in closed form, HMC would need no Metropolis correction. We can’t, so we discretize.

2.2 The leapfrog integrator

The natural discretization is the leapfrog (or Störmer–Verlet) integrator. From (θn,pn)(\theta^n, p^n), a single leapfrog step of size ϵ\epsilon produces (θn+1,pn+1)(\theta^{n+1}, p^{n+1}) via

pn+12=pnϵ2U(θn),θn+1=θn+ϵM1pn+12,pn+1=pn+12ϵ2U(θn+1).\begin{aligned} p^{n+\tfrac{1}{2}} &= p^n - \tfrac{\epsilon}{2}\, \nabla U(\theta^n), \\ \theta^{n+1} &= \theta^n + \epsilon\, M^{-1} p^{n+\tfrac{1}{2}}, \\ p^{n+1} &= p^{n+\tfrac{1}{2}} - \tfrac{\epsilon}{2}\, \nabla U(\theta^{n+1}). \end{aligned}

The integrator is explicit — each line is a closed-form update — because U\nabla U depends only on θ\theta and M1pM^{-1}p depends only on pp. This is the defining structural property that breaks in §5: the Riemannian Hamiltonian’s kinetic term 12pG(θ)1p\tfrac{1}{2} p^\top G(\theta)^{-1} p depends on θ\theta, so the position update becomes θn+1=θn+ϵG(θ?)1pn+12\theta^{n+1} = \theta^n + \epsilon G(\theta^?)^{-1} p^{n+\tfrac{1}{2}}, and we have to decide which θ\theta to evaluate GG at — which is where implicitness enters.

Why the leapfrog works. Each of the three sub-steps is a shear in (θ,p)(\theta, p) space: the momentum half-steps move pp at fixed θ\theta, and the position step moves θ\theta at fixed pp. Shears preserve the symplectic 2-form trivially, and the composition of symplectic maps is symplectic, so the full leapfrog map Φϵ:(θn,pn)(θn+1,pn+1)\Phi_\epsilon: (\theta^n, p^n) \mapsto (\theta^{n+1}, p^{n+1}) is symplectic — and therefore volume-preserving. The same symmetry of the three sub-steps about the central position update gives reversibility: ΦϵNΦϵ=N\Phi_\epsilon \circ \mathcal{N} \circ \Phi_\epsilon = \mathcal{N}.

What the leapfrog does not preserve is HH exactly. It has O(ϵ3)O(\epsilon^3) local error and O(ϵ2)O(\epsilon^2) global error over a fixed integration time T=LϵT = L\epsilon. But because the integrator is symplectic, the energy error oscillates around zero with bounded amplitude rather than drifting secularly — a result from backward error analysis (Hairer, Lubich, and Wanner 2006, ch. IX) that makes long-trajectory simulation feasible at moderate ϵ\epsilon. The Metropolis correction at the end of each trajectory restores exactness despite the residual integrator error, with acceptance probability α=min(1,exp(H(θn,pn)H(θn+L,pn+L)))\alpha = \min\bigl(1, \exp(H(\theta^n, p^n) - H(\theta^{n+L}, p^{n+L}))\bigr).

2.3 Where the constant mass matrix becomes a ceiling

The mass matrix MM acts as a metric on momentum space. Reading K(p)=12pM1pK(p) = \tfrac{1}{2} p^\top M^{-1} p geometrically: large MM in a direction means small kinetic energy contribution from large momentum in that direction, so trajectories cover ground faster there. The optimal choice for a Gaussian target N(μ,Σ)\mathcal{N}(\mu, \Sigma) is M=Σ1M = \Sigma^{-1}, which makes the level sets of HH spherical and decorrelates the leapfrog dynamics. Hoffman and Gelman’s NUTS (2014) and Stan’s adaptive HMC both estimate MM during warmup, typically as a diagonal or full-rank empirical-covariance preconditioner.

The estimate is global. It captures the average curvature of the target over the warmup chain. When the target’s actual local curvature varies smoothly with position — when the Fisher information matrix G(θ)=Eyθ[θ2logp(yθ)]G(\theta) = -\mathbb{E}_{y \mid \theta}[\nabla^2_\theta \log p(y \mid \theta)] depends on θ\theta — no single constant MM can match all of it. Pick MM to suit the loose part of the support and the tight part rejects every proposal; pick MM to suit the tight part and the loose part mixes at glacial speed.

This is the gap. From §3 onward we promote MM to a function of θ\theta, recover the three properties of the leapfrog by other means (implicit splitting), and pay the per-step Cholesky cost in exchange for a step size that’s locally appropriate everywhere.

Leapfrog energy error |H(t) - H(0)| as a function of step number for three step sizes on an isotropic Gaussian. The error oscillates with bounded amplitude rather than drifting; the amplitude scales as ε².
Figure 2 (static fallback). Leapfrog energy error on an isotropic Gaussian — bounded oscillation, O(ε²) amplitude. The hallmark of a symplectic integrator.

§3. Riemannian geometry of probability

The pieces this section assembles: the Fisher information matrix recast as a Riemannian metric, the cotangent picture of momentum that makes the RMHMC kinetic energy 12pG(θ)1p\tfrac{1}{2} p^\top G(\theta)^{-1} p the natural choice rather than an arbitrary one, the geodesic equation in coordinate form, and Čencov’s uniqueness theorem. All of this material is developed at full length in Information Geometry and Riemannian Geometry; the present treatment stays terse and consciously selects what §§4–6 will need.

3.1 The Fisher information as a Riemannian metric

Let {p(y;θ):θΘRd}\{p(y; \theta) : \theta \in \Theta \subseteq \mathbb{R}^d\} be a parametric family of probability densities on a sample space Y\mathcal{Y}, smooth in θ\theta. The score function is the gradient of the log-density,

s(y;θ)  =  θlogp(y;θ)    Rd,s(y; \theta) \;=\; \nabla_\theta \log p(y; \theta) \;\in\; \mathbb{R}^d,

and under standard regularity (smooth interchange of \nabla and \int), it has mean zero:

Eyθ[s(y;θ)]  =  p(y;θ)θlogp(y;θ)dy  =  θ ⁣p(y;θ)dy  =  θ(1)  =  0.\mathbb{E}_{y \mid \theta}\bigl[s(y; \theta)\bigr] \;=\; \int p(y; \theta) \,\nabla_\theta \log p(y; \theta) \,dy \;=\; \nabla_\theta \!\int p(y; \theta)\, dy \;=\; \nabla_\theta(1) \;=\; 0.

The Fisher information matrix is the covariance of the score,

G(θ)  =  Eyθ[s(y;θ)s(y;θ)]    Rd×d.G(\theta) \;=\; \mathbb{E}_{y \mid \theta}\bigl[s(y; \theta)\, s(y; \theta)^\top\bigr] \;\in\; \mathbb{R}^{d \times d}.

A second differentiation gives the equivalent negative-expected-Hessian form. Starting from Eyθ[s(y;θ)]=0\mathbb{E}_{y \mid \theta}[s(y; \theta)] = 0 and differentiating in θ\theta again,

0  =  θ ⁣p(y;θ)s(y;θ)dy  =  [θp(y;θ)]s(y;θ)dy  +  p(y;θ)θs(y;θ)dy.0 \;=\; \nabla_\theta \!\int p(y; \theta)\, s(y; \theta)^\top \,dy \;=\; \int \bigl[\nabla_\theta p(y; \theta)\bigr] s(y; \theta)^\top \,dy \;+\; \int p(y; \theta)\, \nabla_\theta s(y; \theta)^\top \,dy.

The first integrand is pssp \cdot s \cdot s^\top (using θp=ps\nabla_\theta p = p \cdot s) and the second is pθ2logpp \cdot \nabla^2_\theta \log p, so

Eyθ[ss]  +  Eyθ[θ2logp]  =  0,i.e.,G(θ)  =  Eyθ[θ2logp(y;θ)].\mathbb{E}_{y \mid \theta}\bigl[s\, s^\top\bigr] \;+\; \mathbb{E}_{y \mid \theta}\bigl[\nabla^2_\theta \log p\bigr] \;=\; 0, \qquad\text{i.e.,}\qquad G(\theta) \;=\; -\,\mathbb{E}_{y \mid \theta}\bigl[\nabla^2_\theta \log p(y; \theta)\bigr].

A third equivalent form recovers the connection to information divergences: G(θ)G(\theta) is the Hessian of the KL divergence at the diagonal,

Gij(θ)  =  2θiθjKL(p(;θ)p(;θ))θ=θ.G_{ij}(\theta) \;=\; \frac{\partial^2}{\partial \theta'_i\, \partial \theta'_j}\, \mathrm{KL}\bigl(p(\cdot; \theta) \,\|\, p(\cdot; \theta')\bigr)\bigg|_{\theta' = \theta}.

This last identity — derived in the Information Geometry topic — is why G(θ)G(\theta) deserves the name “metric”: it measures, infinitesimally, the squared distance between p(;θ)p(\cdot; \theta) and a nearby distribution. The pair (Θ,G)(\Theta, G) is a Riemannian manifold, and we’ll use gij(θ)=Gij(θ)g_{ij}(\theta) = G_{ij}(\theta) when emphasizing the metric viewpoint and G(θ)G(\theta) when emphasizing the matrix viewpoint.

Remark (Non-likelihood targets).

For Bayesian inference, the target is a posterior π(θ)p(yobsθ)p(θ)\pi(\theta) \propto p(y_{\text{obs}} \mid \theta)\, p(\theta), and the natural metric is the Bayesian Fisher G(θ)=Eyθ[θ2logp(yθ)]θ2logp(θ)G(\theta) = -\mathbb{E}_{y \mid \theta}[\nabla^2_\theta \log p(y \mid \theta)] - \nabla^2_\theta \log p(\theta) — the expected likelihood-Fisher plus the prior precision. For toy targets defined directly by a log-density (the §1 banana, for instance), there is no expectation over data and two options exist. The observed Hessian 2logπ(θ)-\nabla^2 \log \pi(\theta) can fail to be positive-definite at some θ\theta, in which case SoftAbs regularization (Betancourt 2013, §12 below) restores SPD by mapping the eigenvalues through a smooth absolute-value. The Gauss-Newton alternative — derive the target as a least-squares posterior, then use the Fisher information of the corresponding likelihood — is always SPD by construction and is the route Girolami and Calderhead take for their banana benchmark. We use Gauss-Newton in §9; the construction is spelled out there.

3.2 Cotangent vectors and index-raising

Stand at a point θΘ\theta \in \Theta. The tangent space TθΘRdT_\theta \Theta \cong \mathbb{R}^d is the vector space of velocities — directions a parameter trajectory can move. The cotangent space TθΘRdT_\theta^* \Theta \cong \mathbb{R}^d is its dual: the space of linear functionals on TθΘT_\theta \Theta. Concretely, a tangent vector vTθΘv \in T_\theta \Theta has components viv^i and a cotangent vector pTθΘp \in T_\theta^* \Theta has components pip_i — the index position records which kind of object you’re holding.

The metric gives a canonical isomorphism between the two spaces. Each tangent vector vv defines a cotangent vector vv^\flat via v(w)=gθ(v,w)v^\flat(w) = g_\theta(v, w) for any wTθΘw \in T_\theta \Theta. In coordinates this is index lowering: vi=gij(θ)vjv_i = g_{ij}(\theta)\, v^j, summing over the repeated index. The inverse map, index raising, sends a cotangent vector pp to the tangent vector pp^\sharp with components pi=gij(θ)pjp^i = g^{ij}(\theta)\, p_j, where gij(θ)g^{ij}(\theta) is the matrix inverse of gij(θ)g_{ij}(\theta) — equivalently, G(θ)1G(\theta)^{-1}.

Why this matters for RMHMC. Position θ\theta lives on the manifold and its velocity θ˙\dot\theta is naturally a tangent vector. Momentum pp, the conjugate variable, is naturally a cotangent vector — it pairs with velocity via the Lagrangian formulation, pi=L/θ˙ip_i = \partial L / \partial \dot\theta^i, which lands in the dual space. The kinetic energy is the squared norm of pp measured in the inverse metric on the cotangent space:

K(θ,p)  =  12p,pg  =  12gij(θ)pipj  =  12pG(θ)1p.K(\theta, p) \;=\; \tfrac{1}{2}\, \langle p, p\rangle_{g^*} \;=\; \tfrac{1}{2}\, g^{ij}(\theta)\, p_i\, p_j \;=\; \tfrac{1}{2}\, p^\top G(\theta)^{-1} p.

This is exactly the kinetic term in the Girolami–Calderhead Hamiltonian. The G(θ)1G(\theta)^{-1} is doing index-raising — converting the cotangent momentum into the tangent velocity θ˙=G(θ)1p\dot\theta = G(\theta)^{-1} p, which is what Hamilton’s first equation will say in §4. Read this way, the RMHMC kinetic energy isn’t a clever choice; it’s the unique kinetic energy compatible with treating momentum as a cotangent vector and the metric as Fisher.

3.3 Geodesics, Christoffel symbols, and the geodesic equation

On a flat manifold, a free particle moves in a straight line. On a curved one, “straight” gets replaced by geodesic — the curve that locally minimizes arc length, or equivalently, the curve whose tangent vector parallel-transports along itself. In coordinates, a curve θ(t)\theta(t) is a geodesic of (Θ,g)(\Theta, g) if

θ¨k  +  Γijk(θ)θ˙iθ˙j  =  0,k=1,,d,\ddot\theta^{\,k} \;+\; \Gamma^k_{ij}(\theta)\, \dot\theta^{\,i}\, \dot\theta^{\,j} \;=\; 0, \qquad k = 1, \dots, d,

with Einstein summation over i,ji, j. The Christoffel symbols Γijk\Gamma^k_{ij} are determined by the metric:

Γijk(θ)  =  12gkl(θ)(igjl(θ)  +  jgil(θ)    lgij(θ)).\Gamma^k_{ij}(\theta) \;=\; \tfrac{1}{2}\, g^{kl}(\theta)\, \bigl(\partial_i\, g_{jl}(\theta) \;+\; \partial_j\, g_{il}(\theta) \;-\; \partial_l\, g_{ij}(\theta)\bigr).

These are not tensor components — they transform inhomogeneously under coordinate changes — but the geodesic equation as a whole is coordinate-invariant. On a flat manifold (gijg_{ij} constant), all Γijk0\Gamma^k_{ij} \equiv 0 and the equation collapses to θ¨=0\ddot\theta = 0, recovering Newton’s first law.

What this means for RMHMC. Hamilton’s equations on (Θ,g)(\Theta, g), derived in §4, contain Christoffel-symbol-like terms in the momentum update — the price of position-dependent kinetic energy. When the potential U(θ)U(\theta) is zero, Hamilton’s equations on the manifold reduce exactly to the geodesic equation: the leapfrog integrator becomes a geodesic integrator, and the trajectory follows the geometry of (Θ,g)(\Theta, g) with no external force. When U0U \ne 0, we get geodesic flow perturbed by U(θ)-\nabla U(\theta). The pathological posteriors of §1 are pathological precisely because their flat trajectories don’t track the manifold’s geometry; their geodesic trajectories do.

For the banana model of §9, the only nonzero Christoffel symbol turns out to be Γ112=2b\Gamma^2_{11} = 2b — and that single component is exactly what’s needed to bend the geodesic into the ridge’s curvature. The viz below demonstrates this with a corrected closed-form Christoffel computation; the static fallback PNG uses a slightly off implementation from the notebook and may show the geodesic curving the wrong direction, so trust the live viz for the geometric story.

Figure 3. Left: Fisher-metric ellipses at a 5×5 grid; at b = 0 the metric is constant and ellipses are circles, but as b grows they rotate and stretch along the banana's ridge. Right: with ridge-tangent initial velocity, the geodesic on (Θ, G) traces the ridge exactly (Γ²₁₁ = 2b yields d²θ₂/dθ₁² = -2b — matching the ridge curvature). Toggle off "ridge-tangent" to see how the geodesic deviates when the initial velocity is generic.

3.4 Čencov’s uniqueness

Why the Fisher metric and not some other position-dependent metric? Čencov’s theorem (Čencov 1972, refined by Campbell 1986; see Information Geometry for full proof) answers this.

Theorem 1 (Čencov uniqueness).

Let F\mathcal{F} be the category of finite statistical manifolds, where morphisms are Markov kernels. The Fisher information metric is the unique (up to a positive scalar multiple) Riemannian metric on every object of F\mathcal{F} that is invariant under all morphisms — that is, the unique metric for which every sufficient statistic acts as an isometry between source and target manifolds.

Translated: take any parametric family {p(y;θ)}\{p(y; \theta)\} and any sufficient statistic T(y)T(y) for it; the induced family {p(T(y);θ)}\{p(T(y); \theta)\} has its own statistical manifold, and the Fisher metric on the original manifold pushes forward to the Fisher metric on the new one. No other Riemannian metric on statistical manifolds has this invariance property.

For RMHMC, this is the foundational sanity check: the choice of position-dependent metric is not arbitrary tuning. The Fisher information is the unique canonical Riemannian structure on a manifold of distributions. Variants — SoftAbs, observed Hessian, Hessian-based metrics — sacrifice this invariance to gain computational tractability or PSD-ness. They are useful approximations; Fisher is the reference point against which they are judged.

Two-panel figure showing Fisher metric ellipses over the banana posterior and a comparison of a geodesic to a Euclidean straight line from a common starting point.
Figure 3 (static fallback). Notebook-rendered metric ellipses and geodesic vs straight-line trajectories. The live viz above uses a corrected Christoffel computation; this PNG retains the notebook's slightly off original for archival reasons.

§4. The Riemannian Hamiltonian

Standard HMC’s mass matrix was a single constant. The promotion to a position-dependent metric — M(θ)=G(θ)M(\theta) = G(\theta) with GG the Fisher information — changes three things at once: the momentum distribution, the kinetic energy, and the Hamiltonian itself. The third change is the subtle one: an extra log-determinant term appears in HH, and it has to be there for the marginalization to recover the target. This section assembles all of it.

4.1 The position-dependent mass matrix

In standard HMC, the mass matrix MRd×dM \in \mathbb{R}^{d \times d} is a fixed SPD matrix — typically the identity, or a covariance estimate from warmup. Read through the §3.2 lens, MM functions as a metric on momentum space: it tells us how to convert a cotangent vector (momentum) into a tangent vector (velocity) via θ˙=M1p\dot\theta = M^{-1} p, and how to assign a squared norm to momentum via K(p)=12pM1pK(p) = \tfrac{1}{2} p^\top M^{-1} p.

RMHMC’s move is to let this metric depend on position. We set M(θ)=G(θ)M(\theta) = G(\theta), where GG is the Fisher information matrix at θ\theta (canonically; see §3 for the why-Fisher argument). The momentum’s conditional distribution becomes pθN(0,G(θ))p \mid \theta \sim \mathcal{N}(0, G(\theta)) — a Gaussian whose covariance changes with where we are on the manifold. In coordinates where the manifold has high local curvature (large eigenvalues of GG), momentum is sampled with large variance, so trajectories explore vigorously in that region; where curvature is low, momentum is small and trajectories crawl. This is the inverse of what standard HMC does with a constant MM — there, momentum scale is fixed, and pathological geometry punishes the trajectory.

Two consequences are immediate and matter for everything that follows. The kinetic energy gains θ\theta-dependence: K(θ,p)=12pG(θ)1pK(\theta, p) = \tfrac{1}{2} p^\top G(\theta)^{-1} p. And because the momentum density N(0,G(θ))\mathcal{N}(0, G(\theta)) has a θ\theta-dependent normalizing constant, the joint density we’re sampling from will need an explicit log-determinant correction to remain consistent with the target. We work that out next.

4.2 The augmented joint density

The target is π(θ)eU(θ)\pi(\theta) \propto e^{-U(\theta)} with U(θ)=logπ(θ)U(\theta) = -\log \pi(\theta). Augment with momentum pθN(0,G(θ))p \mid \theta \sim \mathcal{N}(0, G(\theta)). The joint is the product of the marginal and the conditional:

π(θ,p)  =  π(θ)1(2π)ddetG(θ)exp ⁣(12pG(θ)1p).\pi(\theta, p) \;=\; \pi(\theta) \cdot \frac{1}{\sqrt{(2\pi)^d\, \det G(\theta)}} \exp\!\Bigl(-\tfrac{1}{2}\, p^\top G(\theta)^{-1} p\Bigr).

Take the negative log and collect terms up to additive constants in (θ,p)(\theta, p):

H(θ,p)  =  logπ(θ,p)  =  U(θ)  +  12pG(θ)1p  +  12logdetG(θ).H(\theta, p) \;=\; -\log \pi(\theta, p) \;=\; U(\theta) \;+\; \tfrac{1}{2}\, p^\top G(\theta)^{-1} p \;+\; \tfrac{1}{2}\, \log\det G(\theta).

This is the Girolami–Calderhead Hamiltonian. The three terms are the potential energy U(θ)U(\theta), the position-dependent kinetic energy K(θ,p)=12pG(θ)1pK(\theta, p) = \tfrac{1}{2} p^\top G(\theta)^{-1} p, and a new volume term V(θ)=12logdetG(θ)V(\theta) = \tfrac{1}{2} \log\det G(\theta) that wasn’t there in flat-space HMC.

The volume term is exactly what’s needed for marginal correctness. Marginalize pp from the joint π(θ,p)eH(θ,p)\pi(\theta, p) \propto e^{-H(\theta, p)}:

eH(θ,p)dp  =  eU(θ)e12logdetG(θ)e12pG(θ)1pdp.\int e^{-H(\theta, p)}\, dp \;=\; e^{-U(\theta)} \cdot e^{-\tfrac{1}{2}\log\det G(\theta)} \cdot \int e^{-\tfrac{1}{2} p^\top G(\theta)^{-1} p}\, dp.

The remaining Gaussian integral evaluates to (2π)d/2detG(θ)(2\pi)^{d/2} \sqrt{\det G(\theta)}, so the right-hand side simplifies to (2π)d/2eU(θ)π(θ)(2\pi)^{d/2}\, e^{-U(\theta)} \propto \pi(\theta). The detG(θ)\sqrt{\det G(\theta)} from the Gaussian normalizer exactly cancels the e12logdetGe^{-\tfrac{1}{2}\log\det G} from the volume term, leaving the target. Drop the volume term and the marginal would be wrong — it would be proportional to π(θ)/detG(θ)\pi(\theta) / \sqrt{\det G(\theta)}, which is not what we want to sample.

This is also why the volume term cannot be absorbed into U(θ)U(\theta) as a redefinition of the potential. It must appear in HH precisely as written, because the same factor of detG(θ)\sqrt{\det G(\theta)} also appears in the momentum normalizer that the sampler implicitly uses every time it draws a momentum, and the two have to cancel.

4.3 The three terms

Let’s interpret each piece of H=U+K+VH = U + K + V.

The potential energy U(θ)=logπ(θ)U(\theta) = -\log \pi(\theta) is unchanged from flat-space HMC. It tells the dynamics where the target prefers to be: gradients of UU pull trajectories toward high-density regions of π\pi.

The kinetic energy K(θ,p)=12pG(θ)1pK(\theta, p) = \tfrac{1}{2} p^\top G(\theta)^{-1} p is now position-dependent. Read geometrically, KK is the squared norm of the cotangent momentum pp measured in the inverse metric — exactly the construction of §3.2. The position-dependence makes KK non-separable from θ\theta: the kinetic and potential coordinates are entangled by the metric. This is the property that breaks the explicit leapfrog in §5.

The volume term V(θ)=12logdetG(θ)V(\theta) = \tfrac{1}{2}\log\det G(\theta) is the new piece. It’s a purely geometric potential — it depends only on the manifold, not on the target. We can think of it as a “geometric correction” that prevents the dynamics from preferring high-determinant regions of the manifold over low-determinant ones. Without VV, regions where G(θ)G(\theta) has large eigenvalues would oversample momentum and skew the joint.

A useful regrouping is U~(θ)=U(θ)+V(θ)\tilde{U}(\theta) = U(\theta) + V(\theta) — the augmented potential — so that H(θ,p)=U~(θ)+K(θ,p)H(\theta, p) = \tilde{U}(\theta) + K(\theta, p). Hamilton’s equations on (θ,p)(\theta, p) space then look more familiar: the gradient force is U~-\nabla \tilde{U}, splitting into a target-driven part and a geometry-driven part.

Remark (V is free pedagogy on the banana).

For the banana model of §9, the metric determinant is exactly constant: detG(θ)=1/a2\det G(\theta) = 1/a^2 uniformly across Θ\Theta. So V(θ)=logaV(\theta) = -\log a is constant and contributes nothing to the dynamics on the banana — the volume term is “free” pedagogy but vanishes in this particular example. §10’s log-Gaussian Cox process is the example where detG(θ)\det G(\theta) genuinely varies with θ\theta and the volume term has real teeth.

4.4 Hamilton’s equations on the manifold

Hamilton’s equations are θ˙i=H/pi\dot\theta^i = \partial H/\partial p_i and p˙i=H/θi\dot p_i = -\partial H/\partial \theta^i. With our three-term HH, the position update is immediate:

θ˙  =  Hp  =  G(θ)1p.\dot\theta \;=\; \frac{\partial H}{\partial p} \;=\; G(\theta)^{-1}\, p.

This is just the cotangent-to-tangent index-raising of §3.2 — the position evolves with velocity equal to the momentum’s tangent representative.

The momentum update requires differentiating each term of HH with respect to θk\theta_k. The potential gives kU\partial_k U. The kinetic term gives 12pk(G1)p\tfrac{1}{2} p^\top \partial_k(G^{-1}) p, and using the matrix identity kG1=G1(kG)G1\partial_k G^{-1} = -G^{-1}(\partial_k G)\, G^{-1}, this becomes 12pG1(kG)G1p-\tfrac{1}{2} p^\top G^{-1} (\partial_k G)\, G^{-1} p. The volume term gives 12klogdetG=12tr(G1kG)\tfrac{1}{2}\, \partial_k \log\det G = \tfrac{1}{2}\, \mathrm{tr}\bigl(G^{-1}\, \partial_k G\bigr) via the Jacobi formula. Collecting,

p˙k  =  kU(θ)  +  12pG(θ)1(kG(θ))G(θ)1p    12tr(G(θ)1kG(θ)).\dot p_k \;=\; -\,\partial_k U(\theta) \;+\; \tfrac{1}{2}\, p^\top G(\theta)^{-1} (\partial_k G(\theta))\, G(\theta)^{-1} p \;-\; \tfrac{1}{2}\, \mathrm{tr}\bigl(G(\theta)^{-1}\, \partial_k G(\theta)\bigr).

Three forces are visible. The first is the gradient of the potential — the same as standard HMC. The second is a Coriolis-like force: it is quadratic in momentum, depends on the derivative of the metric, and vanishes when the metric is constant. The third is the gradient of the volume potential, kV-\partial_k V.

The Coriolis term is what connects the dynamics to geodesic flow on the manifold. Substituting θ˙=G1p\dot\theta = G^{-1} p into the quadratic-in-pp term, it becomes 12θ˙(kG(θ))θ˙\tfrac{1}{2}\, \dot\theta^\top (\partial_k G(\theta))\, \dot\theta, which after some index gymnastics combines with k\partial_k of the kinetic energy to produce exactly the Christoffel-symbol contraction of §3.3. Concretely: when U0U \equiv 0 and V0V \equiv 0, the Hamiltonian flow reduces to

θ¨k  +  Γijk(θ)θ˙iθ˙j  =  0,\ddot\theta^{\,k} \;+\; \Gamma^k_{ij}(\theta)\, \dot\theta^{\,i}\, \dot\theta^{\,j} \;=\; 0,

the geodesic equation. With UU and VV present, we get geodesic flow perturbed by U~-\nabla \tilde{U}.

This is the picture to hold onto: continuous-time RMHMC is geodesic flow on the manifold of distributions, perturbed by the gradient of the augmented potential. In continuous time, the flow conserves HH exactly and preserves π(θ,p)eH\pi(\theta, p) \propto e^{-H} exactly. The problem §5 has to solve is how to discretize this flow in a way that still preserves the canonical measure — at least to the accuracy needed for a Metropolis correction to make up the residual error.

Click panel (a) to add a start (rolls oldest off).
Figure 4. Left: continuous-time RMHMC trajectories (olive, RK4 on Hamilton's equations) follow the banana ridge; constant-mass HMC trajectories (blue) overshoot. Right: the selected RMHMC trajectory's U, K, V time series with their sum H. On the banana V = -log a is constant (V line is flat) and H is conserved to RK4 tolerance — the §4 takeaway. Click the trajectory panel to add a new start, drag the slider to change T.
Two-panel figure with continuous-time RMHMC trajectories on the banana and the corresponding U/K/V energy decomposition along one trajectory.
Figure 4 (static fallback). RK4-integrated Hamilton's equations on the banana. Left: RMHMC and HMC trajectories from the same starts. Right: U, K, V time series along a single trajectory; V is flat because det G is constant on the banana.

§5. The Generalized Leapfrog Integrator

Continuous-time RMHMC preserves the canonical measure exactly. The discrete version cannot — we have to integrate Hamilton’s equations approximately, and any approximation introduces error. The question is whether we can choose the discrete integrator so that the structural properties — symplecticity, reversibility, volume preservation — survive the discretization, leaving only a bounded energy error that the Metropolis correction can mop up. The standard leapfrog of §2 does this for separable Hamiltonians. The Riemannian Hamiltonian is non-separable, and the standard leapfrog fails. The fix is to make the integrator implicit.

5.1 Non-separability and what it breaks

Recall the standard leapfrog of §2. It comes from the splitting H(θ,p)=U(θ)+K(p)H(\theta, p) = U(\theta) + K(p) — potential energy depends only on θ\theta, kinetic energy depends only on pp. The integrator alternates shears: a half-step of pp at fixed θ\theta (using U/θ\partial U/\partial \theta), a full step of θ\theta at fixed pp (using K/p\partial K/\partial p), and another half-step of pp. Each substep is a symplectic shear because it updates only one of the two variables, and the composition of symplectic maps is symplectic. The proof in §2 was almost trivial.

For the Riemannian Hamiltonian H(θ,p)=U(θ)+V(θ)+12pG(θ)1pH(\theta, p) = U(\theta) + V(\theta) + \tfrac{1}{2} p^\top G(\theta)^{-1} p, the kinetic energy K(θ,p)K(\theta, p) depends on both variables — the metric GG depends on position. The standard shear-based splitting no longer applies. If we try to write the same three-step structure naively — “half-step in pp using U\nabla U, full step in θ\theta using G1pG^{-1} p, another half-step in pp” — three things go wrong. First, the half-step in pp would miss the contribution K/θ\partial K/\partial \theta to p˙\dot p, so it’s not the right gradient. Second, even if we included K/θ\partial K/\partial \theta, the substep is no longer a shear: it updates pp at fixed θ\theta but with a gradient that depends on the same pp we’re updating. Third, the resulting map is not symplectic — energy drifts secularly with nn, and after a few hundred leapfrog steps the chain is sampling from the wrong distribution.

The pathology is reproducible and visible. The naive explicit Riemannian leapfrog applied to the banana shows H(θn,pn)H0|H(\theta_n, p_n) - H_0| growing roughly linearly with nn, never returning to baseline — the hallmark of a non-symplectic integrator. The generalized leapfrog, by contrast, keeps the energy error bounded just as the standard leapfrog did in §2.

5.2 The Leimkuhler–Reich splitting

The fix, due to Leimkuhler and Reich (2004, ch. 4), is to keep the three-step symmetric structure but allow each step to be defined implicitly. The integrator is sometimes called the Störmer–Verlet for general Hamiltonians or the generalized leapfrog; we’ll use the latter name throughout. For a step of size ϵ\epsilon from (θn,pn)(\theta^n, p^n) to (θn+1,pn+1)(\theta^{n+1}, p^{n+1}),

pn+12=pnϵ2θH ⁣(θn,pn+12),θn+1=θn+ϵ2[pH ⁣(θn,pn+12)+pH ⁣(θn+1,pn+12)],pn+1=pn+12ϵ2θH ⁣(θn+1,pn+12).\begin{aligned} p^{n+\tfrac{1}{2}} &= p^n - \tfrac{\epsilon}{2}\, \nabla_\theta H\!\bigl(\theta^n,\, p^{n+\tfrac{1}{2}}\bigr), \\ \theta^{n+1} &= \theta^n + \tfrac{\epsilon}{2}\, \bigl[\nabla_p H\!\bigl(\theta^n,\, p^{n+\tfrac{1}{2}}\bigr) \,+\, \nabla_p H\!\bigl(\theta^{n+1},\, p^{n+\tfrac{1}{2}}\bigr)\bigr], \\ p^{n+1} &= p^{n+\tfrac{1}{2}} - \tfrac{\epsilon}{2}\, \nabla_\theta H\!\bigl(\theta^{n+1},\, p^{n+\tfrac{1}{2}}\bigr). \end{aligned}

Two of the three lines are implicit. The first defines pn+1/2p^{n+1/2} via an equation whose right-hand side depends on pn+1/2p^{n+1/2} — the gradient θH\nabla_\theta H contains the kinetic-energy gradient θK(θ,p)\nabla_\theta K(\theta, p), which is quadratic in the momentum we’re trying to compute. The second defines θn+1\theta^{n+1} via an equation whose right-hand side contains pH(θn+1,pn+1/2)=G(θn+1)1pn+1/2\nabla_p H(\theta^{n+1}, p^{n+1/2}) = G(\theta^{n+1})^{-1} p^{n+1/2} — the new position appears inside the metric. Only the third step is explicit, because by then both θn+1\theta^{n+1} and pn+1/2p^{n+1/2} are known.

The symmetric structure is what makes this work. The map (θn,pn)(θn+1,pn+1)(\theta^n, p^n) \mapsto (\theta^{n+1}, p^{n+1}) defined by the three lines above is the composition of two pieces: a half-Hamiltonian-flow under a frozen-metric Hamiltonian and a half-Hamiltonian-flow under the full position-dependent one. Each is a true Hamiltonian flow, hence symplectic; the composition is symplectic. We prove this carefully in §6.

5.3 The two implicit momentum half-steps

Specialize to the Riemannian Hamiltonian. With pH(θ,p)=G(θ)1p\nabla_p H(\theta, p) = G(\theta)^{-1} p and θH(θ,p)=U(θ)+V(θ)+θK(θ,p)\nabla_\theta H(\theta, p) = \nabla U(\theta) + \nabla V(\theta) + \nabla_\theta K(\theta, p), where the kinetic-energy gradient has components

[θK(θ,p)]k  =  12pG(θ)1(kG(θ))G(θ)1p,\bigl[\nabla_\theta K(\theta, p)\bigr]_k \;=\; -\tfrac{1}{2}\, p^\top G(\theta)^{-1} \bigl(\partial_k G(\theta)\bigr)\, G(\theta)^{-1} p,

the first leapfrog line becomes

pn+12  =  pn    ϵ2U(θn)    ϵ2V(θn)    ϵ2θK ⁣(θn,pn+12).p^{n+\tfrac{1}{2}} \;=\; p^n \;-\; \tfrac{\epsilon}{2}\, \nabla U(\theta^n) \;-\; \tfrac{\epsilon}{2}\, \nabla V(\theta^n) \;-\; \tfrac{\epsilon}{2}\, \nabla_\theta K\!\bigl(\theta^n,\, p^{n+\tfrac{1}{2}}\bigr).

The first three terms on the right are constants once θn\theta^n is known. The fourth term is the implicit one: it’s quadratic in pn+1/2p^{n+1/2} through the structure θKpTensor(θ)p\nabla_\theta K \propto p^\top \mathrm{Tensor}(\theta) p. Write the equation as

pn+12  =  b    ϵ2θK ⁣(θn,pn+12),b  :=  pnϵ2[U(θn)+V(θn)].p^{n+\tfrac{1}{2}} \;=\; b \;-\; \tfrac{\epsilon}{2}\, \nabla_\theta K\!\bigl(\theta^n,\, p^{n+\tfrac{1}{2}}\bigr), \qquad b \;:=\; p^n - \tfrac{\epsilon}{2}\,\bigl[\nabla U(\theta^n) + \nabla V(\theta^n)\bigr].

Solve by fixed-point iteration: starting from p(0)=bp^{(0)} = b, iterate

p(k+1)  =  b    ϵ2θK ⁣(θn,p(k))p^{(k+1)} \;=\; b \;-\; \tfrac{\epsilon}{2}\, \nabla_\theta K\!\bigl(\theta^n,\, p^{(k)}\bigr)

until p(k+1)p(k)<tol\|p^{(k+1)} - p^{(k)}\| < \mathrm{tol}. The metric G(θn)G(\theta^n) doesn’t change across iterations, so a single Cholesky factorization of G(θn)G(\theta^n) — computed once before the iteration — provides G(θn)1G(\theta^n)^{-1} for every evaluation of θK\nabla_\theta K. This is the main reason the per-step cost is dominated by one Cholesky per half-step rather than one per iteration.

The third leapfrog line is the same form but explicit: once we know θn+1\theta^{n+1} and pn+1/2p^{n+1/2}, the right-hand side is just

pn+1  =  pn+12    ϵ2[U(θn+1)+V(θn+1)+θK ⁣(θn+1,pn+12)].p^{n+1} \;=\; p^{n+\tfrac{1}{2}} \;-\; \tfrac{\epsilon}{2}\,\bigl[\nabla U(\theta^{n+1}) + \nabla V(\theta^{n+1}) + \nabla_\theta K\!\bigl(\theta^{n+1},\, p^{n+\tfrac{1}{2}}\bigr)\bigr].

This requires the Cholesky of G(θn+1)G(\theta^{n+1}), which we’ll have computed during step 2 and can reuse.

5.4 The implicit position update

The middle line is

θn+1  =  θn  +  ϵ2[G(θn)1+G(θn+1)1]pn+12.\theta^{n+1} \;=\; \theta^n \;+\; \tfrac{\epsilon}{2}\,\bigl[G(\theta^n)^{-1} + G(\theta^{n+1})^{-1}\bigr] p^{n+\tfrac{1}{2}}.

The first metric inverse is known. The implicit part is the second term, G(θn+1)1pn+1/2G(\theta^{n+1})^{-1} p^{n+1/2}, which requires the metric at the unknown new position. Write the equation as

θn+1  =  c  +  ϵ2G(θn+1)1pn+12,c  :=  θn+ϵ2G(θn)1pn+12.\theta^{n+1} \;=\; c \;+\; \tfrac{\epsilon}{2}\, G(\theta^{n+1})^{-1} p^{n+\tfrac{1}{2}}, \qquad c \;:=\; \theta^n + \tfrac{\epsilon}{2}\, G(\theta^n)^{-1} p^{n+\tfrac{1}{2}}.

Fixed-point iteration starts from θ(0)=c\theta^{(0)} = c and iterates

θ(k+1)  =  c  +  ϵ2G(θ(k))1pn+12\theta^{(k+1)} \;=\; c \;+\; \tfrac{\epsilon}{2}\, G(\theta^{(k)})^{-1} p^{n+\tfrac{1}{2}}

until convergence. Each iteration computes a fresh Cholesky of G(θ(k))G(\theta^{(k)}). This is where the per-step cost can balloon — if convergence takes 10 iterations, that’s 10 Choleskys for this single step. Typical ϵ\epsilon regimes get away with 3–5 iterations; outside that regime, the integrator is signaling that ϵ\epsilon is too large.

A semi-explicit simplification is available when the metric has structure. If G(θ)G(\theta) is block-diagonal in the parameter space and one block depends only on a parameter subgroup that this step’s pn+1/2p^{n+1/2} projects onto trivially, the position update for that subgroup is closed-form. For the log-Gaussian Cox process in §10, the GMRF-prior block of GG is constant, and only the likelihood-Fisher block requires iteration.

5.5 Fixed-point iteration in practice

Three implementation concerns turn out to matter.

Existence and uniqueness. The implicit function theorem (see formalCalculus: inverse-implicit ) guarantees that for ϵ\epsilon small enough, both implicit equations have unique solutions. The argument: at ϵ=0\epsilon = 0, both maps are the identity, and the Jacobian of the fixed-point equation at ϵ=0\epsilon = 0 is the identity matrix. The IFT extends this to a neighborhood of ϵ=0\epsilon = 0.

Tolerance and warm-starting. The standard tolerance is tol=106\mathrm{tol} = 10^{-6} in the relative-change norm Δ/p(k)\|\Delta\|/\|p^{(k)}\| for the momentum half-step and analogously for the position. This is far below the discretization error of the integrator itself, so the fixed-point precision is not the binding constraint on chain accuracy — the leapfrog step size is. Warm-start the iteration from the explicit (zeroth-order) update: p(0)=bp^{(0)} = b for the momentum half-step, θ(0)=c\theta^{(0)} = c for the position step. With a tight ϵ\epsilon, this warm-start often produces convergence within one or two iterations.

Divergence diagnostics. If the iteration counter exceeds a cap (typical: 20 iterations) without convergence, treat the leapfrog step as divergent — abort the trajectory, reject the proposal (acceptance probability 0), and increment a divergence counter for that chain. NUTS practice treats divergences as a hard signal that warmup-time ϵ\epsilon adaptation should reduce step size; RMHMC inherits the same heuristic. A chain that produces more than ~1% divergent transitions has ϵ\epsilon too large for the local geometry it’s exploring.

Figure 5. RK4 reference (teal, dashed), generalized leapfrog (olive), and naive explicit Riemannian leapfrog (orange) all start at the same (θ₀, p₀). RK4 stays nearly flat (it's the continuous-time reference at high resolution); the generalized leapfrog oscillates with bounded amplitude (the §5 symplecticity claim made visible); the naive explicit drifts linearly and often diverges at moderate ε. Increase ε to see the regimes pull apart; increase L to see the long-time behavior. Heavy compute recommits only on slider release.
Two-panel figure comparing RK4 reference, generalized leapfrog, and naive explicit Riemannian leapfrog on the banana. Naive drifts secularly; GL oscillates with bounded amplitude; RK4 stays flat.
Figure 5 (static fallback). Energy drift comparison on the banana at ε = 0.1, L = 300. The generalized leapfrog's bounded oscillation is the symplecticity claim made empirical.

§6. Volume preservation, reversibility, and detailed balance

Continuous-time Hamiltonian flow preserves the canonical measure π(θ,p)eH(θ,p)\pi(\theta, p) \propto e^{-H(\theta, p)} exactly. A discrete integrator cannot — there is always some discretization error in HH. The Metropolis correction handles that error by accepting or rejecting proposals based on the energy difference. But the Metropolis correction itself relies on three structural properties of the proposal map: it must be volume-preserving in (θ,p)(\theta, p)-space, reversible under momentum negation, and produce a kernel that satisfies detailed balance. Standard HMC’s explicit leapfrog has these properties by inspection. The generalized leapfrog is not made of shears, so we have to prove them by direct calculation. Each of the three is load-bearing: drop volume preservation and the Jacobian factor in the acceptance ratio fails to collapse to 1; drop reversibility and the forward and reverse moves are no longer paired correctly; drop detailed balance and the chain doesn’t target the right distribution.

Throughout this section let Φϵ:(θn,pn)(θn+1,pn+1)\Phi_\epsilon: (\theta^n, p^n) \mapsto (\theta^{n+1}, p^{n+1}) denote one step of the generalized leapfrog of §5.2, and decompose it as Φϵ=Φϵ/2CΦϵBΦϵ/2A\Phi_\epsilon = \Phi^C_{\epsilon/2} \circ \Phi^B_\epsilon \circ \Phi^A_{\epsilon/2} where the substeps are the three lines of the integrator. Let N:(θ,p)(θ,p)\mathcal{N}: (\theta, p) \mapsto (\theta, -p) denote momentum negation.

6.1 Volume preservation

Theorem 2 (Volume preservation of the generalized leapfrog).

The generalized leapfrog map Φϵ\Phi_\epsilon has unit Jacobian determinant: detJΦϵ(θn,pn)=1\bigl|\det J_{\Phi_\epsilon}(\theta^n, p^n)\bigr| = 1 for every (θn,pn)(\theta^n, p^n) in the domain where the implicit equations have unique solutions.

Proof.

We compute the Jacobian determinant of each substep separately via the implicit function theorem, then show the three factors cancel in the product.

Let Hθp(θ,p)H_{\theta p}(\theta, p) denote the matrix of mixed partial derivatives [Hθp]ij=2H/θipj[H_{\theta p}]_{ij} = \partial^2 H / \partial \theta_i \partial p_j, evaluated at (θ,p)(\theta, p). By symmetry of mixed partials, Hpθ=HθpH_{p\theta} = H_{\theta p}^\top, and in particular det(I+αHpθ)=det(I+αHθp)\det(I + \alpha H_{p\theta}) = \det(I + \alpha H_{\theta p}) for any scalar α\alpha.

Substep A (implicit momentum half-step). The map Φϵ/2A:(θn,pn)(θn,pn+1/2)\Phi^A_{\epsilon/2}: (\theta^n, p^n) \mapsto (\theta^n, p^{n+1/2}) is defined by θ\theta unchanged, pn+1/2=pnϵ2θH(θn,pn+1/2)p^{n+1/2} = p^n - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^n, p^{n+1/2}). The Jacobian has block-lower-triangular structure (since θ\theta doesn’t change), so detJΦA=det(pn+1/2/pn)\det J_{\Phi^A} = \det(\partial p^{n+1/2}/\partial p^n). Differentiate the implicit equation in pnp^n:

pn+1/2pn  =  Id    ϵ2Hθp(θn,pn+1/2)pn+1/2pn.\frac{\partial p^{n+1/2}}{\partial p^n} \;=\; I_d \;-\; \tfrac{\epsilon}{2}\, H_{\theta p}(\theta^n, p^{n+1/2}) \cdot \frac{\partial p^{n+1/2}}{\partial p^n}.

Solve: [Id+ϵ2Hθp(θn,pn+1/2)]pn+1/2/pn=Id\bigl[I_d + \tfrac{\epsilon}{2} H_{\theta p}(\theta^n, p^{n+1/2})\bigr] \cdot \partial p^{n+1/2}/\partial p^n = I_d, hence

detJΦϵ/2A  =  1det(Id+ϵ2Hθp(θn,pn+1/2)).\det J_{\Phi^A_{\epsilon/2}} \;=\; \frac{1}{\det\bigl(I_d + \tfrac{\epsilon}{2} H_{\theta p}(\theta^n, p^{n+1/2})\bigr)}.

Substep B (implicit position update). The map ΦϵB:(θn,pn+1/2)(θn+1,pn+1/2)\Phi^B_\epsilon: (\theta^n, p^{n+1/2}) \mapsto (\theta^{n+1}, p^{n+1/2}) is defined by pn+1/2p^{n+1/2} unchanged, θn+1=θn+ϵ2[pH(θn,pn+1/2)+pH(θn+1,pn+1/2)]\theta^{n+1} = \theta^n + \tfrac{\epsilon}{2}\bigl[\nabla_p H(\theta^n, p^{n+1/2}) + \nabla_p H(\theta^{n+1}, p^{n+1/2})\bigr]. The Jacobian is block-upper-triangular, so detJΦB=det(θn+1/θn)\det J_{\Phi^B} = \det(\partial \theta^{n+1}/\partial \theta^n). Differentiate the implicit equation in θn\theta^n:

θn+1θn  =  Id  +  ϵ2Hpθ(θn,pn+1/2)  +  ϵ2Hpθ(θn+1,pn+1/2)θn+1θn.\frac{\partial \theta^{n+1}}{\partial \theta^n} \;=\; I_d \;+\; \tfrac{\epsilon}{2} H_{p\theta}(\theta^n, p^{n+1/2}) \;+\; \tfrac{\epsilon}{2} H_{p\theta}(\theta^{n+1}, p^{n+1/2}) \cdot \frac{\partial \theta^{n+1}}{\partial \theta^n}.

Solve:

detJΦϵB  =  det(Id+ϵ2Hθp(θn,pn+1/2))det(Idϵ2Hθp(θn+1,pn+1/2)),\det J_{\Phi^B_\epsilon} \;=\; \frac{\det\bigl(I_d + \tfrac{\epsilon}{2} H_{\theta p}(\theta^n, p^{n+1/2})\bigr)}{\det\bigl(I_d - \tfrac{\epsilon}{2} H_{\theta p}(\theta^{n+1}, p^{n+1/2})\bigr)},

using det(I+αHpθ)=det(I+αHθp)\det(I + \alpha H_{p\theta}) = \det(I + \alpha H_{\theta p}) at the last step.

Substep C (explicit momentum half-step). The map Φϵ/2C:(θn+1,pn+1/2)(θn+1,pn+1)\Phi^C_{\epsilon/2}: (\theta^{n+1}, p^{n+1/2}) \mapsto (\theta^{n+1}, p^{n+1}) is defined by θ\theta unchanged, pn+1=pn+1/2ϵ2θH(θn+1,pn+1/2)p^{n+1} = p^{n+1/2} - \tfrac{\epsilon}{2} \nabla_\theta H(\theta^{n+1}, p^{n+1/2}). This is an explicit update — no implicit equation to solve. The Jacobian is block-lower-triangular with pn+1/pn+1/2=Idϵ2Hθp(θn+1,pn+1/2)\partial p^{n+1}/\partial p^{n+1/2} = I_d - \tfrac{\epsilon}{2} H_{\theta p}(\theta^{n+1}, p^{n+1/2}), so detJΦϵ/2C=det ⁣(Idϵ2Hθp(θn+1,pn+1/2))\det J_{\Phi^C_{\epsilon/2}} = \det\!\bigl(I_d - \tfrac{\epsilon}{2} H_{\theta p}(\theta^{n+1}, p^{n+1/2})\bigr).

The product. Multiplying the three determinants,

detJΦϵ  =  1det(I+ϵ2Hθp(θn,pn+1/2))det(I+ϵ2Hθp(θn,pn+1/2))det(Iϵ2Hθp(θn+1,pn+1/2))det(Iϵ2Hθp(θn+1,pn+1/2)).\det J_{\Phi_\epsilon} \;=\; \frac{1}{\det(I + \tfrac{\epsilon}{2} H_{\theta p}(\theta^n, p^{n+1/2}))} \cdot \frac{\det(I + \tfrac{\epsilon}{2} H_{\theta p}(\theta^n, p^{n+1/2}))}{\det(I - \tfrac{\epsilon}{2} H_{\theta p}(\theta^{n+1}, p^{n+1/2}))} \cdot \det(I - \tfrac{\epsilon}{2} H_{\theta p}(\theta^{n+1}, p^{n+1/2})).

The three factors cancel pairwise: substep A’s denominator cancels substep B’s numerator, and substep B’s denominator cancels substep C’s factor. The result is detJΦϵ=1\det J_{\Phi_\epsilon} = 1.

Remark.

None of the three substeps is individually volume-preserving — only the symmetric arrangement is. This is what the prose in §5.2 meant by “the symmetric structure is what makes this work.” The cancellation is not a coincidence; it’s the algebraic shadow of symplecticity. Leimkuhler and Reich (2004, theorem 4.1) give the corresponding symplecticity proof, which is the stronger geometric statement; unit-Jacobian is the determinantal consequence.

6.2 Reversibility

Theorem 3 (Reversibility under momentum negation).

The generalized leapfrog map Φϵ\Phi_\epsilon satisfies NΦϵNΦϵ=id\mathcal{N} \circ \Phi_\epsilon \circ \mathcal{N} \circ \Phi_\epsilon = \mathrm{id}, where N(θ,p)=(θ,p)\mathcal{N}(\theta, p) = (\theta, -p). Equivalently, NΦϵN=Φϵ1\mathcal{N} \circ \Phi_\epsilon \circ \mathcal{N} = \Phi_\epsilon^{-1}.

Proof.

For the Riemannian Hamiltonian H(θ,p)=U(θ)+V(θ)+12pG(θ)1pH(\theta, p) = U(\theta) + V(\theta) + \tfrac{1}{2} p^\top G(\theta)^{-1} p, the kinetic term is quadratic in pp, so

θH(θ,p)  =  θH(θ,p),pH(θ,p)  =  pH(θ,p).\nabla_\theta H(\theta, -p) \;=\; \nabla_\theta H(\theta, p), \qquad \nabla_p H(\theta, -p) \;=\; -\nabla_p H(\theta, p).

Apply Φϵ\Phi_\epsilon to the state N(θn+1,pn+1)=(θn+1,pn+1)\mathcal{N}(\theta^{n+1}, p^{n+1}) = (\theta^{n+1}, -p^{n+1}). Write the resulting state as (θ,p)(\theta^*, p^*) with intermediate momentum p+1/2p^{*+1/2}. The three leapfrog lines read

p+1/2=pn+1ϵ2θH(θn+1,p+1/2),θ=θn+1+ϵ2[pH(θn+1,p+1/2)+pH(θ,p+1/2)],p=p+1/2ϵ2θH(θ,p+1/2).\begin{aligned} p^{*+1/2} &= -p^{n+1} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^{n+1}, p^{*+1/2}), \\ \theta^* &= \theta^{n+1} + \tfrac{\epsilon}{2}\bigl[\nabla_p H(\theta^{n+1}, p^{*+1/2}) + \nabla_p H(\theta^*, p^{*+1/2})\bigr], \\ p^* &= p^{*+1/2} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^*, p^{*+1/2}). \end{aligned}

Claim: p+1/2=pn+1/2p^{*+1/2} = -p^{n+1/2}. Substitute pn+1/2-p^{n+1/2} for p+1/2p^{*+1/2} in the first line. The right-hand side becomes pn+1ϵ2θH(θn+1,pn+1/2)=pn+1ϵ2θH(θn+1,pn+1/2)-p^{n+1} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^{n+1}, -p^{n+1/2}) = -p^{n+1} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^{n+1}, p^{n+1/2}), using the even-in-pp property of θH\nabla_\theta H. The original step-3 of the forward leapfrog gave pn+1=pn+1/2ϵ2θH(θn+1,pn+1/2)p^{n+1} = p^{n+1/2} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^{n+1}, p^{n+1/2}), so pn+1ϵ2θH(θn+1,pn+1/2)=pn+1/2-p^{n+1} - \tfrac{\epsilon}{2}\, \nabla_\theta H(\theta^{n+1}, p^{n+1/2}) = -p^{n+1/2}. By uniqueness of the solution (IFT, §5.5), p+1/2=pn+1/2p^{*+1/2} = -p^{n+1/2}.

Claim: θ=θn\theta^* = \theta^n. Substitute into the second line. With p+1/2=pn+1/2p^{*+1/2} = -p^{n+1/2}, the odd-in-pp property of pH\nabla_p H gives θ=θn+1ϵ2[pH(θn+1,pn+1/2)+pH(θ,pn+1/2)]\theta^* = \theta^{n+1} - \tfrac{\epsilon}{2}\bigl[\nabla_p H(\theta^{n+1}, p^{n+1/2}) + \nabla_p H(\theta^*, p^{n+1/2})\bigr]. The original step-2 of the forward leapfrog was θn+1=θn+ϵ2[pH(θn,pn+1/2)+pH(θn+1,pn+1/2)]\theta^{n+1} = \theta^n + \tfrac{\epsilon}{2}[\nabla_p H(\theta^n, p^{n+1/2}) + \nabla_p H(\theta^{n+1}, p^{n+1/2})]. Substituting θ=θn\theta^* = \theta^n candidate: θn=?θn+1ϵ2[pH(θn+1,pn+1/2)+pH(θn,pn+1/2)]\theta^n \stackrel{?}{=} \theta^{n+1} - \tfrac{\epsilon}{2}[\nabla_p H(\theta^{n+1}, p^{n+1/2}) + \nabla_p H(\theta^n, p^{n+1/2})]. Rearranging gives the forward step-2. So θ=θn\theta^* = \theta^n satisfies the equation, and by uniqueness, θ=θn\theta^* = \theta^n.

Claim: p=pnp^* = -p^n. Compute from the third line: p=pn+1/2ϵ2θH(θn,pn+1/2)p^* = -p^{n+1/2} - \tfrac{\epsilon}{2} \nabla_\theta H(\theta^n, p^{n+1/2}). The forward step-1 was pn+1/2=pnϵ2θH(θn,pn+1/2)p^{n+1/2} = p^n - \tfrac{\epsilon}{2} \nabla_\theta H(\theta^n, p^{n+1/2}), so pn=pn+1/2ϵ2θH(θn,pn+1/2)=p-p^n = -p^{n+1/2} - \tfrac{\epsilon}{2} \nabla_\theta H(\theta^n, p^{n+1/2}) = p^*.

We have shown Φϵ(N(θn+1,pn+1))=(θn,pn)=N(θn,pn)\Phi_\epsilon(\mathcal{N}(\theta^{n+1}, p^{n+1})) = (\theta^n, -p^n) = \mathcal{N}(\theta^n, p^n). Applying N\mathcal{N} to both sides gives NΦϵN(θn+1,pn+1)=(θn,pn)\mathcal{N} \circ \Phi_\epsilon \circ \mathcal{N}(\theta^{n+1}, p^{n+1}) = (\theta^n, p^n). Since Φϵ(θn,pn)=(θn+1,pn+1)\Phi_\epsilon(\theta^n, p^n) = (\theta^{n+1}, p^{n+1}) by construction, this is (NΦϵN)Φϵ=id(\mathcal{N} \circ \Phi_\epsilon \circ \mathcal{N}) \circ \Phi_\epsilon = \mathrm{id}.

Composing LL leapfrog steps, the same argument by induction gives NΦϵLN=ΦϵL\mathcal{N} \circ \Phi_\epsilon^L \circ \mathcal{N} = \Phi_\epsilon^{-L}, so the trajectory of LL steps is reversible under momentum negation as a single unit.

round-trip error: 7.13e-6
Figure 6. Theorem 6.2 in action. The forward trajectory (olive) runs L generalized-leapfrog steps from (θ_0, p_0). The backward trajectory (pink, dashed) starts at (θ_L, -p_L) — the endpoint of the forward path with momentum negated — and runs L more steps. By reversibility, it retraces the forward path exactly, landing at (θ_0, -p_0) up to fixed-point tolerance. Round-trip error reported above. Crucially, this holds for ANY ε for which the implicit solver converges — not just small ε. Reversibility is the structural property; energy conservation is only approximate.

6.3 Detailed balance

The RMHMC proposal map is Ψ:=NΦϵL\Psi := \mathcal{N} \circ \Phi_\epsilon^L — run LL leapfrog steps, then negate momentum. By §6.2, Ψ\Psi is an involution: ΨΨ=id\Psi \circ \Psi = \mathrm{id}. By §6.1, detJΨ=1|\det J_\Psi| = 1 (the leapfrog has unit Jacobian; momentum negation has detN=1|\det \mathcal{N}| = 1). The Metropolis-corrected RMHMC kernel is

K((θ,p)A)  =  α(θ,p)1 ⁣[Ψ(θ,p)A]  +  (1α(θ,p))1 ⁣[(θ,p)A],K\bigl((\theta, p) \to A\bigr) \;=\; \alpha(\theta, p)\, \mathbf{1}\!\bigl[\Psi(\theta, p) \in A\bigr] \;+\; \bigl(1 - \alpha(\theta, p)\bigr)\, \mathbf{1}\!\bigl[(\theta, p) \in A\bigr],

with acceptance probability α(θ,p)=min(1,eH(θ,p)H(Ψ(θ,p)))\alpha(\theta, p) = \min\bigl(1,\, e^{H(\theta, p) - H(\Psi(\theta, p))}\bigr).

Theorem 4 (Detailed balance of the RMHMC kernel).

The RMHMC kernel KK satisfies detailed balance with respect to π(θ,p)eH(θ,p)\pi(\theta, p) \propto e^{-H(\theta, p)}.

Proof.

Detailed balance for a kernel with a deterministic proposal Ψ\Psi and Metropolis acceptance reduces to checking π(x)α(x)=π(Ψ(x))detJΨ(x)α(Ψ(x))\pi(x)\, \alpha(x) = \pi\bigl(\Psi(x)\bigr) \cdot |\det J_\Psi(x)| \cdot \alpha\bigl(\Psi(x)\bigr), where the Jacobian factor accounts for the change-of-variables under the proposal. By §6.1, detJΨ(x)=1|\det J_\Psi(x)| = 1.

Let y=Ψ(x)y = \Psi(x). Then Ψ(y)=x\Psi(y) = x by the involution property. The acceptance probabilities are α(x)=min(1,eH(x)H(y))\alpha(x) = \min\bigl(1,\, e^{H(x) - H(y)}\bigr) and α(y)=min(1,eH(y)H(x))\alpha(y) = \min\bigl(1,\, e^{H(y) - H(x)}\bigr). WLOG, H(x)H(y)H(x) \le H(y); then α(x)=eH(x)H(y)<1\alpha(x) = e^{H(x) - H(y)} < 1 and α(y)=1\alpha(y) = 1. Check:

π(x)α(x)=eH(x)eH(x)H(y)=eH(y)=eH(y)1=π(y)α(y).\pi(x)\, \alpha(x) = e^{-H(x)} \cdot e^{H(x) - H(y)} = e^{-H(y)} = e^{-H(y)} \cdot 1 = \pi(y)\, \alpha(y).

By symmetry the other case (H(x)H(y)H(x) \ge H(y)) gives the same conclusion. So π(x)α(x)=π(y)α(y)\pi(x)\, \alpha(x) = \pi(y)\, \alpha(y) in both cases, and detailed balance holds.

Remark.

The proof uses three structural facts about Ψ\Psi: it’s volume-preserving (detJΨ=1\det J_\Psi = 1), it’s an involution (ΨΨ=id\Psi \circ \Psi = \mathrm{id}), and the acceptance ratio is the symmetric min(1,π(y)/π(x))\min(1, \pi(y)/\pi(x)). All three properties came from §6.1 and §6.2 — the generalized leapfrog’s structural design. Drop any one and the calculation fails.

6.4 The Jacobian-determinant collapse

The detG(θ)\det G(\theta) factor that would normally appear in a Metropolis acceptance probability between distributions whose momentum-conditionals have different covariances does appear here — but it appears inside HH, via the volume term V(θ)=12logdetG(θ)V(\theta) = \tfrac{1}{2} \log \det G(\theta), rather than as a separate Jacobian factor in front.

To see this, write out the energy difference along a transition:

H(θ,p)H(θ,p)  =  [U(θ)U(θ)]  +  12logdetG(θ)detG(θ)  +  [K(θ,p)K(θ,p)].H(\theta', p') - H(\theta, p) \;=\; \bigl[U(\theta') - U(\theta)\bigr] \;+\; \tfrac{1}{2}\log\frac{\det G(\theta')}{\det G(\theta)} \;+\; \bigl[K(\theta', p') - K(\theta, p)\bigr].

The middle term is exactly logdetG(θ)/detG(θ)\log\sqrt{\det G(\theta')/\det G(\theta)} — the log-ratio of the metric volumes at the two states. In exponentiating to get the acceptance probability,

α  =  min ⁣(1,eH(θ,p)H(θ,p))  =  min ⁣(1,eU(θ)U(θ)detG(θ)detG(θ)eK(θ,p)K(θ,p)),\alpha \;=\; \min\!\Bigl(1,\, e^{H(\theta, p) - H(\theta', p')}\Bigr) \;=\; \min\!\biggl(1,\, e^{U(\theta) - U(\theta')} \cdot \sqrt{\tfrac{\det G(\theta)}{\det G(\theta')}} \cdot e^{K(\theta, p) - K(\theta', p)}\biggr),

the metric-volume factor multiplies the potential-energy and kinetic-energy ratios in exactly the way a change-of-measure factor would. The point is that we don’t compute it separately — packaging V(θ)V(\theta) into HH from §4.2 means a single energy difference automatically carries the metric-Jacobian correction. That is what the volume term does.

6.5 Putting it together

The RMHMC algorithm at each iteration:

  1. Resample momentum from pN(0,G(θ))p \sim \mathcal{N}(0, G(\theta)) at the current θ\theta — a Gibbs step in the augmented space.
  2. Propose Ψ(θ,p)\Psi(\theta, p) via LL generalized-leapfrog steps and momentum negation.
  3. Accept or reject by the Metropolis rule of §6.3.

Step (i) is exact and leaves π(θ,p)\pi(\theta, p) invariant by construction. Step (ii)–(iii) leaves π(θ,p)\pi(\theta, p) invariant by detailed balance. Composing the two steps gives a kernel that’s π(θ,p)\pi(\theta, p)-invariant. Marginalizing pp recovers π(θ)\pi(\theta).

The chain’s irreducibility, aperiodicity, and geometric-ergodicity properties are inherited from standard HMC under regularity conditions on the metric (Livingstone, Betancourt, Byrne, and Girolami 2019); §7 addresses what those conditions look like in practice.

Single-panel figure showing forward and backward generalized-leapfrog trajectories on the banana, with round-trip error reported numerically.
Figure 6 (static fallback). Forward and backward GL trajectories on the banana at ε = 0.15, L = 40. Round-trip error ≈ 1.8e-4 — at the fixed-point-tolerance floor.

§7. Step size, trajectory length, and ergodicity

Standard HMC has one main tuning knob — the leapfrog step size ϵ\epsilon — plus a secondary one, the trajectory length LL. RMHMC inherits both and adds a third: the implicit-solver tolerance and iteration cap. The interactions among the three are not what one would expect from naive intuition built on Euclidean HMC, and the practical guidance has to track that.

7.1 What controls the implicit-solver convergence

In standard HMC, ϵ\epsilon controls accuracy: smaller ϵ\epsilon means smaller per-step energy error and higher acceptance, at the price of more leapfrog steps to traverse the same trajectory time T=LϵT = L \epsilon. There’s a single trade-off: ϵ\epsilon vs LL at fixed TT.

RMHMC adds a second constraint. The implicit fixed-point iteration of §5.5 has to converge, and its contraction rate depends on the local geometry. Look at the momentum half-step’s iteration map,

F1(p)  =  b    ϵ2θK(θn,p),F_1(p^*) \;=\; b \;-\; \tfrac{\epsilon}{2}\, \nabla_\theta K(\theta^n, p^*),

where θK(θ,p)=12pG(θ)1(kG(θ))G(θ)1p\nabla_\theta K(\theta, p) = -\tfrac{1}{2} p^\top G(\theta)^{-1} (\partial_k G(\theta))\, G(\theta)^{-1} p in coordinates. The Jacobian of F1F_1 in pp^* is ϵ2(θK)/p-\tfrac{\epsilon}{2}\, \partial(\nabla_\theta K)/\partial p, and its operator norm scales as ϵG1(θn)G(θn)G1(θn)p\epsilon \cdot \|G^{-1}(\theta^n)\| \cdot \|\partial G(\theta^n)\| \cdot \|G^{-1}(\theta^n)\| \cdot \|p\|. The iteration is a contraction iff this norm is less than 1, which gives an effective ϵ\epsilon constraint that depends on position:

ϵ    1G1(θn)2G(θn)p.\epsilon \;\lesssim\; \frac{1}{\|G^{-1}(\theta^n)\|^2 \cdot \|\nabla G(\theta^n)\| \cdot \|p\|}.

In regions where the metric has large gradient — where the manifold’s local curvature is changing fast — the implicit solver demands smaller ϵ\epsilon. In flat regions, larger ϵ\epsilon is fine. A constant ϵ\epsilon that works in the flat region may fail to converge in the steep region.

The practical consequence: monitor FP iteration counts during sampling. If counts stay below a target — say, 5 — across the chain, ϵ\epsilon is comfortably below the position-dependent constraint everywhere the chain visits. If counts spike at certain positions, ϵ\epsilon is too large for the local geometry. NUTS adaptive warmup uses divergent transitions as the binary signal; RMHMC practice should additionally use the iteration count as a continuous signal during warmup.

divergent cells: 128 / 12500
Figure 7. At each grid point (θ₁, θ₂), 20 momenta sampled from N(0, G(θ)), one generalized-leapfrog step at the chosen ε, mean FP iteration count plotted as color. Cells where ≥50% of momenta hit the divergence cap are colored red. At small ε iteration counts are uniform; at large ε they spike in regions of unfavorable metric geometry — typically near the banana ridge's outer arms where the metric varies fastest. The chain's worst-case cost lives in the worst-case region.

7.2 Trajectory length and the U-turn heuristic

Choosing LL is the second tuning question. For standard HMC, the heuristic is: LϵL \epsilon should be on the order of one autocorrelation length of the target. Too short and consecutive states are correlated; too long and the trajectory wraps around the energy level set. The NUTS criterion (Hoffman and Gelman 2014) selects LL adaptively at each iteration: extend the trajectory until it starts curving back toward its origin.

The U-turn condition in standard HMC is the inner product (θLθ0)pL<0(\theta^L - \theta^0) \cdot p^L < 0 — when the displacement vector from origin and the current momentum point in roughly opposite directions, the trajectory is on the return half of an orbit. The Euclidean dot product is the right operation because momentum is a Euclidean vector in flat HMC.

In RMHMC, the inner product needs to be metric-aware. Momentum is a cotangent vector, displacement is a tangent vector, and the natural pairing is contraction via the metric — equivalently, the inner product on the cotangent bundle uses G1G^{-1}. The Riemannian U-turn criterion is

θLθ0,G(θL)1pL  <  0,\bigl\langle \theta^L - \theta^0,\, G(\theta^L)^{-1} p^L \bigr\rangle \;<\; 0,

which translates to: “the current velocity vector θ˙L=G(θL)1pL\dot\theta^L = G(\theta^L)^{-1} p^L points in a direction misaligned with the displacement.” Betancourt (2013) develops this and variants into a Riemannian NUTS. Implementations are available in Stan and a handful of research-grade libraries.

For this topic’s worked examples in §§9–10, we use a fixed LL chosen by short pilot runs — usually L=25L = 25 on the banana and L50L \approx 50 on the LCP. Fixed LL keeps the head-to-head comparisons with standard HMC clean.

A sometimes-useful rule of thumb: on smooth, log-concave targets the optimal LL scales roughly as the square root of the condition number of GG, evaluated at typical samples.

7.3 Geometric ergodicity

A Markov chain is geometrically ergodic if the total-variation distance from its stationary distribution decays exponentially in iteration count. Geometric ergodicity is what justifies finite-sample MCMC estimates with quantified bias.

Standard HMC’s geometric ergodicity is well-understood. Durmus, Moulines, and Saksman (2017) and Livingstone, Betancourt, Byrne, and Girolami (2019) establish sufficient conditions: smooth, super-exponentially light tails on the target; bounded gradients of UU; an appropriate choice of kinetic energy. Most well-posed Bayesian posteriors satisfy these.

For RMHMC the picture is more fragmentary. The main obstacle to direct theory: the metric tensor G(θ)G(\theta) can blow up or degenerate in regions where the Fisher information loses regularity. When G(θ)G(\theta) \to \infty or G(θ)1G(\theta)^{-1} \to \infty as θ\|\theta\| \to \infty, the standard drift-condition machinery breaks down. Partial results exist: for elliptically symmetric distributions, geometric ergodicity follows; for the banana, geometric ergodicity is plausible but not (to the author’s knowledge) explicitly proven in the literature; for log-concave targets with bounded metric eigenvalues, geometric ergodicity should follow from existing HMC results (Livingstone, Betancourt, Byrne, and Girolami 2019).

The working position for practitioners. On smooth targets with metrics that stay well-conditioned across the support, RMHMC behaves like geometrically ergodic HMC and the usual asymptotic theory applies. On targets where the metric becomes singular — Neal’s funnel as τ\tau \to -\infty, the LCP at very low intensity — the chain’s mixing is harder to characterize and divergence rates become diagnostically important. In all cases, multi-chain R^\hat R and ESS diagnostics (developed in §11) should be the first-line check on whether the chain is doing what we hope.

What’s open. Sharp conditions for geometric ergodicity of RMHMC on hierarchical models with hyperparameter degeneracies; quantitative mixing-rate bounds dependent on the metric’s curvature; the relationship between metric choice and ergodicity. These are active research areas in the computational-statistics literature.

Heatmap of mean fixed-point iteration count over the banana support at three step sizes, showing position-dependent solver cost.
Figure 7 (static fallback). Mean FP iteration count per GL step over the banana support at ε = 0.10, 0.20, 0.35. Cost concentrates near the ridge's outer arms where the metric gradient is large.

§8. Computational cost

RMHMC buys mixing efficiency on pathological targets at the price of per-step compute. Where standard HMC pays O(d)O(d) per leapfrog step (one gradient evaluation, one position update, one momentum update), RMHMC pays O(d3)O(d^3) for Cholesky factorization plus the inner-loop cost of the implicit fixed-point iteration. The trade is favorable when the mixing-rate improvement outpaces the per-step cost increase.

8.1 The per-step Cholesky factorization

Look at one generalized-leapfrog step. The implicit momentum half-step needs G(θn)1pG(\theta^n)^{-1} p^* at every iteration of its fixed-point solver. Compute G(θn)1G(\theta^n)^{-1} once via a single Cholesky factorization of G(θn)G(\theta^n), reuse for every iteration: cost is O(d3/3)O(d^3/3) FLOPs for the factorization plus O(d2)O(d^2) per back-substitution. The implicit position step is harder. Each iteration evaluates G(θ(k))1pn+1/2G(\theta^{(k)})^{-1} p^{n+1/2} at a different θ(k)\theta^{(k)}, so a fresh Cholesky of G(θ(k))G(\theta^{(k)}) is needed at every iteration. Cost: O(Iθd3/3)O(I_\theta \cdot d^3/3). Total per leapfrog step: roughly (1+Iθ)(1 + I_\theta) Choleskys, where IθI_\theta is the position-step FP iteration count.

In dimension dd, this asymptote is what governs scaling. For the banana at d=2d = 2, Cholesky is a six-FLOP operation and the per-step cost is dominated by Python/JS overhead. For the downsized LCP at d25d \approx 25, Cholesky is on the order of 5000 FLOPs and is becoming material against the rest of the step. At full LCP scale (d=64d = 64 or larger), Cholesky is the wall.

8.2 Fixed-point convergence rate

How many FP iterations Iθ,IpI_\theta, I_p actually occur in practice? The momentum half-step iteration has Jacobian norm scaling as ρϵθK(θ,)Lip\rho \approx \epsilon \cdot \|\nabla_\theta K(\theta, \cdot)\|_{\text{Lip}}. Banach contraction gives the iteration error after kk steps bounded by ρk\rho^k, and to reach tolerance τ\tau we need

k    log(1/τ)log(1/ρ).k \;\gtrsim\; \frac{\log(1/\tau)}{\log(1/\rho)}.

For ρ0.1\rho \approx 0.1 (typical of a well-tuned ϵ\epsilon) and τ=106\tau = 10^{-6}, this gives k6k \approx 6 — matching the median FP counts the §5 figure measured on the banana. For ρ0.5\rho \approx 0.5 (badly tuned), k20k \approx 20 and we’re hitting the divergence cap. The relationship is logarithmic in τ\tau but inverse-logarithmic in ρ\rho, so the iteration count’s sensitivity to ϵ\epsilon near the contraction boundary is sharp.

The operational consequence: the FP iteration cost is concentrated in the regions where the metric is unfavorable. A chain that visits both flat and curved regions amortizes the per-step cost over a mix of cheap and expensive steps; if the chain spends most of its time in the curved region, the worst-case cost dominates the average.

8.3 Memory layout and the half-triangle trick

The Fisher metric G(θ)G(\theta) is symmetric by construction. Materializing it elementwise as a full d×dd \times d matrix uses d2d^2 floats; storing only the lower-triangular half uses d(d+1)/2d(d+1)/2, a near-2× saving in memory. At low dd this is irrelevant — the banana’s 2×22 \times 2 metric occupies 32 bytes either way. At moderate dd, the saving compounds: a d=100d = 100 metric is 80 KB as a full matrix vs 40 KB as a half-triangle, and that’s per state in the chain.

The Cholesky algorithm in SciPy (and LAPACK underneath) already operates on a triangular layout internally — scipy.linalg.cho_factor(G, lower=True) returns the lower-triangular factor in-place over the lower half of GG. So the factorization itself doesn’t gain from half-triangle materialization unless we’re hand-rolling. Where the half-triangle trick does matter is when GG is computed elementwise from a closed-form expression: skip the upper triangle and mirror. For the LCP’s likelihood-Fisher block (§10), this saves roughly half the metric-evaluation cost.

A complementary structural exploit: when GG is block-diagonal, the Cholesky factors as a product of per-block factorizations and the per-step cost drops by the cube of the block-size ratio. The LCP example has this structure: a likelihood-Fisher block on the intensity field and a constant GMRF-prior block. We exploit it in §10.

8.4 The cost-vs-mixing trade

The fundamental question: when does RMHMC’s higher per-step cost more than pay for itself in mixing?

The wall-clock ratio is what matters. ESS-per-sample comparisons are agnostic to per-step cost. The right comparison is ESS per second.

Empirically on the banana (§9 quantifies): at b=0b = 0 the banana is a Gaussian and HMC wins on ESS/sec because there’s no geometric pathology to compensate for. As bb grows, the metric variation starts to matter and the relative ESS/sec of the two methods shifts. The exact crossover depends on implementation efficiency; the brief calls it at b0.5b \approx 0.5, but a careful in-browser benchmark may land it elsewhere depending on which language is paying which fixed overheads.

The same trade-off works in reverse on well-conditioned targets. An isotropic Gaussian, a moderately-correlated regression posterior, or a high-dimensional but globally Gaussian target — these don’t benefit from RMHMC. RMHMC’s natural habitat is targets where: (i) the Fisher metric is available, (ii) the target’s local curvature varies substantially across the support, (iii) the geometric pathology is severe enough that NUTS-style constant-mass HMC struggles.

A useful heuristic: if NUTS converges nicely on your problem at moderate-to-low divergence rates, you don’t need RMHMC. If it doesn’t — divergent transitions despite small ϵ\epsilon, R^\hat R that won’t converge, ESS stuck at a few percent of the sample count — your target has the kind of geometric structure RMHMC was built for.

Figure 8. Left: per-step wall-clock decomposition for the generalized leapfrog at three ε values. Cholesky dominates at high ε; FP-iteration overhead grows with ε too. Right: ESS/sec on a log y-axis for HMC (blue) and RMHMC (olive) across banana curvature b ∈ [0, 1.5]. RMHMC's per-step cost is ~6–10× HMC's, so its mixing-rate advantage has to overcome that gap; the crossover sits at modest b for the canonical a = 1 banana. Slide L or n_samples to re-run the sweep.
Two-panel figure with per-step cost decomposition and ESS/sec versus banana curvature for HMC and RMHMC.
Figure 8 (static fallback). Cost breakdown and ESS/sec sweep from the notebook. Cholesky dominates at high ε; ESS/sec crossover depends on banana curvature.

§9. Worked example: the banana distribution

The banana has been the running thread of this topic since §1: pathological enough to defeat constant-mass HMC but small enough (d=2d = 2) to debug everything by eye. This section runs the actual head-to-head benchmark — standard HMC, RMHMC, NUTS — on the banana with full diagnostics.

9.1 The target and the setup

Recall the banana posterior. The Bayesian model from §3.1 has parameters (θ1,θ2)R2(\theta_1, \theta_2) \in \mathbb{R}^2 with a Gaussian prior θ1N(0,a2)\theta_1 \sim \mathcal{N}(0, a^2) and an improper uniform prior on θ2\theta_2, plus a Gaussian likelihood yθN(θ2+b(θ12a2),  1)y \mid \theta \sim \mathcal{N}(\theta_2 + b(\theta_1^2 - a^2),\; 1) with observed yobs=0y_{\text{obs}} = 0. The posterior is

π(θ1,θ2)    exp ⁣(12[θ12/a2+(θ2+b(θ12a2))2]),\pi(\theta_1, \theta_2) \;\propto\; \exp\!\bigl(-\tfrac{1}{2}\bigl[\theta_1^2/a^2 + (\theta_2 + b(\theta_1^2 - a^2))^2\bigr]\bigr),

which is a non-linear reparameterization of a bivariate Gaussian. We work throughout with the canonical parameter choice a=1a = 1, b=1b = 1.

The analytical reference moments come from the marginal-plus-conditional decomposition. The marginal posterior on θ1\theta_1 is N(0,a2)\mathcal{N}(0, a^2), so E[θ1]=0\mathbb{E}[\theta_1] = 0 and Var[θ1]=a2\mathrm{Var}[\theta_1] = a^2. The conditional posterior on θ2\theta_2 given θ1\theta_1 is N(b(θ12a2),1)\mathcal{N}(-b(\theta_1^2 - a^2),\, 1). Therefore

E[θ2]  =  Eθ1 ⁣[b(θ12a2)]  =  b(a2a2)  =  0,\mathbb{E}[\theta_2] \;=\; \mathbb{E}_{\theta_1}\!\bigl[-b(\theta_1^2 - a^2)\bigr] \;=\; -b\,(a^2 - a^2) \;=\; 0,

using E[θ12]=a2\mathbb{E}[\theta_1^2] = a^2. For the variance, the law of total variance gives

Var[θ2]  =  Eθ1 ⁣[Var[θ2θ1]]  +  Varθ1 ⁣[E[θ2θ1]]  =  1  +  b2Var[θ12].\mathrm{Var}[\theta_2] \;=\; \mathbb{E}_{\theta_1}\!\bigl[\mathrm{Var}[\theta_2 \mid \theta_1]\bigr] \;+\; \mathrm{Var}_{\theta_1}\!\bigl[\mathbb{E}[\theta_2 \mid \theta_1]\bigr] \;=\; 1 \;+\; b^2 \cdot \mathrm{Var}[\theta_1^2].

Using Var[Z2]=2σ4\mathrm{Var}[Z^2] = 2\sigma^4 for ZN(0,σ2)Z \sim \mathcal{N}(0, \sigma^2), we get Var[θ12]=2a4\mathrm{Var}[\theta_1^2] = 2 a^4, so

Var[θ2]  =  1  +  2a4b2.\mathrm{Var}[\theta_2] \;=\; 1 \;+\; 2\, a^4\, b^2.

For (a,b)=(1,1)(a, b) = (1, 1): E[θ]=(0,0)\mathbb{E}[\theta] = (0, 0), Var[θ]=(1,3)\mathrm{Var}[\theta] = (1, 3). The standard deviation on θ2\theta_2 is 3\sqrt{3} times that on θ1\theta_1, and the support has curvature that an isotropic mass matrix cannot capture.

9.2 The Fisher metric, restated

The Gauss-Newton Fisher metric, derived in §3 and reused throughout §§4–8, is

G(θ1,θ2)  =  (1/a2+4b2θ122bθ12bθ11),G(\theta_1, \theta_2) \;=\; \begin{pmatrix} 1/a^2 + 4 b^2 \theta_1^2 & 2 b \theta_1 \\[2pt] 2 b \theta_1 & 1 \end{pmatrix},

with constant determinant detG=1/a2\det G = 1/a^2, so the volume term V(θ)=12logdetG(θ)=logaV(\theta) = \tfrac{1}{2}\log\det G(\theta) = -\log a is constant and contributes nothing to the dynamics. The metric reduces to the identity at θ1=0\theta_1 = 0 (the banana’s “flat” point) and stretches with θ1|\theta_1| along the ridge.

9.3 Head-to-head: HMC vs RMHMC

We run two samplers on the banana, each with 4 chains of 2000 post-warmup samples. Standard HMC uses an identity mass matrix and a tuned ϵ=0.10\epsilon = 0.10, L=25L = 25. RMHMC uses the Fisher metric with ϵ=0.15\epsilon = 0.15, L=25L = 25. The static fallback PNG adds a third comparison against NUTS from PyMC.

Three observations.

First, both samplers target the same distribution: the empirical means match the analytical (0,0)(0, 0) to within Monte Carlo error, and the empirical Var[θ2]\mathrm{Var}[\theta_2] matches the analytical 33 to within Monte Carlo error.

Second, ESS-per-sample is where RMHMC pulls ahead. Standard HMC’s bulk ESS sits at a fraction of the chain length because consecutive samples are heavily correlated: the trajectory length is too short to traverse a full autocorrelation length in the loose direction. RMHMC’s bulk ESS is much closer to the chain length — samples are nearly independent.

Third, ESS-per-second is the right comparison metric. RMHMC’s per-step cost is roughly 6–10× HMC’s, so its ESS advantage barely overcomes the cost gap on the banana. A careful NumPy/SciPy RMHMC against a careful NumPy HMC lands RMHMC at roughly 2× HMC’s ESS/sec on the banana at b=1b = 1. The cost advantage widens with curvature.

9.4 Trajectory geometry — what the leapfrog paths look like

The sample-quality numbers say RMHMC explores better; the trajectory plots say why. From a common starting point with momentum sampled from N(0,G(θ0))\mathcal{N}(0, G(\theta_0)), one HMC trajectory and one RMHMC trajectory at their respective tuned ϵ\epsilon for L=50L = 50 leapfrog steps.

The HMC trajectory is a Euclidean parabola: starting at the same place with some initial velocity, gravity (the gradient of UU) pulls it back toward the banana’s ridge but doesn’t follow the ridge — once near the ridge, the trajectory has too much momentum perpendicular to it and overshoots. The path bounces across the ridge several times during the trajectory, accumulating little net displacement along the ridge. Effective per-trajectory exploration in the ridge-following direction: roughly Lϵ\sqrt{L} \cdot \epsilon.

The RMHMC trajectory follows the ridge. At each leapfrog step, the metric G(θn)G(\theta^n) rotates and stretches the kinetic energy to align with the local geometry, so the trajectory naturally curves along the ridge instead of cutting across it. Effective per-trajectory exploration in the ridge-following direction: roughly LϵL \cdot \epsilon — linear rather than L\sqrt{L}. This is the geometric story made operational: position-dependent metrics convert random-walk-style exploration into directed-walk-style exploration along the manifold’s natural coordinates.

sampleraccdivFP (p, θ)wall (s)ESSminESS/sVar[θ₁]Var[θ₂]
HMC0.99600.004822914560.9102.525
RMHMC0.9861(6.4, 6.7)0.0949152831.0723.022
analytical1.0003.000
Figure 9. Side-by-side HMC vs RMHMC samples on the banana posterior at the current b. Diagnostics table shows acceptance rate, divergences, mean FP iterations (RMHMC only), wall-clock, Geyer-style ESSmin, ESS/sec, and empirical variances against the analytical Var[θ₁] = a² = 1, Var[θ₂] = 1 + 2 a⁴ b². The third sampler (NUTS) appears in the static-fallback PNG since it requires PyMC.
Three-panel scatter comparison of HMC, RMHMC, and NUTS samples on the banana posterior.
Figure 9a (static fallback). HMC, RMHMC, and NUTS samples on the banana at (a, b) = (1, 1). The live viz above shows the HMC vs RMHMC comparison; the third (NUTS) appears only in the static PNG since PyMC isn't available in the browser.
Single-trajectory comparison of HMC vs RMHMC leapfrog paths on the banana, illustrating why RMHMC follows the ridge while HMC bounces across it.
Figure 9b (static fallback). One HMC trajectory and one RMHMC trajectory from a common starting point with L = 50 leapfrog steps. The RMHMC path tracks the ridge; the HMC path overshoots and bounces.

§10. Worked example: log-Gaussian Cox process

The banana example of §9 is the visualizable case. The log-Gaussian Cox process is the historically-motivating case — Girolami and Calderhead’s 2011 paper used a d=64d = 64 LCP as their main empirical demonstration, and the LCP has been the canonical “real” RMHMC application ever since. We run a downsized d=25d = 25 version that fits the browser-runtime budget while exercising the integrator’s structural features the banana doesn’t: a non-trivial volume term V(θ)V(\theta), a Fisher metric with additive prior–plus–likelihood structure, and a dimension where Cholesky cost starts to matter relative to gradient evaluation.

10.1 The model

A log-Gaussian Cox process is a doubly-stochastic Poisson point process: events arrive on a spatial domain Ω\Omega with intensity λ(s)=exp(x(s))\lambda(s) = \exp(x(s)), where x(s)x(s) is itself a realization of a Gaussian process. Discretizing Ω\Omega into dd cells of equal area A=1/dA = 1/d, the latent vector x=(x1,,xd)\boldsymbol{x} = (x_1, \dots, x_d) is the log-intensity at each cell center, and the count in cell ii is

yixi    Poisson(Aexp(xi)).y_i \mid x_i \;\sim\; \mathrm{Poisson}\bigl(A \cdot \exp(x_i)\bigr).

The prior on x\boldsymbol{x} is a discretized Gaussian process: xN(μx1,K)\boldsymbol{x} \sim \mathcal{N}(\mu_x \mathbf{1}, K) where KK is a kernel covariance matrix and μx\mu_x sets the mean log-intensity. For this section we use a 5×55 \times 5 grid (d=25d = 25), a squared-exponential kernel with length-scale ρ\rho and unit variance, and μx=log50\mu_x = \log 50. We generate a synthetic ground truth xtrueN(μx1,K)\boldsymbol{x}_{\text{true}} \sim \mathcal{N}(\mu_x \mathbf{1}, K) once with a fixed seed, then simulate counts y\boldsymbol{y} from the resulting intensity.

The log-posterior is (up to additive constants):

logp(xy)  =  i[yixiAexp(xi)]    12(xμx1)K1(xμx1),\log p(\boldsymbol{x} \mid \boldsymbol{y}) \;=\; \sum_i \bigl[y_i x_i - A \exp(x_i)\bigr] \;-\; \tfrac{1}{2}(\boldsymbol{x} - \mu_x \mathbf{1})^\top K^{-1} (\boldsymbol{x} - \mu_x \mathbf{1}),

with gradient xlogp=yAexp(x)K1(xμx1)\nabla_{\boldsymbol{x}} \log p = \boldsymbol{y} - A \exp(\boldsymbol{x}) - K^{-1}(\boldsymbol{x} - \mu_x \mathbf{1}). The geometry of this target is genuinely hard for constant-mass HMC: cells with high observed counts have steep gradients and tight local curvature; cells with zero counts have gentle gradients and loose curvature; and the kernel prior introduces strong spatial correlations that make a diagonal preconditioner inadequate.

10.2 The Fisher metric

The Bayesian Fisher information of this model is the expected negative Hessian of the log-likelihood plus the prior precision. The Hessian of the log-likelihood is diagonal and the prior contributes its precision matrix Λ=K1\Lambda = K^{-1}:

G(x)  =  diag(Aexp(x1),,Aexp(xd))  +  Λ.G(\boldsymbol{x}) \;=\; \mathrm{diag}\bigl(A \exp(x_1), \dots, A \exp(x_d)\bigr) \;+\; \Lambda.

Two structural features matter. First, the likelihood part is diagonal and position-dependent, while the prior part Λ\Lambda is dense and constant. Second, the determinant of G(x)G(\boldsymbol{x}) varies non-trivially with x\boldsymbol{x} — unlike the banana, where detG\det G was constant. The volume term V(x)=12logdetG(x)V(\boldsymbol{x}) = \tfrac{1}{2}\log\det G(\boldsymbol{x}) contributes to both Hamilton’s equations and the Metropolis acceptance ratio, and its gradient is computable from a single Cholesky of G(x)G(\boldsymbol{x}) via the diagonal entries of G1G^{-1}.

10.3 RMHMC on the d=25d = 25 benchmark

The viz below runs RMHMC on the LCP posterior at user-selectable kernel length-scale ρ\rho and sample count, then summarizes posterior recovery. Both the posterior mean and pointwise posterior std are plotted as heatmaps alongside the synthetic ground truth.

Two things to call out. First, RMHMC correctly recovers the spatial pattern of xtrue\boldsymbol{x}_{\text{true}}. The empirical posterior mean agrees with the truth to within Monte Carlo error on all 25 components.

Second, the posterior standard deviation heatmap shows where the data is informative versus where it isn’t. Cells with high observed counts have tighter posterior — the likelihood dominates the prior. Cells with low counts have looser posterior — the kernel prior is closer to the binding constraint.

10.4 Scaling beyond d25d \approx 25

The Girolami–Calderhead 2011 paper used d=64d = 64. Their MATLAB runtime was on the order of tens of minutes for ~5000 RMHMC samples. Modern Python implementations are slower per-FLOP than tuned MATLAB but the wall-clock for the same problem at d=64d = 64 on a modern laptop is roughly similar — minutes, not seconds.

What changes as dd grows. Cholesky cost grows as d3d^3. At d=64d = 64, factorization is ~80× more expensive than at d=25d = 25. The per-step cost is now Cholesky-dominated. Gradient evaluation and FP iteration overhead become almost free by comparison.

The structural-exploitation opportunities open up. G(x)=D(x)+ΛG(\boldsymbol{x}) = D(\boldsymbol{x}) + \Lambda has a low-rank update structure if we hold Λ\Lambda‘s Cholesky fixed: the Woodbury identity factors (D+Λ)1(D + \Lambda)^{-1} as Λ1Λ1(D1+Λ1)1Λ1\Lambda^{-1} - \Lambda^{-1}(D^{-1} + \Lambda^{-1})^{-1}\Lambda^{-1}, and when DD is diagonal, the inner inverse is also diagonal. This drops the per-iteration Cholesky cost from O(d3)O(d^3) to O(d2)O(d^2). The half-triangle trick of §8.3 also pays off at d=64d = 64.

For dd much larger than 100\sim 100, the practitioner should consider alternatives. INLA approximates the posterior using nested Laplace approximations and runs in seconds even at d=1000d = 1000. Variational methods scale similarly. RMHMC stays competitive when full posterior samples are required and when the spatial dimension is moderate (up to a few hundred). Beyond that, the dimension is the practitioner’s enemy and structure-aware approximations win.

acc=0.997, div=0, wall=1.19s, ‖mean − x_true‖=2.17, mean(std)=0.565
Figure 10. Log-Gaussian Cox process on a 5×5 grid (d = 25). Top row: synthetic ground truth xtrue and the Poisson counts y it generated. Bottom row: RMHMC posterior mean (same color scale as xtrue, so cells should match visually) and pointwise posterior standard deviation. Shrinking ρ tightens the kernel prior and lets x track local data more aggressively; growing it smooths across cells.
Four-panel heatmap showing LCP true intensity, observed counts, RMHMC posterior mean, and pointwise posterior standard deviation on a 5x5 grid.
Figure 10 (static fallback). LCP posterior recovery at d = 25 from the notebook. Posterior mean tracks the ground truth; std is small where data is informative.

§11. Computational notes

What the topic looks like in production code rather than in pedagogical exposition. The patterns below are the ones our reference implementation in §§5–6 follows.

11.1 Buffer hoisting and the inner-loop discipline

Per-leapfrog cost on the banana (d=2d = 2) is microseconds; on the LCP (d=25d = 25) it’s a few hundred microseconds; in production at d100d \approx 100 it’s milliseconds. Across all three regimes, the largest avoidable cost is memory allocation in the inner loop. The discipline: allocate every scratch buffer once before the trajectory loop, reuse with in-place writes. For the generalized leapfrog of §5: the position scratch and momentum scratch allocated as fixed-length arrays before the outer sample loop; the Cholesky factor and metric inverse allocated once before the FP iteration loop.

11.2 The implicit-leapfrog inner solver: fixed-point vs Newton

The fixed-point iteration of §5.5 converges linearly at rate ρϵLK\rho \approx \epsilon \cdot L_K, where LKL_K is the local Lipschitz constant of the relevant kinetic-energy gradient. At well-tuned ϵ\epsilon, the iteration count is 3–6 to reach a 10610^{-6} tolerance, and the FP iteration is the right choice: simple to implement, no Jacobian assembly, robust to model structure.

Newton’s method is the standard alternative — quadratic convergence but per-iteration cost includes a Jacobian eval and a linear solve. The cost trade: for d=2d = 2 Newton’s per-iteration cost is roughly 2×2 \times FP’s. For larger dd where the Jacobian is a d×dd \times d system, Newton’s per-iteration cost is O(d3)O(d^3) from the solve — comparable to the per-leapfrog Cholesky. Newton wins when FP would take more than 2×\sim 2 \times as many iterations.

Practical recommendation: default to fixed-point. Monitor iteration counts during warmup. If counts exceed 10 at the chain’s typical (θ,p)(\theta, p) states, the right fix is to reduce ϵ\epsilon rather than switch solvers.

11.3 Cholesky idioms and SPD-stable solves

Three SciPy patterns recur:

from scipy.linalg import cho_factor, cho_solve

c, low = cho_factor(G, lower=True)
y = cho_solve((c, low), b)                    # solve G y = b
log_det_G = 2.0 * np.sum(np.log(np.diag(c)))  # log det G from L

The cho_factor path is roughly 2×2 \times faster than np.linalg.solve(G, b) for SPD systems because the latter uses LU decomposition, which doesn’t exploit symmetry. The log-determinant from LL is stable and exact; the alternative np.log(np.linalg.det(G)) overflows for moderate dd and loses precision for ill-conditioned matrices.

SPD stability and jitter. Numerical SPD-ness can fail at borderline-PSD matrices — the kernel covariance KK in §10 needed a 10610^{-6} jitter on the diagonal to be reliably factorable. When SPD fails, the metric is on the boundary of validity, and the fix is to add a small diagonal regularizer (SoftAbs in §12 is the principled version).

11.4 Multi-chain diagnostics via ArviZ

The diagnostics that matter for RMHMC are the same ones that matter for any HMC variant: R^\hat R for convergence, bulk and tail ESS for mixing, divergences for integrator health. ArviZ is the project-standard library.

import arviz as az

idata = az.from_dict(posterior={"theta_1": samples[..., 0], "theta_2": samples[..., 1]})
summary = az.summary(idata, kind="all", round_to=4)
az.plot_trace(idata)

R^<1.01\hat R < 1.01 is the standard convergence threshold. Bulk ESS should be at least 400 per parameter for reliable point estimates; tail ESS at least 100 for tail quantiles. Divergences should be 0 for well-tuned chains; up to a few percent acceptable if spatially localized.

RMHMC-specific additions. Track and plot mean fixed-point iteration counts per leapfrog step, per chain. Counts that drift over the chain indicate ϵ\epsilon that’s not adapted to the local geometry. Divergent transitions in RMHMC have the same diagnostic role as in standard NUTS — they signal ϵ\epsilon too large for the local curvature — but with the additional signal that the FP iteration cap was reached.


§12. Connections and limits

The Girolami–Calderhead 2011 program — Riemannian-manifold MCMC built on the Fisher information metric, implemented via implicit symplectic integration — opened a research thread that’s now run for over a decade and a half. This closing section surveys what RMHMC connects to, where the framework has been generalized, and what’s still open.

12.1 RMHMC and position-dependent SDEs

The cleanest framework that unifies RMHMC with its SDE cousin RSGLD is the complete-recipe construction of Ma, Chen, and Fox (2015, NeurIPS). They show that every π(z)\pi(z)-invariant Itô SDE on an extended state space z=(θ,p)z = (\theta, p) can be written as

dz  =  [(D(z)+Q(z))logπ(z)+Γ(z)]dt  +  2D(z)dWt,dz \;=\; \bigl[\,(D(z) + Q(z))\,\nabla \log \pi(z) + \Gamma(z)\bigr] dt \;+\; \sqrt{2\, D(z)}\, dW_t,

where D(z)D(z) is symmetric positive semidefinite, Q(z)Q(z) is skew-symmetric, and Γ(z)\Gamma(z) is a correction term derived from D+QD + Q. Every gradient-informed MCMC algorithm in the literature corresponds to a specific choice of (D,Q)(D, Q).

Standard SGLD: D=ID = I, Q=0Q = 0. Standard underdamped Langevin: DD is friction in the momentum block, QQ is the symplectic structure. Standard HMC: D=0D = 0, QQ symplectic. RSGLD (the stochastic-gradient-mcmc §10 algorithm): D=G(θ)1D = G(\theta)^{-1} in the position block, Q=0Q = 0. RMHMC: D=0D = 0, QQ symplectic, with the Riemannian Hamiltonian H=U+V+12pG1pH = U + V + \tfrac{1}{2} p^\top G^{-1} p inside logπ\nabla \log \pi.

The framework matters because it puts RMHMC in a continuous spectrum of related samplers. The “right” sampler for a given target is often a hybrid — say, RMHMC with a small amount of momentum-block friction for additional decorrelation, or RSGLD with a small Hamiltonian-style curl term for faster mixing. The complete recipe gives a principled way to construct and validate these hybrids.

12.2 Natural gradient as the first-order analog

The Fisher information metric was already a working tool in machine learning before RMHMC. Natural gradient descent (Amari 1998) replaces the Euclidean gradient L(θ)\nabla L(\theta) in optimization with the natural gradient G(θ)1L(θ)G(\theta)^{-1} \nabla L(\theta), on the argument that the natural gradient is invariant under reparameterization and respects the local geometry of the parameter manifold. Natural gradient appears throughout modern deep learning: K-FAC and its variants (Martens 2014, Martens and Grosse 2015) in supervised learning; trust-region policy optimization (TRPO; Schulman et al. 2015) in reinforcement learning; conjugate-computation variational inference (CVI; Khan and Lin 2017) and the broader natural-gradient VI line of work.

The connection to RMHMC. Both methods use the same Fisher matrix G(θ)G(\theta), but for different purposes: natural gradient uses it to precondition the search direction in optimization; RMHMC uses it to precondition the kinetic energy in sampling. Continuous-time RMHMC in the overdamped limit (large friction, momentum integrated out) reduces to a preconditioned Langevin dynamics whose drift term is exactly G(θ)1U(θ)-G(\theta)^{-1} \nabla U(\theta) — the natural gradient. So natural gradient is RMHMC’s first-order analog.

The practical consequence is straightforward. Models with reliable natural-gradient implementations — Bayesian deep learning with K-FAC variants, RL with TRPO-style natural policy gradient, NGVI for variational families — have the metric tensor already implemented; promoting it from a preconditioner to an RMHMC mass matrix is mechanical.

12.3 Beyond the Fisher metric

The Fisher metric is canonical (per Čencov, §3.4), but it isn’t always available or convenient. Several alternatives have been developed.

SoftAbs (Betancourt 2013). When the Fisher metric is ill-defined — for non-likelihood targets, or for likelihoods where Ey[θ2logp]\mathbb{E}_y[-\nabla^2_\theta \log p] has no closed form — the natural fallback is the observed Hessian θ2logπ(θ)-\nabla^2_\theta \log \pi(\theta). The problem: the observed Hessian can have negative eigenvalues at points where the log-target isn’t locally concave, making it unusable as an SPD metric. SoftAbs solves this by mapping the eigenvalues through a smooth absolute-value function. SoftAbs is the standard fallback metric in Stan’s experimental RMHMC implementations.

Outer-product-of-gradients (OPG) metric. The empirical Fisher GOPG(θ)=1Nn[logp(ynθ)][logp(ynθ)]G_{\text{OPG}}(\theta) = \tfrac{1}{N} \sum_n [\nabla \log p(y_n \mid \theta)][\nabla \log p(y_n \mid \theta)]^\top is the data-driven analog of the population Fisher. It’s always SPD and computable from likelihood gradients alone — no Hessian.

Magnetic HMC (Tripuraneni, Rowland, Ghahramani, and Turner 2017). Adds a non-conservative “magnetic” force term to the Hamiltonian, enabling rotational dynamics that can escape local modes and traverse pathological geometry differently from RMHMC. Magnetic HMC is sometimes used as a complement to RMHMC rather than a replacement.

Riemannian NUTS. Betancourt (2013) and subsequent implementations adapt the NUTS adaptive-trajectory-length criterion to the Riemannian setting using metric-aware U-turn conditions (cf. §7.2). The resulting sampler combines RMHMC’s position-aware metric with NUTS’s automatic LL-selection, and is the production-ready combination of the two threads.

12.4 Open frontiers

What’s not yet shipped technology.

High-dimensional efficiency. Per-step Cholesky of G(θ)G(\theta) at O(d3)O(d^3) limits RMHMC to d100d \lesssim 100 for routine use. Beyond that, structure-aware variants — low-rank metric updates via Woodbury (§10.4), block-diagonal metrics for hierarchical models, GPU-accelerated Cholesky for dense GG at d500d \approx 500 — extend the reach but remain bespoke. A general-purpose RMHMC at d>1000d > 1000 is an open challenge.

Automatic metric selection. Choosing between Fisher, SoftAbs, observed Hessian, and OPG metrics is currently done by trial and error. Adaptive metric selection — picking the right metric at each θ\theta, possibly switching mid-chain — is an active research direction with promising preliminary results but no general framework.

Cost-quality Pareto for Bayesian computation. RMHMC sits on a Pareto frontier alongside NUTS, mean-field variational inference, normalizing-flow VI, INLA, sequential Monte Carlo (coming soon), and various optimization-based approximations. Knowing where on that frontier to place a given problem is the operational research question that determines what’s worth running. Empirical Pareto benchmarks across these methods exist for specific application domains but not as a general taxonomy.

Differentiable programming and autodiff for RMHMC. Modern PPL backends (PyTorch, JAX, PyMC’s PyTensor) can autodiff arbitrary log-target functions. This makes the observed Hessian and SoftAbs metric available for any model the PPL can express, without hand-derived Fisher matrices. The next-generation RMHMC implementations build on this — the implementation cost drops from “weeks of model-specific Fisher derivation” to “one function definition.” Early indications from Betancourt’s experimental Stan branch suggest a substantial shift toward RMHMC becoming the default for posteriors that NUTS handles poorly.

Stochastic-gradient extensions. The stochastic-gradient-mcmc topic developed RSGLD as the Langevin counterpart of RMHMC; full Riemannian SG-HMC is an active research thread, on the boundary between “well-developed theory, partial implementations” and “actively being worked out.”

Connections

  • Variational families and RMHMC's Fisher-metric trajectories solve the same problem — posterior approximation under target geometry — by opposite means: one parametric (a variational family is fit by gradient descent), one nonparametric MCMC. §12.2's natural-gradient view explains how the two methods share the Fisher metric as their underlying preconditioner. variational-inference
  • SMC adaptively schedules an annealing path while RMHMC adaptively shapes the kinetic energy; both are responses to multi-modal or anisotropic targets. SMC is the alternative when the target is a sequence π_t indexed by time; RMHMC is the alternative when the target is a single hard posterior. sequential-monte-carlo
  • RJMCMC handles dimension-changing moves; RMHMC handles within-dimension curvature. The two are routinely combined in trans-dimensional hierarchical models, with RMHMC running the within-model dynamics. reversible-jump-mcmc
  • PPLs (Stan, PyMC, NumPyro) are how practitioners get RMHMC into production. Stan's experimental SoftAbs RMHMC is the canonical implementation; PyMC supports it via blackjax. The PPL gives you the joint log-density and gradient automatically; RMHMC needs the Fisher metric on top. probabilistic-programming

References & Further Reading