advanced geometry 55 min read

Information Geometry & Fisher Metric

The Fisher information metric, α-connections, and divergence functions on statistical manifolds

Overview & Motivation

The space of probability distributions is not just a set — it is a Riemannian manifold with a canonical metric. This observation, due independently to Rao (1945) and later systematized by Amari (1985), places statistical inference squarely inside the Riemannian framework we built in Riemannian Geometry and Geodesics & Curvature.

Here is the key intuition. Consider two pairs of Gaussians:

  • N(0,1)\mathcal{N}(0, 1) and N(0.01,1)\mathcal{N}(0.01, 1) — the means differ by 0.010.01, variance is 11.
  • N(0,0.01)\mathcal{N}(0, 0.01) and N(0.01,0.01)\mathcal{N}(0.01, 0.01) — the means differ by the same 0.010.01, but the variance is 0.010.01.

In Euclidean parameter space, these two pairs are the same distance apart: Δμ=0.01|\Delta\mu| = 0.01. But statistically, the second pair is far more distinguishable — a variance of 0.010.01 means the distributions are tightly concentrated, so a shift of 0.010.01 in the mean is enormous relative to the spread. The Fisher information metric captures this: the “true” distance between distributions depends on where you are in parameter space, exactly as a Riemannian metric varies from point to point on a manifold.

This topic brings together the three pillars of the Differential Geometry track. The parameter space of a statistical model is a smooth manifold. The Fisher information defines a Riemannian metric on this manifold. And the resulting geodesics and curvature have direct statistical meaning: geodesics are the paths of natural gradient descent, and curvature controls the precision of statistical estimation via the Cramér–Rao bound.

What We Cover

  1. Statistical manifolds — parametric families as smooth manifolds, identifiability, and Čencov’s uniqueness theorem
  2. The Fisher information metric — score functions, the Fisher matrix, and why it is a Riemannian metric
  3. Classical families — the Gaussian manifold as the Poincaré half-plane, Bernoulli geometry, and exponential families
  4. α\alpha-connections — Amari’s one-parameter family, ee/mm-duality, and dually flat manifolds
  5. Divergence functions — KL divergence, α\alpha-divergences, Bregman divergences, and the generalized Pythagorean theorem
  6. Geodesics — Fisher-Rao geodesics, the Mahalanobis distance, and hyperbolic geometry of variance
  7. The Cramér–Rao bound — curvature and estimation precision, efficient estimators
  8. Computational notes — symbolic Fisher metric, Christoffel symbols, geodesic solvers, natural gradient
  9. Information geometry in ML — natural gradient descent, variational inference, Adam, and optimal transport

Prerequisites

This topic builds directly on all three preceding topics in the Differential Geometry track:

  • Smooth Manifolds — charts, tangent spaces, and the differential structure that parameter spaces inherit
  • Riemannian Geometry — metric tensors, the Levi-Civita connection, parallel transport, and the machinery for measuring lengths and angles
  • Geodesics & Curvature — the geodesic equation, curvature tensors, and the Gauss–Bonnet theorem

We also draw on the Spectral Theorem for eigendecomposition of the Fisher matrix, and connect to PCA & Low-Rank Approximation through the lens of preconditioning.


Statistical Manifolds & Parametric Families

Definition 1 (Statistical Model).

A parametric statistical model is a family S={pθ:θΘ}\mathcal{S} = \{p_\theta : \theta \in \Theta\} of probability distributions on a sample space X\mathcal{X}, where:

  1. ΘRn\Theta \subseteq \mathbb{R}^n is an open subset (the parameter space),
  2. The map θpθ(x)\theta \mapsto p_\theta(x) is smooth (infinitely differentiable) for each xXx \in \mathcal{X},
  3. The map θpθ\theta \mapsto p_\theta is injective (identifiability): distinct parameters give distinct distributions.

Under these conditions, S\mathcal{S} inherits the smooth manifold structure of Θ\Theta. The dimension of the statistical manifold is n=dim(Θ)n = \dim(\Theta).

The identifiability requirement (condition 3) ensures that the parameter space faithfully represents the set of distributions — there is no redundancy. Without identifiability, the Fisher metric degenerates: it becomes only positive semi-definite rather than positive definite, because some parameter directions produce no change in the distribution.

Examples. The Gaussian family N(μ,σ2)\mathcal{N}(\mu, \sigma^2) has parameter space Θ=R×R+\Theta = \mathbb{R} \times \mathbb{R}_+, a 2-dimensional manifold. The Bernoulli family Ber(p)\text{Ber}(p) has Θ=(0,1)\Theta = (0, 1), a 1-dimensional manifold. The exponential family Exp(λ)\text{Exp}(\lambda) has Θ=R+\Theta = \mathbb{R}_+. The multinomial on kk categories, Mult(p1,,pk)\text{Mult}(p_1, \ldots, p_k) with pi=1\sum p_i = 1, has Θ\Theta equal to the open (k1)(k-1)-simplex.

Each point θΘ\theta \in \Theta represents an entire probability distribution pθp_\theta. As we move through the parameter space, we trace out a path through the space of distributions. The tangent space at θ\theta consists of directions in which we can perturb the parameter — and, as we will see, these tangent vectors can be identified with score functions.

Statistical manifolds: the Gaussian family as curves in function space, the parameter space as a 2D manifold, and the identifiability requirement


The Fisher Information Metric

With the smooth manifold structure in place, we now equip the parameter space with a Riemannian metric. The construction proceeds in three steps: define the score function, take its covariance, and verify that the result is a valid Riemannian metric.

Definition 2 (Score Function).

The score function of a statistical model {pθ}\{p_\theta\} is the gradient of the log-likelihood with respect to the parameters:

si(x;θ)=θilogpθ(x)s_i(x; \theta) = \frac{\partial}{\partial \theta^i} \log p_\theta(x)

The score function sis_i measures the sensitivity of the log-likelihood to changes in the ii-th parameter.

Proposition 1 (Zero Mean of the Score).

For any statistical model satisfying the regularity conditions of Definition 1, the score has zero mean:

Eθ[si(x;θ)]=0for all i and θ\mathbb{E}_\theta[s_i(x; \theta)] = 0 \quad \text{for all } i \text{ and } \theta

Proof.

We compute directly, using the fact that pθ(x)dx=1\int p_\theta(x) \, dx = 1:

Eθ[si]=θilogpθ(x)pθ(x)dx=θipθ(x)pθ(x)pθ(x)dx=θipθ(x)dx\mathbb{E}_\theta[s_i] = \int \frac{\partial}{\partial \theta^i} \log p_\theta(x) \cdot p_\theta(x) \, dx = \int \frac{\frac{\partial}{\partial \theta^i} p_\theta(x)}{p_\theta(x)} \cdot p_\theta(x) \, dx = \int \frac{\partial}{\partial \theta^i} p_\theta(x) \, dx

Interchanging the derivative and integral (justified by the smoothness assumption):

=θipθ(x)dx=θi1=0= \frac{\partial}{\partial \theta^i} \int p_\theta(x) \, dx = \frac{\partial}{\partial \theta^i} 1 = 0 \qquad \square

Since the score has zero mean, its covariance matrix is simply E[sisj]\mathbb{E}[s_i \, s_j]. This covariance is the Fisher information matrix.

Definition 3 (Fisher Information Matrix).

The Fisher information matrix of a statistical model {pθ}\{p_\theta\} is the n×nn \times n matrix

gij(θ)=Eθ[si(x;θ)sj(x;θ)]=logpθθilogpθθjpθ(x)dxg_{ij}(\theta) = \mathbb{E}_\theta\bigl[s_i(x; \theta) \, s_j(x; \theta)\bigr] = \int \frac{\partial \log p_\theta}{\partial \theta^i} \frac{\partial \log p_\theta}{\partial \theta^j} \, p_\theta(x) \, dx

Equivalently, under the same regularity conditions:

gij(θ)=Eθ ⁣[2logpθ(x)θiθj]g_{ij}(\theta) = -\mathbb{E}_\theta\!\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta^i \, \partial \theta^j}\right]

The equivalence of the two forms is a standard computation: differentiate the zero-mean identity E[si]=0\mathbb{E}[s_i] = 0 with respect to θj\theta^j and use the product rule.

Theorem 1 (Fisher Information is a Riemannian Metric).

Under the identifiability condition (Definition 1), the Fisher information matrix gij(θ)g_{ij}(\theta) satisfies:

  1. Symmetry: gij=gjig_{ij} = g_{ji} (by definition, since sisj=sjsis_i \, s_j = s_j \, s_i).
  2. Positive semi-definiteness: For any vector vRnv \in \mathbb{R}^n, i,jgijvivj=E ⁣[(ivisi) ⁣2]0\sum_{i,j} g_{ij} v^i v^j = \mathbb{E}\!\left[\left(\sum_i v^i s_i\right)^{\!2}\right] \geq 0
  3. Positive definiteness: If gijvivj=0\sum g_{ij} v^i v^j = 0, then visi(x;θ)=0\sum v^i s_i(x; \theta) = 0 for pθp_\theta-almost all xx, which means viθilogpθ(x)=0\sum v^i \frac{\partial}{\partial \theta^i} \log p_\theta(x) = 0 a.s. By identifiability, this forces v=0v = 0.
  4. Smoothness: gij(θ)g_{ij}(\theta) is smooth in θ\theta because pθp_\theta is smooth.

Hence (Θ,g)(\Theta, g) is a Riemannian manifold.

Proof.

Properties (1) and (2) are immediate from the definition. For (3), suppose gijvivj=0\sum g_{ij} v^i v^j = 0. Then E[(visi)2]=0\mathbb{E}[(\sum v^i s_i)^2] = 0, so visi(x;θ)=0\sum v^i s_i(x; \theta) = 0 for pθp_\theta-a.e. xx. This means

iviθilogpθ(x)=0a.e.\sum_i v^i \frac{\partial}{\partial \theta^i} \log p_\theta(x) = 0 \quad \text{a.e.}

Exponentiating, pθ+tv(x)/pθ(x)1p_{\theta + tv}(x) / p_\theta(x) \to 1 to first order in tt for all directions vv. By identifiability, pθ+tvpθp_{\theta+tv} \neq p_\theta for v0v \neq 0 and small tt, so we must have v=0v = 0. Property (4) follows from the smoothness of θpθ\theta \mapsto p_\theta and the dominated convergence theorem for the expectation integral. \square

Theorem 2 (Čencov's Uniqueness Theorem).

The Fisher information metric is the unique Riemannian metric on the space of probability distributions, up to a positive constant factor, that is invariant under sufficient statistics (Markov embeddings).

Remark (Significance of Čencov's Theorem).

We state Čencov’s theorem without proof (the full proof requires the theory of Markov kernels — see Čencov 1982). The significance is profound: the Fisher metric is not one choice among many possible Riemannian metrics on statistical manifolds. It is the canonical choice, determined uniquely by the natural invariance requirement that statistical geometry should not depend on the particular representation of the data.

Drag the point to explore how the Fisher metric varies across parameter space. Metric ellipses show the local unit ball in the Fisher metric — smaller ellipses mean higher information.

The Fisher information metric: score function, Fisher matrix formula, and metric ellipses on the Gaussian parameter space


Fisher Metric for Classical Families

We now compute the Fisher metric explicitly for three canonical parametric families. Each reveals different geometric structure.

Example 1 (Gaussian Family).

For the Gaussian family N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with parameters θ=(μ,σ)\theta = (\mu, \sigma):

logp(x;μ,σ)=logσ(xμ)22σ212log(2π)\log p(x; \mu, \sigma) = -\log \sigma - \frac{(x - \mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi)

The score functions are:

sμ=xμσ2,sσ=1σ+(xμ)2σ3s_\mu = \frac{x - \mu}{\sigma^2}, \qquad s_\sigma = -\frac{1}{\sigma} + \frac{(x - \mu)^2}{\sigma^3}

Computing the expectations gij=E[sisj]g_{ij} = \mathbb{E}[s_i s_j]:

  • g11=E[sμ2]=E ⁣[(xμ)2σ4]=1σ2g_{11} = \mathbb{E}[s_\mu^2] = \mathbb{E}\!\left[\frac{(x-\mu)^2}{\sigma^4}\right] = \frac{1}{\sigma^2}
  • g22=E[sσ2]=E ⁣[1σ22(xμ)2σ4+(xμ)4σ6]=2σ2g_{22} = \mathbb{E}[s_\sigma^2] = \mathbb{E}\!\left[\frac{1}{\sigma^2} - \frac{2(x-\mu)^2}{\sigma^4} + \frac{(x-\mu)^4}{\sigma^6}\right] = \frac{2}{\sigma^2}
  • g12=E[sμsσ]=0g_{12} = \mathbb{E}[s_\mu \, s_\sigma] = 0 (by the symmetry of odd moments)

The Fisher metric is:

g=1σ2(1002)g = \frac{1}{\sigma^2}\begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix}

The Riemannian line element is ds2=1σ2(dμ2+2dσ2)ds^2 = \frac{1}{\sigma^2}(d\mu^2 + 2 \, d\sigma^2). Up to the constant factor of 22 in the dσ2d\sigma^2 term, this is the Poincaré upper half-plane metric on the half-plane {(μ,σ):σ>0}\{(\mu, \sigma) : \sigma > 0\}.

Proposition 2 (Gaussian Curvature of the Gaussian Manifold).

The Gaussian family (Θ,g)(\Theta, g) with the Fisher metric has constant negative sectional curvature K=12K = -\frac{1}{2}.

This is computed by applying the Riemann curvature tensor formula from Geodesics & Curvature to the Fisher metric g=diag(1/σ2,2/σ2)g = \text{diag}(1/\sigma^2, 2/\sigma^2). The Gaussian manifold is a surface of constant negative curvature — a hyperbolic space. This means that the space of Gaussian distributions, equipped with the Fisher metric, has the same local geometry as the Poincaré half-plane.

Example 2 (Bernoulli Family).

For the Bernoulli family Ber(p)\text{Ber}(p) with parameter θ=p(0,1)\theta = p \in (0, 1):

logp(x;p)=xlogp+(1x)log(1p)\log p(x; p) = x \log p + (1 - x)\log(1 - p)

The score is sp=xp1x1ps_p = \frac{x}{p} - \frac{1-x}{1-p}, and the Fisher information is:

g(p)=E[sp2]=1p(1p)g(p) = \mathbb{E}[s_p^2] = \frac{1}{p(1-p)}

This diverges as p0p \to 0 or p1p \to 1: the boundary of the parameter space is at infinite Fisher-Rao distance from any interior point. The Bernoulli manifold, despite being a bounded interval (0,1)(0, 1) in Euclidean terms, is a complete Riemannian manifold of infinite diameter.

Theorem 3 (Fisher Metric for Exponential Families).

For an exponential family in natural parameters,

p(x;η)=h(x)exp ⁣(ηTT(x)A(η))p(x; \eta) = h(x) \exp\!\bigl(\eta^T T(x) - A(\eta)\bigr)

the Fisher metric is the Hessian of the log-partition function:

gij(η)=2Aηiηj=Covη(Ti,Tj)g_{ij}(\eta) = \frac{\partial^2 A}{\partial \eta^i \, \partial \eta^j} = \text{Cov}_\eta(T_i, T_j)

Proof.

The score function in natural parameters is:

si(x;η)=ηi ⁣[ηTT(x)A(η)]=Ti(x)Aηis_i(x; \eta) = \frac{\partial}{\partial \eta^i}\!\left[\eta^T T(x) - A(\eta)\right] = T_i(x) - \frac{\partial A}{\partial \eta^i}

Since E[Ti]=Aηi\mathbb{E}[T_i] = \frac{\partial A}{\partial \eta^i} (the mean parameters are the gradient of the log-partition function), the score has the form si=TiE[Ti]s_i = T_i - \mathbb{E}[T_i]. Therefore:

gij=E[sisj]=E[(TiE[Ti])(TjE[Tj])]=Cov(Ti,Tj)g_{ij} = \mathbb{E}[s_i \, s_j] = \mathbb{E}[(T_i - \mathbb{E}[T_i])(T_j - \mathbb{E}[T_j])] = \text{Cov}(T_i, T_j)

For the Hessian form, differentiate E[Ti]=Aηi\mathbb{E}[T_i] = \frac{\partial A}{\partial \eta^i} again:

gij=Cov(Ti,Tj)=ηjE[Ti]=2Aηiηjg_{ij} = \text{Cov}(T_i, T_j) = \frac{\partial}{\partial \eta^j} \mathbb{E}[T_i] = \frac{\partial^2 A}{\partial \eta^i \, \partial \eta^j} \qquad \square

This is a striking result: for exponential families, the Fisher metric is simply the Hessian of a single scalar function A(η)A(\eta). The convexity of A(η)A(\eta) (a standard property of log-partition functions) guarantees positive definiteness — the Fisher metric is automatically a valid Riemannian metric.

Fisher metric for classical families: Gaussian = Poincaré half-plane, Bernoulli Fisher information, exponential family Hessian


α\alpha-Connections and Dual Geometry

The Levi-Civita connection from Riemannian Geometry is the unique torsion-free, metric-compatible connection. Amari’s key insight (1985) is that statistical manifolds carry not one but a one-parameter family of connections — the α\alpha-connections — and the interplay between them reveals the deepest geometric structure.

Definition 4 (α-Connection).

For αR\alpha \in \mathbb{R}, the α\alpha-connection (α)\nabla^{(\alpha)} on a statistical manifold has Christoffel symbols:

Γij,k(α)=E ⁣[(ijlogpθ)klogpθ]+1α2E ⁣[ilogpθjlogpθklogpθ]\Gamma^{(\alpha)}_{ij,k} = \mathbb{E}\!\left[\left(\partial_i \partial_j \log p_\theta\right) \partial_k \log p_\theta\right] + \frac{1 - \alpha}{2}\,\mathbb{E}\!\left[\partial_i \log p_\theta \cdot \partial_j \log p_\theta \cdot \partial_k \log p_\theta\right]

where i=θi\partial_i = \frac{\partial}{\partial \theta^i}.

Special cases:

  • α=0\alpha = 0: the Levi-Civita connection (the Riemannian geometry default)
  • α=1\alpha = 1: the ee-connection (exponential connection)
  • α=1\alpha = -1: the mm-connection (mixture connection)

The α\alpha-connections differ from the Levi-Civita connection by a cubic tensor (the skewness tensor or Amari-Chentsov tensor) that vanishes when α=0\alpha = 0. The crucial property is duality.

Theorem 4 (Duality of α-Connections).

The α\alpha-connection and the (α)(-\alpha)-connection are dual with respect to the Fisher metric:

Xg(Y,Z)=g ⁣(X(α)Y,Z)+g ⁣(Y,X(α)Z)X\,g(Y, Z) = g\!\left(\nabla^{(\alpha)}_X Y,\, Z\right) + g\!\left(Y,\, \nabla^{(-\alpha)}_X Z\right)

for all vector fields X,Y,ZX, Y, Z on the statistical manifold.

Proof.

Write X=kX = \partial_k, Y=iY = \partial_i, Z=jZ = \partial_j in local coordinates. The left side is kgij\partial_k g_{ij}. The right side is:

gljΓki(α)l+gilΓkj(α)lg_{lj}\,\Gamma^{(\alpha)\,l}_{ki} + g_{il}\,\Gamma^{(-\alpha)\,l}_{kj}

Using the definition of the α\alpha-Christoffel symbols and the symmetry of the Fisher metric, the third-moment terms from Γ(α)\Gamma^{(\alpha)} and Γ(α)\Gamma^{(-\alpha)} have opposite signs (the 1α2\frac{1-\alpha}{2} factor becomes 1+α2\frac{1+\alpha}{2} upon negating α\alpha) and cancel in the sum. What remains is exactly the Levi-Civita compatibility equation kgij=gljΓki(0)l+gilΓkj(0)l\partial_k g_{ij} = g_{lj}\,\Gamma^{(0)\,l}_{ki} + g_{il}\,\Gamma^{(0)\,l}_{kj}, which holds because (0)\nabla^{(0)} is metric-compatible. \square

The most important instance of duality is between the ee-connection (α=1\alpha = 1) and the mm-connection (α=1\alpha = -1). This duality underlies the entire structure of exponential families.

Definition 5 (Dually Flat Manifold).

A statistical manifold is dually flat if there exist coordinate systems θ\theta (natural parameters) and η\eta (expectation parameters) such that:

  • The ee-connection (α=1\alpha = 1) is flat in θ\theta-coordinates: all Christoffel symbols Γij(1)k=0\Gamma^{(1)\,k}_{ij} = 0.
  • The mm-connection (α=1\alpha = -1) is flat in η\eta-coordinates: all Christoffel symbols Γij(1)k=0\Gamma^{(-1)\,k}_{ij} = 0.
  • The Legendre transform links the two coordinate systems:

ηi=Aθi,θi=Aηi\eta_i = \frac{\partial A}{\partial \theta^i}, \qquad \theta^i = \frac{\partial A^*}{\partial \eta_i}

where A(θ)A(\theta) is the log-partition function and A(η)=supθ{θηA(θ)}A^*(\eta) = \sup_\theta\{\theta \cdot \eta - A(\theta)\} is its convex conjugate.

Theorem 5 (Exponential Families are Dually Flat).

Every exponential family is a dually flat manifold. The natural parameters η\eta are ee-affine coordinates and the expectation parameters μ=E[T(x)]\mu = \mathbb{E}[T(x)] are mm-affine coordinates.

For the Gaussian family N(μ,σ2)\mathcal{N}(\mu, \sigma^2), the natural parameters are η1=μ/σ2\eta_1 = \mu/\sigma^2 and η2=1/(2σ2)\eta_2 = -1/(2\sigma^2), and the expectation parameters are μ1=E[X]=μ\mu_1 = \mathbb{E}[X] = \mu and μ2=E[X2]=μ2+σ2\mu_2 = \mathbb{E}[X^2] = \mu^2 + \sigma^2. Straight lines in (η1,η2)(\eta_1, \eta_2)-space are ee-geodesics; straight lines in (μ1,μ2)(\mu_1, \mu_2)-space are mm-geodesics. These are generically different curves.

Drag the two endpoints to compare geodesics under different connections. The e-geodesic is straight in natural parameters; the m-geodesic is straight in expectation parameters; the Levi-Civita geodesic (α = 0) follows the Fisher-Rao metric.

Dual geometry: α-connection geodesics, dually flat structure, θ- vs η-coordinate grids


Divergence Functions

Divergence functions measure the “distance” between probability distributions, but they are not true distances — they violate symmetry and/or the triangle inequality. Nevertheless, they encode the geometry of statistical manifolds and are fundamental to inference and learning.

Definition 6 (KL Divergence).

The Kullback–Leibler divergence from pp to qq is:

DKL(pq)=p(x)logp(x)q(x)dxD_{\mathrm{KL}}(p \,\|\, q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Properties:

  1. DKL(pq)0D_{\mathrm{KL}}(p \,\|\, q) \geq 0 (Gibbs’ inequality), with equality iff p=qp = q.
  2. DKL(pq)DKL(qp)D_{\mathrm{KL}}(p \,\|\, q) \neq D_{\mathrm{KL}}(q \,\|\, p) in general — KL divergence is not symmetric.
  3. KL divergence does not satisfy the triangle inequality.

Despite not being a distance, KL divergence has a deep connection to the Fisher metric.

Proposition 3 (Fisher Metric as Hessian of KL Divergence).

The Fisher information matrix is the Hessian of the KL divergence:

gij(θ)=2θiθjDKL(pθpθ)θ=θg_{ij}(\theta) = \frac{\partial^2}{\partial \theta'^i \, \partial \theta'^j} D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta'}) \bigg|_{\theta' = \theta}

Proof.

Taylor-expand DKL(pθpθ)D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta'}) around θ=θ\theta' = \theta. At θ=θ\theta' = \theta, the divergence is zero. The first-order term vanishes:

θiDKL(pθpθ)θ=θ=pθ(x)ipθ(x)pθ(x)dx=θipθdx=0\frac{\partial}{\partial \theta'^i} D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta'}) \bigg|_{\theta'=\theta} = -\int p_\theta(x) \frac{\partial_i p_\theta(x)}{p_\theta(x)} \, dx = -\frac{\partial}{\partial \theta^i} \int p_\theta \, dx = 0

For the second-order term:

2θiθjDKL(pθpθ)θ=θ=pθ2logpθθiθjθ=θdx=Eθ ⁣[2logpθθiθj]=gij(θ)\frac{\partial^2}{\partial \theta'^i \, \partial \theta'^j} D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta'}) \bigg|_{\theta'=\theta} = -\int p_\theta \frac{\partial^2 \log p_{\theta'}}{\partial \theta'^i \, \partial \theta'^j}\bigg|_{\theta'=\theta} dx = -\mathbb{E}_\theta\!\left[\frac{\partial^2 \log p_\theta}{\partial \theta^i \, \partial \theta^j}\right] = g_{ij}(\theta)

Thus DKL(pθpθ+δ)12gij(θ)δiδjD_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta+\delta}) \approx \frac{1}{2} g_{ij}(\theta) \, \delta^i \, \delta^j for small δ\delta. The Fisher metric is the infinitesimal KL divergence. \square

Definition 7 (α-Divergence).

The α\alpha-divergence is a one-parameter family interpolating between forward and reverse KL:

Dα(pq)=41α2(1p(x)(1+α)/2q(x)(1α)/2dx)D_\alpha(p \,\|\, q) = \frac{4}{1 - \alpha^2}\left(1 - \int p(x)^{(1+\alpha)/2} \, q(x)^{(1-\alpha)/2} \, dx\right)

Special cases:

  • α+1\alpha \to +1: DKL(pq)D_{\mathrm{KL}}(p \,\|\, q) (forward KL)
  • α1\alpha \to -1: DKL(qp)D_{\mathrm{KL}}(q \,\|\, p) (reverse KL)
  • α=0\alpha = 0: D0(pq)=2 ⁣(1pqdx)D_0(p \,\|\, q) = 2\!\left(1 - \int \sqrt{p \, q} \, dx\right), twice the squared Hellinger distance

Definition 8 (Bregman Divergence).

For a strictly convex, differentiable function F:RnRF : \mathbb{R}^n \to \mathbb{R}, the Bregman divergence is:

DF(xy)=F(x)F(y)F(y),xyD_F(x \,\|\, y) = F(x) - F(y) - \langle \nabla F(y),\, x - y \rangle

For exponential families, the KL divergence is a Bregman divergence with F=AF = A (the log-partition function):

DKL(pηpη)=DA(ηη)D_{\mathrm{KL}}(p_\eta \,\|\, p_{\eta'}) = D_A(\eta' \,\|\, \eta)

The generalized Pythagorean theorem connects divergences to the dual geometry of α\alpha-connections.

Theorem 6 (Generalized Pythagorean Theorem).

On a dually flat manifold, let pp, qq, rr be three distributions such that qq is the mm-projection of rr onto an ee-flat submanifold containing pp (i.e., the ee-geodesic from rr to qq is orthogonal to the mm-flat submanifold at qq). Then:

DKL(pr)=DKL(pq)+DKL(qr)D_{\mathrm{KL}}(p \,\|\, r) = D_{\mathrm{KL}}(p \,\|\, q) + D_{\mathrm{KL}}(q \,\|\, r)

Proof.

In a dually flat manifold, the KL divergence is a Bregman divergence: DKL(pθpθ)=A(θ)+A(η)θηD_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta'}) = A(\theta) + A^*(\eta') - \theta \cdot \eta' where η=A(θ)\eta' = \nabla A(\theta'). Expanding the right side:

DKL(pq)+DKL(qr)=[A(θp)+A(ηq)θpηq]+[A(θq)+A(ηr)θqηr]D_{\mathrm{KL}}(p \,\|\, q) + D_{\mathrm{KL}}(q \,\|\, r) = \bigl[A(\theta_p) + A^*(\eta_q) - \theta_p \cdot \eta_q\bigr] + \bigl[A(\theta_q) + A^*(\eta_r) - \theta_q \cdot \eta_r\bigr]

The orthogonality condition says (θrθq)(ηpηq)=0(\theta_r - \theta_q) \cdot (\eta_p - \eta_q) = 0, i.e., θrηpθrηqθqηp+θqηq=0\theta_r \cdot \eta_p - \theta_r \cdot \eta_q - \theta_q \cdot \eta_p + \theta_q \cdot \eta_q = 0. Using A(θq)+A(ηq)=θqηqA(\theta_q) + A^*(\eta_q) = \theta_q \cdot \eta_q (the Legendre identity), we can simplify the sum to:

A(θp)+A(ηr)θpηr=DKL(pr)A(\theta_p) + A^*(\eta_r) - \theta_p \cdot \eta_r = D_{\mathrm{KL}}(p \,\|\, r) \qquad \square

This theorem is the information-geometric foundation of variational inference: minimizing DKL(qp)D_{\mathrm{KL}}(q \,\|\, p^*) over a variational family is an mm-projection, and the Pythagorean theorem guarantees that the optimal qq decomposes the divergence into an “explained” part and an irreducible “approximation error.”

Divergence:

Drag the two points to compare divergences between Gaussians. KL divergence is asymmetric — swapping p and q changes the value. The Fisher-Rao distance is a true metric (symmetric, triangle inequality).

Divergence functions: KL asymmetry, α-divergence family, Pythagorean theorem diagram


Geodesics on Statistical Manifolds

The Fisher information metric defines geodesics on statistical manifolds via the geodesic equation from Geodesics & Curvature. These geodesics have concrete statistical interpretations and differ dramatically from Euclidean straight lines.

For the Gaussian family, the Fisher metric ds2=1σ2(dμ2+2dσ2)ds^2 = \frac{1}{\sigma^2}(d\mu^2 + 2 \, d\sigma^2) is (up to a constant) the Poincaré upper half-plane metric. The geodesics are:

  • Vertical lines: μ=const\mu = \text{const}, σ\sigma varies — changing the variance while keeping the mean fixed.
  • Semicircles centered on the μ\mu-axis — the shortest paths between distributions with different means and variances.

These are precisely the geodesics of hyperbolic geometry, consistent with the constant curvature K=1/2K = -1/2.

Proposition 4 (Fisher-Rao Distance for Gaussians with Equal Variance).

For N(μ1,σ2)\mathcal{N}(\mu_1, \sigma^2) and N(μ2,σ2)\mathcal{N}(\mu_2, \sigma^2) (same variance):

dFR=μ1μ2σd_{\mathrm{FR}} = \frac{|\mu_1 - \mu_2|}{\sigma}

This is the Mahalanobis distance.

The Fisher-Rao distance naturally produces the Mahalanobis distance — the standard “number of standard deviations” between means. This is not a coincidence: the Fisher metric is the infinitesimal Mahalanobis metric.

Proposition 5 (Fisher-Rao Distance for Gaussians with Equal Means).

For N(μ,σ12)\mathcal{N}(\mu, \sigma_1^2) and N(μ,σ22)\mathcal{N}(\mu, \sigma_2^2) (same mean):

dFR=2log(σ1/σ2)d_{\mathrm{FR}} = \sqrt{2} \, \bigl|\log(\sigma_1 / \sigma_2)\bigr|

Distances along the variance axis are logarithmic — the geometry is hyperbolic.

The logarithmic scaling means that doubling σ\sigma from 11 to 22 is the same Fisher-Rao distance as doubling from 100100 to 200200. This is the natural scale for variance: what matters is the ratio, not the absolute difference.

Remark (Infinitesimal Relationship: KL and Fisher-Rao).

For infinitesimally close distributions pθp_\theta and pθ+δp_{\theta + \delta}:

DKL(pθpθ+δ)12dFR(θ,θ+δ)2D_{\mathrm{KL}}(p_\theta \,\|\, p_{\theta+\delta}) \approx \frac{1}{2} \, d_{\mathrm{FR}}(\theta, \theta + \delta)^2

The KL divergence is the squared Fisher-Rao distance at infinitesimal scale. But for distributions that are not close, these quantities diverge: the Fisher-Rao distance is a true metric (symmetric, satisfies triangle inequality), while the KL divergence is neither.

Drag the start point to explore Fisher-Rao geodesics on the Gaussian manifold. These are semicircles in the Poincaré half-plane model.

Fisher-Rao geodesics, Fisher-Rao distance vs KL divergence, hyperbolic distance for variance


The Cramér–Rao Bound

The Cramér–Rao bound connects the Fisher information metric to the fundamental limits of statistical estimation. It is the information-geometric version of the uncertainty principle: curvature (Fisher information) controls precision (estimator variance).

Theorem 7 (Cramér–Rao Lower Bound).

Let T(X)T(X) be an unbiased estimator of θi\theta^i (i.e., Eθ[T]=θi\mathbb{E}_\theta[T] = \theta^i). Then:

Varθ(T)[g1(θ)]ii\mathrm{Var}_\theta(T) \geq [g^{-1}(\theta)]_{ii}

More generally, for the covariance matrix of any unbiased estimator T\mathbf{T} of θ\theta:

Covθ(T)g1(θ)\mathrm{Cov}_\theta(\mathbf{T}) \succeq g^{-1}(\theta)

where \succeq denotes the Loewner (positive semidefinite) ordering.

Proof.

We prove the scalar case using Cauchy-Schwarz in L2(pθ)L^2(p_\theta), which is precisely the inner product defined by the Fisher metric.

Since TT is unbiased, E[T]=θ\mathbb{E}[T] = \theta, so:

Cov(T,s)=E[Ts]E[T]E[s]=E[Ts]θ0=E[Ts]\mathrm{Cov}(T, s) = \mathbb{E}[T \cdot s] - \mathbb{E}[T] \cdot \mathbb{E}[s] = \mathbb{E}[T \cdot s] - \theta \cdot 0 = \mathbb{E}[T \cdot s]

We compute E[Ts]\mathbb{E}[T \cdot s] by differentiating the unbiasedness condition:

ddθEθ[T]=ddθT(x)pθ(x)dx=T(x)pθθdx=T(x)s(x;θ)pθ(x)dx=E[Ts]\frac{d}{d\theta} \mathbb{E}_\theta[T] = \frac{d}{d\theta} \int T(x) \, p_\theta(x) \, dx = \int T(x) \frac{\partial p_\theta}{\partial \theta} dx = \int T(x) \, s(x;\theta) \, p_\theta(x) \, dx = \mathbb{E}[T \cdot s]

Since ddθθ=1\frac{d}{d\theta}\theta = 1, we have Cov(T,s)=1\mathrm{Cov}(T, s) = 1.

Now apply Cauchy-Schwarz:

1=Cov(T,s)2Var(T)Var(s)=Var(T)g(θ)1 = |\mathrm{Cov}(T, s)|^2 \leq \mathrm{Var}(T) \cdot \mathrm{Var}(s) = \mathrm{Var}(T) \cdot g(\theta)

Therefore Var(T)1/g(θ)=[g1(θ)]11\mathrm{Var}(T) \geq 1/g(\theta) = [g^{-1}(\theta)]_{11}. \square

Definition 9 (Efficient Estimator).

An unbiased estimator TT is efficient if it achieves the Cramér–Rao bound with equality:

Var(T)=1g(θ)\mathrm{Var}(T) = \frac{1}{g(\theta)}

Efficient estimators exist only when the score function is a linear function of TT. The maximum likelihood estimator (MLE) is asymptotically efficient: as the sample size nn \to \infty, the MLE achieves the bound.

Remark (Geometric Interpretation).

The Cramér–Rao bound has a clean geometric interpretation. The Fisher metric measures the “curvature” of the log-likelihood surface:

  • Large g(θ)g(\theta) means the log-likelihood is sharply peaked — samples carry a lot of information about θ\theta, so estimation is precise (low variance).
  • Small g(θ)g(\theta) means the log-likelihood is flat — samples are uninformative about θ\theta, so estimation is imprecise (high variance).

The bound Var(T)1/g(θ)\mathrm{Var}(T) \geq 1/g(\theta) says: no estimator, no matter how clever, can beat the information content of the data. This is the statistical analogue of the Heisenberg uncertainty principle — but with the Fisher information playing the role of Planck’s constant.

The Cramér–Rao bound: Fisher information and precision, the inequality diagram, MLE convergence


Computational Notes

The computations of information geometry can be automated symbolically and solved numerically.

Symbolic Fisher Metric Derivation

Using SymPy, we can derive the Fisher metric for the Gaussian family from scratch:

import sympy as sp

x, mu, sigma = sp.symbols('x mu sigma', real=True)
sigma = sp.Symbol('sigma', positive=True)

# Log-likelihood
log_p = -sp.log(sigma) - (x - mu)**2 / (2 * sigma**2) - sp.log(sp.sqrt(2 * sp.pi))

# Score functions
s_mu = sp.diff(log_p, mu)          # (x - mu) / sigma^2
s_sigma = sp.diff(log_p, sigma)    # -1/sigma + (x-mu)^2 / sigma^3

# Fisher matrix entries via E[s_i * s_j]
# Using E[(x-mu)^2] = sigma^2, E[(x-mu)^4] = 3*sigma^4
g_11 = sp.Rational(1, 1) / sigma**2
g_22 = sp.Rational(2, 1) / sigma**2
g_12 = sp.Integer(0)
print(f"Fisher metric: diag({g_11}, {g_22})")
# Output: Fisher metric: diag(sigma**(-2), 2/sigma**2)

Christoffel Symbols for the Gaussian Manifold

With coordinates (μ,σ)(\mu, \sigma) and metric g=diag(1/σ2,2/σ2)g = \text{diag}(1/\sigma^2, 2/\sigma^2):

# Christoffel symbols: Gamma^k_{ij} = (1/2) g^{kl}(d_i g_{jl} + d_j g_{il} - d_l g_{ij})
# For diagonal metric g = diag(f, h) where f = 1/sigma^2, h = 2/sigma^2:
# Nonzero symbols:
#   Gamma^sigma_{mu,mu}    = -f'/(2h)   = (2/sigma^3) / (2 * 2/sigma^2) = 1/(2*sigma)
#   Gamma^mu_{mu,sigma}    = f'/(2f)    = (-2/sigma^3) / (2/sigma^2) = -1/sigma
#   Gamma^sigma_{sigma,sigma} = h'/(2h) = (-4/sigma^3) / (2 * 2/sigma^2) = -1/sigma

Numerical Geodesic Solver

The geodesic equation on the Gaussian manifold is the ODE system:

μ¨+2Γμσμμ˙σ˙=0,σ¨+Γμμσμ˙2+Γσσσσ˙2=0\ddot{\mu} + 2\,\Gamma^\mu_{\mu\sigma}\,\dot{\mu}\dot{\sigma} = 0, \qquad \ddot{\sigma} + \Gamma^\sigma_{\mu\mu}\,\dot{\mu}^2 + \Gamma^\sigma_{\sigma\sigma}\,\dot{\sigma}^2 = 0

We solve this with a 4th-order Runge-Kutta integrator:

import numpy as np

def geodesic_step_gaussian(state, dt):
    """Single RK4 step for the geodesic ODE on the Gaussian manifold."""
    mu, sigma, dmu, dsigma = state

    def derivs(s):
        mu, sig, dm, ds = s
        ddmu = (2 / sig) * dm * ds        # -2 * Gamma^mu_{mu,sigma} * dmu * dsigma
        ddsig = -dm**2 / (2*sig) + ds**2 / sig  # Gamma terms
        return np.array([dm, ds, ddmu, ddsig])

    k1 = derivs(state)
    k2 = derivs(state + 0.5*dt*k1)
    k3 = derivs(state + 0.5*dt*k2)
    k4 = derivs(state + dt*k3)
    return state + (dt/6) * (k1 + 2*k2 + 2*k3 + k4)

Natural Gradient Implementation

Side-by-side comparison of Euclidean and natural gradient descent minimizing DKL(N(0,1)N(μ,σ2))D_{\mathrm{KL}}(\mathcal{N}(0,1) \,\|\, \mathcal{N}(\mu, \sigma^2)):

def gradient_descent(mu, sigma, target_mu, target_sigma, lr, natural=False, steps=100):
    """Euclidean or natural gradient descent on KL divergence."""
    trajectory = [(mu, sigma)]
    for _ in range(steps):
        # Euclidean gradient of D_KL(target || model)
        grad_mu = (mu - target_mu) / sigma**2
        grad_sigma = 1/sigma - (target_sigma**2 + (target_mu - mu)**2) / sigma**3

        if natural:
            # Fisher metric inverse: g^{-1} = diag(sigma^2, sigma^2/2)
            nat_mu = sigma**2 * grad_mu
            nat_sigma = (sigma**2 / 2) * grad_sigma
            mu -= lr * nat_mu
            sigma -= lr * nat_sigma
        else:
            mu -= lr * grad_mu
            sigma -= lr * grad_sigma

        sigma = max(sigma, 0.01)  # Keep sigma positive
        trajectory.append((mu, sigma))
    return trajectory

Fisher-Rao Distance Matrix

Pairwise Fisher-Rao distances between Gaussians using Rao’s formula:

def fisher_rao_distance(mu1, s1, mu2, s2):
    """Closed-form Fisher-Rao distance for univariate Gaussians."""
    return np.sqrt(2) * np.arccosh(
        1 + ((mu1 - mu2)**2 + 2*(s1 - s2)**2) / (4 * s1 * s2)
    )

Note that this formula uses the half-plane metric ds2=1σ2(dμ2+2dσ2)ds^2 = \frac{1}{\sigma^2}(d\mu^2 + 2\,d\sigma^2), which includes the factor of 22 in the dσ2d\sigma^2 term. The arccosh\text{arccosh} reflects the hyperbolic nature of the geometry.

Computational information geometry: numerical geodesics, natural vs Euclidean gradient, Fisher-Rao distance matrix


Information Geometry in Machine Learning

Information geometry provides the natural mathematical framework for understanding and improving several core machine learning algorithms.

Natural Gradient Descent

Standard gradient descent updates parameters as θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \, \nabla L(\theta_t), using the Euclidean gradient. But the Euclidean gradient depends on the parameterization: reparameterizing the same model changes the gradient direction. This is undesirable — the “steepest descent” direction should be an intrinsic property of the model, not an artifact of how we chose to write down its parameters.

Amari’s natural gradient (1998) fixes this by using the Fisher-Rao metric:

θt+1=θtηg1(θt)L(θt)\theta_{t+1} = \theta_t - \eta \, g^{-1}(\theta_t) \, \nabla L(\theta_t)

The natural gradient ~L=g1L\tilde{\nabla} L = g^{-1} \nabla L is the steepest descent direction in the Riemannian metric defined by the Fisher information. It is:

  • Reparameterization invariant: changing coordinates θϕ(θ)\theta \to \phi(\theta) does not change the natural gradient direction (it transforms covariantly).
  • Asymptotically efficient: for maximum likelihood estimation, natural gradient descent converges to the optimal rate.
  • Follows geodesics approximately: the natural gradient flow traces out curves that are close to Fisher-Rao geodesics.

Variational Inference as mm-Projection

Variational inference minimizes DKL(qp)D_{\mathrm{KL}}(q \,\|\, p^*) over a variational family Q\mathcal{Q} to approximate an intractable posterior pp^*. In information-geometric terms, this is the mm-projection of pp^* onto Q\mathcal{Q}: the distribution in Q\mathcal{Q} closest to pp^* in the mm-connection sense.

When Q\mathcal{Q} is an exponential family (a dually flat manifold), the generalized Pythagorean theorem guarantees:

DKL(qp)=DKL(qq)+DKL(qp)D_{\mathrm{KL}}(q \,\|\, p^*) = D_{\mathrm{KL}}(q \,\|\, q^*) + D_{\mathrm{KL}}(q^* \,\|\, p^*)

where qq^* is the optimal variational approximation. The second term is the irreducible approximation error (determined by the expressiveness of Q\mathcal{Q}), and the first term is what the optimization eliminates.

Adam as Approximate Natural Gradient

The Adam optimizer (Kingma & Ba, 2015) maintains running estimates of the first and second moments of the gradient. The second moment estimate vtE[(L)2]v_t \approx \mathbb{E}[(\nabla L)^2] is a diagonal approximation to the Fisher information matrix: diag(vt)diag(g(θ))\text{diag}(v_t) \approx \text{diag}(g(\theta)). The Adam update

θt+1=θtηmtvt+ϵ\theta_{t+1} = \theta_t - \eta \, \frac{m_t}{\sqrt{v_t} + \epsilon}

is therefore an approximate natural gradient step with a diagonal Fisher matrix.

K-FAC (Martens & Grosse, 2015) improves on this by using a block-diagonal, Kronecker-factored approximation to the full Fisher matrix. For a neural network layer with input aa and output gradient gg, K-FAC approximates the Fisher block as E[aaT]E[ggT]\mathbb{E}[a a^T] \otimes \mathbb{E}[g g^T], which is far cheaper to invert than the full Fisher matrix while capturing more curvature structure than Adam’s diagonal.

Optimal Transport Connections

The Fisher-Rao metric and the Wasserstein distance define different geometries on the space of distributions:

  • Fisher-Rao measures informational distance: how distinguishable are two distributions from finite samples?
  • Wasserstein measures physical distance: what is the cost of transporting mass from one distribution to the other?

Otto (2001) showed that the Wasserstein space carries a formal Riemannian structure where the gradient flow of the KL divergence is the Fokker-Planck equation. The interplay between these two geometries — informational and physical — is an active research frontier connecting information geometry to optimal transport theory.

Loss Landscape Curvature

The curvature of the loss landscape at a minimum affects generalization. The “flat minima” conjecture (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) suggests that minima occupying large, flat regions of the loss surface generalize better than sharp minima. The Fisher information matrix at convergence is related to the Hessian of the loss, and its eigenspectrum characterizes the sharpness of the minimum.

Specifically, for a model trained with maximum likelihood, the Fisher information matrix equals the expected Hessian of the negative log-likelihood. The eigenvalues of g(θ)g(\theta^*) at the converged parameters θ\theta^* measure the curvature in each direction: large eigenvalues correspond to sharp directions, and the Spectral Theorem guarantees that these principal curvature directions exist and are orthogonal.

Information geometry in ML: reparameterization invariance, VI as m-projection, Adam vs natural gradient convergence


Connections & Further Reading

Information Geometry & Fisher Metric is the capstone of the Differential Geometry track. It connects back to every topic in the track and forward to applications across the curriculum:

Connected TopicDomainRelationship
Smooth ManifoldsDifferential GeometryParameter spaces as smooth manifolds; tangent spaces spanned by score functions
Riemannian GeometryDifferential GeometryFisher metric as Riemannian metric; Levi-Civita as α=0\alpha=0 connection
Geodesics & CurvatureDifferential GeometryFisher-Rao geodesics; curvature K=1/2K=-1/2 for Gaussians; Gauss-Bonnet theorem
The Spectral TheoremLinear AlgebraEigendecomposition of Fisher matrix reveals principal information directions
PCA & Low-Rank ApproximationLinear AlgebraNatural gradient as Fisher preconditioning; covariance structure
Persistent HomologyTopology & TDAEuler characteristic in Gauss-Bonnet for statistical manifolds
Shannon Entropy & Mutual InformationInformation TheoryThe entropy and mutual information quantities are developed rigorously on the Information Theory track. KL divergence DKL(pq)=H(p,q)H(p)D_{KL}(p \| q) = H(p,q) - H(p) is the divergence whose Hessian gives the Fisher metric. The KL divergence and its f-divergence generalizations are developed in KL Divergence & f-Divergences.

Completing the Differential Geometry Track

With this topic, the four-part Differential Geometry track is complete:

  1. Smooth Manifolds (intermediate) — the foundational structure: charts, tangent spaces, and smooth maps
  2. Riemannian Geometry (advanced) — metric tensors, connections, and parallel transport
  3. Geodesics & Curvature (intermediate) — geodesic equations, curvature tensors, and the Gauss-Bonnet theorem
  4. Information Geometry & Fisher Metric (advanced) — the Fisher metric on statistical manifolds, α\alpha-connections, and natural gradient descent

The track moves from abstract manifold structure through Riemannian geometry to the concrete, application-rich setting of statistical manifolds — where the geometric machinery built in the first three topics has direct consequences for machine learning.

Connections

  • Information geometry instantiates Riemannian geometry on statistical manifolds: the Fisher metric is a specific Riemannian metric, the Levi-Civita connection is the α = 0 member of Amari's α-connection family, and parallel transport on the Gaussian manifold is parallel transport on the Poincaré half-plane. riemannian-geometry
  • Statistical manifolds inherit their smooth structure from the parameter space. Charts on the Gaussian manifold are parameterizations (μ, σ) or (η₁, η₂), and the tangent space at a distribution is spanned by score functions — the derivatives of the log-likelihood. smooth-manifolds
  • Fisher-Rao geodesics on the Gaussian manifold are semicircles in the Poincaré half-plane, exactly the geodesics computed by the geodesic equation. The Gaussian curvature K = -1/2 appears via the Riemann tensor machinery, and the Gauss-Bonnet theorem constrains the topology of statistical manifolds. geodesics-curvature
  • The Fisher information matrix is symmetric positive definite, and its eigendecomposition reveals the principal directions of statistical information — the directions in parameter space along which estimation is most and least precise. spectral-theorem
  • Natural gradient descent on the Fisher manifold is the information-geometric version of preconditioning with the Hessian. The Fisher metric plays the same role for statistical models that the covariance matrix plays for PCA: it defines the natural inner product on the parameter space. pca-low-rank
  • The Euler characteristic appears in the Gauss–Bonnet theorem for statistical manifolds, connecting the topology of the parameter space to the integral of Fisher-Rao curvature — a link between topological data analysis and information geometry. persistent-homology

References & Further Reading

  • book Methods of Information Geometry — Amari & Nagaoka (2000) The foundational monograph on information geometry: α-connections, duality, divergences, and applications to statistics and machine learning
  • book Information Geometry and Its Applications — Amari (2016) Amari's comprehensive update covering modern applications including neural networks, machine learning, and signal processing
  • book Differential-Geometrical Methods in Statistics — Amari (1985) The original Lecture Notes in Statistics volume that introduced α-connections and dually flat geometry to statistics
  • paper Natural Gradient Works Efficiently in Learning — Amari (1998) Introduced natural gradient descent — steepest descent in the Fisher-Rao metric — and proved its reparameterization invariance and efficiency
  • paper Information and the Accuracy Attainable in the Estimation of Statistical Parameters — Rao (1945) Rao's seminal paper introducing the Fisher information metric as a Riemannian metric on statistical manifolds — the birth of information geometry
  • paper Optimizing Neural Networks with Kronecker-factored Approximate Curvature — Martens & Grosse (2015) K-FAC: practical natural gradient for deep learning using block-diagonal Kronecker-factored Fisher approximation
  • paper Statistical Decision Rules and Optimal Inference — Čencov (1982) Čencov's uniqueness theorem: the Fisher metric is the unique (up to scale) Riemannian metric invariant under sufficient statistics