intermediate information-theory 45 min read

KL Divergence & f-Divergences

Statistical divergences — measuring distributional mismatch through coding excess, convex generators, and variational representations

Overview & Motivation

Cross-entropy loss, variational inference, and generative adversarial networks look like three unrelated ideas. But they share a common engine: each one minimizes a divergence — a function that measures how one probability distribution differs from another.

  • Cross-entropy loss minimizes the KL divergence from the true label distribution to the model’s predictions (forward KL).
  • Variational inference minimizes the KL divergence from an approximate posterior to the true posterior (reverse KL), which is equivalent to maximizing the ELBO.
  • GANs minimize the Jensen–Shannon divergence between the real data distribution and the generator’s output.

All three are special cases of f-divergence minimization, a framework that unifies a family of distributional distance measures through convex analysis. Understanding this framework gives us a single lens through which cross-entropy loss, the ELBO, and the GAN objective are all the same idea in different clothes.

This topic develops the theory systematically: we start with KL divergence and its operational meaning as the excess cost of miscoding, then explore how the direction of the KL divergence (forward vs reverse) determines fundamentally different fitting behavior. We generalize to f-divergences — showing that KL, reverse KL, χ2\chi^2, Hellinger, total variation, and Jensen–Shannon are all members of one family parameterized by convex generator functions. Variational representations via Fenchel conjugates turn these abstract measures into optimization problems that can be solved with neural networks. Rényi divergences provide a complementary one-parameter family with applications to differential privacy and hypothesis testing.

Prerequisites

This topic builds on:

  • Shannon Entropy & Mutual Information — KL divergence is defined as the gap between cross-entropy and entropy: DKL(pq)=H(p,q)H(p)D_{\mathrm{KL}}(p \| q) = H(p, q) - H(p). We use entropy, mutual information, and the data processing inequality throughout.

What We Cover

  1. KL divergence — definition, Gibbs’ inequality, asymmetry, cross-entropy decomposition
  2. Forward vs reverse KL — mode-covering vs mode-seeking, connections to MLE and variational inference
  3. f-divergences — the unifying framework via convex generator functions
  4. Properties of f-divergences — non-negativity, joint convexity, data processing inequality, Pinsker’s inequality
  5. Variational representations — Fenchel conjugates, Donsker–Varadhan, NWJ bound, connection to GANs
  6. Rényi divergence — the α\alpha-divergence family, monotonicity, special cases
  7. Computational notes — estimation, cross-entropy loss, ELBO, practical implementations

KL Divergence: Definition & Properties

In Shannon Entropy & Mutual Information, we defined the entropy H(p)H(p) as the minimum average code length for a source with distribution pp. If we use a code optimized for a different distribution qq, the average code length increases to H(p,q)=xp(x)log2q(x)H(p, q) = -\sum_x p(x) \log_2 q(x) — the cross-entropy. The difference is the excess cost of using the wrong code.

Definition 1 (KL Divergence).

The Kullback–Leibler divergence (or relative entropy) from distribution pp to distribution qq over a finite alphabet X\mathcal{X} is

DKL(pq)=xXp(x)log2p(x)q(x)=Ep ⁣[log2p(X)q(X)]D_{\mathrm{KL}}(p \| q) = \sum_{x \in \mathcal{X}} p(x) \log_2 \frac{p(x)}{q(x)} = \mathbb{E}_p\!\left[\log_2 \frac{p(X)}{q(X)}\right]

with the conventions 0log(0/q)=00 \log(0/q) = 0 and plog(p/0)=+p \log(p/0) = +\infty when p>0p > 0.

The operational meaning: DKL(pq)D_{\mathrm{KL}}(p \| q) is the expected number of extra bits needed to encode samples from pp using a code optimized for qq, beyond the minimum achieved by the optimal code for pp.

Definition 2 (Cross-Entropy).

The cross-entropy from pp to qq is

H(p,q)=xXp(x)log2q(x)H(p, q) = -\sum_{x \in \mathcal{X}} p(x) \log_2 q(x)

It measures the expected code length when encoding data from pp with a code optimized for qq.

The key decomposition connecting these quantities is immediate:

Proposition 1 (Cross-Entropy Decomposition).

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\mathrm{KL}}(p \| q)

Cross-entropy equals entropy plus KL divergence. Since DKL(pq)0D_{\mathrm{KL}}(p \| q) \geq 0 (Gibbs’ inequality, below), the cross-entropy is always at least as large as the entropy.

Proof.

H(p,q)=xp(x)log2q(x)=xp(x)log2p(x)+xp(x)log2p(x)q(x)=H(p)+DKL(pq)H(p, q) = -\sum_x p(x) \log_2 q(x) = -\sum_x p(x) \log_2 p(x) + \sum_x p(x) \log_2 \frac{p(x)}{q(x)} = H(p) + D_{\mathrm{KL}}(p \| q)

\square

Gibbs’ Inequality

The most fundamental property of KL divergence is non-negativity: using the wrong code never helps.

Proposition 2 (Non-negativity of KL (Gibbs' Inequality)).

For any distributions pp and qq over the same alphabet, DKL(pq)0D_{\mathrm{KL}}(p \| q) \geq 0.

Proof.

Since log-\log is a strictly convex function, Jensen’s inequality gives:

DKL(pq)=xp(x)log2q(x)p(x)=Ep ⁣[log2q(X)p(X)]log2Ep ⁣[q(X)p(X)]=log2xq(x)=log21=0-D_{\mathrm{KL}}(p \| q) = \sum_x p(x) \log_2 \frac{q(x)}{p(x)} = \mathbb{E}_p\!\left[\log_2 \frac{q(X)}{p(X)}\right] \leq \log_2 \mathbb{E}_p\!\left[\frac{q(X)}{p(X)}\right] = \log_2 \sum_x q(x) = \log_2 1 = 0

Therefore DKL(pq)0D_{\mathrm{KL}}(p \| q) \geq 0.

\square

Proposition 3 (KL Divergence and Equality).

DKL(pq)=0D_{\mathrm{KL}}(p \| q) = 0 if and only if p=qp = q (for all xx where p(x)>0p(x) > 0).

Proof.

Equality in Jensen’s inequality holds iff the random variable q(X)/p(X)q(X)/p(X) is constant pp-almost surely. Since xq(x)=xp(x)=1\sum_x q(x) = \sum_x p(x) = 1, that constant must be 11, giving q(x)=p(x)q(x) = p(x) wherever p(x)>0p(x) > 0.

\square

Asymmetry

Unlike a true distance, KL divergence is not symmetric.

Proposition 4 (Asymmetry of KL).

In general, DKL(pq)DKL(qp)D_{\mathrm{KL}}(p \| q) \neq D_{\mathrm{KL}}(q \| p).

Counterexample. Let p=(0.9,0.1)p = (0.9, 0.1) and q=(0.5,0.5)q = (0.5, 0.5). Then:

DKL(pq)=0.9log20.90.5+0.1log20.10.50.368 bitsD_{\mathrm{KL}}(p \| q) = 0.9 \log_2 \frac{0.9}{0.5} + 0.1 \log_2 \frac{0.1}{0.5} \approx 0.368 \text{ bits}

DKL(qp)=0.5log20.50.9+0.5log20.50.10.795 bitsD_{\mathrm{KL}}(q \| p) = 0.5 \log_2 \frac{0.5}{0.9} + 0.5 \log_2 \frac{0.5}{0.1} \approx 0.795 \text{ bits}

The asymmetry is not a defect — it encodes fundamentally different information about the relationship between pp and qq. The direction you choose determines the fitting behavior, as we explore in the next section.

KL divergence is also not a metric: it violates the triangle inequality. It is not even a semimetric (which requires symmetry). Despite this, it plays a central role because its information-theoretic interpretation is exact and its connection to maximum likelihood estimation is direct.

Explore the interactive visualization below. Drag the bars to adjust both distributions and watch the KL divergence, cross-entropy, and their asymmetry update in real time.

Drag bars to adjust p (blue) and q (red)

KL divergence properties — cross-entropy gap, asymmetry, and Gibbs' inequality

import numpy as np

def kl_divergence(p, q):
    """KL divergence D_KL(p || q) in bits."""
    p, q = np.asarray(p, float), np.asarray(q, float)
    mask = p > 0
    if np.any(q[mask] <= 0):
        return np.inf
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

def cross_entropy(p, q):
    """Cross-entropy H(p, q) in bits."""
    p, q = np.asarray(p, float), np.asarray(q, float)
    mask = p > 0
    if np.any(q[mask] <= 0):
        return np.inf
    return -np.sum(p[mask] * np.log2(q[mask]))

# Cross-entropy decomposition: H(p, q) = H(p) + D_KL(p || q)
p = np.array([0.4, 0.35, 0.15, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])

H_p = -np.sum(p * np.log2(p))              # 1.8464 bits
H_pq = cross_entropy(p, q)                  # 2.0000 bits
D_KL = kl_divergence(p, q)                  # 0.1536 bits
# Verify: H_pq ≈ H_p + D_KL → 2.0000 ≈ 1.8464 + 0.1536 ✓

Forward vs Reverse KL

The asymmetry of KL divergence is not merely a mathematical curiosity — it produces two fundamentally different fitting behaviors. When we approximate a complex distribution pp with a simpler model qq, the direction of the KL divergence we minimize determines what kind of approximation we get.

Forward KL: Mode-Covering

Minimizing forward KL DKL(pq)D_{\mathrm{KL}}(p \| q) over the model family {qθ}\{q_\theta\} produces a model that covers all modes of pp.

minθDKL(pqθ)=minθEp ⁣[logp(X)qθ(X)]=minθ[Ep[logqθ(X)]]+const\min_\theta D_{\mathrm{KL}}(p \| q_\theta) = \min_\theta \mathbb{E}_p\!\left[\log \frac{p(X)}{q_\theta(X)}\right] = \min_\theta \left[-\mathbb{E}_p[\log q_\theta(X)]\right] + \text{const}

The penalty comes from p(x)log(p(x)/q(x))p(x) \log(p(x)/q(x)): if p(x)>0p(x) > 0 but q(x)0q(x) \approx 0, the term log(p(x)/q(x))+\log(p(x)/q(x)) \to +\infty. This penalty forces qq to place mass everywhere pp does — it must cover all modes, even at the cost of putting wasted probability mass in regions between modes.

Remark (Forward KL as MLE).

Minimizing forward KL DKL(pqθ)D_{\mathrm{KL}}(p \| q_\theta) over θ\theta is equivalent to maximum likelihood estimation. Since DKL(pqθ)=H(p,qθ)H(p)D_{\mathrm{KL}}(p \| q_\theta) = H(p, q_\theta) - H(p) and H(p)H(p) does not depend on θ\theta, we have argminθDKL(pqθ)=argminθH(p,qθ)=argmaxθEp[logqθ(X)]\arg\min_\theta D_{\mathrm{KL}}(p \| q_\theta) = \arg\min_\theta H(p, q_\theta) = \arg\max_\theta \mathbb{E}_p[\log q_\theta(X)].

When pp is the empirical distribution over training data, Ep[logqθ(X)]=1ni=1nlogqθ(xi)\mathbb{E}_p[\log q_\theta(X)] = \frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i) — the log-likelihood. Cross-entropy loss in classification is forward KL minimization.

Reverse KL: Mode-Seeking

Minimizing reverse KL DKL(qp)D_{\mathrm{KL}}(q \| p) produces a model that seeks a single mode of pp.

minθDKL(qθp)=minθEqθ ⁣[logqθ(X)p(X)]\min_\theta D_{\mathrm{KL}}(q_\theta \| p) = \min_\theta \mathbb{E}_{q_\theta}\!\left[\log \frac{q_\theta(X)}{p(X)}\right]

Now the penalty comes from q(x)log(q(x)/p(x))q(x) \log(q(x)/p(x)): if p(x)0p(x) \approx 0 but q(x)>0q(x) > 0, the term explodes. This forces qq to avoid placing mass where pp does not — but it has no penalty for ignoring modes of pp where qq is already zero. The result: qq locks onto a single mode and ignores the rest.

Remark (Reverse KL and ELBO).

In variational inference, we approximate an intractable posterior p(zx)p(z|x) with a tractable family qϕ(z)q_\phi(z). The evidence lower bound (ELBO) satisfies:

ELBO=logp(x)DKL(qϕ(z)p(zx))\text{ELBO} = \log p(x) - D_{\mathrm{KL}}(q_\phi(z) \| p(z|x))

Since logp(x)\log p(x) is fixed, maximizing the ELBO is equivalent to minimizing the reverse KL from qq to the true posterior. This explains why variational autoencoders (VAEs) tend to produce approximations that are too concentrated — the reverse KL lets the approximation ignore modes of the posterior.

The visualization below demonstrates this contrast on a bimodal target. A single Gaussian fit under forward KL spreads to cover both peaks (with wasted mass in the valley), while under reverse KL it locks onto one peak.

Fit: single Gaussian

Forward vs reverse KL — mode-covering vs mode-seeking behavior on a bimodal target


f-Divergences: A Unifying Framework

KL divergence is one member of a much larger family. Ali & Silvey (1966) and Csiszár (1967) independently showed that a single construction — parameterized by a convex function — generates an entire family of divergences, all sharing the key properties we proved for KL.

Definition 3 (f-Divergence).

Let f:(0,)Rf: (0, \infty) \to \mathbb{R} be a convex function with f(1)=0f(1) = 0. The f-divergence from pp to qq is

Df(pq)=xXq(x)f ⁣(p(x)q(x))=Eq ⁣[f ⁣(p(X)q(X))]D_f(p \| q) = \sum_{x \in \mathcal{X}} q(x)\, f\!\left(\frac{p(x)}{q(x)}\right) = \mathbb{E}_q\!\left[f\!\left(\frac{p(X)}{q(X)}\right)\right]

with the conventions 0f(0/0)=00 \cdot f(0/0) = 0 and qf(p/0)=plimtf(t)/tq \cdot f(p/0) = p \lim_{t \to \infty} f(t)/t when q=0q = 0, p>0p > 0.

The generator function ff determines which divergence we get. Every ff satisfying the conditions above — convex, with f(1)=0f(1) = 0 — produces a valid divergence that is non-negative and zero iff p=qp = q.

The following table shows six important special cases, each recoverable by choosing the appropriate generator:

  • KL divergence: f(t)=tlogtf(t) = t \log t gives p(x)log(p(x)/q(x))\sum p(x) \log(p(x)/q(x))
  • Reverse KL: f(t)=logtf(t) = -\log t gives q(x)log(q(x)/p(x))\sum q(x) \log(q(x)/p(x))
  • χ2\chi^2 divergence: f(t)=(t1)2f(t) = (t - 1)^2 gives (p(x)q(x))2/q(x)\sum (p(x) - q(x))^2/q(x)
  • Squared Hellinger: f(t)=(t1)2f(t) = (\sqrt{t} - 1)^2 gives (p(x)q(x))2\sum (\sqrt{p(x)} - \sqrt{q(x)})^2
  • Total variation: f(t)=t1/2f(t) = |t - 1|/2 gives 12p(x)q(x)\frac{1}{2}\sum |p(x) - q(x)|
  • Jensen–Shannon: f(t)=tlog2tt+1+log2t+1f(t) = t \log\frac{2t}{t+1} + \log\frac{2}{t+1} gives 12DKL(pm)+12DKL(qm)\frac{1}{2}D_{\mathrm{KL}}(p \| m) + \frac{1}{2}D_{\mathrm{KL}}(q \| m)

where m=(p+q)/2m = (p + q)/2 in the Jensen–Shannon case.

Definition 4 (Total Variation Distance).

The total variation distance between pp and qq is

TV(p,q)=12xXp(x)q(x)\mathrm{TV}(p, q) = \frac{1}{2}\sum_{x \in \mathcal{X}} |p(x) - q(x)|

It is the maximum difference in probability assigned to any event: TV(p,q)=maxAXp(A)q(A)\mathrm{TV}(p, q) = \max_{A \subseteq \mathcal{X}} |p(A) - q(A)|.

Definition 5 (Jensen–Shannon Divergence).

The Jensen–Shannon divergence is the symmetrized, smoothed KL divergence:

JS(pq)=12DKL(pm)+12DKL(qm),m=p+q2\mathrm{JS}(p \| q) = \frac{1}{2} D_{\mathrm{KL}}(p \| m) + \frac{1}{2} D_{\mathrm{KL}}(q \| m), \qquad m = \frac{p + q}{2}

Unlike KL divergence, JS is symmetric, bounded (0JSlog20 \leq \mathrm{JS} \leq \log 2), and its square root is a metric. It is the divergence minimized (implicitly) in the original GAN objective.

Explore the f-divergence family below. The left panel shows the generator functions — all convex, all passing through (1,0)(1, 0). The right panel shows how each divergence responds to distributional mismatch.

f-divergence family — generator functions, comparison curves, and Pinsker's inequality

def f_divergence(p, q, f_func):
    """General f-divergence D_f(p || q) = sum q(x) f(p(x)/q(x))."""
    p, q = np.asarray(p, float), np.asarray(q, float)
    result = 0.0
    for pi, qi in zip(p, q):
        if qi > 0 and pi > 0:
            result += qi * f_func(pi / qi)
        elif qi > 0 and pi == 0:
            result += qi * f_func(0.0)
        elif qi == 0 and pi > 0:
            return np.inf
    return result

# Generator functions (use natural log — the standard convention for
# f-divergences; kl_divergence() above uses log2 for bits)
f_kl         = lambda t: t * np.log(t) if t > 0 else 0.0
f_reverse_kl = lambda t: -np.log(t) if t > 0 else np.inf
f_chi_sq     = lambda t: (t - 1) ** 2
f_hellinger  = lambda t: (np.sqrt(max(t, 0)) - 1) ** 2
f_tv         = lambda t: abs(t - 1) / 2
f_js         = lambda t: (t * np.log(t / ((t + 1) / 2)) + np.log(1 / ((t + 1) / 2))
                          if t > 0 else np.log(2))

Properties of f-Divergences

The power of the f-divergence framework is that all members inherit fundamental properties from the convexity of ff alone. We do not need separate proofs for KL, χ2\chi^2, Hellinger, etc. — one proof covers them all.

Theorem 1 (Non-negativity of f-Divergences).

For any convex ff with f(1)=0f(1) = 0, Df(pq)0D_f(p \| q) \geq 0 for all distributions p,qp, q. Equality holds iff p=qp = q when ff is strictly convex at 11.

Proof.

By Jensen’s inequality applied to the convex function ff:

Df(pq)=Eq ⁣[f ⁣(p(X)q(X))]f ⁣(Eq ⁣[p(X)q(X)])=f ⁣(xq(x)p(x)q(x))=f ⁣(xp(x))=f(1)=0D_f(p \| q) = \mathbb{E}_q\!\left[f\!\left(\frac{p(X)}{q(X)}\right)\right] \geq f\!\left(\mathbb{E}_q\!\left[\frac{p(X)}{q(X)}\right]\right) = f\!\left(\sum_x q(x) \cdot \frac{p(x)}{q(x)}\right) = f\!\left(\sum_x p(x)\right) = f(1) = 0

When ff is strictly convex at 11, equality in Jensen’s holds iff p(x)/q(x)p(x)/q(x) is constant qq-a.s., which forces p=qp = q.

\square

This is the same proof structure as Gibbs’ inequality for KL — because Gibbs’ inequality is the special case f(t)=tlogtf(t) = t \log t.

Theorem 2 (Joint Convexity of f-Divergences).

Df(pq)D_f(p \| q) is jointly convex in the pair (p,q)(p, q). That is, for λ[0,1]\lambda \in [0, 1]:

Df(λp1+(1λ)p2λq1+(1λ)q2)λDf(p1q1)+(1λ)Df(p2q2)D_f(\lambda p_1 + (1-\lambda) p_2 \| \lambda q_1 + (1-\lambda) q_2) \leq \lambda D_f(p_1 \| q_1) + (1-\lambda) D_f(p_2 \| q_2)

Proof.

The perspective function g(a,b)=bf(a/b)g(a, b) = b \cdot f(a/b) is jointly convex when ff is convex (this is a standard result in Convex Analysis). Since Df(pq)=xg(p(x),q(x))D_f(p \| q) = \sum_x g(p(x), q(x)) is a sum of jointly convex functions, it is jointly convex.

\square

Joint convexity means that divergence minimization — minimizing Df(pqθ)D_f(p \| q_\theta) over parameters θ\theta — is a convex problem when the mapping θqθ\theta \mapsto q_\theta is affine.

Data Processing Inequality for f-Divergences

The data processing inequality (DPI) says that processing data can only lose information — it can never increase the divergence between two distributions.

Theorem 3 (Data Processing Inequality for f-Divergences).

For any f-divergence DfD_f and any Markov kernel (channel) TT:

Df(TpTq)Df(pq)D_f(Tp \| Tq) \leq D_f(p \| q)

where (Tp)(y)=xT(yx)p(x)(Tp)(y) = \sum_x T(y|x) p(x) is the output distribution when pp is passed through channel TT.

Proof.

For each output yy, the ratio (Tp)(y)/(Tq)(y)(Tp)(y)/(Tq)(y) is a qTq_T-weighted average of the input ratios p(x)/q(x)p(x)/q(x):

(Tp)(y)(Tq)(y)=xT(yx)p(x)xT(yx)q(x)=xwx(y)p(x)q(x)\frac{(Tp)(y)}{(Tq)(y)} = \frac{\sum_x T(y|x) p(x)}{\sum_x T(y|x) q(x)} = \sum_x w_x(y) \cdot \frac{p(x)}{q(x)}

where the weights wx(y)=T(yx)q(x)/(Tq)(y)w_x(y) = T(y|x) q(x) / (Tq)(y) sum to 11. By the convexity of ff (Jensen’s inequality):

f ⁣((Tp)(y)(Tq)(y))xwx(y)f ⁣(p(x)q(x))f\!\left(\frac{(Tp)(y)}{(Tq)(y)}\right) \leq \sum_x w_x(y)\, f\!\left(\frac{p(x)}{q(x)}\right)

Multiplying by (Tq)(y)(Tq)(y) and summing over yy gives Df(TpTq)Df(pq)D_f(Tp \| Tq) \leq D_f(p \| q).

\square

This is strictly more general than the mutual information DPI from Shannon Entropy & Mutual Information. The mutual information DPI I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y) for a Markov chain XYZX \to Y \to Z is the special case where Df=DKLD_f = D_{\mathrm{KL}} applied to the joint vs product of marginals.

Pinsker’s Inequality

Pinsker’s inequality provides a bridge between KL divergence and total variation — two divergences with very different structures.

Theorem 4 (Pinsker's Inequality).

TV(p,q)12DKL(pq)\mathrm{TV}(p, q) \leq \sqrt{\frac{1}{2} D_{\mathrm{KL}}(p \| q)}

or equivalently, TV(p,q)212DKL(pq)\mathrm{TV}(p, q)^2 \leq \frac{1}{2} D_{\mathrm{KL}}(p \| q).

The proof involves a careful comparison of the generator functions for TV and KL via a quadratic bound on tlogtt \log t near t=1t = 1 (see Cover & Thomas, Ch. 11, or Tsybakov, 2009). The bound is tight: equality is approached as pp and qq become close.

Pinsker’s inequality is practically important: total variation has a clean interpretation (maximum probability difference over events) but is hard to work with in optimization; KL divergence has a clean optimization theory (convexity, connections to MLE) but a less transparent geometric interpretation. Pinsker’s inequality lets us convert bounds between them.

Data processing inequality for f-divergences — noise degrades all divergences


Variational Representations

The variational representation of f-divergences is one of the most powerful results in modern information theory. It transforms divergence computation from a density ratio problem (requiring knowledge of pp and qq) into an optimization problem (requiring only samples from pp and qq).

The key idea comes from Convex Analysis: every convex function can be represented as the supremum of affine functions via its Fenchel conjugate (also called the convex conjugate or Legendre transform):

f(s)=supt>0{stf(t)}f^*(s) = \sup_{t > 0}\{st - f(t)\}

Theorem 5 (Variational Representation of f-Divergences).

For any f-divergence with generator ff:

Df(pq)=supT:Xdom(f){Ep[T(X)]Eq[f(T(X))]}D_f(p \| q) = \sup_{T: \mathcal{X} \to \text{dom}(f^*)} \left\{\mathbb{E}_p[T(X)] - \mathbb{E}_q[f^*(T(X))]\right\}

The supremum is attained at T(x)=f(p(x)/q(x))T^*(x) = f'(p(x)/q(x)).

Proof.

By the Fenchel–Young inequality, f(t)stf(s)f(t) \geq st - f^*(s) for all s,ts, t (this is the definition of the conjugate). Therefore for any function TT:

Df(pq)=xq(x)f ⁣(p(x)q(x))xq(x)[p(x)q(x)T(x)f(T(x))]=Ep[T(X)]Eq[f(T(X))]D_f(p \| q) = \sum_x q(x)\, f\!\left(\frac{p(x)}{q(x)}\right) \geq \sum_x q(x) \left[\frac{p(x)}{q(x)} T(x) - f^*(T(x))\right] = \mathbb{E}_p[T(X)] - \mathbb{E}_q[f^*(T(X))]

This gives Df(pq)supT{Ep[T]Eq[f(T)]}D_f(p \| q) \geq \sup_T \{\mathbb{E}_p[T] - \mathbb{E}_q[f^*(T)]\}.

For the reverse inequality, set T(x)=f(p(x)/q(x))T^*(x) = f'(p(x)/q(x)) — the derivative of ff at the density ratio. The Fenchel–Young equality f(t)=tf(t)f(f(t))f(t) = t f'(t) - f^*(f'(t)) (holding when ff is differentiable) shows this achieves equality.

\square

Donsker–Varadhan and NWJ Bounds

For KL divergence specifically, f(t)=tlogtf(t) = t \log t gives f(s)=es1f^*(s) = e^{s-1}, yielding:

Theorem 6 (Donsker–Varadhan Representation).

DKL(pq)=supT{Ep[T(X)]logEq[eT(X)]}D_{\mathrm{KL}}(p \| q) = \sup_T \left\{\mathbb{E}_p[T(X)] - \log \mathbb{E}_q[e^{T(X)}]\right\}

The supremum over all measurable functions T:XRT: \mathcal{X} \to \mathbb{R} is achieved at T(x)=log(p(x)/q(x))+CT^*(x) = \log(p(x)/q(x)) + C for any constant CC.

The Nguyen–Wainwright–Jordan (NWJ) bound is a related lower bound that replaces logEq[eT]\log \mathbb{E}_q[e^T] with Eq[eT1]\mathbb{E}_q[e^{T-1}], which is easier to optimize:

DKL(pq)supT{Ep[T(X)]Eq[eT(X)1]}D_{\mathrm{KL}}(p \| q) \geq \sup_T \left\{\mathbb{E}_p[T(X)] - \mathbb{E}_q[e^{T(X) - 1}]\right\}

Connection to GANs

Remark (GAN as JS Minimization).

The original GAN objective (Goodfellow et al., 2014) is the variational representation of the Jensen–Shannon divergence. The discriminator D(x)D(x) plays the role of the variational function T(x)T(x), and the optimal discriminator satisfies D(x)=p(x)/(p(x)+q(x))D^*(x) = p(x)/(p(x) + q(x)) — the density ratio between real and generated distributions.

More generally, the f-GAN framework (Nowozin et al., 2016) shows that any f-divergence can serve as a GAN objective: train the generator to minimize Df(prealpgen)D_f(p_{\text{real}} \| p_{\text{gen}}) and the discriminator to maximize the variational lower bound. The choice of ff determines the GAN’s training dynamics and mode-collapse behavior.

Variational representations — Fenchel conjugate, NWJ bound, and GAN connection


Rényi Divergence & α\alpha-Divergences

The f-divergence family is parameterized by a function. Rényi divergences provide a complementary parameterization by a single scalar α\alpha, interpolating between familiar divergences.

Definition 6 (Rényi Divergence).

The Rényi divergence of order α>0\alpha > 0, α1\alpha \neq 1, from pp to qq is

Dα(pq)=1α1logxXp(x)αq(x)1αD_\alpha(p \| q) = \frac{1}{\alpha - 1} \log \sum_{x \in \mathcal{X}} p(x)^\alpha\, q(x)^{1-\alpha}

with the convention that Dα(pq)=+D_\alpha(p \| q) = +\infty if p(x)>0p(x) > 0 and q(x)=0q(x) = 0 for some xx.

The Rényi family provides a continuous spectrum of divergences. As α\alpha varies, we recover fundamental information-theoretic quantities:

  • α0\alpha \to 0: logq(supp(p))-\log q(\text{supp}(p)) — support divergence
  • α=1/2\alpha = 1/2: 2logxp(x)q(x)-2\log\sum_x \sqrt{p(x)q(x)} — Bhattacharyya distance
  • α1\alpha \to 1: DKL(pq)D_{\mathrm{KL}}(p \| q) — KL divergence (by L’Hôpital)
  • α=2\alpha = 2: logxp(x)2/q(x)\log\sum_x p(x)^2/q(x) — related to χ2\chi^2 divergence
  • α\alpha \to \infty: logmaxxp(x)/q(x)\log\max_x p(x)/q(x) — max-divergence

The convergence DαDKLD_\alpha \to D_{\mathrm{KL}} as α1\alpha \to 1 is verified by L’Hôpital’s rule applied to the 0/00/0 indeterminate form in the definition.

Theorem 7 (Monotonicity of Rényi Divergence).

For fixed distributions pp and qq, the map αDα(pq)\alpha \mapsto D_\alpha(p \| q) is non-decreasing on (0,)(0, \infty).

Proof.

Define M(α)=xp(x)αq(x)1αM(\alpha) = \sum_x p(x)^\alpha q(x)^{1-\alpha}, so Dα=logM(α)/(α1)D_\alpha = \log M(\alpha) / (\alpha - 1). Taking the derivative and applying Hölder’s inequality shows dDα/dα0d D_\alpha / d\alpha \geq 0. The key step uses the log-convexity of M(α)M(\alpha) — a consequence of Hölder’s inequality applied to the sum pαq1α\sum p^\alpha q^{1-\alpha} with conjugate exponents 1/α1/\alpha and 1/(1α)1/(1-\alpha).

\square

This monotonicity has important consequences:

  • Differential privacy: The (ε,δ)(\varepsilon, \delta)-DP guarantee is controlled by the max-divergence (α=\alpha = \infty). Rényi DP (Mironov, 2017) uses finite α\alpha to get tighter composition bounds, leveraging the monotonicity to relate different privacy definitions.
  • Hypothesis testing: The Chernoff information, which governs the optimal Bayesian error exponent, is minαDα(pq)\min_\alpha D_\alpha(p \| q).

Rényi divergence — monotonicity in α, convergence to KL at α = 1, and ordering of special cases

def renyi_divergence(p, q, alpha):
    """Rényi divergence D_alpha(p || q) in nats."""
    p, q = np.asarray(p, float), np.asarray(q, float)
    if abs(alpha - 1.0) < 1e-10:
        # Limit: KL divergence in nats
        mask = p > 0
        if np.any(q[mask] <= 0):
            return np.inf
        return np.sum(p[mask] * np.log(p[mask] / q[mask]))
    summand = np.sum(p ** alpha * q ** (1 - alpha))
    if summand <= 0:
        return np.inf
    return np.log(summand) / (alpha - 1)

# Verify: D_alpha → D_KL as alpha → 1
p = np.array([0.7, 0.2, 0.1])
q = np.array([0.3, 0.4, 0.3])
alphas = [0.5, 0.9, 0.99, 0.999, 1.0, 1.001, 1.01, 1.1, 2.0]
for a in alphas:
    print(f"D_{a:.3f}(p || q) = {renyi_divergence(p, q, a):.6f} nats")
# Observe smooth convergence to D_KL at α = 1

Computational Notes

Plug-In Estimation from Samples

Given i.i.d. samples from pp and qq, the simplest KL estimator replaces the true distributions with empirical histograms:

D^KL(p^q^)=xp^(x)log2p^(x)q^(x)\hat{D}_{\mathrm{KL}}(\hat{p} \| \hat{q}) = \sum_x \hat{p}(x) \log_2 \frac{\hat{p}(x)}{\hat{q}(x)}

This plug-in estimator is consistent but converges slowly — O(1/n)O(1/\sqrt{n}) in general — and is biased in finite samples due to the nonlinearity of log\log. For continuous distributions, binning introduces discretization error, and kernel-based or kk-NN estimators (Pérez-Cruz, 2008; Wang et al., 2009) are preferred.

Cross-Entropy Loss Decomposition

During training of a classifier with model qθq_\theta and true distribution pp:

H(p,qθ)=H(p)+DKL(pqθ)H(p, q_\theta) = H(p) + D_{\mathrm{KL}}(p \| q_\theta)

The entropy H(p)H(p) is the irreducible noise floor — we cannot reduce the loss below H(p)H(p) no matter how good the model. The KL divergence DKL(pqθ)D_{\mathrm{KL}}(p \| q_\theta) is the reducible component that training shrinks toward zero. For one-hot labels, H(p)=0H(p) = 0 and cross-entropy equals KL divergence.

ELBO as Reverse KL

The evidence lower bound (ELBO) in variational inference decomposes as:

ELBO(qϕ)=logp(x)DKL(qϕ(z)p(zx))\text{ELBO}(q_\phi) = \log p(x) - D_{\mathrm{KL}}(q_\phi(z) \| p(z|x))

Since logp(x)\log p(x) is fixed, maximizing the ELBO is equivalent to minimizing the reverse KL to the posterior. The gap between logp(x)\log p(x) and the ELBO is exactly the reverse KL divergence — a direct measure of how well the approximate posterior fits the true posterior.

Practical Implementations

from scipy.special import rel_entr
import torch
import torch.nn.functional as F

# SciPy: KL divergence via relative entropy
# rel_entr(p, q) returns p * log(p/q) elementwise (in nats)
p = np.array([0.4, 0.35, 0.15, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])
kl_nats = np.sum(rel_entr(p, q))  # D_KL in nats

# PyTorch: cross-entropy loss (forward KL for one-hot labels)
logits = torch.tensor([2.0, 1.0, 0.5, -0.5])
target = torch.tensor(0)  # one-hot: class 0
ce_loss = F.cross_entropy(logits, target)  # -log(softmax(logits)[0])

# PyTorch: KL divergence (expects log-probabilities for input)
log_q = F.log_softmax(logits, dim=0)
p_tensor = torch.tensor([0.4, 0.35, 0.15, 0.1])
kl = F.kl_div(log_q, p_tensor, reduction='sum')  # D_KL(p || q) in nats

Computational divergences — KL estimation convergence, cross-entropy loss decomposition, ELBO trajectory


Connections & Further Reading

KL divergence and its generalizations connect information theory to optimization, geometry, and learning theory. This topic sits at the intersection of several curriculum tracks:

  • Shannon Entropy & Mutual Information (Information Theory) — KL divergence decomposes as DKL(pq)=H(p,q)H(p)D_{\mathrm{KL}}(p \| q) = H(p,q) - H(p). Mutual information I(X;Y)=DKL(pXYpXpY)I(X;Y) = D_{\mathrm{KL}}(p_{XY} \| p_X p_Y). The data processing inequality for mutual information is a special case of the f-divergence DPI.

  • Information Geometry & Fisher Metric (Differential Geometry) — The Fisher information matrix is the Hessian of KL divergence at p=qp = q. The dual ee- and mm-connections arise from the asymmetry of KL. The α\alpha-connections generalize via Rényi divergences.

  • Convex Analysis (Optimization) — f-divergences are defined through convex generators; the Fenchel conjugate ff^* yields the variational representation; joint convexity makes divergence minimization a convex program.

  • Measure-Theoretic Probability (Probability & Statistics) — Continuous KL requires the Radon–Nikodym derivative. Absolute continuity (PQP \ll Q) is necessary for finite KL divergence.

Downstream on this track

  • Rate-Distortion Theory — the minimum rate R(D)R(D) for encoding a source at distortion level DD is an optimization over mutual information — itself a KL divergence — subject to a distortion constraint. The Blahut–Arimoto algorithm iteratively minimizes KL divergence to compute R(D)R(D).
  • Minimum Description Length — model selection via code length: the regret of a universal code is bounded by the KL divergence between the true and estimated distributions. The minimax regret connects to the capacity of the model class.

Notation Reference

  • DKL(pq)D_{\mathrm{KL}}(p \| q) — KL divergence from pp to qq
  • H(p,q)H(p, q) — Cross-entropy from pp to qq
  • Df(pq)D_f(p \| q) — f-divergence with generator ff
  • f(s)=supt{stf(t)}f^*(s) = \sup_t\{st - f(t)\} — Fenchel conjugate of ff
  • TV(p,q)\mathrm{TV}(p, q) — Total variation distance
  • JS(pq)\mathrm{JS}(p \| q) — Jensen–Shannon divergence
  • Dα(pq)D_\alpha(p \| q) — Rényi divergence of order α\alpha

Connections

  • KL divergence decomposes as D_KL(p || q) = H(p, q) - H(p): the excess bits from using code q instead of p. Mutual information is the KL divergence from the joint to the product of marginals. shannon-entropy
  • The Fisher information metric is the Hessian of KL divergence at p = q. KL divergence induces the dual affine connection structure (e- and m-connections) on statistical manifolds, and the α-connections generalize via Rényi divergences. information-geometry
  • f-divergences are defined through convex generator functions. The Fenchel conjugate f* yields the variational representation. Joint convexity of D_f in (p, q) makes divergence minimization a convex program. convex-analysis
  • Continuous KL divergence requires the Radon–Nikodym derivative: D_KL(P || Q) = integral (dP/dQ) log(dP/dQ) dQ. Absolute continuity (P << Q) is necessary for finite KL divergence. measure-theoretic-probability

References & Further Reading