advanced learning-theory 75 min read

The Information Bottleneck

Compressing X while preserving information about Y

1. What problem does the information bottleneck solve?

1.1 Predictive but compact: the trade-off as a design principle

Representations sit between raw data and downstream tasks. A useful representation does two opposed things at once: it throws away what the task doesn’t need (so we aren’t swamped by noise), and it keeps what the task does need (so we can still answer the question). The information bottleneck principle formalizes exactly that opposition as a single optimization problem.

The setup names three random variables. The input XX is whatever the world hands us — a document, an image, a sensor trace. The target YY is whatever we ultimately care about — a topic label, a class, a downstream measurement. The representation TT is the compressed object we want to learn, and we require TT to be a (possibly stochastic) function of XX alone. Formally, TT is conditionally independent of YY given XX:

Y    X    T,Y \;\to\; X \;\to\; T,

which is the Markov chain we’ll be working inside for the rest of the topic. Concretely, TT‘s distribution is fully specified by an encoder p(tx)p(t \mid x) — the only knob the IB principle gets to turn.

Two strategies fail immediately. If we set T=XT = X, the encoder is the identity; we have compressed nothing and we have learned nothing. If we set TT to a constant, the encoder throws everything away; we have compressed perfectly and we have predicted nothing. Useful representations live in between, and the IB principle picks the trade-off explicitly via the Lagrangian

Lβ(p(tx))  =  I(X;T)    βI(T;Y),β>0.(1.1)\mathcal{L}_\beta\bigl(p(t\mid x)\bigr) \;=\; I(X;T) \;-\; \beta\, I(T;Y), \qquad \beta > 0. \quad\quad (1.1)

The first term is the cost of remembering: how much information about XX leaks through the encoder. The second is the value of predicting: how much information about YY the representation TT retains. The single scalar β\beta shifts where we want to land between the two failure modes. As β0\beta \to 0, only compression matters and the minimizer collapses to a constant. As β\beta \to \infty, only prediction matters and the minimizer recovers the lossless representation. Everything interesting happens in between, on the IB curve — the Pareto frontier in the (I(X;T),I(T;Y))(I(X;T),\, I(T;Y)) plane traced out as β\beta sweeps.

We will rederive every piece of this picture over the next eleven sections. For now, it is enough to know that we have a single-parameter family of optimization problems and that the parameter has an interpretable meaning: β\beta is the bit-for-bit exchange rate between compression and prediction.

1.2 A motivating vignette: clustering documents by topic

The original Tishby–Pereira–Bialek paper grounds the principle in document clustering, which is still the cleanest first example. Imagine eight documents drawn from two hidden topics — call them “sports” and “finance.” Each document x{0,1,,7}x \in \{0,1,\ldots,7\} is summarized by a word-count signature, and the topic label y{0,1}y \in \{0,1\} is what a downstream task cares about. A representation TT is a (soft) clustering: the encoder p(tx)p(t \mid x) says how strongly document xx is assigned to cluster tt.

We give every document equal mass and arrange the topic structure as cleanly as possible: documents 00 through 33 come from topic 00, documents 44 through 77 from topic 11. The joint distribution p(x,y)p(x,y) then has uniform mass 1/81/8 on its eight-point support, and the marginals come out to p(x)=1/8p(x) = 1/8 for every document and p(y)=1/2p(y) = 1/2 for each topic. Quick consequences:

H(X)=log28=3,H(Y)=log22=1,I(X;Y)=H(Y)=1.H(X) = \log_2 8 = 3, \qquad H(Y) = \log_2 2 = 1, \qquad I(X;Y) = H(Y) = 1.

The last equality uses that YY is a deterministic function of XX on this construction, so H(YX)=0H(Y\mid X) = 0 and I(X;Y)=H(Y)H(YX)=1I(X;Y) = H(Y) - H(Y\mid X) = 1.

Three reference clusterings stake out the corners of the IB plane:

ClusteringI(X;T)I(X;T)I(T;Y)I(T;Y)What it does
T=constT = \text{const}0000Full compression. All documents in one cluster — predictiveness destroyed.
T=YT = Y1111Topic-aligned. Two clusters, one per topic — minimum compression that still captures all of YY.
T=XT = X3311Identity. Eight clusters, one per document — zero compression, but no additional predictiveness over T=YT = Y.

The third row is the punchline: keeping all three bits of XX buys exactly the same one bit of YY-information as keeping only the one bit that says “which topic.” Compressing XX from three bits down to one bit is therefore free on this construction — the extra two bits of XX are pure noise from YY‘s point of view. The IB curve, traced over β\beta, will recover this fact algorithmically without being told what YY is in advance. We come back to this same eight-document table in §5 when we run the discrete IB algorithm on it.

Five reference clusterings on the 8-document toy plotted in the IB plane
Five reference clusterings on the eight-document, two-topic toy, plotted by their (I(X;T), I(T;Y)) coordinates in bits. The diagonal I(T;Y) ≤ I(X;T) is a hard upper bound (data-processing inequality); the dashed horizontal at I(X;Y) = 1 bit is the DPI ceiling. T = X (3, 1) and the 4-cluster topic-pure pairs (2, 1) both sit on the ceiling, but T = Y (1, 1) achieves it at the smallest possible I(X;T). The mis-aligned 2-cluster split shows that compression alone is not enough — its (1, 0) coordinate keeps one bit of X but throws away the predictively useful one.

Two clusters aligned with the two topics. Captures all of Y at the smallest possible I(X;T) — sits on the optimal corner of the IB plane.

Assignment p(t | x)
t = 0t = 1x0x1x2x3x4x5x6x7y=0y=0y=0y=0y=1y=1y=1y=1
I(X;T) = 1.000 bits
I(T;Y) = 1.000 bits

1.3 Why mutual information is the right currency

Why measure “informativeness” with mutual information rather than (say) classification accuracy, correlation, or one of the many divergences floating around in the ML literature? Four reasons make MI the natural choice for both axes of the IB plane.

Invariance to relabeling. I(X;T)I(X;T) depends only on the joint distribution p(x,t)p(x,t), not on how cluster names are assigned. Two clusterings that partition XX identically have the same mutual information regardless of whether we call the clusters "{0,1}\{0,1\}" or ”{sports,finance}\{\text{sports}, \text{finance}\}.” Compression should count statistical structure, not labels, and MI does this by construction.

A coding interpretation. Shannon’s source-coding theorem (the foundation we lean on from Shannon entropy) tells us that I(X;T)I(X;T) is, asymptotically, the average number of bits per sample needed to communicate TT to a receiver who already knows XX. So I(X;T)I(X;T) has a literal units interpretation — bits. The compression term in the IB Lagrangian carries a cost in the same currency as the predictiveness term.

A clean entropy decomposition. Writing H(X)=I(X;T)+H(XT)H(X) = I(X;T) + H(X \mid T) splits the entropy of XX into “the bits about XX captured by TT” and “the bits about XX lost in TT.” Maximizing I(X;T)I(X;T) is exactly minimizing what we lose; minimizing I(X;T)I(X;T) is exactly maximizing what we throw away. The IB Lagrangian is asking us to throw away what is irrelevant and to keep what is relevant, both measured in the same units.

The data-processing inequality. For any Markov chain TXYT \leftarrow X \to Y,

I(T;Y)    I(X;Y).(1.2)I(T;Y) \;\le\; I(X;Y). \quad\quad (1.2)

No matter what encoder we cook up from XX, the representation TT cannot predict YY better than XX itself can. This places a hard ceiling on the predictiveness side of the IB plane: the IB curve is bounded above by the constant I(X;Y)I(X;Y). It also tells us when compression is genuinely free. If I(X;Y)H(X)I(X;Y) \ll H(X) — that is, if XX has lots of bits unrelated to YY — then there is plenty of room to compress without paying anything in predictiveness. Our document-clustering toy has H(X)=3H(X) = 3 and I(X;Y)=1I(X;Y) = 1, so two of the three bits in XX are pure noise from YY‘s standpoint, and the IB algorithm will discard them as β\beta varies.

These four properties together — invariance, units, decomposition, the DPI ceiling — make I(X;T)I(X;T) and I(T;Y)I(T;Y) the natural axes of the trade-off. The IB Lagrangian is, in a real sense, the most parsimonious functional you can write that respects all four.

1.4 Roadmap and what’s not in scope

The topic covers three substantive arcs.

Discrete IB (Tishby–Pereira–Bialek 1999), §§3–5. We derive the stationarity conditions on Lβ\mathcal{L}_\beta by variational calculus on the encoder, obtain the three coupled updates for p(tx)p(t\mid x), p(t)p(t), and p(yt)p(y\mid t), prove convergence of the iterative algorithm via a Lyapunov argument inherited from the Blahut–Arimoto algorithm in rate-distortion theory, and run the algorithm on the eight-document toy to trace out the IB curve.

Gaussian IB (Chechik–Globerson–Tishby–Weiss 2005), §§6–7. When (X,Y)(X,Y) are jointly Gaussian, the IB problem has a closed-form solution: the optimal encoder is linear-plus-Gaussian-noise, and the optimal projection directions are the canonical-correlation directions of ΣX1/2ΣXYΣX1/2\Sigma_X^{-1/2}\, \Sigma_{X\mid Y}\, \Sigma_X^{-1/2}. The IB curve becomes piecewise-analytic, with phase transitions at critical βc\beta_c values where new directions activate. This is the analytical sandbox where every quantity is exactly computable.

Variational IB and the deep-learning lift, §§8–10. The Alemi–Fischer–Dillon–Murphy 2017 variational lower bound on Lβ-\mathcal{L}_\beta makes the IB Lagrangian tractable for encoders that are too high-dimensional to enumerate, including neural-network encoders. We will derive the bound, work through the reparametrization trick, and exercise the construction on a closed-form linear-Gaussian VIB sandbox. Then we take the deep-learning controversy head-on: Tishby–Zaslavsky 2015 and Shwartz-Ziv–Tishby 2017 argued that deep networks generically traverse a “fitting then compression” trajectory in the information plane; Saxe et al. 2018 showed that the compression phase disappears when the activation is ReLU instead of tanh, and may have been an artifact of how MI was estimated. We will present both sides faithfully.

Connections and limits, §§11–12. IB is mathematically adjacent to several other principles: rate-distortion (the closest parent — same Lagrangian template, different distortion), PAC-Bayes (which uses a KL term as a complexity penalty, structurally analogous to the I(X;T)I(X;T) term), and InfoNCE in contrastive learning (which is provably an MI lower bound, so contrastive methods are implicitly doing IB-style optimization). We close with computational notes — logsumexp stability, plug-in MI estimator pitfalls — and a list of honest limits.

What we are not doing. We are not training a deep VIB on real image data; the under-60-second CPU runtime budget rules it out, and Alemi et al.’s MNIST numbers will be discussed but not reproduced. We are not adjudicating the deep-learning debate; we will present the controversy faithfully and let the reader form their own view. We are not building a full rate-distortion theory from scratch; that lives in rate-distortion, and we’ll cite forward to it at the relevant point in §11.

2. The information bottleneck Lagrangian

2.1 The setting and the Markov chain

Let XX be the input random variable, YY the target, and TT the learned representation. We assume the joint distribution p(x,y)p(x, y) is given. In §3 the iterative algorithm will use this distribution directly; in §12.2 we’ll discuss how to estimate it from a finite sample. The triple (X,Y,T)(X, Y, T) satisfies the Markov property

Y    X    T,Y \;\to\; X \;\to\; T,

equivalent to the conditional-independence statement TYXT \perp Y \mid X. In words: TT is allowed to depend on XX only through XX itself, never directly through YY. This is exactly what it means for TT to be a (possibly randomized) function of XX.

The single degree of freedom in the problem is the encoder

p(tx),tT.p(t \mid x), \qquad t \in \mathcal{T}.

We are free to choose the alphabet T\mathcal{T}, including its cardinality. For the discrete IB of §§3–5 we take T=k|\mathcal{T}| = k for some user-specified kk; for the Gaussian IB of §§6–7 we take T=Rd\mathcal{T} = \mathbb{R}^d; for the variational IB of §§8–10 the alphabet is whatever the variational family supports.

Three derived distributions matter, and the IB algorithm of §3 will turn out to update exactly these three:

p(t)  =  xp(x)p(tx),p(xt)  =  p(x)p(tx)p(t),p(yt)  =  xp(yx)p(xt).(2.1)p(t) \;=\; \sum_x p(x)\,p(t \mid x), \qquad p(x \mid t) \;=\; \frac{p(x)\,p(t \mid x)}{p(t)}, \qquad p(y \mid t) \;=\; \sum_x p(y \mid x)\,p(x \mid t). \quad\quad (2.1)

The first is the cluster marginal. The second is the cluster-conditional distribution of XX, by Bayes. The third is the cluster-conditional distribution of YY, computed using the Markov property p(yx,t)=p(yx)p(y \mid x, t) = p(y \mid x) — the encoder’s job is to forward whatever it knows about XX, not to invent anything new about YY.

2.2 The variational problem

The IB problem has two equivalent forms, and it pays to be precise about both.

The constrained form. Fix a target predictiveness level IY[0,I(X;Y)]I_Y^* \in [0, I(X;Y)]. The problem is

minp(tx)  I(X;T)subject toI(T;Y)IY,tp(tx)=1  x.(2.2)\min_{p(t \mid x)} \; I(X; T) \quad \text{subject to} \quad I(T; Y) \ge I_Y^*, \quad \sum_t p(t \mid x) = 1 \; \forall x. \quad\quad (2.2)

We want the most compressed representation whose predictiveness is at least IYI_Y^*. Sweeping IYI_Y^* across its allowed range traces out the IB curve.

The Lagrangian form. Introduce a Lagrange multiplier β0\beta \ge 0 for the predictiveness constraint and, for now, handle the normalization constraints implicitly. The Lagrangian we will use from §3 onward is

Fβ(p(tx))  =  I(X;T)    βI(T;Y),minimized over normalized encoders.(2.3)\mathcal{F}_\beta(p(t \mid x)) \;=\; I(X; T) \;-\; \beta\, I(T; Y), \qquad \text{minimized over normalized encoders.} \quad\quad (2.3)

The Lagrangian form is a relaxation: for each β\beta, the minimizer of Fβ\mathcal{F}_\beta also solves (2.2) for some value of IYI_Y^* (namely the value I(T;Y)I(T;Y) achieved by the minimizer), and the two forms generate the same Pareto frontier. We work with the Lagrangian form because its stationarity conditions differentiate cleanly — see §3.2.

A subtlety worth flagging now: the Lagrangian relaxation can miss points on the IB curve where the curve has a corner (i.e., where the right-hand and left-hand slopes disagree). At such points, no single β\beta produces that operating point as a Lagrangian minimum — only an open interval of β\beta values does, and the achievable point sits “between” the two corner branches. This will matter in §7 when we discuss Gaussian-IB phase transitions, where the curve has exactly such corners.

2.3 The IB curve in the information plane

Define the IB curve as the value function of (2.2):

IY(R)  :=  supp(tx){I(T;Y)  :  I(X;T)R},R[0,).(2.4)I_Y^*(R) \;:=\; \sup_{p(t \mid x)} \Bigl\{\, I(T; Y) \;:\; I(X; T) \le R \,\Bigr\}, \qquad R \in [0, \infty). \quad\quad (2.4)

This is the upper Pareto boundary of the achievable region in the (I(X;T),I(T;Y))(I(X;T), I(T;Y)) plane — every operating point sits on or below it.

Theorem 1 (IB-curve shape).

IYI_Y^* is non-decreasing, concave, and bounded above by I(X;Y)I(X; Y), with equality IY(R)=I(X;Y)I_Y^*(R) = I(X; Y) achieved for all RR large enough to support a sufficient statistic for YY in XX.

Proof.

Monotonicity. Immediate: enlarging the constraint I(X;T)RI(X;T) \le R enlarges the feasible set, so the supremum cannot decrease.

Concavity by time-sharing. Fix R1,R20R_1, R_2 \ge 0 and α(0,1)\alpha \in (0, 1). Let ϵ>0\epsilon > 0 and pick encoders p1(tx)p_1(t \mid x) over an alphabet T1\mathcal{T}_1 and p2(tx)p_2(t \mid x) over an alphabet T2\mathcal{T}_2 satisfying

I(X;Ti)Ri,I(Ti;Y)IY(Ri)ϵ,i=1,2,I(X; T_i) \le R_i, \qquad I(T_i; Y) \ge I_Y^*(R_i) - \epsilon, \qquad i = 1, 2,

where TiT_i is the representation produced by encoder pip_i. By relabeling we may assume T1T2=\mathcal{T}_1 \cap \mathcal{T}_2 = \emptyset.

Now introduce a Bernoulli switch SBern(α)S \sim \mathrm{Bern}(\alpha), drawn independently of everything else, and define the time-shared representation

Tα  =  {T1if S=1,T2if S=0,T_\alpha \;=\; \begin{cases} T_1 & \text{if } S = 1,\\ T_2 & \text{if } S = 0,\end{cases}

which corresponds to the encoder pα(tx)=αp1(tx)1{tT1}+(1α)p2(tx)1{tT2}p_\alpha(t \mid x) = \alpha\, p_1(t \mid x)\, \mathbf{1}\{t \in \mathcal{T}_1\} + (1 - \alpha)\, p_2(t \mid x)\, \mathbf{1}\{t \in \mathcal{T}_2\}. Because the codomains are disjoint, SS is recoverable from TαT_\alpha, so H(STα)=0H(S \mid T_\alpha) = 0.

Compression side. Using H(STα)=0H(S \mid T_\alpha) = 0 and the chain rule,

I(X;Tα)  =  I(X;Tα,S)  =  I(X;S)=0+I(X;TαS)  =  αI(X;T1)+(1α)I(X;T2),I(X; T_\alpha) \;=\; I(X; T_\alpha, S) \;=\; \underbrace{I(X; S)}_{= 0} \,+\, I(X; T_\alpha \mid S) \;=\; \alpha\, I(X; T_1) \,+\, (1 - \alpha)\, I(X; T_2),

where I(X;S)=0I(X; S) = 0 because SS is independent of XX, and the conditional MI splits because conditional on S=sS = s, the representation TαT_\alpha is exactly TsT_s. Hence I(X;Tα)αR1+(1α)R2I(X; T_\alpha) \le \alpha R_1 + (1 - \alpha) R_2.

Prediction side. By the same argument,

I(Tα;Y)  =  I(Tα,S;Y)  =  I(S;Y)=0+I(Tα;YS)  =  αI(T1;Y)+(1α)I(T2;Y),I(T_\alpha; Y) \;=\; I(T_\alpha, S; Y) \;=\; \underbrace{I(S; Y)}_{= 0} \,+\, I(T_\alpha; Y \mid S) \;=\; \alpha\, I(T_1; Y) \,+\, (1 - \alpha)\, I(T_2; Y),

which is at least αIY(R1)+(1α)IY(R2)ϵ\alpha\, I_Y^*(R_1) + (1 - \alpha)\, I_Y^*(R_2) - \epsilon.

Hence TαT_\alpha is a feasible encoder for the constraint level αR1+(1α)R2\alpha R_1 + (1 - \alpha) R_2, with predictiveness at least αIY(R1)+(1α)IY(R2)ϵ\alpha\, I_Y^*(R_1) + (1 - \alpha)\, I_Y^*(R_2) - \epsilon. Therefore

IY(αR1+(1α)R2)    αIY(R1)+(1α)IY(R2)ϵ.I_Y^*\bigl(\alpha R_1 + (1 - \alpha) R_2\bigr) \;\ge\; \alpha\, I_Y^*(R_1) + (1 - \alpha)\, I_Y^*(R_2) - \epsilon.

Letting ϵ0\epsilon \to 0 gives concavity.

The ceiling. The data-processing inequality applied to the Markov chain YXTY \to X \to T gives I(T;Y)I(X;Y)I(T; Y) \le I(X; Y) for every encoder, so IY(R)I(X;Y)I_Y^*(R) \le I(X; Y) for all RR. Equality is achieved at any RRR \ge R^\dagger, where RR^\dagger is the rate needed to encode a minimal sufficient statistic for YY in XX. When such a statistic exists with finite alphabet, RR^\dagger is finite; otherwise the equality is achieved only in the limit RR \to \infty.

The IB curve is therefore a concave non-decreasing curve starting at the origin, climbing as RR grows, and saturating at I(X;Y)I(X; Y) once we can afford to encode a sufficient statistic. Past saturation, additional compression budget is wasted on bits of XX that don’t help with YY. The whole geometric content of the IB principle lives in the shape of this curve.

2.4 What β controls

The Lagrange multiplier β\beta in Fβ\mathcal{F}_\beta admits a clean geometric reading: it is the reciprocal of the tangent slope of the IB curve at the operating point.

Suppose p(tx)p^*(t \mid x) minimizes Fβ\mathcal{F}_\beta and achieves the point (R,IY)(R^*, I_Y^*) on the IB curve, with IY=IY(R)I_Y^* = I_Y^*(R^*). By the KKT conditions for problem (2.2), β\beta is the dual variable for the predictiveness constraint, and where the curve is differentiable the tangent at (R,IY)(R^*, I_Y^*) has slope 1/β1/\beta:

dIYdRR=R  =  1β.(2.5)\left.\frac{d I_Y^*}{d R}\right|_{R = R^*} \;=\; \frac{1}{\beta}. \quad\quad (2.5)

This identity organizes the β\beta knob into three regimes.

Small β (large slope, near origin). The compression term dominates the Lagrangian. In the limit β0+\beta \to 0^+, the optimum collapses to R=0R^* = 0 and T=constT = \text{const}.

Large β (small slope, near saturation). The predictiveness term dominates. As β\beta \to \infty, the optimum recovers a minimal sufficient statistic: IY=I(X;Y)I_Y^* = I(X; Y) at the smallest RR that supports it.

The critical β_c. Between these regimes sits a sharp threshold. Let σ0:=limR0+dIY/dR\sigma_0 := \lim_{R \to 0^+} dI_Y^*/dR be the initial slope of the IB curve at the origin. For β<1/σ0\beta < 1/\sigma_0, every nontrivial encoder achieves Fβ>0\mathcal{F}_\beta > 0, so the optimum is the trivial T=constT = \text{const}. At βc:=1/σ0\beta_c := 1/\sigma_0, a nontrivial branch of solutions emerges. For the discrete IB, βc\beta_c depends on p(x,y)p(x, y) and is generally found numerically by sweeping β\beta. For the Gaussian IB, σ0\sigma_0 equals the largest eigenvalue λ1\lambda_1 of a canonical-correlation matrix, so βc=1/λ1\beta_c = 1/\lambda_1 — this is the first of several phase transitions worked out in §7.

A concrete preview makes the picture tangible. The scalar-Gaussian case (X,YRX, Y \in \mathbb{R} jointly Gaussian with correlation ρ\rho) admits the closed-form IB curve

IY(R)  =  12log2(1ρ2(122R)),I_Y^*(R) \;=\; -\tfrac{1}{2}\log_2\bigl(1 - \rho^2\,(1 - 2^{-2R})\bigr),

saturating at the predictiveness ceiling I(X;Y)=12log2(1ρ2)I(X;Y) = -\tfrac{1}{2}\log_2(1 - \rho^2). The threshold value βc=1/ρ2\beta_c = 1/\rho^2 is the value below which the trivial encoder is optimal. At ρ=0.8\rho = 0.8, βc=1.5625\beta_c = 1.5625, and the three tangent operating points referenced by the figure below land at (0.254,0.152)(0.254, 0.152) bits for β=1.8\beta = 1.8, (1.208,0.529)(1.208, 0.529) bits for β=4\beta = 4, and (2.145,0.674)(2.145, 0.674) bits for β=12\beta = 12 — each with tangent slope 1/β1/\beta, confirming the KKT identity (2.5). We derive this scalar-Gaussian result formally in §6 and §7; the figure previews the general shape.

Scalar-Gaussian IB curves at four correlations, plus tangent lines at three β operating points
Left: scalar-Gaussian IB curves I_Y*(R) = −½ log₂(1 − ρ²(1 − 2⁻²ᴿ)) for ρ ∈ {0.35, 0.55, 0.80, 0.95}; saturation at I(X;Y) = −½ log₂(1 − ρ²) shown as dotted horizontals. The diagonal I(T;Y) ≤ I(X;T) (dashed gray) is the DPI ceiling. Right: the ρ = 0.8 curve with three tangent lines at β ∈ {1.8, 4, 12}; the slope at each operating point is 1/β by the KKT identity (2.5). The critical β_c = 1/ρ² = 1.5625 is the value below which the trivial T = const is optimal.

Drag β past β_c = 1/ρ² to escape the trivial encoder; the tangent line's slope is always 1/β, which is the KKT identity (2.5). At ρ = 0.8 the §2.4 reference operating points sit at (0.254, 0.152), (1.208, 0.529), and (2.145, 0.674) bits for β ∈ {1.8, 4, 12}.

This figure previews a result derived formally in §6 (the Gaussian closed form), framed for §2 as a concrete example that makes the curve shape and the β\beta-as-slope interpretation tangible.

3. The IB fixed-point equations

3.1 Stationarity conditions on the Lagrangian

We are looking for encoders p(tx)p^*(t \mid x) that minimize the Lagrangian

Fβ(p(tx))  =  I(X;T)βI(T;Y)(3.1)\mathcal{F}_\beta\bigl(p(t \mid x)\bigr) \;=\; I(X; T) - \beta\, I(T; Y) \quad\quad (3.1)

subject to the normalization constraints tp(tx)=1\sum_t p(t \mid x) = 1 for every xx. Augmenting with multipliers μ(x)\mu(x) gives the full Lagrangian

Lβ  =  I(X;T)βI(T;Y)+xμ(x)[tp(tx)1].(3.2)\mathcal{L}_\beta \;=\; I(X; T) - \beta\, I(T; Y) + \sum_x \mu(x)\left[\sum_t p(t \mid x) - 1\right]. \quad\quad (3.2)

At a stationary point we need Lβ/p(tx)=0\partial \mathcal{L}_\beta / \partial p(t \mid x) = 0 for every (x,t)(x, t) with p(x)>0p(x) > 0, together with the normalization constraint.

Theorem 2 (IB stationarity condition).

At any stationary point of Fβ\mathcal{F}_\beta over normalized encoders,

p(tx)  =  p(t)Z(x,β)exp ⁣[βDKL(p(yx)p(yt))],(3.3)p(t \mid x) \;=\; \frac{p(t)}{Z(x, \beta)}\,\exp\!\Bigl[-\beta\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\big\|\, p(y \mid t)\bigr)\Bigr], \quad\quad (3.3)

where p(t)=xp(x)p(tx)p(t) = \sum_x p(x)\, p(t \mid x) is the cluster marginal, p(yt)=xp(yx)p(xt)p(y \mid t) = \sum_x p(y \mid x)\, p(x \mid t) is the cluster-conditional target, and

Z(x,β)  =  tp(t)exp ⁣[βDKL(p(yx)p(yt))](3.4)Z(x, \beta) \;=\; \sum_{t} p(t)\, \exp\!\Bigl[-\beta\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\big\|\, p(y \mid t)\bigr)\Bigr] \quad\quad (3.4)

is the per-input normalization.

Equation (3.3) is the central formula of discrete IB. Read it as a soft-assignment rule: input xx is assigned to cluster tt with probability proportional to the cluster’s prior p(t)p(t) times the exponential of β-\beta times the KL between the true target distribution at xx, namely p(yx)p(y \mid x), and the cluster-conditional target distribution, p(yt)p(y \mid t). Two clusters whose target distributions are close to p(yx)p(y \mid x) both attract xx; the one whose target distribution is closer wins more probability. The bandwidth of “closer wins” is set by β\beta — at small β\beta assignments are diffuse; at large β\beta each input goes almost entirely to its closest cluster.

The two other quantities p(t)p(t) and p(yt)p(y \mid t) are not independent free parameters — they are determined by p(tx)p(t \mid x) through (2.1). So (3.3) is really a self-consistency equation: the encoder p(tx)p(t \mid x) defines the marginal and the decoder, which in turn feed back into the right-hand side. We will turn this self-consistency into the iterative algorithm in §3.3.

3.2 Derivation via variational calculus

We compute the two partial derivatives that show up in (3.2). Throughout we adopt the convention that log\log is the natural log (the conversion to bits is a global factor of 1/ln21/\ln 2 that drops out of the stationarity condition).

Proof.

Partial derivative of I(X;T)I(X; T). Write

I(X;T)  =  tp(t)logp(t)+x,tp(x)p(tx)logp(tx).(3.5)I(X; T) \;=\; -\sum_{t'} p(t') \log p(t') + \sum_{x', t'} p(x')\, p(t' \mid x')\, \log p(t' \mid x'). \quad\quad (3.5)

The first piece depends on p(tx)p(t \mid x) only through p(t)=xp(x)p(tx)p(t) = \sum_{x'} p(x') p(t \mid x'), with p(t)/p(tx)=p(x)δt,t\partial p(t')/\partial p(t \mid x) = p(x)\,\delta_{t, t'}. Differentiating p(t)logp(t)-p(t) \log p(t) with respect to p(t)p(t) gives logp(t)1-\log p(t) - 1, so the chain-rule contribution is (logp(t)+1)p(x)-\bigl(\log p(t) + 1\bigr)\, p(x). The second piece depends explicitly on p(tx)p(t \mid x) at (x,t)=(x,t)(x', t') = (x, t), giving p(x)(logp(tx)+1)p(x)\bigl(\log p(t \mid x) + 1\bigr). The two +1+1‘s cancel:

I(X;T)p(tx)  =  p(x)[logp(tx)logp(t)].(3.6)\frac{\partial I(X; T)}{\partial p(t \mid x)} \;=\; p(x)\bigl[\log p(t \mid x) - \log p(t)\bigr]. \quad\quad (3.6)

Partial derivative of I(T;Y)I(T; Y). Using p(t,y)=xp(x,y)p(tx)p(t, y) = \sum_{x'} p(x', y)\, p(t \mid x'), write

I(T;Y)  =  x,t,yp(x,y)p(tx)logp(yt)  +  H(Y),(3.7)I(T; Y) \;=\; \sum_{x', t', y} p(x', y)\, p(t' \mid x')\, \log p(y \mid t') \;+\; H(Y), \quad\quad (3.7)

where the H(Y)H(Y) term is constant in p(tx)p(t \mid x) and drops out. The remaining sum depends on p(tx)p(t \mid x) both explicitly (through the p(tx)p(t' \mid x') factor at (x,t)=(x,t)(x', t') = (x, t)) and implicitly (through every p(yt)p(y \mid t'), which is itself a function of p(tx)p(t \mid x)).

Explicit contribution. Direct: p(x,y)logp(yt)p(x, y)\, \log p(y \mid t), summed over yy, gives p(x)yp(yx)logp(yt)p(x) \sum_y p(y \mid x)\, \log p(y \mid t).

Implicit contribution. From p(yt)=(xp(x,y)p(tx))/p(t)p(y \mid t') = \bigl(\sum_{x'} p(x', y)\, p(t \mid x')\bigr) / p(t') a short calculation gives

p(yt)p(tx)  =  δt,tp(x)p(t)[p(yx)p(yt)].(3.8)\frac{\partial p(y \mid t')}{\partial p(t \mid x)} \;=\; \delta_{t, t'} \cdot \frac{p(x)}{p(t)}\,\bigl[p(y \mid x) - p(y \mid t)\bigr]. \quad\quad (3.8)

The implicit contribution to I(T;Y)/p(tx)\partial I(T;Y) / \partial p(t \mid x) is then

x,t,yp(x,y)p(tx)1p(yt)p(yt)p(tx)  =  p(x)y[p(yx)p(yt)]  =  0.(3.9)\sum_{x', t', y} p(x', y)\, p(t' \mid x')\, \frac{1}{p(y \mid t')}\, \frac{\partial p(y \mid t')}{\partial p(t \mid x)} \;=\; p(x) \sum_y \bigl[p(y \mid x) - p(y \mid t)\bigr] \;=\; 0. \quad\quad (3.9)

So the implicit term vanishes identically — both summands have yp(y)=1\sum_y p(y \mid \cdot) = 1 and the difference cancels. This is the key simplification of the derivation, and we verify it numerically below. We are left with

I(T;Y)p(tx)  =  p(x)yp(yx)logp(yt).(3.10)\frac{\partial I(T; Y)}{\partial p(t \mid x)} \;=\; p(x) \sum_y p(y \mid x)\, \log p(y \mid t). \quad\quad (3.10)

Assembling the KKT condition. Differentiating (3.2) and setting the result to zero:

p(x)[logp(tx)logp(t)]βp(x)yp(yx)logp(yt)+μ(x)  =  0.(3.11)p(x)\bigl[\log p(t \mid x) - \log p(t)\bigr] - \beta\, p(x) \sum_y p(y \mid x)\, \log p(y \mid t) + \mu(x) \;=\; 0. \quad\quad (3.11)

Dividing through by p(x)>0p(x) > 0 and using

yp(yx)logp(yt)  =  DKL(p(yx)p(yt))H(YX=x),\sum_y p(y \mid x)\, \log p(y \mid t) \;=\; -D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, p(y \mid t)\bigr) - H(Y \mid X = x),

we obtain

logp(tx)  =  logp(t)βDKL(p(yx)p(yt))  +  [terms independent of t].(3.12)\log p(t \mid x) \;=\; \log p(t) - \beta\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, p(y \mid t)\bigr) \;+\; \bigl[\text{terms independent of } t\bigr]. \quad\quad (3.12)

The tt-independent terms (βH(YX=x)\beta H(Y \mid X = x) and μ(x)/p(x)\mu(x) / p(x)) are absorbed into the normalization Z(x,β)Z(x, \beta). Exponentiating and using tp(tx)=1\sum_t p(t \mid x) = 1 produces (3.3) and (3.4).

The implicit-term-vanishes identity (3.9) expresses a structural feature of the IB problem: p(yt)p(y \mid t), viewed as a function of the encoder, is the “Bayes-optimal” estimate of YY given T=tT = t averaged over the encoder’s induced distribution on XX, so the gradient of any quantity computed from p(yt)p(y \mid t) that is “evaluated at its own optimal estimate” picks up no first-order correction. This is the same observation that makes EM monotone and that makes the IB algorithm Blahut–Arimoto-style monotone, which we’ll prove in §4.

Remark (Numerical check of (3.9)).

The notebook (cell 13) verifies (3.9) by comparing the analytical gradient I(T;Y)/p(tx)\partial I(T;Y) / \partial p(t \mid x) — using only the explicit contribution (3.10) — against a finite-difference reference of the full I(T;Y)I(T;Y) on a random soft encoder with k=3k = 3 clusters and ndocs=8n_{\text{docs}} = 8. The maximum and mean absolute differences both come out to 8.66×1028.66 \times 10^{-2}, the finite-difference truncation error at step size 10310^{-3} — the implicit term contributes nothing detectable, consistent with the identity. If the implicit term were nonzero, the discrepancy would grow with the step size; instead it shrinks, the signature of pure O(h2)O(h^2) truncation error.

3.3 The three coupled updates

A stationary point of Fβ\mathcal{F}_\beta must satisfy three coupled self-consistency equations:

p(tx)  =  p(t)Z(x,β)exp ⁣[βDKL(p(yx)p(yt))],p(t)  =  xp(x)p(tx),p(yt)  =  1p(t)xp(yx)p(x)p(tx).(3.13)\begin{aligned} p(t \mid x) &\;=\; \frac{p(t)}{Z(x, \beta)}\, \exp\!\bigl[-\beta\, D_{\mathrm{KL}}(p(y \mid x) \,\|\, p(y \mid t))\bigr], \\[4pt] p(t) &\;=\; \sum_x p(x)\, p(t \mid x), \\[4pt] p(y \mid t) &\;=\; \frac{1}{p(t)} \sum_x p(y \mid x)\, p(x)\, p(t \mid x). \end{aligned} \quad\quad (3.13)

These are the IB equations. The natural way to attack a coupled fixed-point system like (3.13) is alternating substitution: hold two of the three quantities fixed, update the third, repeat. Concretely, fix some initial encoder p(0)(tx)p^{(0)}(t \mid x) and iterate

p(n)(t)  =  xp(x)p(n)(tx),p(n)(yt)  =  1p(n)(t)xp(yx)p(x)p(n)(tx),p(n+1)(tx)  =  p(n)(t)Z(n)(x,β)exp ⁣[βDKL(p(yx)p(n)(yt))].(3.14)\begin{aligned} p^{(n)}(t) &\;=\; \sum_x p(x)\, p^{(n)}(t \mid x), \\[2pt] p^{(n)}(y \mid t) &\;=\; \frac{1}{p^{(n)}(t)} \sum_x p(y \mid x)\, p(x)\, p^{(n)}(t \mid x), \\[2pt] p^{(n+1)}(t \mid x) &\;=\; \frac{p^{(n)}(t)}{Z^{(n)}(x, \beta)}\, \exp\!\bigl[-\beta\, D_{\mathrm{KL}}(p(y \mid x) \,\|\, p^{(n)}(y \mid t))\bigr]. \end{aligned} \quad\quad (3.14)

This is the iterative IB algorithm (Tishby, Pereira, and Bialek 1999). §4 proves that the algorithm decreases Fβ\mathcal{F}_\beta monotonically and converges to a stationary point of (3.13); §5 runs it on the eight-document toy.

Three structural features of (3.13):

Labeling invariance. Permuting the cluster labels in TT leaves the IB equations invariant. Two clusters t,tt, t' with the same prior p(t)=p(t)p(t) = p(t') and the same conditional p(yt)=p(yt)p(y \mid t) = p(y \mid t') are indistinguishable — under (3.13) they will receive identical assignment probabilities from every xx. This is the source of the bifurcation pattern in §5: clusters that are degenerate at low β\beta split into distinct clusters as β\beta crosses a phase-transition value.

The trivial fixed point. Setting p(tx)=q(t)p(t \mid x) = q(t) (any xx-independent distribution) makes the right-hand side of the first equation independent of xx as well. So the trivial encoder is always a fixed point — it is the unique fixed point for β<βc\beta < \beta_c, and loses stability at β=βc\beta = \beta_c.

The implicit data dependence. The IB equations involve p(yx)p(y \mid x), p(x)p(x), and p(x,y)p(x, y) only — they don’t use any external “distortion function.” This is the structural feature that distinguishes IB from rate-distortion (see §3.4).

3.4 Why these are a Blahut–Arimoto cousin

The IB iteration (3.14) inherits its template from the Blahut–Arimoto algorithm for computing rate-distortion functions (covered in rate-distortion theory). The BA algorithm minimizes the rate-distortion Lagrangian

FsRD(p(x^x))  =  I(X;X^)+sE[d(X,X^)],s>0,(3.15)\mathcal{F}_s^{\mathrm{RD}}\bigl(p(\hat x \mid x)\bigr) \;=\; I(X; \hat X) + s\, \mathbb{E}\bigl[d(X, \hat X)\bigr], \qquad s > 0, \quad\quad (3.15)

with stationarity condition

p(x^x)  =  p(x^)ZRD(x,s)exp ⁣[sd(x,x^)].(3.16)p(\hat x \mid x) \;=\; \frac{p(\hat x)}{Z^{\mathrm{RD}}(x, s)}\, \exp\!\bigl[-s\, d(x, \hat x)\bigr]. \quad\quad (3.16)

Compare side-by-side with the IB condition (3.3):

p(tx)  =  p(t)Z(x,β)exp ⁣[βDKL(p(yx)p(yt))].p(t \mid x) \;=\; \frac{p(t)}{Z(x, \beta)}\, \exp\!\bigl[-\beta\, D_{\mathrm{KL}}(p(y \mid x) \,\|\, p(y \mid t))\bigr].

Structurally identical. The role of the rate-distortion distortion d(x,x^)d(x, \hat x) is played in IB by

dIB(x,t)  =  DKL(p(yx)p(yt)),(3.17)d_{\mathrm{IB}}(x, t) \;=\; D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, p(y \mid t)\bigr), \quad\quad (3.17)

which we may call the predictive distortion — a measure of how badly the cluster-conditional p(yt)p(y \mid t) approximates the true conditional p(yx)p(y \mid x). Two clusters are “close” in the IB sense if they make similar predictions about YY, not if they look similar in input space. Compression is therefore task-aligned: it preserves bits useful for predicting YY and discards everything else.

There is one crucial difference between IB and rate-distortion, however, which is the key technical reason IB is harder. In rate-distortion, the distortion d(x,x^)d(x, \hat x) is given as input — fixed before the algorithm starts. In IB, the predictive distortion dIB(x,t)d_{\mathrm{IB}}(x, t) depends on p(yt)p(y \mid t), which depends on the encoder p(tx)p(t \mid x), which depends on the predictive distortion. The “distortion” emerges from the iteration itself. As a consequence:

  • The rate-distortion Lagrangian FsRD\mathcal{F}_s^{\mathrm{RD}} is convex in p(x^x)p(\hat x \mid x) for fixed distortion, so BA converges to the global minimum.
  • The IB Lagrangian Fβ\mathcal{F}_\beta is not convex in p(tx)p(t \mid x), so the IB iteration converges only to a local stationary point — generally to one of many.

This non-convexity has consequences we’ll feel in §5: running the IB iteration with different random initializations finds different stationary points, and the global IB curve has to be traced by either (a) running many restarts at each β\beta and keeping the best, or (b) using β\beta-annealing — solving at small β\beta first and warm-starting at the next β\beta.

Despite the non-convexity, the IB iteration enjoys a clean monotonicity property — Fβ\mathcal{F}_\beta decreases at every step — and convergence to a stationary point. We prove both in §4.

4. Convergence of the IB updates

4.1 The Lagrangian decreases at every step

The IB iteration (3.14) updates the encoder by a non-trivial coupled rule, so monotone decrease of Fβ\mathcal{F}_\beta is not obvious from the update form alone. The clean proof goes through a wider three-argument functional — the same Csiszár–Tusnády construction that gives convergence of Blahut–Arimoto for rate-distortion and of EM for likelihood maximization.

Setup: the auxiliary functional. Treat the encoder p(tx)p(t \mid x), the cluster marginal q(t)q(t), and the cluster decoder r(yt)r(y \mid t) as independent arguments. Define

F[p,q,r]  :=  x,tp(x)p(tx)logp(tx)q(t)  +  βx,tp(x)p(tx)DKL(p(yx)r(yt)).(4.1)F\bigl[p, q, r\bigr] \;:=\; \sum_{x, t} p(x)\, p(t \mid x)\, \log \frac{p(t \mid x)}{q(t)} \;+\; \beta \sum_{x, t} p(x)\, p(t \mid x)\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, r(y \mid t)\bigr). \quad\quad (4.1)

The first sum looks like I(X;T)I(X; T) but uses the auxiliary qq in the denominator. The second looks like the expected KL “predictive distortion” but uses the auxiliary rr as the cluster decoder. Both reduce to their actual-quantity versions when qq and rr equal the encoder’s induced marginal and decoder respectively.

Lemma 1 (F relates to the IB Lagrangian).

For every normalized encoder pp and every q,rq, r,

F[p,q,r]  =  Fβ(p)+βI(X;Y)+DKL(p(t)q(t))+βEtp(t) ⁣[DKL(p(yt)r(yt))].(4.2)F\bigl[p, q, r\bigr] \;=\; \mathcal{F}_\beta(p) + \beta\, I(X; Y) + D_{\mathrm{KL}}\bigl(p(t) \,\|\, q(t)\bigr) + \beta\, \mathbb{E}_{t \sim p(t)}\!\Bigl[D_{\mathrm{KL}}\bigl(p(y \mid t) \,\|\, r(y \mid t)\bigr)\Bigr]. \quad\quad (4.2)

In particular, F[p,q,r]Fβ(p)+βI(X;Y)F[p, q, r] \ge \mathcal{F}_\beta(p) + \beta\, I(X; Y), with equality iff q(t)=p(t)q(t) = p(t) and r(yt)=p(yt)r(y \mid t) = p(y \mid t).

Proof.

Splitting the logarithm in the first sum of (4.1):

x,tp(x)p(tx)logp(tx)q(t)  =  I(X;T)+DKL(p(t)q(t)).\sum_{x, t} p(x)\, p(t \mid x)\, \log \frac{p(t \mid x)}{q(t)} \;=\; I(X; T) + D_{\mathrm{KL}}(p(t) \,\|\, q(t)).

For the second sum, expand the KL and use xp(x,y)p(tx)=p(t,y)\sum_x p(x, y)\, p(t \mid x) = p(t, y) together with yp(t,y)logr(yt)=p(t)[H(p(yt))+DKL(p(yt)r(yt))]\sum_y p(t, y)\, \log r(y \mid t) = -p(t)\bigl[H(p(y \mid t)) + D_{\mathrm{KL}}(p(y \mid t) \,\|\, r(y \mid t))\bigr]:

βx,tp(x)p(tx)DKL(p(yx)r(yt))  =  β[I(X;Y)I(T;Y)]+βEt[DKL(p(yt)r(yt))].\beta \sum_{x, t} p(x)\, p(t \mid x)\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, r(y \mid t)\bigr) \;=\; \beta\bigl[I(X; Y) - I(T; Y)\bigr] + \beta\, \mathbb{E}_t\bigl[D_{\mathrm{KL}}(p(y \mid t) \,\|\, r(y \mid t))\bigr].

Adding the two contributions yields (4.2). The two KL terms are non-negative, with simultaneous equality iff q=p(t)q = p(t) and r=p(yt)r = p(y \mid t).

Lemma 2 (IB iteration is alternating minimization on F).

For fixed q,rq, r, the minimizer of pF[p,q,r]p \mapsto F[p, q, r] over normalized encoders is

p(tx)  =  q(t)Z(x,β)exp[βDKL(p(yx)r(yt))].(4.3)p^*(t \mid x) \;=\; \frac{q(t)}{Z(x, \beta)}\, \exp\bigl[-\beta\, D_{\mathrm{KL}}(p(y \mid x) \,\|\, r(y \mid t))\bigr]. \quad\quad (4.3)

Proof.

With q,rq, r fixed, FF is a sum of X|X| independent per-xx problems. Each per-xx problem is convex in p(x)p(\cdot \mid x). Differentiating with multiplier μ(x)\mu(x) for normalization and setting the result to zero gives (4.3) after absorbing tt-independent terms into the normalization.

Theorem 3 (Monotone decrease).

Let {p(n)}\{p^{(n)}\} be the IB iterates produced by (3.14), and Fβ(n):=Fβ(p(n))\mathcal{F}_\beta^{(n)} := \mathcal{F}_\beta(p^{(n)}). Then for every n0n \ge 0,

Fβ(n+1)    Fβ(n),(4.4)\mathcal{F}_\beta^{(n+1)} \;\le\; \mathcal{F}_\beta^{(n)}, \quad\quad (4.4)

with equality iff p(n)p^{(n)} is a fixed point of (3.13).

Proof.

Let q(n):=p(n)(t)q^{(n)} := p^{(n)}(t) and r(n):=p(n)(yt)r^{(n)} := p^{(n)}(y \mid t). By Lemma 1 with q=q(n)q = q^{(n)}, r=r(n)r = r^{(n)} (the equality case),

F[p(n),q(n),r(n)]  =  Fβ(p(n))+βI(X;Y).F[p^{(n)}, q^{(n)}, r^{(n)}] \;=\; \mathcal{F}_\beta(p^{(n)}) + \beta\, I(X; Y).

The next iterate p(n+1)p^{(n+1)} minimizes pF[p,q(n),r(n)]p \mapsto F[p, q^{(n)}, r^{(n)}] by Lemma 2, so

F[p(n+1),q(n),r(n)]    F[p(n),q(n),r(n)].F[p^{(n+1)}, q^{(n)}, r^{(n)}] \;\le\; F[p^{(n)}, q^{(n)}, r^{(n)}].

By Lemma 1 (inequality direction) with p=p(n+1)p = p^{(n+1)},

F[p(n+1),q(n),r(n)]    Fβ(p(n+1))+βI(X;Y).F[p^{(n+1)}, q^{(n)}, r^{(n)}] \;\ge\; \mathcal{F}_\beta(p^{(n+1)}) + \beta\, I(X; Y).

Chaining gives Fβ(n+1)Fβ(n)\mathcal{F}_\beta^{(n+1)} \le \mathcal{F}_\beta^{(n)}. Equality requires both inequalities to be tight, which happens iff p(n)p^{(n)} is already a fixed point.

The construction is worth lingering on: F[p,q,r]F[p, q, r] is a Lyapunov-with-slack — it equals Fβ(p)\mathcal{F}_\beta(p) plus two KL-divergence penalties that vanish precisely when qq and rr are the encoder’s induced quantities. The IB iteration alternately tightens the slack and minimizes the loosened objective. This Csiszár–Tusnády pattern proves Blahut–Arimoto, EM, and CAVI convergence — IB specializes it to the predictive-KL distortion.

4.2 Boundedness and limit-point existence

Boundedness of Fβ(n)\mathcal{F}_\beta^{(n)}. I(X;T)0I(X; T) \ge 0 and I(T;Y)I(X;Y)I(T; Y) \le I(X; Y) by DPI, so

Fβ(p)    βI(X;Y).(4.5)\mathcal{F}_\beta(p) \;\ge\; -\beta\, I(X; Y). \quad\quad (4.5)

Monotone decrease plus a lower bound implies the sequence Fβ(n)\mathcal{F}_\beta^{(n)} converges to some limit Fβ()\mathcal{F}_\beta^{(\infty)}.

Compactness. Iterates p(n)(tx)p^{(n)}(t \mid x) live in the compact product of simplices xΔT1\prod_x \Delta^{|\mathcal{T}|-1}. By Bolzano–Weierstrass, every subsequence has a convergent sub-subsequence.

Limit points are fixed points. The IB update map TIB\mathcal{T}_{\mathrm{IB}} is continuous on the simplex interior. If p(nk)pp^{(n_k)} \to p^*, then TIB(p(nk))TIB(p)\mathcal{T}_{\mathrm{IB}}(p^{(n_k)}) \to \mathcal{T}_{\mathrm{IB}}(p^*). By the equality case of Theorem 3, p=TIB(p)p^* = \mathcal{T}_{\mathrm{IB}}(p^*).

Theorem 4 (Convergence to a fixed point).

Every limit point of the IB iteration is a fixed point of (3.13), and the sequence of Lagrangian values Fβ(n)\mathcal{F}_\beta^{(n)} converges.

4.3 What the algorithm converges to: stationary, not global

Theorems 3 and 4 give stationary-point convergence, not global-minimum convergence. Three observations.

Trivial vs nontrivial fixed points. The encoder p(tx)=q(t)p(t \mid x) = q(t) (any xx-independent distribution) is always a fixed point. For β<βc\beta < \beta_c it is the unique fixed point and the global minimum. For β>βc\beta > \beta_c the trivial fixed point becomes a saddle: small perturbations grow under iteration.

Symmetry-broken fixed points. Multiple equivalent solutions exist by cluster-label permutation — all give the same Fβ\mathcal{F}_\beta value but trip up direct iterate comparisons.

Genuinely distinct local minima. Beyond labeling, the IB landscape can have multiple genuinely distinct stationary points at different Fβ\mathcal{F}_\beta values. Random initialization picks between basins essentially at random.

The numerical experiment in Figure 3 illustrates this: five random initializations at β=5\beta = 5, k=4k = 4 on the eight-document toy land at slightly different Fβ\mathcal{F}_\beta values — some runs find the optimum, others get stuck.

Convergence traces of F_β from five random initializations on the 8-document toy at β = 5
Left: 𝓕_β traced over 60 iterations from 5 random initializations on the eight-document toy at β = 5, k = 4. Every trace decreases monotonically (Theorem 3); four runs converge to the global optimum 𝓕_β* = −4 ln 2 ≈ −2.7726 nats, one run gets stuck in a suboptimal local minimum. Right: per-step decrement 𝓕_β(p^(n−1)) − 𝓕_β(p^(n)) on the same runs, on a symlog axis. Each value is ≥ 0 (Theorem 3); the symlog axis shows the magnitude collapsing toward machine precision as the iteration converges.

4.4 The non-convexity hazard and annealing in β

Two practical remedies for the non-convexity exposed in Figure 3.

Random restarts. At each β\beta of interest, run the iteration from MM random initializations and keep the lowest Fβ\mathcal{F}_\beta at convergence. Embarrassingly parallel. The failure mode is high MM — when the landscape has many basins, MM must scale with the number of basins.

β-annealing. Solve at a small β0\beta_0 first (where the landscape is convex, trivial encoder is the unique fixed point), then incrementally raise β\beta, warm-starting each new β\beta from the previous β\beta‘s converged encoder. The IB curve is continuous in β\beta, so neighboring β\beta values have neighboring optimal encoders. Annealing traces a smooth path along the IB curve.

Annealing has a structural benefit beyond non-convexity: it makes the cluster bifurcations explicit. At small β\beta, all clusters collapse onto the trivial fixed point. As β\beta crosses βc\beta_c, two clusters split apart. At larger β\beta thresholds, further splits occur — each phase transition activates an additional cluster. The annealing trace through the IB plane traces out the staircase of phase transitions that §5 makes visible for the discrete toy and §7 for the Gaussian closed form.

§5 implements both strategies on the eight-document toy and traces the full IB curve.

5. The discrete IB in practice

5.1 Initialization and random restarts

§4 left us with an algorithm that decreases Fβ\mathcal{F}_\beta monotonically and converges to a fixed point — without guaranteeing the best fixed point. Three practical choices fill in the operational picture.

Encoder initialization. The IB iteration is a self-map on the simplex product xΔT1\prod_x \Delta^{|\mathcal{T}|-1}. We start from a random softmax draw:

p(0)(tx)  =  exp(Wxt)texp(Wxt),WxtN(0,σ2).(5.1)p^{(0)}(t \mid x) \;=\; \frac{\exp\bigl(W_{x t}\bigr)}{\sum_{t'} \exp\bigl(W_{x t'}\bigr)}, \qquad W_{x t} \sim \mathcal{N}(0, \sigma^2). \quad\quad (5.1)

Small σ\sigma (say σ[0.05,0.5]\sigma \in [0.05, 0.5]) yields a near-trivial initial encoder. The random-softmax form is easier to control numerically than near-uniform perturbations, and it keeps every cluster nonempty at initialization so that no p(0)(t)=0p^{(0)}(t) = 0 propagates a log0\log 0 through the first update.

The escape-from-saddle problem. At any β>βc\beta > \beta_c, the trivial encoder p(tx)=q(t)p(t \mid x) = q(t) is a saddle point of Fβ\mathcal{F}_\beta. An iterate that lands too close to it can take many iterations to escape. With σ=0\sigma = 0 exactly, the iteration is stationary and never moves. Always use σ>0\sigma > 0.

Random restarts. At each β\beta of interest, run from MM independent initializations and keep the lowest Fβ\mathcal{F}_\beta at convergence. For small toys M=5M = 5 to 2020 suffices.

β-annealing. Solve at a small starting β0\beta_0 first, then incrementally increase β\beta, warm-starting each new β\beta from the previous β\beta‘s converged encoder. Two practical wrinkles: (i) at each β\beta step, add a tiny random perturbation to the warm-start encoder to break exact saddle-point degeneracy, and (ii) the β\beta grid should be log-spaced and dense near suspected phase transitions to resolve their location.

From §5.3 onward: anneal forward across β\beta as the primary strategy, with random-restart sanity checks at selected β\beta to confirm we haven’t missed a lower-Fβ\mathcal{F}_\beta branch.

5.2 A worked example on the eight-document toy

Before sweeping β\beta, exhibit the algorithm on the running eight-document toy from §1.2 at fixed β=5\beta = 5. Since βc=1\beta_c = 1 for this toy, β=5\beta = 5 sits well past the first transition. With k=4k = 4 clusters and random-softmax initialization at σ=0.5\sigma = 0.5, the iteration converges in roughly 17 iterations.

The trajectory has a characteristic shape: in the first few iterations, Fβ\mathcal{F}_\beta drops sharply as the encoder finds the topic-aligned partition; over the next dozen iterations, Fβ\mathcal{F}_\beta creeps toward its limit while the two “spare” clusters merge with the two topic clusters. By iteration 30, the per-step decrement is at machine precision.

The converged encoder uses only 2\approx 2 effective clusters (out of the 4 allocated), with the two spare clusters bleeding into the active ones. The converged (I(X;T),I(T;Y))(I(X;T), I(T;Y)) equals (ln2,ln2)(0.693,0.693)(\ln 2, \ln 2) \approx (0.693, 0.693) nats — exactly the optimum achieved by the T=YT = Y encoder. The optimum Fβ=(1β)ln2=4ln22.7726\mathcal{F}_\beta^* = (1 - \beta) \ln 2 = -4 \ln 2 \approx -2.7726 nats matches the value found by four of the five seeds in Figure 3 (one seed gets stuck in a suboptimal local minimum that uses one cluster’s worth of effective capacity inefficiently).

5.3 Tracing the IB curve by sweeping β

The eight-document toy has only one phase transition because H(Y)=1H(Y) = 1 bit caps the staircase. For a multi-transition example we introduce a richer toy — call it T3 for “three-topic graded purity”:

Six documents in three topics, with p(x)p(x) uniform. Conditional p(yx)p(y \mid x) varies by document pair:

  • Docs 0, 1: p(yx)=(0.85,0.10,0.05)p(y \mid x) = (0.85,\, 0.10,\, 0.05) — very topic-0 pure.
  • Docs 2, 3: p(yx)=(0.10,0.85,0.05)p(y \mid x) = (0.10,\, 0.85,\, 0.05) — very topic-1 pure.
  • Docs 4, 5: p(yx)=(0.15,0.15,0.70)p(y \mid x) = (0.15,\, 0.15,\, 0.70) — moderately topic-2 pure.

Three distinct conditional distributions, with the first two highly distinguishable (purity 0.85) and the third moderately distinct from the others (purity 0.70). For this construction the numerical values are H(Y)=1.088H(Y) = 1.088 nats, H(YX)=0.618H(Y \mid X) = 0.618 nats, hence I(X;Y)=0.470I(X;Y) = 0.470 nats (=0.678(= 0.678 bits)).

Setting k=4k = 4 (one spare cluster to confirm self-disabling), we sweep β\beta across 40 log-spaced values in [0.3,100][0.3,\, 100], annealing forward with warm-starts. The result is plotted in Figure 4: the IB curve in the (I(X;T),I(T;Y))(I(X;T), I(T;Y)) plane (left), and the same data plotted as I(X;T)I(X;T) and I(T;Y)I(T;Y) vs β\beta (right). The plane plot shows the smooth, concave curve from origin to saturation; the β\beta plot exposes the two phase transitions as visible steepness changes — the first around β4\beta \approx 4 where the encoder starts to differentiate, and the second around β11\beta \approx 11 where the third cluster activates.

The saturation point of the curve sits at (R,I(X;Y))=(log3,0.470)(R^\dagger, I(X;Y)) = (\log 3,\, 0.470) nats =(1.099,0.470)= (1.099,\, 0.470) nats, since 3 clusters suffice to encode a minimal sufficient statistic.

Two consistency checks. Random restarts at β{2,5,15}\beta \in \{2,\, 5,\, 15\} confirm the annealing trace is the global optimum: 10 random initializations at each β\beta converge to within machine precision of the annealing value.

IB curve on the T3 toy traced by annealing across 40 log-spaced β values
Left: IB curve traced by annealing across 40 log-spaced β values on the T3 toy. The DPI ceiling I(X;Y) = 0.470 nats appears as a dashed horizontal; the saturation rate R† = log 3 = 1.099 nats appears as a dotted vertical. Random-restart points at β ∈ {2, 5, 15} (10 restarts each) confirm the annealing trace is the global optimum. Right: the same data plotted as I(X;T) and I(T;Y) vs β on a log axis. The two phase transitions appear as visible steepness changes in I(T;Y).

The sweep is β-annealed (each new β warm-starts from the previous β's converged encoder). Saturation point sits at (R†, I(X;Y)) = (log 3, 0.470) nats; the spare fourth cluster self-disables. Two phase transitions are visible in the right panel as steepness changes in I(T;Y).

5.4 Reading the trajectory: how clusters split with β

The IB curve in the left panel of Figure 4 is concave and continuous — phase transitions are invisible in that view. They become visible in two derived quantities, which we plot in Figure 5.

The effective cluster count. Define the perplexity of the cluster marginal:

Keff(β)  :=  exp(H(p(t)))  =  exp ⁣( ⁣tp(t)logp(t)).(5.2)K_{\mathrm{eff}}(\beta) \;:=\; \exp\bigl(H(p(t))\bigr) \;=\; \exp\!\Bigl(-\!\sum_t p(t)\, \log p(t)\Bigr). \quad\quad (5.2)

This is the “effective number of clusters” — equal to kk when TT is uniform over all kk clusters, equal to 11 when TT collapses to a single cluster, and continuously interpolating in between when the cluster marginal is non-uniform. As β\beta grows, KeffK_{\mathrm{eff}} traces out a staircase: flat plateaus where the cluster structure is stable, separated by rapid climbs at the phase transitions where new clusters activate.

Cluster bifurcations as heatmaps. The right panel of Figure 5 shows the encoder p(tx)p(t \mid x) at four representative β\beta values, sampled to span the cluster-count regimes: pre-first-transition (Keff1K_{\mathrm{eff}} \approx 1), between transitions (Keff2K_{\mathrm{eff}} \approx 2), post-second-transition (Keff3K_{\mathrm{eff}} \approx 3), and at large β\beta (sharp assignments). Reading across the panels: at small β\beta all six documents are assigned to roughly the same cluster. At intermediate β\beta, the encoder splits into two clusters separating docs 4–5 from the rest. At larger β\beta, the topic-0 and topic-1 pairs separate. At very large β\beta, the assignments are nearly one-hot — three discrete clusters, one per topic-pair.

Two structural observations. First, the order of bifurcations is not arbitrary: the first split separates the documents whose conditional distributions are most divergent. The cheapest predictive bit to encode first is the one that distinguishes the largest-divergence groups. Second, the fourth cluster we allocated self-disables: across all β\beta, one cluster carries near-zero marginal probability p(t)0p(t) \approx 0. IB uses only as many clusters as the structure of p(yx)p(y \mid x) supports.

The staircase pattern is the discrete analog of the phase transitions we will see in closed form for Gaussian IB in §7, where each canonical-correlation direction activates at a specific βc\beta_c value determined by the spectrum.

K_eff staircase and encoder-heatmap snapshots across the T3 β sweep
Left: K_eff(β) = exp(H(p(t))) across the annealing sweep on T3, on a log-β axis. Horizontal guides at K_eff = 1, 2, 3. The staircase pattern climbs from ≈ 1 to ≈ 2 at the first phase transition (β ≈ 4) and to ≈ 3 at the second (β ≈ 11). The fourth (spare) cluster never activates. Right: encoder p(t | x) heatmaps at four selected β values, one per K_eff regime — each subpanel a 6 × 4 grid of cells with shading proportional to p(t | x).

At small β, all six documents share one cluster (K_eff ≈ 1). Crossing the first phase transition splits the cluster structure — first the topic-2 pair separates from the topic-0/topic-1 group (purity 0.70 is the least correlated), then the topic-0 and topic-1 pairs separate at a higher β. The fourth allocated cluster self-disables. At large β, assignments approach one-hot and K_eff stabilizes near 3.

6. The Gaussian information bottleneck

6.1 The jointly-Gaussian setting and why it admits closed form

Through §5 the IB picture is operationally complete but every quantity is numerical. Switching to a jointly-Gaussian setting — the Chechik–Globerson–Tishby–Weiss 2005 setting — replaces iteration with matrix calculus. The Gaussian IB has a closed-form optimal encoder, an explicit expression for the IB curve as a piecewise-analytic function of β\beta, exact locations for the phase transitions in terms of a spectrum, and a clean geometric reading via canonical-correlation analysis.

The setup. Let XRpX \in \mathbb{R}^p and YRqY \in \mathbb{R}^q be zero-mean jointly Gaussian random vectors with covariance structure

ΣX=E[XX],ΣY=E[YY],ΣXY=E[XY]Rp×q.(6.1)\Sigma_X = \mathbb{E}[X X^\top], \qquad \Sigma_Y = \mathbb{E}[Y Y^\top], \qquad \Sigma_{XY} = \mathbb{E}[X Y^\top] \in \mathbb{R}^{p \times q}. \quad\quad (6.1)

We assume ΣX,ΣY\Sigma_X, \Sigma_Y are positive definite. The conditional covariance of XX given YY has the Schur-complement form

ΣXY  =  ΣXΣXYΣY1ΣYX,(6.2)\Sigma_{X \mid Y} \;=\; \Sigma_X - \Sigma_{XY}\, \Sigma_Y^{-1}\, \Sigma_{YX}, \quad\quad (6.2)

and ΣXYΣX\Sigma_{X \mid Y} \preceq \Sigma_X in the positive-semidefinite order. The representation TT takes values in Rd\mathbb{R}^d, where the dimension dd is determined from the spectrum below.

Why closed form is available. Two facts make the IB Lagrangian collapse to a covariance computation.

Gaussian mutual information. For jointly Gaussian (U,V)(U, V),

I(U;V)  =  12logΣUΣUV.(6.3)I(U; V) \;=\; \tfrac{1}{2} \log \frac{|\Sigma_U|}{|\Sigma_{U \mid V}|}. \quad\quad (6.3)

Mutual information depends only on second moments.

Linear-Gaussian preserves Gaussianity. If T=AX+ξT = A X + \xi where AA is deterministic and ξ\xi is Gaussian noise independent of XX, then the joint (X,T)(X, T) and (T,Y)(T, Y) are also jointly Gaussian. Restricting the encoder to be linear plus Gaussian noise makes the Lagrangian a function of the deterministic parameters (A,Σξ)(A, \Sigma_\xi) alone.

6.2 The optimal encoder is linear plus Gaussian noise

Theorem 5 (Linear-Gaussian optimality (Chechik et al. 2005)).

Let (X,Y)(X, Y) be zero-mean jointly Gaussian. The minimizer of the IB Lagrangian Fβ\mathcal{F}_\beta over all encoders p(tx)p(t \mid x) producing a dd-dimensional representation TRdT \in \mathbb{R}^d is of the form

T  =  AX+ξ,ξN(0,Σξ),ξX,(6.4)T \;=\; A\, X + \xi, \qquad \xi \sim \mathcal{N}(0,\, \Sigma_\xi), \qquad \xi \perp X, \quad\quad (6.4)

for some ARd×pA \in \mathbb{R}^{d \times p} and ΣξRd×d\Sigma_\xi \in \mathbb{R}^{d \times d} positive semi-definite. Equivalently, the optimal encoder is the Gaussian conditional

p(tx)  =  N(t;Ax,Σξ).(6.5)p^*(t \mid x) \;=\; \mathcal{N}\bigl(t;\, A\, x,\, \Sigma_\xi\bigr). \quad\quad (6.5)

The rigorous proof is in Chechik et al. 2005 (Theorem 1), with an entropy-power-inequality refinement in Painsky and Tishby 2017 that handles the equality cases more carefully. The intuition: for jointly-Gaussian (X,Y)(X, Y), the predictive distortion DKL(p(yx)p(yt))D_{\mathrm{KL}}(p(y \mid x) \,\|\, p(y \mid t)) from (3.3) is the KL between two Gaussians when TT is also Gaussian, and that KL depends only on second moments of (X,T,Y)(X, T, Y). The linear-Gaussian family spans the second-moment space, so the optimum lies in that family.

The parameterization. Standard formulas for Gaussian MI give:

ΣT  =  AΣXA+Σξ,ΣTX  =  Σξ,ΣTY  =  AΣXYA+Σξ.(6.6)\Sigma_T \;=\; A\, \Sigma_X\, A^\top + \Sigma_\xi, \qquad \Sigma_{T \mid X} \;=\; \Sigma_\xi, \qquad \Sigma_{T \mid Y} \;=\; A\, \Sigma_{X \mid Y}\, A^\top + \Sigma_\xi. \quad\quad (6.6)

I(X;T)  =  12logAΣXA+ΣξΣξ,I(T;Y)  =  12logAΣXA+ΣξAΣXYA+Σξ.(6.7)I(X; T) \;=\; \tfrac{1}{2} \log \frac{|A \Sigma_X A^\top + \Sigma_\xi|}{|\Sigma_\xi|}, \qquad I(T; Y) \;=\; \tfrac{1}{2} \log \frac{|A \Sigma_X A^\top + \Sigma_\xi|}{|A \Sigma_{X \mid Y} A^\top + \Sigma_\xi|}. \quad\quad (6.7)

Combining and rearranging,

Fβ(A,Σξ)  =  1β2logAΣXA+Σξ  +  β2logAΣXYA+Σξ    12logΣξ.(6.8)\mathcal{F}_\beta(A, \Sigma_\xi) \;=\; \frac{1 - \beta}{2}\, \log \bigl|A \Sigma_X A^\top + \Sigma_\xi\bigr| \;+\; \frac{\beta}{2}\, \log \bigl|A \Sigma_{X \mid Y} A^\top + \Sigma_\xi\bigr| \;-\; \frac{1}{2}\, \log |\Sigma_\xi|. \quad\quad (6.8)

Gauge freedom. The Lagrangian (6.8) is invariant under invertible linear transformations of TT. To remove the gauge we normalize the noise covariance:

Σξ  =  Id.(6.9)\Sigma_\xi \;=\; I_d. \quad\quad (6.9)

The remaining optimization is over ARd×pA \in \mathbb{R}^{d \times p} alone:

Fβ(A)  =  1β2logAΣXA+I  +  β2logAΣXYA+I.(6.10)\mathcal{F}_\beta(A) \;=\; \frac{1 - \beta}{2}\, \log \bigl|A \Sigma_X A^\top + I\bigr| \;+\; \frac{\beta}{2}\, \log \bigl|A \Sigma_{X \mid Y} A^\top + I\bigr|. \quad\quad (6.10)

6.3 Reduction to a generalized eigenproblem

For symmetric Σ\Sigma,

AlogAΣA+I  =  2(AΣA+I)1AΣ.(6.11)\nabla_A \log\bigl|A \Sigma A^\top + I\bigr| \;=\; 2\,(A \Sigma A^\top + I)^{-1}\, A\, \Sigma. \quad\quad (6.11)

Setting AFβ(A)=0\nabla_A \mathcal{F}_\beta(A) = 0 gives

(1β)(AΣXA+I)1AΣX  +  β(AΣXYA+I)1AΣXY  =  0.(6.12)(1 - \beta)\,(A \Sigma_X A^\top + I)^{-1}\, A\, \Sigma_X \;+\; \beta\,(A \Sigma_{X \mid Y} A^\top + I)^{-1}\, A\, \Sigma_{X \mid Y} \;=\; 0. \quad\quad (6.12)

The diagonalization ansatz. Define

M  :=  ΣXYΣX1    Rp×p.(6.13)M \;:=\; \Sigma_{X \mid Y}\, \Sigma_X^{-1} \;\in\; \mathbb{R}^{p \times p}. \quad\quad (6.13)

MM has eigenvalues λ1λ2λp\lambda_1 \le \lambda_2 \le \cdots \le \lambda_p in (0,1](0, 1] (positive because ΣXY\Sigma_{X \mid Y} is PSD; at most 11 because ΣXYΣX\Sigma_{X \mid Y} \preceq \Sigma_X). The right eigenvectors viv_i satisfy ΣXYvi=λiΣXvi\Sigma_{X \mid Y} v_i = \lambda_i \Sigma_X v_i. Normalize in the ΣX\Sigma_X-inner-product: viΣXvj=δijv_i^\top \Sigma_X v_j = \delta_{ij}.

Ansatz. Take

A  =  diag(α1,,αd)  V,V=[v1  v2    vd].(6.14)A \;=\; \mathrm{diag}(\alpha_1, \ldots, \alpha_d)\; V^\top, \qquad V = [v_1\; v_2\; \cdots\; v_d]. \quad\quad (6.14)

Plugging into (6.10), both AΣXAA \Sigma_X A^\top and AΣXYAA \Sigma_{X \mid Y} A^\top become diagonal, and the Lagrangian decouples across directions:

Fβ(α1,,αd)  =  i=1d[1β2log(1+αi2)  +  β2log(1+αi2λi)].(6.15)\mathcal{F}_\beta(\alpha_1, \ldots, \alpha_d) \;=\; \sum_{i=1}^d \biggl[\frac{1 - \beta}{2}\, \log(1 + \alpha_i^2) \;+\; \frac{\beta}{2}\, \log(1 + \alpha_i^2 \lambda_i)\biggr]. \quad\quad (6.15)

Setting ui:=αi2u_i := \alpha_i^2 and Fβ/ui=0\partial \mathcal{F}_\beta / \partial u_i = 0,

ui  =  αi2  =  β(1λi)1λi,(6.16)u_i^* \;=\; \alpha_i^{*\,2} \;=\; \frac{\beta(1 - \lambda_i) - 1}{\lambda_i}, \quad\quad (6.16)

provided the right-hand side is non-negative — i.e., β>1/(1λi)\beta > 1/(1 - \lambda_i). Otherwise the directional minimum is at αi=0\alpha_i^* = 0 and direction ii is inactive.

Ansatz optimality. Is the diagonal-in-eigenbasis form (6.14) optimal among all AA? Yes — by the von Neumann trace inequality applied to the simultaneous diagonalization of ΣX\Sigma_X and ΣXY\Sigma_{X \mid Y} in the ΣX\Sigma_X-inner-product basis {vi}\{v_i\}. See Chechik et al. 2005 (Lemma 3.2) for the detailed argument.

Summary. The optimal Gaussian-IB encoder at β\beta is

Aβ  =  diag(α1,,αd)Vd,αi2  =  max ⁣(0,  β(1λi)1λi),(6.17)A^*_\beta \;=\; \mathrm{diag}(\alpha_1^*, \ldots, \alpha_d^*)\, V_d^\top, \qquad \alpha_i^{*\,2} \;=\; \max\!\biggl(0,\; \frac{\beta(1 - \lambda_i) - 1}{\lambda_i}\biggr), \quad\quad (6.17)

where VdV_d holds the dd smallest-eigenvalue eigenvectors of MM. Each direction switches on at the critical

βc,i  =  11λi.(6.18)\beta_{c,\, i} \;=\; \frac{1}{1 - \lambda_i}. \quad\quad (6.18)

6.4 Reading the canonical-correlation spectrum

The eigenvalues λi\lambda_i of M=ΣXYΣX1M = \Sigma_{X \mid Y}\, \Sigma_X^{-1} measure the fraction of XX‘s variance unexplained by YY. λi1\lambda_i \to 1 means direction viv_i is uncorrelated with YY; λi0\lambda_i \to 0 means direction viv_i is strongly determined by YY.

Canonical correlations. The eigenvalues of MM are related to the canonical correlations between XX and YY:

λi  =  1ρi2,(6.19)\lambda_i \;=\; 1 - \rho_i^2, \quad\quad (6.19)

where ρ1ρ2\rho_1 \ge \rho_2 \ge \cdots are the canonical correlations (sorted descending), with ρi[0,1]\rho_i \in [0, 1]. Substituting into (6.18):

βc,i  =  1ρi2.(6.20)\beta_{c,\, i} \;=\; \frac{1}{\rho_i^2}. \quad\quad (6.20)

The IB phase transitions happen at the reciprocals of the squared canonical correlations, in order of decreasing ρi\rho_i. The most-correlated direction activates first; ρi=0\rho_i = 0 directions never activate. The IB algorithm finds exactly the CCA directions, in order of statistical informativeness — the same projections Hotelling 1936 derived from a correlation-maximization argument, here motivated by predictive compression instead.

The 4-D toy. We build a synthetic 4-D jointly-Gaussian example with designed canonical correlations ρ=(0.95,0.70,0.30,0)\rho = (0.95,\, 0.70,\, 0.30,\, 0). Each ρi\rho_i encodes a different “informativeness regime.” The phase-transition staircase comes out:

  • β<1.108\beta < 1.108: trivial encoder, T=constT = \text{const}.
  • 1.108<β<2.0411.108 < \beta < 2.041: one active direction (ρ1=0.95)(\rho_1 = 0.95).
  • 2.041<β<11.1112.041 < \beta < 11.111: two active directions.
  • β>11.111\beta > 11.111: three active directions; the fourth never activates.

The total I(X;Y)=12ilog(1ρi2)=1.548I(X;Y) = -\tfrac{1}{2}\sum_i \log(1 - \rho_i^2) = 1.548 nats. A Monte Carlo verification with n=50,000n = 50{,}000 samples recovers the canonical correlations to within ρ^ρ0.0023|\hat\rho - \rho| \le 0.0023.

Bar charts of ρ² vs λ side-by-side and β_c thresholds for the 4-D Gaussian toy
Left: bar chart of the canonical-correlation spectrum — ρ_i² and λ_i = 1 − ρ_i² side by side for the four designed correlations. Visual confirmation that ρ_i² + λ_i = 1. Right: horizontal bar chart of β_{c,i} = 1/ρ_i² on a log axis. Direction 1 activates at β_{c,1} ≈ 1.11; direction 2 at ≈ 2.04; direction 3 at ≈ 11.11. Direction 4 (ρ_4 = 0) has β_{c,4} = ∞ — never activates.

7. Phase transitions on the Gaussian IB curve

7.1 Critical β values determined by the spectrum

From §6, the optimal Gaussian-IB encoder at parameter β\beta is given by (6.17) with thresholds βc,i=1/(1λi)=1/ρi2\beta_{c, i} = 1/(1 - \lambda_i) = 1/\rho_i^2. Sorting eigenvalues ascending gives an ordered sequence

βc,1  <  βc,2  <    <  βc,r,(7.1)\beta_{c,\, 1} \;<\; \beta_{c,\, 2} \;<\; \cdots \;<\; \beta_{c,\, r}, \quad\quad (7.1)

where r=rank(ΣXY)r = \mathrm{rank}(\Sigma_{XY}). Directions with ρi=0\rho_i = 0 never activate.

The first phase transition is at βc,1\beta_{c, 1}, not at β=1\beta = 1. For general jointly Gaussian (X,Y)(X, Y), the first transition is the inverse of the largest squared canonical correlation, which is in (0,1](0, 1]. So βc,11\beta_{c, 1} \ge 1 always, with equality iff XX and YY share a perfectly correlated direction.

The activation order is by canonical correlation, not by the original coordinates. The CCA directions viv_i are linear combinations of the original components of XX; they need not align with any coordinate axis. The IB principle is coordinate-free: it sees only the second-moment structure of (X,Y)(X, Y), expressed through ρ1,ρ2,\rho_1, \rho_2, \ldots.

7.2 Closed-form curve and the C¹-but-not-C² distinction

Plugging (6.17) into (6.7) uses two algebraic identities:

1+αi2  =  (β1)(1λi)λi,1+αi2λi  =  β(1λi).(7.2)1 + \alpha_i^{*\,2} \;=\; \frac{(\beta - 1)(1 - \lambda_i)}{\lambda_i}, \qquad 1 + \alpha_i^{*\,2}\, \lambda_i \;=\; \beta\,(1 - \lambda_i). \quad\quad (7.2)

For each active direction ii:

I(X;Tβ)i  =  12log ⁣((β1)(1λi)λi),(7.3)I(X; T_\beta)_i \;=\; \tfrac{1}{2} \log \!\biggl(\frac{(\beta - 1)(1 - \lambda_i)}{\lambda_i}\biggr), \quad\quad (7.3)

I(Tβ;Y)i  =  12log ⁣(β1βλi).(7.4)I(T_\beta; Y)_i \;=\; \tfrac{1}{2} \log \!\biggl(\frac{\beta - 1}{\beta\, \lambda_i}\biggr). \quad\quad (7.4)

Summing over all active directions ii at parameter β\beta:

I(X;Tβ)  =  12i:β>βc,ilog ⁣((β1)(1λi)λi),I(Tβ;Y)  =  12i:β>βc,ilog ⁣(β1βλi).(7.5)I(X; T_\beta) \;=\; \frac{1}{2} \sum_{i:\, \beta > \beta_{c,\, i}} \log\!\biggl(\frac{(\beta - 1)(1 - \lambda_i)}{\lambda_i}\biggr), \qquad I(T_\beta; Y) \;=\; \frac{1}{2} \sum_{i:\, \beta > \beta_{c,\, i}} \log\!\biggl(\frac{\beta - 1}{\beta\, \lambda_i}\biggr). \quad\quad (7.5)

The IB curve is piecewise analytic in β\beta: across each open interval (βc,k,βc,k+1)(\beta_{c,\, k},\, \beta_{c,\, k+1}), exactly kk directions are active. At each phase transition, the activating direction contributes zero exactly at the threshold, so the curve is continuous in β\beta.

The slope dI(T;Y)/dI(X;T)=1/βdI(T;Y) / dI(X;T) = 1/\beta at every interior point. Differentiating (7.3) and (7.4):

dI(X;Tβ)idβ  =  12(β1),dI(Tβ;Y)idβ  =  12β(β1).(7.6)\frac{dI(X; T_\beta)_i}{d\beta} \;=\; \frac{1}{2\,(\beta - 1)}, \qquad \frac{dI(T_\beta; Y)_i}{d\beta} \;=\; \frac{1}{2\,\beta\,(\beta - 1)}. \quad\quad (7.6)

The ratio is 1/β1/\beta per direction. Summing over kk active directions multiplies both by kk, leaving the ratio unchanged. The IB curve has tangent slope 1/β1/\beta at every interior point, recovering (2.5).

Curvature jumps at each phase transition. From the chain rule and (7.6),

d2I(T;Y)dI(X;T)2  =  2(β1)kβ2.(7.7)\frac{d^2 I(T;Y)}{d I(X;T)^2} \;=\; -\frac{2\,(\beta - 1)}{k\,\beta^2}. \quad\quad (7.7)

At each βc,k+1\beta_{c,\, k+1}, the curvature magnitude decreases by a factor of k/(k+1)k/(k+1) — the curve becomes less curved when a new direction activates. The IB curve is C1C^1 but not C2C^2: the tangent is everywhere continuous (slope 1/β1/\beta on both sides of each transition), but the curvature jumps. This is why phase transitions are invisible in a smooth-looking plot of I(T;Y)I(T;Y) against I(X;T)I(X;T) yet very visible in the β\beta-parameterization: in the IB plane they show up only as a faint curvature change, while in the β\beta-plot they are sharp kinks in the activation timing.

Saturation. As β\beta \to \infty, I(Tβ;Y)iI(T_\beta; Y)_i saturates at 12logλi-\tfrac{1}{2}\, \log \lambda_i per active direction. Total:

limβI(Tβ;Y)  =  12i:ρi>0logλi  =  12logΣXYΣX  =  I(X;Y),(7.8)\lim_{\beta \to \infty} I(T_\beta; Y) \;=\; -\frac{1}{2} \sum_{i:\, \rho_i > 0} \log \lambda_i \;=\; -\frac{1}{2} \log \frac{|\Sigma_{X \mid Y}|}{|\Sigma_X|} \;=\; I(X; Y), \quad\quad (7.8)

recovering the DPI ceiling exactly.

7.3 Dimension allocation: how each new direction turns on

The closed form decomposes I(X;Tβ)I(X; T_\beta) and I(Tβ;Y)I(T_\beta; Y) by canonical-correlation direction, and that decomposition is plotted as a stacked area in Figure 8.

Activation is staged. Below βc,1\beta_{c,\, 1}, all bands are flat at zero. At each βc,i\beta_{c,\, i}, direction ii‘s band starts climbing from zero.

Within an active region, contributions are unequal. For two simultaneously-active directions i,ji, j:

I(T;Y)i    I(T;Y)j  =  12log(λj/λi),I(T;Y)_i \;-\; I(T;Y)_j \;=\; \tfrac{1}{2}\, \log(\lambda_j / \lambda_i),

constant in β\beta once both are active. The IB doesn’t rebalance — it adds new directions while preserving the existing ordering between them.

Most-correlated direction dominates. For the 4-D toy:

  • Direction 1 (ρ1=0.95)(\rho_1 = 0.95): saturation 12log(0.0975)1.164-\tfrac{1}{2}\log(0.0975) \approx 1.164 nats — 75.2% of I(X;Y)I(X;Y).
  • Direction 2 (ρ2=0.70)(\rho_2 = 0.70): 0.337\approx 0.337 nats — 21.8%.
  • Direction 3 (ρ3=0.30)(\rho_3 = 0.30): 0.047\approx 0.047 nats — 3.0%.
  • Direction 4: 0 (never activates).

This is the Gaussian-IB analog of “dominant principal component” in PCA, with one substantive difference: PCA ranks directions by XX-variance; Gaussian IB ranks them by predictiveness of YY.

Stacked-area decomposition of I(X;T) and I(T;Y) by CCA direction across the β sweep
Per-direction contributions to I(X;T) (left) and I(T;Y) (right) across the β sweep on the 4-D toy, as stacked-area plots. Each colored band is one canonical-correlation direction's contribution; phase-transition vertical lines mark β_{c,i}. The right panel's total saturates at I(X;Y) = 1.548 nats (dashed horizontal). The fourth direction's band is always empty — ρ_4 = 0 means β_{c,4} = ∞.
dir 1: ρ = 0.95, λ = 0.098, β_c = 1.11
dir 2: ρ = 0.70, λ = 0.510, β_c = 2.04
dir 3: ρ = 0.30, λ = 0.910, β_c = 11.11
dir 4: ρ = 0.00, λ = 1.000, β_c =

Direction 1 dominates the predictiveness budget: it saturates at −½ log λ_1 ≈ 1.164 nats (75.2% of I(X;Y)). Direction 2 contributes ≈ 0.337 nats (21.8%), direction 3 ≈ 0.047 nats (3.0%). Direction 4 (ρ_4 = 0) never activates.

7.4 Numerical verification on a 4-D toy

We check (7.3)–(7.5) by Monte Carlo. Protocol:

  1. Sample (X,Y)(X, Y) from the joint Gaussian designed in §6.4.
  2. For each β{1.5,2.5,5.0,15.0}\beta \in \{1.5,\, 2.5,\, 5.0,\, 15.0\}, construct AβA^*_\beta and sample T=AβX+ξT = A^*_\beta X + \xi.
  3. Compute empirical I(X;T)I(X; T) and I(T;Y)I(T; Y) using the Gaussian-MI formula on the empirical covariances.
  4. Compare to the closed-form values from (7.5).

The four β land in different active-direction regimes — β = 1.5 (one active direction), 2.5 (two), 5.0 (two), 15.0 (three). Closed-form values, in nats:

βI(X;T) (cf)I(T;Y) (cf)
1.50.76610.6146
2.51.49810.9898
5.02.47891.2775
15.03.89441.4443

At nMC=20,000n_{\text{MC}} = 20{,}000 samples, agreement is within O(1/n)102O(1/\sqrt{n}) \approx 10^{-2} nats — confirming both the encoder construction and (7.5) at once. The notebook prints the side-by-side comparison; max-abs-difference across the four β is below 6×1036 \times 10^{-3} nats.

Closed-form Gaussian IB curve on the 4-D toy with phase-transition markers and MC verification
Left: closed-form Gaussian IB curve on the 4-D toy in the (I(X;T), I(T;Y)) plane. Three phase-transition operating points marked at β_{c,1}, β_{c,2}, β_{c,3}. The curve is C¹ — tangent continuous at each transition (slope 1/β everywhere) — but C² fails (curvature drops by factor k/(k+1) at each transition). Right: I(X;T) and I(T;Y) as functions of β on a log axis. Phase transitions appear as kinks in the β-parameterization.

Phase transitions at β_c = (1.108, 2.041, 11.111) activate the three nonzero canonical-correlation directions. The fourth (ρ_4 = 0) never activates. Saturation at I(X;Y) = 1.548 nats. Try β = 1.5 → (0.766, 0.615), β = 2.5 → (1.498, 0.990), β = 5.0 → (2.479, 1.278), β = 15 → (3.894, 1.444) — the four §7.4 verification points.

8. Variational IB and the deep-learning lift

8.1 Why direct IB doesn’t scale to high-dimensional encoders

The §3 IB algorithm has two requirements that limit its reach in practice. First, every iteration uses the conditional p(yx)p(y \mid x) at every xx — that means we need the whole joint distribution p(x,y)p(x, y) as input. Second, the encoder p(tx)p(t \mid x) is represented explicitly for every xx. Both requirements break down when XX is high-dimensional continuous data.

We need two structural changes. First, amortize the encoder: instead of storing p(tx)p(t \mid x) for every xx, parameterize it by a neural network pθ(tx)p_\theta(t \mid x) whose parameters θ\theta are shared across xx. Second, upper-bound the intractable terms in the IB Lagrangian by quantities estimable from samples.

This is what Alemi, Fischer, Dillon, and Murphy did in 2017, building on Achille and Soatto (information dropout), Kingma and Welling (VAE), and Tishby–Zaslavsky. The result, Variational Information Bottleneck (VIB), is the workhorse of the IB-in-deep-learning literature.

8.2 The Alemi–Fischer–Dillon–Murphy 2017 variational bound

Upper bound on I(X;T)I(X; T). For any auxiliary distribution q(t)q(t) over T\mathcal{T},

I(X;T)  =  Ex[DKL(p(tx)q(t))]    DKL(p(t)q(t))    Ex[DKL(p(tx)q(t))],(8.1)I(X; T) \;=\; \mathbb{E}_x\bigl[D_{\mathrm{KL}}(p(t \mid x) \,\|\, q(t))\bigr] \;-\; D_{\mathrm{KL}}(p(t) \,\|\, q(t)) \;\le\; \mathbb{E}_x\bigl[D_{\mathrm{KL}}(p(t \mid x) \,\|\, q(t))\bigr], \quad\quad (8.1)

with equality iff q(t)=p(t)q(t) = p(t). The practical choice — for tractability and for connecting to the VAE — is q(t)=N(0,I)q(t) = \mathcal{N}(0, I).

Lower bound on I(T;Y)I(T; Y). For any auxiliary r(yt)r(y \mid t),

I(T;Y)    E(t,y)[logr(yt)]  +  H(Y),(8.2)I(T; Y) \;\ge\; \mathbb{E}_{(t, y)}\bigl[\log r(y \mid t)\bigr] \;+\; H(Y), \quad\quad (8.2)

with equality iff r(yt)=p(yt)r(y \mid t) = p(y \mid t).

Combining. The VIB upper bound on Fβ\mathcal{F}_\beta (absorbing constants) is

LVIB(pθ,q,rϕ)  :=  Ex[DKL(pθ(tx)q(t))]    βE(x,y,t)[logrϕ(yt)].(8.3)\mathcal{L}_{\mathrm{VIB}}\bigl(p_\theta, q, r_\phi\bigr) \;:=\; \mathbb{E}_x\bigl[D_{\mathrm{KL}}(p_\theta(t \mid x) \,\|\, q(t))\bigr] \;-\; \beta\, \mathbb{E}_{(x, y, t)}\bigl[\log r_\phi(y \mid t)\bigr]. \quad\quad (8.3)

Compare to (3.1): same two-term structure (rate +β+ \beta-weighted distortion), but each term is now a bound rather than the exact MI.

Connection to the ELBO. Equation (8.3) is structurally identical to the negative variational evidence lower bound for a supervised latent-variable model: encoder pθ(tx)p_\theta(t \mid x), prior q(t)q(t), decoder rϕ(yt)r_\phi(y \mid t). VIB is VAE with the targets switched — instead of reconstructing XX, predict YY. This identification (Alemi et al. 2017, §3) unlocked the full deep-learning toolkit for IB.

When is the bound tight? With qq free: q=p(t)q = p(t). With rr free: r=p(yt)r = p(y \mid t). In practice qq is fixed at N(0,I)\mathcal{N}(0, I), so the bound is strictly loose. The gap is DKL(pθ(t)q(t))D_{\mathrm{KL}}(p_\theta(t) \,\|\, q(t)) — the KL between the encoder’s induced marginal and the chosen prior. This becomes a regularization that pulls the encoder toward producing a Gaussian-marginal TT.

8.3 Amortized encoders and the reparametrization trick

Diagonal-Gaussian amortized encoder. Parameterize

pθ(tx)  =  N(t;μθ(x),diag(σθ2(x))),(8.4)p_\theta(t \mid x) \;=\; \mathcal{N}\bigl(t;\, \mu_\theta(x),\, \mathrm{diag}\bigl(\sigma_\theta^2(x)\bigr)\bigr), \quad\quad (8.4)

with μθ(x),σθ(x)Rd\mu_\theta(x), \sigma_\theta(x) \in \mathbb{R}^d neural-network outputs sharing parameters across all xx.

Reparametrization (Kingma and Welling 2014). Sampling tpθ(tx)t \sim p_\theta(t \mid x) is differentiable in θ\theta via

t  =  μθ(x)+σθ(x)ϵ,ϵN(0,I).(8.5)t \;=\; \mu_\theta(x) \,+\, \sigma_\theta(x) \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I). \quad\quad (8.5)

Stochasticity lives in ϵ\epsilon, independent of θ\theta — gradients pass straight through.

Minibatch SGD. For dataset {(xn,yn)}n=1N\{(x_n, y_n)\}_{n=1}^N,

L^VIB(θ,ϕ)  =  1Nn=1N[DKL(pθ(txn)q(t))    βEϵ[logrϕ(ynμθ(xn)+σθ(xn)ϵ)]].(8.6)\widehat{\mathcal{L}}_{\mathrm{VIB}}(\theta, \phi) \;=\; \frac{1}{N} \sum_{n=1}^N \Bigl[D_{\mathrm{KL}}(p_\theta(t \mid x_n) \,\|\, q(t)) \;-\; \beta\, \mathbb{E}_{\epsilon}[\log r_\phi(y_n \mid \mu_\theta(x_n) + \sigma_\theta(x_n) \odot \epsilon)]\Bigr]. \quad\quad (8.6)

The VIB training loop is almost identical to a VAE training loop, with one substitution: the decoder predicts YY from TT instead of XX from TT. Code-wise, switching from VAE to VIB is a one-line change.

8.4 A closed-form linear-Gaussian VIB sandbox

To pull the VIB framework off the SGD treadmill, restrict to the linear-Gaussian case: jointly Gaussian (X,Y)(X, Y) from §6 setup, Gaussian encoder pθ(tx)=N(Ax,Σξ)p_\theta(t \mid x) = \mathcal{N}(A x, \Sigma_\xi), fixed Gaussian prior q(t)=N(0,I)q(t) = \mathcal{N}(0, I), and Gaussian decoder rϕ(yt)=N(Bt,Ση)r_\phi(y \mid t) = \mathcal{N}(B t, \Sigma_\eta).

Closed-form objective. Expanding (8.3):

LVIB(A,Σξ,B,Ση)  =  12[tr(ΣT)logΣξd]  +  β2NLL(B,Ση;A,Σξ),(8.7)\mathcal{L}_{\mathrm{VIB}}(A, \Sigma_\xi, B, \Sigma_\eta) \;=\; \tfrac{1}{2}\bigl[\mathrm{tr}(\Sigma_T) - \log|\Sigma_\xi| - d\bigr] \;+\; \tfrac{\beta}{2}\, \mathrm{NLL}(B, \Sigma_\eta;\, A, \Sigma_\xi), \quad\quad (8.7)

where ΣT=AΣXA+Σξ\Sigma_T = A \Sigma_X A^\top + \Sigma_\xi. The optimal decoder is B=ΣYXAΣT1B^* = \Sigma_{YX}\, A^\top\, \Sigma_T^{-1} and Ση=ΣYT\Sigma_\eta^* = \Sigma_{Y \mid T}. After this inner optimization,

LVIB(A,Σξ)  =  12[tr(ΣT)logΣξd]rate (upper bound on I(X;T))  +  β2logΣYΣYXAΣT1AΣXYdistortion (lower bound on I(T;Y)).(8.8)\mathcal{L}_{\mathrm{VIB}}^*(A, \Sigma_\xi) \;=\; \underbrace{\tfrac{1}{2}\bigl[\mathrm{tr}(\Sigma_T) - \log|\Sigma_\xi| - d\bigr]}_{\text{rate (upper bound on } I(X;T) \text{)}} \;+\; \underbrace{\tfrac{\beta}{2}\, \log\bigl|\Sigma_Y - \Sigma_{YX}\, A^\top\, \Sigma_T^{-1}\, A\, \Sigma_{XY}\bigr|}_{\text{distortion (lower bound on } -I(T;Y) \text{)}}. \quad\quad (8.8)

Gap to the true IB Lagrangian. The rate term overshoots I(X;T)I(X; T) by exactly

DKL(N(0,ΣT)N(0,I)),(8.9)D_{\mathrm{KL}}\bigl(\mathcal{N}(0,\, \Sigma_T) \,\|\, \mathcal{N}(0,\, I)\bigr), \quad\quad (8.9)

the KL from the encoder’s induced marginal to the unit-Gaussian prior. The gap vanishes only when ΣT=I\Sigma_T = I, which is generally not the IB optimum’s marginal.

Numerical experiment. Optimize (8.8) over (A,Σξ)(A, \Sigma_\xi) at 40 log-spaced β\beta on the 4-D toy from §6.4 with L-BFGS-B and β\beta-annealing warm-starts. At each β\beta, compute the actual (I(X;T),I(T;Y))(I(X;T), I(T;Y)) using §6’s closed-form Gaussian MI formulas — this evaluates the bound’s tightness, not its self-consistency.

The result: VIB trajectory traces a curve that stays inside (below and right of) the true IB curve, with the gap visible across all β\beta. At small β\beta the gap is small (both near origin); at intermediate β\beta the gap is largest; at large β\beta both saturate at I(T;Y)=I(X;Y)I(T;Y) = I(X;Y).

VIB trajectory inside the closed-form Gaussian IB curve, plus the predictiveness gap
Left: linear-Gaussian VIB trajectory traced across 40 β values on the 4-D toy, against the closed-form IB curve from §7. The VIB optimum sits strictly inside the IB curve at every β. Right: the predictiveness gap I(T;Y)_IB − I(T;Y)_VIB as a function of β on a log axis — small near the origin, peaks at intermediate β, decays to zero at saturation.

Loading VIB precompute (40 L-BFGS-B optima)…

9. The fitting and compression controversy

9.1 The Tishby–Zaslavsky and Shwartz-Ziv–Tishby claim

The most-discussed application of the IB principle in deep learning has been a descriptive one: the claim that deep networks, when trained by SGD, traverse a characteristic trajectory in the information plane that explains why they generalize. Tishby and Zaslavsky proposed this picture in 2015; Shwartz-Ziv and Tishby formalized and experimentally documented it in 2017.

For a feedforward classifier with hidden layers T1,T2,,TLT_1, T_2, \ldots, T_L, estimate I(X;Tl)I(X; T_l) and I(Tl;Y)I(T_l; Y) from the training data each epoch and plot trajectories. Shwartz-Ziv and Tishby (on tanh-activated networks) observed every layer’s trajectory tracing two phases:

The fitting phase. Early in training, I(X;Tl)I(X; T_l) and I(Tl;Y)I(T_l; Y) both grow. Diagonal climb in the IB plane.

The compression phase. Later in training, I(Tl;Y)I(T_l; Y) stays high while I(X;Tl)I(X; T_l) decreases. Leftward bend in the plane. Interpretation: late-phase SGD performs a diffusion in weight space that compresses the representations.

Three claims rode this picture:

  1. Compression is generic. Predicted for any deep classifier trained by SGD.
  2. Compression explains generalization. Late SGD performs implicit IB-style regularization.
  3. Deep nets implicitly do IB. Slogan packaging of (1) + (2).

The picture was provocative, visually compelling, and immediately influential. It also turned out to be contested.

9.2 The information-plane visualization

Setting aside the interpretive claims, the visualization — (I(X;Tl),I(Tl;Y))(I(X; T_l),\, I(T_l; Y)) over training — is a legitimate tool. Its catch is that computing it requires an MI estimator for high-dimensional continuous TT.

Binning. Discretize TT, plug-in. Sensitive to bin width and saturating activations.

kk-NN-based (Kraskov–Stögbauer–Grassberger 2004). Principled, slow, biased in high dimensions.

Parametric / variational. Fit a model. Tractable when correctly specified; biased when not.

The choice of estimator can change the qualitative shape of the trajectory. This is the methodological wedge that Saxe et al. drove into the original picture.

9.3 Saxe et al. 2018: tanh vs ReLU and the compression artifact

Saxe, Bansal, Dapello, Advani, Kolchinsky, Tracey, and Cox (ICLR 2018) ran the experiment with two variations the original work hadn’t probed.

Activation function. Shwartz-Ziv–Tishby used tanh; Saxe et al. replicated with ReLU. With tanh, compression appeared; with ReLU, the compression phase disappeared. I(X;Tl)I(X; T_l) grew monotonically throughout. Same architecture, same training, same dataset — qualitatively different trajectories depending on activation.

MI estimator artifact. Tanh activations saturate near ±1\pm 1. During training, activation values cluster into the boundary bins. The binning estimator interprets this as information loss even though no real information has been lost. With ReLU (no saturation), no such clustering.

Analytical sandbox. Saxe et al. analyzed linear networks where MI is computable in closed form. In the linear-Gaussian setting, SGD shows no compression phase. This was the cleanest evidence that compression is not intrinsic to SGD on deep nets.

Their conclusion. The compression phase in the original experiments is artifactual — an interaction between tanh’s saturation and the binning estimator. Switch to ReLU or to closed-form MI, and the phase disappears. The two-phase trajectory is not a generic feature of SGD on deep networks.

A short demonstration of the analytical case appears in Figure 10. We run gradient descent on the closed-form linear-Gaussian VIB objective from §8 at fixed β=5\beta = 5, track the trajectory using the exact Gaussian MI formulas (no estimator), and observe: I(X;T)I(X; T) and I(T;Y)I(T; Y) both grow monotonically. There is no compression phase. The “compression delta” — peak I(X;T)I(X;T) minus final I(X;T)I(X;T) — comes out to 0.00000.0000 nats on this run.

Gradient-descent trajectory under linear-Gaussian VIB at β=5: no compression phase
Left: gradient-descent trajectory on the closed-form linear-Gaussian VIB at fixed β = 5 on the 4-D toy, plotted in the information plane. Each point is one GD step (n = 400 total); color encodes step index. The trajectory traces from the trivial encoder (start) toward the VIB optimum (converged), staying inside the §7 IB curve. Right: I(X;T) and I(T;Y) as functions of GD step. Both grow monotonically — no compression phase. The annotated compression delta (peak I(X;T) − final) is essentially zero.

9.4 What the controversy leaves intact (and what it doesn’t)

What did not survive Saxe et al. 2018 and the work that followed:

  • The claim that compression is a generic feature of SGD on deep networks.
  • The claim that compression explains deep-network generalization.
  • The packaging “deep learning is information bottleneck.”

What did survive:

  • The information plane as a visualization tool. Plotting (I(X;Tl),I(Tl;Y))(I(X; T_l), I(T_l; Y)) over training remains useful for understanding what a network’s hidden representations are doing, with the caveat that the MI estimator matters.
  • VIB as a regularizer. Alemi et al.’s MNIST experiments showed measurable robustness gains against adversarial examples, independent of any descriptive claim about implicit IB in unmodified SGD.
  • The fitting phase. Uncontroversially observed by everyone.
  • Compression as an inducible phenomenon. Saturating activations, stochastic noise injection (Achille and Soatto’s Information Dropout), explicit IB regularization (VIB) — all produce compression on purpose. Not automatic, but a useful design choice.

Goldfeld, van den Berg, Greenewald, Melnyk, Nguyen, Kingsbury, and Polyanskiy (NeurIPS 2019) showed that compression is closely tied to entropic regularization of the activations — either via discretization, or via injected noise. Compression observed in the information plane is real when it appears, but it appears as a consequence of specific structural choices, not as a universal SGD phenomenon.

Honest assessment. The IB principle is most useful held as a prescriptive Lagrangian rather than a descriptive theory of deep learning. The prescriptive reading (VIB, Information Dropout) is well-founded and has measurable benefits. The descriptive reading (deep nets implicitly do IB) was overreached and has been substantially walked back. The framework is sharp in its proper domain — clear XX, clear YY, the need for a compressed TT that retains predictiveness — and weaker the further one pushes it past that domain.

10. IB beyond compression-vs-prediction

The IB Lagrangian’s two-term structure generalizes naturally to multi-term variants by adding mutual-information quantities with their own Lagrange multipliers. Three application stories illustrate this modularity.

10.1 Information dropout and the emergence of invariance

Achille and Soatto (2018) implement VIB via a multiplicative-noise injection on activations:

T  =  f(WX)exp(Z),ZN(0,σθ2(X)),(10.1)T \;=\; f(W X) \,\odot\, \exp(Z), \qquad Z \sim \mathcal{N}\bigl(0,\, \sigma_\theta^2(X)\bigr), \quad\quad (10.1)

with σθ(X)\sigma_\theta(X) a learned, input-dependent noise scale. For ReLU activations, (10.1) implements a log-normal noise on the positive part of TT. The objective is the VIB Lagrangian (8.3) with a log-normal prior.

The invariance claim. A representation TT is invariant to a nuisance factor NN when TNYT \perp N \mid Y. Achille and Soatto show that minimizing the VIB Lagrangian with sufficient β\beta produces representations whose XX-content is exactly the part of XX that predicts YY. Nuisances are squeezed out by the compression term. Invariance is a consequence of IB compression, not an extra ingredient.

The disentanglement claim. Under information dropout, components of TT become approximately independent conditional on YY, more sharply at higher β\beta. The IB Lagrangian’s β\beta is, structurally, the β\beta-VAE’s β\beta (Higgins et al. 2017) — they are the same quantity in different applications.

What the framing buys. Information dropout gives a one-line architectural change (multiplicative noise on activations) that implements VIB-style regularization. The IB principle used prescriptively — as a design tool — produces invariance and partial disentanglement automatically.

10.2 IB framings of fair representation learning

Moyer, Gao, Brekelmans, Galstyan, and Ver Steeg (2018) cast fair representation learning as a generalized IB problem with an explicit sensitive-attribute penalty.

Setup. XX is the input, YY the prediction target, SS a sensitive attribute. The goal: TT that predicts YY well, contains minimal info about SS, and stays compact about XX. The fair-IB Lagrangian:

Lfair(p(tx))  =  I(X;T)rate    βYI(T;Y)utility  +  βSI(T;S)leakage,(10.2)\mathcal{L}_{\mathrm{fair}}\bigl(p(t \mid x)\bigr) \;=\; \underbrace{I(X; T)}_{\text{rate}} \;-\; \underbrace{\beta_Y\, I(T; Y)}_{\text{utility}} \;+\; \underbrace{\beta_S\, I(T; S)}_{\text{leakage}}, \quad\quad (10.2)

with two Lagrange multipliers controlling utility and leakage. The Pareto frontier sweeps a three-axis trade-off.

Variational treatment. Each MI term gets a variational bound following §8: upper bound on I(X;T)I(X;T) via E[DKL(p(tx)q(t))]\mathbb{E}[D_{\mathrm{KL}}(p(t \mid x) \,\|\, q(t))], lower bound on I(T;Y)I(T;Y) via an auxiliary decoder rYr_Y, upper bound on I(T;S)I(T;S) via a second auxiliary decoder rSr_S. End-to-end trainable by SGD.

Comparison to adversarial fair-rep. The pre-Moyer-et-al. approach was adversarial: learn TT alongside a discriminator DSD_S, train TT to fool DSD_S. Minimax, GAN-style, training instability. The IB approach bounds I(T;S)I(T;S) directly via its variational form — single-level minimization, no minimax. Empirically (UCI Adult, Heritage Health), IB matches or beats adversarial fair-rep with more stable training.

What the framing buys. Each new constraint is one variational term plus one β\beta. Single-level optimization, end-to-end differentiable.

10.3 IB framings of privacy and the utility–leakage trade-off

Replace the sensitive attribute SS with all of the information we want to protect:

Lpriv(p(tx))  =  I(T;Y)  +  βSI(T;S),(10.3)\mathcal{L}_{\mathrm{priv}}\bigl(p(t \mid x)\bigr) \;=\; -\, I(T; Y) \;+\; \beta_S\, I(T; S), \quad\quad (10.3)

(or with XX in the leakage slot for full anonymization).

Connection to differential privacy. DP gives worst-case guarantees (bounded logp(TD)/p(TD)\log p(T \mid D)/p(T \mid D') for adjacent datasets); IB-privacy gives average-case guarantees (bounded I(T;X)I(T; X) or I(T;S)I(T; S)). DP is stronger and has composition theorems; IB-privacy is more flexible — it produces a Pareto frontier rather than a single budget.

Operational interpretation. Issa, Wagner, and Kamath (2020) showed that I(T;S)I(T; S) has a clean operational meaning: it is the expected log-ratio by which a Bayesian adversary’s posterior on SS improves after observing TT. Minimizing I(T;S)I(T; S) is literally minimizing the expected information gain an adversary obtains.

Practical use cases. Federated learning, synthetic data generation, selective-attribute privacy. The deep-learning machinery (VIB encoders, decoders, reparametrization) transfers directly.

A unifying observation. Each application here is one instance of a multi-term IB Lagrangian:

ApplicationRatePredictivenessOther
Standard IB / VIBI(X;T)I(X;T)βI(T;Y)\beta\, I(T;Y)
Information dropoutI(X;T)I(X;T) via mult. noiseβI(T;Y)\beta\, I(T;Y)
Fair-rep (Moyer et al.)I(X;T)I(X;T)βYI(T;Y)\beta_Y\, I(T;Y)βSI(T;S)\beta_S\, I(T;S) leakage
PrivacyI(X;T)I(X;T) anonymizationβI(T;Y)\beta\, I(T;Y) utilityβSI(T;S)\beta_S\, I(T;S) leakage

§11 places this template in its broader context.

11. Connections to other principles

11.1 IB and rate-distortion: the same Lagrangian template

The classical rate-distortion Lagrangian for source XX with distortion dd:

FsRD(p(x^x))  =  I(X;X^)  +  sE[d(X,X^)],s0.(11.1)\mathcal{F}_s^{\mathrm{RD}}\bigl(p(\hat x \mid x)\bigr) \;=\; I(X; \widehat{X}) \;+\; s\, \mathbb{E}\bigl[d(X, \widehat{X})\bigr], \qquad s \ge 0. \quad\quad (11.1)

The IB Lagrangian rewritten:

Fβ(p(tx))  =  I(X;T)  +  β[I(X;Y)I(T;Y)]    βI(X;Y).(11.2)\mathcal{F}_\beta\bigl(p(t \mid x)\bigr) \;=\; I(X; T) \;+\; \beta\, \bigl[I(X;Y) - I(T;Y)\bigr] \;-\; \beta\, I(X;Y). \quad\quad (11.2)

IB is rate-distortion with the predictive distortion:

dIB(X,T)  =  DKL(p(yX)p(yT)),E[dIB(X,T)]  =  I(X;Y)I(T;Y).(11.3)d_{\mathrm{IB}}(X, T) \;=\; D_{\mathrm{KL}}\bigl(p(y \mid X) \,\|\, p(y \mid T)\bigr), \qquad \mathbb{E}[d_{\mathrm{IB}}(X, T)] \;=\; I(X;Y) - I(T;Y). \quad\quad (11.3)

Where they differ. RD’s distortion is fixed before the algorithm runs. IB’s distortion depends on the encoder iterate through the induced p(yt)p(y \mid t). Two consequences:

  1. RD Lagrangian is convex in p(x^x)p(\hat x \mid x) for fixed distortion → BA converges globally. IB is not convex → BA finds local stationary points.
  2. R(D) admits closed forms. Gaussian RD has R(D)=12log(σ2/D)R(D) = \tfrac{1}{2} \log(\sigma^2 / D). Gaussian IB has a closed-form curve (§7) with a different distortion. Both curves are convex-decreasing.

Side-by-side numerical comparison at common rates and ρ=0.9\rho = 0.9:

R (nats)RD: D=e2RD = e^{-2R}IB: DpredD_{\mathrm{pred}}
0.200.67030.6750
0.500.36790.4716
1.000.13530.2277
2.000.01830.0376

The IB distortion is larger at every rate — the price of using a self-emergent distortion that depends on YY rather than a fixed Euclidean one.

Side-by-side Gaussian R(D) and Gaussian IB R(D_pred) at ρ = 0.9
Left: the Gaussian rate-distortion function R(D) = ½ log(1/D) for scalar source X ~ N(0, 1) with squared-error distortion. Convex decreasing in (D, R). The Lagrangian 𝓕_s^RD is annotated. Right: the Gaussian IB function R(D_pred) at ρ = 0.9, where D_pred = I(X;Y) − I(T;Y). Convex decreasing, structurally identical shape, with the IB Lagrangian 𝓕_β annotated. The substantive difference: RD's distortion is fixed externally; IB's depends on the encoder iterate.

Both curves are convex decreasing in (D, R) — structurally identical shape. The substantive difference is the distortion's provenance: RD takes d(X, X̂) as input, fixed before the algorithm runs; IB derives D_pred = D_KL(p(y|x) ‖ p(y|t)) from the encoder iterate itself. The side-by-side at ρ = 0.9 reproduces the brief's §11.1 table: at R = 0.20, RD has D ≈ 0.670 vs IB D_pred ≈ 0.675; at R = 2.00, RD ≈ 0.018 vs IB ≈ 0.038.

The full rate-distortion theory is developed in rate-distortion; §11.1 here just makes the structural parallel explicit.

11.2 IB and PAC-Bayes: KL as a complexity penalty

PAC-Bayes (McAllester 1999; Catoni 2007) bounds generalization for stochastic learners. With probability 1δ\ge 1 - \delta over draws of training set SS:

EhQ[L(h)]    EhQ[L^S(h)]  +  DKL(QP)+log(1/δ)2n,(11.4)\mathbb{E}_{h \sim Q}\bigl[L(h)\bigr] \;\le\; \mathbb{E}_{h \sim Q}\bigl[\widehat{L}_S(h)\bigr] \;+\; \sqrt{\frac{D_{\mathrm{KL}}(Q \,\|\, P) + \log(1 / \delta)}{2 n}}, \quad\quad (11.4)

where QQ is the data-conditional posterior and PP is a data-independent prior. The KL term is a complexity penalty.

Structural parallel with IB. The IB rate I(X;T)=Ex[DKL(p(tx)p(t))]I(X; T) = \mathbb{E}_x[D_{\mathrm{KL}}(p(t \mid x) \,\|\, p(t))] is a KL between a data-conditional encoder and its marginal. PAC-Bayes’s DKL(QP)D_{\mathrm{KL}}(Q \,\|\, P) is structurally analogous.

Russo–Zou / Xu–Raginsky. Russo and Zou (2016) and Xu and Raginsky (2017) gave a tight information-theoretic generalization bound: for stochastic algorithm p(S,A)p(S, A) with sub-Gaussian loss,

E[L(A)L^S(A)]    2σ2I(S;A)n.(11.5)\bigl|\mathbb{E}\bigl[L(A) - \widehat{L}_S(A)\bigr]\bigr| \;\le\; \sqrt{\frac{2\, \sigma^2\, I(S; A)}{n}}. \quad\quad (11.5)

The generalization gap is bounded by the mutual information between training set and learned hypothesis.

The unifying picture. A stochastic learning rule whose output depends only weakly on the input (in MI/KL terms) generalizes well. IB says the same thing for stochastic encoders. Same arithmetic, three framings. For VIB: bounding I(X;T)I(X; T) during training literally controls a generalization gap, in the Russo–Zou–Xu–Raginsky sense.

11.3 IB and InfoNCE / contrastive learning

InfoNCE (van den Oord, Li, and Vinyals 2018) is the loss behind modern self-supervised representation learning. For positive pairs (xi,yi)(x_i, y_i) and K1K - 1 negatives,

LInfoNCE(θ)  =  E ⁣[logefθ(xi,yi)1Kj=1Kefθ(xi,yj)].(11.6)\mathcal{L}_{\mathrm{InfoNCE}}(\theta) \;=\; -\, \mathbb{E}\!\left[\log \frac{e^{f_\theta(x_i, y_i)}}{\frac{1}{K}\sum_{j = 1}^K e^{f_\theta(x_i, y_j)}}\right]. \quad\quad (11.6)

The MI lower bound. For the optimal-discriminator score,

I(X;Y)    logK    LInfoNCE(θ).(11.7)I(X; Y) \;\ge\; \log K \;-\; \mathcal{L}_{\mathrm{InfoNCE}}(\theta^*). \quad\quad (11.7)

Minimizing InfoNCE maximizes a tractable lower bound on I(X;Y)I(X; Y). Modern contrastive methods scale KK aggressively (SimCLR: 8192-sample batches; MoCo: even larger via momentum queues) to tighten this bound.

Connection to IB. Casting contrastive in IB notation: XX first view, YY second view, T=gθ(X)T = g_\theta(X) the representation. Contrastive learning maximizes a lower bound on I(T;Y)I(T; Y) — the predictiveness term in the IB Lagrangian.

What is missing: the compression term I(X;T)I(X; T). Contrastive learning has no explicit rate penalty. The compression that emerges comes from the limited capacity of the representation (typically 128–256-dim projection heads). The bottleneck is structural rather than information-theoretic.

This makes contrastive learning a partial IB — the predictiveness half with architectural compression. Saunshi et al. (2019) closed the loop, showing that contrastive learning produces representations with downstream-classification generalization bounds structurally similar to PAC-Bayes-style bounds.

11.4 IB and the minimal sufficient statistic

The classical notion of sufficient statistic (Fisher 1922; Lehmann–Scheffé 1950) is the statistical-inference grandparent of IB.

Classical definition. A statistic T=T(X)T = T(X) is sufficient for YY given XX if p(xt)p(x \mid t) does not depend on YY — equivalently, I(T;Y)=I(X;Y)I(T; Y) = I(X; Y). A statistic is minimal sufficient if it is sufficient and is a function of every other sufficient statistic.

The classical concept is binary. Sufficient or not. No continuous “how sufficient” scale.

IB makes sufficiency continuous. As β\beta \to \infty, the IB optimum approaches the minimal sufficient statistic: limβI(Tβ;Y)=I(X;Y)\lim_{\beta \to \infty} I(T_\beta^*; Y) = I(X; Y), and among sufficient statistics, the IB β=\beta = \infty optimum has the smallest I(X;T)I(X; T). The IB problem at finite β\beta is the lossy version of the classical sufficient-statistic problem, with β\beta as the lossiness knob.

This generalizes-the-classical the same way rate-distortion generalizes Shannon source coding:

ClassicalModern lossy version
Shannon source coding (lossless)Rate-distortion
Sufficient statistic (classical)Information bottleneck

Exponential-family sufficiency. For exponential-family p(yx)=h(y)exp(η(x)T0(y)A(η(x)))p(y \mid x) = h(y) \exp(\eta(x)^\top T_0(y) - A(\eta(x))), the natural parameter η(X)\eta(X) is a finite-dimensional sufficient statistic. IB applied to such families produces finite-dimensional representations — the algorithm rediscovers the exponential-family structure. In the Gaussian case (§6), this is exact: the IB optimum at β=\beta = \infty recovers the canonical-correlation projection, which is the linear-Gaussian minimal sufficient statistic.

The lineage.

YearContributionAuthor(s)
1922Sufficient statisticsFisher
1948Mutual information / source codingShannon
1950Minimal sufficient statisticsLehmann, Scheffé
1959Rate-distortion theoryShannon
1972Blahut–Arimoto algorithmBlahut; Arimoto (independently)
1999Information bottleneckTishby, Pereira, Bialek
2005Gaussian IBChechik, Globerson, Tishby, Weiss
2017Variational IBAlemi, Fischer, Dillon, Murphy

IB is, in this lineage, a 1999 modern reframing of a 1922 classical idea via Shannon’s information theory.

12. Computational notes and honest limits

12.1 Numerical stability of the IB updates

The IB iteration involves three operations that lose precision in standard float64, especially at large β\beta. Implementing in log-space throughout is the fix.

Issue 1: the exponential. (3.3) involves exp(βDKL)\exp(-\beta\, D_{\mathrm{KL}}) which underflows when βDKL>700\beta\, D_{\mathrm{KL}} > 700. Use log-space:

logp(n+1)(tx)  =  logp(n)(t)βDKL(p(yx)p(n)(yt))logZ(x,β),(12.1)\log p^{(n+1)}(t \mid x) \;=\; \log p^{(n)}(t) - \beta\, D_{\mathrm{KL}}\bigl(p(y \mid x) \,\|\, p^{(n)}(y \mid t)\bigr) - \log Z(x, \beta), \quad\quad (12.1)

with logZ(x,β)=logsumexpt[logp(n)(t)βDKL]\log Z(x, \beta) = \mathrm{logsumexp}_t[\log p^{(n)}(t) - \beta\, D_{\mathrm{KL}}] via scipy.special.logsumexp (or a hand-rolled max-stable version in TS).

Issue 2: zero-probability components. Extinct clusters during annealing have p(t)=0p(t) = 0; log0=\log 0 = -\infty propagates NaN. Clip: logp(t)logmax(p(t),10300)\log p(t) \to \log \max(p(t),\, 10^{-300}).

Issue 3: KL with vanishing summand. Use scipy.special.rel_entr(p, q) instead of computing plog(p/q)p \log(p / q) by hand. rel_entr returns 00 when p=0p = 0 and ++\infty when p>0,q=0p > 0,\, q = 0.

Issue 4: convergence test. Compare Fβ\mathcal{F}_\beta values, not encoders. Encoder distance is misleading near phase transitions where the cluster labeling can flip. Use relative-decrement Fβ(n)Fβ(n+1)<ϵFβ(n)|\mathcal{F}_\beta^{(n)} - \mathcal{F}_\beta^{(n+1)}| < \epsilon \cdot |\mathcal{F}_\beta^{(n)}|.

Implementation rule. Track logp(tx),logp(t),logp(yt)\log p(t \mid x), \log p(t), \log p(y \mid t) throughout. Carry out KLs, normalizations, marginalizations in log-space. Convert to probabilities only at final output. This is what the §5 viz components do — see T3IBCurveTracer.tsx and ClusterBifurcationCascade.tsx in the source.

12.2 Estimating MI on continuous data: plug-in pitfalls

The discrete IB algorithm assumed access to the joint p(x,y)p(x, y). For continuous high-dimensional data — images, text, audio — we have samples. Every estimator has pathologies.

Binning (plug-in). Bias: positive (Roulston 1999) and bin-count-dependent. The Saxe et al. 2018 critique of Shwartz-Ziv–Tishby (§9.3) hinged on this — tanh saturation clusters activations into bins, dropping binned MI without any real information loss.

kk-NN (Kraskov–Stögbauer–Grassberger 2004). O(NlogN)O(N \log N) per estimate; substantial negative bias in high dimensions (Belghazi et al. 2018); sensitive to kk.

Variational bounds. MINE (Belghazi et al. 2018), InfoNCE (van den Oord et al. 2018). Training-time friendly, differentiable, scalable. Tightness depends on critic capacity; reliably tracks changes in MI; absolute values biased.

Recommendation.

TaskEstimator
Exploratory analysis, small datakk-NN
Training-time optimization (VIB and descendants)Variational bounds
Descriptive claims about a network’s information planeBe very careful — see Goldfeld et al. 2019 for guidance

12.3 Within-formalML siblings and forward pointers

Direct prerequisites.

Soft prerequisites and structural cousins.

Forward pointers.

12.4 Honest limits — what IB doesn’t tell you

1. IB doesn’t predict generalization in deep nets. The §9 controversy. The original “deep learning is IB” framing did not survive Saxe et al. Used prescriptively (VIB, Information Dropout), the framework has measurable transfer and robustness benefits; descriptively, it does not explain what SGD on unmodified networks is doing.

2. IB doesn’t tell you the optimal β\beta a priori. The Lagrange multiplier is application-dependent. No general procedure to pick “the right β\beta” from data alone. In practice: cross-validate downstream performance, or target a specific rate or predictiveness.

3. IB requires either the joint p(x,y)p(x, y) or a viable variational bound. For finite high-dim samples, neither is unproblematic. VIB’s bounds are loose by design — the encoder’s induced marginal is forced toward the prior, a deviation from the true IB optimum (§8.4 exhibited this on the Gaussian sandbox).

4. Non-uniqueness. Labeling degeneracy and genuinely distinct basins. Annealing helps; doesn’t eliminate. See Figure 3 for an explicit example on the 8-document toy.

5. The compression-prediction framing isn’t always the right one. Transfer learning across many tasks, pure self-supervised learning, learned-loss problems all push outside strict IB. Sometimes multi-constraint IB (§10) is the answer; sometimes a different framework.

6. Sample-complexity guarantees are weak. Russo–Zou / Xu–Raginsky (§11.2) gives MI-based generalization bounds, rarely tight in practice. VIB models typically outperform the bound; the gap is active research (Achille and Soatto 2018).

A closing thought. The IB principle is most useful held as a prescriptive Lagrangian rather than a descriptive theory of deep learning. It compresses, it predicts, and it has clean closed-form structure in the Gaussian setting that anchors intuition for general cases. Within its domain — clear XX, clear YY, the need for a compressed TT that retains predictiveness — the IB and its variational descendants are sharp tools.

Connections

  • Mutual information — the currency of the IB Lagrangian — is defined and developed there. The IB curve's saturation point I(X;Y) and the DPI ceiling both inherit machinery from the prereq, and §1.3 reuses the entropy-decomposition view I(X;T) = H(X) − H(X|T) to motivate the compression term. shannon-entropy
  • The predictive distortion d_IB(x,t) = D_KL(p(y|x) ‖ p(y|t)) that emerges from the §3 stationarity condition is exactly the KL from the prereq, and the §4 Lyapunov construction uses KL non-negativity twice (one penalty per auxiliary argument q and r). kl-divergence
  • Closest parent. The Blahut–Arimoto algorithm template that the §3.4 IB iteration inherits is developed there; §11.1 compares R(D) against R(D_pred) explicitly, showing IB as RD with a self-emergent distortion. rate-distortion
  • Downstream cousin. The §11.4 minimal-sufficient-statistic framing connects the IB optimum at β → ∞ to the classical sufficiency notion; representation-learning develops the contrastive and equiangular-tight-frame views of the same compress-while-preserving question. representation-learning
  • The §8 VIB bound is structurally identical to the supervised-VAE ELBO — same encoder–prior–decoder construction, with Y replacing X in the reconstruction target. The Alemi et al. 2017 identification of VIB as 'VAE with the targets switched' transfers the entire ELBO toolkit to IB. variational-inference
  • PAC-Bayes uses D_KL(Q ‖ P) as a complexity penalty; the IB rate I(X;T) = E_x[D_KL(p(t|x) ‖ p(t))] is the same arithmetic in a different framing. §11.2 makes the Russo–Zou / Xu–Raginsky generalization-bound bridge explicit. pac-bayes-bounds
  • Sibling Lagrangian framework. Both MDL and IB trade code length against fidelity; the prescriptive 'penalize complexity' arithmetic is shared between them. IB does not depend on MDL formally, but §11 places both in the broader rate-distortion lineage. minimum-description-length

References & Further Reading