intermediate bayesian-ml 50 min read

Mixed-Effects Models

Partial-pooling theory in its own right: Henderson's mixed-model equations and the BLUP shrinkage formula, REML for variance components via flat-prior Bayesian marginalization, and the frequentist–Bayesian bridge that fits the same hierarchical model two ways and shows them agree numerically

Overview

Mixed-effects models sit at one of those rare junctions where four threads of statistical theory converge on the same algebra. The frequentist GLS-with-known-covariance estimator gives the cleanest answer when the marginal covariance is given. Henderson’s penalized-likelihood derivation produces the same answer without inverting the marginal covariance at all, by treating the random effects as quantities to be solved for jointly with the fixed effects. The REML / Bayesian-marginalization story corrects the maximum-likelihood downward bias on variance components by integrating the fixed effects out under a flat prior. And the full Bayesian fit puts a prior on everything and runs NUTS. When the data is informative enough — which is to say, in the regime where mixed-effects models are most often appropriate — the four routes give numerically equivalent answers. That convergence is the structural reason this model class has been a workhorse of applied statistics for fifty years, and the reason it deserves its own topic in T5.

We develop the four routes in turn on a single running example: six classrooms, J=6J = 6, N=60N = 60 student-level observations spread unevenly across them. §1 motivates partial pooling geometrically as the third response to grouped data, after complete pooling and no pooling. §2 writes the linear mixed model in matrix form and identifies the marginal covariance V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R} that drives generalized least squares. §3 derives Henderson’s mixed-model equations and the closed-form BLUP shrinkage formula λj=τ2/(τ2+σ2/nj)\lambda_j = \tau^2/(\tau^2 + \sigma^2/n_j) — the single number that characterizes how mixed-effects partial-pool. §4 derives REML by marginalizing the fixed effects out under a flat improper prior, transparent about the connection to the Gaussian-process marginal-likelihood story. §5 fits the same six-classroom data through both statsmodels.MixedLM (REML) and PyMC (full Bayesian), shows the two views agree numerically, and re-encounters the centered-vs-non-centered story from probabilistic programming in its native habitat. §6 closes with where mixed-effects shine, where they break down, and the connections that motivate the rest of T5.

The reader who has internalized partial pooling from formalStatistics: Hierarchical Bayes and Partial Pooling on the cross-site companion will find this topic the engineering counterpart: same theory, more matrix algebra, code that runs.

1. Three responses to grouped data

1.1 Six classrooms

Six teachers give the same standardized exam to their classes at the end of a term. Class sizes vary — one is a tiny seminar of four students, one is a survey lecture of twenty, the rest fall in between. Alongside each student’s exam score yy, the school district records a standardized index of preparation hours x[0,5]x \in [0, 5]; call it the prep-hours index, with x=0x = 0 meaning no out-of-class preparation and x=5x = 5 near the upper observed end. We have N=60N = 60 student-level observations spread unevenly across J=6J = 6 classrooms, with (n1,,n6)=(4,6,8,10,12,20)(n_1, \ldots, n_6) = (4, 6, 8, 10, 12, 20).

Panel (a) of Figure 1 shows all sixty points, color-coded by classroom. Some classrooms sit consistently higher (more effective teaching, more prepared students, kinder grading curves — we don’t know, and the data doesn’t tell us); others sit lower. Within each classroom, scores trend upward with prep hours, more or less linearly, more or less consistently. The question is how to summarize the relationship between prep hours and exam score in a way that respects the classroom structure.

There are three natural answers, each tempting in its own way.

1.2 Three classical responses

The first answer is complete pooling: ignore the classroom labels entirely, fit a single line through all sixty points, and report one intercept and one slope. Operationally, this is just OLS on the (xi,yi)(x_i, y_i) pairs with no notion of classroom. The fit is clean — one slope, one intercept, narrow standard errors thanks to the full N=60N = 60. The cost is also clean: the line is wrong for any classroom whose typical score sits far from the grand mean. A consistently high-scoring classroom sits above the line, a consistently low one sits below, and the line itself is averaging across classroom signal we know is there. We’ve thrown away the structure the data was offering us.

The second answer is no pooling: fit each classroom separately. Concretely, encode classroom identity with six dummy variables and run OLS on the joint design matrix, holding the slope on prep hours shared across classrooms (the prep-hours-to-score relationship is presumably the same across teachers; we’ll relax even this in §2). We get six intercepts, one per classroom, and one shared slope. The fit is now sensitive to classroom — the largest classroom’s intercept is well-estimated, with a tight standard error proportional to σ/n6=σ/20\sigma / \sqrt{n_6} = \sigma/\sqrt{20}. But the small classrooms? Classroom 1 has four students. A single unusually weak or strong student moves its intercept dramatically; the standard error on that intercept is σ/4=σ/2\sigma / \sqrt{4} = \sigma/2, more than twice that of the largest classroom. We’ve gone from one too-aggressive average to six estimates whose reliability varies by a factor of 20/42.2\sqrt{20/4} \approx 2.2 across classrooms.

The third answer is partial pooling. Don’t treat the six classroom intercepts as six independent parameters; treat them as six draws from a common population of classroom-level intercepts, and estimate that population’s mean and spread alongside the per-classroom values. The estimate for Classroom jj‘s intercept becomes a compromise: a weighted average of the no-pooling estimate α^jno-pool\hat\alpha_j^{\mathrm{no\text{-}pool}} (what the classroom’s own data says) and the global complete-pooling intercept α^pool\hat\alpha^{\mathrm{pool}} (what the entire population says), with the weight depending on how much data the classroom has and how much classrooms vary at the population level. Lots of data plus large between-classroom variation — trust the classroom’s own data, pull little. Little data plus small between-classroom variation — distrust the classroom’s own data, pull hard toward the grand mean.

Panel (c) of Figure 1 shows the partial-pooling fit on the same scatterplot. The lines for the largest classrooms barely move from where panel (b)‘s no-pooling fits placed them. The lines for the smallest classrooms drift visibly toward the grand-mean line that panel (a) drew. The magnitude of the drift is the shrinkage we’ll derive in §3 — for now, it’s a visual fact about the figure.

The geometric punchline: panel (c) sits between panels (a) and (b), and the location depends on group size. Partial pooling is the answer that admits both the within-group signal and the between-group signal exist, and weighs them against each other. The rest of the topic spells out how.

Six classrooms, sizes (4, 6, 8, 10, 12, 20). Slide pooling strength toward 0 — the classroom-level lines collapse onto the dashed population-mean line (complete pooling). Slide it toward 1 — each line returns to the OLS fit through its own data (no pooling). Anywhere in between, the smallest classroom (n = 4) shrinks toward the dashed line much faster than the largest (n = 20) because λⱼ depends on group size. The default position reproduces the topic's REML shrinkage factors (~0.22 for n = 4, rising to ~0.59 for n = 20).

1.3 What makes an effect “random”?

The terminology that names this third response — “random effects,” “mixed-effects models” — has a long history of confusing people, in part because the word random is doing two jobs at once. Andrew Gelman’s 2005 paper “Analysis of variance: why it is more important than ever” surveys five competing definitions in the literature and argues for a pragmatic one we’ll adopt here:

A coefficient is fixed if we model it as a single unknown number to be estimated. A coefficient is random if we model it as a draw from a distribution whose own parameters are estimated.

The slope on prep hours in our setup is fixed: there’s one unknown number β\beta, and the data estimates it. The classroom intercepts are random: the six αj\alpha_j values are six draws from a common N(α,τ2)\mathcal{N}(\alpha, \tau^2), and the data estimates the population mean α\alpha and the population spread τ\tau as well as the realized αj\alpha_j. The terminology is unfortunate — both kinds of coefficients are unknown and both are estimated — but the modeling commitment is real and consequential.

That commitment is what enables shrinkage. Six independent intercepts have no reason to inform one another; they’re separate parameters in a flat parameter space. Six draws from a parent distribution share that parent — the same data that informs the parent’s mean and spread also pulls each child toward its siblings. This is the structural reason small classrooms shrink hard and large classrooms shrink little: a small-nn classroom has weak within-group signal that the parent can dominate, while a large-nn classroom has strong within-group signal that overpowers the parent’s pull.

Three remarks before we move on. First, fixed and random are modeling choices, not properties of the world. We treat the classroom intercepts as random because we believe the six teachers were sampled from some population of teachers and we want to generalize. If we cared specifically about these six teachers and had no interest in extrapolating to others, fixed-effects with classroom dummies would be the right call. Second, the same coefficient can be modeled differently in different studies — replication studies routinely promote a treatment effect from fixed (one study) to random (across studies), because the across-study generalization is the whole point. Third, the random-effects framing brings in a hyperprior. Frequentist mixed-effects models (Henderson, REML) treat the hyperprior parameters (α,τ2)(\alpha, \tau^2) as point estimates; Bayesian mixed-effects models (PyMC, Stan) put a prior on them and integrate. We’ll see in §5 that for the typical case, the two views give numerically similar answers. For now, both views agree on the structural point: classroom-level intercepts are tied together by a common parent.

The rest of the topic makes the pieces precise. §2 writes the linear mixed model in matrix form and identifies the marginal covariance V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R} that drives the GLS solution. §3 derives Henderson’s mixed-model equations and the closed-form BLUP shrinkage formula. §4 introduces REML for estimating the variance components (τ2,σ2)(\tau^2, \sigma^2). §5 fits §1’s six-classroom data both ways — statsmodels.MixedLM with REML and a PyMC GLMM — and shows the two views agree.

Three-panel scatter plot of sixty student-level exam scores against prep hours, color-coded by classroom. Panel (a) shows complete pooling — a single black OLS line through all sixty points, missing several classrooms by visible vertical offsets. Panel (b) shows no pooling — six classroom-colored lines with the shared slope but separate intercepts; the largest classroom's line sits cleanly through its data while the smallest classroom's line bends to its four points. Panel (c) shows partial pooling — six lines with the shared slope and partial-pooled intercepts plus a thin dashed black population-mean line; the largest classroom barely moves from panel (b) while the smallest drifts visibly toward the population-mean line.
Figure 1. Three classical responses to grouped data. (a) Complete pooling — one OLS line, ignoring classroom. (b) No pooling — six classroom-specific intercepts with a shared slope. (c) Partial pooling — same shared slope, but each intercept is the partial-pooled estimate from a random-intercept LMM. The smallest classroom's line drifts noticeably toward the population mean; the largest classroom's barely moves.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy import stats

# Six-classroom DGP: J=6 classrooms, N=60 students, varying group sizes
rng = np.random.default_rng(20260429)
J, n_per_group = 6, np.array([4, 6, 8, 10, 12, 20])
N = n_per_group.sum()
classroom = np.repeat(np.arange(J), n_per_group)
alpha_t, beta_t, tau_t, sigma_t = 50.0, 5.0, 5.0, 8.0
a = rng.normal(0.0, tau_t, size=J)
x = rng.uniform(0.0, 5.0, size=N)
y = alpha_t + beta_t * x + a[classroom] + rng.normal(0.0, sigma_t, size=N)
data = pd.DataFrame({"y": y, "x": x, "classroom": classroom})

# Three pooling regimes
pooled = stats.linregress(x, y)                                  # complete pooling
mlm = smf.mixedlm("y ~ x", data, groups="classroom").fit(        # partial pooling
    reml=True, method=["powell"]                                 # Powell — see note
)

The explicit method=["powell"] matters. statsmodels’ default optimizer chain (bfgs / lbfgs / cg) silently returns converged=False on this small-JJ data with the realized τ^2\hat\tau^2 near the boundary; Powell converges cleanly to the same REML optimum that §4’s custom implementation finds independently. The shrinkage factors at the REML estimates are λ^j(0.222,0.298,0.359,0.412,0.451,0.588)\hat\lambda_j \approx (0.222, 0.298, 0.359, 0.412, 0.451, 0.588) across the six classrooms — small classrooms shrink hard, the largest shrinks visibly less.

2. The linear mixed model

§1 gave us three procedures and three pictures. We’re going to fold all three into a single matrix equation, the linear mixed model, and then notice something about its marginal distribution that turns out to drive everything in §3 and §4. The translation costs us nothing geometrically — the same six lines from §1’s panel (c) come out the other side — but it gains us the algebraic apparatus we need to derive Henderson’s equations and to estimate variance components.

2.1 The model in matrix form

Stack the N=60N = 60 student-level observations into yRN\mathbf{y} \in \mathbb{R}^N, with the row order matching §1’s classroom layout (rows 1–4 are Classroom 1, rows 5–10 are Classroom 2, and so on). The linear mixed model writes y\mathbf{y} as a sum of three pieces:

y  =  Xβ  +  Zu  +  ε,(2.1)\mathbf{y} \;=\; \mathbf{X}\boldsymbol{\beta} \;+\; \mathbf{Z}\mathbf{u} \;+\; \boldsymbol{\varepsilon}, \quad\quad (2.1)

where each piece has a job:

  • XRN×p\mathbf{X} \in \mathbb{R}^{N \times p} is the fixed-effects design matrix. Each column is a covariate that we model with a single shared coefficient. For our six-classroom data, p=2p = 2: column one is a column of ones (the intercept), column two is the prep-hours predictor xix_i.
  • βRp\boldsymbol{\beta} \in \mathbb{R}^p is the fixed-effects vector. Here β=(α,β)\boldsymbol{\beta} = (\alpha, \beta)^\top — the global intercept and the shared slope on prep hours.
  • ZRN×q\mathbf{Z} \in \mathbb{R}^{N \times q} is the random-effects design matrix. Each column corresponds to one entry of the random-effects vector u\mathbf{u}, and its rows mark which observations are affected by that entry. For a random-intercept-only model with J=6J = 6 classrooms, q=J=6q = J = 6 and Z\mathbf{Z} is the indicator matrix that puts a 1 in row ii, column jj if observation ii belongs to Classroom jj.
  • uRq\mathbf{u} \in \mathbb{R}^q is the random-effects vector. Here u=(a1,,a6)\mathbf{u} = (a_1, \ldots, a_6)^\top — the six classroom-level intercept deviations.
  • εRN\boldsymbol{\varepsilon} \in \mathbb{R}^N is the residual vector — student-level noise.

The distributional assumptions complete the model:

u    N(0,G),ε    N(0,R),u    ε,(2.2)\mathbf{u} \;\sim\; \mathcal{N}(\mathbf{0}, \mathbf{G}), \qquad \boldsymbol{\varepsilon} \;\sim\; \mathcal{N}(\mathbf{0}, \mathbf{R}), \qquad \mathbf{u} \;\perp\; \boldsymbol{\varepsilon}, \quad\quad (2.2)

where GRq×q\mathbf{G} \in \mathbb{R}^{q \times q} is the random-effects covariance and RRN×N\mathbf{R} \in \mathbb{R}^{N \times N} is the residual covariance. For our running example, both covariances are scaled identities:

G  =  τ2IJ,R  =  σ2IN.(2.3)\mathbf{G} \;=\; \tau^2 \mathbf{I}_J, \qquad \mathbf{R} \;=\; \sigma^2 \mathbf{I}_N. \quad\quad (2.3)

The first equality says the six classroom-level deviations are mutually independent draws from a common N(0,τ2)\mathcal{N}(0, \tau^2) — six iid pulls from one parent distribution, the operational meaning of “the classrooms are exchangeable draws from a population.” The second says student-level noise is iid Gaussian. We’ll relax both in §6’s remarks but keep them throughout §3–§5; they cover the bulk of standard mixed-effects practice and let us write the cleanest proofs.

The trio (2.1)–(2.3) is the linear mixed model.

Definition 1 (Linear mixed model).

A linear mixed model on an NN-vector of observations y\mathbf{y} is a triple (X,Z,(G,R))(\mathbf{X}, \mathbf{Z}, (\mathbf{G}, \mathbf{R})) with y=Xβ+Zu+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}, uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}), εN(0,R)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{R}), and uε\mathbf{u} \perp \boldsymbol{\varepsilon}. The vector β\boldsymbol{\beta} collects the fixed effects and u\mathbf{u} collects the random effects; X\mathbf{X} and Z\mathbf{Z} are the corresponding design matrices.

Notice what Definition 1 does not commit to: it does not say what X\mathbf{X} and Z\mathbf{Z} look like, or how G\mathbf{G} and R\mathbf{R} are structured. Different choices encode different scientific assumptions — random intercepts only versus random slopes too, crossed versus nested classrooms, autoregressive errors within a longitudinal subject, and so on. The model in (2.1) is a chassis; the assumptions about X,Z,G,R\mathbf{X}, \mathbf{Z}, \mathbf{G}, \mathbf{R} are what we drop into it for a particular study.

2.2 Building X\mathbf{X} and Z\mathbf{Z} on the six-classroom data

Concretely, on §1’s dataset:

X  =  (1x11x21x60)R60×2,Z  =  (140000001600000018000000110000000112000000120)R60×6,(2.4)\mathbf{X} \;=\; \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_{60} \end{pmatrix} \in \mathbb{R}^{60 \times 2}, \qquad \mathbf{Z} \;=\; \begin{pmatrix} \mathbf{1}_4 & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{1}_6 & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{1}_8 & \mathbf{0} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1}_{10} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1}_{12} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1}_{20} \end{pmatrix} \in \mathbb{R}^{60 \times 6}, \quad\quad (2.4)

where 1k\mathbf{1}_k denotes a column of kk ones and 0\mathbf{0} is a zero column of the matching height. Reading Z\mathbf{Z} row by row: row 1 (a Classroom-1 student) is (1,0,0,0,0,0)(1, 0, 0, 0, 0, 0), so Zu\mathbf{Z}\mathbf{u} at that row contributes a1a_1 — the deviation of Classroom 1’s intercept from the global α\alpha. Row 5 (the first Classroom-2 student) is (0,1,0,0,0,0)(0, 1, 0, 0, 0, 0), contributing a2a_2. Each row picks out exactly one classroom random effect, which is the matrix encoding of “every student inherits exactly one classroom’s deviation.”

The fixed-effects piece Xβ=α1N+βx\mathbf{X}\boldsymbol{\beta} = \alpha \mathbf{1}_N + \beta \mathbf{x} is the global linear trend, the same for every student. The random-effects piece Zu\mathbf{Z}\mathbf{u} adds a classroom-specific intercept shift on top. The residual ε\boldsymbol{\varepsilon} adds student-level noise. Three additive layers of structure, one matrix equation. Panels (a) and (b) of Figure 2 visualize X\mathbf{X} and Z\mathbf{Z} explicitly; the reader should be able to read the block-of-ones pattern in Z\mathbf{Z} as the matrix encoding of classroom membership.

2.3 The marginal distribution

If we never observed individual students and only saw the data after marginalizing over the random effects, what distribution would we see? This is the marginal distribution of y\mathbf{y}, and it’s the engine driving everything that follows.

Theorem 1 (Marginal distribution of the LMM).

Under Definition 1, y    N(Xβ,V),V  =  ZGZ  +  R.(2.5)\mathbf{y} \;\sim\; \mathcal{N}(\mathbf{X}\boldsymbol{\beta},\, \mathbf{V}), \qquad \mathbf{V} \;=\; \mathbf{Z}\mathbf{G}\mathbf{Z}^\top \;+\; \mathbf{R}. \quad\quad (2.5)

Proof.

The vector y\mathbf{y} is an affine combination of two independent Gaussian vectors plus a deterministic shift: y=Xβ+Zu+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}, where u\mathbf{u} and ε\boldsymbol{\varepsilon} are jointly Gaussian (independent, hence jointly Gaussian) and Xβ\mathbf{X}\boldsymbol{\beta} is a constant. Affine transformations of Gaussian vectors are Gaussian, so y\mathbf{y} is Gaussian; we just need its mean and covariance.

For the mean, E[u]=0\mathbb{E}[\mathbf{u}] = \mathbf{0} and E[ε]=0\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0} give E[y]=Xβ+ZE[u]+E[ε]=Xβ\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\,\mathbb{E}[\mathbf{u}] + \mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{X}\boldsymbol{\beta}.

For the covariance, the constant Xβ\mathbf{X}\boldsymbol{\beta} contributes nothing, so Cov(y)=Cov(Zu+ε)=Cov(Zu)+Cov(ε)+2Cov(Zu,ε)\mathrm{Cov}(\mathbf{y}) = \mathrm{Cov}(\mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}) = \mathrm{Cov}(\mathbf{Z}\mathbf{u}) + \mathrm{Cov}(\boldsymbol{\varepsilon}) + 2\,\mathrm{Cov}(\mathbf{Z}\mathbf{u}, \boldsymbol{\varepsilon}). The cross-covariance vanishes because uε\mathbf{u} \perp \boldsymbol{\varepsilon}, leaving the two diagonal terms. The first is Cov(Zu)=ZCov(u)Z=ZGZ\mathrm{Cov}(\mathbf{Z}\mathbf{u}) = \mathbf{Z}\,\mathrm{Cov}(\mathbf{u})\,\mathbf{Z}^\top = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top, the standard linear-transformation rule. The second is Cov(ε)=R\mathrm{Cov}(\boldsymbol{\varepsilon}) = \mathbf{R} by assumption. Adding them gives V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R}.

The factor decomposition of (2.5) is worth pausing on. Two students in the same classroom share the same random intercept aja_j, so they’re positively correlated: their joint covariance is τ2\tau^2 from the shared aja_j, with no additional contribution from the independent residuals. Two students in different classrooms are independent, because their two random intercepts are independent draws from N(0,τ2)\mathcal{N}(0, \tau^2) and their residuals are also independent. The same student paired with itself contributes τ2+σ2\tau^2 + \sigma^2 — once for the random intercept and once for the residual.

Plugging the random-intercept-only G=τ2IJ\mathbf{G} = \tau^2 \mathbf{I}_J and R=σ2IN\mathbf{R} = \sigma^2 \mathbf{I}_N into (2.5) makes this concrete. With rows ordered by classroom as in §1, V\mathbf{V} becomes block-diagonal, with one nj×njn_j \times n_j block per classroom:

V  =  blkdiag(V1,,VJ),Vj  =  τ21nj1nj  +  σ2Inj.(2.6)\mathbf{V} \;=\; \mathrm{blkdiag}(\mathbf{V}_1, \ldots, \mathbf{V}_J), \qquad \mathbf{V}_j \;=\; \tau^2 \mathbf{1}_{n_j}\mathbf{1}_{n_j}^\top \;+\; \sigma^2 \mathbf{I}_{n_j}. \quad\quad (2.6)

Each block has τ2+σ2\tau^2 + \sigma^2 on its diagonal and τ2\tau^2 on every off-diagonal entry — what the longitudinal-data literature calls compound symmetry and what panel (c) of Figure 2 displays as a heatmap. The off-block entries are exactly zero, which is the matrix encoding of “different classrooms are independent.” The visual takeaway: V\mathbf{V} is sparse where the modeling commits to independence and dense where it commits to shared structure, and the block pattern is the geometry of the hierarchy itself.

The intra-classroom correlation,

ρ  =  Cov(yij,yij)Var(yij)Var(yij)  =  τ2τ2+σ2,(2.7)\rho \;=\; \frac{\mathrm{Cov}(y_{ij}, y_{i'j})}{\sqrt{\mathrm{Var}(y_{ij}) \mathrm{Var}(y_{i'j})}} \;=\; \frac{\tau^2}{\tau^2 + \sigma^2}, \quad\quad (2.7)

is called the intraclass correlation coefficient (ICC). It’s the fraction of total variance explained by the grouping — ρ0\rho \to 0 means classrooms barely matter, ρ1\rho \to 1 means classrooms explain almost everything. For our DGP, ρ=25/(25+64)0.281\rho = 25 / (25 + 64) \approx 0.281, so about 28% of the total variance lives at the classroom level. The ICC is the most-quoted single number in mixed-effects practice, and (2.7) is its full derivation.

view:
τ²: 0.390σ²: 1.000diagonal of V: 1.390within-block off-diagonal: 0.390ICC = τ²/(τ²+σ²): 0.281
Six block-diagonal blocks of sizes (4, 6, 8, 10, 12, 20). Each block is compound-symmetric — τ²+σ² on the diagonal, τ² on every within-block off-diagonal, zero off-block. Slide τ²/σ² → 0: V collapses to σ²·I_N and the block structure disappears (between-classroom contribution gone). Slide τ²/σ² → 1: the within-block off-diagonals climb to match the diagonal as the random-effects prior dominates the residual. Toggle ZGZᵀ to see the between-group structure alone — six dense blocks of τ² with off-block zeros and no diagonal residual. Toggle R to see what's left when the random effects are stripped — a plain σ²·I_N diagonal with no block structure at all.

2.4 GLS and the chicken-and-egg problem

Theorem 1 hands us a marginal distribution that’s exactly the setup of generalized least squares: a Gaussian linear model with mean Xβ\mathbf{X}\boldsymbol{\beta} and a known, non-spherical covariance V\mathbf{V}. Aitken’s classical theorem says the best linear unbiased estimator (BLUE) of β\boldsymbol{\beta} in this setup is the GLS estimator.

Proposition 1 (GLS estimator and Aitken's theorem).

If V\mathbf{V} is known and positive-definite, the unique BLUE of β\boldsymbol{\beta} in the model yN(Xβ,V)\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \mathbf{V}) is β^GLS  =  (XV1X)1XV1y,(2.8)\hat{\boldsymbol{\beta}}_{\mathrm{GLS}} \;=\; (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \mathbf{y}, \quad\quad (2.8) with covariance Cov(β^GLS)=(XV1X)1\mathrm{Cov}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}) = (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1}.

Proof.

Pre-multiply (2.1) by V1/2\mathbf{V}^{-1/2}, the symmetric square-root inverse of V\mathbf{V} (which exists and is unique because V\mathbf{V} is symmetric positive-definite). Define y~=V1/2y\tilde{\mathbf{y}} = \mathbf{V}^{-1/2}\mathbf{y} and X~=V1/2X\tilde{\mathbf{X}} = \mathbf{V}^{-1/2}\mathbf{X}. The transformed model is y~=X~β+η~\tilde{\mathbf{y}} = \tilde{\mathbf{X}}\boldsymbol{\beta} + \tilde{\boldsymbol{\eta}} with η~=V1/2(Zu+ε)N(0,IN)\tilde{\boldsymbol{\eta}} = \mathbf{V}^{-1/2}(\mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}) \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_N), an ordinary spherical Gaussian regression. Apply OLS-is-BLUE (Gauss–Markov) on the whitened system: β^=(X~X~)1X~y~=(XV1X)1XV1y\hat{\boldsymbol{\beta}} = (\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}})^{-1} \tilde{\mathbf{X}}^\top \tilde{\mathbf{y}} = (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \mathbf{y}, exactly (2.8). The transformed fit retains BLUE on the original parameters because the whitening is invertible and linear. The covariance follows by the same transformation rule applied to the OLS covariance on the whitened system.

We’ll use the GLS covariance in §3 to read off standard errors for the fixed effects, and in §5 to compare frequentist Wald intervals against PyMC posterior intervals.

So far so good — except for one inconvenient fact. We don’t know V\mathbf{V}. Equation (2.5) gives V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R}, which depends on G\mathbf{G} and R\mathbf{R}, which depend on the variance components (τ2,σ2)(\tau^2, \sigma^2), which are unknown parameters we have to estimate from the same data we’re using to fit β\boldsymbol{\beta}. Plugging the true V\mathbf{V} into (2.8) gives a clean estimator of β\boldsymbol{\beta}; plugging the OLS estimate of β\boldsymbol{\beta} into a residuals-based estimate of V\mathbf{V} would give a biased mess. We need a way to estimate everything jointly.

Two paths through this circularity, and they’re the subjects of §3 and §4 respectively. §3 (Henderson’s path) does not try to invert V\mathbf{V} at all; it treats the random effects u\mathbf{u} as unknown quantities to be solved for jointly with β\boldsymbol{\beta}, and writes down the normal equations for the joint penalized log-density. The resulting mixed-model equations solve for (β^,u^)(\hat{\boldsymbol{\beta}}, \hat{\mathbf{u}}) simultaneously without ever forming V1\mathbf{V}^{-1} explicitly, and the prediction u^\hat{\mathbf{u}} has its own name — the best linear unbiased predictor, BLUP — and a clean shrinkage interpretation. §4 (REML’s path) estimates the variance components (τ2,σ2)(\tau^2, \sigma^2) first, then plugs them into (2.8). The naive route — maximum likelihood on the marginal distribution — turns out to be biased downward for the variance components. Restricted maximum likelihood (REML) fixes the bias by marginalizing β\boldsymbol{\beta} out before estimating the variance components.

The two paths produce the same β^\hat{\boldsymbol{\beta}} and the same u^\hat{\mathbf{u}} when run with the same variance-component estimates, so they’re complementary rather than competitive.

Three-panel heatmap layout with width ratios approximately 1:2:4. Panel (a) shows the 60×2 fixed-effects design matrix X — column 0 is a uniform stripe of ones, column 1 is a noisy grayscale gradient of prep-hours values. Panel (b) shows the 60×6 random-effects design matrix Z — six binary indicator columns with the iconic block-of-ones staircase stepping diagonally down the matrix, marking classroom membership. Panel (c) shows the 60×60 marginal covariance V = ZGZᵀ + R as a heatmap — six block-diagonal blocks of sizes 4, 6, 8, 10, 12, 20, each compound-symmetric (diagonal τ²+σ²=89, within-block off-diagonal τ²=25, off-block 0).
Figure 2. The LMM is constructively a matrix model. (a) X — fixed-effects design, no classroom structure. (b) Z — the classroom indicator matrix; the staircase of blocks encodes classroom membership. (c) V = ZGZᵀ + R — block-diagonal with compound-symmetric blocks; sparse where the model commits to independence (off-block) and dense where it commits to shared structure (within-block).
# Build X and Z explicitly, then form V at the truth and solve GLS by hand
import numpy as np
from scipy.linalg import solve

X = np.column_stack([np.ones(N), data["x"].to_numpy()])             # 60 × 2
Z = np.zeros((N, J)); Z[np.arange(N), classroom] = 1.0              # 60 × 6
G_t, R_t = (tau_t ** 2) * np.eye(J), (sigma_t ** 2) * np.eye(N)
V_t = Z @ G_t @ Z.T + R_t                                           # 60 × 60

# GLS at the truth — never form V^{-1}, solve V x = ... instead
beta_gls = solve(X.T @ solve(V_t, X), X.T @ solve(V_t, y), assume_a="pos")
icc_t = tau_t ** 2 / (tau_t ** 2 + sigma_t ** 2)                    # ≈ 0.281

The GLS-with-true-V\mathbf{V} fit gives α^50.99\hat\alpha \approx 50.99 and β^4.86\hat\beta \approx 4.86, both within sampling noise of the truths (50,5)(50, 5). The OLS standard errors (which assume R=σ2IN\mathbf{R} = \sigma^2\mathbf{I}_N rather than the block-diagonal V\mathbf{V}) are systematically too narrow on the intercept by about a factor of 1+(nj1)ρ\sqrt{1 + (n_j - 1)\rho} averaged over students — the standard “ignore-the-clustering” anticonservatism that mixed-effects practice routinely confronts.

3. Henderson’s mixed-model equations and BLUP

§2 ended on a circularity: GLS gives the cleanest fixed-effects estimator when the marginal covariance V\mathbf{V} is known, but V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R} depends on variance components we have to estimate from the same data. There are two ways forward, and §3 takes the first one — Henderson’s. Don’t try to invert V\mathbf{V} at all. Treat the random effects u\mathbf{u} as quantities to be solved for jointly with β\boldsymbol{\beta}, paying a Gaussian “penalty” for each uju_j that pulls it toward zero, and let the penalty’s strength encode the prior uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}). The resulting block linear system is Henderson’s mixed-model equations; its solution simultaneously gives β^GLS\hat{\boldsymbol{\beta}}_{\mathrm{GLS}} and the best linear unbiased predictor of u\mathbf{u}, the BLUP.

What we get out of this section is the closed-form shrinkage formula that drove §1’s panel (c) — the BLUP for classroom jj is the no-pooling residual mean times a shrinkage factor λj=τ2/(τ2+σ2/nj)\lambda_j = \tau^2 / (\tau^2 + \sigma^2/n_j) — together with the structural understanding of why that formula is what falls out of the general theory. We’re still assuming the variance components (τ2,σ2)(\tau^2, \sigma^2) are known throughout §3; §4 fills in their estimation.

3.1 The joint penalized log-density

Henderson’s 1950 / 1975 trick is to write down the joint density of (y,u)(\mathbf{y}, \mathbf{u}) given β\boldsymbol{\beta}, treat u\mathbf{u} as if it were a parameter to be optimized over rather than a latent variable to be marginalized, and maximize the joint log-density over (β,u)(\boldsymbol{\beta}, \mathbf{u}) together. The result is a clean linear system whose left-hand side has dimension (p+q)×(p+q)(p + q) \times (p + q) — small even when NN is enormous — and whose right-hand side reads off the data through XR1y\mathbf{X}^\top \mathbf{R}^{-1} \mathbf{y} and ZR1y\mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{y}.

Lemma 1 (Joint penalized log-density).

Under Definition 1, the joint density of the observations and the random effects given the fixed effects factors as p(y,uβ)  =  p(yu,β)p(u),(3.1)p(\mathbf{y}, \mathbf{u} \mid \boldsymbol{\beta}) \;=\; p(\mathbf{y} \mid \mathbf{u}, \boldsymbol{\beta}) \cdot p(\mathbf{u}), \quad\quad (3.1) with yu,βN(Xβ+Zu,R)\mathbf{y} \mid \mathbf{u}, \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u}, \mathbf{R}) and uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}). Up to constants that don’t depend on (β,u)(\boldsymbol{\beta}, \mathbf{u}), the negative joint log-density is logp(y,uβ)  =  12(yXβZu)R1(yXβZu)  +  12uG1u.(3.2)-\log p(\mathbf{y}, \mathbf{u} \mid \boldsymbol{\beta}) \;=\; \tfrac{1}{2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{u})^\top \mathbf{R}^{-1} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{u}) \;+\; \tfrac{1}{2}\,\mathbf{u}^\top \mathbf{G}^{-1} \mathbf{u}. \quad\quad (3.2)

Proof.

The factorization in (3.1) is the chain rule applied to a Gaussian linear model: conditional on u\mathbf{u}, the residual ε=yXβZu\boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{u} has distribution N(0,R)\mathcal{N}(\mathbf{0}, \mathbf{R}), so yu,βN(Xβ+Zu,R)\mathbf{y} \mid \mathbf{u}, \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u}, \mathbf{R}). The marginal uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}) is part of Definition 1. Equation (3.2) is the sum of the two negative Gaussian log-densities, with constant terms (12logdetR-\tfrac{1}{2}\log\det \mathbf{R}, 12logdetG-\tfrac{1}{2}\log\det \mathbf{G}, N+q2log2π-\tfrac{N + q}{2}\log 2\pi) that don’t depend on the optimization variables absorbed into a single additive constant.

The structure of (3.2) is worth pausing on. The first term is the negative log-likelihood of y\mathbf{y} given fixed and random effects — a generalized least-squares residual norm under the residual covariance R\mathbf{R}. The second term is a quadratic penalty on u\mathbf{u}: the further u\mathbf{u} strays from 0\mathbf{0} (the random-effects prior mean), the larger the penalty, with the penalty’s strength governed by G1\mathbf{G}^{-1}. The penalty is the algebraic encoding of “we believe a priori that the random effects are small, with the meaning of small set by G\mathbf{G}.” Larger G\mathbf{G} (looser prior) → softer penalty → more freedom for u\mathbf{u} to take large values. Smaller G\mathbf{G} (tighter prior) → harder penalty → u\mathbf{u} is pulled aggressively toward zero. The whole shrinkage story sits inside that one penalty term.

The reader who has seen Bayesian ridge regression will recognize (3.2) immediately: it’s the maximum a posteriori (MAP) objective for u\mathbf{u} under a Gaussian prior, with β\boldsymbol{\beta} playing the role of an unpenalized nuisance parameter. Henderson didn’t derive his equations as Bayesian MAP — he derived them via a clever marginal-likelihood argument that we’ll see in §4 — but the algebraic form is identical, and the Bayesian reading is what §5 will use to bridge to the PyMC fit.

3.2 The mixed-model equations

Maximizing (3.2) jointly in (β,u)(\boldsymbol{\beta}, \mathbf{u}) — equivalently, minimizing the negative log-density — is a problem in linear algebra. The objective is quadratic in both arguments, so its gradients are linear, and setting both gradients to zero gives a block linear system: Henderson’s mixed-model equations.

Theorem 2 (Mixed-model equations).

Under Definition 1 with V,G,R\mathbf{V}, \mathbf{G}, \mathbf{R} all positive-definite, the joint maximum of (3.2) over (β,u)(\boldsymbol{\beta}, \mathbf{u}) solves the block linear system (XR1XXR1ZZR1XZR1Z+G1)(β^u^)  =  (XR1yZR1y),(3.3)\begin{pmatrix} \mathbf{X}^\top \mathbf{R}^{-1} \mathbf{X} & \mathbf{X}^\top \mathbf{R}^{-1} \mathbf{Z} \\ \mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{X} & \mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{Z} + \mathbf{G}^{-1} \end{pmatrix} \begin{pmatrix} \hat{\boldsymbol{\beta}} \\ \hat{\mathbf{u}} \end{pmatrix} \;=\; \begin{pmatrix} \mathbf{X}^\top \mathbf{R}^{-1} \mathbf{y} \\ \mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{y} \end{pmatrix}, \quad\quad (3.3) and the solution satisfies β^  =  (XV1X)1XV1y  =  β^GLS,u^  =  GZV1(yXβ^).(3.4)\hat{\boldsymbol{\beta}} \;=\; (\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \mathbf{y} \;=\; \hat{\boldsymbol{\beta}}_{\mathrm{GLS}}, \qquad \hat{\mathbf{u}} \;=\; \mathbf{G}\mathbf{Z}^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}). \quad\quad (3.4)

Proof.

For the block system, take partial derivatives of (3.2) with respect to β\boldsymbol{\beta} and u\mathbf{u} and set each to zero: β:XR1(yXβZu)  =  0,(3.5a)\frac{\partial}{\partial \boldsymbol{\beta}}: \quad -\mathbf{X}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{u}) \;=\; \mathbf{0}, \quad\quad (3.5\text{a}) u:ZR1(yXβZu)+G1u  =  0.(3.5b)\frac{\partial}{\partial \mathbf{u}}: \quad -\mathbf{Z}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{u}) + \mathbf{G}^{-1}\mathbf{u} \;=\; \mathbf{0}. \quad\quad (3.5\text{b}) Rearrange (3.5a) into XR1Xβ+XR1Zu=XR1y\mathbf{X}^\top \mathbf{R}^{-1} \mathbf{X} \boldsymbol{\beta} + \mathbf{X}^\top \mathbf{R}^{-1} \mathbf{Z} \mathbf{u} = \mathbf{X}^\top \mathbf{R}^{-1} \mathbf{y}, the top row of (3.3). Rearrange (3.5b) into ZR1Xβ+(ZR1Z+G1)u=ZR1y\mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{X} \boldsymbol{\beta} + (\mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{Z} + \mathbf{G}^{-1}) \mathbf{u} = \mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{y}, the bottom row. Stacking gives (3.3).

For the equivalence to GLS, let C=ZR1Z+G1\mathbf{C} = \mathbf{Z}^\top \mathbf{R}^{-1} \mathbf{Z} + \mathbf{G}^{-1}, the (q×q)(q \times q) bottom-right block of (3.3). Eliminate u\mathbf{u} from (3.5b): u=C1ZR1(yXβ)\mathbf{u} = \mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}). Substitute into (3.5a): XR1(yXβZC1ZR1(yXβ))  =  0.\mathbf{X}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{Z}\mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})) \;=\; \mathbf{0}. Factor out (yXβ)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}): X[R1R1ZC1ZR1](yXβ)  =  0.(3.7)\mathbf{X}^\top \bigl[\mathbf{R}^{-1} - \mathbf{R}^{-1}\mathbf{Z}\mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1}\bigr] (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \;=\; \mathbf{0}. \quad\quad (3.7) The bracketed matrix is exactly V1\mathbf{V}^{-1} by the Woodbury identity: (R+ZGZ)1  =  R1R1Z(ZR1Z+G1)1ZR1  =  R1R1ZC1ZR1.(3.8)(\mathbf{R} + \mathbf{Z}\mathbf{G}\mathbf{Z}^\top)^{-1} \;=\; \mathbf{R}^{-1} - \mathbf{R}^{-1}\mathbf{Z}(\mathbf{Z}^\top \mathbf{R}^{-1}\mathbf{Z} + \mathbf{G}^{-1})^{-1}\mathbf{Z}^\top \mathbf{R}^{-1} \;=\; \mathbf{R}^{-1} - \mathbf{R}^{-1}\mathbf{Z}\mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1}. \quad\quad (3.8) (The identity (3.8) is standard linear algebra; see Searle, Casella, McCulloch §7.4 for the version specialized to mixed models.) Substituting (3.8) into (3.7) gives XV1(yXβ)=0\mathbf{X}^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0}, equivalent to XV1Xβ=XV1y\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{V}^{-1}\mathbf{y}, which solves to β^=(XV1X)1XV1y\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{y} — the GLS estimator.

For u^\hat{\mathbf{u}}: substituting β^\hat{\boldsymbol{\beta}} into the eliminated form gives u^=C1ZR1(yXβ^)\hat{\mathbf{u}} = \mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}). The identity C1ZR1=GZV1\mathbf{C}^{-1}\mathbf{Z}^\top \mathbf{R}^{-1} = \mathbf{G}\mathbf{Z}^\top \mathbf{V}^{-1} (from rearranging (3.8) after right-multiplication by ZG\mathbf{Z}\mathbf{G}) gives the announced form u^=GZV1(yXβ^)\hat{\mathbf{u}} = \mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}), which is (3.4).

The first form of u^\hat{\mathbf{u}} in (3.4) — GZV1(yXβ^)\mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) — has a clean probabilistic reading: it is the conditional mean of u\mathbf{u} given y\mathbf{y} under the joint Gaussian model. Standard multivariate-Gaussian conditioning gives

E[uy]  =  Cov(u,y)Cov(y)1(yE[y])  =  GZV1(yXβ),(3.9)\mathbb{E}[\mathbf{u} \mid \mathbf{y}] \;=\; \mathrm{Cov}(\mathbf{u}, \mathbf{y})\,\mathrm{Cov}(\mathbf{y})^{-1}(\mathbf{y} - \mathbb{E}[\mathbf{y}]) \;=\; \mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}), \quad\quad (3.9)

using Cov(u,y)=Cov(u,Zu+ε)=GZ\mathrm{Cov}(\mathbf{u}, \mathbf{y}) = \mathrm{Cov}(\mathbf{u}, \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}) = \mathbf{G}\mathbf{Z}^\top and Cov(y)=V\mathrm{Cov}(\mathbf{y}) = \mathbf{V}. So the BLUP equals the posterior mean of u\mathbf{u} given the data — under the prior uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}) that the model already commits us to. We’ll return to this in §5: the Bayesian PyMC fit doesn’t compute Henderson’s equations, but its posterior mean for the random effects is exactly u^\hat{\mathbf{u}} (up to the difference between plugged-in REML estimates of (G,R)(\mathbf{G}, \mathbf{R}) and properly-marginalized Bayesian estimates), which is why the two views agree numerically.

The conditional covariance is Cov(uy)=GGZV1ZG\mathrm{Cov}(\mathbf{u} \mid \mathbf{y}) = \mathbf{G} - \mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}\mathbf{Z}\mathbf{G}, a useful identity for the BLUP standard errors below.

A practical note. The MMEs in (3.3) have dimension (p+q)×(p+q)=8×8(p + q) \times (p + q) = 8 \times 8 on our running example, while the GLS form (3.4) requires V1\mathbf{V}^{-1}, an N×N=60×60N \times N = 60 \times 60 matrix. For larger problems the gap widens dramatically — a study with N=105N = 10^5 observations across J=100J = 100 groups and p=5p = 5 fixed effects has MMEs of dimension 105×105105 \times 105 versus a GLS that nominally inverts a 105×10510^5 \times 10^5 matrix. The savings is what made Henderson’s formulation practical at all in the pre-computer era and is what makes industrial-scale mixed-effects fits tractable today; modern implementations exploit additional sparsity in Z\mathbf{Z} (which is highly sparse for grouped data) and in the bottom-right block of (3.3) to bring the cost down further. statsmodels.MixedLM and lme4’s lmer both ultimately solve a sparse version of (3.3).

3.3 Shrinkage in the random-intercept special case

The general form (3.4) is what runs in software. The pedagogical payoff comes from specializing it to the random-intercept-only model, where every quantity in (3.3) collapses to something an undergraduate can compute by hand.

Proposition 2 (BLUP shrinkage for the random-intercept model).

Suppose G=τ2IJ\mathbf{G} = \tau^2 \mathbf{I}_J and R=σ2IN\mathbf{R} = \sigma^2 \mathbf{I}_N, and the random-effects design Z\mathbf{Z} encodes group membership (each row a one-hot vector over JJ groups). Define the no-pooling residual mean for group jj as rˉj(β)  =  1njij(yijxijβ),(3.11)\bar{r}_j(\boldsymbol{\beta}) \;=\; \frac{1}{n_j} \sum_{i \in j} (y_{ij} - \mathbf{x}_{ij}^\top \boldsymbol{\beta}), \quad\quad (3.11) the average residual of group jj given the fixed-effects fit, and the shrinkage factor λj  =  τ2τ2+σ2/nj    (0,1).(3.12)\lambda_j \;=\; \frac{\tau^2}{\tau^2 + \sigma^2 / n_j} \;\in\; (0, 1). \quad\quad (3.12) Then the BLUP for group jj — the jj-th component of u^\hat{\mathbf{u}} in (3.4), evaluated at β^\hat{\boldsymbol{\beta}} — is u^j  =  λjrˉj(β^),(3.13)\hat{u}_j \;=\; \lambda_j \cdot \bar{r}_j(\hat{\boldsymbol{\beta}}), \quad\quad (3.13) and the conditional variance of uju_j given y\mathbf{y} is Var(ujy)  =  (1λj)τ2.(3.14)\mathrm{Var}(u_j \mid \mathbf{y}) \;=\; (1 - \lambda_j) \tau^2. \quad\quad (3.14)

Proof.

For the random-intercept design, ZZ=diag(n1,,nJ)\mathbf{Z}^\top\mathbf{Z} = \mathrm{diag}(n_1, \ldots, n_J) — column jj of Z\mathbf{Z} has njn_j ones and NnjN - n_j zeros. With R=σ2IN\mathbf{R} = \sigma^2\mathbf{I}_N and G=τ2IJ\mathbf{G} = \tau^2\mathbf{I}_J: C  =  ZR1Z+G1  =  1σ2diag(n1,,nJ)+1τ2IJ  =  diag(njσ2+1τ2)j.\mathbf{C} \;=\; \mathbf{Z}^\top \mathbf{R}^{-1}\mathbf{Z} + \mathbf{G}^{-1} \;=\; \frac{1}{\sigma^2}\mathrm{diag}(n_1, \ldots, n_J) + \frac{1}{\tau^2}\mathbf{I}_J \;=\; \mathrm{diag}\bigl(\tfrac{n_j}{\sigma^2} + \tfrac{1}{\tau^2}\bigr)_j. So C\mathbf{C} is diagonal, with C1\mathbf{C}^{-1} having jj-th entry (njσ2+1τ2)1=σ2τ2njτ2+σ2\bigl(\tfrac{n_j}{\sigma^2} + \tfrac{1}{\tau^2}\bigr)^{-1} = \tfrac{\sigma^2 \tau^2}{n_j \tau^2 + \sigma^2}. The jj-th component of ZR1(yXβ^)\mathbf{Z}^\top \mathbf{R}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) is 1σ2ij(yijxijβ^)=njσ2rˉj(β^)\tfrac{1}{\sigma^2}\sum_{i \in j}(y_{ij} - \mathbf{x}_{ij}^\top\hat{\boldsymbol{\beta}}) = \tfrac{n_j}{\sigma^2}\bar{r}_j(\hat{\boldsymbol{\beta}}). Multiplying: u^j  =  σ2τ2njτ2+σ2njσ2rˉj(β^)  =  njτ2njτ2+σ2rˉj(β^)  =  λjrˉj(β^),\hat{u}_j \;=\; \frac{\sigma^2 \tau^2}{n_j \tau^2 + \sigma^2} \cdot \frac{n_j}{\sigma^2}\,\bar{r}_j(\hat{\boldsymbol{\beta}}) \;=\; \frac{n_j \tau^2}{n_j \tau^2 + \sigma^2}\,\bar{r}_j(\hat{\boldsymbol{\beta}}) \;=\; \lambda_j\,\bar{r}_j(\hat{\boldsymbol{\beta}}), which is (3.13). For (3.14), use Cov(uy)=GGZV1ZG\mathrm{Cov}(\mathbf{u} \mid \mathbf{y}) = \mathbf{G} - \mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}\mathbf{Z}\mathbf{G}. The jj-th diagonal of ZV1Z\mathbf{Z}^\top\mathbf{V}^{-1}\mathbf{Z} is nj/(σ2+njτ2)n_j / (\sigma^2 + n_j\tau^2) (a one-line application of Woodbury to the jj-th block Vj=σ2Inj+τ21nj1nj\mathbf{V}_j = \sigma^2\mathbf{I}_{n_j} + \tau^2\mathbf{1}_{n_j}\mathbf{1}_{n_j}^\top): the (j,j)(j, j) entry of GZV1ZG\mathbf{G}\mathbf{Z}^\top\mathbf{V}^{-1}\mathbf{Z}\mathbf{G} is τ4nj/(σ2+njτ2)=λjτ2\tau^4 \cdot n_j / (\sigma^2 + n_j\tau^2) = \lambda_j \tau^2, and so Var(ujy)=τ2λjτ2=(1λj)τ2\mathrm{Var}(u_j \mid \mathbf{y}) = \tau^2 - \lambda_j\tau^2 = (1 - \lambda_j)\tau^2.

The three identities (3.12)–(3.14) deserve to be read as a unit. The shrinkage factor λj\lambda_j depends only on the variance ratio τ2/σ2\tau^2 / \sigma^2 and the group size njn_j — nothing else. It is the number that characterizes how mixed-effects models partial-pool. Read it as a competition between two precisions: 1/τ21/\tau^2 is the prior precision of the random effect (how much the population-level prior pulls toward zero), and nj/σ2n_j/\sigma^2 is the data precision for group jj (how much group jj‘s own data informs its random effect). The shrinkage factor

λj  =  nj/σ2nj/σ2+1/τ2  =  data precisiondata precision+prior precision\lambda_j \;=\; \frac{n_j/\sigma^2}{n_j/\sigma^2 + 1/\tau^2} \;=\; \frac{\text{data precision}}{\text{data precision} + \text{prior precision}}

is the data-precision share of the total. Three regimes deserve names:

  • njn_j \to \infty at fixed τ2\tau^2. Data precision dominates; λj1\lambda_j \to 1; u^jrˉj(β^)\hat{u}_j \to \bar{r}_j(\hat{\boldsymbol{\beta}}), the no-pooling estimate. The classroom’s own data overwhelms the prior, and the BLUP equals the no-pooling residual mean.
  • nj0n_j \to 0 at fixed τ2\tau^2. Data precision vanishes; λj0\lambda_j \to 0; u^j0\hat{u}_j \to 0, the prior mean. With no within-classroom data, the BLUP can do nothing but report the prior — every group looks like the population.
  • τ2\tau^2 \to \infty at fixed njn_j. Prior precision vanishes; λj1\lambda_j \to 1; u^jrˉj\hat{u}_j \to \bar{r}_j. An uninformative random-effects prior recovers no pooling. τ20\tau^2 \to 0 has the symmetric effect: λj0\lambda_j \to 0 and u^j0\hat{u}_j \to 0, complete pooling.

Equation (3.14) is the conditional-uncertainty companion: when the data dominates (λj1\lambda_j \to 1), the conditional variance of uju_j collapses to zero; when the prior dominates (λj0\lambda_j \to 0), the conditional variance returns to the prior variance τ2\tau^2. The two limits are dual — high information content shrinks both the bias and the variance of the BLUP toward zero.

For our six-classroom data with the truth τ2=25\tau^2 = 25, σ2=64\sigma^2 = 64, the shrinkage factors are λ10.610,λ20.701,λ30.758,λ40.796,λ50.824,λ60.887\lambda_1 \approx 0.610, \lambda_2 \approx 0.701, \lambda_3 \approx 0.758, \lambda_4 \approx 0.796, \lambda_5 \approx 0.824, \lambda_6 \approx 0.887, and the conditional variances (1λj)τ2(1 - \lambda_j)\tau^2 run from 9.759.75 (Classroom 1) down to 2.832.83 (Classroom 6). The smallest classroom has more than three times the BLUP uncertainty of the largest, and shrinks more than three times as hard — both in the same direction, both governed by the same single number njn_j.

Left: λⱼ as a function of group size n at the current variance ratio. The curve is concave-saturating in n — small groups sit on the steep ascent (sharp marginal returns to data), large groups sit on the plateau (diminishing returns). The six dots are the actual classroom sizes; their λⱼ values are the y-coordinates. Right: for each classroom, an open dot on the dashed vertical at the complete-pool intercept α̂, an open dot at the no-pool intercept α̂ⱼ(no-pool), and a filled circle at the partial-pool intercept α̂ⱼ. The filled circle's position along the row is λⱼ — left endpoint is λ = 0 (complete pooling), right endpoint is λ = 1 (no pooling). Drag the slider and the whole picture breathes: the curve on the left changes shape and the dots move; the partial-pool circles on the right slide along their rows; small classrooms always travel a larger fraction of the way back to α̂.

The partial-pooled intercept for Classroom jj, which is what we plotted in §1’s panel (c), is the global intercept plus the BLUP:

α^jpartial  =  α^+u^j  =  α^+λjrˉj(β^).(3.15)\hat{\alpha}_j^{\mathrm{partial}} \;=\; \hat{\alpha} + \hat{u}_j \;=\; \hat{\alpha} + \lambda_j \bar{r}_j(\hat{\boldsymbol{\beta}}). \quad\quad (3.15)

Rewriting rˉj\bar{r}_j in terms of the no-pooling intercept α^jno-pool=α^+rˉj(β^)\hat{\alpha}_j^{\mathrm{no\text{-}pool}} = \hat{\alpha} + \bar{r}_j(\hat{\boldsymbol{\beta}}) — the per-classroom intercept from §1’s panel (b) — gives the equivalent weighted-average form

α^jpartial  =  λjα^jno-pool+(1λj)α^,(3.16)\hat{\alpha}_j^{\mathrm{partial}} \;=\; \lambda_j\,\hat{\alpha}_j^{\mathrm{no\text{-}pool}} + (1 - \lambda_j)\,\hat{\alpha}, \quad\quad (3.16)

which makes panel (c)‘s geometry algebraic: the partial-pooled intercept is a convex combination of the no-pooling estimate (panel b) and the global intercept (panel a’s line), with the convex weight λj\lambda_j set entirely by group size and the variance ratio.

3.4 BLUE, BLUP, and the predictive frame

The terminology Best Linear Unbiased Predictor — BLUP — deserves a remark. We’ve estimated β\boldsymbol{\beta} throughout as a fixed unknown parameter; the GLS estimator β^\hat{\boldsymbol{\beta}} is the Best Linear Unbiased Estimator (BLUE) under Gauss–Markov conditions, a standard fact about linear models. We’ve also computed u^\hat{\mathbf{u}}, but the random effects u\mathbf{u} are not parameters in the same sense — they’re realized values of a random vector with a known prior distribution. The convention in the mixed-effects literature is to call the corresponding estimator a predictor rather than an estimator, on the grammatical grounds that we estimate parameters and predict random variables. The mean-squared-error optimality result is the same in both cases — minimum MSE among linear unbiased rules — but the framing matters for two reasons.

First, the standard error of u^j\hat{u}_j is not the standard error of estimation of a fixed parameter; it is the standard error of prediction of a random quantity, and (3.14) is what governs it. The frequentist prediction interval u^j±1.96(1λj)τ2\hat{u}_j \pm 1.96 \sqrt{(1 - \lambda_j)\tau^2} covers the true realized uju_j with the stated frequentist probability under repeated sampling of (y,u)(\mathbf{y}, \mathbf{u}) pairs, not the standard “what is the probability that uju_j lies in this interval” — the latter requires a Bayesian posterior, and §5 shows that for this model the posterior interval and the prediction interval numerically coincide.

Second, the unbiasedness of the BLUP is unbiasedness in the predictive sense: E[u^u]=0\mathbb{E}[\hat{\mathbf{u}} - \mathbf{u}] = \mathbf{0}, where the expectation is over both y\mathbf{y} and u\mathbf{u}. Crucially, u^\hat{\mathbf{u}} is not unbiased for any particular realized u\mathbf{u} — it shrinks toward zero, so for a u\mathbf{u} that happens to be far from zero, u^\hat{\mathbf{u}} is systematically biased toward zero in the conditional sense E[u^u]u\mathbb{E}[\hat{\mathbf{u}} \mid \mathbf{u}] \ne \mathbf{u}. The unbiasedness in the predictive average integrates over the prior, where the expected u\mathbf{u} is zero. This is the Stein-estimator family at work — biased point estimates can have lower MSE than unbiased ones — and it is the structural reason mixed-effects models systematically beat both complete pooling and no pooling on out-of-sample prediction.

In modern Bayesian language we’d say nothing fancy at all about any of this: we have a joint Gaussian model, E[uy]\mathbb{E}[\mathbf{u} \mid \mathbf{y}] is the posterior mean, that’s the MMSE estimator, and the predictive interval is the posterior credible interval. The frequentist BLUP framing was historically prior to the Bayesian framing in this context — Henderson 1950, Patterson and Thompson 1971 — but the two views give the same point estimates and (asymptotically) the same intervals, which is one of the nicer correspondences in applied statistics.

Three-panel figure on shrinkage. Panel (a) plots the shrinkage curve λ(n) = τ²/(τ² + σ²/n) over n in [1,50] with horizontal references at λ=0 (complete pooling) and λ=1 (no pooling); the six classrooms are marked as colored dots at (n_j, λ_j). Panel (b) shows six rows sorted by n_j ascending, each with a hollow circle at the no-pooling residual mean and a filled colored circle at the BLUP λ_j times the residual mean, with an arrow connecting them and a vertical reference at zero; smaller classrooms have visibly longer arrows. Panel (c) shows BLUPs with conditional 95% intervals plus no-pooling residual means with their own larger 95% intervals in gray for comparison.
Figure 3. Shrinkage in action. (a) The shrinkage factor λ_j as a function of group size at the truth — monotonic, with the six classrooms at distinct points. (b) Each classroom's no-pooling deviation pulled toward zero by λ_j; smallest classrooms shrink hardest. (c) BLUPs with conditional ±1.96 SE intervals (colored) versus no-pooling means with their wider ±1.96 SE intervals (gray) — the BLUP is biased toward zero AND has narrower uncertainty, the structural reason mixed-effects beats both alternatives in MSE.
# Henderson MMEs at the truth, then verify against §1's REML fit
G_t, R_t = (tau_t ** 2) * np.eye(J), (sigma_t ** 2) * np.eye(N)
R_inv, G_inv = np.linalg.inv(R_t), np.linalg.inv(G_t)
M = np.block([[X.T @ R_inv @ X,           X.T @ R_inv @ Z],          # (p+q)×(p+q)
              [Z.T @ R_inv @ X,           Z.T @ R_inv @ Z + G_inv]])  # = 8×8 here
rhs = np.concatenate([X.T @ R_inv @ y, Z.T @ R_inv @ y])
sol = solve(M, rhs, assume_a="pos")
beta_mme, u_mme = sol[:2], sol[2:]                                    # MME solve
lam = (tau_t ** 2) / (tau_t ** 2 + sigma_t ** 2 / n_per_group)        # shrinkage

The MME solve gives α^50.99\hat\alpha \approx 50.99, β^4.86\hat\beta \approx 4.86 (matching §2’s GLS-with-true-V\mathbf{V} fit to machine precision), and BLUPs u^j\hat{u}_j that agree with mlm.random_effects from §1’s REML fit to within the difference between truth-plug-in and REML-plug-in variance components. The shrinkage factors at the truth are (0.610,0.701,0.758,0.796,0.824,0.887)(0.610, 0.701, 0.758, 0.796, 0.824, 0.887) — exactly Proposition 2’s prediction, and the curve panel (a) draws.

4. REML for variance components

§3 assumed the variance components (τ2,σ2)(\tau^2, \sigma^2) were known. They aren’t. Estimating them is where the structural difficulty of mixed-effects fitting actually lives — the fixed effects and random effects come along for free once (τ2,σ2)(\tau^2, \sigma^2) are pinned down via §3’s MMEs, but pinning them down is a job in its own right and the obvious approach (maximum likelihood on the marginal yN(Xβ,V)\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \mathbf{V})) is biased downward in a way that’s significant for the small-to-moderate group counts that motivate mixed-effects in the first place.

REML — restricted or residual maximum likelihood — is the standard fix. The original Patterson–Thompson 1971 derivation transformed y\mathbf{y} to a vector of error contrasts orthogonal to X\mathbf{X} and computed the likelihood of those. We’ll take the equivalent and tighter route: marginalize β\boldsymbol{\beta} out under a flat improper prior, leaving a residual likelihood that depends only on (G,R)(\mathbf{G}, \mathbf{R}), and maximize it. The structural answer is the same as what you’d get from §1’s textbook degrees-of-freedom correction, generalized to arbitrary mixed models. The mechanics turn out to mirror the Gaussian-process §5 marginal-likelihood-based kernel-hyperparameter learning almost exactly — same Bayesian-marginalization machinery, same Occam’s-razor-flavored objective, different model class.

4.1 Why ML is biased downward

The cleanest illustration is the elementary one: ordinary linear regression. With yN(Xβ,σ2IN)\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_N), pp fixed-effects parameters, no random effects, the maximum-likelihood estimator of σ2\sigma^2 is

σ^ML2  =  1Ni=1N(yixiβ^)2  =  RSS(β^)N,\hat{\sigma}^2_{\mathrm{ML}} \;=\; \frac{1}{N}\,\sum_{i=1}^N (y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}})^2 \;=\; \frac{\mathrm{RSS}(\hat{\boldsymbol{\beta}})}{N},

where RSS(β^)=yXβ^2\mathrm{RSS}(\hat{\boldsymbol{\beta}}) = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 is the residual sum of squares at the OLS fit. The textbook unbiased estimator divides by NpN - p instead of NN:

σ^unbiased2  =  RSS(β^)Np,E[σ^unbiased2]  =  σ2.(4.1)\hat{\sigma}^2_{\mathrm{unbiased}} \;=\; \frac{\mathrm{RSS}(\hat{\boldsymbol{\beta}})}{N - p}, \qquad \mathbb{E}[\hat{\sigma}^2_{\mathrm{unbiased}}] \;=\; \sigma^2. \quad\quad (4.1)

The factor NpN - p in the denominator is the residual degrees of freedom: there are NN raw observations, but pp of them have been used to fit β^\hat{\boldsymbol{\beta}}, leaving NpN - p “effective” observations to estimate σ2\sigma^2. The ML estimator, by dividing by NN rather than NpN - p, fails to deduct the cost of estimating β^\hat{\boldsymbol{\beta}} and consequently underestimates σ2\sigma^2 by a factor of (Np)/N(N - p)/N. For N=100,p=5N = 100, p = 5, that’s a 5% downward bias — small but real. For N=12,p=5N = 12, p = 5, that’s a 58% downward bias — catastrophic.

Mixed-effects fits suffer from the same problem, in a more elaborate form. The marginal log-likelihood we’d plausibly maximize in (β,G,R)(\boldsymbol{\beta}, \mathbf{G}, \mathbf{R}) jointly is

ML(β,G,R)  =  N2log2π12logdetV12(yXβ)V1(yXβ),(4.2)\ell_{\mathrm{ML}}(\boldsymbol{\beta}, \mathbf{G}, \mathbf{R}) \;=\; -\tfrac{N}{2}\log 2\pi - \tfrac{1}{2}\log\det \mathbf{V} - \tfrac{1}{2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}), \quad\quad (4.2)

with V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R}. Maximizing in β\boldsymbol{\beta} at fixed (G,R)(\mathbf{G}, \mathbf{R}) gives the GLS estimator from §2; maximizing in (G,R)(\mathbf{G}, \mathbf{R}) at the resulting β^GLS\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}profiling β\boldsymbol{\beta} out — gives the profile log-likelihood in (G,R)(\mathbf{G}, \mathbf{R}):

prof(G,R)  =  N2log2π12logdetV12RSSV(β^GLS),(4.3)\ell_{\mathrm{prof}}(\mathbf{G}, \mathbf{R}) \;=\; -\tfrac{N}{2}\log 2\pi - \tfrac{1}{2}\log\det \mathbf{V} - \tfrac{1}{2}\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}), \quad\quad (4.3)

where RSSV(β^GLS)=(yXβ^GLS)V1(yXβ^GLS)\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}) = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}_{\mathrm{GLS}})^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}) is the GLS-weighted residual sum of squares. Maximizing prof\ell_{\mathrm{prof}} over (G,R)(\mathbf{G}, \mathbf{R}) gives the ML estimates (G^ML,R^ML)(\hat{\mathbf{G}}_{\mathrm{ML}}, \hat{\mathbf{R}}_{\mathrm{ML}}), and these are biased downward for exactly the same reason (4.1) is: prof\ell_{\mathrm{prof}} has paid no penalty for the pp degrees of freedom consumed by β^GLS\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}.

For our six-classroom example with p=2p = 2, the bias is mild — about (Np)/N=58/6096.7%(N - p)/N = 58/60 \approx 96.7\% of the truth, a 3.3% downward bias. For a study with more covariates or fewer observations the bias scales linearly in pp. With six covariates and twenty observations, ML would underestimate σ2\sigma^2 by 30%. The variance-component estimates inherit that bias, the fixed-effects standard errors (which depend on V1\mathbf{V}^{-1}) inherit it from there, and confidence intervals based on the ML fit are systematically too narrow.

REML’s job is to clean this up.

4.2 The REML derivation by Bayesian marginalization

Patterson and Thompson 1971 derived REML by transforming y\mathbf{y} to a vector of error contrasts — linear combinations ay\mathbf{a}^\top \mathbf{y} with aX=0\mathbf{a}^\top \mathbf{X} = \mathbf{0} — and writing down the likelihood of those contrasts. The contrasts are statistics that depend on the data only through residuals from the fixed-effects fit, so their distribution depends on (G,R)(\mathbf{G}, \mathbf{R}) but not on β\boldsymbol{\beta}, and maximizing their likelihood gives unbiased variance-component estimates. The construction is geometrically clean but algebraically heavy. We’ll take the equivalent route that’s cleaner algebraically and connects directly to the GP §5 story: integrate β\boldsymbol{\beta} out under a flat improper prior p(β)1p(\boldsymbol{\beta}) \propto 1 on Rp\mathbb{R}^p.

Theorem 3 (REML log-likelihood).

Under Definition 1 with V,G,R\mathbf{V}, \mathbf{G}, \mathbf{R} positive-definite, the marginal density of y\mathbf{y} obtained by integrating β\boldsymbol{\beta} out under a flat improper prior p(β)1p(\boldsymbol{\beta}) \propto 1 on Rp\mathbb{R}^p is well-defined up to a multiplicative constant, and the resulting REML log-likelihood (up to an additive constant in (G,R)(\mathbf{G}, \mathbf{R})) is REML(G,R)  =  12logdetV    12logdet(XV1X)    12RSSV(β^GLS).(4.4)\ell_{\mathrm{REML}}(\mathbf{G}, \mathbf{R}) \;=\; -\tfrac{1}{2}\log\det \mathbf{V} \;-\; \tfrac{1}{2}\log\det(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}) \;-\; \tfrac{1}{2}\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}). \quad\quad (4.4)

Proof.

The marginal density of y\mathbf{y} after integrating out β\boldsymbol{\beta} is pREML(yG,R)  =  Rpp(yβ,G,R)p(β)dβ    RpN(y;Xβ,V)dβ,(4.5)p_{\mathrm{REML}}(\mathbf{y} \mid \mathbf{G}, \mathbf{R}) \;=\; \int_{\mathbb{R}^p} p(\mathbf{y} \mid \boldsymbol{\beta}, \mathbf{G}, \mathbf{R})\,p(\boldsymbol{\beta})\,d\boldsymbol{\beta} \;\propto\; \int_{\mathbb{R}^p} \mathcal{N}(\mathbf{y};\, \mathbf{X}\boldsymbol{\beta},\, \mathbf{V})\,d\boldsymbol{\beta}, \quad\quad (4.5) the proportionality absorbing the unknown constant from the improper prior. The integrand is a Gaussian in β\boldsymbol{\beta} — quadratic exponential — so the integral is a Gaussian integral and we can evaluate it in closed form. Expand the quadratic in the integrand: (yXβ)V1(yXβ)  =  βXV1Xβ2βXV1y+yV1y.(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \;=\; \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\boldsymbol{\beta} - 2\boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{y} + \mathbf{y}^\top \mathbf{V}^{-1}\mathbf{y}. Complete the square in β\boldsymbol{\beta}. With A=XV1X\mathbf{A} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} and b=XV1y\mathbf{b} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{y}, the quadratic-in-β\boldsymbol{\beta} piece is βAβ2βb\boldsymbol{\beta}^\top \mathbf{A}\boldsymbol{\beta} - 2\boldsymbol{\beta}^\top \mathbf{b}, which equals (βA1b)A(βA1b)bA1b(\boldsymbol{\beta} - \mathbf{A}^{-1}\mathbf{b})^\top \mathbf{A}(\boldsymbol{\beta} - \mathbf{A}^{-1}\mathbf{b}) - \mathbf{b}^\top \mathbf{A}^{-1}\mathbf{b}. Recognizing A1b=β^GLS\mathbf{A}^{-1}\mathbf{b} = \hat{\boldsymbol{\beta}}_{\mathrm{GLS}} and substituting, (yXβ)V1(yXβ)  =  (ββ^GLS)A(ββ^GLS)  +  RSSV(β^GLS),(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top \mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \;=\; (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}_{\mathrm{GLS}})^\top \mathbf{A}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}_{\mathrm{GLS}}) \;+\; \mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}), where RSSV(β^GLS)=yV1ybA1b\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}) = \mathbf{y}^\top \mathbf{V}^{-1}\mathbf{y} - \mathbf{b}^\top \mathbf{A}^{-1}\mathbf{b}. The integral in (4.5) factors: pREML(yG,R)    1(2π)N/2detVexp ⁣(12RSSV(β^GLS))Rpexp ⁣(12(ββ^GLS)A(ββ^GLS))dβ.p_{\mathrm{REML}}(\mathbf{y} \mid \mathbf{G}, \mathbf{R}) \;\propto\; \frac{1}{(2\pi)^{N/2}\sqrt{\det \mathbf{V}}} \,\exp\!\left(-\tfrac{1}{2}\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}})\right) \cdot \int_{\mathbb{R}^p} \exp\!\left(-\tfrac{1}{2}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}_{\mathrm{GLS}})^\top \mathbf{A}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}_{\mathrm{GLS}})\right) d\boldsymbol{\beta}. The Gaussian integral evaluates to (2π)p/2(detA)1/2(2\pi)^{p/2} (\det \mathbf{A})^{-1/2}: pREML(yG,R)    1(2π)(Np)/2detVdet(XV1X)exp ⁣(12RSSV(β^GLS)).p_{\mathrm{REML}}(\mathbf{y} \mid \mathbf{G}, \mathbf{R}) \;\propto\; \frac{1}{(2\pi)^{(N-p)/2}\sqrt{\det \mathbf{V}}\sqrt{\det(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X})}} \,\exp\!\left(-\tfrac{1}{2}\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}})\right). Take logs and drop the Np2log2π-\tfrac{N - p}{2}\log 2\pi constant (it doesn’t depend on (G,R)(\mathbf{G}, \mathbf{R})): REML(G,R)  =  12logdetV12logdet(XV1X)12RSSV(β^GLS),\ell_{\mathrm{REML}}(\mathbf{G}, \mathbf{R}) \;=\; -\tfrac{1}{2}\log\det \mathbf{V} - \tfrac{1}{2}\log\det(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}) - \tfrac{1}{2}\mathrm{RSS}_\mathbf{V}(\hat{\boldsymbol{\beta}}_{\mathrm{GLS}}), which is (4.4).

The structural difference between (4.3) and (4.4) is one term — 12logdet(XV1X)-\tfrac{1}{2}\log\det(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}) — and that one term is the entire fix. It penalizes variance components that produce large fixed-effects information A=XV1X\mathbf{A} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}, which counter-acts the ML profile likelihood’s tendency to push σ2\sigma^2 down to inflate the GLS information matrix. The Patterson–Thompson construction produces the same correction by a different route: their error-contrast likelihood is the likelihood of an (Np)(N - p)-dimensional vector of statistics whose distribution doesn’t involve β\boldsymbol{\beta}, and writing it out yields exactly (4.4) up to constants. The flat-prior marginalization route is shorter, more transparent about what the correction is (a marginalization of a nuisance parameter), and connects directly to GP §5: the GP marginal log-likelihood

logp(yX,θkernel)  =  12logdetKkernel12yKkernel1yN2log2π\log p(\mathbf{y} \mid \mathcal{X}, \boldsymbol{\theta}_{\mathrm{kernel}}) \;=\; -\tfrac{1}{2}\log\det \mathbf{K}_{\mathrm{kernel}} - \tfrac{1}{2}\mathbf{y}^\top \mathbf{K}_{\mathrm{kernel}}^{-1}\mathbf{y} - \tfrac{N}{2}\log 2\pi

is structurally identical to (4.4) with Xβ0\mathbf{X}\boldsymbol{\beta} \equiv \mathbf{0} — no fixed effects to marginalize, no logdet\log\det correction, just the determinant penalty and the residual quadratic. REML is the marginal-likelihood machinery, plus a logdet(XV1X)\log\det(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}) correction for the pp fixed effects we’ve integrated out.

A remark on the improper prior. We used p(β)1p(\boldsymbol{\beta}) \propto 1 on Rp\mathbb{R}^p, which doesn’t normalize. The marginal density (4.5) is therefore defined only up to a multiplicative constant — the constant of integration of the improper prior — and that’s why (4.4) is the REML log-likelihood “up to an additive constant.” For the purpose of maximizing in (G,R)(\mathbf{G}, \mathbf{R}) the constant is irrelevant. For Bayesian readers: the improper-prior marginalization is what makes REML a partial Bayesian procedure rather than a fully Bayesian one — we marginalize β\boldsymbol{\beta} but point-estimate (G,R)(\mathbf{G}, \mathbf{R}). Putting a proper prior on (G,R)(\mathbf{G}, \mathbf{R}) and integrating those out too is the fully Bayesian alternative, which §5 takes up.

4.3 The profile form for the random-intercept model

Specializing (4.4) to the random-intercept-only model with G=τ2IJ\mathbf{G} = \tau^2 \mathbf{I}_J and R=σ2IN\mathbf{R} = \sigma^2 \mathbf{I}_N produces a function of two scalar variance components that we can plot, optimize numerically, and validate against statsmodels.MixedLM(reml=True). Three computational ingredients make this practical:

  1. logdetV\log\det \mathbf{V} via the Sylvester identity. Direct computation of detV\det \mathbf{V} on a 60×6060 \times 60 matrix is fine here, but for general use the Sylvester determinant identity gives logdetV  =  logdet(R)+logdet(Iq+ZR1ZG).(4.6)\log\det \mathbf{V} \;=\; \log\det(\mathbf{R}) + \log\det(\mathbf{I}_q + \mathbf{Z}^\top \mathbf{R}^{-1}\mathbf{Z}\,\mathbf{G}). \quad\quad (4.6) The right side reduces a 60×6060 \times 60 determinant to a 6×66 \times 6 determinant plus a 60×6060 \times 60 diagonal R\mathbf{R}. The savings is dramatic at scale: a study with N=105N = 10^5 and J=100J = 100 does a 100×100100 \times 100 determinant rather than a 105×10510^5 \times 10^5 one.

  2. V1\mathbf{V}^{-1} via Woodbury. The same identity from §3’s proof reduces V1\mathbf{V}^{-1} to a single J×JJ \times J inversion.

  3. Reparameterization to unconstrained variables. The variance components (τ2,σ2)(\tau^2, \sigma^2) are positive-real; numerical optimizers prefer unconstrained variables. Standard practice is to optimize over (logτ,logσ)(\log \tau, \log \sigma), matching the §3 transform table from probabilistic programming. The Jacobian factor doesn’t appear in (4.4) because we’re maximizing a likelihood, not a posterior, but it would appear if we put proper log-Normal priors on (τ,σ)(\tau, \sigma) in §5’s Bayesian fit.

For a numerical optimizer to find the REML optimum, plug (4.6) and Woodbury into (4.4), differentiate analytically (or use scipy.optimize.minimize with method='L-BFGS-B' and finite-difference gradients), and let the optimizer report (τ^REML2,σ^REML2)(\hat\tau^2_{\mathrm{REML}}, \hat\sigma^2_{\mathrm{REML}}). The result must agree with statsmodels.MixedLM(reml=True).fit().cov_re and .scale to numerical precision — which it does, because MixedLM does exactly this optimization internally.

4.4 ML vs REML on the six-classroom data

Run both estimators on §1’s data. Maximize the profile likelihood (4.3) for ML and the REML likelihood (4.4) for REML, both over the same (logτ,logσ)(\log\tau, \log\sigma) space, and compare. Two consistency checks the run should confirm: (i) τ^REML2>τ^ML2\hat\tau^2_{\mathrm{REML}} > \hat\tau^2_{\mathrm{ML}} and σ^REML2>σ^ML2\hat\sigma^2_{\mathrm{REML}} > \hat\sigma^2_{\mathrm{ML}} (REML’s bias correction pushes both variance components up); (ii) the REML estimates match statsmodels to numerical precision.

The visual story is in Figure 4. The two log-likelihood surfaces over (logτ,logσ)(\log\tau, \log\sigma) have different optima — REML’s optimum sits at slightly larger variance components than ML’s, exactly the bias correction (4.4) was designed to produce. In this particular six-classroom realization with p=2p = 2, the upward shift in σ^2\hat\sigma^2 is only slight (well under 1% in the reported estimates), with a substantially larger relative shift in τ^2\hat\tau^2 because the logdetXV1X\log\det \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} term interacts nonlinearly with τ2\tau^2. The relevant expected benchmark is the classical degrees-of-freedom heuristic: ML’s variance estimate is biased downward by a factor of roughly (Np)/N(N - p)/N — here on the order of a few percent — but that’s the bias factor in expectation, not the exact gap any particular sample exhibits. On a study with p=10p = 10 and N=30N = 30, the same comparison produces ML estimates 30% lower than REML’s, which is the regime where REML is non-negotiable.

A note on degenerate optima. The REML log-likelihood for a random-intercept model can have τ^2=0\hat\tau^2 = 0 as a boundary optimum if the data shows little classroom-level variation — the case where the model “decides” classrooms don’t matter. This is not a bug; it’s the data telling us partial pooling collapses to complete pooling. MixedLM handles this gracefully (returns τ^2=0\hat\tau^2 = 0 and a warning); custom optimizers should optimize over logτ[,]\log\tau \in [-\infty, \infty] with care at the lower bound. Our DGP has τ=5>0\tau = 5 > 0 and J=6J = 6 classrooms, so the boundary case won’t trigger here, but it’s worth flagging because applied users encounter it routinely with small JJ or weak between-group signal.

ML grid argmax: τ̂² = 22.24, σ̂² = 77.94REML grid argmax: τ̂² = 30.17, σ̂² = 77.94truth: τ² = 25, σ² = 64
Both surfaces are evaluated on a 60×60 grid in (log τ, log σ) using the block-diagonal V eigenstructure for the random-intercept model; each cell evaluates ℓ_ML or ℓ_REML in closed form. The two optima sit at slightly different locations: REML's lies at larger variance components than ML's because the −½ log|XᵀV⁻¹X| correction in eq. (4.4) penalizes parameter combinations that produce large fixed-effects information, undoing ML's tendency to push σ² downward. On this p=2 realization, the gap is small but visible — on small-N high-p data it dominates and ML becomes seriously biased. The truth (black ×) sits between the two optima, slightly above both — sampling noise on J=6 classrooms.

4.5 The Bayesian view: REML as an empirical-Bayes hyperprior

The marginalization step in (4.5) integrates β\boldsymbol{\beta} out under a flat improper prior. What if we put a prior on (G,R)(\mathbf{G}, \mathbf{R}) too and integrate them out? That’s full Bayesian inference, and it’s what §5’s PyMC fit does. The relationship to REML is direct: REML is empirical Bayes on the variance components — point-estimate them by maximizing a marginal likelihood, then plug those point estimates into Henderson’s MMEs (which are themselves the conditional posterior mode for (β,u)(\boldsymbol{\beta}, \mathbf{u}) given (G,R)=(G^,R^)(\mathbf{G}, \mathbf{R}) = (\hat{\mathbf{G}}, \hat{\mathbf{R}})).

This empirical-Bayes-meets-MAP reading is what makes the §5 frequentist–Bayesian comparison so clean: REML and the PyMC posterior agree wherever the variance-component posterior is concentrated near (G^REML,R^REML)(\hat{\mathbf{G}}_{\mathrm{REML}}, \hat{\mathbf{R}}_{\mathrm{REML}}), which is the typical regime when JJ isn’t too small. The discrepancies show up when the variance-component posterior is wide (small JJ, weak between-group signal, posterior bimodality at τ2=0\tau^2 = 0) — exactly the regime where empirical-Bayes plug-in underestimates uncertainty and full Bayesian propagation pays its rent.

For the six-classroom data at J=6J = 6, the variance-component posterior is concentrated enough that REML and the PyMC fit will agree closely, and that agreement is the closure point §5 builds toward.

Two-panel side-by-side log-likelihood surfaces over (log-tau, log-sigma) with shared axes. Panel (a) shows the ML profile log-likelihood as a filled viridis contour with a black star marking the truth (log 5, log 8) and a magenta cross marking the ML optimum, which sits at smaller variance components than the truth. Panel (b) shows the REML log-likelihood with the same truth marker and a cyan cross at the REML optimum, which sits closer to the truth and at slightly larger variance components than ML's. The two surfaces look similar away from the optima.
Figure 4. ML versus REML log-likelihood surfaces over (log τ, log σ). The truth (black star) sits between the two optima, and REML's optimum (cyan ×) is at larger variance components than ML's (magenta ×). On p=2 data the bias correction is small but visible; on small-N high-p data it dominates.
# Custom REML and ML log-likelihoods, optimized via L-BFGS-B
from scipy.optimize import minimize
from scipy.linalg import cho_factor, cho_solve

def neg_ell(theta, kind):                                # theta = (log τ, log σ)
    tau2, sig2 = np.exp(2 * theta[0]), np.exp(2 * theta[1])
    V = Z @ (tau2 * np.eye(J)) @ Z.T + sig2 * np.eye(N)
    L, _ = cho_factor(V, lower=True)
    A = X.T @ cho_solve((L, True), X)                    # X^T V^{-1} X
    beta = np.linalg.solve(A, X.T @ cho_solve((L, True), y))
    rss = (y - X @ beta) @ cho_solve((L, True), y - X @ beta)
    logdet_V = 2 * np.sum(np.log(np.diag(L)))
    if kind == "ml":
        return 0.5 * (N * np.log(2 * np.pi) + logdet_V + rss)
    return 0.5 * (logdet_V + np.linalg.slogdet(A)[1] + rss)   # REML

x0 = np.array([np.log(5.0), np.log(8.0)])
ml = minimize(neg_ell, x0, args=("ml",),   method="L-BFGS-B")
re = minimize(neg_ell, x0, args=("reml",), method="L-BFGS-B")

The custom REML fit returns τ^REML23.30\hat\tau_{\mathrm{REML}}^2 \approx 3.30 and σ^REML256.1\hat\sigma_{\mathrm{REML}}^2 \approx 56.1, agreeing with mlm.cov_re.iloc[0,0] and mlm.scale from §1’s statsmodels fit to roughly 1.5×1031.5 \times 10^{-3} — the two implementations solve the same optimization to the optimizer’s tolerance. The ML estimates are τ^ML22.70\hat\tau_{\mathrm{ML}}^2 \approx 2.70, σ^ML256.0\hat\sigma_{\mathrm{ML}}^2 \approx 56.0 — both below REML’s by the Bayesian-marginalization correction (4.4) introduces. The truth (25,64)(25, 64) sits in the right neighborhood; this realization happens to under-shoot τ2\tau^2 (one of the regimes where the variance-component posterior has nontrivial mass near zero, as §5 will exhibit).

5. The frequentist–Bayesian bridge

We’ve spent four sections building the frequentist machinery. §1 motivated partial pooling geometrically. §2 gave the linear mixed model and the marginal covariance V=ZGZ+R\mathbf{V} = \mathbf{Z}\mathbf{G}\mathbf{Z}^\top + \mathbf{R}. §3 derived Henderson’s mixed-model equations and the BLUP shrinkage formula. §4 derived REML and showed it’s a Bayesian-marginalization-of-β\boldsymbol{\beta} in disguise.

§4 closed with a remark that’s worth taking seriously: REML is empirical Bayes on the variance components. We marginalize β\boldsymbol{\beta} (frequentists call this profiling), point-estimate (G,R)(\mathbf{G}, \mathbf{R}) by maximizing the residual likelihood, then plug those point estimates into Henderson’s MMEs to get (β^,u^)(\hat{\boldsymbol{\beta}}, \hat{\mathbf{u}}). The plug-in step is what makes it empirical Bayes rather than fully Bayesian — full Bayesian inference would put a prior on (G,R)(\mathbf{G}, \mathbf{R}) and integrate over the variance-component posterior rather than picking a single estimate.

This section runs the full Bayesian fit on §1’s data and shows what we get. Three claims to defend, in order:

  1. The Bayesian fit recovers REML. The variance-component posteriors concentrate near the REML estimates, the fixed-effects posteriors concentrate near the GLS estimates, and the random-effects posterior means concentrate near the BLUPs. When the data is informative enough — and J=6J = 6 classrooms with N=60N = 60 observations is plenty — the two views agree numerically.
  2. The Bayesian fit gives properly propagated uncertainty. REML’s plug-in step ignores variance-component uncertainty when reporting fixed-effects standard errors and BLUP intervals. The Bayesian fit propagates that uncertainty through to every downstream quantity. Whether the difference matters depends on how concentrated the variance-component posterior is — for our data it’s small, but the experiment will show it.
  3. Centered-vs-non-centered shows up here, in its native habitat. Probabilistic programming §5 used eight schools with known σj\sigma_j as the iconic centered-vs-non-centered example. The funnel pathology those parameterizations diagnose is the characteristic geometry of hierarchical models with small group counts — which is exactly the regime mixed-effects practitioners run into routinely. Our J=6J = 6 classrooms are smaller than eight schools’ J=8J = 8, the variance components are unknown, and the funnel will be present. Same lesson, different data, in the regime where it actually applies.

5.1 The Bayesian LMM specification

Take the §1 model and put priors on everything.

αN(50,202)βN(0,102)τHalfNormal(10)σHalfNormal(10)ajτN(0,τ2),j=1,,6yijα,β,aj,σN(α+βxij+aj,σ2).(5.1)\begin{aligned} \alpha &\sim \mathcal{N}(50,\, 20^2) \\ \beta &\sim \mathcal{N}(0,\, 10^2) \\ \tau &\sim \mathrm{HalfNormal}(10) \\ \sigma &\sim \mathrm{HalfNormal}(10) \\ a_j \mid \tau &\sim \mathcal{N}(0,\, \tau^2), \qquad j = 1, \ldots, 6 \\ y_{ij} \mid \alpha, \beta, a_j, \sigma &\sim \mathcal{N}(\alpha + \beta\, x_{ij} + a_j,\, \sigma^2). \quad\quad (5.1) \end{aligned}

Two notes on the prior choices. The fixed-effects priors αN(50,400)\alpha \sim \mathcal{N}(50, 400) and βN(0,100)\beta \sim \mathcal{N}(0, 100) are intentionally weak — they cover the plausible range of exam-score intercepts and prep-hours slopes by orders of magnitude, with no real prior commitment. They behave essentially like a flat prior on the support that matters; the data dominates them. The variance-component priors τ,σHalfNormal(10)\tau, \sigma \sim \mathrm{HalfNormal}(10) are what Gelman et al. (2008) call weakly informative: they’re not flat (a flat prior on a positive-real parameter is improper and creates pathologies near zero), but they put modest probability on values up to 30\sim 30, which comfortably accommodates the truth (τ,σ)=(5,8)(\tau, \sigma) = (5, 8) without imposing much.

The choice of HalfNormal over the historically common InverseGamma(0.001, 0.001) is deliberate. The Gelman 2006 prior-sensitivity paper showed that the inverse-gamma prior on variance components creates an artifact: for small data the posterior pulls τ\tau away from zero in proportion to the prior’s tail behavior, not the data. HalfNormal (or HalfCauchy / HalfStudentT, equivalent in this regime) shrinks toward zero gracefully when the data is uninformative and gets out of the way when the data is. It’s the modern default and what Stan’s manual recommends.

5.2 The funnel reappears

Run the centered model with NUTS. PyMC’s defaults — 1000 warmup, 1000 draws, 4 chains, target_accept=0.9 — should be plenty for a model this small. The diagnostics will report some number of divergent transitions — leapfrog trajectories where the symplectic integrator’s energy error exceeded NUTS’s tolerance. The count will be small but nonzero, R^\hat R for τ\tau will sit around 1.01–1.02 (acceptable but not great), and the effective sample size for τ\tau will be a fraction of the nominal 4000 draws. This is the same pathology probabilistic programming §5 identified on eight schools: the joint distribution of (aj,logτ)(a_j, \log\tau) has a funnel — wide at the top (large τ\tau, the aja_j have lots of room) and tight at the bottom (small τ\tau, the aja_j are squeezed near zero). NUTS’s leapfrog integrator picks one step size ϵ\epsilon during warmup and reuses it everywhere; the step size that’s right for the wide region is too large for the narrow region, and divergences cluster at the funnel’s neck.

Why does our problem funnel? Because J=6J = 6 is small. The eight-schools case was the iconic funnel because J=8J = 8 schools and the variance components were small relative to the group SEs. We have J=6J = 6 classrooms and modest within-classroom data, so the variance components are estimable but not pinned down — the posterior on τ\tau has nontrivial mass near small values, and the funnel is exactly what you get when p(ajτ)p(a_j \mid \tau) is conditionally tight at small τ\tau.

The pedagogical point is that the funnel is not a sampler bug. It’s a geometric mismatch between the model’s natural parameterization and the geometry HMC samples on, and the divergence count is the diagnostic signal. Increasing target_accept to 0.99 reduces the divergence count by shrinking the leapfrog step size, but slows the sampler dramatically — the symptom is suppressed without addressing the cause. Reparameterizing is the structural fix.

5.3 The non-centered rewrite

The non-centered parameterization introduces standard-Normal latent variables zjz_j and defines the random effects deterministically:

zjN(0,1),aj    τzj,j=1,,J.(5.2)z_j \sim \mathcal{N}(0, 1), \qquad a_j \;\equiv\; \tau \cdot z_j, \qquad j = 1, \ldots, J. \quad\quad (5.2)

Probabilistically, integrating zjz_j out of aj=τzja_j = \tau z_j gives ajτN(0,τ2)a_j \mid \tau \sim \mathcal{N}(0, \tau^2), exactly the centered prior. The two parameterizations encode the same generative model. Geometrically, NUTS samples on (α,β,τ,σ,z1,,zJ)(\alpha, \beta, \tau, \sigma, z_1, \ldots, z_J) rather than (α,β,τ,σ,a1,,aJ)(\alpha, \beta, \tau, \sigma, a_1, \ldots, a_J), and the joint distribution of (zj,logτ)(z_j, \log\tau) is rectangular under the prior — independent components, no funnel — while the joint of (aj,logτ)(a_j, \log\tau) is the funnel we just diagnosed. Same posterior, different geometry to traverse.

Run the non-centered model with the same NUTS settings. The expected diagnostics: zero divergences (or very nearly), R^1.00\hat R \approx 1.00 across the board, full effective sample size for τ\tau. The fix is mechanical, costs three lines, and is the standard Bayesian-hierarchical trick — you’ll see it in any modern Stan or PyMC tutorial.

5.4 The numerical convergence

Three claims from the head of this section, now made concrete on §1’s data.

Claim 1: variance-component posteriors track REML. The non-centered NUTS fit produces 4000 posterior draws of (τ,σ)(\tau, \sigma). Plot the marginals as densities, overlay vertical reference lines at the REML estimates from §4 and the truth. On this realization, the posterior mean of τ2\tau^2 lands in the same ballpark as τ^REML2\hat\tau^2_{\mathrm{REML}} (within roughly 20% — the looser end is expected because τ2\tau^2 is the harder variance component to identify with only J=6J = 6 groups), and the posterior mean of σ2\sigma^2 is much closer (within a couple of percent of σ^REML2\hat\sigma^2_{\mathrm{REML}}). Both posteriors concentrate near the REML estimates rather than near the truth, which is what “the data is informative enough” looks like for J=6,N=60J = 6, N = 60 — the REML estimate is the right comparison point because the empirical-Bayes-vs-Bayesian comparison should agree wherever the variance-component posterior is concentrated, regardless of where that posterior sits relative to the data-generating truth.

Claim 2: fixed-effects posteriors track GLS, and BLUPs track posterior means. Fixed-effects posterior means α^Bayes,β^Bayes\hat\alpha_{\mathrm{Bayes}}, \hat\beta_{\mathrm{Bayes}} should match the GLS estimates from §2 and §4 to within Monte Carlo noise. Random-effects posterior means E[ajy]\mathbb{E}[a_j \mid \mathbf{y}] should match the BLUPs u^j\hat u_j from §3 to within Monte Carlo noise. The two views agree on point estimates.

Claim 3: Bayesian intervals are slightly wider. Where REML’s Wald intervals on β\boldsymbol{\beta} assume the variance components are known at their REML estimates, the Bayesian fit propagates the variance-component posterior through to the fixed-effects posterior. The Bayesian interval widths are inflated by a small factor — typically 5–15% for this regime — relative to REML’s. The factor scales with how much the variance-component posterior is non-degenerate; for studies with J=3J = 3 instead of J=6J = 6 the inflation could be 30% or more, which is the regime where empirical-Bayes plug-in starts to seriously underestimate uncertainty.

The four-panel Figure 5 makes all three claims visually direct.

5.5 What this comparison teaches

Three takeaways, each worth being explicit about.

First, frequentist and Bayesian mixed-effects fits are not different methods. They fit the same model. Both compute the conditional posterior of (β,u)(\boldsymbol{\beta}, \mathbf{u}) given (G,R)(\mathbf{G}, \mathbf{R}) and the data — Henderson’s MMEs are the conditional posterior mode. They differ only in how they handle (G,R)(\mathbf{G}, \mathbf{R}): REML point-estimates them by maximizing a marginal likelihood, the Bayesian fit puts a prior on them and integrates. When the data is informative, the variance-component posterior is concentrated near the REML estimates and the two procedures give numerically equivalent answers. When it isn’t, they diverge in predictable ways and the Bayesian fit is the more defensible one.

Second, the centered-vs-non-centered story is a fact about HMC, not about mixed-effects. The funnel is a feature of any hierarchical model whose variance components are unknown and whose group counts are small enough to make the variance components only weakly identified. Mixed-effects models hit this regime constantly — most applied studies have JJ in the dozens or fewer, and most have variance-component posteriors with nontrivial mass near zero. The non-centered rewrite is the standard fix and it’s a three-line code change. The next time you see a Bayesian hierarchical model with a divergent transition warning, this is the pattern to recognize.

Third, PPL idiom hides exactly nothing about what the model is doing. The MMEs we derived in §3 are not the algorithm PyMC runs — PyMC runs NUTS on the joint log-density, the MMEs are not in the picture — but the model PyMC is fitting is the model whose conditional posterior mode the MMEs solve. The frequentist–Bayesian bridge is structural: same model, same data, same conditional structure, two different ways to deal with the variance components. Once you see this, the choice of which to use becomes a question about the data regime and the use case, not a question about which philosophical tribe you belong to.

Loading prior-sweep posteriors…
Each row is one classroom; the orange interval is the REML BLUP 95% Wald CI from statsmodels (frequentist plug-in path); the blue interval is the Bayesian 95% credible interval from a non-centered PyMC fit at the chosen τ prior. The orange reference is fixed — REML ignores the Bayesian prior. The blue intervals shift with the dropdown: at scale = 0.5 the prior shrinks every classroom toward 0 regardless of group data; at scale = 10 (the topic's default) Bayesian and REML agree numerically; at scale = 50 they remain in agreement because the data has dominated the prior. The §5.5 claim "Bayesian and REML are not different methods, just different ways of handling the variance components" becomes the visible behavior of the blue intervals collapsing into the orange ones as the prior relaxes.
Four-panel 2×2 comparison figure. Panel (a) — variance-component posteriors: τ posterior in blue and σ posterior in green as kernel densities, with vertical reference lines at the truth (black dashed), REML estimates (orange solid), and posterior mean (color-matched solid); REML and posterior-mean references nearly coincide. Panel (b) — fixed-effects forest plot for (α, β): Bayesian 95% CI in blue, REML 95% Wald in orange, true value as a black ×; the Bayesian intervals are slightly wider. Panel (c) — funnel comparison: side-by-side scatter of centered fit (a₁, log τ) in transparent blue with red × marks for divergent transitions (funnel shape), and non-centered fit (z₁, log τ) in transparent green showing a rectangular blob with no divergences. Panel (d) — BLUPs vs Bayesian posterior means: six rows sorted by n_j, each with Bayesian 95% CI in blue (above) and BLUP interval in orange (below), with markers at point estimates and a black × at the true a_j; the two intervals are nearly coincident.
Figure 5. The frequentist–Bayesian bridge on the six-classroom data. (a) Variance-component posteriors track REML. (b) Bayesian fixed-effects intervals are slightly wider than REML's Wald intervals. (c) Centered (a₁, log τ) shows the funnel and divergent transitions; non-centered (z₁, log τ) is rectangular. (d) BLUPs are the random-effects posterior means modulo variance-component handling.
# PyMC: centered + non-centered on the same data
import pymc as pm

with pm.Model() as centered:
    alpha = pm.Normal("alpha", mu=50.0, sigma=20.0)
    beta = pm.Normal("beta", mu=0.0, sigma=10.0)
    tau = pm.HalfNormal("tau", sigma=10.0)
    sigma = pm.HalfNormal("sigma", sigma=10.0)
    a = pm.Normal("a", mu=0.0, sigma=tau, shape=J)               # centered
    pm.Normal("y", mu=alpha + beta * x + a[classroom], sigma=sigma, observed=y)
    idata_c = pm.sample(1000, tune=1000, chains=4, target_accept=0.9)

with pm.Model() as non_centered:
    alpha = pm.Normal("alpha", mu=50.0, sigma=20.0)
    beta = pm.Normal("beta", mu=0.0, sigma=10.0)
    tau = pm.HalfNormal("tau", sigma=10.0)
    sigma = pm.HalfNormal("sigma", sigma=10.0)
    z = pm.Normal("z", mu=0.0, sigma=1.0, shape=J)               # standardized
    a = pm.Deterministic("a", tau * z)                           # non-centered
    pm.Normal("y", mu=alpha + beta * x + a[classroom], sigma=sigma, observed=y)
    idata_nc = pm.sample(1000, tune=1000, chains=4, target_accept=0.9)

The centered fit produces ~30–60 divergent transitions out of 4000 draws on this realization, with R^(τ)1.02\hat R(\tau) \approx 1.02 and visibly funnel-shaped joint draws of (a1,logτ)(a_1, \log\tau). The non-centered fit produces zero or very nearly zero divergences (single digits at most — what the §5.3 prose calls “very nearly”), R^1.00\hat R \approx 1.00 across the board, and full effective sample size for τ\tau. The non-centered posterior means recover α^Bayes50.99\hat\alpha_{\mathrm{Bayes}} \approx 50.99, β^Bayes4.86\hat\beta_{\mathrm{Bayes}} \approx 4.86 (matching GLS / REML to within Monte Carlo noise), τ^Bayes23.9\hat\tau_{\mathrm{Bayes}}^2 \approx 3.9 and σ^Bayes257.2\hat\sigma_{\mathrm{Bayes}}^2 \approx 57.2 (matching REML to a few percent), and per-classroom posterior means that track the BLUPs from §3 to within sampling noise. Three independently-derived procedures, one numerically consistent answer.

6. Connections, limits, and references

Mixed-effects models sit at one of those rare junctions where four independent threads of statistical theory converge on the same algebra: the frequentist GLS-with-known-covariance estimator from §2, Henderson’s penalized-likelihood derivation from §3, the REML / Bayesian-marginalization story from §4, and the full Bayesian fit from §5. The four routes give numerically equivalent answers when the data is informative enough — which is to say, in the regime where mixed-effects models are most often appropriate. That convergence is the structural reason the topic has been a workhorse of applied statistics for fifty years.

The remainder of §6 takes stock: where mixed-effects shine, where they break down, and what neighboring topics connect.

6.1 Where mixed-effects shine

Five regimes where the linear mixed model is the right tool:

  • Clustered cross-sectional data. The §1 six-classroom example is the canonical case: observations cluster naturally into groups, and the grouping captures variation we want to model rather than ignore. Students nest in classrooms, classrooms in schools, patients in clinics, customers in stores. Whenever the natural unit of analysis is the individual but the natural unit of variation is the group, mixed-effects models are the principled response.
  • Longitudinal / repeated-measures data. Each subject contributes a sequence of measurements over time. The within-subject correlation is non-negotiable — ignoring it gives anti-conservative standard errors — and the between-subject heterogeneity in baseline level (random intercept) and time trend (random slope) is the substantive question. The sleepstudy dataset from lme4 is the canonical pedagogical example.
  • Multilevel / hierarchical data. Three or four levels of nesting (students within classrooms within schools within districts), each carrying its own random effects, becomes nontrivial in Z\mathbf{Z} but the algebra is unchanged. Crossed random effects (an item-response model with student-level and item-level random effects, neither nested in the other) extend the same framework with a slightly different Z\mathbf{Z}.
  • Recommender systems with user / item random effects. The ML-native regime. A rating yui=μ+au+bi+εuiy_{ui} = \mu + a_u + b_i + \varepsilon_{ui} with auN(0,τu2)a_u \sim \mathcal{N}(0, \tau_u^2) and biN(0,τi2)b_i \sim \mathcal{N}(0, \tau_i^2) is a crossed random-effects model whose BLUPs are exactly the user / item biases in any matrix-factorization recommender. Koren, Bell, and Volinsky 2009 makes the connection explicit.
  • A/B testing and meta-analysis. When the same experiment is run across many sites, the treatment effect is naturally modeled as varying by site, with a population-level mean (fixed effect) and site-level deviations (random effects). Bayesian meta-analysis is essentially this model under a different name; Rubin’s eight-schools is its iconic instance.

6.2 Where mixed-effects break down

Five structural failure modes:

  • Non-Gaussian likelihoods → GLMMs. When y\mathbf{y} is binary, count, ordinal, censored, or otherwise non-Gaussian, the linear mixed model is replaced by the generalized linear mixed model. The marginal likelihood (4.2) generalizes but no longer factors. Standard responses: PQL (Breslow & Clayton 1993) for fast approximate inference; Laplace approximation (Wolfinger & O’Connell 1993, the default in lme4::glmer); fully Bayesian via PyMC GLMM with NUTS.
  • Very high-dimensional random effects → factor models or GPs. When qq is in the thousands, MMEs become a memory and computational problem even with sparsity exploits. Latent-factor models replace uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}) with u=Bf+η\mathbf{u} = \mathbf{B}\mathbf{f} + \boldsymbol{\eta}. Gaussian processes apply when the random-effects index has a meaningful metric — the random-effects covariance becomes a kernel.
  • Crossed effects with massive cardinality → variational methods. Industrial recommender systems have JusersJ_{\mathrm{users}} in the hundreds of millions. Variational inference gives a tractable approximation; mean-field VI is the natural starting point, structured VI respecting random-effects structure (Hoffman et al. 2013) is state of the art.
  • Unidentified or near-degenerate variance components. When JJ is very small (J=2J = 2 or J=3J = 3), variance components are weakly identified — REML can return τ^2=0\hat\tau^2 = 0 as a boundary optimum. Bayesian inference handles this gracefully (wide variance-component posterior, appropriately conservative fixed-effects intervals) but at J=2J = 2 even Bayesian inference identifies τ2\tau^2 almost entirely from the prior. The honest response is either to collect more groups or to use fixed-effects.
  • Misspecified random-effects distributions. The Gaussian assumption uN(0,G)\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}) is a modeling commitment. When the random effects are heavy-tailed, the Gaussian assumption shrinks them too aggressively. Replace with Student-tt random effects (a one-line PyMC change), or use Bayesian nonparametric priors — Dirichlet-process random effects are the canonical option.

6.3 Connections to neighboring topics

Within T5 (Bayesian & Probabilistic ML):

  • Variational Inference — §6.2’s crossed-effects-at-scale response. The mean-field and structured VI machinery from VI §3–4 applies directly to mixed-effects with very large u\mathbf{u}.
  • Gaussian Processes — sibling. The marginal-likelihood-based hyperparameter learning from GP §5 is the structural analog of REML, and both topics close with a Bayesian hyperprior alternative. When the random-effects index has a meaningful metric, the LMM becomes a GP.
  • Probabilistic Programming — direct upstream. PP §5’s eight-schools fit IS a hierarchical mixed-effects model; this topic developed the partial-pooling theory PP took as given. The centered-vs-non-centered story from PP §5 reappeared in §5.3 in its native habitat.
  • Sparse Bayesian Priors (coming soon) — for variable selection on the fixed effects β\boldsymbol{\beta} in models with many predictors.
  • Bayesian Neural Networks (coming soon) — when the fixed-effects predictor Xβ\mathbf{X}\boldsymbol{\beta} is replaced by a neural network and the random effects encode group-level deviations.

Across to formalstatistics:

Connections

  • §6.2's response to crossed-effects-at-scale is variational inference. The mean-field and structured VI machinery developed in variational-inference applies directly to mixed-effects with very large random-effects vectors; the reverse-KL projection and the marginal-variance underestimation under correlation are the same machinery this topic's empirical-Bayes-vs-Bayesian comparison in §5 makes operational at small scale. variational-inference
  • Sibling topic in T5: GPs are the conjugate-Gaussian-likelihood case where closed-form posterior inference is available without a sampler, and §4's REML derivation by Bayesian marginalization is structurally identical to GP §5's marginal-likelihood-based hyperparameter learning. When the random-effects index has a meaningful metric, the LMM's random-effects covariance becomes a kernel and the LMM becomes a GP — §6.2 makes the connection explicit. gaussian-processes
  • Direct upstream: PP §5's eight-schools fit IS a hierarchical mixed-effects model, and this topic developed the partial-pooling theory PP took as given. The centered-vs-non-centered story from PP §5 reappears in §5.3 in its native habitat — small group counts, unknown variance components — where the funnel pathology is the characteristic geometry of any hierarchical model with weakly-identified random effects. probabilistic-programming

References & Further Reading