advanced learning-theory 70 min read

Generalized Method of Moments

Hansen's framework for over-identified estimation — efficient weighting, the J-statistic, and the bridge from ML to double machine learning

§1 — Introduction and motivation

Eighty-nine years separate Karl Pearson’s introduction of the method of moments in 1894 from Lars Peter Hansen’s generalization in 1982. In that span the field of statistics underwent the likelihood revolution, the rise of the Neyman–Pearson testing framework, and the bootstrap; econometrics absorbed all three and grew its own preoccupations on top. By the early 1980s rational-expectations models routinely produced more moment restrictions on a parameter vector than the parameter vector had components, and the just-identified inversion Pearson had relied on could no longer carry the load. Hansen’s resolution — minimize a weighted quadratic form in the sample-moment vector — is what we now call generalized method of moments (GMM). It earned Hansen a share of the 2013 Nobel Prize and has been the workhorse estimator of modern econometrics ever since.

We develop GMM from the just-identified Pearson setup forward, building the asymptotic theory, deriving the efficient weighting matrix and the Hansen J-statistic in full, and connecting GMM to maximum likelihood, instrumental variables, and the modern double machine learning of Chernozhukov et al. (2018). The reader should have seen Semiparametric Inference (the efficient-weighting matrix realizes the semiparametric bound), Convex Analysis (the GMM criterion is convex-quadratic in the moment residuals), and Concentration Inequalities (uniform laws of large numbers are the workhorse step in the consistency proof) before this topic.

1.1 From Pearson to Hansen — a century of moment matching

Pearson’s (1894) “Contributions to the mathematical theory of evolution” introduced the method of moments as a procedure for fitting a mixture of two Gaussians to a dataset of crab carapace measurements. The recipe was simple: compute the first kk sample moments, set them equal to the first kk population moments expressed as functions of θ\theta, and solve the resulting system for θ\theta. With kk equations in kk unknowns the problem was usually well-posed; Pearson did the arithmetic by hand for k=5k = 5 in the crab study, an order of difficulty that limited the method’s reach.

Then Fisher’s likelihood program subsumed nearly everything. From the 1920s through the 1960s the method of moments survived in textbooks as a pedagogical introduction to estimation, but in research it was sidelined as a “rough” alternative to maximum likelihood — asymptotically efficient only by accident, and noticeably inefficient in standard parametric families.

The setting that brought moment-based estimation back was rational-expectations macroeconomics. Hansen and Singleton’s “Generalized instrumental variables estimation of nonlinear rational expectations models” (1982, Econometrica) modeled a consumer’s Euler equation as

E ⁣[(δRt+1β1)Ft]=0,\mathbb{E}\!\left[\bigl(\delta\, R_{t+1}^{\,\beta} - 1\bigr) \,\big|\, \mathcal{F}_t\right] = 0,

where δ\delta is the discount factor, β\beta the coefficient of relative risk aversion, Rt+1R_{t+1} the gross asset return, and Ft\mathcal{F}_t the agent’s information set at time tt. The conditional moment condition becomes a family of unconditional moment conditions — one for every Ft\mathcal{F}_t-measurable instrument ZtZ_t we choose — and as soon as we pick more than two instruments we have more equations than the two parameters (δ,β)(\delta, \beta) to identify. The companion paper, Hansen’s “Large sample properties of generalized method of moments estimators” (1982, Econometrica), gave the framework that made this over-identified system into a well-defined estimator. Hansen shared the 2013 Nobel Prize for the asset-pricing applications of this machinery.

1.2 The over-identified problem — what L > k moment conditions break

Suppose we observe an i.i.d. sample X1,,XnX_1, \dots, X_n from a distribution P0P_0 indexed by an unknown parameter θ0ΘRk\theta_0 \in \Theta \subseteq \mathbb{R}^k. Economic theory — or a structural model, or a domain assumption — supplies a vector-valued moment function

g ⁣:X×ΘRL,g \colon \mathcal{X} \times \Theta \to \mathbb{R}^L,

such that the population moment condition

EP0 ⁣[g(X,θ0)]  =  0\mathbb{E}_{P_0}\!\left[g(X, \theta_0)\right] \;=\; 0

holds at the true parameter and at no other parameter in Θ\Theta. The integer LL is the number of moment conditions; kk is the number of unknown parameters. Three regimes:

  • Under-identified (L<kL < k): fewer equations than unknowns. Even in expectation the moment conditions do not pin θ0\theta_0 down — the parameter is not identifiable from these moments alone, and no amount of data can rescue us.
  • Just-identified (L=kL = k): as many equations as unknowns. The Pearson case. The system E[g(X,θ)]=0\mathbb{E}[g(X,\theta)] = 0 generically has a unique solution θ0\theta_0, and the sample-moment system gˉn(θ)=n1i=1ng(Xi,θ)=0\bar g_n(\theta) = n^{-1}\sum_{i=1}^n g(X_i, \theta) = 0 generically has a unique solution θ^n\hat\theta_n that we can compute by inverting gg.
  • Over-identified (L>kL > k): more equations than unknowns. In expectation the system is consistent — θ0\theta_0 zeroes all LL moment conditions simultaneously by hypothesis — but in any finite sample gˉn(θ)=0\bar g_n(\theta) = 0 has no solution at all. We have a redundancy of information that no choice of θ\theta can absorb exactly.

The over-identified regime is the interesting one and the one Hansen’s framework targets. Why might we want more moment conditions than parameters? Because more conditions, even though they cannot all hold exactly in-sample, contain more information about θ0\theta_0 than any subset of kk of them — and we want an estimator that uses all LL of them simultaneously rather than picking a subset and throwing the rest away.

Concrete prototype: linear instrumental variables. We posit a structural equation

Yi  =  Xiθ0+εi,E[Ziεi]=0,Y_i \;=\; X_i^\top \theta_0 + \varepsilon_i, \qquad \mathbb{E}[Z_i \varepsilon_i] = 0,

where XiRkX_i \in \mathbb{R}^k is endogenous (correlated with εi\varepsilon_i) and ZiRLZ_i \in \mathbb{R}^L is a vector of LkL \ge k instruments assumed orthogonal to the structural error. The moment function is g(Xi,Yi,Zi;θ)=Zi(YiXiθ)g(X_i, Y_i, Z_i; \theta) = Z_i (Y_i - X_i^\top \theta). With L=kL = k exactly, the unique solution to gˉn(θ)=0\bar g_n(\theta) = 0 is the textbook IV estimator θ^=(ZX)1ZY\hat\theta = (Z^\top X)^{-1} Z^\top Y. With L>kL > k — more instruments than endogenous regressors — there is no θ\theta that orthogonalizes the residual against every column of ZZ, and we need a principled rule for combining the LL moment conditions. That rule is GMM.

1.3 The GMM idea in one paragraph

The natural object to minimize is the magnitude of the sample-moment vector gˉn(θ)\bar g_n(\theta). But magnitude relative to what metric? Hansen’s framework parameterizes this freedom. Pick any positive-definite L×LL \times L matrix WW — this is the weighting matrix — and define the GMM criterion

Jn(θ,W)  =  ngˉn(θ)Wgˉn(θ).J_n(\theta, W) \;=\; n \cdot \bar g_n(\theta)^\top \, W \, \bar g_n(\theta).

The GMM estimator corresponding to WW is the minimizer

θ^W  =  argminθΘJn(θ,W).\hat\theta_W \;=\; \arg\min_{\theta \in \Theta} J_n(\theta, W).

Three facts make this work as a foundational estimation framework. First, any positive-definite WW delivers a consistent estimator: as nn \to \infty, θ^Wpθ0\hat\theta_W \to_p \theta_0 regardless of the choice of WW. Second, different choices of WW deliver different asymptotic variances, all of the same “sandwich” form, partially ordered by the Loewner ordering on positive-semidefinite matrices — and one specific choice uniquely minimizes the asymptotic variance: the inverse of the moment-residual variance W=Ω1W^\star = \Omega^{-1}, where Ω=Var(ngˉn(θ0))\Omega = \mathrm{Var}\bigl(\sqrt{n}\,\bar g_n(\theta_0)\bigr). The resulting minimum-variance V=(GΩ1G)1V^\star = (G^\top \Omega^{-1} G)^{-1} matches the semiparametric efficiency bound for the moment-condition model and gives Hansen’s framework its claim to optimality. Third, the value of the criterion at its efficient-weight minimum, J^=Jn(θ^W,Ω^1)\hat J = J_n(\hat\theta_{W^\star}, \hat\Omega^{-1}), has an asymptotic χLk2\chi^2_{L-k} distribution under correct specification — a free over-identification test that emerges as a by-product of the same quadratic form the estimator minimizes. Sections §4 through §8 develop these three facts in full.

1.4 Where GMM sits in the T6 track

GMM is the over-identified extension of formalStatistics: method-of-moments . That topic covers the just-identified Pearson case in detail; this topic picks up where Pearson stops, generalizing to L>kL > k and developing the asymptotic machinery the over-identified case requires.

Three formalML prerequisites carry direct weight. Concentration Inequalities supplies the uniform laws of large numbers that make the consistency proof go through: we need gˉn(θ)E[g(X,θ)]\bar g_n(\theta) \to \mathbb{E}[g(X,\theta)] uniformly over θΘ\theta \in \Theta, not just pointwise, and the empirical-process bounds from that topic are how we get there. Convex Analysis gives us the geometry of the criterion: Jn(θ,W)J_n(\theta, W) is a positive-semidefinite quadratic form in gˉn(θ)\bar g_n(\theta), which means the criterion is convex in θ\theta whenever g(X,θ)g(X, \theta) is affine in θ\theta (the linear case) and the first-order conditions are well-behaved more generally. Semiparametric Inference supplies the efficiency bound: the variance V=(GΩ1G)1V^\star = (G^\top \Omega^{-1} G)^{-1} that efficient GMM achieves is the semiparametric efficiency bound for the moment-condition model, derivable from the efficient influence function of θ0\theta_0 on the tangent space orthogonal to the moment-condition score.

Two formalML topics pick up where GMM leaves off. Causal Inference Methods (coming soon) extends to doubly-robust estimation, where the augmented inverse-probability-weighted estimator is GMM with a specific Neyman-orthogonal moment condition. The same topic covers double machine learning (Chernozhukov et al. 2018), where GMM-style score equations on first-stage nuisance functions produce second-stage point estimators whose asymptotic distribution is unaffected by the convergence rate of the machine-learned first-stage estimators.


§2 — Classical method of moments: the just-identified case

2.1 Sample moments and the moment equations

Given an i.i.d. sample X1,,XnX_1, \dots, X_n from P0P_0 indexed by θ0ΘRk\theta_0 \in \Theta \subseteq \mathbb{R}^k, Pearson’s construction picks kk scalar moment functions g1,,gkg_1, \dots, g_k — a vector-valued g ⁣:X×ΘRkg \colon \mathcal{X} \times \Theta \to \mathbb{R}^k — satisfying the population moment condition EP0[g(X,θ0)]=0\mathbb{E}_{P_0}[g(X, \theta_0)] = 0 at the true parameter, and at no other parameter in Θ\Theta. The method-of-moments estimator θ^n\hat\theta_n is the solution to the sample-moment equations

gˉn(θ^n)  =  1ni=1ng(Xi,θ^n)  =  0.\bar g_n(\hat\theta_n) \;=\; \frac{1}{n} \sum_{i=1}^n g(X_i, \hat\theta_n) \;=\; 0.

Two practical observations frame everything that follows. First, the choice of moment functions is a design choice: different functions give different estimators with different asymptotic variances, and the estimator is only as efficient as the moments are informative about θ0\theta_0. Second, in general the system gˉn(θ^n)=0\bar g_n(\hat\theta_n) = 0 has no closed-form solution, and we solve it numerically via Newton–Raphson or scipy.optimize.fsolve. The canonical illustrations below — Gaussian on the first two moments and Gamma on the first two moments — happen to admit clean closed forms; most nontrivial cases do not.

2.2 Worked examples — Gaussian and Gamma

Gaussian on μ,σ2\mu, \sigma^2. For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) with θ=(μ,σ2)\theta = (\mu, \sigma^2), the first two population moments are E[X]=μ\mathbb{E}[X] = \mu and E[X2]=μ2+σ2\mathbb{E}[X^2] = \mu^2 + \sigma^2. The MoM sample-moment equations are Xˉ=μ^\bar X = \hat\mu and X2=μ^2+σ^2\overline{X^2} = \hat\mu^2 + \hat\sigma^2, solved in closed form by μ^=Xˉ\hat\mu = \bar X and σ^2=X2Xˉ2=Sn2\hat\sigma^2 = \overline{X^2} - \bar X^2 = S_n^2 (the sample variance with divisor nn).

For the Gaussian, this MoM estimator coincides exactly with maximum likelihood: the sufficient statistics for (μ,σ2)(\mu, \sigma^2) are precisely the first two sample moments, so the score equations θlogp(X;θ)=0\nabla_\theta \log p(X; \theta) = 0 reduce to the same system the MoM solves. We will see in §11 that every MLE is a just-identified GMM with the score equations as moment conditions; for the Gaussian, the first-two-moments MoM is also the score-equation MoM, and the two coincide.

Gamma on α,β\alpha, \beta. For XGamma(α,β)X \sim \mathrm{Gamma}(\alpha, \beta) with density p(x;α,β)=βαxα1eβx/Γ(α)p(x; \alpha, \beta) = \beta^\alpha x^{\alpha-1} e^{-\beta x} / \Gamma(\alpha) on x>0x > 0, the mean is μ=α/β\mu = \alpha / \beta and the variance is σ2=α/β2\sigma^2 = \alpha / \beta^2. Equating to the sample mean and sample variance,

α^β^  =  Xˉ,α^β^2  =  Sn2,\frac{\hat\alpha}{\hat\beta} \;=\; \bar X, \qquad \frac{\hat\alpha}{\hat\beta^2} \;=\; S_n^2,

and solving gives the closed form

β^  =  XˉSn2,α^  =  Xˉ2Sn2.\hat\beta \;=\; \frac{\bar X}{S_n^2}, \qquad \hat\alpha \;=\; \frac{\bar X^2}{S_n^2}.

The MLE for the Gamma is not this estimator. The Gamma’s sufficient statistics are Xi\sum X_i and logXi\sum \log X_i, not Xi\sum X_i and Xi2\sum X_i^2, and the MLE α^MLE\hat\alpha_{\mathrm{MLE}} solves the digamma equation

logαψ(α)  =  logXˉlogX,\log \alpha - \psi(\alpha) \;=\; \log \bar X - \overline{\log X},

where ψ=Γ/Γ\psi = \Gamma'/\Gamma is the digamma function. This has no closed form; we solve it numerically. The MoM estimator on the Gamma is therefore consistent but not efficient — and the gap to MLE is large enough to see clearly at moderate sample sizes, as the visualization below demonstrates.

Var(MoM) / Var(MLE) = 1.66× — Gamma MoM is inefficient relative to MLE.
Sampling distributions of Gamma MoM and MLE estimators of α across 1000 Monte Carlo replicates at n=200
Figure 2.1 — Gamma sampling distributions of α̂. MoM is centered on the truth but spread wider than MLE. The relative-efficiency gap is set by the ratio of the asymptotic variances, det(V_MoM)/det(V_MLE), which exceeds 1 for every α > 0.

2.3 Asymptotic normality via the delta method

Theorem 2.1 (Just-identified MoM asymptotic normality).

Suppose θ^npθ0\hat\theta_n \to_p \theta_0, g(x,θ)g(x, \theta) is twice continuously differentiable in θ\theta in an open neighborhood of θ0\theta_0, the Jacobian

G  =  G(θ0)  =  E ⁣[g(X,θ0)θ]G \;=\; G(\theta_0) \;=\; \mathbb{E}\!\left[\frac{\partial g(X, \theta_0)}{\partial \theta^\top}\right]

is non-singular, and Ω=E ⁣[g(X,θ0)g(X,θ0)]\Omega = \mathbb{E}\!\left[g(X, \theta_0)\, g(X, \theta_0)^\top\right] is finite. Then

n(θ^nθ0)  d  N ⁣(0,G1ΩG).\sqrt{n}\,(\hat\theta_n - \theta_0) \;\to_d\; \mathcal{N}\!\left(0,\, G^{-1} \Omega \, G^{-\top}\right).
Proof.

The estimator solves gˉn(θ^n)=0\bar g_n(\hat\theta_n) = 0. By the consistency assumption, θ^n\hat\theta_n lies in the smooth neighborhood for nn large enough. Taylor-expand each component gˉn,j\bar g_{n,j} around θ0\theta_0 to first order, with the second-order remainder absorbed by a mean-value point θn,j\theta_{n,j}^\star on the segment between θ0\theta_0 and θ^n\hat\theta_n:

0  =  gˉn(θ^n)  =  gˉn(θ0)  +  Gn(θn)(θ^nθ0),0 \;=\; \bar g_n(\hat\theta_n) \;=\; \bar g_n(\theta_0) \;+\; G_n(\theta_n^\star)\,(\hat\theta_n - \theta_0),

where Gn(θ)=gˉn(θ)/θG_n(\theta) = \partial \bar g_n(\theta) / \partial \theta^\top is the sample-Jacobian. The mean-value point varies by component, but the rest of the argument needs only Gn(θn)pGG_n(\theta_n^\star) \to_p G, which follows from uniform LLN on g/θ\partial g / \partial \theta over the neighborhood (a workhorse application of Concentration Inequalities) together with θnpθ0\theta_n^\star \to_p \theta_0.

Rearrange and multiply by n\sqrt{n}:

n(θ^nθ0)  =  Gn(θn)1ngˉn(θ0).\sqrt{n}\,(\hat\theta_n - \theta_0) \;=\; -\,G_n(\theta_n^\star)^{-1} \cdot \sqrt{n}\, \bar g_n(\theta_0).

Two convergence facts complete the proof. By the central limit theorem applied to the i.i.d. observations g(Xi,θ0)g(X_i, \theta_0) with mean 00 and covariance Ω\Omega,

ngˉn(θ0)  =  1ni=1ng(Xi,θ0)  d  N(0,Ω).\sqrt{n}\, \bar g_n(\theta_0) \;=\; \frac{1}{\sqrt{n}} \sum_{i=1}^n g(X_i, \theta_0) \;\to_d\; \mathcal{N}(0, \Omega).

By the continuous mapping theorem applied to matrix inversion (continuous at non-singular matrices), Gn(θn)1pG1G_n(\theta_n^\star)^{-1} \to_p G^{-1}. Slutsky’s theorem combines these:

n(θ^nθ0)  d  G1N(0,Ω)  =  N ⁣(0,G1ΩG).\sqrt{n}\,(\hat\theta_n - \theta_0) \;\to_d\; -\,G^{-1} \cdot \mathcal{N}(0, \Omega) \;=\; \mathcal{N}\!\left(0, \, G^{-1} \Omega \, G^{-\top}\right).

This is the just-identified case of the sandwich variance formula §5 will develop in full generality. With L=kL = k, the Jacobian GG is square and invertible by identification, so the “sandwich” collapses to the triple product G1ΩGG^{-1} \Omega \, G^{-\top}. The weighting matrix WW does not appear because L=kL = k leaves no choice of how to combine moment conditions — the just-identified system has a unique solution.

A useful corollary: when the moment functions are the score, g(X,θ)=θlogp(X;θ)g(X, \theta) = \nabla_\theta \log p(X; \theta), the asymptotic variance reduces to the inverse Fisher information I1\mathcal{I}^{-1} — the Cramér–Rao lower bound. To see this, note that under regularity G=E[2logp/θθ]=IG = \mathbb{E}[\partial^2 \log p / \partial \theta \partial \theta^\top] = -\mathcal{I} (the negative Hessian of the log-likelihood in expectation), and Ω=E[logplogp]=I\Omega = \mathbb{E}[\nabla \log p \cdot \nabla \log p^\top] = \mathcal{I} (the information matrix equality). So G1ΩG=(I)1I(I)=I1G^{-1} \Omega \, G^{-\top} = (-\mathcal{I})^{-1} \mathcal{I} (-\mathcal{I})^{-\top} = \mathcal{I}^{-1}. Choosing the score as moment function gives MLE efficiency; any other choice gives a consistent estimator with a (weakly) larger asymptotic variance in the Loewner ordering. This is why MoM on the Gaussian (where the first-two-moments are sufficient) matches MLE, and MoM on the Gamma (where the first-two-moments are not sufficient) does not.

2.4 Why we need GMM — over-identification kills direct inversion

What happens if we add a third moment condition to the Gaussian setup? Take

g(X,μ,σ2)  =  (Xμ(Xμ)2σ2(Xμ)3(Xμ)43σ4),L=4,  k=2.g(X, \mu, \sigma^2) \;=\; \begin{pmatrix} X - \mu \\ (X - \mu)^2 - \sigma^2 \\ (X - \mu)^3 \\ (X - \mu)^4 - 3 \sigma^4 \end{pmatrix}, \qquad L = 4, \; k = 2.

The third and fourth components exploit the Gaussian’s symmetry and the closed form for its fourth central moment (E[(Xμ)4]=3σ4\mathbb{E}[(X-\mu)^4] = 3\sigma^4). In population all four moment conditions hold at (μ0,σ02)(\mu_0, \sigma_0^2). In any finite sample, however, the sample third central moment fluctuates around zero at rate n1/2n^{-1/2} and the sample fourth central moment fluctuates around 3σ^n43 \hat\sigma_n^4 at the same rate. So the system gˉn(θ)=0\bar g_n(\theta) = 0 has no solution at finite nn: the first two equations pin down (μ^,σ^2)=(Xˉ,Sn2)(\hat\mu, \hat\sigma^2) = (\bar X, S_n^2), and the third and fourth equations evaluated at that solution leave residuals of order Op(n1/2)O_p(n^{-1/2}).

Four moment-residual curves as functions of candidate μ, showing that no single μ zeros all four conditions simultaneously
Figure 2.2 — Over-identification visualized. The four moment-residual curves intersect zero at different values of the candidate parameter, so no single μ zeros all four conditions simultaneously. As n grows the residual curves shrink toward zero at rate n^{-1/2} but never coincide.

Three responses to this redundancy. We can drop two moment conditions and revert to a just-identified system — but which two, and why throw away information theory has supplied? We can use the first two to estimate θ\theta and use the residuals in the third and fourth as a test of correct specification — a quick precursor to the Hansen J-statistic of §8. Or we can combine all four moment conditions through a positive-definite weighting matrix that absorbs their relative noise levels and trades off between them. This is GMM, and §3 develops it.


§3 — The GMM framework: moment conditions and weighted quadratic forms

This is the central definitional section. We formalize the over-identified moment-condition model, define the GMM criterion Jn(θ,W)J_n(\theta, W) as a weighted quadratic form in the sample-moment vector, derive the first-order conditions that the GMM estimator solves, and lay out the rank condition that makes the whole construction work. Hansen’s (1982) original paper develops all four pieces in roughly five pages of dense notation; we expand them with concrete geometry and a worked running example.

3.1 Moment conditions and population identification

The GMM data-generating model has three pieces.

  1. Sample. An i.i.d. sample X1,,XnX_1, \dots, X_n from a distribution P0P_0 on a measurable space (X,B)(\mathcal{X}, \mathcal{B}).

  2. Parameter. An unknown θ0ΘRk\theta_0 \in \Theta \subseteq \mathbb{R}^k, with Θ\Theta an open subset of Euclidean space. (Compactness is sometimes imposed for the consistency proof in §4; we relax it where the geometry allows.)

  3. Moment function. A known map g ⁣:X×ΘRLg \colon \mathcal{X} \times \Theta \to \mathbb{R}^L, measurable in its first argument and continuously differentiable in its second, satisfying the population moment condition

    EP0[g(X,θ0)]  =  0.\mathbb{E}_{P_0}[g(X, \theta_0)] \;=\; 0.

The integer LL is the number of moment conditions; kk is the number of unknown parameters. We assume LkL \ge k throughout. When L=kL = k we recover the just-identified Pearson setup of §2; the over-identified case L>kL > k is the one Hansen 1982 targets and the one this topic is about.

The moment condition is an identifying restriction: we further assume the population moment m(θ):=EP0[g(X,θ)]m(\theta) := \mathbb{E}_{P_0}[g(X, \theta)] satisfies m(θ)=0m(\theta) = 0 if and only if θ=θ0\theta = \theta_0. Without this global identification hypothesis the population objective the GMM estimator targets has multiple zeros and consistency fails outright.

Running example. Throughout this section we use a synthetic problem that strips IV and endogeneity away and isolates the over-identified structure. We observe four scalar measurements Xi=(Xi,1,Xi,2,Xi,3,Xi,4)R4X_i = (X_{i,1}, X_{i,2}, X_{i,3}, X_{i,4})^\top \in \mathbb{R}^4 at each ii, and on theoretical grounds we know the population means satisfy

E[X1]=θ01,E[X2]=θ02,E[X3]=θ01+θ02,E[X4]=2θ01θ02.\mathbb{E}[X_1] = \theta_{01}, \quad \mathbb{E}[X_2] = \theta_{02}, \quad \mathbb{E}[X_3] = \theta_{01} + \theta_{02}, \quad \mathbb{E}[X_4] = 2\theta_{01} - \theta_{02}.

With four measurements and two unknown parameters (k=2k = 2, L=4L = 4), the system is over-identified. The moment function is

g(X,θ)  =  XAθ,A  =  (10011121)R4×2.g(X, \theta) \;=\; X - A\theta, \qquad A \;=\; \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ 2 & -1 \end{pmatrix} \in \mathbb{R}^{4 \times 2}.

This is generalized-least-squares-via-moments — the simplest nontrivial over-identified GMM problem. The linear setup makes everything in §3 concrete and the criterion surface in §3.2 literally elliptical. §9 will replace the synthetic design AA with a structural instrumental variables setup.

3.2 The GMM criterion function

The sample moment vector is the empirical analog of m(θ)m(\theta),

gˉn(θ)  :=  1ni=1ng(Xi,θ)    RL.\bar g_n(\theta) \;:=\; \frac{1}{n} \sum_{i=1}^n g(X_i, \theta) \;\in\; \mathbb{R}^L.

By LLN, gˉn(θ)pm(θ)\bar g_n(\theta) \to_p m(\theta) pointwise (we will need uniform convergence for the consistency proof in §4, but pointwise convergence suffices to define the estimator). For the running example, gˉn(θ)=XˉAθ\bar g_n(\theta) = \bar X - A\theta where Xˉ=(1/n)iXiR4\bar X = (1/n)\sum_i X_i \in \mathbb{R}^4.

The GMM criterion function measures the magnitude of gˉn(θ)\bar g_n(\theta) relative to a positive-definite weighting matrix WW:

Jn(θ,W)  :=  ngˉn(θ)Wgˉn(θ)()J_n(\theta, W) \;:=\; n \cdot \bar g_n(\theta)^\top W \, \bar g_n(\theta) \qquad\qquad (\star)

where WRL×LW \in \mathbb{R}^{L \times L} is symmetric and positive-definite. The factor of nn standardizes the criterion to an Op(1)O_p(1) scale at θ0\theta_0: without it, Jn(θ0,W)=Op(n1)0J_n(\theta_0, W) = O_p(n^{-1}) \to 0 and the asymptotic theory would be awkward. With the nn factor, Jn(θ0,W)J_n(\theta_0, W) converges in distribution (§8 gives the χLk2\chi^2_{L-k} limit at the efficient weighting, and the analogous quadratic-form-in-Gaussians limit in general).

The criterion is non-negative, and equals zero only when gˉn(θ)=0\bar g_n(\theta) = 0. In the over-identified case (L>kL > k), Jn(θ,W)>0J_n(\theta, W) > 0 for all θΘ\theta \in \Theta at finite nn — the system has no solution that zeroes all LL residuals simultaneously — so the minimization

θ^W  :=  argminθΘJn(θ,W)\hat\theta_W \;:=\; \arg\min_{\theta \in \Theta} J_n(\theta, W)

is a genuine minimization rather than root-finding. The minimum value J^:=Jn(θ^W,W)\hat J := J_n(\hat\theta_W, W) is the residual GMM criterion — the unavoidable mismatch the over-identified system leaves behind, which §8 will repurpose as a specification test.

When gg is affine in θ\theta, as in the running example, Jn(θ,W)J_n(\theta, W) is exactly quadratic in θ\theta:

Jn(θ,W)  =  n(XˉAθ)W(XˉAθ).J_n(\theta, W) \;=\; n \,(\bar X - A\theta)^\top W \,(\bar X - A\theta).

Its contours in θ\theta-space are concentric ellipses centered at θ^W\hat\theta_W, with shape controlled by AWAA^\top W A and eccentricity reflecting how the four moment conditions trade off against each other under the metric WW.

θ̂_W = (0.950, 0.968)  ·  J_min = 4.07  ·  ‖θ̂_W − θ₀‖ = 0.059
Contour plot of J_n(θ, I) over a 2D grid in (θ_1, θ_2) for the running-example design
Figure 3.1 — GMM criterion surface for the running example under identity weighting. The contours are concentric ellipses; the estimator θ̂_W is the unique minimum, marked by the dot. The cross marks the population truth θ₀.
Three side-by-side criterion surfaces under identity, suboptimal-diagonal, and efficient weighting matrices
Figure 3.2 — How the weighting matrix shapes the criterion surface. Identity (left) treats every moment equally; diagonal (center) downweights noisier moments; efficient W* = Ω⁻¹ (right) accounts for cross-moment correlations and yields the tightest confidence ellipse. The estimate θ̂_W shifts modestly across panels; the *uncertainty* shrinks substantially.

For general nonlinear gg, Jn(θ,W)J_n(\theta, W) is convex in a neighborhood of θ0\theta_0 — its Hessian at θ0\theta_0 is 2G0WG02\,G_0^\top W G_0, which is positive-definite under the rank condition in §3.4 — but may be non-convex globally. Practical optimization uses Newton–Raphson, quasi-Newton, or scipy.optimize.minimize; §13 covers the tips and traps.

3.3 First-order conditions — the GMM normal equations

Differentiate JnJ_n with respect to θ\theta:

Jn(θ,W)θ  =  2nGn(θ)Wgˉn(θ),\frac{\partial J_n(\theta, W)}{\partial \theta} \;=\; 2n \cdot G_n(\theta)^\top W \, \bar g_n(\theta),

where Gn(θ):=gˉn(θ)/θRL×kG_n(\theta) := \partial \bar g_n(\theta) / \partial \theta^\top \in \mathbb{R}^{L \times k} is the sample Jacobian. Setting this to zero gives the GMM normal equations:

Gn(θ^W)Wgˉn(θ^W)  =  0.G_n(\hat\theta_W)^\top \, W \, \bar g_n(\hat\theta_W) \;=\; 0.

This is the over-identified analog of the just-identified system gˉn(θ^n)=0\bar g_n(\hat\theta_n) = 0 from §2. In the just-identified case (L=kL = k), GnG_n is k×kk \times k and generically invertible, so GnWgˉn=0G_n^\top W \bar g_n = 0 if and only if gˉn=0\bar g_n = 0 — the weighting matrix drops out and we recover the unique just-identified solution. In the over-identified case (L>kL > k), the system reduces from LL equations to kk equations through left-multiplication by GnWG_n^\top W, projecting the LL-dimensional moment residual onto the kk-dimensional column space of W1/2GnW^{1/2} G_n. The geometric reading is set the projection of the moment residual onto the parameter-direction subspace (in the metric WW) to zero — we cannot zero the full LL-dimensional residual at any θ\theta, but we can zero its projection along the kk directions that matter for θ\theta.

For affine gg, the GMM normal equations have a closed-form solution. The running example gives Gn=AG_n = -A (independent of θ\theta), so

AWAθ^W  =  AWXˉ,θ^W  =  (AWA)1AWXˉ.A^\top W A \, \hat\theta_W \;=\; A^\top W \, \bar X, \qquad \hat\theta_W \;=\; (A^\top W A)^{-1} \, A^\top W \, \bar X.

This is generalized least squares, with the role of the GLS weighting matrix played by the GMM weighting matrix WW. For W=IW = I we get ordinary least squares of Xˉ\bar X on AθA\theta; for W=diag(1/Var(Xj))W = \mathrm{diag}(1/\mathrm{Var}(X_j)) we get weighted least squares with variance-inverse weights; for the efficient choice W=Ω1W^\star = \Omega^{-1} where Ω\Omega is the full L×LL \times L moment-covariance matrix (off-diagonal terms included), we get the Aitken / Gauss–Markov estimator that achieves the smallest asymptotic variance. §6 derives this efficient choice.

3.4 Identification rank conditions

The GMM estimator is well-defined and the asymptotic theory works only if the model identifies θ0\theta_0. Two layers.

Global identification. The population moment m(θ)=E[g(X,θ)]m(\theta) = \mathbb{E}[g(X, \theta)] has a unique zero at θ0\theta_0:

m(θ)  =  0        θ  =  θ0.m(\theta) \;=\; 0 \;\iff\; \theta \;=\; \theta_0.

This is a hypothesis on the model — we assume it. For affine gg of the form g(X,θ)=XAθg(X, \theta) = X - A\theta, global identification reduces to AA having full column rank kk: equivalently AAA^\top A being invertible. Geometrically, the columns of AA must span a kk-dimensional subspace of RL\mathbb{R}^L, so no two parameter vectors map to the same population mean vector.

Local identification — the rank condition. The Jacobian at the truth,

G0  :=  G(θ0)  =  E ⁣[g(X,θ0)θ]    RL×k,G_0 \;:=\; G(\theta_0) \;=\; \mathbb{E}\!\left[\frac{\partial g(X, \theta_0)}{\partial \theta^\top}\right] \;\in\; \mathbb{R}^{L \times k},

has full column rank kk:

rank(G0)  =  k.\mathrm{rank}(G_0) \;=\; k.

This is the rank condition of GMM, and it carries three consequences we will rely on throughout the rest of the topic.

  1. The Hessian of the population criterion Q0(θ)=m(θ)Wm(θ)Q_0(\theta) = m(\theta)^\top W m(\theta) at θ0\theta_0 is 2G0WG02\,G_0^\top W G_0, positive-definite under the rank condition — so θ0\theta_0 is a strict local minimum of the population objective.
  2. The sandwich variance VW=(G0WG0)1G0WΩWG0(G0WG0)1V_W = (G_0^\top W G_0)^{-1} G_0^\top W \Omega W G_0 (G_0^\top W G_0)^{-1} that §5 derives is well-defined (the “bread” G0WG0G_0^\top W G_0 is invertible).
  3. The efficient-weighting variance V=(G0Ω1G0)1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1} that §6 derives is well-defined.

For the running example, G0=AG_0 = -A has rank 2 iff the columns of AA are linearly independent. With the given AA the first two rows already form a 2×22 \times 2 identity submatrix, so the rank condition holds. The visualization below lets you watch the criterion surface degenerate as the design matrix is artificially collapsed toward rank deficiency.

condition number κ(AᵀA) = 2.34e+0  ·  effective rank = 2

What can go wrong? If the four moment conditions were E[Xj]=αjθ01+βjθ02\mathbb{E}[X_j] = \alpha_j \theta_{01} + \beta_j \theta_{02} with (αj,βj)=(1,0.5),(2,1),(3,1.5),(4,2)(\alpha_j, \beta_j) = (1, 0.5), (2, 1), (3, 1.5), (4, 2) for all jj, every row of AA would be a scalar multiple of (1,0.5)(1, 0.5)^\topAA has rank 1, θ02\theta_{02} is not identified by these moments (only the linear combination θ01+0.5θ02\theta_{01} + 0.5\,\theta_{02} is), and no amount of data rescues the estimator. The model is under-identified despite having L=4>k=2L = 4 > k = 2 moment conditions. Over-identification counts equations, not information; the rank condition counts information.

When the rank condition holds exactly but G0G_0 is nearly rank-deficient — the columns of G0G_0 are nearly collinear in the metric W1/2W^{1/2} — we have weakly-identified parameters (in the IV setting, weak instruments). The asymptotic theory still goes through but the asymptotic-normal approximation deteriorates at finite samples, and standard errors can be wildly understated. §9 covers the diagnostics for this case (the Staiger–Stock F-statistic for linear IV, the Anderson–Rubin test as a weak-instrument-robust alternative).


§4 — Consistency of GMM estimators

We turn to the first asymptotic result: as nn \to \infty, the GMM estimator θ^Wn\hat\theta_{W_n} converges in probability to the true parameter θ0\theta_0. The result is almost free in the sense that it requires no particular choice of weighting matrix — any positive-definite limiting WW delivers consistency — and depends only on global identification plus a uniform law of large numbers on the sample-moment vector. What does require care is the bridge from pointwise convergence of gˉn(θ)\bar g_n(\theta) at each θ\theta to uniform convergence over Θ\Theta, which is what the consistency proof actually needs. The technical workhorse is the empirical-process machinery developed in Concentration Inequalities.

4.1 Uniform laws of large numbers

The ordinary LLN gives us gˉn(θ)pm(θ)\bar g_n(\theta) \to_p m(\theta) for each fixed θ\theta — pointwise convergence. But the GMM estimator is the argmin of a function of θ\theta, and pointwise convergence of the integrand is not enough to conclude that the argmin converges. A small pathological dip of gˉn\bar g_n far from θ0\theta_0 can fool the optimizer even when each gˉn(θ)\bar g_n(\theta) is close to m(θ)m(\theta) in expectation; what we need is that no such dip happens anywhere in Θ\Theta.

A function class F={g(,θ):θΘ}\mathcal{F} = \{g(\cdot, \theta) : \theta \in \Theta\} obeys the uniform law of large numbers if

supθΘgˉn(θ)m(θ)  p  0as n.\sup_{\theta \in \Theta} \|\bar g_n(\theta) - m(\theta)\| \;\to_p\; 0 \quad \text{as } n \to \infty.

Two standard conditions, both inherited from Concentration Inequalities, deliver this. The bracketing-entropy condition asks that Θ\Theta be compact, g(X,)g(X, \cdot) be continuous in θ\theta for P0P_0-a.e. XX, and a dominating function d(X)d(X) exist with supθΘg(X,θ)d(X)\sup_{\theta \in \Theta} \|g(X, \theta)\| \le d(X) and E[d(X)]<\mathbb{E}[d(X)] < \infty. This is the textbook Newey–McFadden setup (1994, Lemma 2.4); it gives op(1)o_p(1) convergence with no explicit rate. The Rademacher complexity condition asks that the class F\mathcal{F} have Rademacher complexity Rn(F)\mathfrak{R}_n(\mathcal{F}) that vanishes with nn. This gives an explicit rate supθgˉnm=Op(Rn(F))\sup_\theta \|\bar g_n - m\| = O_p(\mathfrak{R}_n(\mathcal{F})) — at the parametric Op(n1/2)O_p(n^{-1/2}) when F\mathcal{F} is VC or has bounded entropy.

For the linear running example g(X,θ)=XAθg(X, \theta) = X - A\theta, the function class is affine in kk parameters, has VC-dimension O(k)O(k), and the uniform LLN holds at the optimal parametric rate Op(k/n)O_p(\sqrt{k/n}). The connection to Concentration Inequalities is direct: the uniform LLN follows from concentration of the empirical-process supremum supθgˉn(θ)m(θ)\sup_\theta |\bar g_n(\theta) - m(\theta)|, controlled by Talagrand’s inequality, the bounded-differences inequality, or Massart’s symmetrization argument applied to the function class F\mathcal{F}.

4.2 The population objective and global identification

Define the population criterion as the limit of Jn/nJ_n / n:

Q0(θ,W)  :=  m(θ)Wm(θ).Q_0(\theta, W) \;:=\; m(\theta)^\top W m(\theta).

Two properties drive consistency. First, Q0(θ,W)0Q_0(\theta, W) \ge 0 for all θ\theta, with equality iff m(θ)=0m(\theta) = 0; this follows from positive-definiteness of WW. Second, under the global-identification hypothesis m(θ)=0    θ=θ0m(\theta) = 0 \iff \theta = \theta_0, the population criterion has a unique minimizer at θ0\theta_0, where Q0(θ0,W)=0Q_0(\theta_0, W) = 0.

Combining: θ0=argminθΘQ0(θ,W)\theta_0 = \arg\min_{\theta \in \Theta} Q_0(\theta, W) uniquely. Consistency reduces to verifying that the sample argmin θ^W\hat\theta_W converges to the population argmin θ0\theta_0 — a question for the argmax theorem (the consistency lemma for extremum estimators). The sample criterion is the empirical version of Q0Q_0, scaled by nn: Qn(θ,Wn):=Jn(θ,Wn)/n=gˉn(θ)Wngˉn(θ)Q_n(\theta, W_n) := J_n(\theta, W_n)/n = \bar g_n(\theta)^\top W_n \bar g_n(\theta). The factor of nn does not move the argmin.

4.3 The consistency theorem

Theorem 4.1 (Consistency of GMM).

Assume (i) i.i.d. sample; (ii) compact parameter space Θθ0\Theta \ni \theta_0; (iii) global identification (m(θ)=0    θ=θ0m(\theta) = 0 \iff \theta = \theta_0); (iv) continuity of g(X,)g(X, \cdot) on Θ\Theta for P0P_0-a.e. XX; (v) dominating function E ⁣[supθΘg(X,θ)]<\mathbb{E}\!\left[\sup_{\theta \in \Theta} \|g(X, \theta)\|\right] < \infty; (vi) weighting WnpWW_n \to_p W with WW symmetric positive-definite deterministic. Then

θ^Wn  :=  argminθΘJn(θ,Wn)  p  θ0.\hat\theta_{W_n} \;:=\; \arg\min_{\theta \in \Theta} J_n(\theta, W_n) \;\to_p\; \theta_0.
Proof.

Four steps.

Step 1: uniform LLN. By (i), (iv), (v), the class F={g(,θ):θΘ}\mathcal{F} = \{g(\cdot, \theta) : \theta \in \Theta\} satisfies the bracketing-entropy condition and the uniform LLN holds: supθΘgˉn(θ)m(θ)p0\sup_{\theta \in \Theta} \|\bar g_n(\theta) - m(\theta)\| \to_p 0.

Step 2: uniform convergence of the criterion. We show supθQn(θ,Wn)Q0(θ,W)p0\sup_\theta |Q_n(\theta, W_n) - Q_0(\theta, W)| \to_p 0. Using the symmetric-WW identity aWabWb=(ab)W(a+b)a^\top W a - b^\top W b = (a - b)^\top W (a + b),

gˉnWngˉnmWm  =  gˉn(WnW)gˉn  +  (gˉnm)W(gˉn+m).\bar g_n^\top W_n \bar g_n - m^\top W m \;=\; \bar g_n^\top (W_n - W) \bar g_n \;+\; (\bar g_n - m)^\top W (\bar g_n + m).

Take operator-norm bounds, sup over θ\theta, and apply Slutsky with Step 1’s uniform LLN and assumption (vi). Result: supθQn(θ,Wn)Q0(θ,W)p0\sup_\theta |Q_n(\theta, W_n) - Q_0(\theta, W)| \to_p 0.

Step 3: identification. By (iii) and positive-definiteness of WW, Q0(θ,W)λmin(W)m(θ)20Q_0(\theta, W) \ge \lambda_{\min}(W) \|m(\theta)\|^2 \ge 0 with equality iff θ=θ0\theta = \theta_0. So θ0\theta_0 is the unique minimizer of Q0Q_0 on Θ\Theta.

Step 4: argmin convergence. Fix ε>0\varepsilon > 0 and let Bεc=ΘB(θ0,ε)B_\varepsilon^c = \Theta \setminus B(\theta_0, \varepsilon). By compactness of Θ\Theta and continuity of Q0Q_0, δ:=infθBεcQ0(θ,W)>0\delta := \inf_{\theta \in B_\varepsilon^c} Q_0(\theta, W) > 0. On the failure event {θ^WnBεc}\{\hat\theta_{W_n} \in B_\varepsilon^c\}, the argmin definition and Step 2 give Q0(θ^Wn,W)2supθQnQ0Q_0(\hat\theta_{W_n}, W) \le 2 \sup_\theta |Q_n - Q_0|. So δ2supθQnQ0\delta \le 2 \sup_\theta |Q_n - Q_0|, which holds with probability going to zero by Step 2. P(θ^Wnθ0>ε)0P(\|\hat\theta_{W_n} - \theta_0\| > \varepsilon) \to 0.

(Originally Hansen 1982; the textbook treatment with these regularity conditions follows Newey & McFadden 1994 §2.) The proof carries a useful geometric reading. The sample criterion tracks the population criterion uniformly; the population criterion has a strict global minimum at θ0\theta_0 separated by a positive gap from any point outside the ε\varepsilon-neighborhood; therefore the sample minimizer must eventually fall inside the ε\varepsilon-neighborhood. Compactness of Θ\Theta guarantees the positive gap; the uniform LLN guarantees the sample criterion tracks the population criterion everywhere on Θ\Theta, not just near the truth.

RMSE ratio at n = 800: identity / efficient = 1.16×
Log-log RMSE convergence for identity and efficient weighting on the running example, with sampling-distribution clouds at three sample sizes
Figure 4.1 — Consistency at the parametric rate. Left: log-log RMSE follows the n^{-1/2} reference dashed line for both identity and efficient weighting; the efficient line sits below. Right: sampling-distribution clouds shrink toward θ₀ as n grows from 25 → 200 → 1000.

Two consequences worth noting before §5. Robustness to the weighting matrix. Theorem 4.1 imposes no special structure on WW beyond positive-definiteness — any positive-definite limiting weighting matrix gives a consistent estimator. The first-step identity-weight GMM and the second-step efficient Ω^1\hat\Omega^{-1}-weighted GMM are both consistent. The proof template generalizes. The four-step proof — uniform LLN on the integrand, uniform convergence of the criterion, identification of the population minimizer, argmin convergence — is the template for proving consistency of every extremum estimator in statistics: MLE, M-estimators, GEL, sieve estimators, neural-network ERM. GMM is the prototype.

4.4 What can go wrong

Four failure modes, each illustrating which assumption above is doing the work.

Global identification failure. If the population moment m(θ)=0m(\theta) = 0 has multiple solutions, θ0\theta_0 is not the unique minimizer of Q0Q_0 — assumption (iii) breaks. The sample criterion has multiple local minima of comparable depth, and the argmin is not a well-defined limit object. Mitigation: re-parametrize to break the symmetry, or use a sign-restricted optimizer.

Rank deficiency. If G0G_0 has rank <k< k, the population criterion is flat along the null direction of G0WG0G_0^\top W G_0 near θ0\theta_0. The sample criterion inherits a near-flat ridge, the argmin is non-unique up to motion along the ridge, and the asymptotic variance in §5 blows up.

Weak identification. G0G_0 has full column rank but is nearly rank-deficient. Consistency still holds in principle (Theorem 4.1 applies — the rank condition is local-identification, not consistency), but the rate of convergence is governed by λmin(G0WG0)1/2\lambda_{\min}(G_0^\top W G_0)^{-1/2}. At finite samples, the consistency-implied Gaussian approximation breaks down — the sampling distribution has heavy tails and CI coverage drops. §9.4 develops the linear-IV special case.

Dependent data. For non-i.i.d. data — time series, clustered samples — the LLN still applies under standard mixing or ergodicity conditions, but the moment-covariance Ω\Omega in §5 must account for autocorrelation (HAC / Newey–West estimators replace the simple sample covariance). Consistency direction (Theorem 4.1) carries over with minimal modification; the asymptotic variance does not.


§5 — Asymptotic normality of GMM estimators

§4 established that θ^Wpθ0\hat\theta_W \to_p \theta_0. We now refine that result with the rate and limit distribution: n(θ^Wθ0)dN(0,VW)\sqrt{n}(\hat\theta_W - \theta_0) \to_d \mathcal{N}(0, V_W), where VWV_W is the sandwich variance that gives Hansen’s framework its characteristic asymptotic structure. Three ingredients — the Jacobian G0G_0, the weighting matrix WW, and the moment-covariance Ω\Omega — enter the formula, and reading their interactions sets up the efficient-weighting choice of §6 and the J-statistic of §8.

5.1 The sandwich variance formula

Theorem 5.1 (Asymptotic normality of GMM).

In addition to the consistency conditions of Theorem 4.1, assume (vii) θ0\theta_0 lies in the interior of Θ\Theta; (viii) E[g(X,θ0)2]<\mathbb{E}[\|g(X, \theta_0)\|^2] < \infty and Ω:=E[g(X,θ0)g(X,θ0)]\Omega := \mathbb{E}[g(X, \theta_0)\, g(X, \theta_0)^\top] is positive-definite; (ix) g(X,)g(X, \cdot) is continuously differentiable on a neighborhood Nθ0\mathcal{N} \ni \theta_0 for P0P_0-a.e. XX, with a Jacobian uniform LLN and dominating function; (x) G0:=E[g(X,θ0)/θ]G_0 := \mathbb{E}[\partial g(X, \theta_0) / \partial \theta^\top] has full column rank kk. Then

n(θ^Wθ0)  d  N(0,VW),VW  =  (G0WG0)1G0WΩWG0(G0WG0)1.\sqrt{n}\,(\hat\theta_W - \theta_0) \;\to_d\; \mathcal{N}(0, V_W), \qquad V_W \;=\; (G_0^\top W G_0)^{-1} \, G_0^\top W \Omega W G_0 \, (G_0^\top W G_0)^{-1}.

The factor G0WG0G_0^\top W G_0 is the bread; the factor G0WΩWG0G_0^\top W \Omega W G_0 is the meat. The bread appears twice (once on each side of the meat) because the GMM normal equations are obtained by left-multiplying the moment residual by GnWG_n^\top W, so the inverse Hessian of the criterion enters the variance once for each application.

Two special cases collapse the sandwich. In the just-identified case (L=kL = k), G0G_0 is k×kk \times k and invertible by (x). Algebraic cancellation gives VW=G01ΩG0V_W = G_0^{-1}\, \Omega \, G_0^{-\top} — the weighting matrix drops out, recovering the §2.3 just-identified-MoM variance. Under efficient weighting (W=Ω1W = \Omega^{-1}), the sandwich collapses to VΩ1=(G0Ω1G0)1V_{\Omega^{-1}} = (G_0^\top \Omega^{-1} G_0)^{-1}. §6 will show this is the Loewner-minimum of VWV_W over all positive-definite WW — Hansen’s efficiency bound.

5.2 Proof via Taylor expansion of the first-order conditions

Proof.

The GMM estimator satisfies the first-order conditions Gn(θ^W)Wngˉn(θ^W)=0G_n(\hat\theta_W)^\top \, W_n \, \bar g_n(\hat\theta_W) = 0. By consistency, θ^W\hat\theta_W lies in the differentiability neighborhood N\mathcal{N} eventually. Five steps complete the argument.

Step 1 — mean-value expansion. By the mean-value theorem applied component-wise to gˉn\bar g_n,

gˉn(θ^W)  =  gˉn(θ0)  +  Gn(θˉn)(θ^Wθ0),\bar g_n(\hat\theta_W) \;=\; \bar g_n(\theta_0) \;+\; G_n(\bar\theta_n)\,(\hat\theta_W - \theta_0),

where θˉn\bar\theta_n lies on the segment from θ0\theta_0 to θ^W\hat\theta_W.

Step 2 — substitute into the FOCs. This gives An(θ^Wθ0)=Gn(θ^W)Wngˉn(θ0)A_n \cdot (\hat\theta_W - \theta_0) = -G_n(\hat\theta_W)^\top W_n \bar g_n(\theta_0), where An:=Gn(θ^W)WnGn(θˉn)A_n := G_n(\hat\theta_W)^\top W_n G_n(\bar\theta_n).

Step 3 — matrix convergence. By consistency, the Jacobian uniform LLN, and continuous mapping, AnpG0WG0A_n \to_p G_0^\top W G_0. The rank condition (x) makes this invertible, so An1p(G0WG0)1A_n^{-1} \to_p (G_0^\top W G_0)^{-1}.

Step 4 — CLT on the moment vector at θ0\theta_0. ngˉn(θ0)dN(0,Ω)\sqrt{n}\, \bar g_n(\theta_0) \to_d \mathcal{N}(0, \Omega) by the classical CLT applied to i.i.d. g(Xi,θ0)g(X_i, \theta_0) with mean 0 and covariance Ω\Omega.

Step 5 — Slutsky. n(θ^Wθ0)=An1Gn(θ^W)Wnngˉn(θ0)\sqrt{n}\,(\hat\theta_W - \theta_0) = -A_n^{-1} G_n(\hat\theta_W)^\top W_n \cdot \sqrt{n}\, \bar g_n(\theta_0). The limit is a linear transformation of a Gaussian: LZLZ with L=(G0WG0)1G0WL = -(G_0^\top W G_0)^{-1} G_0^\top W and ZN(0,Ω)Z \sim \mathcal{N}(0, \Omega), so LZN(0,LΩL)=N(0,VW)LZ \sim \mathcal{N}(0, L \Omega L^\top) = \mathcal{N}(0, V_W).

(Hansen 1982; textbook treatment in Newey & McFadden 1994 §2 and Hayashi 2000 Ch. 3.) The proof is structurally identical to the just-identified case of §2.3, with one substitution: instead of inverting gˉn(θ^n)=0\bar g_n(\hat\theta_n) = 0 directly (impossible in the over-identified case), we invert the first-order conditions GnWngˉn(θ^W)=0G_n^\top W_n \bar g_n(\hat\theta_W) = 0. The mapping from ”LL moment-condition equations” to ”kk first-order-condition equations” is achieved by left-multiplying by GnWnG_n^\top W_n, projecting the LL-dimensional moment residual onto the kk-dimensional parameter-direction subspace. This projection lives in the bread of the sandwich.

5.3 Reading the sandwich

The variance VWV_W has three ingredients, each contributing a specific piece of statistical content. The Jacobian G0=E[g/θ]G_0 = \mathbb{E}[\partial g / \partial \theta^\top] measures signal strength — how strongly the population moment responds to a parameter perturbation at the truth. A large G0G_0 (well-identified parameter, strong instruments) gives small VWV_W. The covariance Ω=E[gg]\Omega = \mathbb{E}[g g^\top] is the moment-noise covariance — diagonal entries capture per-moment noise levels, off-diagonals capture cross-moment correlations. The weighting matrix WW is the only ingredient under the analyst’s control: different WW‘s produce different consistent estimators with different VWV_W‘s.

Three weighting values of practical interest: identity (W=IW = I, easy to compute, generally suboptimal); diagonal inverse-variance (W=diag(1/Var(gj))W = \mathrm{diag}(1/\mathrm{Var}(g_j)), captures heteroskedasticity but ignores correlations); efficient (W=Ω1W = \Omega^{-1}, achieves the Hansen bound). For score moments g=logpg = \nabla \log p, the efficient sandwich collapses to VΩ1=I1V_{\Omega^{-1}} = \mathcal{I}^{-1}, the Cramér–Rao bound — maximum likelihood is efficient GMM with score moments. The viz below makes the three ingredients literal: it shows how the bread / meat / sandwich operator-norms shift across weighting choices, and how the resulting 95% confidence ellipses nest.

‖V_I‖/‖V*‖ = 1.50× — efficient sandwich tightens by this factor on the largest direction.
Three 95% confidence ellipses for identity, diagonal, and efficient weightings overlaid on a θ_1, θ_2 plane
Figure 5.1 — 95% confidence ellipses for three weighting matrices on the running example. Identity (gray) and diagonal-inverse-variance (gold) ellipses contain the efficient (accent) ellipse — the Loewner-ordering inclusions are visible by eye.

In practice we don’t know G0G_0 or Ω\Omega exactly. The natural plug-in estimators are

G^n  =  1ni=1ng(Xi,θ^W)θ,Ω^n  =  1ni=1ng(Xi,θ^W)g(Xi,θ^W),\hat G_n \;=\; \frac{1}{n} \sum_{i=1}^n \frac{\partial g(X_i, \hat\theta_W)}{\partial \theta^\top}, \qquad \hat\Omega_n \;=\; \frac{1}{n} \sum_{i=1}^n g(X_i, \hat\theta_W)\, g(X_i, \hat\theta_W)^\top,

and the plug-in sandwich is V^W=(G^nWnG^n)1G^nWnΩ^nWnG^n(G^nWnG^n)1\hat V_W = (\hat G_n^\top W_n \hat G_n)^{-1}\, \hat G_n^\top W_n \hat\Omega_n W_n \hat G_n\, (\hat G_n^\top W_n \hat G_n)^{-1}. Under the regularity conditions of Theorem 5.1, V^WpVW\hat V_W \to_p V_W, so confidence intervals built from V^W\hat V_W have asymptotically correct coverage. For dependent data, Ω^n\hat\Omega_n is replaced by a HAC estimator like Newey–West.

5.4 Loewner ordering on asymptotic variances

Different choices of WW produce different asymptotic variances. We compare them via the Loewner ordering on positive-semidefinite matrices: V1V2    V2V10    uV1uuV2u  uV_1 \preceq V_2 \iff V_2 - V_1 \succeq 0 \iff u^\top V_1 u \le u^\top V_2 u \;\forall u. The Loewner ordering captures the variance comparison faithfully: V1V2V_1 \preceq V_2 means every linear combination of θ^\hat\theta has smaller asymptotic variance under W1W_1 than under W2W_2. Equivalently, every confidence ellipse from V1V_1 is contained in the corresponding ellipse from V2V_2. The ordering is partial: not every two asymptotic-variance matrices are comparable. But there is a unique smallest element.

Q-Q plot of standardized GMM estimator components against the standard normal distribution at three sample sizes
Figure 5.2 — Q-Q plot of √n(θ̂_W − θ₀) standardized by the sandwich variance, against the standard normal. The asymptotic-normal approximation tightens visibly from n = 50 to n = 500.

Question. Is there a WW that achieves VWVWV_W \preceq V_{W'} for all positive-definite WW'?

Answer (proved in §6). Yes. The efficient weighting matrix W=Ω1W^\star = \Omega^{-1} uniquely (up to positive scaling) achieves the Loewner-minimum VW=(G0Ω1G0)1    VWV_{W^\star} = (G_0^\top \Omega^{-1} G_0)^{-1} \;\preceq\; V_W for every positive-definite WW.


§6 — Efficient weighting and the Hansen bound

§5 derived the sandwich asymptotic variance VWV_W for any positive-definite WW and noted in §5.4 that the family {VW}W0\{V_W\}_{W \succ 0} has a Loewner-minimum. This section proves that the minimum is achieved uniquely (up to positive scaling) at W=Ω1W^\star = \Omega^{-1}, derives the resulting Hansen efficiency bound V=(G0Ω1G0)1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1}, and connects the bound to the semiparametric efficiency machinery from the prerequisite Semiparametric Inference. The agreement between Hansen’s purely algebraic argument and the Hilbert-space tangent-space construction of the semiparametric efficiency bound is one of the most satisfying results in this part of asymptotic statistics.

6.1 Minimizing the asymptotic variance

We want to find the positive-definite weighting matrix WW that minimizes VWV_W in the Loewner order. Direct calculus on the matrix-valued objective VWV_W is awkward — the cone of positive-definite matrices has no obvious differential structure that makes “differentiating VWV_W with respect to WW” tractable. The standard reduction is to scalar sub-problems: for every direction uRk{0}u \in \mathbb{R}^k \setminus \{0\}, find the WW that minimizes the scalar variance uVWuu^\top V_W u. If a single WW^\star minimizes uVWuu^\top V_W u for every uu simultaneously, that WW^\star is the Loewner-minimum of the family.

The strategy is global rather than local: we will not differentiate at all. Instead we exhibit WW^\star directly and verify the Loewner inequality VWVWV_W \succeq V_{W^\star} via a matrix Cauchy–Schwarz argument. This is Hansen’s original 1982 proof technique, and it generalizes the classical Gauss–Markov / Aitken theorem to nonlinear moment-condition models.

6.2 The efficient weighting matrix theorem

Theorem 6.1 (Efficiency of Ω⁻¹-weighted GMM).

Under the conditions of Theorem 5.1, for every positive-definite WRL×LW \in \mathbb{R}^{L \times L},

VW    V  :=  (G0Ω1G0)1,V_W \;\succeq\; V^\star \;:=\; (G_0^\top \Omega^{-1} G_0)^{-1},

with equality if and only if W=cΩ1W = c \Omega^{-1} for some scalar c>0c > 0.

Proof.

We invert and use matrix Cauchy–Schwarz, which is the cleaner form of the algebra.

Step 1 — whitening. Let Ω=LL\Omega = L L^\top be the Cholesky decomposition. Define G~:=L1G0\tilde G := L^{-1} G_0 and W~:=LWL\tilde W := L^\top W L. Then W~\tilde W is symmetric positive-definite, G~\tilde G has full column rank kk, and

G0WG0=G~W~G~,G0WΩWG0=G~W~2G~,G0Ω1G0=G~G~.G_0^\top W G_0 = \tilde G^\top \tilde W \tilde G, \quad G_0^\top W \Omega W G_0 = \tilde G^\top \tilde W^2 \tilde G, \quad G_0^\top \Omega^{-1} G_0 = \tilde G^\top \tilde G.

So VW1=G~W~G~(G~W~2G~)1G~W~G~V_W^{-1} = \tilde G^\top \tilde W \tilde G \cdot (\tilde G^\top \tilde W^2 \tilde G)^{-1} \cdot \tilde G^\top \tilde W \tilde G and V1=G~G~V^{\star -1} = \tilde G^\top \tilde G.

Step 2 — matrix Cauchy–Schwarz. For any matrices A,BA, B with BBB^\top B invertible,

AB(BB)1BA    AA,A^\top B \, (B^\top B)^{-1} \, B^\top A \;\preceq\; A^\top A,

with equality iff col(A)col(B)\mathrm{col}(A) \subseteq \mathrm{col}(B). Proof: PB:=B(BB)1BP_B := B (B^\top B)^{-1} B^\top is the orthogonal projection onto col(B)\mathrm{col}(B), so IPBI - P_B is positive semi-definite, and AAAB(BB)1BA=A(IPB)A0A^\top A - A^\top B (B^\top B)^{-1} B^\top A = A^\top (I - P_B) A \succeq 0.

Step 3 — apply with A=G~A = \tilde G, B=W~G~B = \tilde W \tilde G. The pieces compute to VW1V1V_W^{-1} \preceq V^{\star -1}. Inverting reverses the Loewner order:

VW    V.V_W \;\succeq\; V^\star.

Step 4 — uniqueness. Equality requires col(G~)col(W~G~)\mathrm{col}(\tilde G) \subseteq \mathrm{col}(\tilde W \tilde G), which forces W~=cIL\tilde W = c I_L for some c>0c > 0 on col(G~)\mathrm{col}(\tilde G). Translating back, W=cΩ1W = c \Omega^{-1}.

(Hansen 1982; textbook treatment in Newey & McFadden 1994 §2.3.)

6.3 The Hansen efficiency bound

The single-matrix form V=(G0Ω1G0)1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1} — the Hansen efficiency bound — is structurally cleaner than the sandwich VWV_W. The collapse from “bread × meat × bread” to a single matrix happens because under efficient weighting the meat equals the bread:

G0WΩWG0  =  G0Ω1ΩΩ1G0  =  G0Ω1G0  =  (V)1.G_0^\top W^\star \Omega W^\star G_0 \;=\; G_0^\top \Omega^{-1} \Omega \Omega^{-1} G_0 \;=\; G_0^\top \Omega^{-1} G_0 \;=\; (V^\star)^{-1}.

The visualization below makes this concrete. Panel A: a scatter of (logdetVW,trVW)(\log \det V_W, \mathrm{tr}\, V_W) over 200 random positive-definite weighting matrices drawn from a Wishart distribution. VV^\star sits at the southwest extreme — there is no WW that yields a smaller uncertainty volume and a smaller total uncertainty. Panel B: nested 95% confidence ellipses for the interpolation Wα=(1α)I+αΩ1W_\alpha = (1 - \alpha) I + \alpha \Omega^{-1} as α\alpha slides from 0 (identity) to 1 (efficient).

Histogram of det(V*)/det(V_W) ratios across random PD W's, with nested ellipses showing W_alpha shrinkage
Figure 6.1 — Hansen-bound visualization. Left: histogram of det(V*)/det(V_W) across 5000 Wishart-sampled W's — the ratio is always ≤ 1, confirming V* is the determinant minimum. Right: nested ellipses for W_α = (1−α) I + α Ω⁻¹ as α varies from 0 to 1; the ellipse shrinks monotonically by Theorem 6.1.

6.4 Connection to the semiparametric efficiency bound

The Hansen bound V=(G0Ω1G0)1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1} admits a second derivation — entirely independent of Hansen’s algebra — from the semiparametric efficiency machinery developed in the prerequisite Semiparametric Inference. The two derivations are different proof technologies but they yield the same lower bound.

For the moment-condition model P={P:EP[g(X,θ(P))]=0}\mathcal{P} = \{P : \mathbb{E}_P[g(X, \theta(P))] = 0\}, the efficient influence function (EIF) for θ0\theta_0 at P0P_0 is (Bickel–Klaassen–Ritov–Wellner 1993, Theorem 5.1)

ϕ(X)  =  (G0Ω1G0)1G0Ω1g(X,θ0).\phi^\star(X) \;=\; -(G_0^\top \Omega^{-1} G_0)^{-1} \, G_0^\top \Omega^{-1} \, g(X, \theta_0).

Computing its variance:

Vsp  =  E[ϕϕ]  =  (G0Ω1G0)1G0Ω1ΩΩ1G0(G0Ω1G0)1  =  (G0Ω1G0)1  =  V.V_{\rm sp}^\star \;=\; \mathbb{E}[\phi^\star \phi^{\star\top}] \;=\; (G_0^\top \Omega^{-1} G_0)^{-1} G_0^\top \Omega^{-1} \cdot \Omega \cdot \Omega^{-1} G_0 (G_0^\top \Omega^{-1} G_0)^{-1} \;=\; (G_0^\top \Omega^{-1} G_0)^{-1} \;=\; V^\star.

The semiparametric bound equals the Hansen bound. Efficient GMM is therefore not merely efficient within the GMM family (Theorem 6.1) — it is efficient within the entire semiparametric class of RAL estimators of θ0\theta_0 under the moment-condition restriction. No future ML, nonparametric, or regularized estimator can asymptotically beat VV^\star on this model class. This is the rigorous statement of what makes GMM the canonical estimator for moment-condition models: it is information-theoretically optimal in the strongest sense available.


§7 — Two-step feasible efficient GMM

§6 proved that W=Ω1W^\star = \Omega^{-1} achieves the Hansen efficiency bound. But Ω=E[g(X,θ0)g(X,θ0)]\Omega = \mathbb{E}[g(X, \theta_0)\, g(X, \theta_0)^\top] is a population object we do not observe. The two-step feasible efficient GMM procedure resolves this: run GMM once with some preliminary weighting to get a consistent first-step estimate θ^(1)\hat\theta^{(1)}, use those residuals to construct Ω^n\hat\Omega_n, then run GMM again with W^=Ω^n1\hat W = \hat\Omega_n^{-1}. The second-step estimator θ^(2)\hat\theta^{(2)} inherits the efficiency bound asymptotically. This is the practical algorithm of applied GMM; almost every GMM regression run in econometrics today follows this two-step structure.

7.1 The two-step algorithm

Algorithm 7.1 (Two-step efficient GMM (Hansen 1982)).

Given a sample X1,,XnX_1, \dots, X_n and a moment function g(X,θ)RLg(X, \theta) \in \mathbb{R}^L with LkL \ge k:

  1. Choose a first-step weighting matrix W(1)W^{(1)}, positive-definite and not depending on the parameter. Common choices: W(1)=ILW^{(1)} = I_L (always works), or W(1)=(ZZ/n)1W^{(1)} = (Z^\top Z / n)^{-1} for linear IV.
  2. First-step GMM: θ^(1)=argminθΘngˉn(θ)W(1)gˉn(θ)\hat\theta^{(1)} = \arg\min_{\theta \in \Theta} n \, \bar g_n(\theta)^\top W^{(1)} \bar g_n(\theta).
  3. Estimate Ω\Omega from first-step residuals: Ω^n=(1/n)i=1ng(Xi,θ^(1))g(Xi,θ^(1))\hat\Omega_n = (1/n) \sum_{i=1}^n g(X_i, \hat\theta^{(1)})\, g(X_i, \hat\theta^{(1)})^\top.
  4. Second-step GMM with W^=Ω^n1\hat W = \hat\Omega_n^{-1}: θ^(2)=argminθΘngˉn(θ)Ω^n1gˉn(θ)\hat\theta^{(2)} = \arg\min_{\theta \in \Theta} n \, \bar g_n(\theta)^\top \hat\Omega_n^{-1} \bar g_n(\theta).
  5. Return θ^(2)\hat\theta^{(2)} as the efficient estimator, with asymptotic variance V^(2)=(G^nΩ^n1G^n)1\hat V^{(2)} = (\hat G_n^\top \hat\Omega_n^{-1} \hat G_n)^{-1}.

Theorem 7.1 (Efficiency of the two-step estimator).

Under the conditions of Theorem 5.1, the two-step estimator from Algorithm 7.1 satisfies

n(θ^(2)θ0)  d  N(0,V),V=(G0Ω1G0)1.\sqrt{n}\,(\hat\theta^{(2)} - \theta_0) \;\to_d\; \mathcal{N}(0, V^\star), \qquad V^\star = (G_0^\top \Omega^{-1} G_0)^{-1}.
Proof.

Two steps. Step 1: Ω^npΩ\hat\Omega_n \to_p \Omega by consistency of θ^(1)\hat\theta^{(1)} (Theorem 4.1) combined with a uniform mean-value bound on ggg g^\top over the neighborhood N\mathcal{N}. Step 2: Apply Theorem 5.1 with Wn=Ω^n1pΩ1=WW_n = \hat\Omega_n^{-1} \to_p \Omega^{-1} = W^\star (continuous mapping). The limit variance is VW=VV_{W^\star} = V^\star.

The proof carries the central insight: the first-step weighting matrix doesn’t need to be efficient — it just needs to deliver consistency. Any positive-definite W(1)W^{(1)} does the job. The efficiency comes entirely from the second-step weighting Ω^n1\hat\Omega_n^{-1}.

The visualization below traces Algorithm 7.1 end-to-end on the running example. Three stacked panels: (a) the step-1 criterion surface Jn(θ,I)J_n(\theta, I) with the step-1 estimate marked; (b) the estimated Ω^n\hat\Omega_n shown as a heatmap (off-diagonals capture cross-sensor moment correlations); (c) the step-2 criterion surface Jn(θ,Ω^n1)J_n(\theta, \hat\Omega_n^{-1}), with step-1 and step-2 estimates overlaid. Raise σ3\sigma_3 to make the step-1 / step-2 ellipse shapes visibly different.

Side-by-side criterion surfaces for step 1 (W = I) and step 2 (W = Ω̂⁻¹) on the running example
Figure 7.1 — Criterion surface deformation from step 1 to step 2. Identity weighting produces nearly circular contours; the efficient weighting Ω̂⁻¹ elongates the contours along the direction the moment data most strongly informs, tightening the resulting confidence ellipse.

7.2 Estimating Ω from first-step residuals

For i.i.d. data the uncentered estimator Ω^n=(1/n)igigi\hat\Omega_n = (1/n) \sum_i g_i g_i^\top (with gi:=g(Xi,θ^(1))g_i := g(X_i, \hat\theta^{(1)})) is the natural choice — uncentered because E[g(X,θ0)]=0\mathbb{E}[g(X, \theta_0)] = 0 by the moment-condition restriction. Some software packages center anyway (subtract gˉn\bar g_n) as a defensive measure against misspecification; for correctly-specified models the two estimators are asymptotically equivalent.

For dependent data — time series, clustered samples — the simple covariance estimator is inconsistent because it ignores the contributions of E[gigj]\mathbb{E}[g_i g_j^\top] at lags ij>0|i - j| > 0. The HAC estimator (Newey–West 1987; Andrews–Monahan 1991) generalizes:

Ω^nHAC  =  Γ^0  +  =1Lnk ⁣(Ln)(Γ^+Γ^),\hat\Omega_n^{\rm HAC} \;=\; \hat\Gamma_0 \;+\; \sum_{\ell=1}^{L_n} k\!\left(\frac{\ell}{L_n}\right) (\hat\Gamma_\ell + \hat\Gamma_\ell^\top),

where Γ^\hat\Gamma_\ell are sample auto-covariances, Lnn1/3L_n \sim n^{1/3} is a bandwidth, and k()k(\cdot) is a kernel ensuring positive-definiteness. The canonical applied workflow: use Newey–West for time series, use the plain sample covariance for i.i.d. cross-sections.

7.3 Iterated GMM and convergence

The two-step procedure stops after one update of W^\hat W. Iterated GMM keeps going: re-estimate Ω^(t)\hat\Omega^{(t)} from residuals at θ^(t)\hat\theta^{(t)}, set W(t+1)=(Ω^(t))1W^{(t+1)} = (\hat\Omega^{(t)})^{-1}, recompute θ^(t+1)\hat\theta^{(t+1)}, and iterate to a fixed point. Iterated GMM is a Picard iteration on the update map T(θ)=argminθJn(θ,Ω^(θ)1)T(\theta) = \arg\min_\theta J_n(\theta, \hat\Omega(\theta)^{-1}).

Three properties: (a) asymptotic equivalence — both θ^(2)\hat\theta^{(2)} and θ^iter\hat\theta^{\rm iter} converge at rate n1/2n^{-1/2} with the same asymptotic variance VV^\star; (b) invarianceθ^iter\hat\theta^{\rm iter} depends only on the sample and the moment function, not on W(1)W^{(1)}; (c) fixed-point characterization — for affine moment functions, this fixed point coincides with the continuous-updating estimator (CUE) of §10. In applied practice, two-step is the default; iterated GMM is used when reviewers ask for “specification-agnostic” results; CUE / EL / GEL (§10) are used when two-step bias matters.

7.4 Finite-sample bias of two-step GMM — the Hansen–Heaton–Yaron critique

Hansen, Heaton, and Yaron (1996) found that the estimated Ω^n\hat\Omega_n depends on the same data used in step 2, inducing a finite-sample bias of order O(L/n)O(L / n). The mechanism: Ω^n\hat\Omega_n is constructed from θ^(1)\hat\theta^{(1)}, which is a function of the sample; using it as the weighting matrix in step 2 creates a correlation between Ω^n1\hat\Omega_n^{-1} and gˉn\bar g_n in the FOCs that classical asymptotics ignores.

Two-step, iterated, and oracle GMM bias at small n with L = 4 moment conditions
Figure 7.2 — Finite-sample bias hierarchy. The oracle estimator (efficient weighting at the unknown true Ω) is unbiased to leading order; iterated GMM reduces two-step bias by a constant factor (∼ 30-50%); both decay at rate O(L/n). The gap closes as n grows.

Three responses to the HHY critique shape the modern literature: iterated GMM mitigates but does not eliminate (bias drops by a constant factor); continuous-updating (CUE) jointly optimizes θ\theta and W(θ)W(\theta) with smaller higher-order bias; empirical likelihood (EL) has the smallest higher-order bias in the GEL class (Newey–Smith 2004). §10 develops all three.


§8 — The Hansen J-statistic and over-identification testing

The two-step procedure of §7 produces, almost as a by-product, a test of correct specification. The minimum value of the efficient-weight GMM criterion — the Hansen J-statistic — has an asymptotic χLk2\chi^2_{L - k} distribution under correct specification. The same machinery that produces the point estimate produces a free specification test.

8.1 The J-statistic as a quadratic form

After running two-step (or iterated) efficient GMM, the Hansen J-statistic is

J^  :=  Jn ⁣(θ^(2),Ω^n1)  =  ngˉn(θ^(2))Ω^n1gˉn(θ^(2)).\hat J \;:=\; J_n\!\left(\hat\theta^{(2)}, \, \hat\Omega_n^{-1}\right) \;=\; n \cdot \bar g_n(\hat\theta^{(2)})^\top \, \hat\Omega_n^{-1} \, \bar g_n(\hat\theta^{(2)}).

Under correct specification (EP0[g(X,θ0)]=0\mathbb{E}_{P_0}[g(X, \theta_0)] = 0), gˉn(θ^(2))=Op(n1/2)\bar g_n(\hat\theta^{(2)}) = O_p(n^{-1/2}), so J^=Op(1)\hat J = O_p(1) — bounded in probability, with the χLk2\chi^2_{L-k} distribution below. Under misspecification (no θ\theta satisfies the population moment condition), gˉn(θ^(2))\bar g_n(\hat\theta^{(2)}) is bounded away from zero, so J^=Op(n)\hat J = O_p(n) — the criterion diverges, and the test rejects with probability tending to 1.

8.2 Asymptotic distribution under H₀

Theorem 8.1 (Asymptotic distribution of the J-statistic).

Under the conditions of Theorem 5.1, with two-step or iterated efficient GMM weighting,

J^  d  χLk2.\hat J \;\to_d\; \chi^2_{L - k}.
Proof.

Four steps.

Step 1 — linearize gˉn(θ^(2))\bar g_n(\hat\theta^{(2)}) around θ0\theta_0. From the proof of Theorem 5.1, n(θ^(2)θ0)=(G0Ω1G0)1G0Ω1ngˉn(θ0)+op(1)\sqrt{n}(\hat\theta^{(2)} - \theta_0) = -(G_0^\top \Omega^{-1} G_0)^{-1} G_0^\top \Omega^{-1} \cdot \sqrt{n}\, \bar g_n(\theta_0) + o_p(1). Mean-value expansion of gˉn\bar g_n then gives

ngˉn(θ^(2))  =  M0ngˉn(θ0)  +  op(1),\sqrt{n}\, \bar g_n(\hat\theta^{(2)}) \;=\; M_0 \cdot \sqrt{n}\, \bar g_n(\theta_0) \;+\; o_p(1),

where M0:=ILG0(G0Ω1G0)1G0Ω1M_0 := I_L - G_0\, (G_0^\top \Omega^{-1} G_0)^{-1} G_0^\top \Omega^{-1} is the residual-projection matrix.

Step 2 — CLT. ngˉn(θ0)dZN(0,Ω)\sqrt{n}\, \bar g_n(\theta_0) \to_d Z \sim \mathcal{N}(0, \Omega). By continuous mapping and Ω^npΩ\hat\Omega_n \to_p \Omega, J^d(M0Z)Ω1(M0Z)\hat J \to_d (M_0 Z)^\top \Omega^{-1} (M_0 Z).

Step 3 — whitening. Let Ω=LL\Omega = L L^\top and Z~:=L1ZN(0,IL)\tilde Z := L^{-1} Z \sim \mathcal{N}(0, I_L), G~:=L1G0\tilde G := L^{-1} G_0. Then M0Z=LMG~Z~M_0 Z = L \cdot M_{\tilde G} \tilde Z, where MG~:=ILG~(G~G~)1G~M_{\tilde G} := I_L - \tilde G (\tilde G^\top \tilde G)^{-1} \tilde G^\top is the standard Euclidean orthogonal projection onto col(G~)\mathrm{col}(\tilde G)^\perp, a symmetric idempotent matrix of rank LkL - k.

Step 4 — compute. (M0Z)Ω1(M0Z)=(LMG~Z~)LL1(LMG~Z~)=MG~Z~2(M_0 Z)^\top \Omega^{-1} (M_0 Z) = (L M_{\tilde G} \tilde Z)^\top L^{-\top} L^{-1} (L M_{\tilde G} \tilde Z) = \|M_{\tilde G} \tilde Z\|^2. Since Z~N(0,IL)\tilde Z \sim \mathcal{N}(0, I_L) and MG~M_{\tilde G} is an orthogonal projection of rank LkL - k, MG~Z~2χLk2\|M_{\tilde G} \tilde Z\|^2 \sim \chi^2_{L - k}.

(Hansen 1982; the residual-projection proof is the canonical textbook treatment.) Geometric reading: the sample-moment vector at the GMM optimum is the residual after projecting the LL-dimensional moment-noise vector onto the kk-dimensional parameter-identifying subspace col(G0)\mathrm{col}(G_0). The kk projection components are absorbed into θ^(2)\hat\theta^{(2)}; the LkL - k residual components are what J^\hat J measures. A just-identified model (L=kL = k) has zero over-identifying restrictions, J^0\hat J \equiv 0 exactly, and there is no specification test. Over-identification is precisely the resource the J-test exploits.

The visualization below runs the J-statistic Monte Carlo on the running example. Top panel: empirical density of J^\hat J over 400 two-step replicates at the user’s chosen nn, with the χ22\chi^2_2 theoretical density overlaid. The empirical 95th percentile and Type-I rate are reported. Bottom panel: power curve — the empirical rejection rate at α=0.05\alpha = 0.05 as we shift sensor 3’s true mean away from its θ0\theta_0-implied value by δ\delta. At δ=0\delta = 0 the rate hovers near α\alpha; as δ\delta grows the rate rises toward 1, consistent with the consistency of the J-test against fixed misspecifications.

mean Ĵ = 2.02 (theory L−k = 2) · empirical Type-I = 4.3%
Empirical histogram of the J-statistic under H₀ overlaid with the χ²_{L-k} density
Figure 8.1 — Empirical Ĵ distribution under correct specification, B = 5000 replicates at n = 500. The histogram tracks the χ²₂ density closely; Q-Q-plot inset (right) shows the asymptotic-normal approximation is essentially exact at this sample size.
J-test power curve as misspecification amplitude grows
Figure 8.2 — Power against misspecification of moment g₃. Each curve corresponds to a different sample size; all converge to 1 as δ grows. The δ = 0 curve hovers at the nominal α = 0.05, confirming correct Type-I control.

8.3 Power against misspecification

The J-test is consistent against any fixed misspecification: if the model is wrong, the test rejects with probability tending to 1.

Fixed alternatives. Suppose the true distribution P0P_0 satisfies m(θ)0m(\theta) \ne 0 for every θΘ\theta \in \Theta. The “pseudo-true” parameter θ0:=argminθm(θ)Ω1m(θ)\theta_0^\star := \arg\min_\theta m(\theta)^\top \Omega^{-1} m(\theta) is the limit of θ^(2)\hat\theta^{(2)}, and the population objective at this point is strictly positive. So J^/npQ0(θ0)>0\hat J / n \to_p Q_0(\theta_0^\star) > 0, J^p\hat J \to_p \infty at rate nn, and the test rejects with probability tending to 1.

Local alternatives (Pitman drift). Suppose the true population moment is m(θ0)=n1/2δm(\theta_0) = n^{-1/2} \delta for some fixed direction δRLcol(G0)\delta \in \mathbb{R}^L \setminus \mathrm{col}(G_0). Following the proof of Theorem 8.1 with the local-alternative drift, the limiting distribution is

J^  d  χLk2(λ),λ=δΩ1δ\hat J \;\to_d\; \chi^2_{L - k}(\lambda), \qquad \lambda = \delta^\top \Omega^{-1} \delta

— a non-central chi-squared. Power increases with λ\lambda: the J-test is most powerful against misspecifications δ\delta that are large in the Ω1\Omega^{-1} norm and orthogonal (in that norm) to the parameter-identifying subspace.

Where the J-test is blind. A misspecification δcol(G0)\delta \in \mathrm{col}(G_0) — a deviation in the parameter-direction subspace — is absorbed by the GMM optimizer into a shift of θ^(2)\hat\theta^{(2)} and contributes nothing to J^\hat J. The J-test cannot detect parametric misspecification within the model’s identification capacity; it only detects mismatches orthogonal to the parameter-direction subspace.

8.4 Reading a J-test in practice

The standard reporting convention is the p-value: pJ^=P(χLk2>J^)p_{\hat J} = P(\chi^2_{L-k} > \hat J). Reject H0H_0 at level α\alpha if pJ^<αp_{\hat J} < \alpha. Three practical pitfalls.

Pitfall 1: “Low J → correctly specified” is a Type II error trap. A non-rejecting J-test is consistent with correct specification but does not prove it. Power against subtle misspecifications can be low at moderate nn.

Pitfall 2: A significant J does not localize the culprit. The J-statistic is a single scalar; it does not localize the problem to a specific moment condition. To diagnose, examine individual moment residuals gˉn(θ^(2))j\bar g_n(\hat\theta^{(2)})_j; the component(s) with the largest standardized residuals point to candidate misspecified moments. Newey (1985) develops formal tests of subsets of moment conditions.

Pitfall 3: Weak identification inflates the false-rejection rate. When G0G_0 is near-singular, the asymptotic χLk2\chi^2_{L-k} approximation breaks down. The Anderson–Rubin test (§9.4) provides a weak-instrument-robust alternative.

Relationship to other specification tests. The Sargan test (Sargan 1958) is the J-test specialized to linear IV — same statistic, same null distribution, older econometric name. The Hausman test (Hausman 1978) tests whether two consistent estimators agree — a specification check based on the difference between estimators. Conditional moment tests (Newey 1985) test subsets of moment conditions, useful for localizing the source of a significant J-statistic.


§9 — Linear GMM, instrumental variables, and 2SLS

This section makes the abstract framework concrete via the canonical applied case: linear instrumental-variables regression. The IV setting is where GMM was born — Hansen and Singleton’s (1982) Euler equations are nonlinear IV — and where the framework continues to do most of its applied work.

9.1 The linear IV model

The structural equation is Yi=Xiθ0+εiY_i = X_i^\top \theta_0 + \varepsilon_i, i=1,,ni = 1, \dots, n, where XiRkX_i \in \mathbb{R}^k is a vector of regressors with some component endogenous, E[Xiεi]0\mathbb{E}[X_i \varepsilon_i] \ne 0. OLS is biased because the orthogonality condition fails.

The fix is an instrument vector ZiRLZ_i \in \mathbb{R}^L with LkL \ge k satisfying two conditions: exogeneity E[Ziεi]=0\mathbb{E}[Z_i \varepsilon_i] = 0 and relevance E[ZiXi]RL×k\mathbb{E}[Z_i X_i^\top] \in \mathbb{R}^{L \times k} has rank kk. Given (Yi,Xi,Zi)(Y_i, X_i, Z_i), the GMM moment function is g(Yi,Xi,Zi;θ)=Zi(YiXiθ)RLg(Y_i, X_i, Z_i; \theta) = Z_i \,(Y_i - X_i^\top \theta) \in \mathbb{R}^L. By exogeneity, E[g(Y,X,Z;θ0)]=0\mathbb{E}[g(Y, X, Z; \theta_0)] = 0 at the truth. By relevance, G0=E[ZX]G_0 = -\mathbb{E}[Z X^\top] has full column rank kk. The over-identification degree is LkL - k.

The sample-moment vector is gˉn(θ)=(1/n)Z(YXθ)\bar g_n(\theta) = (1/n) Z^\top (Y - X \theta) where Z,X,YZ, X, Y stack rows. The sample Jacobian is Gn=(1/n)ZXG_n = -(1/n) Z^\top X, independent of θ\theta (the moment function is affine in θ\theta).

The visualization below sets k=1k = 1 (one endogenous regressor) and L=4L = 4 (four instruments), with a heteroskedasticity-controlled error variance. As you raise endogeneity ρ\rho, the OLS sampling distribution drifts away from θ0=1\theta_0 = 1. As you raise heteroskedasticity γ\gamma, the efficient-GMM advantage over 2SLS grows.

SD: OLS=0.126 · 2SLS=0.204 · eff=0.187
Three sampling distributions overlaid: OLS biased, 2SLS unbiased but spread, efficient GMM unbiased and tighter
Figure 9.1 — Linear IV sampling distributions at n = 500, L = 4 instruments, ρ = 0.5 endogeneity, γ = 1.5 heteroskedasticity. OLS is biased toward the confounded mean; 2SLS is consistent for θ₀ = 1 but inefficient; efficient GMM tightens the distribution by exploiting the heteroskedasticity through Ω̂⁻¹ weighting.

9.2 2SLS as (ZZ/n)1(Z^\top Z / n)^{-1}-weighted GMM

The GMM estimator with a generic weighting matrix WW is

θ^W  =  (XZWZX)1XZWZY,(9.1)\hat\theta_W \;=\; (X^\top Z \, W \, Z^\top X)^{-1} X^\top Z \, W \, Z^\top Y, \qquad\qquad (9.1)

closed-form because gg is affine in θ\theta.

Theorem 9.1 (2SLS as GMM).

The two-stage least squares estimator is the GMM estimator with weighting matrix W=(ZZ/n)1W = (Z^\top Z / n)^{-1}:

θ^2SLS  =  (XPZX)1XPZY,PZ=Z(ZZ)1Z.\hat\theta_{2SLS} \;=\; (X^\top P_Z X)^{-1} X^\top P_Z Y, \qquad P_Z = Z (Z^\top Z)^{-1} Z^\top.
Proof.

Substitute W=n(ZZ)1W = n (Z^\top Z)^{-1} into (9.1); the scalar nn cancels, leaving (XZ(ZZ)1ZX)1XZ(ZZ)1ZY=(XPZX)1XPZY(X^\top Z (Z^\top Z)^{-1} Z^\top X)^{-1} X^\top Z (Z^\top Z)^{-1} Z^\top Y = (X^\top P_Z X)^{-1} X^\top P_Z Y.

The textbook “two-stage” interpretation: PZX=X^P_Z X = \hat X is the OLS prediction of XX from ZZ (first stage), and θ^2SLS\hat\theta_{2SLS} is the OLS regression of YY on X^\hat X (second stage). Under homoskedasticity E[ε2Z]=σε2\mathbb{E}[\varepsilon^2 | Z] = \sigma_\varepsilon^2, Ω=σε2E[ZZ]\Omega = \sigma_\varepsilon^2 \cdot \mathbb{E}[Z Z^\top], and W=(ZZ/n)1W = (Z^\top Z / n)^{-1} differs from W=Ω1W^\star = \Omega^{-1} only by the scalar σε2\sigma_\varepsilon^2. So 2SLS is efficient GMM under homoskedasticity. The Hansen bound becomes V2SLS=σε2(E[XZ](E[ZZ])1E[ZX])1V^\star_{\rm 2SLS} = \sigma_\varepsilon^2 \cdot (\mathbb{E}[X Z^\top] (\mathbb{E}[Z Z^\top])^{-1} \mathbb{E}[Z X^\top])^{-1}, estimated as V^2SLS=σ^ε2(XPZX/n)1\hat V_{\rm 2SLS} = \hat\sigma_\varepsilon^2 \cdot (X^\top P_Z X / n)^{-1} — the standard 2SLS variance formula.

9.3 Efficient GMM under heteroskedasticity

When E[ε2Z]\mathbb{E}[\varepsilon^2 | Z] depends on ZZ, Ω\Omega is no longer proportional to E[ZZ]\mathbb{E}[Z Z^\top], and 2SLS is inefficient. The fix is the two-step procedure of §7 specialized to linear IV.

Algorithm 9.1 (Heteroskedasticity-robust efficient GMM).

  1. Run 2SLS to obtain θ^2SLS\hat\theta_{2SLS}.
  2. Compute residuals e^i=YiXiθ^2SLS\hat e_i = Y_i - X_i^\top \hat\theta_{2SLS}.
  3. Estimate the heteroskedasticity-robust moment covariance: Ω^n=(1/n)i=1nZiZie^i2\hat\Omega_n = (1/n) \sum_{i=1}^n Z_i Z_i^\top \hat e_i^2.
  4. Re-estimate GMM with W^=Ω^n1\hat W = \hat\Omega_n^{-1}: θ^eff=(XZW^ZX)1XZW^ZY\hat\theta^{\rm eff} = (X^\top Z \hat W Z^\top X)^{-1} X^\top Z \hat W Z^\top Y.
  5. The plug-in efficient variance is V^eff=((XZ/n)W^(ZX/n))1\hat V^{\rm eff} = ((X^\top Z / n) \, \hat W \, (Z^\top X / n))^{-1}.

This estimator goes by several names: efficient GMM (Hansen 1982), heteroskedasticity-robust 2SLS (applied econometrics), White (1980) estimator in the just-identified case L=kL = k, or sandwich estimator (statistical learning).

9.4 Weak instruments and near-identification

For a scalar endogenous regressor (k=1k = 1), the first-stage equation is Xi=Ziπ+viX_i = Z_i^\top \pi + v_i. The strength of the instruments is measured by the concentration parameter μ2=π(ZZ)π/σv2\mu^2 = \pi^\top (Z^\top Z) \pi / \sigma_v^2. The first-stage F-statistic is F=μ2/LF = \mu^2 / L (large-nn approximation), an asymptotic chi-squared random variable under the null that all πj=0\pi_j = 0.

Staiger and Stock (1997, Econometrica 65(3): 557–586) proposed the most-cited rule of thumb — first-stage F > 10 as a working threshold separating “strong” from “weak” identification — based on simulation evidence that the 2SLS asymptotic-normal approximation deteriorates below this point. Stock and Yogo (2005) subsequently formalized the threshold via tabulated critical values: under their bias-bound criterion (2SLS bias 10%\le 10\% of OLS bias) at L=1L = 1 instrument, the 5%-level critical value is approximately 11.0; values for other (L,k)(L, k) pairs are in Stock-Yogo Tables 1–2.

When the first-stage F is below 10, two pathologies appear. Bias: 2SLS becomes biased toward OLS in finite samples, with bias of order O(1/F)O(1/F). Standard-error distortion: the 2SLS asymptotic-normal CI undercovers — the nominal-95% CI may have actual coverage of 80% or less.

The Anderson–Rubin test (Anderson and Rubin 1949) introduced a test that does not condition on estimating θ\theta and is therefore robust to weak instruments. The AR statistic at a hypothesized value θ\theta^\star is

AR(θ)  =  ngˉn(θ)Ω^(θ)1gˉn(θ),\mathrm{AR}(\theta^\star) \;=\; n \cdot \bar g_n(\theta^\star)^\top \hat\Omega(\theta^\star)^{-1} \bar g_n(\theta^\star),

where Ω^(θ)\hat\Omega(\theta^\star) is the moment-covariance estimated under H0:θ=θH_0: \theta = \theta^\star. Under H0H_0, by the same residual-projection argument as §8.2 but without subtracting the kk parameter-estimation degrees of freedom,

AR(θ)  d  χL2(not χLk2).\mathrm{AR}(\theta^\star) \;\to_d\; \chi^2_L \qquad (\text{not } \chi^2_{L-k}).

The AR confidence set is CR1αAR={θ:AR(θ)<χL,1α2}\mathrm{CR}_{1-\alpha}^{\rm AR} = \{ \theta^\star : \mathrm{AR}(\theta^\star) < \chi^2_{L, \, 1-\alpha} \} — robust to weak instruments. Kleibergen (2002) developed the K-statistic and Moreira (2003) the conditional-likelihood-ratio test as closely related weak-instrument-robust alternatives.

The visualization below makes the weak-instrument failure mode visible. Slide the first-stage strength τ\tau down toward zero: the 2SLS sampling distribution develops bias toward OLS and heavy tails, while the AR confidence set widens dramatically (often spanning the whole real line) to honestly reflect the lost identification. The 2SLS Wald 95% CI under-covers — its width does not grow as τ\tau shrinks.

mean first-stage F = 7.7 vs SY threshold = 10 — weak instruments
Three-panel diagnostic showing first-stage F, 2SLS bias, and AR set width as instrument strength scales
Figure 9.2 — Weak-instrument diagnostics. As the first-stage strength scale τ shrinks: F-statistic drops below the SY-10 threshold; 2SLS becomes biased toward OLS with heavier tails; the AR set widens to honestly reflect lost identification, while the 2SLS Wald CI under-covers.

For ML / causal-inference applications with strong, theory-motivated instruments (e.g., randomized experiments treated as instruments for compliance), weak instruments are rarely binding. For applied micro applications relying on borderline-relevant instruments, the weak-instrument diagnostics are essential.


§10 — Modern GMM: CUE, empirical likelihood, and GEL

Three modern estimators — the continuous-updating estimator (CUE), Owen’s empirical likelihood (EL), and the generalized empirical likelihood (GEL) family unifying both — provide alternatives to two-step GMM with the same first-order asymptotic efficiency but provably smaller higher-order bias. Newey and Smith (2004) gave the unifying analysis: CUE, EL, and exponential tilting (ET) are all members of a single one-parameter family with a bias hierarchy where EL is strictly preferred.

10.1 The continuous-updating estimator (CUE)

CUE collapses the two steps into a single joint optimization:

θ^CUE  :=  argminθΘ  ngˉn(θ)Ω^n(θ)1gˉn(θ),\hat\theta_{\rm CUE} \;:=\; \arg\min_{\theta \in \Theta} \; n \cdot \bar g_n(\theta)^\top \, \hat\Omega_n(\theta)^{-1} \, \bar g_n(\theta),

where Ω^n(θ)=(1/n)ig(Xi,θ)g(Xi,θ)\hat\Omega_n(\theta) = (1/n) \sum_i g(X_i, \theta) g(X_i, \theta)^\top is recomputed at every θ\theta. The criterion is genuinely nonlinear, even when gg is affine in θ\theta.

Three properties of CUE: (a) smaller higher-order bias than two-step GMM — CUE breaks the gˉn\bar g_nΩ^\hat\Omega asymmetry; (b) reparametrization invariance — invariant to smooth invertible θh(θ)\theta \mapsto h(\theta); (c) computational cost — nonlinear optimization (BFGS or Newton) starting from the two-step estimate; typically 10\sim 10 iterations. Under standard regularity n(θ^CUEθ0)dN(0,V)\sqrt{n}(\hat\theta_{\rm CUE} - \theta_0) \to_d \mathcal{N}(0, V^\star) with the same V=(G0Ω1G0)1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1} as two-step.

10.2 Empirical likelihood (Owen 1988, 1990, 2001)

Empirical likelihood replaces the parametric likelihood with a nonparametric likelihood defined over discrete distributions supported on the data. Given probability weights p1,,pnp_1, \dots, p_n, define the empirical likelihood L(p)=ipiL(p) = \prod_i p_i subject to pi0p_i \ge 0, pi=1\sum p_i = 1, and pig(Xi,θ)=0\sum p_i g(X_i, \theta) = 0.

Profile empirical likelihood. At each θ\theta, profile out pp. The Lagrangian KKT conditions give pi(θ,λ)=1/(n(1+λg(Xi,θ)))p_i(\theta, \lambda) = 1 / (n \,(1 + \lambda^\top g(X_i, \theta))), where λ=λ(θ)\lambda = \lambda(\theta) solves the inner moment constraint

i=1ng(Xi,θ)1+λg(Xi,θ)  =  0.(10.1)\sum_{i=1}^n \frac{g(X_i, \theta)}{1 + \lambda^\top g(X_i, \theta)} \;=\; 0. \qquad\qquad (10.1)

The inner problem (10.1) is the FOC of a convex optimization over λ\lambda — minimize ilog(1+λg(Xi,θ))-\sum_i \log(1 + \lambda^\top g(X_i, \theta)) — with unique interior solution. Newton-Raphson with backtracking (to maintain 1+λgi>01 + \lambda^\top g_i > 0) converges in 5\sim 5 iterations.

The profile empirical-likelihood-ratio statistic is R(θ):=2ilog(1+λ(θ)g(Xi,θ))\ell_R(\theta) := 2 \sum_i \log(1 + \lambda(\theta)^\top g(X_i, \theta)), and the empirical likelihood estimator is θ^EL=argminθR(θ)\hat\theta_{\rm EL} = \arg\min_\theta \ell_R(\theta).

Theorem 10.1 (Wilks' theorem for empirical likelihood (Owen 1990)).

Under the conditions of Theorem 5.1, R(θ0)dχk2\ell_R(\theta_0) \to_d \chi^2_k — the parametric Wilks asymptotic, despite the nonparametric construction. The EL over-identification statistic satisfies R(θ^EL)dχLk2\ell_R(\hat\theta_{\rm EL}) \to_d \chi^2_{L-k} — the same degrees of freedom as Hansen’s J-statistic.

EL confidence regions CR1αEL={θ:R(θ)<χk,1α2}\mathrm{CR}_{1-\alpha}^{\rm EL} = \{\theta : \ell_R(\theta) < \chi^2_{k, 1-\alpha}\} are transformation-invariant, have data-determined shape, and have better finite-sample coverage in many settings (DiCiccio-Hall-Romano 1991).

10.3 Generalized empirical likelihood (GEL): the Newey–Smith unification

Newey-Smith (2004) observed that CUE, EL, and ET all fit a single GEL family. Let ρ:RR\rho: \mathbb{R} \to \mathbb{R} be twice continuously differentiable, concave on a neighborhood of zero, with ρ(0)=ρ(0)=1\rho'(0) = \rho''(0) = -1. The GEL estimator with carrier ρ\rho is

θ^GEL  :=  argminθmaxλ    1ni=1nρ ⁣(λg(Xi,θ)).\hat\theta_{\rm GEL} \;:=\; \arg\min_\theta \, \max_\lambda \;\; \frac{1}{n} \sum_{i=1}^n \rho\!\bigl(\lambda^\top g(X_i, \theta)\bigr).

Three canonical members of the Cressie-Read family ργ(v)=((1+γv)(γ+1)/γ1)/(γ+1)\rho_\gamma(v) = -((1 + \gamma v)^{(\gamma+1)/\gamma} - 1)/(\gamma+1):

γ\gammaργ(v)\rho_\gamma(v)EstimatorSource
11v2/2v-v^2/2 - vCUEHansen-Heaton-Yaron 1996
00log(1v)\log(1 - v)ELOwen 1988, 1990
1-1exp(v)+v+1-\exp(v) + v + 1 (centered)ETImbens 1997; Kitamura-Stutzer 1997

All GEL estimators have n(θ^GELθ0)dN(0,V)\sqrt{n}(\hat\theta_{\rm GEL} - \theta_0) \to_d \mathcal{N}(0, V^\star) — first-order equivalent.

10.4 Higher-order properties and the Newey–Smith bias hierarchy

Newey-Smith (2004) computed the O(1/n)O(1/n) bias for each estimator and proved a strict ordering:

bias2step(n)    biasiterated(n)    biasCUE(n)    biasET(n)    biasEL(n).\text{bias}_{\rm 2step}(n) \;\gtrsim\; \text{bias}_{\rm iterated}(n) \;\gtrsim\; \text{bias}_{\rm CUE}(n) \;\gtrsim\; \text{bias}_{\rm ET}(n) \;\gtrsim\; \text{bias}_{\rm EL}(n).

The visualization below makes the hierarchy visible. Panel A: the Cressie-Read carrier ργ(v)\rho_\gamma(v) morphs as you slide γ\gamma — at γ=0\gamma=0 it’s the EL log carrier; at γ=1\gamma=1 the CUE quadratic; at γ=1\gamma=-1 the ET exponential. Panel B: an empirical bar chart of mean θ^θ0\|\hat\theta - \theta_0\| for two-step, iterated, CUE, and EL on the running example at small nn where the gap is visible.

Bias hierarchy bar chart: two-step ≥ iterated ≥ CUE ≥ ET ≥ EL
Figure 10.1 — Newey-Smith bias hierarchy on the extended L = 8 running example at n = 50. The ordering is exactly as the theory predicts; the gap between two-step and EL is ∼ 30% in this regime.

Computational hierarchy:

EstimatorImplementationPer-replicate cost
Two-step GMMClosed-form (linear) / 2 BFGS (nonlinear)O(L2k)O(L^2 k)
Iterated GMMPicard iterationO(tL2k)O(t \cdot L^2 k)
CUESingle BFGS with θ\theta-dependent WWO(tL3)O(t \cdot L^3)
ELNested BFGS + Newton on λ\lambdaO(touttinnL2)O(t_{\rm out} \cdot t_{\rm in} \cdot nL^2)
ETSame as EL with exponential carriersimilar to EL

Recommendation. Modern applied workflow: (1) always report two-step efficient GMM as the baseline; (2) also report CUE or EL when LkL - k is large or L/n>0.05L/n > 0.05; (3) use EL-based confidence regions when interest centers on a nonlinear function of θ\theta.


§11 — GMM and maximum likelihood

Maximum likelihood is the canonical estimator of parametric statistics. GMM is the canonical estimator of moment-condition models. ML is just-identified GMM with the score equations as moment conditions, and the Cramér–Rao bound is the Hansen efficiency bound under score moments. Conversely, when the assumed likelihood is wrong, the resulting “quasi-MLE” is consistent for a pseudo-true parameter but inherits the GMM sandwich variance rather than the Fisher-information inverse. Every M-estimator — MLE, OLS, quantile regression, Huber regression, quasi-MLE — is just-identified GMM with a specific score-like moment function. GMM extends M-estimation by allowing L>kL > k.

11.1 ML as just-identified GMM

The MLE solves iθlogp(Xi;θ^MLE)=0\sum_i \nabla_\theta \log p(X_i; \hat\theta_{\mathrm{MLE}}) = 0. Identify the score with the moment function: g(X,θ):=θlogp(X;θ)Rkg(X, \theta) := \nabla_\theta \log p(X; \theta) \in \mathbb{R}^k. The MLE is the just-identified GMM estimator with L=kL = k. Under regularity:

  • Eθ0[g(X,θ0)]=0\mathbb{E}_{\theta_0}[g(X, \theta_0)] = 0 (the score has mean zero at the truth).
  • G0=Eθ0[2logp/θθ]=I(θ0)G_0 = \mathbb{E}_{\theta_0}[\partial^2 \log p / \partial \theta \partial \theta^\top] = -\mathcal{I}(\theta_0) (Hessian of the log-likelihood in expectation = negative Fisher information).
  • Ω=Eθ0[logplogp]=I(θ0)\Omega = \mathbb{E}_{\theta_0}[\nabla \log p \cdot \nabla \log p^\top] = \mathcal{I}(\theta_0) (the information matrix equality; Fisher 1925).

Substituting into the just-identified sandwich:

V  =  G01ΩG0  =  (I)1I(I)  =  I1.V \;=\; G_0^{-1} \, \Omega \, G_0^{-\top} \;=\; (-\mathcal{I})^{-1} \, \mathcal{I} \, (-\mathcal{I})^{-\top} \;=\; \mathcal{I}^{-1}.

The MLE achieves the Cramér–Rao lower bound. From the §6 perspective: V=(G0Ω1G0)1=(II1I)1=I1V^\star = (G_0^\top \Omega^{-1} G_0)^{-1} = (\mathcal{I} \cdot \mathcal{I}^{-1} \cdot \mathcal{I})^{-1} = \mathcal{I}^{-1}. So MLE is efficient GMM with score moments — and the Cramér–Rao bound is the moment-condition special case of the Hansen efficiency bound.

11.2 Quasi-MLE and the sandwich variance

What if we maximize a wrong likelihood? Suppose the data come from P0P_0 but we maximize logq(Xi;θ)\sum \log q(X_i; \theta) for a working density q(;θ)pP0q(\cdot; \theta) \ne p_{P_0}. The estimator solves iθlogq(Xi;θ^)=0\sum_i \nabla_\theta \log q(X_i; \hat\theta) = 0 and is called the quasi-MLE.

The quasi-MLE is consistent for the pseudo-true parameter θ0:=argmaxθEP0[logq(X;θ)]\theta_0^\star := \arg\max_\theta \mathbb{E}_{P_0}[\log q(X; \theta)]. Under misspecification, the information matrix equality fails: G0:=EP0[2logq/θθ]EP0[logqlogq]=:ΩG_0^\star := \mathbb{E}_{P_0}[\partial^2 \log q / \partial \theta \partial \theta^\top] \ne -\mathbb{E}_{P_0}[\nabla \log q \cdot \nabla \log q^\top] =: -\Omega^\star. The quasi-MLE asymptotic variance is the GMM sandwich:

VQMLE  =  (G0)1Ω(G0).V_{\mathrm{QMLE}} \;=\; (G_0^\star)^{-1} \, \Omega^\star \, (G_0^\star)^{-\top}.

This is the Eicker–Huber–White sandwich (Eicker 1967; Huber 1967; White 1980), familiar as “robust standard errors” or vcov_type='HC0'.

Canonical example: heteroskedastic OLS. Y=Xθ0+εY = X^\top \theta_0 + \varepsilon under the working assumption of homoskedastic Gaussian errors. OLS is the quasi-MLE. Under homoskedasticity, the information matrix equality holds and the naive CI has correct coverage. Under heteroskedasticity, the equality fails and the naive CI under-covers; the sandwich CI restores nominal coverage. The visualization below makes this concrete: slide the heteroskedasticity scale up and watch the naive coverage drop while the sandwich stays near 95%.

gap: naive under-covers by 8.4pp; sandwich within 0.4pp
Coverage rates of naive vs sandwich CIs for OLS under varying heteroskedasticity
Figure 11.1 — Empirical 95% CI coverage at n = 200, B = 1000. Naive (homoskedastic) CIs under-cover by 10–20 percentage points under strong heteroskedasticity; sandwich CIs hold the nominal rate throughout.

11.3 M-estimation as a unifying framework

Pick a loss function ρ\rho and define θ^M:=argminθ(1/n)iρ(Xi,θ)\hat\theta_M := \arg\min_\theta (1/n) \sum_i \rho(X_i, \theta). The FOC iψ(Xi,θ^M)=0\sum_i \psi(X_i, \hat\theta_M) = 0 with ψ:=θρ\psi := \nabla_\theta \rho is just-identified GMM with g=ψg = \psi. The asymptotic variance is the sandwich VM=G01ΩG0V_M = G_0^{-1} \, \Omega \, G_0^{-\top}.

Five M-estimators:

Estimatorρ(x,θ)\rho(x, \theta)ψ(x,θ)\psi(x, \theta)Information identity holds?
MLElogp(x;θ)-\log p(x; \theta)logp-\nabla \log pYes (correctly specified)
Quasi-MLElogq(x;θ)-\log q(x; \theta)logq-\nabla \log qNo (misspecified)
OLS(yxθ)2/2(y - x^\top \theta)^2 / 2x(yxθ)-x(y - x^\top \theta)Only under homoskedastic Gaussian
Quantile regression (τ\tau)ρτ(yxθ)\rho_\tau(y - x^\top \theta)x(τI[y<xθ])x \cdot (\tau - \mathbb{I}[y < x^\top \theta])No (non-smooth)
Huber regressionρH(yxθ)\rho_H(y - x^\top \theta)xψH(yxθ)x \cdot \psi_H(y - x^\top \theta)No

where ρτ(r)=r(τI[r<0])\rho_\tau(r) = r(\tau - \mathbb{I}[r < 0]) (Koenker-Bassett 1978) and ρH\rho_H is the Huber loss (Huber 1964). Where GMM extends M-estimation. The over-identified case L>kL > k has no counterpart in classical M-estimation: there is no ρ\rho whose gradient is the full LL-vector gg. GMM absorbs M-estimation when L=kL = k and strictly generalizes it when L>kL > k by introducing the weighting matrix WW. In modern causal inference (§12), this generalization is the substantive value of GMM.


§12 — GMM in modern causal inference

The most active recent application of GMM is in causal inference with machine-learned nuisance functions — what Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) call double / debiased machine learning (DML). We want to estimate a low-dimensional causal parameter θ0Rk\theta_0 \in \mathbb{R}^k but identification requires high-dimensional nuisance functions η0(W)\eta_0(W) estimated with ML at sub-parametric rates. DML answers: how can we plug ML-estimated η^\hat\eta into a target-parameter estimator and still get n\sqrt{n}-consistent, asymptotically-normal θ^n\hat\theta_n?

The answer combines two ideas: Neyman orthogonality — design g(O;θ,η)g(O; \theta, \eta) so that ηE[g(O;θ0,η0)]η=η0=0\partial_\eta \mathbb{E}[g(O; \theta_0, \eta_0)]|_{\eta = \eta_0} = 0 — and cross-fitting — estimate η^\hat\eta on one fold, evaluate gg on another fold. Under both, θ^n\hat\theta_n achieves n\sqrt{n}-consistency and asymptotic normality even when η^\hat\eta converges at n1/4n^{-1/4} rate, and the asymptotic variance is the semiparametric efficiency bound.

12.1 Doubly robust estimation as GMM

The partial-linear model. Observe Oi=(Yi,Xi,Wi)O_i = (Y_i, X_i, W_i) where

Yi  =  Xiθ0+g0(Wi)+εi,Xi  =  m0(Wi)+vi,E[εiXi,Wi]=0,    E[viWi]=0.Y_i \;=\; X_i\, \theta_0 + g_0(W_i) + \varepsilon_i, \quad X_i \;=\; m_0(W_i) + v_i, \quad \mathbb{E}[\varepsilon_i | X_i, W_i] = 0, \;\; \mathbb{E}[v_i | W_i] = 0.

Nuisances η0=(0,m0)\eta_0 = (\ell_0, m_0) where 0(W):=E[YW]\ell_0(W) := \mathbb{E}[Y | W]. The Robinson (1988) partialling-out construction: given ^,m^\hat\ell, \hat m, form residuals Y~i:=Yi^(Wi)\tilde Y_i := Y_i - \hat \ell(W_i), X~i:=Xim^(Wi)\tilde X_i := X_i - \hat m(W_i). The estimator is the simple OLS slope θ^partial:=iX~iY~i/iX~i2\hat\theta_{\rm partial} := \sum_i \tilde X_i \tilde Y_i / \sum_i \tilde X_i^2. This is GMM with the Robinson moment g(O;θ,η)=X~(Y~θX~)g(O; \theta, \eta) = \tilde X (\tilde Y - \theta \tilde X).

For the average treatment effect with binary treatment A{0,1}A \in \{0, 1\}, the AIPW moment is

gDR(O;θ,η)  =  μ1(W)μ0(W)  +  A(Yμ1(W))e(W)    (1A)(Yμ0(W))1e(W)    θ,g_{\rm DR}(O; \theta, \eta) \;=\; \mu_1(W) - \mu_0(W) \;+\; \frac{A\,(Y - \mu_1(W))}{e(W)} \;-\; \frac{(1-A)\,(Y - \mu_0(W))}{1 - e(W)} \;-\; \theta,

with η=(μ0,μ1,e)\eta = (\mu_0, \mu_1, e). The estimator has the double robustness property: consistent if either the outcome regressions μ^a\hat\mu_a are consistent or the propensity e^\hat e is consistent.

12.2 Double machine learning

Algorithm 12.1 (DML, K-fold cross-fitting).

  1. Partition the data into KK folds I1,,IK\mathcal{I}_1, \dots, \mathcal{I}_K.
  2. For each k=1,,Kk = 1, \dots, K: estimate η^(k)\hat\eta^{(-k)} on observations outside fold kk; evaluate g(Oi;θ,η^(k))g(O_i; \theta, \hat\eta^{(-k)}) on observations in fold kk.
  3. Solve the cross-fitted moment equation (1/n)kiIkg(Oi;θ^DML,η^(k))=0(1/n) \sum_k \sum_{i \in \mathcal{I}_k} g(O_i; \hat\theta_{\rm DML}, \hat\eta^{(-k)}) = 0 for θ^DML\hat\theta_{\rm DML}.

Theorem 12.1 (Chernozhukov et al. (2018)).

Assume the moment function gg is Neyman-orthogonal (§12.3) and the product of nuisance estimation errors satisfies g^(k)g0L2(P)m^(k)m0L2(P)=op(n1/2)\|\hat g^{(-k)} - g_0\|_{L^2(P)} \cdot \|\hat m^{(-k)} - m_0\|_{L^2(P)} = o_p(n^{-1/2}) — the mixed-bias condition (a standard sufficient condition is η^j(k)ηj,0L2(P)=op(n1/4)\|\hat\eta_j^{(-k)} - \eta_{j,0}\|_{L^2(P)} = o_p(n^{-1/4}) for each nuisance). Under standard regularity (smooth gg, bounded moments, identification at θ0\theta_0),

n(θ^DMLθ0)  d  N(0,V),\sqrt{n}\,(\hat\theta_{\rm DML} - \theta_0) \;\to_d\; \mathcal{N}(0, V^\star),

where VV^\star is the semiparametric efficiency bound.

The result is striking. With ML nuisance estimators converging at n1/4n^{-1/4} rather than the parametric n1/2n^{-1/2}, the DML point estimator still attains n1/2n^{-1/2}-rate and Gaussian asymptotic inference, and achieves the semiparametric efficiency bound. The “double” in DML refers to the mixed-bias product condition: the product of the two nuisance error rates only needs to be op(n1/2)o_p(n^{-1/2}) — a substantially weaker requirement than the op(n1/2)o_p(n^{-1/2})-each rate the naive plug-in argument would demand.

The visualization below illustrates the bias hierarchy on a partial-linear DGP with nonlinear nuisances g0(W)=sin(πW)g_0(W) = \sin(\pi W), m0(W)=W2m_0(W) = W^2, fit with degree-4 polynomial regression (a simpler-than-RF nuisance estimator that still exhibits the asymptotic story). The bar chart shows |bias| and SD across replicates for the three estimators; the histogram shows the sampling distributions overlaid. OLS is biased because it doesn’t adjust for WW; naive plug-in is biased at moderate nn because the same data fits nuisances and structural estimator; DML cross-fits and the bias collapses.

OLS bias = 0.004 · naive = 0.005 · DML = 0.006
DML vs naive plug-in vs OLS Monte Carlo bias comparison on partial-linear model with random-forest nuisances
Figure 12.1 — DML vs naive plug-in vs OLS on the partial-linear DGP with random-forest nuisances. DML eliminates the bias from same-sample nuisance estimation; naive plug-in retains a noticeable residual bias even with the same nuisance functional form.

12.3 Neyman orthogonality: the central design constraint

The DML construction works because g(O;θ,η)g(O; \theta, \eta) satisfies Neyman orthogonality:

tE ⁣[g(O;θ0,η0+t(ηη0))]t=0  =  0for all admissible η.()\frac{\partial}{\partial t}\, \mathbb{E}\!\left[g(O; \theta_0, \eta_0 + t \cdot (\eta - \eta_0))\right] \Bigg|_{t = 0} \;=\; 0 \quad \text{for all admissible } \eta. \qquad (\star)

This is a pathwise / Gâteaux derivative condition: small perturbations of η\eta around η0\eta_0 leave the population expectation of gg unchanged to first order. The plug-in estimator gˉn(θ;η^)\bar g_n(\theta; \hat\eta) is therefore first-order insensitive to η^η0\hat\eta - \eta_0.

Verifying orthogonality for the Robinson moment. With g(O;θ,η)=(Xm(W))(Y(W)θ(Xm(W)))g(O; \theta, \eta) = (X - m(W))(Y - \ell(W) - \theta(X - m(W))), perturbing \ell gives g/η0=(Xm0(W))=v\partial g / \partial \ell |_{\eta_0} = -(X - m_0(W)) = -v, with E[v(0)]=E[(0)E[vW]]=0\mathbb{E}[-v(\ell - \ell_0)] = \mathbb{E}[(\ell-\ell_0) \cdot \mathbb{E}[-v|W]] = 0 using E[vW]=0\mathbb{E}[v|W] = 0. Similarly g/mη0=θ0vε\partial g/\partial m|_{\eta_0} = \theta_0 v - \varepsilon, with E[θ0vεW]=0\mathbb{E}[\theta_0 v - \varepsilon | W] = 0. The Robinson moment is Neyman orthogonal by construction.

Constructing orthogonal moments via the EIF. Start with a “naive” moment g0(O;θ,η)g_0(O; \theta, \eta) satisfying E[g0(O;θ0,η0)]=0\mathbb{E}[g_0(O; \theta_0, \eta_0)] = 0 but not Neyman orthogonal. The orthogonalized moment is g(O;θ,η)=g0(O;θ,η)+ϕ(O;θ,η)g(O; \theta, \eta) = g_0(O; \theta, \eta) + \phi^\star(O; \theta, \eta), where ϕ\phi^\star is the EIF of the nuisance-correction term from the semiparametric efficiency machinery (§6.4). This Neyman orthogonalization is the same operation that produces AIPW from naive outcome regression, Robinson partialling-out from naive OLS, targeted minimum-loss from naive plug-in, and most modern doubly-robust estimators.

Three reasons GMM-with-ML-nuisances is the modern frontier: (1) asymptotic theory is portable — Theorem 5.1’s sandwich, §6’s efficiency bound, §8’s J-statistic all generalize; (2) multiple identifying moments combine optimally — efficient GMM with the union of moments achieves the semiparametric efficiency bound; (3) specification testing is free — Hansen’s J-statistic generalizes to the DML setting (Chernozhukov-Newey-Singh 2022).


§13 — Computational notes, limits, and connections

13.1 Numerical optimization tips

Affine moment functions: use the closed form. numpy.linalg.solve(A.T @ W @ A, A.T @ W @ b) computes the GMM estimate in microseconds. Do not call scipy.optimize.minimize on a linear problem. For smooth nonlinear moments, scipy.optimize.minimize(method='BFGS') with the two-step estimate as starting value is the practical default; pass analytic gradients via jac= when available.

The CUE objective is generally nonconvex even for affine moments. Starting from the two-step estimate θ^(2)\hat\theta^{(2)} — asymptotically equivalent to CUE — typically lands in the convex basin of the global minimum. For EL, use nested optimization with a convex inner problem: Newton-Raphson with line search on the inner λ\lambda (converges in 5\sim 5 iterations), BFGS on the profile R(θ)\ell_R(\theta) as the outer loop.

Convergence diagnostics: monitor the norm of the FOC residual GnWgˉn(θ)\|G_n^\top W \bar g_n(\theta)\| at the optimum (should be near machine epsilon), the condition number of GnWGnG_n^\top W G_n (large = weak identification), and the J-statistic value vs χLk2\chi^2_{L-k} critical (large = specification rejection). Multi-start optimization is the standard defensive strategy for nonconvex problems.

13.2 Bootstrap for GMM and the J-statistic

The naïve nonparametric bootstrap fails for GMM because the bootstrap moment condition is not zero at θ=θ^n\theta = \hat\theta_n in the over-identified case. The Brown-Newey (1995) / Hall-Horowitz (1996) recentered bootstrap fixes this: define g(X,θ):=g(X,θ)gˉn(θ^n)g^\star(X^\star, \theta) := g(X^\star, \theta) - \bar g_n(\hat\theta_n) and run GMM on gg^\star. Hall and Horowitz (1996) proved that the recentered bootstrap yields an asymptotic refinement: bootstrap CIs and J-test critical values have coverage error O(n1)O(n^{-1}) vs O(n1/2)O(n^{-1/2}) asymptotic. Modern variants: wild bootstrap (Davidson-MacKinnon 2010) for heteroskedasticity-robust refinement; block bootstrap (Künsch 1989) for time-series GMM. Default for most applied work: recentered bootstrap with B1000B \sim 1000.

13.3 Bayesian GMM via Chernozhukov–Hong (2003)

Chernozhukov-Hong define a Laplace-type estimator (LTE) by treating the GMM criterion as a quasi-log-likelihood:

π(θdata)    exp ⁣(12Jn(θ,W))π(θ).\pi^\star(\theta \mid \text{data}) \;\propto\; \exp\!\left(-\frac{1}{2} J_n(\theta, W)\right) \cdot \pi(\theta).

Under standard regularity, the posterior mean is consistent, posterior variance equals the GMM sandwich variance, and posterior credible regions equal asymptotic confidence regions to first order. The visualization below runs a 2D random-walk Metropolis sampler from this quasi-posterior on the running example. Top panel: the LTE sample cloud overlaid with the frequentist sandwich 95% ellipse — the two agree at moderate nn. Bottom panel: marginal posterior densities for θ1\theta_1 and θ2\theta_2, with the Gaussian asymptotic curves overlaid.

MH accept rate = 10.1% · θ̄ = (0.973, 0.942)
2D Metropolis quasi-posterior cloud with frequentist sandwich 95% ellipse overlay
Figure 13.1 — Laplace-type estimator quasi-posterior vs frequentist sandwich at n = 500. The MH sample cloud and the sandwich ellipse agree to first order; the LTE provides a Bayesian-flavored uncertainty quantification without specifying a full likelihood model.

LTE is useful when: the GMM criterion is non-smooth (quantile IV, M-estimation with non-differentiable loss); the criterion is multimodal; the model is weakly identified and MCMC reveals the indeterminate direction; informative prior information is available.

13.4 Cross-site connections and further reading

Inbound connections from sister sites. The just-identified predecessor is formalStatistics: method-of-moments , which develops the §2 Pearson construction in detail. The deferred reciprocal in docs/plans/deferred-reciprocals.md auto-discharges via the formalstatisticsPrereqs reciprocal on ship.

Outbound connections within formalML. Three previously-shipped topics underwrite GMM’s machinery: Concentration Inequalities (uniform-LLN bridge for the §4 consistency proof), Convex Analysis (convex-quadratic geometry of the GMM criterion), and Semiparametric Inference (efficient influence function and the semiparametric efficiency bound that equals the Hansen bound). Two planned formalML topics will pick up where GMM leaves off: Causal Inference Methods (coming soon) — doubly-robust estimation, double machine learning, Neyman orthogonality, AIPW; and Empirical Processes (coming soon) — formal development of uniform LLN and Glivenko-Cantelli machinery.

Recommended further reading. Original sources: Hansen (1982), Hansen-Singleton (1982), Owen (1988). Textbook treatments: Hayashi (2000) Ch. 3-5; Hansen B.E. (2022) Ch. 13-15 (free online); Newey-McFadden (1994). Modern developments: Newey-Smith (2004) for GEL unification; Chernozhukov et al. (2018) for DML; Kennedy (2022) for EIF formulas. Software: linearmodels (Python); gmm package (R); statsmodels.sandbox.regression.gmm (Python); ivreg2 (Stata). For DML: econml (Microsoft); doubleml (CRAN/PyPI).

Closing remark. GMM is one of the few estimation frameworks that has stayed central through forty years of changing computational tools. Pearson’s method of moments (1894) became Hansen’s GMM (1982) became Newey-Smith’s GEL (2004) became Chernozhukov et al.’s DML (2018). Each generation built on the same algebraic core — minimize a weighted quadratic in the sample-moment vector — and adapted it to the computational reality of the day. The modern incarnation, GMM with cross-fitted ML nuisances, is the framework that lets us put random forests inside causal-inference confidence intervals without giving up the n\sqrt{n}-consistency that classical statistics requires.

Connections

  • The uniform LLN required for the §4 consistency proof comes directly from the empirical-process / Talagrand machinery developed there. Pointwise convergence of the sample-moment vector at each θ is not enough — the GMM estimator is an argmin, and we need sup_θ ‖ḡ_n − m‖ →_p 0 to conclude argmin convergence. The bracketing-entropy and Rademacher-complexity routes both deliver this. concentration-inequalities
  • The GMM criterion J_n(θ, W) = n ḡ_n(θ)ᵀ W ḡ_n(θ) is a convex quadratic form in the sample-moment residual. When g(X, θ) is affine in θ (the linear IV / running-example case), J_n is globally convex in θ with a closed-form minimizer; for nonlinear g it is convex on a neighborhood of θ₀. The first-order conditions of §3.3 are the convex normal equations. convex-analysis
  • The Hansen efficiency bound V* = (G₀ᵀ Ω⁻¹ G₀)⁻¹ equals the semiparametric efficiency bound for the moment-condition model — derivable independently via the efficient influence function (§6.4). The DML machinery of §12 is exactly Neyman-orthogonalized GMM with ML-estimated nuisance functions; the orthogonalization step uses the same EIF construction. semiparametric-inference

References & Further Reading