advanced learning-theory 60 min read

Causal Inference Methods

Identification, doubly-robust estimation, and inference under confounding — from potential outcomes and IPW through AIPW, TMLE, and DML, with instrumental variables, front-door identification, heterogeneous treatment effects, and sensitivity analysis

§1 Motivation: prediction is not enough

A trained-and-tested machine-learning model that predicts well on held-out data is a wonderful thing. It tells us — with quantified uncertainty, even — what we should expect to see when a fresh covariate vector XX arrives at deployment. It does not, in general, tell us anything about what would happen if we intervened to change XX. Those are different questions, and the distinction is the whole reason this topic exists.

This section establishes the distinction concretely, surveys the standard pathologies of naive adjustment, sketches the rest of the topic, and previews the running example we’ll thread through every estimator. The arc is short: predictive estimands measure associations in a distribution; causal estimands measure effects of interventions; under the right assumptions the two coincide; and the methods we’ll develop are recipes for moving from association to effect when those assumptions hold.

Statistical estimands vs causal estimands. Let D{0,1}D \in \{0, 1\} be a binary treatment, YY a real-valued outcome, and XX a vector of pre-treatment covariates we observe. Two estimands sit close together in notation and very far apart in meaning:

τassoc  =  E[YD=1]E[YD=0],τ  =  E[Y(1)]E[Y(0)].\tau_{\text{assoc}} \;=\; \mathbb{E}[Y \mid D = 1] - \mathbb{E}[Y \mid D = 0], \qquad \tau \;=\; \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)].

The first is a contrast of conditional means in the observed-data distribution — a statistical estimand, computable from any iid sample (Xi,Di,Yi)i=1n(X_i, D_i, Y_i)_{i=1}^n by replacing each E[]\mathbb{E}[\cdot] with a sample average. The second is a contrast of expected potential outcomes Y(1),Y(0)Y(1), Y(0): the values YY would take if every unit were forced into treatment (resp. control). Those potential outcomes are counterfactual — for any given unit we observe only one of them, namely Y=Y(D)Y = Y(D) — so the second estimand cannot, in general, be read off the observed-data distribution without further assumptions. We unpack that machinery in §2 (notation) and §3 (assumptions). For now, take it on faith that τ\tau is the object of interest and τassoc\tau_{\text{assoc}} is its statistical doppelgänger.

The two estimands agree in randomized experiments, where treatment is assigned independently of XX: randomization severs the link that would otherwise let XX drive both DD and YY simultaneously. They disagree in observational data — sometimes by a lot. A treatment that looks like it works because sicker patients are the ones who receive it can produce E[YD=1]<E[YD=0]\mathbb{E}[Y \mid D = 1] < \mathbb{E}[Y \mid D = 0], apparently harmful, when E[Y(1)]>E[Y(0)]\mathbb{E}[Y(1)] > \mathbb{E}[Y(0)], actually beneficial. That sign-flip is the bias the rest of this topic is built to retire.

When “controlling for XX” goes wrong. The textbook prescription — adjust for all variables that affect both DD and YY — is correct in spirit and easy to get wrong in practice. Two mistakes are common enough to deserve names.

Colliders. A collider is a variable that is a common effect of two others. Conditioning on a collider — say, by adding it as a regressor — opens a path between its parents that was not there before. The cleanest example: take AA and BB as independent standard normals and let C=A+B+εC = A + B + \varepsilon be a noisy sum. Marginally, Cor(A,B)=0\operatorname{Cor}(A, B) = 0. Conditional on C0C \approx 0, the only way to have C0C \approx 0 is to have ABA \approx -B, so Cor(A,BC0)\operatorname{Cor}(A, B \mid C \approx 0) is sharply negative. We have manufactured a dependence by adjusting for a variable we should have left alone.

Mediators. A mediator sits on a causal path from DD to YY — treatment causes the mediator, the mediator causes the outcome. Adjusting for a mediator subtracts out part of the very effect we want to measure, leaving only the direct portion. This is sometimes exactly what we want (controlled-direct-effect questions) and sometimes catastrophic (when the total effect is the thing that matters). The mistake is symmetric to the collider one: the textbook rule “control for XX” is not a license to control for every XX.

The lesson is not “adjust less” or “adjust more.” It is adjust correctly — which requires knowing enough of the causal structure to identify confounders without sweeping up colliders or mediators. §2 makes this precise via potential outcomes; §3 makes it precise via d-separation on DAGs.

Two panels: distribution of the naive contrast under confounding (biased upward away from the true τ = 1), and a scatter plot showing the manufactured anti-correlation between A and B conditional on C ≈ 0.
Two failure modes on a stripped-down DGP. Left: confounding inflates the naive contrast above the true τ = 1. Right: conditioning on a common-effect collider C manufactures a spurious anti-correlation between independent A and B.

Map of the topic. The topic is organized along the identification-and-confounding-control axis: given an estimand, what assumptions let us recover it from observed data, and what estimators implement that recovery efficiently? §§2–3 build the framework — potential outcomes, ignorability, positivity, the identification theorem. §§4–8 develop estimators that assume ignorability: IPW (§4), outcome regression and the g-formula (§5), augmented IPW (§6), targeted maximum likelihood (§7), and double machine learning (§8). The pedagogical arc is that each new estimator improves on its predecessors by being more robust to nuisance-model misspecification, culminating in DML, which licenses arbitrary off-the-shelf ML nuisance fits while preserving n\sqrt{n}-asymptotic inference. §§9–10 handle the case where ignorability is unavailable: instrumental variables and front-door identification. §§11–12 cover heterogeneous treatment effects and sensitivity to unmeasured confounding. §13 threads the running example through every estimator end-to-end, and §14 closes with limits and connections.

The signature theorem of this topic is the doubly-robust property of AIPW (§6.2): AIPW is consistent for τ\tau whenever either the propensity model or the outcome model is correctly specified — a remarkable two-shots-on-goal guarantee no single-model estimator can match. The doubly-robust property is the bridge from the §§4–5 single-model estimators to the §§7–8 machinery that lets us use modern ML for nuisance estimation.

The running example. Throughout the topic we work with a partially-linear data-generating process in the Robinson (1988) tradition:

Y  =  τD+g(X)+ε,DXBernoulli(e(X)),Y \;=\; \tau D + g(X) + \varepsilon, \qquad D \mid X \sim \text{Bernoulli}(e(X)),

with XR10X \in \mathbb{R}^{10}, εN(0,1)\varepsilon \sim \mathcal{N}(0, 1) independent of (X,D)(X, D), smooth nonlinear outcome confounder gg, smooth nonlinear propensity score ee, and known true ATE τ=1\tau = 1. The concrete forms we use are

g(X)  =  sin(X1)+12X22,e(X)  =  σ ⁣(1.5(0.5X1+Φ(X3)0.3X40.5)),g(X) \;=\; \sin(X_1) + \tfrac{1}{2} X_2^2, \qquad e(X) \;=\; \sigma\!\bigl(1.5\bigl(0.5 X_1 + \Phi(X_3) - 0.3\,X_4 - 0.5\bigr)\bigr),

where σ\sigma is the logistic sigmoid and Φ\Phi the standard normal CDF. Propensity is coupled to the outcome via X1X_1 (the same variable that drives the nonlinear sin\sin component of gg); without this coupling, the naive contrast would be unbiased and the §5 misspec demonstration would fail to land. The DGP is benign enough that closed-form intuition is available, but rich enough that lasso, random-forest, and OLS nuisance fits give qualitatively different downstream behavior — which is exactly what we need to make the §6–§8 robustness story visible.

§2 The potential-outcomes framework

§1 introduced the causal estimand τ=E[Y(1)Y(0)]\tau = \mathbb{E}[Y(1) - Y(0)] without defining the notation. This section unpacks it: what Y(d)Y(d) means, what assumptions make it well-defined, why we can never observe it directly, and how the various target estimands of causal inference (ATE, ATT, CATE) relate. The framework we adopt is the Neyman–Rubin potential-outcomes framework, due to Neyman (1923) and elevated to general statistical use by Rubin (1974). The DAG-based framework of Pearl reappears in §3 and §10; for setting up notation, potential outcomes are cleaner.

Counterfactual notation. For each unit ii, the potential outcomes are Yi(0)Y_i(0) and Yi(1)Y_i(1) — the values of the outcome that would obtain if unit ii received control (D=0D = 0) or treatment (D=1D = 1), respectively. We treat (Yi(0),Yi(1))(Y_i(0), Y_i(1)) as fixed, unit-level attributes — properties of the unit, not random variables associated with the experiment. Randomness enters through treatment assignment DiD_i and through whatever measurement error or post-treatment noise is built into the outcome. The unit-level treatment effect is τi=Yi(1)Yi(0)\tau_i = Y_i(1) - Y_i(0). This is the quantity we would want to know for each unit if we could observe both potential outcomes. In reality we observe only Yi=Yi(Di)Y_i = Y_i(D_i), the realized outcome under the assigned treatment, and never the counterfactual Yi(1Di)Y_i(1 - D_i).

Definition 1 (SUTVA, consistency, ATE).

The Stable Unit Treatment Value Assumption (SUTVA; Rubin 1980) requires both no interference — unit ii‘s potential outcomes depend on ii‘s own treatment, not on other units’ treatments — and no hidden treatment versions — each value dd corresponds to a single well-defined intervention. Under SUTVA the consistency identity is definitional:

Yi  =  Yi(Di)  =  DiYi(1)+(1Di)Yi(0).Y_i \;=\; Y_i(D_i) \;=\; D_i \, Y_i(1) + (1 - D_i) \, Y_i(0).

The average treatment effect is τ  =  E[Y(1)Y(0)]\tau \;=\; \mathbb{E}[Y(1) - Y(0)]; the ATT is τATT=E[Y(1)Y(0)D=1]\tau_{\text{ATT}} = \mathbb{E}[Y(1) - Y(0) \mid D = 1]; and the conditional average treatment effect is τ(x)=E[Y(1)Y(0)X=x]\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]. The three are linked by τ=EX[τ(X)]\tau = \mathbb{E}_X[\tau(X)] and τATT=E[τ(X)D=1]\tau_{\text{ATT}} = \mathbb{E}[\tau(X) \mid D = 1].

The fundamental problem of causal inference. The defining difficulty of causal inference is that for each unit we observe Y(D)Y(D) but never Y(1D)Y(1 - D). The unit-level effect τi\tau_i is unobservable. Holland (1986) named this the fundamental problem of causal inference, and it is genuinely a problem — not a technical detail to be waved away. Standard statistical inference deals with missing data as a contingent feature of the sampling design; here, the missing data is structural — half of every potential-outcome pair, every time.

α = 0: τi = τ for every unit. α > 0: τi = τ + α·εi.
(A) Counterfactual world: both Y(0) and Y(1) for every unit.
(B) The fundamental problem: filled dot = observed, open dot = counterfactual.

The way out is to give up on unit-level effects and aim for averages over the population. The estimand τ\tau is an average over the population’s unit-level effects, not a unit-level quantity. Under the assumptions of §3, those averages can be expressed as functionals of the observed-data distribution despite the structural missingness.

§3 Identification under ignorability and positivity

§2 set up the potential-outcomes notation and showed why unit-level effects are unobservable. The next question is when averages of those unit-level effects can be recovered from observed-data quantities. The answer comes in three pieces: a structural condition (ignorability), a regularity condition (positivity), and a theorem that exhibits the ATE as two distinct functionals of the observed-data distribution — one a propensity-weighted contrast, the other a covariate-adjusted regression contrast. These two representations are the seeds of the §4 IPW estimator and the §5 outcome-regression estimator, respectively.

Definition 2 (Ignorability and positivity).

Conditional ignorability (no unmeasured confounding, conditional exchangeability) is

{Y(0),Y(1)}   ⁣ ⁣ ⁣  DX.\{Y(0), Y(1)\} \;\perp\!\!\!\perp\; D \,\big|\, X.

Positivity (overlap) is 0<e(x)<10 < e(x) < 1 a.s., where the propensity score is e(x)=Pr(D=1X=x)e(x) = \Pr(D = 1 \mid X = x).

Ignorability is a substantive assumption about the data-generating process. It holds, by construction, when XX contains every variable that simultaneously affects both treatment assignment and potential outcomes. There is no statistical test for ignorability from observed data alone; the field treats it as a structural claim defended on substantive grounds (study design, domain knowledge, sensitivity analysis — §12). Positivity is a regularity condition that makes 1/e(X)1/e(X) and 1/(1e(X))1/(1 - e(X)) finite. Practical positivity violations — where e(x)e(x) is technically bounded but lives close to {0,1}\{0, 1\} on a non-trivial mass of XX — are a frequent cause of trouble in real applied work.

Fraction trimmed: 0.2%

Randomized experiments. The cleanest setting where both ignorability and positivity hold by construction is a randomized experiment. If D ⁣ ⁣ ⁣(Y(0),Y(1))D \perp\!\!\!\perp (Y(0), Y(1)) unconditionally and Pr(D=1)(0,1)\Pr(D = 1) \in (0, 1), then ignorability holds trivially with any XX and positivity holds because the propensity is just the fixed assignment probability. Everything we develop in §§4–10 is in some sense machinery for approximating the conditions of a randomized experiment when randomization is impossible — by reweighting (IPW), by regression adjustment (g-formula), by augmentation (AIPW), or by instrumental-variable manipulation (IV).

Theorem 1 (Identification of the ATE).

Assume SUTVA, consistency, ignorability {Y(0),Y(1)} ⁣ ⁣ ⁣DX\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X, and positivity 0<e(x)<10 < e(x) < 1 a.s. Then the ATE τ=E[Y(1)Y(0)]\tau = \mathbb{E}[Y(1) - Y(0)] is identified by both:

(a) The g-formula (standardization) functional:

τ  =  EX ⁣[μ1(X)μ0(X)],where μd(x)=E[YX=x,D=d].\tau \;=\; \mathbb{E}_X\!\bigl[\mu_1(X) - \mu_0(X)\bigr], \qquad \text{where } \mu_d(x) = \mathbb{E}[Y \mid X = x, D = d].

(b) The inverse-probability-weighting (IPW) functional:

τ  =  E ⁣[DYe(X)    (1D)Y1e(X)].\tau \;=\; \mathbb{E}\!\left[\frac{D \, Y}{e(X)} \;-\; \frac{(1 - D)\, Y}{1 - e(X)}\right].
Proof.

We show (a): E[Y(d)]=EX[μd(X)]\mathbb{E}[Y(d)] = \mathbb{E}_X[\mu_d(X)] for each d{0,1}d \in \{0, 1\}; the ATE follows by taking the difference. The three-step chain is iterated expectation → ignorability → consistency:

E[Y(d)]=EX ⁣[E[Y(d)X]](iterated expectation)=EX ⁣[E[Y(d)X,D=d]](ignorability)=EX ⁣[E[YX,D=d]](consistency)=EX[μd(X)].\begin{aligned} \mathbb{E}[Y(d)] &= \mathbb{E}_X\!\bigl[\mathbb{E}[Y(d) \mid X]\bigr] && (\text{iterated expectation}) \\ &= \mathbb{E}_X\!\bigl[\mathbb{E}[Y(d) \mid X, D = d]\bigr] && (\text{ignorability}) \\ &= \mathbb{E}_X\!\bigl[\mathbb{E}[Y \mid X, D = d]\bigr] && (\text{consistency}) \\ &= \mathbb{E}_X[\mu_d(X)]. \end{aligned}

The second line is the load-bearing one. Ignorability says the conditional distribution of Y(d)Y(d) given XX does not depend on DD, so conditioning on D=dD = d adds no information beyond XX. The third line is a pure relabeling: within the sub-population {D=d}\{D = d\}, the observed YY is the potential outcome Y(d)Y(d) by consistency. Positivity enters implicitly: we need the conditioning event {D=d,X=x}\{D = d, X = x\} to have positive probability in every covariate stratum xx.

For (b), show E[DY/e(X)]=E[Y(1)]\mathbb{E}[D Y / e(X)] = \mathbb{E}[Y(1)]; the second term follows by symmetry. The five-step chain is consistency → iterated expectation → factoring → ignorability → cancellation:

E ⁣[DYe(X)]=E ⁣[DY(1)e(X)](consistency: DY=DY(1))=E ⁣[E ⁣[DY(1)e(X)X,Y(1)]](iterated expectation)=E ⁣[Y(1)e(X)E[DX,Y(1)]](factor (X,Y(1))-measurable)=E ⁣[Y(1)e(X)E[DX]](ignorability)=E ⁣[Y(1)e(X)e(X)](definition of e)=E[Y(1)].\begin{aligned} \mathbb{E}\!\left[\frac{D Y}{e(X)}\right] &= \mathbb{E}\!\left[\frac{D \, Y(1)}{e(X)}\right] && (\text{consistency: } D Y = D Y(1)) \\ &= \mathbb{E}\!\left[\,\mathbb{E}\!\left[\frac{D Y(1)}{e(X)} \,\big|\, X, Y(1)\right]\right] && (\text{iterated expectation}) \\ &= \mathbb{E}\!\left[\frac{Y(1)}{e(X)} \cdot \mathbb{E}[D \mid X, Y(1)]\right] && (\text{factor } (X, Y(1))\text{-measurable}) \\ &= \mathbb{E}\!\left[\frac{Y(1)}{e(X)} \cdot \mathbb{E}[D \mid X]\right] && (\text{ignorability}) \\ &= \mathbb{E}\!\left[\frac{Y(1)}{e(X)} \cdot e(X)\right] && (\text{definition of } e) \\ &= \mathbb{E}[Y(1)]. \end{aligned}

The fourth line is the load-bearing one — ignorability says D ⁣ ⁣ ⁣Y(1)XD \perp\!\!\!\perp Y(1) \mid X, so E[DX,Y(1)]=e(X)\mathbb{E}[D \mid X, Y(1)] = e(X), which cancels against the 1/e(X)1/e(X) weight. Symmetric calculation for D=0D = 0 gives E[(1D)Y/(1e(X))]=E[Y(0)]\mathbb{E}[(1 - D) Y / (1 - e(X))] = \mathbb{E}[Y(0)]; subtraction yields (b).

The two representations are point-equal but constructed from different ingredients — outcome means μd\mu_d in (a), propensity ee in (b). That asymmetry is exactly what §6’s augmented-IPW estimator will exploit to deliver double robustness.

Histograms of fitted propensity scores by treatment arm on the canonical Robinson DGP.
Static fallback for §3: propensity histograms by arm on the canonical Robinson DGP. The interactive viz above varies the steepness slider in [0.5, 5] and shades the positivity-violation regions [0, δ] ∪ [1−δ, 1].

§4 The IPW estimator

The IPW functional from §3 is literally an expectation, so the natural estimator just replaces it with a sample average. This gives the inverse-probability-weighting estimator (Horvitz–Thompson 1952; Robins, Rotnitzky & Zhao 1994).

The Horvitz–Thompson construction. Given a sample (Xi,Di,Yi)i=1n(X_i, D_i, Y_i)_{i=1}^n and access to (or a fitted estimate of) the propensity e(Xi)e(X_i), the HT-IPW estimator is

τ^IPW  =  1ni=1n[DiYie(Xi)    (1Di)Yi1e(Xi)].\hat\tau_{\text{IPW}} \;=\; \frac{1}{n}\sum_{i=1}^n \left[\frac{D_i Y_i}{e(X_i)} \;-\; \frac{(1-D_i)\, Y_i}{1 - e(X_i)}\right].

The intuition is reweighting to balance. Each treated unit gets weight 1/e(Xi)1/e(X_i) — larger weight for units unlikely to be treated, smaller for those nearly guaranteed treatment. The reweighted treatment arm behaves as though drawn from the marginal distribution of XX. The same construction balances the control arm. IPW constructs a synthetic randomized experiment from observational data: each unit’s contribution is up-weighted by the reciprocal of the probability that it ended up in the arm we observe it in.

The propensity score as a balancing score. Rosenbaum and Rubin (1983) showed that the propensity is a balancing score: conditional on e(X)e(X) alone — not on the full XX — treated and control units have the same distribution of XX:

X   ⁣ ⁣ ⁣  De(X).X \;\perp\!\!\!\perp\; D \,\big|\, e(X).

Under ignorability, the same conditioning transfers to potential outcomes: {Y(0),Y(1)} ⁣ ⁣ ⁣De(X)\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid e(X). This is the structural reason IPW works with a scalar nuisance even when XX is high-dimensional. The downside is that a misspecified propensity model loses this property — the estimated e^\hat e is no longer a true balancing score, and IPW becomes biased in proportion to the misspecification error.

Asymptotic normality and the sandwich. Under ignorability, positivity, and correct specification, IPW is asymptotically normal:

n(τ^IPWτ)  d  N ⁣(0,  VIPW),\sqrt{n}\,(\hat\tau_{\text{IPW}} - \tau) \;\xrightarrow{d}\; \mathcal{N}\!\bigl(0,\; V_{\text{IPW}}\bigr),

with VIPW=E[ψIPW2]V_{\text{IPW}} = \mathbb{E}[\psi_{\text{IPW}}^2] the variance of the per-unit summand ψIPW=DY/e(X)(1D)Y/(1e(X))τ\psi_{\text{IPW}} = D Y / e(X) - (1 - D) Y / (1 - e(X)) - \tau. The result follows from M-estimator theory (Stefanski & Boos 2002), with details on the formalStatistics: Point Estimation page. Remarkably, estimating the propensity gives a more efficient IPW estimator than using the truth (Hirano, Imbens, and Ridder 2003). The plug-in sandwich gives valid Wald 95% CIs τ^±1.96Var^\hat\tau \pm 1.96\sqrt{\widehat{\text{Var}}}.

Stabilized weights and trimming. The chief weakness of IPW is variance inflation under near-positivity violation. When some unit has e^(Xi)\hat e(X_i) near {0,1}\{0, 1\}, its weight explodes; that single unit dominates and τ^\hat\tau inherits enormous variance. The Hájek (stabilized) estimator normalizes weights to sum to 1 within each arm; trimming discards units with e^(X)[δ,1δ]\hat e(X) \notin [\delta, 1 - \delta] at the cost of changing the estimand to the trimmed sub-population’s ATE. Crump et al. (2009) gives an optimal trimming rule; δ[0.01,0.05]\delta \in [0.01, 0.05] is the practical default.

mean τ̂ = 0.987SD = 0.130mean SE = 0.14295% coverage = 98.0%max IPW weight = 43.6

§5 Outcome regression and the g-formula

The g-formula representation τ=EX[μ1(X)μ0(X)]\tau = \mathbb{E}_X[\mu_1(X) - \mu_0(X)] is the second identification functional. Where IPW reweights, the outcome-regression estimator fills in the missing potential outcomes via a regression model.

Standardization. Fit a regression model for the conditional outcome means μd(x)\mu_d(x) — typically by splitting the sample on DD and fitting separate regressions, or by fitting a single joint model with DD as a feature. Then plug into the g-formula sample average:

τ^OR  =  1ni=1n[μ^1(Xi)μ^0(Xi)].\hat\tau_{\text{OR}} \;=\; \frac{1}{n}\sum_{i=1}^n \bigl[\hat\mu_1(X_i) - \hat\mu_0(X_i)\bigr].

For each unit — regardless of which arm they were observed in — we compute the model-implied counterfactual contrast and average over the empirical XX-distribution. Under correct specification, OR is consistent and asymptotically normal. The naive plug-in SE n1/2sd(μ^1(Xi)μ^0(Xi))n^{-1/2} \cdot \operatorname{sd}(\hat\mu_1(X_i) - \hat\mu_0(X_i)) ignores first-stage fitting noise and tends to under-cover; proper inference uses the bootstrap or the §6 AIPW-based variance.

Bias under misspecification. If μ^dμ~dμd\hat\mu_d \to \tilde\mu_d \neq \mu_d in probability, the asymptotic bias is the population-mean difference E[μ~1(X)μ~0(X)]τ\mathbb{E}[\tilde\mu_1(X) - \tilde\mu_0(X)] - \tau — direct, with no correcting term. OR’s failure mode is bias without variance blowup. IPW under near-positivity violation is high-bias-and-high-variance; OR under misspecification is high-bias-and-low-variance. The two trade off in opposite directions.

The g-formula as integral form. The representation τ=[μ1(x)μ0(x)]dFX(x)\tau = \int [\mu_1(x) - \mu_0(x)] \, dF_X(x) integrates the conditional contrast against the marginal covariate distribution rather than the treatment-conditional one. The sample-average plug-in is the Monte Carlo approximation against F^X\hat F_X. For longitudinal settings with time-varying treatment, this becomes Robins’s g-computation algorithm (Robins 1986); §14.2 sketches the extension.

§6 Augmented IPW and double robustness

Signature section. §4 gave us IPW: correct propensity gives consistency, misspecified propensity gives bias. §5 gave us OR: correct outcome gives consistency, misspecified outcome gives bias. AIPW combines both into a single score whose bias is the product of two nuisance errors. Either nuisance correct ⇒ AIPW consistent.

The AIPW score.

ψAIPW(W;τ,e,μ)  =  μ1(X)μ0(X)  +  D(Yμ1(X))e(X)    (1D)(Yμ0(X))1e(X)    τ.\psi_{\text{AIPW}}(W;\, \tau, e, \mu) \;=\; \mu_1(X) - \mu_0(X) \;+\; \frac{D\,(Y - \mu_1(X))}{e(X)} \;-\; \frac{(1-D)\,(Y - \mu_0(X))}{1 - e(X)} \;-\; \tau.

The empirical estimator is the sample average. Structure: OR plus IPW-weighted augmentation by outcome residuals. The augmentation corrects OR when the outcome model is wrong; it vanishes in expectation when the outcome model is right (E[YμdX,D=d]=0\mathbb{E}[Y - \mu_d \mid X, D = d] = 0). Symmetric rewriting shows it equally corrects IPW under wrong propensity and vanishes under right propensity. Each nuisance covers the other.

Theorem 2 (Double robustness of AIPW).

Assume SUTVA, consistency, ignorability {Y(0),Y(1)} ⁣ ⁣ ⁣DX\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X, and positivity 0<e(x)<10 < e(x) < 1 a.s. Suppose the plug-in nuisances satisfy μ^dpμ~d\hat\mu_d \xrightarrow{p} \tilde\mu_d and e^pe~\hat e \xrightarrow{p} \tilde e, with e~\tilde e bounded away from {0,1}\{0, 1\}. If at least one of (i) μ~d=μd\tilde\mu_d = \mu_d (correct outcome), or (ii) e~=e\tilde e = e (correct propensity), then τ^AIPWpτ\hat\tau_{\text{AIPW}} \xrightarrow{p} \tau. Under mild rate conditions on the nuisance estimators, n(τ^AIPWτ)dN(0,VAIPW)\sqrt n(\hat\tau_{\text{AIPW}} - \tau) \xrightarrow{d} \mathcal{N}(0, V_{\text{AIPW}}) with VAIPW=E[ψAIPW2]V_{\text{AIPW}} = \mathbb{E}[\psi_{\text{AIPW}}^2], equal to the Hahn (1998) semiparametric efficiency bound when both nuisances are correctly specified.

Proof.

We decompose the asymptotic bias as a sum of two product-of-errors terms in six steps.

Let ψˉ(μ~,e~)=E[ψAIPW(W;τ,e~,μ~)+τ]\bar\psi(\tilde\mu, \tilde e) = \mathbb{E}[\psi_{\text{AIPW}}(W;\, \tau, \tilde e, \tilde\mu) + \tau], so ψˉ(μ,e)=τ\bar\psi(\mu, e) = \tau. By the LLN, τ^AIPWpψˉ(μ~,e~)\hat\tau_{\text{AIPW}} \xrightarrow{p} \bar\psi(\tilde\mu, \tilde e).

Step 1: Condition on XX in each summand. For the IPW-weighted residual term, by consistency DY=DY(1)D Y = D Y(1) and by ignorability D ⁣ ⁣ ⁣Y(1)XD \perp\!\!\!\perp Y(1) \mid X:

E[D(Yμ~1(X))X]=e(X)(μ1(X)μ~1(X)).\mathbb{E}[D\,(Y - \tilde\mu_1(X)) \mid X] = e(X)\,(\mu_1(X) - \tilde\mu_1(X)).

Dividing by e~(X)\tilde e(X) and taking the symmetric calculation for the control arm:

E ⁣[D(Yμ~1(X))e~(X)X]=e(X)e~(X)(μ1(X)μ~1(X)),\mathbb{E}\!\left[\frac{D\,(Y - \tilde\mu_1(X))}{\tilde e(X)} \,\big|\, X\right] = \frac{e(X)}{\tilde e(X)}\,(\mu_1(X) - \tilde\mu_1(X)),E ⁣[(1D)(Yμ~0(X))1e~(X)X]=1e(X)1e~(X)(μ0(X)μ~0(X)).\mathbb{E}\!\left[\frac{(1-D)\,(Y - \tilde\mu_0(X))}{1 - \tilde e(X)} \,\big|\, X\right] = \frac{1 - e(X)}{1 - \tilde e(X)}\,(\mu_0(X) - \tilde\mu_0(X)).

Step 2: Combine. The conditional expectation of the full AIPW summand (with τ\tau added back) given XX is

μ~1(X)μ~0(X)+e(X)e~(X)(μ1(X)μ~1(X))1e(X)1e~(X)(μ0(X)μ~0(X)).\tilde\mu_1(X) - \tilde\mu_0(X) + \frac{e(X)}{\tilde e(X)}(\mu_1(X) - \tilde\mu_1(X)) - \frac{1 - e(X)}{1 - \tilde e(X)}(\mu_0(X) - \tilde\mu_0(X)).

Step 3: Add and subtract μ1(X)μ0(X)\mu_1(X) - \mu_0(X). Write μ~d=μd+Δμ,d\tilde\mu_d = \mu_d + \Delta_{\mu, d} with Δμ,d(X)=μ~d(X)μd(X)\Delta_{\mu, d}(X) = \tilde\mu_d(X) - \mu_d(X):

[μ1(X)μ0(X)]+Δμ,1(X)[1e(X)e~(X)]Δμ,0(X)[11e(X)1e~(X)].\bigl[\mu_1(X) - \mu_0(X)\bigr] + \Delta_{\mu,1}(X)\Bigl[1 - \tfrac{e(X)}{\tilde e(X)}\Bigr] - \Delta_{\mu,0}(X)\Bigl[1 - \tfrac{1 - e(X)}{1 - \tilde e(X)}\Bigr].

Step 4: Simplify the brackets. Each is a propensity error: 1e(X)/e~(X)=Δe(X)/e~(X)1 - e(X)/\tilde e(X) = \Delta_e(X)/\tilde e(X) and 1(1e(X))/(1e~(X))=Δe(X)/(1e~(X))1 - (1 - e(X))/(1 - \tilde e(X)) = -\Delta_e(X)/(1 - \tilde e(X)) where Δe(X)=e~(X)e(X)\Delta_e(X) = \tilde e(X) - e(X). Substituting:

[μ1(X)μ0(X)]+Δμ,1(X)Δe(X)e~(X)+Δμ,0(X)Δe(X)1e~(X).\bigl[\mu_1(X) - \mu_0(X)\bigr] + \frac{\Delta_{\mu,1}(X)\,\Delta_e(X)}{\tilde e(X)} + \frac{\Delta_{\mu,0}(X)\,\Delta_e(X)}{1 - \tilde e(X)}.

Step 5: Take unconditional expectation.

  ψˉ(μ~,e~)τ  =  E ⁣[Δμ,1Δee~]+E ⁣[Δμ,0Δe1e~].  \boxed{\;\bar\psi(\tilde\mu, \tilde e) - \tau \;=\; \mathbb{E}\!\left[\frac{\Delta_{\mu,1}\,\Delta_e}{\tilde e}\right] + \mathbb{E}\!\left[\frac{\Delta_{\mu,0}\,\Delta_e}{1 - \tilde e}\right].\;}

Step 6: Conclude. The asymptotic bias is a sum of products of outcome-model and propensity errors. If either nuisance is correctly specified (one factor identically zero), both products vanish and the bias is zero.

A consequence: the bias is second-order in nuisance errors — bounded by Δμ2Δe2\|\Delta_\mu\|_2 \cdot \|\Delta_e\|_2 via Cauchy–Schwarz. When both nuisances converge at rate oP(n1/4)o_P(n^{-1/4}), the product converges at rate oP(n1/2)o_P(n^{-1/2}), asymptotically negligible relative to the n\sqrt n leading-order CLT fluctuation. This is the Neyman-orthogonality mechanism underlying §8’s DML — same algebra, repackaged as a n\sqrt n-rate guarantee.

The efficient influence function and semiparametric efficiency. The AIPW score is the EIF for the ATE under ignorability. Hahn (1998) established that for any regular estimator with n(τ^τ)dN(0,V)\sqrt n(\hat\tau - \tau) \xrightarrow{d} \mathcal{N}(0, V),

VV=E ⁣[σ12(X)e(X)+σ02(X)1e(X)+(μ1(X)μ0(X)τ)2],V \geq V^* = \mathbb{E}\!\left[\frac{\sigma_1^2(X)}{e(X)} + \frac{\sigma_0^2(X)}{1 - e(X)} + (\mu_1(X) - \mu_0(X) - \tau)^2\right],

with σd2(X)=Var(YX,D=d)\sigma_d^2(X) = \operatorname{Var}(Y \mid X, D = d), and AIPW achieves VV^* when both nuisances are correctly specified. The EIF is Neyman-orthogonal to the nuisances at the true values, which licenses ML nuisance estimators in DML (§8).

Four nuisance-specification cells. AIPW (green) recovers τ in 3 of 4; IPW (blue) and OR (brown) each recover in 2.

§7 Targeted Maximum Likelihood (TMLE)

TMLE produces an estimator asymptotically equivalent to AIPW — same limit distribution, same efficiency bound, same doubly-robust consistency — via a different numerical construction that solves the EIF estimating equation exactly in-sample and preserves the parameter space for bounded outcomes. Introduced by van der Laan and Rubin (2006), developed in van der Laan and Rose (2011).

Motivation. AIPW’s augmentation term has generally non-zero empirical mean at the initial μ^\hat\mu — the initial fit wasn’t optimized for the EIF. TMLE adjusts μ^μ^\hat\mu \to \hat\mu^* to enforce the targeting equations:

1niDi(Yiμ^1(Xi))e^(Xi)=0,1ni(1Di)(Yiμ^0(Xi))1e^(Xi)=0.\frac{1}{n}\sum_i \frac{D_i(Y_i - \hat\mu_1^*(X_i))}{\hat e(X_i)} = 0, \qquad \frac{1}{n}\sum_i \frac{(1-D_i)(Y_i - \hat\mu_0^*(X_i))}{1 - \hat e(X_i)} = 0.

The TMLE estimator is then τ^TMLE=n1i[μ^1(Xi)μ^0(Xi)]\hat\tau_{\text{TMLE}} = n^{-1}\sum_i [\hat\mu_1^*(X_i) - \hat\mu_0^*(X_i)] — OR with the targeted regression, equivalently AIPW with the augmentation absorbed.

The one-step targeting update. The linear submodel is

μ^d(X)=μ^d(X)+ϵ^dHd(X),H1(X)=1/e^(X),H0(X)=1/(1e^(X)).\hat\mu_d^*(X) = \hat\mu_d(X) + \hat\epsilon_d \cdot H_d(X), \qquad H_1(X) = 1/\hat e(X),\quad H_0(X) = 1/(1 - \hat e(X)).

OLS on each arm regressing residuals on the clever covariate gives

ϵ^d=i:Di=dHd(Xi)(Yiμ^d(Xi))i:Di=dHd(Xi)2.\hat\epsilon_d = \frac{\sum_{i:\,D_i = d} H_d(X_i)\,(Y_i - \hat\mu_d(X_i))}{\sum_{i:\,D_i = d} H_d(X_i)^2}.

The OLS first-order condition is the targeting equation.

Remark (Logistic-submodel TMLE for bounded outcomes).

For bounded Y[0,1]Y \in [0, 1] (e.g., binary YY, or a dichotomized continuous outcome), the logistic submodel is

logit(μ^d(X))=logit(μ^d(X))+ϵ^dHd(X),\operatorname{logit}\bigl(\hat\mu_d^*(X)\bigr) = \operatorname{logit}\bigl(\hat\mu_d(X)\bigr) + \hat\epsilon_d \cdot H_d(X),

with ϵ^d\hat\epsilon_d from logistic regression of YY on Hd(X)H_d(X) with logit(μ^d)\operatorname{logit}(\hat\mu_d) as offset. The same targeting property holds, and the construction preserves μ^d(X)[0,1]\hat\mu_d^*(X) \in [0, 1] by construction — the linear submodel can drift outside [0,1][0, 1] for binary outcomes. The notebook verifies this on a dichotomized Robinson outcome: linear submodel gives τ^=+0.3316\hat\tau = +0.3316, logistic submodel gives τ^=+0.3318\hat\tau = +0.3318, all μ^d(X)[0,1]\hat\mu_d^*(X) \in [0, 1].

Theorem 3 (Asymptotic equivalence of TMLE and AIPW).

Under SUTVA, consistency, ignorability, positivity, and nuisance rate conditions η^η0L2(P)=oP(n1/4)\|\hat\eta - \eta_0\|_{L^2(P)} = o_P(n^{-1/4}), TMLE is asymptotically normal:

n(τ^TMLEτ)dN(0,V),\sqrt n(\hat\tau_{\text{TMLE}} - \tau) \xrightarrow{d} \mathcal{N}(0, V^*),

where VV^* is the Hahn (1998) bound. The proof uses the same product-of-errors decomposition as §6.3 with the additional observation that ϵ^d=OP(n1/2)\hat\epsilon_d = O_P(n^{-1/2}), asymptotically negligible.

Finite-sample regimes where TMLE wins over AIPW: (i) when the outcome model is misspecified but the propensity is correct, TMLE absorbs the augmentation correction into the regression, giving marginally tighter variance at small nn; (ii) for bounded outcomes, the logistic submodel keeps μ^d(X)[0,1]\hat\mu_d^*(X) \in [0, 1] by construction.

AIPW mean τ̂ = 0.9978TMLE mean τ̂ = 0.9963|AIPW aug mean| = 3.41e-2|TMLE aug mean| = 5.17e-17

§8 Double / debiased Machine Learning (DML)

DML (Chernozhukov et al. 2018) lets us use flexible ML nuisance estimators — lasso, random forests, neural nets — and still get n\sqrt n-asymptotic-normal inference. Two ingredients: Neyman orthogonality (already in the AIPW score) and cross-fitting.

Why naive plug-in fails with ML nuisance. When η^\hat\eta and the score are computed on the same data, in-sample residuals are correlated with the in-sample fit; the overfit bias enters at order n1/2n^{-1/2}, swamping the leading-order n\sqrt n CLT fluctuation. For parametric estimators with p/n0p/n \to 0, the overfit bias is OP(p/n)O_P(p/n) and negligible. For ML estimators that can memorize or interpolate (RF, NN, lasso boundaries with CV), it is nontrivial.

Neyman orthogonality. The AIPW score has the structural property that the first-order variation of the expected score under nuisance perturbations is identically zero at the truth: tE[ψ(W;τ,η0+th)]t=0=0\partial_t \mathbb{E}[\psi(W; \tau, \eta_0 + t h)]|_{t=0} = 0 for every direction hh. We verify this component-by-component for the AIPW score:

(i) Perturbing ee. The ee-dependent terms are D(Yμ1)/eD(Y - \mu_1)/e and (1D)(Yμ0)/(1e)-(1-D)(Y - \mu_0)/(1-e). Differentiating: eψ[he]=D(Yμ1,0)he/e02(1D)(Yμ0,0)he/(1e0)2\partial_e \psi[h_e] = -D(Y - \mu_{1,0}) h_e / e_0^2 - (1-D)(Y - \mu_{0,0}) h_e / (1 - e_0)^2. At the truth, conditioning on XX: E[D(Yμ1,0)X]=e0(X)E[Y(1)μ1,0(X)X]=0\mathbb{E}[D(Y - \mu_{1,0}) \mid X] = e_0(X) \cdot \mathbb{E}[Y(1) - \mu_{1,0}(X) \mid X] = 0 by ignorability and the definition of μ1,0\mu_{1,0}. The control term vanishes symmetrically.

(ii) Perturbing μ1\mu_1. The μ1\mu_1-dependent terms are +μ1+\mu_1 and Dμ1/e-D\mu_1/e. Differentiating: μ1ψ[hμ1]=hμ1(X)(1D/e(X))\partial_{\mu_1} \psi[h_{\mu_1}] = h_{\mu_1}(X)(1 - D/e(X)). At the truth, E[1D/e0(X)X]=1e0(X)/e0(X)=0\mathbb{E}[1 - D/e_0(X) \mid X] = 1 - e_0(X)/e_0(X) = 0.

(iii) Perturbing μ0\mu_0. By symmetry, E[μ0ψ]=0\mathbb{E}[\partial_{\mu_0} \psi] = 0.

All three directional derivatives vanish — the AIPW score is Neyman-orthogonal at the true nuisances. Consequence: the Taylor expansion of E[ψ(W;τ0,η^)]\mathbb{E}[\psi(W; \tau_0, \hat\eta)] around η0\eta_0 has the first-order term vanishing, leaving an O(η^η02)O(\|\hat\eta - \eta_0\|^2) remainder. If η^η0=oP(n1/4)\|\hat\eta - \eta_0\| = o_P(n^{-1/4}), this is oP(n1/2)o_P(n^{-1/2}) — asymptotically negligible relative to the n\sqrt n-rate CLT fluctuation.

Cross-fitting. The Neyman-orthogonality argument requires η^\hat\eta and the score evaluation to be independent. Sample-splitting restores independence: partition {1,,n}\{1, \ldots, n\} into KK folds; for each fold, fit nuisance on the other K1K-1 folds and evaluate the score on the held-out fold; average. Common K=5K = 5 or K=10K = 10.

Theorem 4 (DML guarantee (Chernozhukov et al. 2018)).

Under SUTVA, consistency, ignorability, positivity, and the rate condition η^(k)η0L2(P)=oP(n1/4)\|\hat\eta^{(-k)} - \eta_0\|_{L^2(P)} = o_P(n^{-1/4}) for each fold kk, the DML estimator is asymptotically normal:

n(τ^DMLτ)  d  N(0,V),\sqrt n\,(\hat\tau_{\text{DML}} - \tau) \;\xrightarrow{d}\; \mathcal{N}(0, V^*),

where V=E[ψAIPW(W;τ,η0)2]V^* = \mathbb{E}[\psi_{\text{AIPW}}(W; \tau, \eta_0)^2] is the Hahn (1998) semiparametric efficiency bound.

The oP(n1/4)o_P(n^{-1/4}) rate is satisfied by lasso (restricted-eigenvalue + sparsity), random forests (Athey–Wager honest splitting), neural networks (Farrell, Liang, Misra 2021), and series methods. The lasso rate is covered in detail on high-dimensional regression.

mean τ̂ = 1.013SD = 0.114Cross-fit DML score

§9 Instrumental variables

§§4–8 assume ignorability. When an unobserved UU affects both DD and YY, those estimators are biased and no test on (X,D,Y)(X, D, Y) alone reveals it. Two responses: bound the magnitude via sensitivity analysis (§12), or find an instrument ZZ that allows point identification despite the unmeasured confounding.

Exclusion and relevance. An instrument ZZ satisfies (a) Independence: Z ⁣ ⁣ ⁣(Y(0),Y(1),D(0),D(1))Z \perp\!\!\!\perp (Y(0), Y(1), D(0), D(1)) — typically by design (randomized ZZ); (b) Exclusion restriction: Y(d,z)=Y(d)Y(d, z) = Y(d)ZZ affects YY only through DD; (c) Relevance: Cov(Z,D)0\operatorname{Cov}(Z, D) \neq 0ZZ moves the treatment, testable from data. When all three hold, the Wald estimator τ^Wald=(YˉZ=1YˉZ=0)/(DˉZ=1DˉZ=0)\hat\tau_{\text{Wald}} = (\bar Y_{Z=1} - \bar Y_{Z=0})/(\bar D_{Z=1} - \bar D_{Z=0}) identifies a causal effect: under homogeneous effects, τ\tau directly; under heterogeneity with monotonicity, the LATE. Wald generalizes to two-stage least squares (2SLS) for multi-valued ZZ or with controls XX.

Theorem 5 (LATE under monotonicity (Imbens & Angrist 1994)).

Add (d) Monotonicity: Pr(D(1)D(0))=1\Pr(D(1) \geq D(0)) = 1 — no defiers. Then

τ^WaldpτLATE=E[Y(1)Y(0)complier].\hat\tau_{\text{Wald}} \xrightarrow{p} \tau_{\text{LATE}} = \mathbb{E}[Y(1) - Y(0) \mid \text{complier}].
Proof.

The population partitions into four principal strata by (D(0),D(1))(D(0), D(1)): never-takers (0,0)(0, 0), always-takers (1,1)(1, 1), compliers (0,1)(0, 1), defiers (1,0)(1, 0) — the last empty by monotonicity. Compute the numerator and denominator of the Wald ratio in turn.

Numerator. By independence + exclusion, E[YZ=1]E[YZ=0]=E[Y(D(1))Y(D(0))]\mathbb{E}[Y \mid Z = 1] - \mathbb{E}[Y \mid Z = 0] = \mathbb{E}[Y(D(1)) - Y(D(0))]. Partition by stratum and apply consistency:

  • Never-takers: D(0)=D(1)=0D(0) = D(1) = 0, so Y(D(1))Y(D(0))=0Y(D(1)) - Y(D(0)) = 0. Contributes 0.
  • Always-takers: D(0)=D(1)=1D(0) = D(1) = 1, so Y(D(1))Y(D(0))=0Y(D(1)) - Y(D(0)) = 0. Contributes 0.
  • Compliers: D(0)=0D(0) = 0, D(1)=1D(1) = 1, so Y(D(1))Y(D(0))=Y(1)Y(0)Y(D(1)) - Y(D(0)) = Y(1) - Y(0). Contributes (Y(1)Y(0))Pr(complier)(Y(1) - Y(0)) \cdot \Pr(\text{complier}) on average.
  • Defiers: empty by monotonicity.

So E[Y(D(1))Y(D(0))]=E[Y(1)Y(0)complier]Pr(complier)\mathbb{E}[Y(D(1)) - Y(D(0))] = \mathbb{E}[Y(1) - Y(0) \mid \text{complier}] \cdot \Pr(\text{complier}).

Denominator. E[DZ=1]E[DZ=0]=E[D(1)D(0)]\mathbb{E}[D \mid Z = 1] - \mathbb{E}[D \mid Z = 0] = \mathbb{E}[D(1) - D(0)]. Stratum contributions: never-takers 0, always-takers 0, compliers +1+1, defiers excluded. So E[D(1)D(0)]=Pr(complier)\mathbb{E}[D(1) - D(0)] = \Pr(\text{complier}).

Ratio. The complier probability cancels:

τ^Wald  p  E[Y(1)Y(0)complier]Pr(complier)Pr(complier)  =  τLATE.\hat\tau_{\text{Wald}} \;\xrightarrow{p}\; \frac{\mathbb{E}[Y(1) - Y(0) \mid \text{complier}] \cdot \Pr(\text{complier})}{\Pr(\text{complier})} \;=\; \tau_{\text{LATE}}.

LATE ≠ ATE in general: different instruments select different complier populations and yield different LATE estimates. Without additional assumptions (e.g., homogeneous effects), the ATE is not IV-identified.

Weak-IV pathologies. When Cov(Z,D)\operatorname{Cov}(Z, D) is small, the denominator approaches zero, the ratio variance explodes, the finite-sample distribution becomes Cauchy-like, and standard Wald CIs lose coverage. The first-stage F-statistic diagnoses: Staiger–Stock (1997) and Stock–Yogo (2005) give the rule-of-thumb F>10F > 10; Andrews, Stock, and Sun (2019) tighten to F>23.1F > 23.1 for nominal 5% Wald. Weak-IV-robust alternatives — the Anderson–Rubin test, CLR refinements (Moreira 2003) — invert the null into reduced-form tests that preserve size when FF is small. Conley, Hansen, and Rossi (2012) extends the toolkit to sensitivity for partial exclusion-restriction violations.

median first-stage F = 12.6 (strong)IQR(τ̂) = 0.860

§10 Front-door identification

When unobserved confounders prevent back-door adjustment and no instrument is available, the front-door criterion (Pearl 1995) provides identification via a mediator MM on the DYD \to Y pathway.

Conditions: (a) all DYD \to Y paths go through MM — no direct DYD \to Y effect; (b) DMD \to M is unconfounded — no back-door from DD to MM; (c) the back-door from MM to YY is blocked by DD. The canonical example is smoking → tar → cancer with unobserved genetic confounder.

The front-door criterion. Under (a)–(c), Pearl’s do-calculus gives

Pr(Y=ydo(D=d))=mPr(M=mD=d)dPr(Y=yD=d,M=m)Pr(D=d).\Pr(Y = y \mid \operatorname{do}(D = d)) = \sum_m \Pr(M = m \mid D = d) \sum_{d'} \Pr(Y = y \mid D = d', M = m)\,\Pr(D = d').

For binary DD with continuous YY and MM, the ATE is

τ=g(m)[p(mD=1)p(mD=0)]dm,g(m)=dE[YM=m,D=d]Pr(D=d).\tau = \int g(m)\bigl[p(m \mid D = 1) - p(m \mid D = 0)\bigr]\, dm, \qquad g(m) = \sum_{d'} \mathbb{E}[Y \mid M = m, D = d']\,\Pr(D = d').

The notebook implements this via a single OLS fit of YY on (M,D,MD)(M, D, M \cdot D), computes g^(Mi)\hat g(M_i) for each unit, and takes a difference of arm-means; the result tracks the oracle estimator (which knows UU) to within Monte Carlo noise while the naive contrast is biased upward by the confounding through UU.

DAG schematic of the front-door criterion (U → D → M → Y, with an unobserved U → Y arrow) and the resulting Monte Carlo distributions of naive, front-door, and oracle estimators on the worked DGP.
§10 front-door DGP: U unobserved, D | U, M | D, Y = β_M M + δ_U U + ε with true τ = γ · β_M = 0.8. The front-door estimator and the oracle (which knows U) both center at 0.8 ± MC noise; the naive contrast is biased upward to ≈ 2.03.

Generalizations. Front-door and back-door are two criteria within Pearl’s do-calculus, the complete identification framework. Shpitser–Pearl (2006) gives a completeness theorem: an effect is identifiable iff do-calculus derives its expression. For longitudinal time-varying treatment with time-varying confounding, Robins’s g-methods generalize the back-door criterion — the g-formula, marginal structural models, structural nested models. Hernán & Robins (2020) is the standard reference.

§11 Heterogeneous treatment effects

§§4–10 targeted scalar estimands. CATE τ(x)=E[Y(1)Y(0)X=x]\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x] is the function-valued target. Under ignorability and positivity, τ(x)=μ1(x)μ0(x)\tau(x) = \mu_1(x) - \mu_0(x). CATE is the substrate for individualized decisions.

Meta-learners. Künzel et al. (2019) taxonomy:

  • S-learner. Single μ^(x,d)\hat\mu(x, d); CATE = μ^(x,1)μ^(x,0)\hat\mu(x, 1) - \hat\mu(x, 0). Biased toward zero when regularization shrinks the DD coefficient.
  • T-learner. Two regressions μ^0\hat\mu_0, μ^1\hat\mu_1 separately. Inefficient under arm imbalance but unbiased toward zero.
  • X-learner (Künzel et al. 2019). Imputed pseudo-effects combined via propensity weights — better at extrapolating across imbalanced arms.
  • R-learner (Nie & Wager 2021). Partial out m^(x)=E[YX]\hat m(x) = \mathbb{E}[Y \mid X] and e^(x)=E[DX]\hat e(x) = \mathbb{E}[D \mid X]; minimize the Robinson residual loss. Neyman-orthogonal at the population level.
  • DR-learner (Kennedy 2020). Cross-fit AIPW pseudo-outcomes Y~i=ψAIPW(Wi;η^(k))\tilde Y_i = \psi_{\text{AIPW}}(W_i; \hat\eta^{(-k)}) — by construction E[Y~X=x]=τ(x)\mathbb{E}[\tilde Y \mid X = x] = \tau(x) — then regress Y~\tilde Y on XX. Inherits the doubly-robust property pointwise. Modern default.

Causal forests. Wager & Athey (2018), Athey, Tibshirani, and Wager (2019) adapt random forests to CATE with (i) a treatment-effect splitting criterion (maximize within-leaf treatment-effect variance) and (ii) honest splitting — fit the tree on one half, estimate within-leaf effects on the other. Honesty enables pointwise asymptotic normality (τ^(x)τ(x))/σ^(x)dN(0,1)(\hat\tau(x) - \tau(x))/\hat\sigma(x) \xrightarrow{d} \mathcal{N}(0, 1) via the infinitesimal jackknife.

DGP: τ(x) = 1 + 0.5 x1 − 0.3 x2 + 0.4 x1 x2
True CATE τ(x)
True CATE τ(x)
T-learner (kernel smoother)
T-learner (kernel smoother)
DR-learner (cross-fit AIPW + smoother)
DR-learner (cross-fit AIPW + smoother)
x1 on horizontal axis, x2 on vertical, both in [−2, 2]. Color scale: blue ↔ red, centered at the grid midpoint.

Why CATE matters. The optimal treatment rule π(x)=1{τ(x)>c}\pi^*(x) = \mathbb{1}\{\tau(x) > c\} depends on the CATE function pointwise. Policy value V(π)=E[Y(π(X))]V(\pi) = \mathbb{E}[Y(\pi(X))] quantifies expected outcome under the rule. Policy learning (Kitagawa & Tetenov 2018; Athey & Wager 2021), heterogeneity tests (Athey & Imbens 2016; Chernozhukov et al. 2018), and subgroup analysis via generalized random forests are all built on CATE estimates.

§12 Sensitivity analysis

The unmeasured-confounding problem. The bias of any back-door-adjusted estimator under unobserved UU scales as EX[(E[UX,D=1]E[UX,D=0])γU]\mathbb{E}_X[(\mathbb{E}[U \mid X, D=1] - \mathbb{E}[U \mid X, D=0]) \cdot \gamma_U], where γU\gamma_U is the partial effect of UU on YY. Either piece zero ⇒ zero bias. Sensitivity analysis asks: how large would the unobserved piece need to be in order to nullify the observed effect?

Rosenbaum bounds. Parameterize the maximum within-matched-pair odds-ratio gap an unobserved UU can induce by Γ1\Gamma \geq 1. The tipping-point Γ\Gamma is the smallest value at which the worst-case p-value crosses 0.05. Published medical studies typically report tipping points in [1.2,3.0][1.2, 3.0]. Rosenbaum (1987, 2002) is the comprehensive reference.

The E-value (and Cinelli–Hazlett robustness value). For an observed risk ratio RR>1\text{RR} > 1, VanderWeele and Ding (2017) define

E-value=RR+RR(RR1).\text{E-value} = \text{RR} + \sqrt{\text{RR} \cdot (\text{RR} - 1)}.

For RR1\text{RR} \leq 1, replace by 1/RR1/\text{RR}. Interpretation: the minimum confounding strength (RR scale, with both treatment and outcome) needed to nullify the effect. For continuous outcomes, approximate via RRexp(0.91d)\text{RR} \approx \exp(0.91 \, d) for standardized effect dd. Computable at point and at CI bound; the latter is more conservative. VanderWeele–Ding benchmarks: 1.5 modest, 3 moderate, 5+ strong.

For a linear regression Y=τD+Xγ+δZ+εY = \tau D + X\gamma + \delta Z + \varepsilon with unobserved ZZ, the Cinelli–Hazlett (2020) Robustness Value is

RVq=12 ⁣(fq4+4fq2fq2),fq=qt^df,RV_q = \tfrac{1}{2}\!\left(\sqrt{f_q^4 + 4 f_q^2} - f_q^2\right), \qquad f_q = \frac{q \cdot |\hat t|}{\sqrt{df}},

where t^=β^D/SE^\hat t = \hat\beta_D / \widehat{\text{SE}}. RVqRV_q is the value at which RYZD,X2=RDZX2=RVqR^2_{Y \sim Z \mid D, X} = R^2_{D \sim Z \mid X} = RV_q produces bias =qβ^D= q \cdot \hat\beta_D. The conservative RVq,αRV_{q, \alpha} pushes the CI to include zero. Both diagnostics report on different scales and should be reported together — the E-value lives on the risk-ratio scale, the RV lives on the partial-R2R^2 scale.

tipping point at U strength ≈ > 1.5

Practical reporting. Three recommendations (VanderWeele & Ding 2017; Cinelli & Hazlett 2020): (1) always report some sensitivity diagnostic for observational studies — the E-value is the cheapest; (2) pair point estimates with CI-based sensitivity on multiple scales (E-value and RV); (3) calibrate against domain benchmarks. The caveat: sensitivity analysis quantifies robustness to unmeasured confounding, not to misspecification among observed covariates — the §6 doubly-robust property addresses the latter.

§13 Worked example: end-to-end on the Robinson DGP

The full pipeline threads §§4–8 estimators through one estimation routine. The DGP is the canonical Robinson DGP from §1.4: XN(0,I10)X \sim \mathcal{N}(0, I_{10}), e(X)=σ(1.5(0.5X1+Φ(X3)0.3X40.5))e(X) = \sigma(1.5(0.5 X_1 + \Phi(X_3) - 0.3 X_4 - 0.5)), DXBernoulli(e(X))D \mid X \sim \text{Bernoulli}(e(X)), Y=τD+sin(X1)+0.5X22+εY = \tau D + \sin(X_1) + 0.5 X_2^2 + \varepsilon, τ=1\tau = 1.

Nuisance fits are lasso CV (logistic with L1 penalty for the propensity, lasso CV per arm for the outcome) and a kernel-smoother RF proxy (the in-browser substitute for the notebook’s full RandomForestRegressor). The forest plot shows IPW (Hájek), OR (g-comp), AIPW, DML-lasso, and DML-RF-proxy point estimates with 95% Wald CIs; the lower panel overlays the Monte Carlo distributions across BB replicates.

(B) MC distribution across B replicates

Remark (Cross-check against doubleml).

The Python notebook includes an optional cross-check against the doubleml package on a single sample. Hand-rolled DML-lasso returns τ^=+0.9893\hat\tau = +0.9893 with SE^=0.0637\widehat{\text{SE}} = 0.0637; doubleml.DoubleMLPLR returns τ^=+0.9939\hat\tau = +0.9939 with SE^=0.0592\widehat{\text{SE}} = 0.0592. The two agree to within Monte Carlo noise, as expected — the hand-rolled implementation gives us full control over the cross-fitting harness and the score evaluation, while doubleml is the reference for production work.

Coverage table. Across B=100B = 100 replicates at n=1000n = 1000 on the canonical Robinson DGP, the notebook reports the following: IPW (Hájek) achieves 0.97 coverage with mean SE 0.10; AIPW achieves 0.89 with mean SE 0.07; DML-lasso achieves 0.95 with mean SE 0.10; DML-RF achieves 0.91 with mean SE 0.07. OR achieves 0.07 coverage because the naive plug-in SE ignores first-stage fitting noise — under-coverage of OR is expected and the §6 AIPW-based variance is the right alternative.

Sensitivity diagnostics on the AIPW estimate. On the single-sample AIPW point estimate at n=2000n = 2000 with τ^=+0.97\hat\tau = +0.97, SE^=0.051\widehat{\text{SE}} = 0.051: E-value (point) = 2.96, E-value (CI lower) = 2.74, RVq=1=0.343RV_{q=1} = 0.343, RVq=1,α=0.05=0.314RV_{q=1, \alpha=0.05} = 0.314. All four diagnostics agree that nullifying this effect would require an unmeasured confounder roughly as strong as a moderate epidemiological exposure-outcome pair (RR ≈ 3 on the risk-ratio scale, or partial R2R^2 of about 0.34 on both the treatment and the outcome residualized on observed covariates).

§14 Connections and limits

Rubin vs Pearl. Two traditions — potential outcomes (Neyman, Rubin) and DAGs (Wright, Pearl). For binary-treatment ATE, they give equivalent identification claims and estimators; they differ in what they make easy to express. The Rubin tradition makes unit-level claims and randomization-based design natural; the Pearl tradition makes structural claims and do-calculus natural. Modern applied work mixes both. The framework chosen for this topic is the potential-outcomes one because it gives the cleanest setup for the §§4–8 estimator zoo; the §10 front-door section pivots to the DAG framework because that is where it lives most naturally.

Longitudinal extensions. Time-varying treatment and time-varying confounding break the static back-door criterion: a covariate measured after an earlier treatment can be both a confounder for later treatment and a mediator for the earlier one. Robins’s g-methods generalize: the longitudinal g-formula, marginal structural models with longitudinal IPW, structural nested models with longitudinal OR. Hernán & Robins (2020) is the standard reference.

What causal inference cannot do. (a) The no-unmeasured-confounders ceiling: no procedure recovers τ\tau from observational data alone when ignorability fails — sensitivity analysis (§12) bounds the magnitude of unobserved confounding required to overturn a finding but cannot identify τ\tau itself. (b) Effects of non-manipulable “treatments” (race, gender, age) require careful specification of the manipulation regime — Holland (1986)‘s “no causation without manipulation” formulation is one stance; modern mediation analysis (VanderWeele 2015) is another. (c) Extrapolation beyond observed treatment support requires parametric assumptions outside the framework — IPW, OR, AIPW are silent on what happens for XX values with e(X){0,1}e(X) \in \{0, 1\} on the support boundary.

Forward pointers. Mediation analysis (VanderWeele 2015) extends the potential-outcomes framework to decompose total effects into direct and indirect components. Dynamic treatment regimes (Murphy 2003; Chakraborty & Moodie 2013) extend longitudinally and connect to offline reinforcement learning (Levine et al. 2020) and off-policy evaluation (Thomas & Brunskill 2016). Causal discovery (PC, FCI, NOTEARS — Zheng et al. 2018) tries to learn the DAG from data when it is not known a priori. Online causal inference and bandits (Athey et al. 2022; Hadad et al. 2021) extend the framework to adaptive data collection.

The §§3–8 framework is a building block. Future estimators built on top of sequence models, learned policies, and adaptive data will inherit the AIPW score, the doubly-robust property, Neyman orthogonality, and cross-fitting as substrate — the same way every modern model-selection rule inherits training-error-plus-capacity-penalty from structural risk minimization and every modern generalization bound inherits empirical-process / uniform-convergence machinery from formalStatistics: Empirical Processes .

Connections

  • PAC-Bayes bounds and the §8.2 Neyman-orthogonality argument share the same Taylor-expansion-around-truth machinery: the first-order term vanishes at the true parameter / true nuisance, leaving a second-order remainder that the Donsker-class structure tames. DML inherits the $\sqrt{n}$ rate from a Neyman-orthogonal score the way PAC-Bayes inherits the dimension-free bound from its orthogonality between the posterior and the prior. pac-bayes-bounds
  • SRM picks a single hypothesis class from a nested family; DML picks a single ATE estimator from a class of cross-fitted nuisance fits. Both rely on the rate of nuisance estimation falling below a critical threshold ($o_P(n^{-1/4})$ for DML, $o_P(n^{-1/2})$ for SRM consistency); both depend on uniform-convergence machinery underwriting the per-instance plug-in argument. SRM is model selection in supervised learning; DML is model selection at the nuisance-estimation step of causal inference. structural-risk-minimization
  • Lasso is the workhorse nuisance estimator in §§8, 13 of this topic; the restricted-eigenvalue + sparsity rate that delivers $o_P(n^{-1/4})$ consistency comes directly from the high-dimensional-regression topic. The §8.2 Neyman-orthogonality argument requires $\|\hat\eta - \eta_0\|_2 = o_P(n^{-1/4})$, which lasso achieves under sparsity (Bickel–Ritov–Tsybakov 2009) covered in that topic. high-dimensional-regression
  • The plug-in sandwich variance for IPW (§4.3), the AIPW EIF variance (§6.4), and the DML $\sqrt{n}$-asymptotic-normality result (§8.4) all rely on standard concentration inequalities (Hoeffding, Bernstein, McDiarmid) for the empirical-process step. The §12 Rosenbaum bound is itself a closed-form Bernoulli concentration result expressed in tipping-point form. concentration-inequalities

References & Further Reading