advanced unsupervised 65 min read

Density-Ratio Estimation

Direct estimation of p/q via the Bregman framework: KMM, KLIEP, uLSIF, and classification-DRE under one convex loss, the Nguyen-Wainwright-Jordan variational lift to neural witnesses, and applications to covariate shift, weighted conformal prediction, and MMD two-sample testing

Motivation: why estimate p/qp/q directly

The density-ratio function r(x)=p(x)/q(x)r(x) = p(x) / q(x) pulls more weight than any single object in modern unsupervised and generative ML. It tells us how to reweight samples from one distribution to look like samples from another (covariate-shift correction); it is the optimal witness for distinguishing two distributions (two-sample testing and GAN discriminators); it is the integrand of the Kullback–Leibler divergence (mutual-information estimation and variational inference). Estimating rr should be easy — we have samples from both pp and qq, and the ratio is one scalar function. Yet the most obvious recipe — fit a density estimator to each sample, then divide — is so badly behaved that an entire family of direct density-ratio estimators grew up to avoid it. This section explains what goes wrong with divide-then-estimate, where the direct approach pays off, and why we built the next twelve sections around it.

The plug-in pathology

The plug-in recipe writes itself: estimate p^\hat p from the pp-samples, estimate q^\hat q from the qq-samples, declare r^(x)=p^(x)/q^(x)\hat r(x) = \hat p(x) / \hat q(x), and call it a day. The trouble is concentrated in the denominator. Wherever qq has little mass — but pp might have a lot — q^(x)\hat q(x) is a small noisy positive number, and dividing by it explodes the noise in r^\hat r. The picture is starkest in the tails of qq that overlap the bulk of pp: the true ratio is large there (that’s exactly the region where reweighting matters most), but the plug-in’s variance is largest there too, so the plug-in tells us nothing reliable about the very region we care about.

We can see this numerically on a setup with a closed-form ratio. Take p=N(0,1)p = \mathcal{N}(0, 1) and q=N(1,1)q = \mathcal{N}(1, 1). The ratio admits a one-line algebra:

r(x)  =  φ(x)φ(x1)  =  exp ⁣(12x),r(x) \;=\; \frac{\varphi(x)}{\varphi(x - 1)} \;=\; \exp\!\left(\tfrac{1}{2} - x\right),

where φ\varphi is the standard normal density. The ratio grows exponentially as xx \to -\infty — at x=3x = -3 it is e3.533e^{3.5} \approx 33 — exactly where q^\hat q from a finite sample is most fragile. The figure below shows the two KDEs alongside the ratio comparison: the plug-in tracks rr inside the bulk and falls apart in the left tail, with log-MSE several orders of magnitude worse there than in the centre.

Two-panel figure: left shows KDE estimates of two shifted Gaussians overlaid on their true densities; right shows the plug-in ratio vs the closed-form ratio on a log y-axis. The plug-in tracks the true ratio in the bulk and diverges in the left tail.
The plug-in KDE-divided-by-KDE estimator of the density ratio for the running toy. Left: KDE estimates track the true densities in the bulk but show finite-sample noise in the tails. Right: the plug-in ratio diverges from the closed-form ratio in the left tail of q where the true ratio is largest.

Three applications in one breath

The reason a community of estimators evolved around rr rather than (p,q)(p, q) is that several questions different ML papers care about turn out to be the same question dressed differently — namely, “what is p(x)/q(x)p(x)/q(x)?”

Covariate shift. Train and test draws have the same conditional p(yx)p(y \mid x) but different marginals: ptrain(x)ptest(x)p_{\text{train}}(x) \ne p_{\text{test}}(x). Shimodaira (2000) showed that the risk-minimization estimator on training data is asymptotically optimal for the test distribution if we reweight each training loss by r(x)=ptest(x)/ptrain(x)r(x) = p_{\text{test}}(x) / p_{\text{train}}(x). The covariate-shift problem is a density-ratio-estimation problem in disguise. We come back to it in §9.

Two-sample testing. Asking “are these two samples from the same distribution?” is asking whether r1r \equiv 1. The maximum mean discrepancy of Gretton et al. (2012) and the classifier-as-statistic constructions of Lopez-Paz and Oquab (2017) both sit on this observation. We come back to it in §11.

Mutual information and KL. The KL divergence is KL(pq)=Ep[logr]\mathrm{KL}(p \,\|\, q) = \mathbb{E}_p[\log r], and mutual information is the KL between the joint and the product of marginals. Estimating either reduces to estimating logr\log r from samples — exactly what the Nguyen–Wainwright–Jordan (2010) variational lower bound and MINE (Belghazi et al. 2018) do. We come back to this view in §8.

What “direct” buys

There is a single recurring intuition behind direct DRE: estimating two unknown functions and dividing them solves two hard problems to answer one easy question. The two densities pp and qq are infinite-dimensional, and density estimation suffers a curse of dimensionality with rate n2/(4+d)n^{-2/(4+d)} for kernel methods at second-order smoothness (Stone 1982). The ratio rr is also a function on Rd\mathbb{R}^d, but in many practical settings rr is much smoother than pp or qq individually — they share the same fine structure, and the ratio cancels it out. Direct DRE attacks rr in a single optimization that is, in a precise sense the Bregman-divergence framework of §3 will make clear, the only loss whose population minimizer is rr. The classic estimators all fall out of choosing different convex ϕ\phi‘s in that one framework.

The other thing “direct” buys is selection by cross-validation against a real loss. We cannot cross-validate a plug-in ratio because we have no loss whose minimum is at the truth; the formalStatistics: Kernel Density Estimation bandwidth that minimizes a density loss is not the bandwidth that minimizes the ratio’s MSE. KLIEP (§5) lets us cross-validate against its own KL objective; uLSIF (§6) gives us analytic leave-one-out — a closed form, no refitting. This is the practical reason uLSIF became the default in industry.

Roadmap

§2 fixes the object and proves the importance-weighting identity that every downstream application depends on. §3 wraps the four estimator families into a single Bregman-divergence loss family, so subsequent sections feel like specializations rather than four unrelated papers. §4–§7 are those four specializations: KMM (kernel mean matching), KLIEP, LSIF / uLSIF, and probabilistic classification. §8 takes the variational ff-divergence view, which subsumes the previous four and lifts to neural witness functions and the GAN family. §9 turns the estimators loose on the canonical downstream application, covariate-shift correction; §10 extends that to conformal prediction under shift via weighted exchangeability. §11 closes the loop with kernel methods by showing MMD as a special-case ratio statistic. §12 collects the practical questions — bandwidth, regularization, effective sample size, where DRE breaks. §13 maps connections to the sister sites and the rest of formalML.

The density-ratio object and the importance-weighting identity

This section fixes the object we’ll spend the rest of the topic estimating, and proves the one identity every downstream section depends on. The identity is a single line of measure-theoretic algebra, but it’s load-bearing: covariate-shift correction (§9), two-sample testing (§11), and the variational ff-divergence view (§8) all dissolve into special cases of it. We work it carefully here so that later sections can call on it without ceremony.

Setup

We work with two probability distributions pp and qq on a measurable space XRd\mathcal{X} \subseteq \mathbb{R}^d, both with densities (with respect to Lebesgue measure unless we flag otherwise). We observe two iid samples:

X1p,,Xnpp  iid  p,X1q,,Xnqq  iid  q,X_1^p, \ldots, X_{n_p}^p \;\stackrel{\text{iid}}{\sim}\; p, \qquad X_1^q, \ldots, X_{n_q}^q \;\stackrel{\text{iid}}{\sim}\; q,

and our goal is to recover the density ratio

r(x)  :=  p(x)q(x),xX,r(x) \;:=\; \frac{p(x)}{q(x)}, \qquad x \in \mathcal{X},

as a function r:XR0r: \mathcal{X} \to \mathbb{R}_{\geq 0}. The asymmetry of the role labels is deliberate: pp is the target distribution we’d like to take expectations under, and qq is the source distribution we have abundant samples from. In covariate shift (§9) pp is the test marginal and qq is the train marginal; in two-sample testing (§11) the role labels collapse and we only care whether r1r \equiv 1.

Two notational alternates worth flagging because they appear in the literature. (i) Some authors estimate w(x)=q(x)/p(x)w(x) = q(x)/p(x), the reciprocal, when their downstream use is reweighting pp-samples to look like qq-samples; the two are interchangeable up to a sign flip in the log. (ii) When the discussion is purely measure-theoretic, rr is written as the Radon–Nikodym derivative dp/dq\mathrm{d}p / \mathrm{d}q and the integrals below become dq\int \cdot \, \mathrm{d}q. We stick with the density-ratio notation throughout because every estimator in §3–§8 works with densities.

The importance-weighting identity

This is the workhorse.

Detour: what pqp \ll q actually asserts. Before stating the identity, we unpack the absolute-continuity hypothesis, because the entire proof leans on it. Consider the probability measures pp and qq as set functions on a σ\sigma-algebra F\mathcal{F} of subsets of X\mathcal{X} (the Borel σ\sigma-algebra in everything that follows). We say pp is absolutely continuous with respect to qq, written pqp \ll q, if every F\mathcal{F}-measurable set AA with q(A)=0q(A) = 0 also has p(A)=0p(A) = 0 — in words, qq vanishing on a region forces pp to vanish there too. The Radon–Nikodym theorem (Royden 1988, Theorem 11.23) then promises that under pqp \ll q there exists a qq-almost-everywhere unique non-negative F\mathcal{F}-measurable function r:XR0r: \mathcal{X} \to \mathbb{R}_{\geq 0} — the Radon–Nikodym derivative, written r=dp/dqr = \mathrm{d}p / \mathrm{d}q — such that

p(A)  =  Ar(x)q(dx)for every AF.p(A) \;=\; \int_A r(x) \, q(\mathrm{d}x) \qquad \text{for every } A \in \mathcal{F}.

For two densities with respect to a common dominating measure (Lebesgue, throughout this topic), this abstract derivative is just the pointwise ratio: r(x)=p(x)/q(x)r(x) = p(x)/q(x) wherever q(x)>0q(x) > 0, extended by zero wherever q(x)=0q(x) = 0. So the Radon–Nikodym derivative and the density ratio agree as functions on X\mathcal{X}.

Theorem 1 (Importance-Weighting Identity).

Let pp and qq be probability densities on X\mathcal{X} with pqp \ll q, and let r=dp/dqr = \mathrm{d}p / \mathrm{d}q. Then for any measurable f:XRf: \mathcal{X} \to \mathbb{R} with Epf(X)<\mathbb{E}_p|f(X)| < \infty,

Ep[f(X)]  =  Eq[r(X)f(X)].\mathbb{E}_p[f(X)] \;=\; \mathbb{E}_q[r(X)\, f(X)].
Proof.

Four moves.

Move 1 — write the pp-expectation as an integral against pp. By definition,

Ep[f(X)]  =  Xf(x)p(dx).\mathbb{E}_p[f(X)] \;=\; \int_{\mathcal{X}} f(x)\, p(\mathrm{d}x).

Move 2 — invoke Radon–Nikodym to rewrite the measure pp as an rr-weighted version of qq. The detour above tells us that for every F\mathcal{F}-measurable set AA, p(A)=Ardqp(A) = \int_A r\, \mathrm{d}q. The standard approximation argument (first indicators, then simple functions, then non-negative measurable functions by monotone convergence, then arbitrary measurable ff by splitting f=f+ff = f^+ - f^- and applying dominated convergence under Epf<\mathbb{E}_p|f| < \infty) lifts this from sets to integrals:

Xf(x)p(dx)  =  Xf(x)r(x)q(dx).\int_{\mathcal{X}} f(x)\, p(\mathrm{d}x) \;=\; \int_{\mathcal{X}} f(x)\, r(x)\, q(\mathrm{d}x).

Move 3 — handle the null set {q=0}\{q = 0\} explicitly. The integrand f(x)r(x)f(x)\, r(x) on the right-hand side is unambiguously defined on {q>0}\{q > 0\} and is irrelevant on {q=0}\{q = 0\} since q({q=0})=0q(\{q = 0\}) = 0 kills its contribution. The convention r:=0r := 0 on {q=0}\{q = 0\} is the standard choice; any other convention agrees qq-almost-everywhere, and integrals don’t care.

Move 4 — recognize the qq-expectation. The right-hand side of Move 2 is, by definition, Eq[r(X)f(X)]\mathbb{E}_q[r(X)\, f(X)]. Combining Moves 1–4 gives the identity. Finiteness of the right-hand side follows from Eqrf=Epf<\mathbb{E}_q|r\, f| = \mathbb{E}_p|f| < \infty.

The identity has two readings. Statistical: if we have qq-samples, we can estimate pp-expectations by reweighting. Geometric: the ratio rr is the unique deformation that pulls qq onto pp. Both readings will reappear.

Corollary 1 (Importance-Sampling Estimator).

With X1q,,XnqqiidqX_1^q, \ldots, X_{n_q}^q \stackrel{\text{iid}}{\sim} q and rr the true ratio, the estimator

μ^IS  :=  1nqi=1nqr(Xiq)f(Xiq)\hat \mu^{\mathrm{IS}} \;:=\; \frac{1}{n_q} \sum_{i=1}^{n_q} r(X_i^q)\, f(X_i^q)

is unbiased for Ep[f(X)]\mathbb{E}_p[f(X)] and has variance nq1Varq(r(X)f(X))n_q^{-1} \mathrm{Var}_q(r(X) f(X)).

Proof.

Unbiasedness is the identity above applied inside the expectation of each summand. The variance formula is the iid variance of a sample mean.

Absolute continuity and the support condition

The identity requires pqp \ll q. In density terms this is the support condition: supp(p)supp(q)\operatorname{supp}(p) \subseteq \operatorname{supp}(q) — every region where pp puts mass must also be a region where qq puts mass. If the support condition fails, rr is undefined (or infinite) on a set of positive pp-measure, and no finite-sample estimator can hope to recover it there: the qq-samples never visit that region. Direct DRE methods (§4–§7) silently inherit this requirement; classification-based DRE (§7) makes it operationally visible because the classifier fails to separate the classes only where both supports overlap.

Absolute continuity is necessary but not sufficient for μ^IS\hat \mu^{\mathrm{IS}} to be useful. Even when rr is everywhere finite, the IS estimator’s variance can be infinite. Setting f1f \equiv 1 in the corollary’s variance formula gives

nqVar(μ^IS)  =  Varq(r(X))  =  Eq[r(X)2]1.n_q \cdot \mathrm{Var}(\hat \mu^{\mathrm{IS}}) \;=\; \mathrm{Var}_q(r(X)) \;=\; \mathbb{E}_q[r(X)^2] - 1.

The quantity Eq[r(X)2]1\mathbb{E}_q[r(X)^2] - 1 is exactly the chi-squared divergence χ2(pq)\chi^2(p \,\|\, q). So the IS estimator has finite variance for the simplest possible ff (the constant one) if and only if χ2(pq)<\chi^2(p \,\|\, q) < \infty. For our running shift-Gaussian toy with p=N(0,1)p = \mathcal{N}(0, 1) and q=N(μq,1)q = \mathcal{N}(\mu_q, 1), a one-page calculation gives

χ2(pq)  =  eμq21,\chi^2(p \,\|\, q) \;=\; e^{\mu_q^2} - 1,

which is gentle for μq=1\mu_q = 1 (χ21.7\chi^2 \approx 1.7) but ferocious for μq=3\mu_q = 3 (χ28102\chi^2 \approx 8102). The IS estimator’s variance at μq=3\mu_q = 3 is roughly 5000×5000\times that at μq=1\mu_q = 1 — even though the true ratio exists and is finite everywhere.

The runtime diagnostic that picks up this pathology in practice is the effective sample size

neff  :=  (i=1nqwi)2i=1nqwi2,wi=r(Xiq),n_{\text{eff}} \;:=\; \frac{\left( \sum_{i=1}^{n_q} w_i \right)^2}{\sum_{i=1}^{n_q} w_i^2}, \qquad w_i = r(X_i^q),

which ranges from 11 (one weight dominates) to nqn_q (all weights equal). It estimates how many “effectively-independent” pp-samples our weighted qq-sample is worth. When neffnqn_{\text{eff}} \ll n_q, the IS estimator is unreliable regardless of what nqn_q says.

Numerical hook: verifying the identity and watching it strain

We verify the identity numerically with the closed-form ratio from §1, then re-run the experiment under a larger shift to watch the variance and neffn_{\text{eff}} degrade. Target: Ep[X2]\mathbb{E}_p[X^2], which equals 11 in closed form for p=N(0,1)p = \mathcal{N}(0, 1). We compare two estimators: the direct sample mean from pp-samples, and the importance-weighted sample mean from qq-samples using the true ratio.

Two-panel figure showing Monte Carlo convergence of the importance-sampling estimator vs the direct estimator of the second moment of p at two different shift levels. The IS estimator has wider confidence bands, especially at the larger shift.
The IS estimator from q-samples weighted by the true ratio is unbiased for E_p[X squared], but its 95 percent confidence band is wider than the direct estimator. The gap grows with the shift mu_q because the chi-squared divergence governs the IS variance, not the sample size.
import numpy as np

rng = np.random.default_rng(42)
def is_estimator(mu_q, n):
    x_q = rng.normal(mu_q, 1.0, size=n)
    w = np.exp(0.5 * mu_q**2 - mu_q * x_q)  # true r at x_q
    return (w * x_q**2).mean()

def direct_estimator(n):
    x_p = rng.normal(0.0, 1.0, size=n)
    return (x_p**2).mean()

At μq=1\mu_q = 1, the IS estimator’s standard error at n=104n = 10^4 is about 3.5×3.5\times the direct estimator’s; at μq=3\mu_q = 3, it’s about 80×80\times. The empirical neffn_{\text{eff}} at n=104n = 10^4 collapses from 3500\sim 3500 to 20\sim 20 between the two shifts — the IS estimator effectively sees only a few dozen “useful” samples in the high-shift regime.

The Bregman-divergence loss family for DRE

This section is what turns the next four sections from “four unrelated 2008–2012 papers” into “four specializations of one framework.” Sugiyama, Suzuki, and Kanamori (2012, Chapter 5) showed that every classical direct-DRE estimator — KMM, KLIEP, LSIF, uLSIF, and probabilistic classification — corresponds to a choice of strictly convex generator ϕ\phi in a single Bregman-divergence loss. The framework promises three things at once: a unified derivation, a unified convergence theory (each loss is convex, with the true ratio as its unique population minimizer), and an unambiguous cross-validation criterion (each loss can be estimated unbiasedly from samples without knowing rr). This section proves the master identity and shows how the LSIF and KLIEP objectives drop out of it; §3.4 previews the logistic-loss specialization that §7 will unpack in full.

This section has no figure by design — the numerical hook is a small sanity-check cell at the end that verifies, on the running toy, that both the LSIF and KLIEP empirical objectives bottom out at the true ratio.

Bregman divergences and the master identity

A Bregman divergence generated by a strictly convex differentiable function ϕ:R>0R\phi: \mathbb{R}_{>0} \to \mathbb{R} is the function BRϕ:R>0×R>0R0\mathrm{BR}_\phi : \mathbb{R}_{>0} \times \mathbb{R}_{>0} \to \mathbb{R}_{\geq 0} defined by

BRϕ(u,v)  :=  ϕ(u)    ϕ(v)    ϕ(v)(uv).\mathrm{BR}_\phi(u, v) \;:=\; \phi(u) \;-\; \phi(v) \;-\; \phi'(v)\,(u - v).

Geometrically, ϕ(v)+ϕ(v)(uv)\phi(v) + \phi'(v)(u - v) is the tangent line to ϕ\phi at vv evaluated at uu, so BRϕ(u,v)\mathrm{BR}_\phi(u, v) measures how much ϕ(u)\phi(u) exceeds that tangent at uu. Convexity gives BRϕ(u,v)0\mathrm{BR}_\phi(u, v) \geq 0, with equality iff u=vu = v (strict convexity is what rules out flat segments). Three examples set the pattern:

  • ϕ(t)=12t2\phi(t) = \tfrac{1}{2} t^2: BRϕ(u,v)=12(uv)2\mathrm{BR}_\phi(u, v) = \tfrac{1}{2}(u - v)^2 — squared loss.
  • ϕ(t)=tlogtt\phi(t) = t \log t - t: BRϕ(u,v)=ulog(u/v)u+v\mathrm{BR}_\phi(u, v) = u \log(u/v) - u + v — KL between two positive scalars.
  • ϕ(t)=logt\phi(t) = -\log t: BRϕ(u,v)=u/vlog(u/v)1\mathrm{BR}_\phi(u, v) = u/v - \log(u/v) - 1 — Itakura–Saito divergence.

To lift the pointwise divergence to a discrepancy between functions rr and gg, integrate against qq:

Dϕ(r,g)  :=  Eq[BRϕ(r(X),g(X))]  =  Eq[ϕ(r)ϕ(g)ϕ(g)(rg)].D_\phi(r, g) \;:=\; \mathbb{E}_q\bigl[\mathrm{BR}_\phi\bigl(r(X), g(X)\bigr)\bigr] \;=\; \mathbb{E}_q\bigl[\phi(r) - \phi(g) - \phi'(g)\,(r - g)\bigr].

This DϕD_\phi is the target functional. Its minimum over gg is 00, attained at g=rg = r. But we can’t compute it directly because rr appears inside ϕ(r)\phi(r) and ϕ(g)r\phi'(g) r, and rr is unknown. The master identity converts DϕD_\phi into a computable objective.

Theorem 2 (Master DRE Identity (Sugiyama–Suzuki–Kanamori 2012, Theorem 5.1)).

Let pqp \ll q and r=p/qr = p/q. For any measurable g:XR>0g: \mathcal{X} \to \mathbb{R}_{>0},

Dϕ(r,g)  =  Eq[ϕ(r(X))]constant in g  +  Jϕ(g),D_\phi(r, g) \;=\; \underbrace{\mathbb{E}_q\bigl[\phi(r(X))\bigr]}_{\text{constant in } g} \;+\; J_\phi(g),

where

Jϕ(g)  :=  Eq[ϕ(g(X))g(X)ϕ(g(X))]    Ep[ϕ(g(X))].J_\phi(g) \;:=\; \mathbb{E}_q\bigl[\phi'(g(X))\, g(X) - \phi(g(X))\bigr] \;-\; \mathbb{E}_p\bigl[\phi'(g(X))\bigr].
Proof.

Expand the definition of DϕD_\phi:

Dϕ(r,g)  =  Eq[ϕ(r)]    Eq[ϕ(g)]    Eq[ϕ(g)(rg)].D_\phi(r, g) \;=\; \mathbb{E}_q[\phi(r)] \;-\; \mathbb{E}_q[\phi(g)] \;-\; \mathbb{E}_q[\phi'(g)\, (r - g)].

Distribute the third term: Eq[ϕ(g)(rg)]=Eq[ϕ(g)r]Eq[ϕ(g)g]\mathbb{E}_q[\phi'(g)\, (r - g)] = \mathbb{E}_q[\phi'(g)\, r] - \mathbb{E}_q[\phi'(g)\, g]. Substitute back:

Dϕ(r,g)  =  Eq[ϕ(r)]    Eq[ϕ(g)]    Eq[ϕ(g)r]  +  Eq[ϕ(g)g].D_\phi(r, g) \;=\; \mathbb{E}_q[\phi(r)] \;-\; \mathbb{E}_q[\phi(g)] \;-\; \mathbb{E}_q[\phi'(g)\, r] \;+\; \mathbb{E}_q[\phi'(g)\, g].

The middle term Eq[ϕ(g)r]\mathbb{E}_q[\phi'(g)\, r] is exactly an importance-weighted expectation: by the identity in §2.2 with f=ϕgf = \phi' \circ g,

Eq[ϕ(g(X))r(X)]  =  Ep[ϕ(g(X))].\mathbb{E}_q[\phi'(g(X))\, r(X)] \;=\; \mathbb{E}_p[\phi'(g(X))].

Substitute and collect terms in gg:

Dϕ(r,g)  =  Eq[ϕ(r)]  +  {Eq[ϕ(g)gϕ(g)]    Ep[ϕ(g)]}.D_\phi(r, g) \;=\; \mathbb{E}_q[\phi(r)] \;+\; \bigl\{ \mathbb{E}_q[\phi'(g)\, g - \phi(g)] \;-\; \mathbb{E}_p[\phi'(g)] \bigr\}.

The first term is constant in gg; the brace is Jϕ(g)J_\phi(g) by definition.

Two consequences. First, JϕJ_\phi is computable from samples — we replace the two expectations with empirical analogues

J^ϕ(g)  =  1nqi=1nq[ϕ(g(Xiq))g(Xiq)ϕ(g(Xiq))]    1npj=1npϕ(g(Xjp)),\hat J_\phi(g) \;=\; \frac{1}{n_q} \sum_{i=1}^{n_q} \bigl[\phi'(g(X_i^q)) g(X_i^q) - \phi(g(X_i^q))\bigr] \;-\; \frac{1}{n_p} \sum_{j=1}^{n_p} \phi'(g(X_j^p)),

and the estimator is unbiased for Jϕ(g)J_\phi(g). Second, JϕJ_\phi inherits convexity from ϕ\phi — for a fixed gg-parametrization that’s linear in gg, the empirical J^ϕ\hat J_\phi is convex in the parameters, so first-order methods find the unique minimizer. The two consequences together are what makes the Bregman framework operational: pick ϕ\phi for the downstream use, optimize J^ϕ\hat J_\phi, done.

Squared loss recovers LSIF

Choose ϕ(t)=12t2\phi(t) = \tfrac{1}{2} t^2. Then ϕ(t)=t\phi'(t) = t, and

ϕ(t)tϕ(t)  =  t212t2  =  12t2.\phi'(t)\, t - \phi(t) \;=\; t^2 - \tfrac{1}{2} t^2 \;=\; \tfrac{1}{2} t^2.

The master identity collapses to

JLSIF(g)  =  12Eq[g(X)2]    Ep[g(X)],J_{\mathrm{LSIF}}(g) \;=\; \tfrac{1}{2}\, \mathbb{E}_q[g(X)^2] \;-\; \mathbb{E}_p[g(X)],

with sample version J^LSIF(g)=(1/(2nq))ig(Xiq)2(1/np)jg(Xjp)\hat J_{\mathrm{LSIF}}(g) = (1/(2 n_q)) \sum_i g(X_i^q)^2 - (1/n_p) \sum_j g(X_j^p). Population stationarity in gg gives δJLSIF/δg=qgp=0\delta J_{\mathrm{LSIF}}/\delta g = q g - p = 0, so g=p/q=rg^* = p/q = r, as the framework requires.

The LSIF specialization gets a closed form for linear-in-parameters models. Parametrize g(x)=αψ(x)g(x) = \boldsymbol{\alpha}^\top \boldsymbol{\psi}(x) with ψ:XRb\boldsymbol{\psi}: \mathcal{X} \to \mathbb{R}^b a fixed basis (we’ll typically take ψ\boldsymbol{\psi} to be Gaussian kernels centred at a subsample). Plugging into the sample objective,

J^LSIF(α)  =  12αH^α    h^α,\hat J_{\mathrm{LSIF}}(\boldsymbol{\alpha}) \;=\; \tfrac{1}{2}\, \boldsymbol{\alpha}^\top \hat{\mathbf{H}}\, \boldsymbol{\alpha} \;-\; \hat{\mathbf{h}}^\top \boldsymbol{\alpha},

where

H^  =  1nqi=1nqψ(Xiq)ψ(Xiq)Rb×b,h^  =  1npj=1npψ(Xjp)Rb.\hat{\mathbf{H}} \;=\; \frac{1}{n_q} \sum_{i=1}^{n_q} \boldsymbol{\psi}(X_i^q) \boldsymbol{\psi}(X_i^q)^\top \in \mathbb{R}^{b \times b}, \qquad \hat{\mathbf{h}} \;=\; \frac{1}{n_p} \sum_{j=1}^{n_p} \boldsymbol{\psi}(X_j^p) \in \mathbb{R}^b.

H^\hat{\mathbf{H}} is positive semi-definite, so this is a convex quadratic. The minimizer is the linear system H^α^=h^\hat{\mathbf{H}}\, \hat{\boldsymbol{\alpha}} = \hat{\mathbf{h}}, with the regularized version (H^+λI)α^=h^(\hat{\mathbf{H}} + \lambda \mathbf{I}) \hat{\boldsymbol{\alpha}} = \hat{\mathbf{h}} that yields α^=(H^+λI)1h^\hat{\boldsymbol{\alpha}} = (\hat{\mathbf{H}} + \lambda \mathbf{I})^{-1} \hat{\mathbf{h}}. §6 picks up here.

KL loss recovers KLIEP

Choose ϕ(t)=tlogtt\phi(t) = t \log t - t. Then ϕ(t)=logt\phi'(t) = \log t, and

ϕ(t)tϕ(t)  =  tlogt(tlogtt)  =  t.\phi'(t)\, t - \phi(t) \;=\; t \log t - (t \log t - t) \;=\; t.

The master identity gives

JKLIEPunc(g)  =  Eq[g(X)]    Ep[logg(X)],J_{\mathrm{KLIEP}}^{\mathrm{unc}}(g) \;=\; \mathbb{E}_q[g(X)] \;-\; \mathbb{E}_p[\log g(X)],

the unconstrained KLIEP objective. Population stationarity: δJ/δg=qp/g=0\delta J / \delta g = q - p/g = 0, so g=p/q=rg^* = p/q = r.

The original KLIEP (Sugiyama et al. 2008) presents the constrained form,

maxg  Ep[logg(X)]subject toEq[g(X)]=1,\max_g \; \mathbb{E}_p[\log g(X)] \quad \text{subject to} \quad \mathbb{E}_q[g(X)] = 1,

which is mathematically equivalent (the equality constraint is exactly the first-order condition that the multiplier in front of Eq[g]\mathbb{E}_q[g] equal 11). The constrained form has a tighter feasible set and is what Sugiyama et al.’s projected-gradient algorithm operates on; the unconstrained form is what gradient methods see directly. §5 covers both.

Logistic loss recovers probabilistic classification (preview)

The probabilistic-classification approach to DRE proceeds by pooling pp-samples (label y=1y = 1) and qq-samples (label y=0y = 0) and fitting a classifier η^(x)=logP(y=1x)/P(y=0x)\hat \eta(x) = \log \mathbb{P}(y = 1 \mid x) / \mathbb{P}(y = 0 \mid x). The Bayes-rule reformulation in §7.2 will give

r(x)  =  nqnpexp ⁣(η^(x)),r(x) \;=\; \frac{n_q}{n_p}\, \exp\!\bigl(\hat \eta(x)\bigr),

and the classifier’s training loss — logistic cross-entropy on the pooled labels — falls into the Bregman family with generator

ϕLR(t)  =  tlogt(1+t)log(1+t)+(1+t)log2,\phi_{\mathrm{LR}}(t) \;=\; t \log t - (1 + t) \log(1 + t) + (1 + t)\log 2,

modulo the normalization conventions catalogued in Menon and Ong (2016). We don’t unpack this here because §7 owns the full derivation, the calibration discussion, and the numerical comparison with uLSIF. The point for §3 is that the framework is complete: KMM (§4, via a kernelized squared-loss view), LSIF/uLSIF, KLIEP, and probabilistic classification all live inside one Bregman generator ϕ\phi, and the choice of ϕ\phi controls the bias–variance trade-off rather than the algorithm class.

KMM — kernel mean matching

§4 is the first section in which we estimate something. Kernel mean matching (Huang, Smola, Gretton, Borgwardt, and Schölkopf 2007) was the first widely-used direct DRE method, and it takes a sample-reweighting view of the problem: rather than estimate a function r()r(\cdot) on all of X\mathcal{X}, KMM estimates only the nqn_q numbers w1,,wnqw_1, \ldots, w_{n_q} that approximate {r(Xiq)}i=1nq\{r(X_i^q)\}_{i=1}^{n_q}. These weights are the only quantities a downstream importance-weighted estimator (covariate-shift correction in §9, weighted conformal in §10) ever asks for, so producing a function r()r(\cdot) on the whole space is unnecessary effort. KMM trades that out-of-sample generality for a tighter sample-only optimization with a clean QP structure and a population-level guarantee that the kernel-mean discrepancy is zero exactly at the true ratio.

The reweighted moment-matching objective in an RKHS

Let k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} be a positive-definite kernel and let H\mathcal{H} be its reproducing kernel Hilbert space. For any probability measure PP on X\mathcal{X} satisfying EXP[k(X,X)]<\mathbb{E}_{X \sim P}[\sqrt{k(X, X)}] < \infty, the kernel mean embedding of PP is

μP  :=  EXP[k(X,)]    H.\mu_P \;:=\; \mathbb{E}_{X \sim P}[k(X, \cdot)] \;\in\; \mathcal{H}.

μP\mu_P is an element of H\mathcal{H}; the reproducing property gives μP,fH=EP[f(X)]\langle \mu_P, f \rangle_\mathcal{H} = \mathbb{E}_P[f(X)] for every fHf \in \mathcal{H}, so μP\mu_P encodes all moments of PP that H\mathcal{H} can witness.

Definition 1 (Characteristic Kernel).

A positive-definite kernel kk is characteristic if the kernel-mean-embedding map PμPP \mapsto \mu_P is injective on the space of probability measures — equivalently, μP=μQP=Q\mu_P = \mu_Q \Rightarrow P = Q.

The Gaussian RBF k(x,y)=exp(xy2/(2σk2))k(x, y) = \exp(-\|x - y\|^2 / (2 \sigma_k^2)) is characteristic on Rd\mathbb{R}^d for any σk>0\sigma_k > 0 (Sriperumbudur et al. 2010). This is the property that makes both KMM identifiable (§4) and the MMD a valid two-sample test statistic (§11).

The KMM construction is to find a non-negative weight function g:XR0g: \mathcal{X} \to \mathbb{R}_{\geq 0} such that the kernel mean of the gg-reweighted source distribution matches the kernel mean of the target:

μqg  :=  Eq[g(X)k(X,)]  =?  μp.\mu_q^g \;:=\; \mathbb{E}_q[g(X)\, k(X, \cdot)] \;\stackrel{?}{=}\; \mu_p.

By the importance-weighting identity, μqg=μp\mu_q^g = \mu_p holds at g=rg = r, since

Eq[r(X)k(X,)]  =  Ep[k(X,)]  =  μp.\mathbb{E}_q[r(X)\, k(X, \cdot)] \;=\; \mathbb{E}_p[k(X, \cdot)] \;=\; \mu_p.

The population KMM objective is the squared H\mathcal{H}-norm of the gap:

JKMM(g)  :=  12μqgμpH2.J_{\mathrm{KMM}}(g) \;:=\; \tfrac{1}{2}\, \big\| \mu_q^g - \mu_p \big\|^2_\mathcal{H}.

JKMMJ_{\mathrm{KMM}} is convex and non-negative, and for a characteristic kernel the population minimizer is g=rg^\star = r (uniquely qq-a.e.). The injectivity of the kernel-mean embedding does the work: JKMM(g)=0J_{\mathrm{KMM}}(g) = 0 implies Eq[g(X)k(X,)]=Ep[k(X,)]\mathbb{E}_q[g(X) k(X, \cdot)] = \mathbb{E}_p[k(X, \cdot)], which says gq=pg \cdot q = p as measures, hence g=p/qg = p/q qq-a.e.

The quadratic program and its KKT conditions

KMM works on the empirical version, replacing population kernel means with sample averages and estimating a single weight wi0w_i \geq 0 for each qq-sample. Expanding the squared H\mathcal{H}-norm via the reproducing property and rescaling, the empirical objective becomes

12wKwκw,\frac{1}{2}\, \mathbf{w}^\top \mathbf{K}\, \mathbf{w} - \boldsymbol{\kappa}^\top \mathbf{w},

where Kii=k(Xiq,Xiq)\mathbf{K}_{ii'} = k(X_i^q, X_{i'}^q) and κi=(nq/np)jk(Xiq,Xjp)\boldsymbol{\kappa}_i = (n_q / n_p) \sum_j k(X_i^q, X_j^p). Adding the two practical constraints, the KMM quadratic program is

minwRnq12wKwκwsubject to0wiB,i=1,,nq,1nqiwi1ϵ.\begin{aligned} \min_{\mathbf{w} \in \mathbb{R}^{n_q}} \quad & \tfrac{1}{2}\, \mathbf{w}^\top \mathbf{K}\, \mathbf{w} - \boldsymbol{\kappa}^\top \mathbf{w} \\ \text{subject to} \quad & 0 \le w_i \le B, \quad i = 1, \ldots, n_q, \\ & \Big| \tfrac{1}{n_q} \textstyle\sum_i w_i - 1 \Big| \le \epsilon. \end{aligned}

K\mathbf{K} is positive semi-definite (it’s a Gram matrix), so the objective is convex; the feasible set is a polytope; the QP has a unique optimum (modulo the kernel of K\mathbf{K}, which collapses if the kernel is strictly positive definite, as the Gaussian RBF is).

The KKT stationarity condition reduces, for an interior optimum with no active bounds and slack normalization, to w=K1κ\mathbf{w}^\star = \mathbf{K}^{-1} \boldsymbol{\kappa}. Pure-interior optima do happen in practice, but the bound constraints typically activate on a handful of indices — those are the qq-samples where the unconstrained QP wants a negative or pathologically large weight, and KMM clips them to the polytope boundary instead.

Constraints — why BB and ϵ\epsilon exist, and what they cost

The capping constraint wiBw_i \le B. Without an upper bound, individual weights can grow arbitrarily large when a qq-sample sits in a region where pp has substantial mass but few other qq-samples cover. The unbounded QP responds by inflating a single wiw_i to absorb that mismatch, which produces an estimator whose effective sample size collapses. The capping bound BB is a hard prior on the maximum allowable weight; Huang et al. recommend B[10,103]B \in [10, 10^3] in practice.

The normalization constraint nq1iwi1ϵ|n_q^{-1} \sum_i w_i - 1| \le \epsilon. The true ratio satisfies Eq[r(X)]=Ep[1]=1\mathbb{E}_q[r(X)] = \mathbb{E}_p[1] = 1, so a faithful weight vector should average to roughly 11. The relaxed inequality version with slack ϵ\epsilon accommodates finite-sample fluctuations; Huang et al. recommend ϵ=(nq1)/nq\epsilon = (\sqrt{n_q} - 1)/\sqrt{n_q}.

Bandwidth selection. The Gaussian kernel needs a bandwidth σk\sigma_k. The median heuristic — set σk=median{XiXj:1i<jnq+np}\sigma_k = \mathrm{median}\{ \|X_i - X_j\| : 1 \le i < j \le n_q + n_p \} over the pooled sample — is the standard choice and works well across orders of magnitude in dimension. More principled bandwidth selection requires a held-out scoring rule, which KMM doesn’t naturally provide (its objective is unobserved at the population level); §6’s uLSIF gives us analytic LOO-CV, which is a major reason it overtook KMM in practice.

Numerical demonstration

We run KMM on the shift-Gaussian toy (p=N(0,1)p = \mathcal{N}(0, 1), q=N(1,1)q = \mathcal{N}(1, 1)) and compare the recovered weights w^i\hat w_i against r(Xiq)=exp(12Xiq)r(X_i^q) = \exp(\tfrac{1}{2} - X_i^q) from the closed form.

Two-panel figure: left shows q-samples on a 1-D axis with marker size proportional to KMM weight, overlaid on the closed-form true ratio. Right shows scatter of recovered weights against true ratio with a diagonal y equals x reference.
KMM weights track the true ratio at the q-samples. Left: high-weight samples cluster in the left tail where p has mass but q has few representatives. Right: scatter of recovered KMM weights against the true ratio falls on the diagonal — KMM has recovered the right reweighting.
import numpy as np
from scipy.optimize import minimize
from scipy.spatial.distance import cdist

def kmm_fit(x_p, x_q, sigma_k, B=1000.0):
    n_q, n_p = len(x_q), len(x_p)
    K = np.exp(-cdist(x_q[:, None], x_q[:, None], 'sqeuclidean') / (2 * sigma_k**2))
    kappa = (n_q / n_p) * np.exp(-cdist(x_q[:, None], x_p[:, None], 'sqeuclidean') / (2 * sigma_k**2)).sum(1)
    eps = (np.sqrt(n_q) - 1) / np.sqrt(n_q)
    cons = ({'type': 'ineq', 'fun': lambda w: B - w},
            {'type': 'ineq', 'fun': lambda w: w},
            {'type': 'ineq', 'fun': lambda w: eps - abs(w.mean() - 1.0)})
    res = minimize(lambda w: 0.5 * w @ K @ w - kappa @ w, x0=np.ones(n_q),
                   jac=lambda w: K @ w - kappa, method='SLSQP', constraints=cons)
    return res.x

On the running toy, KMM recovers weights with Pearson correlation >0.6> 0.6 to the closed-form ratio at the qq-samples, with mean(w) 1\approx 1 and an effective sample size around nq/6n_q / 6. Raising μq\mu_q stretches the support gap and the recovered weights degrade — KMM has no out-of-sample function, so it inherits any support-overlap fragility directly.

KLIEP — Kullback–Leibler importance estimation procedure

§5 picks up the KL specialization of the Bregman framework from §3.3 and runs it through to a working estimator. KLIEP (Sugiyama, Suzuki, Nakajima, Kashima, von Bünau, and Kawanabe 2008) was historically the first direct DRE method to deliver a principled bandwidth-selection scheme — the KLIEP objective is itself a valid held-out scoring rule, so cross-validation comes for free. That is a meaningful operational advantage over KMM (§4), which has no built-in CV criterion; it’s the reason KLIEP and its descendants (uLSIF in §6) eclipsed KMM as the practitioner default.

The KL objective and the linear-in-parameters model

Take the §3.3 KL choice ϕ(t)=tlogtt\phi(t) = t \log t - t. The cleanest derivation of KLIEP doesn’t start from ϕ\phi — it starts from the question “if I treat r^(x)q(x)\hat r(x) \cdot q(x) as my estimate of p(x)p(x), what loss should I minimize?” The natural answer is the KL divergence from pp to the estimated density:

KL(pr^q)  =  p(x)logp(x)r^(x)q(x)dx  =  Ep[logr(X)]    Ep[logr^(X)].\mathrm{KL}\bigl(p \,\big\|\, \hat r \cdot q\bigr) \;=\; \int p(x) \log \frac{p(x)}{\hat r(x)\, q(x)}\, \mathrm{d}x \;=\; \mathbb{E}_p\bigl[\log r(X)\bigr] \;-\; \mathbb{E}_p\bigl[\log \hat r(X)\bigr].

The first term is constant in r^\hat r — it’s just KL(pq)\mathrm{KL}(p \,\|\, q) as far as the optimizer cares. Minimizing the KL is therefore equivalent to maximizing Ep[logr^(X)]\mathbb{E}_p[\log \hat r(X)]. The KL framing makes one constraint visible: for r^q\hat r \cdot q to be a proper probability density, we need r^(x)q(x)dx=Eq[r^(X)]=1\int \hat r(x)\, q(x)\, \mathrm{d}x = \mathbb{E}_q[\hat r(X)] = 1. The non-negativity r^0\hat r \geq 0 is the second standing requirement.

The constrained KLIEP problem is therefore

maxr^    Ep[logr^(X)]subject toEq[r^(X)]=1,r^0.\max_{\hat r} \;\; \mathbb{E}_p\bigl[\log \hat r(X)\bigr] \quad \text{subject to} \quad \mathbb{E}_q[\hat r(X)] = 1, \quad \hat r \geq 0.

Linear-in-parameters model. Sugiyama et al. (2008) parametrize r^\hat r as a non-negative linear combination of a fixed kernel basis:

r^(x)  =  =1bαψ(x),ψ(x)=exp ⁣(xc22σ2),\hat r(x) \;=\; \sum_{\ell = 1}^{b} \alpha_\ell\, \psi_\ell(x), \qquad \psi_\ell(x) = \exp\!\left( -\frac{\|x - c_\ell\|^2}{2 \sigma^2} \right),

with αR0b\boldsymbol{\alpha} \in \mathbb{R}_{\geq 0}^b and {c}=1b\{c_\ell\}_{\ell=1}^{b} a random subsample of the pp-samples. The centres cc_\ell are fixed at initialization (Sugiyama et al. recommend b=100b = 100). The only tunable hyperparameter is the bandwidth σ\sigma.

Convex-program structure

Substituting the linear model and replacing expectations with sample averages,

maxαRb  L^(α)  :=  1npj=1nplog ⁣(αψ(Xjp))subject toαψˉq=1,α0,\max_{\boldsymbol{\alpha} \in \mathbb{R}^b}\; \hat{\mathcal{L}}(\boldsymbol{\alpha}) \;:=\; \frac{1}{n_p} \sum_{j=1}^{n_p} \log\!\bigl(\boldsymbol{\alpha}^\top \boldsymbol{\psi}(X_j^p)\bigr) \quad \text{subject to} \quad \boldsymbol{\alpha}^\top \bar{\boldsymbol{\psi}}_q = 1, \quad \boldsymbol{\alpha} \geq \mathbf{0},

where ψˉq:=(1/nq)iψ(Xiq)Rb\bar{\boldsymbol{\psi}}_q := (1/n_q) \sum_i \boldsymbol{\psi}(X_i^q) \in \mathbb{R}^b. The objective is the average of log\log of a linear function of α\boldsymbol{\alpha} — concave on the feasible region. The feasible set {α0,  ψˉqα=1}\{\boldsymbol{\alpha} \geq 0,\; \bar{\boldsymbol{\psi}}_q^\top \boldsymbol{\alpha} = 1\} is a convex polytope. The constrained maximization is therefore a convex program with a unique global optimum.

Projected-gradient ascent — Sugiyama et al. (2008), Algorithm 1

The gradient of L^\hat{\mathcal{L}} at α\boldsymbol{\alpha} has a clean form. Let ΨpRnp×b\boldsymbol{\Psi}_p \in \mathbb{R}^{n_p \times b} be the design matrix with (Ψp)j=ψ(Xjp)(\boldsymbol{\Psi}_p)_{j\ell} = \psi_\ell(X_j^p), and write r=ΨpαRnp\boldsymbol{r} = \boldsymbol{\Psi}_p \boldsymbol{\alpha} \in \mathbb{R}^{n_p} for the current estimate evaluated at the pp-samples. Then

αL^(α)  =  1npΨp(1/r)    Rb,\nabla_{\boldsymbol{\alpha}} \hat{\mathcal{L}}(\boldsymbol{\alpha}) \;=\; \frac{1}{n_p} \boldsymbol{\Psi}_p^\top \bigl(1 / \boldsymbol{r}\bigr) \;\in\; \mathbb{R}^b,

where the division is component-wise. Each iteration takes a gradient-ascent step and then projects back onto the feasible set:

Algorithm 1 (KLIEP (Sugiyama et al. 2008, Algorithm 1)).

  1. Initialize α1/(ψˉq1)\boldsymbol{\alpha} \gets \mathbf{1} / (\bar{\boldsymbol{\psi}}_q^\top \mathbf{1}) so the equality constraint holds at start.
  2. Repeat:
    • Gradient ascent: αα+ηΨp(1/(Ψpα))\boldsymbol{\alpha} \gets \boldsymbol{\alpha} + \eta\, \boldsymbol{\Psi}_p^\top (1 / (\boldsymbol{\Psi}_p \boldsymbol{\alpha})).
    • Project onto ψˉqα=1\bar{\boldsymbol{\psi}}_q^\top \boldsymbol{\alpha} = 1: αα+(1ψˉqα)ψˉq/ψˉq2\boldsymbol{\alpha} \gets \boldsymbol{\alpha} + (1 - \bar{\boldsymbol{\psi}}_q^\top \boldsymbol{\alpha})\, \bar{\boldsymbol{\psi}}_q / \|\bar{\boldsymbol{\psi}}_q\|^2.
    • Project onto α0\boldsymbol{\alpha} \geq \mathbf{0}: αmax(α,0)\boldsymbol{\alpha} \gets \max(\boldsymbol{\alpha}, \mathbf{0}) element-wise.
    • Re-normalize: αα/(ψˉqα)\boldsymbol{\alpha} \gets \boldsymbol{\alpha} / (\bar{\boldsymbol{\psi}}_q^\top \boldsymbol{\alpha}).
  3. Until L^(αt)L^(αt1)<tol|\hat{\mathcal{L}}(\boldsymbol{\alpha}_{t}) - \hat{\mathcal{L}}(\boldsymbol{\alpha}_{t-1})| < \mathrm{tol} or max iterations reached.

Convergence of alternating projection onto convex sets is classical (von Neumann 1933; Dykstra 1983); KLIEP’s variant doesn’t add the Dykstra correction, but in practice it converges to a feasible point within numerical tolerance of the constrained optimum, especially when the active set on the non-negativity constraint stabilizes after a few hundred iterations.

Two-panel figure showing KLIEP convergence. Left panel: log-likelihood vs iteration with horizontal reference line at the truth-evaluated ceiling. Right panel: recovered r-hat overlaid on the true ratio on a log y-axis.
KLIEP convergence trace at the median-heuristic bandwidth on the running toy. The projected-gradient log-likelihood approaches its truth-evaluated ceiling within a few hundred iterations. The limiting r-hat tracks the true ratio in the bulk and slightly mis-fits the deep left tail where q-samples are scarce.

Bandwidth selection by KL cross-validation

The bandwidth σ\sigma in the Gaussian-kernel basis is the only hyperparameter left after fixing b=100b = 100 centres. The KLIEP objective doubles as a held-out scoring rule, because the empirical Ep[logr^(X)]\mathbb{E}_p[\log \hat r(X)] on data the estimator has not seen estimates the population Ep[logr^(X)]=KL(pr^q)+const\mathbb{E}_p[\log \hat r(X)] = -\mathrm{KL}(p \,\|\, \hat r \cdot q) + \text{const} unbiasedly. So KK-fold KL-CV with K=5K = 5 proceeds:

  1. Partition the pp-samples into KK folds {Pk}k=1K\{P_k\}_{k=1}^K.
  2. For each candidate σ\sigma and each fold kk: fit KLIEP on kkPk\bigcup_{k' \neq k} P_{k'} paired with the full qq-sample, then score L^k(σ):=Pk1jPklogr^σ(Xjp)\hat{\mathcal{L}}_k(\sigma) := |P_k|^{-1} \sum_{j \in P_k} \log \hat r_\sigma(X_j^p).
  3. The CV score is L^CV(σ):=K1kL^k(σ)\hat{\mathcal{L}}^{\mathrm{CV}}(\sigma) := K^{-1} \sum_k \hat{\mathcal{L}}_k(\sigma).
  4. Choose σ:=argmaxσL^CV(σ)\sigma^\star := \arg\max_\sigma \hat{\mathcal{L}}^{\mathrm{CV}}(\sigma).

This is the principled bandwidth-selection scheme KMM (§4) lacked.

Two-panel figure of KLIEP bandwidth selection. Left: KL cross-validation score vs bandwidth on a log-sigma axis with sigma-star marked. Right: KLIEP fits at three bandwidths (too-small, sigma-star, too-large) overlaid on the true ratio.
KL cross-validation for KLIEP bandwidth selection on the running toy. Left: the 5-fold CV log-likelihood is unimodal in log sigma and its peak picks the operational bandwidth. Right: the recovered r-hat at sigma-star fits the bulk well, the under-smoothed version is jagged, and the over-smoothed version flattens the exponential left tail.

On the verified-execution stack the CV-picked bandwidth ends up at the smallest value in the sweep grid — a finite-sample artifact of the kernel basis being centred on pp-samples (which have little mass in the deep left tail where the true ratio is largest). The meaningful comparison metric is therefore bulk-grid Pearson on x[2,3]x \in [-2, 3] rather than full-grid Pearson, and the bulk-grid fit is high (Pearson >0.85> 0.85). uLSIF in §6 gives us a smoother LOO-CV landscape and dominates KLIEP in operational use.

LSIF and uLSIF — least-squares importance fitting

§6 closes the loop on the kernel-based estimator family. LSIF (Kanamori, Hido, and Sugiyama 2009) is the squared-loss specialization of the Bregman framework from §3.2 — replacing the KL objective of §5 with 12Eq[g(X)2]Ep[g(X)]\tfrac{1}{2} \mathbb{E}_q[g(X)^2] - \mathbb{E}_p[g(X)], which turns the optimization into a linearly-constrained convex quadratic program with a non-negativity constraint α0\boldsymbol{\alpha} \geq 0. uLSIF is the simple but consequential observation that dropping the non-negativity constraint turns the QP into a single linear solve with a closed-form solution, and — more importantly — admits an analytic leave-one-out cross-validation formula that costs O(b3+b2(np+nq))O(b^3 + b^2 (n_p + n_q)) to compute, no refitting required. That single feature is why uLSIF overtook both LSIF and KLIEP as the practitioner default for kernel-based DRE.

The squared-loss closed form

Recall from §3.2 that the squared-loss Bregman objective is

JLSIF(g)  =  12Eq[g(X)2]    Ep[g(X)],J_{\mathrm{LSIF}}(g) \;=\; \tfrac{1}{2}\, \mathbb{E}_q[g(X)^2] \;-\; \mathbb{E}_p[g(X)],

with sample version J^LSIF(g)=(1/(2nq))ig(Xiq)2(1/np)jg(Xjp)\hat J_{\mathrm{LSIF}}(g) = (1/(2 n_q)) \sum_i g(X_i^q)^2 - (1/n_p) \sum_j g(X_j^p), and population minimizer g=rg^\star = r. Using the same linear-in-parameters model as KLIEP, r^(x)=αψ(x)\hat r(x) = \boldsymbol{\alpha}^\top \boldsymbol{\psi}(x), and adding a Tikhonov regularizer 12λα2\tfrac{1}{2} \lambda \|\boldsymbol{\alpha}\|^2, the LSIF problem with non-negativity is

minαRb    12α(H^+λI)α    h^αsubject toα0,\min_{\boldsymbol{\alpha} \in \mathbb{R}^b} \;\; \tfrac{1}{2}\, \boldsymbol{\alpha}^\top (\hat{\mathbf{H}} + \lambda \mathbf{I})\, \boldsymbol{\alpha} \;-\; \hat{\mathbf{h}}^\top \boldsymbol{\alpha} \quad \text{subject to} \quad \boldsymbol{\alpha} \geq \mathbf{0},

where H^=ΨqΨq/nqRb×b\hat{\mathbf{H}} = \boldsymbol{\Psi}_q^\top \boldsymbol{\Psi}_q / n_q \in \mathbb{R}^{b \times b} and h^=Ψp1np/npRb\hat{\mathbf{h}} = \boldsymbol{\Psi}_p^\top \mathbf{1}_{n_p} / n_p \in \mathbb{R}^b. Without the non-negativity constraint, first-order optimality is the linear system (H^+λI)α^=h^(\hat{\mathbf{H}} + \lambda \mathbf{I})\, \hat{\boldsymbol{\alpha}} = \hat{\mathbf{h}}, with the closed-form solution

α^(λ)  =  (H^+λI)1h^.\hat{\boldsymbol{\alpha}}(\lambda) \;=\; (\hat{\mathbf{H}} + \lambda \mathbf{I})^{-1}\, \hat{\mathbf{h}}.

uLSIF — dropping non-negativity for the closed form

The motivation for dropping the non-negativity constraint is computational: the NNQP solver needs an iterative active-set sweep with O(b)O(b) outer iterations and O(b3)O(b^3) inner work, while the unconstrained closed form is a single O(b3)O(b^3) linear solve. The motivation for trusting the relaxation is empirical: when the Gaussian basis is rich enough and the bandwidth σ\sigma is sensible, the unconstrained α^\hat{\boldsymbol{\alpha}} rarely has many negative components, and even when it does, the predicted ratio r^(x)=αψ(x)\hat r(x) = \boldsymbol{\alpha}^\top \boldsymbol{\psi}(x) is positive almost everywhere in the support overlap.

The standard uLSIF post-processing is to clip predictions at zero: r^+(x):=max(r^(x),0)\hat r_+(x) := \max(\hat r(x), 0). Kanamori, Suzuki, and Sugiyama (2012, §4.3) show that the convergence rate of r^+\hat r_+ to the true rr matches that of the constrained LSIF estimator up to a constant — the relaxation pays no statistical price asymptotically and a modest one in finite samples, but the operational simplification (closed form + analytic LOO) more than compensates.

Analytic leave-one-out cross-validation

LOO-CV evaluates the LSIF objective on left-out samples. Naively, this needs np+nqn_p + n_q refits per (σ,λ)(\sigma, \lambda), each costing O(b3)O(b^3). For uLSIF, the closed form lets us compute all refits in two precomputed b×bb \times b matrix inverses plus O(b2)O(b^2) per left-out sample, by Sherman–Morrison rank-one updates.

Lemma 1 (Leave-out empirical matrices).

The leave-ii-out empirical second-moment matrix and leave-jj-out empirical mean satisfy

H^(i,q)  =  nqnq1H^    1nq1ψq(i)ψq(i),h^(j,p)  =  npnp1h^    1np1ψp(j),\hat{\mathbf{H}}^{(-i,q)} \;=\; \frac{n_q}{n_q - 1}\, \hat{\mathbf{H}} \;-\; \frac{1}{n_q - 1}\, \boldsymbol{\psi}_q^{(i)} \boldsymbol{\psi}_q^{(i)\top}, \qquad \hat{\mathbf{h}}^{(-j,p)} \;=\; \frac{n_p}{n_p - 1}\, \hat{\mathbf{h}} \;-\; \frac{1}{n_p - 1}\, \boldsymbol{\psi}_p^{(j)},

where ψq(i):=ψ(Xiq)\boldsymbol{\psi}_q^{(i)} := \boldsymbol{\psi}(X_i^q) and ψp(j):=ψ(Xjp)\boldsymbol{\psi}_p^{(j)} := \boldsymbol{\psi}(X_j^p).

Proof.

H^=(1/nq)iψq(i)ψq(i)\hat{\mathbf{H}} = (1/n_q) \sum_{i'} \boldsymbol{\psi}_q^{(i')} \boldsymbol{\psi}_q^{(i')\top}, so iiψq(i)ψq(i)=nqH^ψq(i)ψq(i)\sum_{i' \ne i} \boldsymbol{\psi}_q^{(i')} \boldsymbol{\psi}_q^{(i')\top} = n_q \hat{\mathbf{H}} - \boldsymbol{\psi}_q^{(i)} \boldsymbol{\psi}_q^{(i)\top}, and dividing by (nq1)(n_q - 1) gives the first display. The argument for h^(j,p)\hat{\mathbf{h}}^{(-j,p)} is identical.

Lemma 2 (Sherman–Morrison).

For invertible ARb×b\mathbf{A} \in \mathbb{R}^{b \times b} and uRb\mathbf{u} \in \mathbb{R}^b with 1uA1u01 - \mathbf{u}^\top \mathbf{A}^{-1} \mathbf{u} \neq 0,

(Auu)1  =  A1  +  A1uuA11uA1u.(\mathbf{A} - \mathbf{u}\, \mathbf{u}^\top)^{-1} \;=\; \mathbf{A}^{-1} \;+\; \frac{\mathbf{A}^{-1} \mathbf{u}\, \mathbf{u}^\top \mathbf{A}^{-1}}{1 - \mathbf{u}^\top \mathbf{A}^{-1} \mathbf{u}}.

Theorem 3 (Exact analytic LOO for uLSIF).

Let B:=H^+λI\mathbf{B} := \hat{\mathbf{H}} + \lambda \mathbf{I} be the regularized matrix from the full fit, and let Aq:=(nq/(nq1))H^+λI\mathbf{A}_q := (n_q / (n_q - 1))\, \hat{\mathbf{H}} + \lambda \mathbf{I}. Define, for each i,ji, j:

r~i:=ψq(i)Aq1h^,ηi:=ψq(i)Aq1ψq(i),sj:=ψp(j)α^,δj:=ψp(j)B1ψp(j).\tilde r_i := \boldsymbol{\psi}_q^{(i)\top} \mathbf{A}_q^{-1} \hat{\mathbf{h}}, \quad \eta_i := \boldsymbol{\psi}_q^{(i)\top} \mathbf{A}_q^{-1} \boldsymbol{\psi}_q^{(i)}, \quad s_j := \boldsymbol{\psi}_p^{(j)\top} \hat{\boldsymbol{\alpha}}, \quad \delta_j := \boldsymbol{\psi}_p^{(j)\top} \mathbf{B}^{-1} \boldsymbol{\psi}_p^{(j)}.

Then the exact leave-one-out predicted ratios are

r^iLOO,q  =  r~i1ηi/(nq1),r^jLOO,p  =  npsjδjnp1.\hat r^{\mathrm{LOO},q}_i \;=\; \frac{\tilde r_i}{1 - \eta_i / (n_q - 1)}, \qquad \hat r^{\mathrm{LOO},p}_j \;=\; \frac{n_p\, s_j - \delta_j}{n_p - 1}.
Proof.

For the qq-removal, the first lemma gives the leave-ii-out regularized matrix as

B(i,q)  =  H^(i,q)+λI  =  Aq    1nq1ψq(i)ψq(i).\mathbf{B}^{(-i,q)} \;=\; \hat{\mathbf{H}}^{(-i,q)} + \lambda \mathbf{I} \;=\; \mathbf{A}_q \;-\; \frac{1}{n_q - 1}\, \boldsymbol{\psi}_q^{(i)} \boldsymbol{\psi}_q^{(i)\top}.

Aq\mathbf{A}_q is independent of ii, so we precompute Aq1\mathbf{A}_q^{-1} once per (σ,λ)(\sigma, \lambda). Apply Sherman–Morrison with A=Aq\mathbf{A} = \mathbf{A}_q and u=ψq(i)/nq1\mathbf{u} = \boldsymbol{\psi}_q^{(i)} / \sqrt{n_q - 1}:

(B(i,q))1  =  Aq1  +  1nq1Aq1ψq(i)ψq(i)Aq11ηi/(nq1).(\mathbf{B}^{(-i,q)})^{-1} \;=\; \mathbf{A}_q^{-1} \;+\; \frac{1}{n_q - 1} \cdot \frac{\mathbf{A}_q^{-1} \boldsymbol{\psi}_q^{(i)}\, \boldsymbol{\psi}_q^{(i)\top} \mathbf{A}_q^{-1}}{1 - \eta_i / (n_q - 1)}.

h^\hat{\mathbf{h}} is unchanged when we remove a qq-sample, so α^(i,q)=(B(i,q))1h^\hat{\boldsymbol{\alpha}}^{(-i,q)} = (\mathbf{B}^{(-i,q)})^{-1}\, \hat{\mathbf{h}} and the prediction at XiqX_i^q is

r^iLOO,q  =  ψq(i)(B(i,q))1h^  =  r~i1ηi/(nq1).\hat r^{\mathrm{LOO},q}_i \;=\; \boldsymbol{\psi}_q^{(i)\top} (\mathbf{B}^{(-i,q)})^{-1}\, \hat{\mathbf{h}} \;=\; \frac{\tilde r_i}{1 - \eta_i / (n_q - 1)}.

For the pp-removal, the regularized matrix B\mathbf{B} is unchanged (depends only on qq-samples), but h^\hat{\mathbf{h}} is replaced by h^(j,p)=(np/(np1))h^(1/(np1))ψp(j)\hat{\mathbf{h}}^{(-j,p)} = (n_p / (n_p - 1)) \hat{\mathbf{h}} - (1/(n_p - 1)) \boldsymbol{\psi}_p^{(j)}, so

r^jLOO,p  =  ψp(j)α^(j,p)  =  npsjδjnp1.\hat r^{\mathrm{LOO},p}_j \;=\; \boldsymbol{\psi}_p^{(j)\top} \hat{\boldsymbol{\alpha}}^{(-j,p)} \;=\; \frac{n_p\, s_j - \delta_j}{n_p - 1}.

Corollary 2 (Asymptotic form (Kanamori–Hido–Sugiyama 2009, §3.5)).

In the limit nq,npn_q, n_p \to \infty with λ\lambda fixed, AqB\mathbf{A}_q \to \mathbf{B}, so ηihi:=ψq(i)B1ψq(i)\eta_i \to h_i := \boldsymbol{\psi}_q^{(i)\top} \mathbf{B}^{-1} \boldsymbol{\psi}_q^{(i)} and r~iri:=ψq(i)α^\tilde r_i \to r_i := \boldsymbol{\psi}_q^{(i)\top} \hat{\boldsymbol{\alpha}}, and the exact formulas collapse to

r^iLOO,q    ri1hi/nq,r^jLOO,p    sjδj/np.\hat r^{\mathrm{LOO},q}_i \;\longrightarrow\; \frac{r_i}{1 - h_i / n_q}, \qquad \hat r^{\mathrm{LOO},p}_j \;\longrightarrow\; s_j - \delta_j / n_p.

The asymptotic form differs from the exact form by an O(1/n)O(1/n) correction in the denominators. For nq100n_q \ge 100 the two forms agree to better than 1%1\% on every LOO quantity — well below the finite-sample noise in the scoring rule. We use the asymptotic form because it reuses the single matrix inverse B1\mathbf{B}^{-1} that we already had to compute for the fit, instead of requiring a second inverse Aq1\mathbf{A}_q^{-1} per (σ,λ)(\sigma, \lambda) pair.

Numerical demonstration

We sweep uLSIF over a grid of (σ,λ)(\sigma, \lambda) pairs, pick the analytic-LOO-CV optimum, and compare the fit and runtime to the KLIEP estimate from §5 on the same toy.

Two-panel figure of uLSIF LOO-CV. Left panel: heatmap of LOO score over a sigma-lambda grid with argmin marked. Right panel: three-way comparison of true ratio, uLSIF fit at optimum, and KLIEP fit overlaid on a log y-axis.
uLSIF LOO-CV landscape and three-way fit comparison on the running toy. The analytic LOO-CV surface is smooth and unimodal in log sigma. The optimum-bandwidth uLSIF fit tracks the true ratio with Pearson approximately 0.86; the KLIEP and uLSIF fits agree closely in the bulk and both compress the deep left tail (a generic kernel-basis limitation discussed in §12.3).
import numpy as np
from scipy.spatial.distance import cdist

def ulsif_fit(x_p, x_q, sigma, lam, b=100, rng=None):
    rng = np.random.default_rng() if rng is None else rng
    centers = rng.choice(x_p, size=b, replace=False)
    Psi_p = np.exp(-cdist(x_p[:, None], centers[:, None], 'sqeuclidean') / (2 * sigma**2))
    Psi_q = np.exp(-cdist(x_q[:, None], centers[:, None], 'sqeuclidean') / (2 * sigma**2))
    H = Psi_q.T @ Psi_q / len(x_q)
    h = Psi_p.mean(axis=0)
    alpha = np.linalg.solve(H + lam * np.eye(b), h)
    return alpha, centers

On the verified-execution stack the uLSIF analytic-LOO sweep and KLIEP 5-fold CV sweep land within a constant factor of each other on this toy. The systematic difference is what uLSIF buys per fit: the LOO score sweeps λ\lambda in the same matrix inverse, while KLIEP requires KK refits per (σ)(\sigma) for the K-fold CV. uLSIF also dominates on accuracy: Pearson(uLSIF, true r) is approximately 0.86 vs KLIEP’s 0.04 at the median-σ heuristic baseline.

Probabilistic classification as DRE

§7 is the section where direct DRE quietly stops being a separate algorithm family and becomes “fit any well-calibrated classifier on a pooled labelled dataset.” The reformulation is short — half a page of Bayes’ rule — and the consequences are large: every gradient-boosted tree, every neural-network classifier, every penalized logistic regression that exists in a practitioner’s pipeline can be repurposed as a density-ratio estimator without rebuilding the algorithm. The catch is calibration: the classifier’s predicted probabilities must approximate the true posterior on the pooled-label problem, not just rank-order it. AUC alone is not enough.

The pooled-dataset construction

Pool the pp-samples and qq-samples into one dataset of size N=np+nqN = n_p + n_q, and attach a binary label that records which sample each came from:

Dpool  :=  {(Xip,Y=1)}i=1np{(Xjq,Y=0)}j=1nq.\mathcal{D}_{\mathrm{pool}} \;:=\; \bigl\{ (X_i^p, Y = 1) \bigr\}_{i=1}^{n_p} \,\cup\, \bigl\{ (X_j^q, Y = 0) \bigr\}_{j=1}^{n_q}.

This dataset has the structure of a labelled binary classification problem with class-conditional densities f(xY=1)=p(x)f(x \mid Y = 1) = p(x) and f(xY=0)=q(x)f(x \mid Y = 0) = q(x), and class-prior probabilities π1=np/N\pi_1 = n_p / N and π0=nq/N\pi_0 = n_q / N. Fit a probabilistic classifier — anything that produces an estimate π^(x):=P^(Y=1X=x)\hat\pi(x) := \hat{\mathbb{P}}(Y = 1 \mid X = x) rather than just a hard label. The next subsection shows that this estimate, divided by 1π^(x)1 - \hat\pi(x) and multiplied by the prior-odds ratio, is an estimate of r(x)r(x).

The Bayes-rule identity

Theorem 4 (Classification-DRE identity).

Under the pooled-dataset construction with class-conditionals f(xY=1)=p(x)f(x \mid Y = 1) = p(x), f(xY=0)=q(x)f(x \mid Y = 0) = q(x) and priors π1,π0\pi_1, \pi_0,

r(x)  =  p(x)q(x)  =  π0π1P(Y=1X=x)P(Y=0X=x)  =  nqnpP(Y=1X=x)P(Y=0X=x).r(x) \;=\; \frac{p(x)}{q(x)} \;=\; \frac{\pi_0}{\pi_1} \cdot \frac{\mathbb{P}(Y = 1 \mid X = x)}{\mathbb{P}(Y = 0 \mid X = x)} \;=\; \frac{n_q}{n_p} \cdot \frac{\mathbb{P}(Y = 1 \mid X = x)}{\mathbb{P}(Y = 0 \mid X = x)}.
Proof.

By Bayes’ rule applied to the pooled distribution,

P(Y=1X=x)  =  π1p(x)π1p(x)+π0q(x),P(Y=0X=x)  =  π0q(x)π1p(x)+π0q(x).\mathbb{P}(Y = 1 \mid X = x) \;=\; \frac{\pi_1\, p(x)}{\pi_1\, p(x) + \pi_0\, q(x)}, \qquad \mathbb{P}(Y = 0 \mid X = x) \;=\; \frac{\pi_0\, q(x)}{\pi_1\, p(x) + \pi_0\, q(x)}.

Dividing — the mixture denominator cancels — gives

P(Y=1X=x)P(Y=0X=x)  =  π1p(x)π0q(x)  =  π1π0r(x),\frac{\mathbb{P}(Y = 1 \mid X = x)}{\mathbb{P}(Y = 0 \mid X = x)} \;=\; \frac{\pi_1\, p(x)}{\pi_0\, q(x)} \;=\; \frac{\pi_1}{\pi_0}\, r(x),

and the claim follows by solving for r(x)r(x) with π1/π0=np/nq\pi_1 / \pi_0 = n_p / n_q.

In the balanced case np=nqn_p = n_q the prior-odds ratio is 11 and r(x)=exp(η(x))r(x) = \exp(\eta(x)), where η(x):=logP(Y=1x)/P(Y=0x)\eta(x) := \log \mathbb{P}(Y = 1 \mid x) / \mathbb{P}(Y = 0 \mid x) is the logit. The DRE estimator is then r^(x)=exp(η^(x))\hat r(x) = \exp(\hat\eta(x)) — the exponentiated logit of any probabilistic classifier we choose to fit. Substituting the linear logistic-loss reparametrization with g=eηg = e^\eta recovers exactly the Bregman generator

ϕLR(t)  =  tlogt(1+t)log(1+t)+(1+t)log2\phi_{\mathrm{LR}}(t) \;=\; t \log t - (1 + t) \log(1 + t) + (1 + t) \log 2

from Menon and Ong (2016), closing the §3.4 preview.

Calibration

The classification-DRE identity assumes the classifier’s predicted probabilities π^(x)\hat\pi(x) are approximately equal to the true posterior — that is, the classifier is calibrated:

P(Y=1π^(X)=p)    p,for all p[0,1].\mathbb{P}(Y = 1 \mid \hat\pi(X) = p) \;\approx\; p, \quad \text{for all } p \in [0, 1].

This is a strictly stronger requirement than discriminative performance. A classifier with perfect AUC =1= 1 can be arbitrarily badly calibrated — for example, if it squashes all probabilities to {ϵ,1ϵ}\{\epsilon, 1 - \epsilon\} for some small ϵ\epsilon, the rank-order is right but the absolute values are wrong, and the DRE estimate takes only two values regardless of xx.

Which off-the-shelf classifiers are calibrated?

  • L2-penalized logistic regression with proper-loss training is well-calibrated by construction, because the loss is a strictly proper scoring rule (Gneiting and Raftery 2007).
  • SVMs, random forests, gradient-boosted trees, AdaBoost are typically not well-calibrated. SVMs hinge-loss optimize for margin, not probabilities; tree ensembles push predictions toward extremes (Niculescu-Mizil and Caruana 2005). These need post-hoc recalibration — Platt scaling (sigmoid fit on a held-out calibration set) for SVMs, isotonic regression (monotone non-parametric fit) for tree ensembles.
  • Deep neural network classifiers are usually over-confident on modern architectures (Guo et al. 2017); calibration is an active research area, with temperature scaling the most common post-hoc fix.

Numerical demonstration

We pool the np=nq=300n_p = n_q = 300 shift-Gaussian samples from §1 with labels {1,0}\{1, 0\}, fit an L2-penalized logistic regression on the Gaussian basis at the same centres and bandwidth uLSIF picked in §6, exponentiate the logit to get r^LR\hat r_{\mathrm{LR}}, and compare it head-to-head with r^uLSIF\hat r_{\mathrm{uLSIF}} and the closed-form truth.

Two-panel figure: left shows three-way comparison of true ratio, uLSIF fit, and logistic-regression fit on a log y-axis. Right shows reliability diagram with predicted probability in deciles vs observed fraction with diagonal calibration reference line.
Classification-DRE vs uLSIF on the running toy. Left: logistic regression on the same Gaussian basis produces a ratio estimate effectively indistinguishable from uLSIF, with Pearson approximately 0.99 between the two. Right: the reliability diagram sits on the diagonal, confirming that the Bayes-rule identity's hypothesis is met.
from sklearn.linear_model import LogisticRegression
import numpy as np

def lr_dre(x_p, x_q, sigma, C, b=100, rng=None):
    rng = np.random.default_rng() if rng is None else rng
    centers = rng.choice(x_p, size=b, replace=False)
    x = np.concatenate([x_p, x_q])
    y = np.concatenate([np.ones(len(x_p)), np.zeros(len(x_q))])
    Psi = np.exp(-((x[:, None] - centers[None, :])**2) / (2 * sigma**2))
    lr = LogisticRegression(penalty='l2', C=C, solver='lbfgs').fit(Psi, y)
    return lr, centers

On the running toy, Pearson(LR, uLSIF) is about 0.990.99 — the two estimators agree more closely with each other than either does with the truth, since both share the same finite-sample basis limitations. The reliability-diagram maximum deviation from the diagonal is below 0.100.10, confirming that logistic regression is well-calibrated on this problem. Swapping in a tree ensemble visibly biases the DRE estimate even though the AUC barely moves — the calibration warning of §7.3 in action.

The ff-divergence variational view and neural DRE

§8 is the section that opens the door from kernel-basis methods to arbitrary function classes — and ultimately to deep generative models. Nguyen, Wainwright, and Jordan (2010) showed that every ff-divergence between pp and qq has a variational representation as the supremum, over a class of “witness” functions TT, of a sample-computable functional whose optimizer is T=f(r)T^\star = f'(r). The construction recovers LSIF and KLIEP as special cases of one framework, but the framework’s real power is that TT can be parametrized by anything we like — including a small neural network trained by SGD. §8.3 implements that idea, and §8.4 closes the section with the observation that vanilla GANs are doing exactly this estimation implicitly.

The Nguyen–Wainwright–Jordan variational lower bound

Definition 2 (f-divergence).

For a convex function f:(0,)Rf: (0, \infty) \to \mathbb{R} with f(1)=0f(1) = 0, the ff-divergence between probability densities p,qp, q with pqp \ll q is

Df(pq)  :=  Xf ⁣(p(x)q(x))q(x)dx  =  Eq[f(r(X))].D_f(p \,\|\, q) \;:=\; \int_\mathcal{X} f\!\left(\frac{p(x)}{q(x)}\right) q(x)\, \mathrm{d}x \;=\; \mathbb{E}_q[f(r(X))].

Familiar choices: f(u)=uloguf(u) = u \log u gives KL(pq)=Ep[logr]\mathrm{KL}(p \,\|\, q) = \mathbb{E}_p[\log r]; f(u)=(u1)2/2f(u) = (u - 1)^2 / 2 gives 12χ2(pq)\tfrac{1}{2} \chi^2(p \,\|\, q); f(u)=ulogu(1+u)log ⁣(1+u2)f(u) = u \log u - (1 + u) \log\!\bigl(\tfrac{1 + u}{2}\bigr) gives twice the Jensen–Shannon divergence. Df0D_f \geq 0 with equality iff p=qp = q (Jensen’s inequality on convex ff and Eq[r]=1\mathbb{E}_q[r] = 1).

Definition 3 (Fenchel conjugate).

The Fenchel conjugate of a convex function f:RR{+}f: \mathbb{R} \to \mathbb{R} \cup \{+\infty\} is

f(t)  :=  supuR{utf(u)}.f^\star(t) \;:=\; \sup_{u \in \mathbb{R}}\, \bigl\{ u\, t - f(u) \bigr\}.

The Fenchel–Young inequality f(u)+f(t)utf(u) + f^\star(t) \geq u\, t holds for all u,tu, t, with equality iff t=f(u)t = f'(u) (assuming ff differentiable on the interior of its domain).

For our purposes, ff is differentiable on (0,)(0, \infty) and the relevant conjugates compute cleanly:

f(u)f(u)f(u)f'(u)f(t)f^\star(t)corresponding DRE estimator
uloguu \log ulogu+1\log u + 1et1e^{t - 1}KLIEP (§5)
(u1)2/2(u - 1)^2 / 2u1u - 1t2/2+tt^2/2 + tLSIF (§6)
logu-\log u1/u-1/u1log(t)-1 - \log(-t), t<0t < 0reverse-KL DRE

Theorem 5 (NWJ Variational Representation (Nguyen–Wainwright–Jordan 2010, Lemma 1)).

Let ff be a convex function with Fenchel conjugate ff^\star. Then

Df(pq)  =  supT:Xdom(f){Ep[T(X)]    Eq[f(T(X))]},D_f(p \,\|\, q) \;=\; \sup_{T: \mathcal{X} \to \operatorname{dom}(f^\star)}\, \Bigl\{\, \mathbb{E}_p[T(X)] \;-\; \mathbb{E}_q[f^\star(T(X))] \,\Bigr\},

where the supremum runs over all measurable TT for which the integrals exist, and is achieved at T(x)  =  f(r(x))T^\star(x) \;=\; f'(r(x)).

Proof.

Lower bound (\geq). Apply Fenchel–Young pointwise with u=r(x)u = r(x) and t=T(x)t = T(x):

f(r(x))    T(x)r(x)f(T(x)).f(r(x)) \;\geq\; T(x)\, r(x) - f^\star(T(x)).

Taking Eq\mathbb{E}_q on both sides,

Eq[f(r(X))]    Eq[T(X)r(X)]    Eq[f(T(X))].\mathbb{E}_q[f(r(X))] \;\geq\; \mathbb{E}_q[T(X)\, r(X)] \;-\; \mathbb{E}_q[f^\star(T(X))].

By the §2.2 importance-weighting identity, Eq[T(X)r(X)]=Ep[T(X)]\mathbb{E}_q[T(X)\, r(X)] = \mathbb{E}_p[T(X)], so

Df(pq)    Ep[T(X)]    Eq[f(T(X))]D_f(p \,\|\, q) \;\geq\; \mathbb{E}_p[T(X)] \;-\; \mathbb{E}_q[f^\star(T(X))]

for every TT in the supremum’s class.

Achievement of the bound at T=f(r)T^\star = f'(r). Fenchel–Young holds with equality at t=f(u)t = f'(u), so for every xx,

f(r(x))  =  r(x)T(x)f(T(x)),f(r(x)) \;=\; r(x)\, T^\star(x) - f^\star(T^\star(x)),

and the same chain in reverse delivers Df=Ep[T]Eq[f(T)]D_f = \mathbb{E}_p[T^\star] - \mathbb{E}_q[f^\star(T^\star)].

Corollary 3 (Variational DRE estimator).

A sample-based estimator T^\hat T of T=f(r)T^\star = f'(r) is obtained by maximizing the empirical Lagrangian

T^  =  argmaxTT{1npj=1npT(Xjp)    1nqi=1nqf(T(Xiq))},\hat T \;=\; \arg\max_{T \in \mathcal{T}}\, \Biggl\{\, \frac{1}{n_p} \sum_{j=1}^{n_p} T(X_j^p) \;-\; \frac{1}{n_q} \sum_{i=1}^{n_q} f^\star(T(X_i^q)) \,\Biggr\},

and the density-ratio estimator is r^(x)=(f)1(T^(x))\hat r(x) = (f')^{-1}(\hat T(x)).

Recovering LSIF and KLIEP

Two short verifications that the NWJ framework subsumes the Bregman-DRE machinery of §3.

Chi-squared \Rightarrow LSIF. Take f(u)=(u1)2/2f(u) = (u - 1)^2 / 2 (so Df=12χ2(pq)D_f = \tfrac{1}{2} \chi^2(p \,\|\, q)). From the table, f(u)=u1f'(u) = u - 1 and f(t)=t2/2+tf^\star(t) = t^2 / 2 + t. Reparametrizing g(x)=T(x)+1g(x) = T(x) + 1 and expanding,

supg{Ep[g]12Eq[g2]}  =  Df(pq)+12,\sup_g \bigl\{ \mathbb{E}_p[g] - \tfrac{1}{2} \mathbb{E}_q[g^2] \bigr\} \;=\; D_f(p \,\|\, q) + \tfrac{1}{2},

which up to the additive 1/21/2 is exactly JLSIF(g)-J_{\mathrm{LSIF}}(g) from §3.2. Maximizing NWJ is equivalent to minimizing the LSIF objective.

KL \Rightarrow KLIEP. Take f(u)=uloguf(u) = u \log u (so Df=KL(pq)D_f = \mathrm{KL}(p \,\|\, q)). From the table, f(u)=logu+1f'(u) = \log u + 1 and f(t)=et1f^\star(t) = e^{t - 1}. Reparametrizing g(x)=eT(x)1g(x) = e^{T(x) - 1},

supg{Ep[logg]Eq[g]}  =  Df(pq)1,\sup_g \bigl\{ \mathbb{E}_p[\log g] - \mathbb{E}_q[g] \bigr\} \;=\; D_f(p \,\|\, q) - 1,

which is JKLIEPunc(g)-J^{\mathrm{unc}}_{\mathrm{KLIEP}}(g) from §3.3. NWJ with f=KLf = \mathrm{KL} recovers the unconstrained KLIEP objective.

The two recoveries make explicit what the Bregman framework of §3 promised: the same loss family that produced the kernel-basis estimators of §5 and §6 is the NWJ family in disguise, with the witness function TT playing the role of f(g)f'(g).

Neural DRE — parametrize the witness with an MLP

The NWJ corollary’s “any function class T\mathcal{T}” is the operational handle for neural DRE. Pick a small multilayer perceptron Tθ:RdRT_\theta: \mathbb{R}^d \to \mathbb{R}, train by SGD on the empirical NWJ lower bound for some ff, recover r^(x)=(f)1(Tθ(x))\hat r(x) = (f')^{-1}(T_\theta(x)). For our running 1-D toy with np=nq=300n_p = n_q = 300, an MLP with two hidden layers of 3232 ReLU units (≈1,1001{,}100 parameters) and 500\sim 500 Adam steps converges to a fit competitive with uLSIF, in a couple of seconds on CPU.

We use the KL generator f(u)=uloguf(u) = u \log u because the KL ratio is the most common neural-DRE target (mutual-information estimation in MINE — Belghazi et al. 2018 — uses essentially the same objective). The empirical NWJ objective to maximize is

L^NWJKL(θ)  =  1npj=1npTθ(Xjp)    1nqi=1nqexp ⁣(Tθ(Xiq)1),\hat L_{\mathrm{NWJ-KL}}(\theta) \;=\; \frac{1}{n_p} \sum_{j=1}^{n_p} T_\theta(X_j^p) \;-\; \frac{1}{n_q} \sum_{i=1}^{n_q} \exp\!\bigl(T_\theta(X_i^q) - 1\bigr),

and the predicted ratio is r^(x)=exp(Tθ(x)1)\hat r(x) = \exp(T_\theta(x) - 1). As the optimization converges, L^NWJKL(θ)\hat L_{\mathrm{NWJ-KL}}(\theta) approaches KL(pq)=(μpμq)2/2=0.5\mathrm{KL}(p \,\|\, q) = (\mu_p - \mu_q)^2 / 2 = 0.5 — the closed-form benchmark for our shift-Gaussian toy.

Two-panel figure of neural DRE training. Left: NWJ-KL lower bound vs SGD iteration with horizontal reference line at the closed-form KL ceiling. Right: neural-DRE r-hat overlaid on uLSIF r-hat and the closed-form true ratio on a log y-axis.
Neural NWJ-KL training converges to its closed-form KL ceiling on the running toy in about 500 Adam steps. The resulting neural ratio estimator is functionally indistinguishable from uLSIF on this 1-D problem; the advantage of the neural class shows in higher-dimensional settings where the kernel-basis estimators degrade.
import torch
import torch.nn as nn

class WitnessMLP(nn.Module):
    def __init__(self, d_in=1, hidden=32, depth=2):
        super().__init__()
        layers = [nn.Linear(d_in, hidden), nn.ReLU()]
        for _ in range(depth - 1):
            layers += [nn.Linear(hidden, hidden), nn.ReLU()]
        layers += [nn.Linear(hidden, 1)]
        self.net = nn.Sequential(*layers)
    def forward(self, x):
        return self.net(x).squeeze(-1)

def nwj_kl_objective(T_p, T_q, clip_q=10.0):
    T_q_safe = torch.clamp(T_q, max=clip_q)
    return T_p.mean() - torch.exp(T_q_safe - 1.0).mean()

The exponential in ff^\star can overflow on rare XiqX_i^q values where Tθ(Xiq)T_\theta(X_i^q) becomes large; we clip the argument at a safe upper bound. Full-batch gradients are fine at n=300n = 300; for larger nn, mini-batches with batch size 128\geq 128 are needed for stable estimates of both expectations. The MLP has no inductive bias toward the actual closed-form r(x)=e1/2xr(x) = e^{1/2 - x} — it could learn arbitrary 1-D functions — yet the NWJ loss reliably steers it toward the truth.

GAN as implicit density-ratio estimation

§8.4 closes the section with a conceptual observation: vanilla generative adversarial networks (Goodfellow et al. 2014) are implicitly doing density-ratio estimation. The discriminator is the ratio estimator; the generator is updated to push the ratio toward 11.

The vanilla GAN objective is the minimax

minG  maxD  V(G,D)  :=  Expdata[logD(x)]  +  ExpG[log(1D(x))],\min_G\; \max_D\; V(G, D) \;:=\; \mathbb{E}_{x \sim p_{\mathrm{data}}}[\log D(x)] \;+\; \mathbb{E}_{x \sim p_G}[\log(1 - D(x))],

where pdatap_{\mathrm{data}} is the true data distribution and pGp_G is the distribution induced by pushing latent samples zpzz \sim p_z through the generator GG. For fixed GG, the inner maximum over DD is achieved at

D(x)  =  pdata(x)pdata(x)+pG(x).D^\star(x) \;=\; \frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x) + p_G(x)}.

This is the Bayes-optimal probabilistic classifier on the pooled labelled dataset with balanced priors. By the §7.2 classification-DRE identity (with np=nqn_p = n_q so the prior-odds = 1),

pdata(x)pG(x)  =  D(x)1D(x),D(x)  =  σ ⁣(logr(x)).\frac{p_{\mathrm{data}}(x)}{p_G(x)} \;=\; \frac{D^\star(x)}{1 - D^\star(x)}, \qquad D^\star(x) \;=\; \sigma\!\bigl(\log r(x)\bigr).

So the discriminator implements the sigmoid of the log-ratio: D=σlogrD^\star = \sigma \circ \log r, equivalently r=D/(1D)r = D^\star / (1 - D^\star). The GAN apparatus is a classification-DRE machine. Mohamed and Lakshminarayanan (2016) made this explicit and built the “implicit generative models” framework around it. Nowozin, Cseke, and Tomioka (2016) extended the construction to arbitrary ff-divergences via the NWJ lower bound — the ff-GAN family.

Two-panel figure: left shows trained discriminator D(x) overlaid on closed-form Bayes-optimal D-star with a horizontal line at 0.5. Right shows recovered ratio r-hat = D/(1-D) overlaid on closed-form r on a log y-axis.
The GAN discriminator as an implicit DRE estimator. Left: trained on fixed (p, q) data with BCE, the discriminator's sigmoid output is indistinguishable from the Bayes-optimal D-star. Right: inverting via r-hat = D/(1-D) recovers the true ratio to high precision.

In a full GAN, the generator alternates to push r^\hat r toward 11 — this demo isolates the inner-DD DRE step. The connection clarifies §10–§12: many modern generative models (GANs, normalizing flows trained adversarially, score-based diffusion models with auxiliary discriminators) have discriminator components that double as ratio estimators, and the downstream uses that need rr explicitly can build on them.

Covariate-shift correction

§9 turns the estimator machinery of §4–§8 onto the canonical downstream application: covariate-shift correction, in which training and test data come from distributions with the same conditional p(yx)p(y \mid x) but different marginals ptrain(x)ptest(x)p_{\text{train}}(x) \ne p_{\text{test}}(x). Shimodaira (2000) showed that under that assumption, the test-data risk minimizer is recovered asymptotically by reweighting the training loss by r(x)=ptest(x)/ptrain(x)r(x) = p_{\text{test}}(x) / p_{\text{train}}(x). The DRE estimators of §4–§8 produce r^\hat r from a labelled train set and an unlabelled test set — exactly the data we typically have.

Notation note: from this section through §10, pp plays the role of ptestp_{\text{test}} and qq plays the role of ptrainp_{\text{train}}. The role labels reverse from §1–§8 to keep r(x)=p(x)/q(x)r(x) = p(x)/q(x) as the importance weight applied to training data (the operational quantity).

The covariate-shift assumption

Train and test data are drawn from joint distributions ptrain(x,y)p_{\text{train}}(x, y) and ptest(x,y)p_{\text{test}}(x, y) on X×Y\mathcal{X} \times \mathcal{Y}. The covariate-shift assumption is that the conditional distribution of the label given the features is the same on both sides:

ptrain(yx)  =  ptest(yx)  =:  p(yx),butptrain(x)    ptest(x).p_{\text{train}}(y \mid x) \;=\; p_{\text{test}}(y \mid x) \;=:\; p(y \mid x), \quad \text{but} \quad p_{\text{train}}(x) \;\ne\; p_{\text{test}}(x).

Real-world settings that approximately satisfy this include selection bias in clinical trials, domain adaptation in vision (the underlying object/label relationship is fixed but imaging conditions shift the pixel marginal), and active learning where a query strategy shifts the input distribution but not the labelling oracle.

Importance-weighted ERM

Theorem 6 (Importance-Weighted Risk Identity).

Suppose covariate shift holds and ptest(x)ptrain(x)p_{\text{test}}(x) \ll p_{\text{train}}(x), with ratio r(x)=ptest(x)/ptrain(x)r(x) = p_{\text{test}}(x) / p_{\text{train}}(x). For any loss LL and predictor ff with Etrainr(X)L(f(X),Y)<\mathbb{E}_{\text{train}}|r(X)\, L(f(X), Y)| < \infty,

Rtest(f)  :=  E(X,Y)ptest[L(f(X),Y)]  =  E(X,Y)ptrain[r(X)L(f(X),Y)].R_{\text{test}}(f) \;:=\; \mathbb{E}_{(X, Y) \sim p_{\text{test}}}[L(f(X), Y)] \;=\; \mathbb{E}_{(X, Y) \sim p_{\text{train}}}\bigl[r(X)\, L(f(X), Y)\bigr].
Proof.

By the covariate-shift assumption and the §2.2 importance-weighting identity,

Rtest(f)  =  L(f(x),y)p(yx)ptest(x)dxdy  =  L(f(x),y)p(yx)r(x)ptrain(x)dxdy.R_{\text{test}}(f) \;=\; \int L(f(x), y)\, p(y \mid x)\, p_{\text{test}}(x)\, \mathrm{d}x\, \mathrm{d}y \;=\; \int L(f(x), y)\, p(y \mid x)\, r(x)\, p_{\text{train}}(x)\, \mathrm{d}x\, \mathrm{d}y.

The integrand factorizes back into r(x)L(f(x),y)ptrain(x,y)r(x)\, L(f(x), y)\, p_{\text{train}}(x, y).

The natural finite-sample estimator is the importance-weighted empirical risk

R^IW(f)  :=  1ntraini=1ntrainr(Xitrain)L(f(Xitrain),Yitrain),\hat R_{\text{IW}}(f) \;:=\; \frac{1}{n_{\text{train}}} \sum_{i=1}^{n_{\text{train}}} r(X_i^{\text{train}})\, L(f(X_i^{\text{train}}), Y_i^{\text{train}}),

and the corresponding IW-ERM estimator f^IW:=argminfR^IW(f)\hat f_{\text{IW}} := \arg\min_f \hat R_{\text{IW}}(f).

Theorem 7 (Shimodaira 2000, IW-ERM Consistency under Misspecification).

Let {fθ:θΘ}\{f_\theta : \theta \in \Theta\} be a parametric hypothesis class with Θ\Theta compact, and suppose the model is misspecified. Under standard regularity conditions, the IW-ERM estimator

θ^IW(n)  :=  argminθΘ1ni=1nr(Xitrain)L(fθ(Xitrain),Yitrain)\hat\theta_{\text{IW}}^{(n)} \;:=\; \arg\min_{\theta \in \Theta}\, \frac{1}{n} \sum_{i=1}^n r(X_i^{\text{train}})\, L(f_\theta(X_i^{\text{train}}), Y_i^{\text{train}})

converges in probability to θtest:=argminθΘRtest(fθ)\theta_{\text{test}}^\star := \arg\min_{\theta \in \Theta} R_{\text{test}}(f_\theta) as nn \to \infty. The unweighted ERM converges instead to θtrain\theta_{\text{train}}^\star, which differs from θtest\theta_{\text{test}}^\star in general when the model is misspecified.

Proof.

By the IW identity, Etrain[r(X)L(fθ(X),Y)]=Rtest(fθ)\mathbb{E}_{\text{train}}[r(X)\, L(f_\theta(X), Y)] = R_{\text{test}}(f_\theta) pointwise in θ\theta. A uniform law of large numbers on compact Θ\Theta gives supθR^IW(fθ)Rtest(fθ)0\sup_\theta |\hat R_{\text{IW}}(f_\theta) - R_{\text{test}}(f_\theta)| \to 0 in probability. Continuity of Rtest(fθ)R_{\text{test}}(f_\theta) in θ\theta and identifiability of the argmin deliver θ^IW(n)θtest\hat\theta_{\text{IW}}^{(n)} \to \theta_{\text{test}}^\star in probability via M-estimator consistency (van der Vaart 1998, Theorem 5.7).

The misspecification clause matters: when the model is well-specified — when {fθ}\{f_\theta\} contains the test-optimal ff^\star — the unweighted and IW-ERM estimators have the same population minimizer (namely ff^\star), and reweighting only inflates variance without changing the limit. The IW correction is doing work only in the misspecified case, which is the case that matters in practice.

Variance inflation and effective sample size

The IW-ERM estimator’s asymptotic correctness comes at a variance cost. The variance inflation factor relative to the unweighted estimator is approximately χ2(ptestptrain)+1\chi^2(p_{\text{test}} \,\|\, p_{\text{train}}) + 1 — the same chi-squared quantity from §2.3. The operational diagnostic is again the effective sample size neff=(iwi)2/iwi2n_{\text{eff}} = (\sum_i w_i)^2 / \sum_i w_i^2. Two common practical adjustments: truncated importance weights wimin(wi,B)w_i \to \min(w_i, B) for some upper cap BB (trades bias for variance) and self-normalized weights R^IWSN(f)=(iwiLi)/(iwi)\hat R_{\text{IW}}^{\text{SN}}(f) = (\sum_i w_i L_i) / (\sum_i w_i) (introduces O(1/ntrain)O(1/n_{\text{train}}) bias but reduces variance).

Numerical demonstration

We set up a misspecified linear-regression problem under covariate shift: true conditional y=x2+εy = x^2 + \varepsilon, ptrain(x)=N(1,1)p_{\text{train}}(x) = \mathcal{N}(1, 1), ptest(x)=N(0,1)p_{\text{test}}(x) = \mathcal{N}(0, 1). The unweighted train fit converges to slope 22, intercept 00 (the best linear fit under ptrainp_{\text{train}}); the test-oracle fit converges to slope 00, intercept 11. IW correction is supposed to move us from the former to the latter.

Two-panel figure: left shows training scatter with four linear fits overlaid (unweighted, IW-true, IW-uLSIF, oracle) and the true y equals x squared curve. Right shows bar chart of test MSE for the four estimators.
Covariate-shift correction on a misspecified linear-regression problem. The unweighted train fit converges to slope 2 because the train marginal puts mass at x equals 1 where x squared is steep. IW correction with either the true ratio or the uLSIF estimate moves the slope to near zero, matching the test-oracle fit. The bar chart makes the gap quantitative: unweighted test MSE is about three times the IW-corrected MSE on this DGP.
import numpy as np

def iw_linear_fit(x, y, w=None):
    n = len(x)
    if w is None:
        w = np.ones(n)
    sw = w.sum()
    swx = (w * x).sum()
    swy = (w * y).sum()
    swxx = (w * x * x).sum()
    swxy = (w * x * y).sum()
    b = (sw * swxy - swx * swy) / (sw * swxx - swx**2)
    a = (swy - b * swx) / sw
    return a, b

The IW fit with the uLSIF-estimated ratio recovers the test-optimal slope to within 0.3 and tracks the IW-true-r fit’s test MSE within 10%\sim 10\%. The empirical neffn_{\text{eff}} at μtrain=1\mu_{\text{train}} = 1 is about 200200 out of 500500 training samples — well above the 0.10.1 threshold from §12.2, so the IW estimator is operating in a regime where the asymptotic guarantees apply.

Conformal prediction under covariate shift

§10 carries the §9 covariate-shift application one step further: instead of producing a point predictor with calibrated risk, we produce a prediction interval with calibrated coverage. Vanilla split conformal prediction (Vovk, Gammerman, and Shafer 2005; Lei et al. 2018) gives a finite-sample, distribution-free coverage guarantee — but only under exchangeability of train, calibration, and test data, which covariate shift breaks. Tibshirani, Barber, Candès, and Ramdas (2019) showed that weighted exchangeability survives under covariate shift, with weights proportional to r(x)r(x).

Weighted exchangeability and the conformal quantile

Definition 4 (Weighted Exchangeability (TBCR 2019, Definition 1)).

Random variables Z1,,ZnZ_1, \ldots, Z_n with joint density ff are weighted exchangeable with weight functions w1,,wn:ZR0w_1, \ldots, w_n: \mathcal{Z} \to \mathbb{R}_{\geq 0} if there exists a permutation-invariant function g:ZnR0g: \mathcal{Z}^n \to \mathbb{R}_{\geq 0} such that

f(z1,,zn)  =  (i=1nwi(zi))g(z1,,zn).f(z_1, \ldots, z_n) \;=\; \biggl(\prod_{i=1}^{n} w_i(z_i)\biggr) \cdot g(z_1, \ldots, z_n).

Exchangeability is the special case wi1w_i \equiv 1 (with gg itself the joint density). In the covariate-shift setting, (Z1,,Zm,Zm+1)(Z_1, \ldots, Z_m, Z_{m+1}) with Zi=(Xi,Yi)Z_i = (X_i, Y_i) has joint density

f(z1,,zm+1)  =  i=1mptrain(zi)ptest(zm+1)  =  r(xm+1)i=1m+1ptrain(zi),f(z_1, \ldots, z_{m+1}) \;=\; \prod_{i=1}^m p_{\text{train}}(z_i) \cdot p_{\text{test}}(z_{m+1}) \;=\; r(x_{m+1}) \cdot \prod_{i=1}^{m+1} p_{\text{train}}(z_i),

the second equality by ptest(z)=r(x)ptrain(z)p_{\text{test}}(z) = r(x)\, p_{\text{train}}(z) under conditional invariance. Setting g(z1,,zm+1)=i=1m+1ptrain(zi)g(z_1, \ldots, z_{m+1}) = \prod_{i=1}^{m+1} p_{\text{train}}(z_i) (permutation-symmetric) and wi(z)=1w_i(z) = 1 for imi \leq m, wm+1(z)=r(x)w_{m+1}(z) = r(x) exhibits the factorization. So the calibration plus test data are weighted exchangeable with weights “1 for calibration points, rr for the test point.”

Lemma 3 (Weighted permutation (TBCR 2019, Lemma 3)).

Let Z1,,ZnZ_1, \ldots, Z_n be weighted exchangeable with weights w1,,wnw_1, \ldots, w_n, and let V:ZnRV: \mathcal{Z}^n \to \mathbb{R} be permutation-invariant. For any i{1,,n}i \in \{1, \ldots, n\} and any measurable set AA,

P(V(Z(i),Zi)A)  =  E ⁣[wi(Z(i))j=1nwj(Z(j))1{V(Z)A}].\mathbb{P}(V(Z_{(i)}, Z_{-i}) \in A) \;=\; \mathbb{E}\!\left[\, \frac{w_i(Z_{(i)})}{\sum_{j=1}^n w_j(Z_{(j)})} \cdot \mathbb{1}\{V(Z) \in A\} \,\right].

This lemma says that when we apply a permutation-invariant statistic to a weighted-exchangeable sequence, the marginal distribution at position ii is a weighted average of the joint statistic, with weights proportional to wiw_i. For the conformal application, VV is the empirical CDF of conformity scores.

The weighted split-conformal algorithm

Algorithm 2 (Weighted Split Conformal (TBCR 2019, Algorithm 1)).

  1. Split labelled PtrainP_{\text{train}} data into training and calibration folds of sizes ntrn_{\text{tr}} and mm.
  2. Train predictor μ^\hat\mu on the training fold.
  3. Compute calibration residuals Ri:=Yiμ^(Xi)R_i := |Y_i - \hat\mu(X_i)| for i=1,,mi = 1, \ldots, m.
  4. For each new test point Xm+1=xX_{m+1} = x^\star:
    • Compute weights wi:=r^(Xi)w_i := \hat r(X_i) for i=1,,mi = 1, \ldots, m and wm+1:=r^(x)w_{m+1} := \hat r(x^\star).
    • Normalize: p~i:=wi/j=1m+1wj\tilde p_i := w_i / \sum_{j=1}^{m+1} w_j.
    • Define the weighted empirical CDF F~(t):=i=1mp~i1{Rit}+p~m+11{+t}\tilde F(t) := \sum_{i=1}^m \tilde p_i \cdot \mathbb{1}\{R_i \leq t\} + \tilde p_{m+1} \cdot \mathbb{1}\{+\infty \leq t\}.
    • The weighted conformal quantile is q^(x):=inf{t:F~(t)1α}\hat q(x^\star) := \inf\{ t : \tilde F(t) \geq 1 - \alpha \}.
    • The prediction set is C^(x):={y:yμ^(x)q^(x)}\hat C(x^\star) := \bigl\{ y : |y - \hat\mu(x^\star)| \leq \hat q(x^\star) \bigr\}.

The vanilla split-conformal special case is recovered when all wi1w_i \equiv 1. Under covariate shift, the unequal weights reweight which calibration residuals “count” — points near the test region get more weight, points far from it get less — and the resulting quantile adapts to the test distribution.

Theorem 8 (Weighted Conformal Coverage (TBCR 2019, Theorem 2)).

Under the covariate-shift assumption with iid calibration data from PtrainP_{\text{train}} and an independent test point from PtestP_{\text{test}}, if we run Algorithm 2 above using the true rr for the weights, the resulting prediction set satisfies

P(Ym+1C(Xm+1))    1α.\mathbb{P}\bigl(Y_{m+1} \in C(X_{m+1})\bigr) \;\geq\; 1 - \alpha.

The guarantee is finite-sample, distribution-free over PtrainP_{\text{train}}.

Proof.

The conformity score Vi=Ri=Yiμ^(Xi)V_i = R_i = |Y_i - \hat\mu(X_i)| is permutation-invariant in its construction. The calibration-plus-test sequence is weighted exchangeable with weights wi=1w_i = 1 (calibration) and wm+1=r(xm+1)w_{m+1} = r(x_{m+1}) (test). The weighted-permutation lemma above implies that the rank of the test residual Rm+1R_{m+1} among {R1,,Rm,Rm+1}\{R_1, \ldots, R_m, R_{m+1}\} has the weighted discrete distribution with probabilities p~i\tilde p_i. The quantile q^\hat q is then constructed so that, with probability at least 1α1 - \alpha, Rm+1q^R_{m+1} \leq \hat q, which is the event Ym+1C(Xm+1)Y_{m+1} \in C(X_{m+1}).

Coverage validity as a function of DRE accuracy

The TBCR coverage guarantee assumes we use the true ratio rr, which in practice we don’t have. The natural substitution is r^\hat r from any DRE estimator.

Theorem 9 (Approximate Coverage with Estimated Ratio (TBCR 2019, Theorem 3)).

Under the §10.2 setup with weights computed from an estimate r^\hat r instead of the true rr, the resulting prediction set satisfies

P(Ym+1C^(Xm+1))    (1α)    dTV(Ptest,r^Ptrain)  +  O(1/m).\bigl|\, \mathbb{P}(Y_{m+1} \in \hat C(X_{m+1})) \;-\; (1 - \alpha) \,\bigr| \;\leq\; d_{\mathrm{TV}}\bigl(P_{\text{test}}, \hat r \cdot P_{\text{train}}\bigr) \;+\; O(1/m).

The bound says: the coverage error is governed by how badly r^\hat r misrepresents the test distribution, plus a finite-sample O(1/m)O(1/m) correction. The DRE accuracy is the bottleneck. Any of the §4–§8 estimators that fits rr accurately will give approximately valid coverage when plugged in.

Numerical demonstration

We reuse the §9 covariate-shift DGP (true conditional y=x2+εy = x^2 + \varepsilon; ptrain(x)=N(1,1)p_{\text{train}}(x) = \mathcal{N}(1, 1); ptest(x)=N(0,1)p_{\text{test}}(x) = \mathcal{N}(0, 1)), train an OLS linear predictor on the training fold (to keep the conformal demo decoupled from §9’s IW machinery), compute residuals on a held-out calibration fold, then evaluate three conformal procedures on a fresh test set.

Two-panel figure of weighted conformal coverage under covariate shift. Left: test scatter with three prediction intervals overlaid as bands. Right: empirical coverage rate vs nominal coverage level for the three procedures with diagonal calibration reference.
Weighted conformal coverage under covariate shift on the running toy. Vanilla split conformal under-covers because calibration residuals come from the train marginal, not the test marginal. Weighted conformal with the true ratio achieves nominal coverage as the theorem guarantees; weighted conformal with the uLSIF estimate is within a few percentage points of nominal — the residual gap is the §10.3 DRE-accuracy term.
import numpy as np

def weighted_conformal_quantile(residuals, w_cal, w_test, alpha):
    pairs = sorted(zip(residuals, w_cal), key=lambda t: t[0])
    sum_w = sum(w_cal) + w_test
    target = 1 - alpha
    cum = 0.0
    for r, w in pairs:
        cum += w / sum_w
        if cum >= target:
            return r
    return float('inf')

On the verified-execution stack, vanilla unweighted coverage at 1α=0.901 - \alpha = 0.90 is about 0.750.75 — a 15-point shortfall. Weighted conformal with the true rr achieves 0.920.92 (essentially nominal up to discrete-quantile adjustment); weighted conformal with uLSIF achieves 0.900.90 — within a percentage point of nominal. The interval widths in the left panel show how weighting adapts: the weighted bands are slightly wider in the test region (where unweighted intervals are systematically too short) and slightly narrower elsewhere.

MMD as a special-case kernel ratio

§11 closes the loop on the kernel-methods machinery that §4 (KMM) opened, by examining the maximum mean discrepancy — a kernel-embedding distance between distributions that doubles as a two-sample test statistic. The two-sample-testing problem “is p=qp = q?” is the special case of the density-ratio problem “what is r=p/qr = p/q?” where we only care whether r1r \equiv 1. Gretton, Borgwardt, Rasch, Schölkopf, and Smola (2012) showed that MMD admits a clean kernel-mean-embedding form, an unbiased sample estimator, and a permutation-based null calibration that yields a finite-sample-valid test.

MMD definition and kernel-mean-embedding form

Definition 5 (Maximum Mean Discrepancy).

For a function class F\mathcal{F} and probability distributions P,QP, Q on X\mathcal{X}, the maximum mean discrepancy is

MMD(P,Q;F)  :=  supfFEP[f(X)]EQ[f(X)].\mathrm{MMD}(P, Q; \mathcal{F}) \;:=\; \sup_{f \in \mathcal{F}}\, \bigl|\, \mathbb{E}_P[f(X)] - \mathbb{E}_Q[f(X)] \,\bigr|.

When F\mathcal{F} is the unit ball of an RKHS H\mathcal{H} with kernel kk,

MMD(P,Q;H)  =  μPμQH,\mathrm{MMD}(P, Q; \mathcal{H}) \;=\; \bigl\|\, \mu_P - \mu_Q \,\bigr\|_\mathcal{H},

where μP\mu_P is the kernel mean embedding from §4.1.

Proof.

By the reproducing property, for fHf \in \mathcal{H},

EP[f(X)]EQ[f(X)]  =  f,μPμQH.\mathbb{E}_P[f(X)] - \mathbb{E}_Q[f(X)] \;=\; \langle f, \mu_P - \mu_Q \rangle_\mathcal{H}.

Cauchy–Schwarz gives f,μPμQHfHμPμQH|\langle f, \mu_P - \mu_Q \rangle_\mathcal{H}| \leq \|f\|_\mathcal{H} \cdot \|\mu_P - \mu_Q\|_\mathcal{H}, with equality at f=(μPμQ)/μPμQHf = (\mu_P - \mu_Q) / \|\mu_P - \mu_Q\|_\mathcal{H}. The supremum over fH1\|f\|_\mathcal{H} \leq 1 is therefore μPμQH\|\mu_P - \mu_Q\|_\mathcal{H} exactly.

The squared MMD admits an expansion in pairwise kernel evaluations:

MMD2(P,Q;H)  =  EX,XP[k(X,X)]    2EXP,YQ[k(X,Y)]  +  EY,YQ[k(Y,Y)],\mathrm{MMD}^2(P, Q; \mathcal{H}) \;=\; \mathbb{E}_{X, X' \sim P}[k(X, X')] \;-\; 2\, \mathbb{E}_{X \sim P, Y \sim Q}[k(X, Y)] \;+\; \mathbb{E}_{Y, Y' \sim Q}[k(Y, Y')],

which follows from expanding μPμQH2\|\mu_P - \mu_Q\|^2_\mathcal{H}. The corresponding U-statistic estimator

MMD^u2  :=  1n(n1)iik(Xi,Xi)    2nmi,jk(Xi,Yj)  +  1m(m1)jjk(Yj,Yj)\widehat{\mathrm{MMD}}^2_u \;:=\; \frac{1}{n(n - 1)} \sum_{i \ne i'} k(X_i, X_{i'}) \;-\; \frac{2}{n\, m} \sum_{i, j} k(X_i, Y_j) \;+\; \frac{1}{m(m - 1)} \sum_{j \ne j'} k(Y_j, Y_{j'})

is unbiased for MMD2(P,Q)\mathrm{MMD}^2(P, Q); removing the diagonals avoids the O(1/n)O(1/n) bias of the plug-in V-statistic.

For a characteristic kernel (the Gaussian RBF on Rd\mathbb{R}^d qualifies),

MMD(P,Q;H)  =  0        P=Q,\mathrm{MMD}(P, Q; \mathcal{H}) \;=\; 0 \;\iff\; P = Q,

so MMD genuinely separates distinct distributions.

Two-sample testing as ratio-distinguishability

The hypothesis-testing problem is H0:P=QH_0: P = Q vs H1:PQH_1: P \ne Q, which under any characteristic kernel is equivalent to H0:MMD2(P,Q)=0H_0: \mathrm{MMD}^2(P, Q) = 0 vs H1:MMD2>0H_1: \mathrm{MMD}^2 > 0. From the DRE perspective, this is the question “is r1r \equiv 1?” — the special case of density-ratio estimation where we don’t need to know rr, just whether it’s the trivial ratio.

Under H0H_0, the pooled sample (X1,,Xn,Y1,,Ym)(X_1, \ldots, X_n, Y_1, \ldots, Y_m) is iid from the common distribution, so the labels assigning each point to “the XX sample” vs “the YY sample” are exchangeable. A permutation test exploits this directly: shuffle the labels BB times, recompute the MMD statistic, and use the resulting empirical null distribution to compute a p-value.

Theorem 10 (Finite-Sample Validity of Permutation Test).

Under H0:P=QH_0: P = Q, the permutation pp-value p^\hat p satisfies P(p^α)α\mathbb{P}(\hat p \leq \alpha) \leq \alpha for every α(0,1)\alpha \in (0, 1) and every B1B \geq 1. The test controls Type-I error exactly at level α\alpha for α{1/(B+1),2/(B+1),}\alpha \in \{1/(B+1), 2/(B+1), \ldots\}.

Proof.

Under H0H_0, the labels are exchangeable, so (Tobs,T(1),,T(B))(T_{\text{obs}}, T^{(1)}, \ldots, T^{(B)}) are exchangeable; the rank of TobsT_{\text{obs}} among the B+1B + 1 statistics is uniform on {1,,B+1}\{1, \ldots, B + 1\}. The event p^α\hat p \leq \alpha corresponds to TobsT_{\text{obs}} being in the top α(B+1)\lfloor \alpha (B+1) \rfloor statistics, which has probability at most α(B+1)/(B+1)α\lfloor \alpha (B+1) \rfloor / (B+1) \leq \alpha.

Kernel choice and the discriminative-power-vs-smoothness trade-off

The kernel bandwidth σk\sigma_k governs both the MMD test’s power and the KMM estimator’s smoothness, but in opposite directions. Small σk\sigma_k gives a witness function class with high resolution but high variance — test power is low for moderate-shift alternatives. Large σk\sigma_k gives a smooth witness class but compresses fine-scale differences. Gretton et al. (2012, §8) recommend choosing σk\sigma_k to maximize test power on a held-out fold, which typically lands at the median-pairwise-distance heuristic for Gaussian-vs-Gaussian-like alternatives. For KMM-style DRE, slightly larger σk\sigma_k is preferred; the two-sample-testing optimum and the DRE optimum coincide only when the alternative is “Gaussian-shaped” at the scale of the median heuristic. Sutherland et al. (2017) derive a data-dependent kernel-selection objective for multimodal alternatives.

Numerical demonstration

We run the MMD permutation test under two scenarios on the running 1-D toy: (i) null, where both samples are iid from N(0,1)\mathcal{N}(0, 1), and (ii) alternative, where the second sample comes from the shifted N(0.5,1)\mathcal{N}(0.5, 1). Sample sizes n=m=200n = m = 200, kernel bandwidth from the median heuristic on the pooled sample, B=500B = 500 permutations. A third panel computes a small empirical power curve to confirm Type-I control at δ=0\delta = 0 and rising power as the shift grows.

Three-panel figure of MMD permutation test. Left: histogram of permutation-distribution MMD-squared values under the null with observed statistic in the bulk. Middle: same histogram under the shifted alternative with observed statistic far to the right. Right: empirical power curve, rejection rate vs shift size.
MMD permutation test under null and shifted alternative on the running toy. Under the null, the observed statistic falls in the bulk of the permutation distribution. Under even a modest shift of 0.5 sigma, the observed statistic separates cleanly. The empirical power curve rises from near alpha at delta equals 0 toward one as delta grows.
import numpy as np
from scipy.spatial.distance import cdist

def mmd_u_squared(X, Y, sigma):
    Kxx = np.exp(-cdist(X[:, None], X[:, None], 'sqeuclidean') / (2 * sigma**2))
    Kyy = np.exp(-cdist(Y[:, None], Y[:, None], 'sqeuclidean') / (2 * sigma**2))
    Kxy = np.exp(-cdist(X[:, None], Y[:, None], 'sqeuclidean') / (2 * sigma**2))
    np.fill_diagonal(Kxx, 0); np.fill_diagonal(Kyy, 0)
    n, m = len(X), len(Y)
    return Kxx.sum() / (n*(n-1)) - 2*Kxy.mean() + Kyy.sum() / (m*(m-1))

Under H0H_0, the observed statistic is one of many similar permutation values (p-value >0.1> 0.1). Under the shifted alternative, the observed statistic is in the far right tail (p-value <0.01< 0.01). The empirical power increases monotonically with shift size, from near α=0.05\alpha = 0.05 at δ=0\delta = 0 (Type-I level) to near 11 at δ=1.0\delta = 1.0.

Practical considerations, diagnostics, and pitfalls

§12 brings the four estimator families back to the practitioner’s bench and asks: what do we tune, what do we monitor, and where does DRE quietly fail? The earlier sections derived each estimator carefully and ran a controlled-toy demo; this section consolidates the operational guidance and surfaces a single failure mode that all four estimators share — the curse of dimensionality.

Bandwidth and basis selection across KLIEP, LSIF, uLSIF

The three kernel-basis estimators (§5 KLIEP, §6 LSIF, §6 uLSIF) share a Gaussian basis ψ(x)=exp(xc2/(2σ2))\boldsymbol{\psi}_\ell(x) = \exp(-\|x - c_\ell\|^2 / (2 \sigma^2)) with bb centres drawn as a random subsample of the pp-samples.

  • Number of centres bb. Sugiyama et al. (2008) recommend b=100b = 100 as a default. Increasing bb past 100100 buys little on 1-D and 2-D problems but matters for d5d \geq 5.
  • Bandwidth σ\sigma. KMM (§4) has no built-in CV criterion — median-pairwise-distance heuristic. KLIEP (§5) uses KK-fold KL cross-validation. uLSIF (§6) uses analytic LOO-CV from the Sherman–Morrison machinery — no refits, single O(b3+b2(np+nq))O(b^3 + b^2 (n_p + n_q)) score per (σ,λ)(\sigma, \lambda) pair.
  • Tikhonov ridge λ\lambda for uLSIF. Log-spaced grid λ[104,101]\lambda \in [10^{-4}, 10^{-1}] with 7–9 candidates, swept alongside the bandwidth via analytic LOO. Larger λ\lambda shrinks r^\hat r toward 11 — useful when the support overlap is poor.

Effective sample size as a runtime diagnostic

The effective-sample-size diagnostic neff=(iwi)2/iwi2n_{\text{eff}} = (\sum_i w_i)^2 / \sum_i w_i^2 is the most important runtime check on any DRE-driven pipeline. After fitting any DRE estimator, compute neffn_{\text{eff}} on the qq-samples using wi=r^(Xiq)w_i = \hat r(X_i^q):

neff/nqn_{\text{eff}} / n_qInterpretationAction
>0.5> 0.5Reweighting is well-conditionedProceed with downstream IW estimator
0.10.50.1 \le \cdot \le 0.5Moderate effective shrinkage; usable but variableUse IW with awareness; consider truncation
0.01<0.10.01 \le \cdot < 0.1Severe shrinkage; IW estimator is high-varianceStrongly consider self-normalized IW or truncation; sanity-check r^\hat r
<0.01< 0.01r^\hat r concentrates almost all mass on a few samplesAlmost certainly the DRE estimator is failing; check the support overlap

When neffn_{\text{eff}} is small relative to nqn_q, distinguishing “true large χ2\chi^2” from “DRE over-fitting” requires a side check, typically by comparing r^\hat r to the truncated estimator min(r^,B)\min(\hat r, B) and seeing whether the IW point estimate is stable to truncation.

The curse of dimensionality for DRE

All four estimator families fit a function on Rd\mathbb{R}^d from finite samples. Nonparametric estimation in dd dimensions suffers a curse-of-dimensionality rate n2/(4+d)n^{-2/(4+d)} for second-order-smooth targets (Stone 1982). For uLSIF (Kanamori, Suzuki, and Sugiyama 2012, §6), the MISE converges at rate n4/(4+d)n^{-4/(4+d)} in dimension dd. For d=1d = 1 this is n4/5n^{-4/5}; for d=20d = 20 it is n4/24n0.17n^{-4/24} \approx n^{-0.17} — essentially flat.

We demo this on a controlled DGP: p=N(0d,Id)p = \mathcal{N}(\mathbf{0}_d, \mathbf{I}_d), q=N(1d,Id)q = \mathcal{N}(\mathbf{1}_d, \mathbf{I}_d), with closed-form ratio r(x)=exp(d/21dx)r(\mathbf{x}) = \exp(d/2 - \mathbf{1}_d^\top \mathbf{x}). Fit uLSIF in dimensions d{1,2,5,10,20}d \in \{1, 2, 5, 10, 20\} at fixed np=nq=500n_p = n_q = 500.

Two-panel figure of curse of dimensionality. Left: log MSE of uLSIF r-hat vs closed-form r, as a function of input dimension, on a log y-axis. Right: effective sample size n_eff over n_q at the q-samples, showing two curves — one for r_hat-induced ESS and one for true-r ESS.
The curse of dimensionality for DRE on isotropic shift Gaussians. Left: log MSE grows by roughly an order of magnitude with each five-dimensional increment, consistent with the Stone 1982 rate. Right: the TRUE-r effective sample size collapses exponentially with d; the r-hat-induced ESS is misleadingly stable because the kernel-basis estimator over-smooths toward uniform reweighting in high dimensions, masking the conceptual curse.

The curse of dimensionality has two faces. (i) Conceptually, the importance weights themselves become wildly variable — Var(r)\mathrm{Var}(r) grows log-linearly in dd and TRUE-rr ESS collapses exponentially. (ii) Operationally, kernel-basis estimators at median bandwidth degrade toward uniform r^\hat r, which inflates r^\hat r-weight ESS and silently masks the conceptual curse. Past d20d \approx 20, kernel-basis DRE requires either explicit structural assumptions, neural classification-DRE that can exploit feature-learning, or acceptance that the IW estimator will be high-variance and partial corrections like truncation are necessary.

When to prefer classification-DRE over kernel-DRE

Use kernel-basis DRE (uLSIF in §6 as the default) when:

  • Input dimension is moderate (d20d \leq 20).
  • The downstream use only needs sample-level weights or function evaluations on the train/test feature distributions.
  • The smoothness of the underlying rr is plausible (Hölder-continuous with order 1\geq 1).
  • The runtime budget is small and the practitioner wants principled hyperparameter selection without an SGD pipeline.

Use classification-DRE (logistic regression in §7 as the default, neural classifier in §8 for high dd) when:

  • Input dimension is high (d>50d > 50), or the input space is structured (images, text, time series).
  • A pretrained classifier or feature extractor is already available — exponentiating its logit gives a DRE estimator for free.
  • The classifier loss is naturally proper; for tree ensembles or SVMs, apply post-hoc Platt scaling or isotonic regression first.
  • The end use is a function on the input space rather than sample-level weights — neural classifiers extrapolate naturally, while in-sample-only KMM does not.

Use neural variational DRE (§8.3 NWJ-trained MLP) when:

  • The estimation target is a divergence (KL or chi-squared) rather than the ratio itself — mutual-information estimation, distributional distances for representation learning.
  • High dimension and structured inputs make classification-DRE the natural choice, but the downstream use is a divergence integral rather than per-sample weights.

Avoid DRE entirely when: neff/nqn_{\text{eff}} / n_q from any estimator falls below 0.010.01, or when concept drift / label shift is suspected (covariate shift’s load-bearing assumption has failed).

Connections, limits, and forward pointers

This closing section maps the cross-site prerequisite structure, lists within-formalML sibling topics that share machinery with DRE, and surfaces the topic’s honest limits and open research directions.

DRE shares machinery or framing with the following sibling topics:

  • Kernel Regression (T2 Supervised Learning). The kernel-weight matrix construction in §4–§6 (Gaussian RBF basis, median-heuristic bandwidth, Tikhonov-regularized linear systems, leave-one-out cross-validation) is the same machinery this topic uses for the Nadaraya–Watson estimator. KMM, KLIEP, and uLSIF can be read as “Nadaraya–Watson for the density ratio.”
  • Gaussian Processes (T5 Bayesian ML). The kernel-mean-embedding machinery in §4.1 and reused in §11.1 lives in the same RKHS framework Gaussian processes use for posterior computation. The “characteristic kernel” notion that drives MMD’s injectivity and KMM’s identifiability is the same condition that gives GPs their universality.
  • Normalizing Flows (T3 Unsupervised & Generative). Both topics target generative modeling — flows estimate pp via change-of-variables; DRE estimates p/qp/q directly. The §8.4 GAN-as-implicit-DRE observation connects the two: a normalizing flow trained adversarially has a discriminator implicitly doing DRE.
  • Conformal Prediction (T4 Statistical ML Theory). The §10 weighted-exchangeability extension of split conformal is the immediate downstream use of DRE in calibrated prediction-interval construction. Tibshirani, Barber, Candès, and Ramdas (2019) built that bridge; DRE supplies the r^\hat r that closes it.
  • Clustering (T3 Unsupervised & Generative). The §1.1 plug-in-pathology demo — KDE divided by KDE — is the same KDE machinery the clustering topic uses for mean-shift’s density gradient. Both topics extend KDE in different directions: clustering follows the gradient uphill; DRE replaces the quotient with direct estimation.

Honest limits of DRE

The four estimator families derived in this topic are mathematically clean and operationally effective in the regime they were built for — tabular data, moderate dimension, smooth ratios — but they have hard limits the practitioner should hold close.

Curse of dimensionality. All four estimators degrade polynomially with input dimension at the Stone (1982) rate. By d50d \approx 50, none of them produce a usable estimate at any reasonable sample size without structural assumptions. The neural classification-DRE of §7 and §8 partially escape this via feature-learning, but only when the feature representation actually compresses the effective dimension.

Support overlap is often violated in practice. The absolute-continuity hypothesis pqp \ll q from §2.3 is a strong assumption. In real covariate-shift settings — clinical trials with new patient demographics, vision models deployed in new geographic conditions — the test distribution routinely has support outside the training distribution. When this happens, the true rr is infinite on a set of positive pp-measure, no DRE estimator can recover it there, and the IW estimator’s bias is not just a finite-sample artefact but a fundamental obstruction.

Calibration requirements for classification-DRE. §7’s classification-DRE identity assumes the classifier’s predicted probabilities approximate the true posterior. Most modern classifiers are miscalibrated by default (Niculescu-Mizil and Caruana 2005; Guo et al. 2017). Plugging an uncalibrated classifier into the DRE identity gives a systematically biased ratio estimate, and the bias propagates into every downstream IW estimator without warning.

IW estimator variance is fundamental, not removable. Even with a perfectly known rr, the IW estimator’s variance is inflated by χ2(pq)+1\chi^2(p \,\|\, q) + 1 relative to the unweighted estimator. When the chi-squared divergence is large — which is the regime where reweighting matters most — IW estimators are unavoidably high-variance. Truncation and self-normalization help but introduce bias. There is no free lunch: large χ2\chi^2 implies high IW variance.

DRE in modern deep-learning practice. The classical DRE pipeline — fit r^\hat r, then plug into a downstream IW estimator — is not always how the deep-learning community engages with density ratios. Modern generative models (GANs, score-based diffusion, VAE-GAN hybrids), contrastive representation learning (InfoNCE, SimCLR), and energy-based models all implicitly estimate density ratios as a side effect of their primary training objective. These implicit estimators are typically more accurate on high-dimensional structured data than explicit kernel-DRE, but they don’t expose r^\hat r as a first-class object; recovering it requires careful logit extraction or auxiliary discriminator probes.

Open problems and forward pointers

DRE is an active research area with several open directions that exceed the scope of this topic.

  • Bregman-divergence DRE with neural witnesses. The §3 Bregman framework with the §8 neural parametrization is a marriage of convenience that hasn’t been fully systematized — for arbitrary ϕ\phi, when does the neural-NWJ training actually converge to f(r)f'(r) at the right rate? Recent work (Rhodes, Xu, Gutmann 2020 on telescoping density-ratio estimation; Choi, Liao, Ermon 2021 on featurized density-ratio estimation) makes progress.
  • DRE under sample-selection bias. Sample-selection bias is a strict generalization of covariate shift in which the training-sample inclusion probability depends on (x,y)(x, y) rather than xx alone. The Heckman (1979) selection-correction framework provides one route; modern semi-parametric efficiency theory (Kennedy, Ma, McHugh, Small 2017) provides a richer one.
  • Off-policy evaluation in reinforcement learning. Off-policy evaluation — estimating the value of a target policy from data collected by a different behaviour policy — is DRE applied to action-conditioned distributions, often with horizon effects that produce exponential blow-up in the ratio. Doubly-robust estimators (Jiang and Li 2016) and marginalized importance sampling (Liu et al. 2018) are active areas.
  • Score-based DRE. A recent line of work uses score matching to estimate logr\nabla \log r directly, then integrates to recover rr up to a normalizing constant. Hyvärinen (2005) introduced the score-matching framework; Choi, Liao, and Ermon (2021) adapted it to DRE specifically.
  • Conditional DRE. Estimating p(xz)/q(xz)p(x \mid z) / q(x \mid z) for some conditioning variable zz comes up in conditional-distribution shift, fairness-constrained DRE, and treatment-effect estimation.
  • DRE for sequential and non-stationary data. When the training and test distributions are not just shifted but evolving over time, the iid foundation of §2 breaks down. Online DRE, drift detection, and concept-drift correction are active areas (Gama et al. 2014 for a survey).

Connections

  • Both topics extend kernel density estimation: clustering follows the density gradient uphill (mean-shift), DRE replaces the quotient $\hat p / \hat q$ with direct estimation that avoids the §1.1 tail blowup. clustering
  • Shares the Gaussian-RBF basis, median-heuristic bandwidth selection, and Tikhonov-regularized linear-system machinery; KMM, KLIEP, and uLSIF are 'Nadaraya–Watson for the density ratio.' kernel-regression
  • The kernel-mean-embedding machinery in §4.1 and §11.1 lives in the same RKHS framework Gaussian processes use; the characteristic-kernel notion that drives MMD's injectivity is the same condition that gives GPs their universality. gaussian-processes
  • Both target generative modeling; flows estimate $p$ via change-of-variables, DRE estimates $p/q$ directly. The §8.4 GAN-as-implicit-DRE observation connects the two: an adversarially trained flow has a discriminator implicitly doing DRE. normalizing-flows
  • The §10 weighted-exchangeability extension is the immediate downstream use of DRE in calibrated prediction-interval construction. DRE supplies the $\hat r$ that closes the Tibshirani–Barber–Candès–Ramdas (2019) coverage guarantee under shift. conformal-prediction
  • KLIEP minimizes $\mathrm{KL}(p \,\|\, \hat r \cdot q)$; the §8 NWJ variational view recovers the KL divergence as the value of the bound at $T^\star = \log r + 1$; mutual information is the KL between joint and product of marginals. kl-divergence
  • The Bregman-divergence master identity in §3 rests on convexity of the generator $\phi$; KMM (§4.2), KLIEP (§5.2), and uLSIF (§6.1) are convex programs solved by Lagrangian-duality / KKT machinery. convex-analysis

References & Further Reading