formalML
The mathematical machinery behind modern machine learning
Deep-dive explainers combining rigorous mathematics, interactive visualizations, and working code. Built for practitioners, graduate students, and researchers.
Latest Topics
Double Descent
The interpolation threshold, the minimum-norm interpolator, and the modern overparameterized regime
Double descent is the empirical and analytic phenomenon that test error, plotted as a function of model complexity, has two descents rather than one — a classical descent in the underparameterized regime, a spike at the interpolation threshold where the number of parameters equals the number of training points, and a second descent into the overparameterized regime that classical learning theory said was hopeless. This topic develops the linear theory of double descent in full: §§1–3 establish the empirical phenomenon and pin down what makes the interpolation threshold special. §4 introduces the minimum-norm interpolator. §§5–6 develop the Marchenko–Pastur random-matrix machinery and the Hastie–Montanari–Rosset–Tibshirani 2022 closed form that gives the analytic double-descent curve. §7 separates model-wise from sample-wise double descent and the 'more data can hurt' pathology. §8 generalizes to random-features models and the kernel limit. §9 proves that gradient descent from zero initialization is implicitly equivalent to the minimum-norm interpolator. §10 is a sidebar on deep double descent. §§11–12 introduce effective dimension and revise the classical capacity-control framework. §§13–14 close with computational notes and an honest accounting of what the linear theory predicts versus what remains open in the feature-learning regime.
VC Dimension
Combinatorial capacity, Sauer–Shelah, and uniform convergence for binary hypothesis classes
The Vapnik–Chervonenkis dimension is the combinatorial summary of a binary hypothesis class's capacity — the largest sample size on which the class produces every conceivable labeling. Finiteness of this single integer is the necessary and sufficient condition for distribution-free uniform convergence of empirical risk to population risk, and the Sauer–Shelah lemma converts it into a polynomial growth function Π(n) ≤ (en/d)^d that powers the Fundamental Theorem of Statistical Learning. This topic develops VC dimension at the intermediate level on two anchor classes (half-planes and axis-aligned rectangles in the plane) carried through every section. §§1–4 introduce the motivation, shattering, growth function, and VC dimension itself. §5 proves the Sauer–Shelah lemma in full via Pajor's shifting trick. §6 chains Sauer–Shelah with Hoeffding through symmetrization to produce the FTSL (both upper and lower bound directions). §7 separates realizable (1/ε) from agnostic (1/ε²) rates. §8 computes the VC dimension for canonical classes — half-spaces via Radon's theorem, axis-rectangles via pigeonhole, and neural networks via the Bartlett bound. §9 bridges to Rademacher complexity. §10 is the integrative empirical-shatter-check experiment that verifies every bound on the same two anchor classes. §§11–13 cover ML applications (SVM margins, decision-tree pruning, the deep-learning puzzle), computational notes, and forward-pointers.
Causal Inference Methods
Identification, doubly-robust estimation, and inference under confounding — from potential outcomes and IPW through AIPW, TMLE, and DML, with instrumental variables, front-door identification, heterogeneous treatment effects, and sensitivity analysis
How do we estimate causal effects from observational data? This topic develops the modern toolkit for the binary-treatment average treatment effect under ignorability — potential outcomes, identification, inverse-probability weighting (IPW), outcome regression, augmented IPW (AIPW) with its doubly-robust property, targeted maximum likelihood (TMLE), and double / debiased machine learning (DML) — then extends to instrumental variables and front-door identification when ignorability fails, to heterogeneous treatment effects with meta-learners and causal forests, and to sensitivity analysis via E-values and Cinelli–Hazlett robustness values. The signature theorem is the doubly-robust property of AIPW: consistency for the average treatment effect whenever either the propensity model or the outcome model is correctly specified — a remarkable two-shots-on-goal guarantee no single-model estimator can match. The framework is threaded through a single Robinson partially-linear DGP from §1 onward, so that every estimator can be compared empirically on the same canonical workbench.