intermediate linear-algebra 40 min read

PCA & Low-Rank Approximation

From variance maximization to modern dimensionality reduction — the mathematical theory behind the most widely used unsupervised learning method

Overview & Motivation

The SVD established that the right singular vectors of a centered data matrix XX are the principal components, and that the Eckart–Young–Mirsky theorem makes the truncated SVD the optimal low-rank approximation. This topic develops PCA as a standalone framework — starting from the variance-maximization problem, connecting it to the SVD and the Spectral Theorem, and extending it to the modern variants that handle nonlinear, corrupted, and high-dimensional data.

Principal Component Analysis (PCA) finds the orthogonal directions of maximum variance in a dataset XRn×dX \in \mathbb{R}^{n \times d} (rows = observations, columns = features). It answers a simple question: if we could keep only k<dk < d directions, which kk directions capture the most information?

The answer is the top kk eigenvectors of the sample covariance matrix Σ^=1n1XTX\hat{\Sigma} = \frac{1}{n-1} X^T X — equivalently, the top kk right singular vectors of XX. This topic covers:

  • Variance maximization: PCA as a constrained optimization problem (Rayleigh quotient).
  • Optimal projection: the Eckart–Young connection — PCA is the best kk-dimensional linear projection.
  • The scree plot: heuristics for choosing kk (elbow, Kaiser, parallel analysis).
  • Kernel PCA: extending PCA to nonlinear manifolds via the kernel trick.
  • Probabilistic PCA: a latent variable model that handles missing data and provides a generative framework.
  • Robust PCA: separating low-rank structure from sparse corruption (Principal Component Pursuit).
  • Sparse PCA: interpretable principal components via 1\ell_1-penalized loadings.

What We Cover

  1. PCA from First Principles — the Rayleigh quotient and sequential variance maximization.
  2. PCA via the SVD — the correspondence table and numerical superiority.
  3. Geometric Interpretation — optimal projection and the dual view (variance = error minimization).
  4. The Scree Plot — dimensionality selection (elbow, Kaiser, parallel analysis).
  5. Low-Rank Approximation — the Eckart–Young connection and effective rank.
  6. Computing PCA in Practice — scikit-learn classes and the standardization question.
  7. The Biplot — visualizing observations and variables together.
  8. Kernel PCA — nonlinear dimensionality reduction via the kernel trick.
  9. Probabilistic PCA — the latent variable model and EM for missing data.
  10. Robust PCA — Principal Component Pursuit for low-rank + sparse decomposition.
  11. Sparse PCA — interpretable loadings via 1\ell_1 regularization.

PCA from First Principles: Variance Maximization

The Setup

Let XRn×dX \in \mathbb{R}^{n \times d} be a centered data matrix (nn observations, dd features, column means subtracted). The sample covariance matrix is:

Σ^=1n1XTXRd×d\hat{\Sigma} = \frac{1}{n-1} X^T X \in \mathbb{R}^{d \times d}

This matrix is symmetric and positive semi-definite. By the Spectral Theorem, it has an orthogonal eigendecomposition:

Σ^=QΛQT\hat{\Sigma} = Q \Lambda Q^T

with eigenvalues λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 and orthonormal eigenvectors q1,,qdq_1, \ldots, q_d.

The Variance-Maximization Problem

The first principal component is the unit vector w1w_1 that maximizes the variance of the projection Xw1Xw_1:

w1=argmaxw=1Var(Xw)=argmaxw=1wTΣ^ww_1 = \arg\max_{\|w\| = 1} \text{Var}(Xw) = \arg\max_{\|w\| = 1} w^T \hat{\Sigma} w

This is a Rayleigh quotient problem. By the Spectral Theorem (specifically, the Courant–Fischer minimax principle), the maximum is achieved at w1=q1w_1 = q_1 (the top eigenvector), and the maximum variance is λ1\lambda_1.

-4-224-4-224x₁x₂R(θ) = 3.00PC1PC2
Covariance Matrix Σ
3.0001.2731.2731.500
Eigenvalues
λ₁ = 3.727λ₂ = 0.773
Eigenvectors
q₁ = [0.868, 0.496]
q₂ = [-0.496, 0.868]
Explained Variance Ratio
PC1: 82.8% · PC2: 17.2%
Condition Number
κ = λ₁/λ₂ = 4.82
Rayleigh Quotient R(θ) = w⊤Σw
3.00max = λ₁ = 3.73

Proposition 1 (Sequential variance maximization).

The kk-th principal component is the unit vector wkw_k that maximizes wTΣ^ww^T \hat{\Sigma} w subject to w=1\|w\| = 1 and ww1,,wk1w \perp w_1, \ldots, w_{k-1}. The solution is wk=qkw_k = q_k (the kk-th eigenvector of Σ^\hat{\Sigma}), and the maximum variance is λk\lambda_k.

Proof.

By induction on the Spectral Theorem’s eigenvalue ordering.

After deflation, we project Σ^\hat{\Sigma} onto the orthogonal complement of span(q1,,qk1)\text{span}(q_1, \ldots, q_{k-1}). Any vector ww in this subspace can be written as w=j=kdcjqjw = \sum_{j=k}^{d} c_j q_j with cj2=1\sum c_j^2 = 1. Then:

wTΣ^w=j=kdcj2λjλkj=kdcj2=λkw^T \hat{\Sigma} w = \sum_{j=k}^{d} c_j^2 \lambda_j \leq \lambda_k \sum_{j=k}^{d} c_j^2 = \lambda_k

with equality when ck=1c_k = 1 and cj=0c_j = 0 for j>kj > k, i.e., w=qkw = q_k. This is the restricted Rayleigh quotient, and the bound follows from the Spectral Theorem’s eigenvalue ordering λkλk+1λd\lambda_k \geq \lambda_{k+1} \geq \cdots \geq \lambda_d. \square

The Principal Components

The principal components (or PC scores) are the projections of the data onto the eigenvectors:

Z=XQRn×dZ = XQ \in \mathbb{R}^{n \times d}

where Zij=xiTqjZ_{ij} = x_i^T q_j is the ii-th observation’s coordinate along the jj-th principal direction. The columns of ZZ are uncorrelated with variances λ1,,λd\lambda_1, \ldots, \lambda_d:

1n1ZTZ=1n1QTXTXQ=QTΣ^Q=Λ\frac{1}{n-1} Z^T Z = \frac{1}{n-1} Q^T X^T X Q = Q^T \hat{\Sigma} Q = \Lambda

This diagonalization — transforming correlated features into uncorrelated components ordered by variance — is the essence of PCA.

PCA variance maximization: data cloud with principal directions and PC scores


PCA via the SVD

The SVD Connection

The SVD topic §8 established the fundamental connection. We now develop it fully.

Let X=UΣVTX = U \Sigma V^T be the SVD of the centered data matrix. Then:

Σ^=1n1XTX=1n1VΣTΣVT=V(Σ2n1)VT\hat{\Sigma} = \frac{1}{n-1} X^T X = \frac{1}{n-1} V \Sigma^T \Sigma V^T = V \left(\frac{\Sigma^2}{n-1}\right) V^T

Comparing with the Spectral Theorem decomposition Σ^=QΛQT\hat{\Sigma} = Q \Lambda Q^T:

Theorem 1 (PCA via SVD).

Let XRn×dX \in \mathbb{R}^{n \times d} be a centered data matrix with SVD X=UΣVTX = U \Sigma V^T. The principal components are the columns of VV, the PC scores are the columns of UΣU\Sigma, and the ii-th eigenvalue of the sample covariance matrix is λi=σi2/(n1)\lambda_i = \sigma_i^2 / (n - 1).

PCA quantitySpectral TheoremSVD
Principal directionsEigenvectors qiq_i of Σ^\hat{\Sigma}Right singular vectors viv_i of XX
Eigenvalues (variances)λi\lambda_iσi2/(n1)\sigma_i^2 / (n - 1)
PC scoresZ=XQZ = XQZ=UΣZ = U\Sigma
Explained variance ratioλi/λj\lambda_i / \sum \lambda_jσi2/σj2\sigma_i^2 / \sum \sigma_j^2

Numerical Superiority of the SVD

Forming XTXX^T X explicitly squares the condition number: κ(XTX)=κ(X)2\kappa(X^T X) = \kappa(X)^2. For ill-conditioned data, the smallest eigenvalues can be lost to floating-point roundoff. The SVD computes the same decomposition directly on XX, preserving numerical accuracy. This is why every serious PCA implementation uses the SVD internally.

Definition 1 (Condition number for PCA).

The condition number of PCA is κ=σ1/σd\kappa = \sigma_1 / \sigma_d (the ratio of the largest to smallest singular value of XX). When κ\kappa is large, the data is ill-conditioned for PCA — the smallest principal components are unreliable.

import numpy as np

# PCA via eigendecomposition vs. SVD: verify the connection
np.random.seed(42)
n, d = 500, 3
cov_true = np.array([[3.0, 1.5, 0.5], [1.5, 2.0, 0.3], [0.5, 0.3, 0.5]])
X = np.random.multivariate_normal(np.zeros(d), cov_true, n)
X_centered = X - X.mean(axis=0)

# Method 1: Spectral Theorem on covariance
Sigma_hat = (X_centered.T @ X_centered) / (n - 1)
evals_eig, Q = np.linalg.eigh(Sigma_hat)
idx = np.argsort(evals_eig)[::-1]
evals_eig, Q = evals_eig[idx], Q[:, idx]

# Method 2: SVD of centered data
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
evals_svd = sigma**2 / (n - 1)

# Eigenvalues match: λ_i = σ_i² / (n-1)
for i in range(d):
    print(f"  λ_{i+1}: eigendecomp = {evals_eig[i]:.6f}  SVD = {evals_svd[i]:.6f}")

# Condition numbers: SVD avoids squaring
print(f"\n  κ(X) = {sigma[0]/sigma[-1]:.2f}")
print(f"  κ(X^T X) = {evals_eig[0]/evals_eig[-1]:.2f}  ≈  κ(X)² = {(sigma[0]/sigma[-1])**2:.2f}")

Geometric Interpretation

PCA as Optimal Linear Projection

PCA finds the kk-dimensional subspace that best preserves the data’s variance. Equivalently, it finds the projection that minimizes reconstruction error.

Theorem 2 (PCA as optimal projection).

Let XRn×dX \in \mathbb{R}^{n \times d} be centered data with SVD X=UΣVTX = U \Sigma V^T. The kk-dimensional affine subspace that minimizes the total squared reconstruction error is spanned by the top kk right singular vectors v1,,vkv_1, \ldots, v_k:

minrank-k subspace Wi=1nxiprojW(xi)2=j=k+1dσj2=(n1)j=k+1dλj\min_{\text{rank-}k\text{ subspace } W} \sum_{i=1}^{n} \|x_i - \text{proj}_W(x_i)\|^2 = \sum_{j=k+1}^{d} \sigma_j^2 = (n-1) \sum_{j=k+1}^{d} \lambda_j

This is the Eckart–Young–Mirsky theorem applied to the centered data matrix.

Proof.

The reconstruction error for a rank-kk projection PP is XXPF2=X(IP)F2\|X - XP\|_F^2 = \|X(I - P)\|_F^2. We need to show this is minimized when P=VkVkTP = V_k V_k^T.

Expand using the SVD X=UΣVTX = U \Sigma V^T:

X(IP)F2=UΣVT(IP)F2=ΣVT(IP)F2\|X(I - P)\|_F^2 = \|U \Sigma V^T (I - P)\|_F^2 = \|\Sigma V^T (I - P)\|_F^2

where the last equality uses the unitary invariance of the Frobenius norm (UAF=AF\|UA\|_F = \|A\|_F for orthogonal UU).

Since IPI - P projects onto the orthogonal complement of the kk-dimensional subspace, the minimum is achieved when PP projects onto the span of the top kk right singular vectors. The error equals j=k+1dσj2\sum_{j=k+1}^{d} \sigma_j^2 — the sum of the squared singular values we discard. This is exactly the Eckart–Young theorem. \square

The Dual View: Variance Maximization = Error Minimization

PCA simultaneously:

  1. Maximizes the variance of the projected data: j=1kλj\sum_{j=1}^{k} \lambda_j.
  2. Minimizes the reconstruction error: j=k+1dλj\sum_{j=k+1}^{d} \lambda_j.

Since total variance =j=1dλj= \sum_{j=1}^{d} \lambda_j is constant, maximizing (1) is equivalent to minimizing (2). This dual view is central to understanding dimensionality reduction.

Theorem 3 (Optimal projection subspace).

Among all kk-dimensional subspaces WRdW \subseteq \mathbb{R}^d, the one that minimizes the total squared distance from the data to its projections is the span of the top kk principal components:

W=span(v1,,vk)=argmindim(W)=ki=1nxiPWxi2W^* = \text{span}(v_1, \ldots, v_k) = \arg\min_{\dim(W) = k} \sum_{i=1}^{n} \|x_i - P_W x_i\|^2

where PWP_W is the orthogonal projection onto WW.

The Reconstruction Formula

Given the top kk principal components Vk=[v1,,vk]V_k = [v_1, \ldots, v_k] and PC scores Zk=XVkZ_k = X V_k, the rank-kk reconstruction is:

X^k=ZkVkT=XVkVkT\hat{X}_k = Z_k V_k^T = X V_k V_k^T

The projection matrix Pk=VkVkTP_k = V_k V_k^T is the orthogonal projector onto the principal subspace.

PCA geometry: projection, reconstruction, and the dual view


The Scree Plot and Dimensionality Selection

The Scree Plot

The scree plot displays the eigenvalues (or singular values, or explained variance ratios) in descending order. The goal is to identify the “elbow” — the point where the eigenvalues transition from large (signal) to small (noise).

Definition 2 (Explained variance ratio).

The explained variance ratio for kk components is:

EVR(k)=i=1kλii=1dλi=i=1kσi2i=1dσi2\text{EVR}(k) = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{d} \sigma_i^2}

where λi\lambda_i are eigenvalues of the covariance matrix and σi\sigma_i are singular values of the centered data matrix.

Selection Criteria

Several heuristics exist for choosing the number of components kk:

MethodRuleWhen to use
Elbow methodChoose kk at the “bend” in the scree plotVisual; subjective but widely used
Kaiser criterionKeep components with λi>1\lambda_i > 1 (for standardized data)Only for correlation-matrix PCA
Explained variance thresholdChoose smallest kk s.t. EVR(k)τ\text{EVR}(k) \geq \tau (e.g., τ=0.95\tau = 0.95)When a target variance fraction is needed
Parallel analysisCompare eigenvalues to those from random data of the same dimensionsMost statistically principled

Parallel Analysis

Parallel analysis (Horn, 1965) generates many random datasets of the same size (n×dn \times d) from i.i.d. standard normals, computes their eigenvalues, and retains only those components whose observed eigenvalues exceed the 95th percentile of the random eigenvalues. This controls for the “eigenvalue inflation” that occurs in finite samples.

Eigenvalue spectrum

Cumulative explained variance

Selected k = 2 — explains 55.4% of total variance.(Click a bar to change k.)

Criteria comparison: Kaiser = 3  |  Parallel analysis = 2  |  Elbow = 2  |  95% threshold = 10

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load and standardize the Wine dataset (13 features)
wine = load_wine()
X_wine = StandardScaler().fit_transform(wine.data)
n, d = X_wine.shape

# PCA via SVD
U, sigma, Vt = np.linalg.svd(X_wine, full_matrices=False)
eigenvalues = sigma**2 / (n - 1)
evr = eigenvalues / eigenvalues.sum()
cumulative_evr = np.cumsum(evr)

# Parallel analysis: compare to eigenvalues from random data
n_permutations = 100
random_eigenvalues = np.zeros((n_permutations, d))
for j in range(n_permutations):
    X_random = np.random.randn(n, d)
    _, s_rand, _ = np.linalg.svd(X_random, full_matrices=False)
    random_eigenvalues[j] = s_rand**2 / (n - 1)

random_95th = np.percentile(random_eigenvalues, 95, axis=0)
k_parallel = np.sum(eigenvalues > random_95th)

# Selection results
k_kaiser = np.sum(eigenvalues > 1.0)
k_95 = np.searchsorted(cumulative_evr, 0.95) + 1

print(f"Kaiser criterion (λ > 1): k = {k_kaiser}")
print(f"95% variance threshold:   k = {k_95}")
print(f"Parallel analysis:        k = {k_parallel}")

Scree plot with selection criteria on the Wine dataset


Low-Rank Approximation as Optimal Projection

The Eckart–Young Connection

The Eckart–Young–Mirsky theorem states that the best rank-kk approximation to AA is the truncated SVD AkA_k. Applied to the centered data matrix XX:

Xk=UkΣkVkT=(PC scores)k×(principal directions)kTX_k = U_k \Sigma_k V_k^T = \text{(PC scores)}_k \times \text{(principal directions)}_k^T

This is exactly the PCA reconstruction: project onto the top kk PCs, then reconstruct. The Eckart–Young theorem guarantees this is optimal among all rank-kk approximations.

Effective Rank

Definition 3 (Effective rank).

The effective rank of a matrix measures how many singular values contribute meaningfully. Two common definitions:

  • Hard threshold: rε={i:σi>εσ1}r_\varepsilon = |\{i : \sigma_i > \varepsilon \cdot \sigma_1\}| for some tolerance ε\varepsilon.
  • Entropy-based (Roy & Vetterli, 2007): erank(A)=exp(ipilogpi)\text{erank}(A) = \exp\left(-\sum_{i} p_i \log p_i\right) where pi=σi2/AF2p_i = \sigma_i^2 / \|A\|_F^2.

When the effective rank is much smaller than min(m,n)\min(m, n), the data is approximately low-rank and PCA is highly effective.

Remark.

The entropy-based effective rank has a nice information-theoretic interpretation: it measures the “uniformity” of the singular value distribution. When all singular values are equal, erank(A)=min(m,n)\text{erank}(A) = \min(m,n) (full rank). When one singular value dominates, erank(A)1\text{erank}(A) \approx 1.

Low-rank approximation: singular value spectrum, error decay, and explained variance ratio


Computing PCA in Practice

scikit-learn

ClassWhen to useAlgorithm
PCA(n_components=k)Moderate data; exact PCAFull SVD via LAPACK or randomized SVD (auto)
PCA(svd_solver='randomized')Large dd, small kkHalko–Martinsson–Tropp randomized SVD
IncrementalPCA(n_components=k)Out-of-core / streamingIncremental SVD update
TruncatedSVD(n_components=k)Sparse matrices (text, graphs)ARPACK or randomized

Remark.

PCA centers the data automatically. TruncatedSVD does not center — use it for sparse data where centering would destroy sparsity (e.g., TF-IDF matrices in LSA).

Standardize or Not?

  • Standardize (scale to unit variance) when features have different units or very different scales. PCA on the correlation matrix R=D1/2Σ^D1/2R = D^{-1/2} \hat{\Sigma} D^{-1/2} where D=diag(Σ^)D = \text{diag}(\hat{\Sigma}).
  • Don’t standardize when features are commensurable (same units, similar scales) — PCA on the covariance matrix preserves the natural scaling.

The Kaiser criterion (λ>1\lambda > 1) only makes sense when PCA is computed on the correlation matrix (standardized data), because the average eigenvalue of a d×dd \times d correlation matrix is exactly 1.


Application: High-Dimensional Data Visualization

The Biplot

The biplot (Gabriel, 1971) simultaneously displays observations (PC scores) and variables (loadings) in the principal component space. It is the standard visualization for PCA results:

  • Observations are plotted as points using PC scores (zi1,zi2)(z_{i1}, z_{i2}).
  • Variables are plotted as arrows (loading vectors) showing each original feature’s contribution to the PCs.
  • Arrow length indicates the feature’s variance explained by the displayed PCs.
  • Arrow angle between two features approximates their correlation (small angle ≈ high positive correlation; 90° ≈ uncorrelated; 180° ≈ high negative correlation).

Interpreting Loadings

The loadings are the entries of the principal component vectors vjv_j. The loading vjkv_{jk} tells us how much feature kk contributes to PC jj. Large loadings (in absolute value) indicate features that are important for that component.

PCA biplot: score plot with loading arrows on the Wine dataset


Kernel PCA

The Limitation of Linear PCA

Linear PCA finds directions of maximum variance in the original feature space. But many real datasets have nonlinear structure — data lying on a curved manifold (Swiss roll), concentric circles, or a spiral. Linear PCA cannot “unfold” these structures.

The Kernel Trick

Kernel PCA (Schölkopf, Smola & Müller, 1998) applies PCA in a high-dimensional (possibly infinite-dimensional) feature space ϕ(x)H\phi(x) \in \mathcal{H} without ever computing ϕ\phi explicitly:

  1. Compute the n×nn \times n kernel matrix Kij=κ(xi,xj)K_{ij} = \kappa(x_i, x_j) where κ\kappa is a kernel function.
  2. Center the kernel matrix: K~=K1n1K1nK1+1n21K1\tilde{K} = K - \frac{1}{n} \mathbf{1} K - \frac{1}{n} K \mathbf{1} + \frac{1}{n^2} \mathbf{1} K \mathbf{1}.
  3. Eigendecompose: K~=αΛαT\tilde{K} = \alpha \Lambda \alpha^T (Spectral Theorem, since K~\tilde{K} is symmetric PSD).
  4. The kernel PC scores are Zk=αkλkZ_k = \alpha_k \sqrt{\lambda_k} for the top kk eigenvectors.

Definition 4 (Common kernels for Kernel PCA).

KernelFormulaEffect
Linearκ(x,y)=xTy\kappa(x, y) = x^T yStandard PCA
RBF (Gaussian)κ(x,y)=exp(γxy2)\kappa(x, y) = \exp(-\gamma \|x - y\|^2)Smooth nonlinear embedding
Polynomialκ(x,y)=(xTy+c)p\kappa(x, y) = (x^T y + c)^pPolynomial feature interactions

The RBF kernel with bandwidth γ\gamma is the most commonly used; it implicitly maps to an infinite-dimensional feature space.

Kernel

Original data

RBF Kernel PCA (γ=1)

Kernel matrix

200 points  |  Kernel: RBF exp(-γ‖x−y‖²), γ=1

from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

# Concentric circles: linear PCA fails, kernel PCA succeeds
X_circles, y_circles = make_circles(n_samples=500, factor=0.3, noise=0.05)

# Linear PCA cannot separate the two rings
Z_linear = PCA(n_components=2).fit_transform(X_circles)

# RBF Kernel PCA separates them
Z_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=10).fit_transform(X_circles)

# The bandwidth γ controls locality:
#   Small γ → global (nearly linear)
#   Large γ → local (sharp boundaries)
for gamma in [0.1, 2, 10, 50]:
    kpca = KernelPCA(n_components=2, kernel='rbf', gamma=gamma)
    Z = kpca.fit_transform(X_circles)
    print(f"  γ = {gamma:5.1f}: separation quality varies")

Kernel PCA: unfolding nonlinear structure on circles and Swiss roll


Probabilistic PCA

PCA as a Latent Variable Model

Probabilistic PCA (Tipping & Bishop, 1999) recasts PCA as a generative model:

x=Wz+μ+εx = Wz + \mu + \varepsilon

where:

  • zN(0,Ik)z \sim \mathcal{N}(0, I_k) is a kk-dimensional latent variable.
  • WRd×kW \in \mathbb{R}^{d \times k} is the loading matrix.
  • μRd\mu \in \mathbb{R}^d is the data mean.
  • εN(0,σ2Id)\varepsilon \sim \mathcal{N}(0, \sigma^2 I_d) is isotropic Gaussian noise.

Theorem 4 (MLE for Probabilistic PCA).

The maximum likelihood estimates are:

W^ML=Uk(Λkσ2I)1/2R\hat{W}_{\text{ML}} = U_k (\Lambda_k - \sigma^2 I)^{1/2} R

where UkU_k contains the top kk eigenvectors of Σ^\hat{\Sigma}, Λk=diag(λ1,,λk)\Lambda_k = \text{diag}(\lambda_1, \ldots, \lambda_k), RR is an arbitrary rotation, and:

σ^ML2=1dkj=k+1dλj\hat{\sigma}^2_{\text{ML}} = \frac{1}{d - k} \sum_{j=k+1}^{d} \lambda_j

The noise variance σ^2\hat{\sigma}^2 is the average of the discarded eigenvalues — the “lost” variance is modeled as isotropic noise.

Why Probabilistic PCA?

  1. Missing data: the EM algorithm naturally handles incomplete observations.
  2. Model comparison: the marginal likelihood allows Bayesian model selection for kk.
  3. Generative model: we can sample new data points from p(x)p(x).
  4. Mixture of PPCA: captures multimodal structure (cluster-specific PCA).
def ppca_em(X, k, n_iter=100, missing_mask=None):
    """Probabilistic PCA via EM. missing_mask[i,j]=True means X[i,j] is missing."""
    n, d = X.shape
    mu = np.nanmean(X, axis=0) if missing_mask is not None else X.mean(axis=0)
    W = np.random.randn(d, k) * 0.1
    sigma2 = 1.0

    for iteration in range(n_iter):
        # E-step: posterior p(z|x) = N(M^{-1} W^T (x - mu), sigma^2 M^{-1})
        M = W.T @ W + sigma2 * np.eye(k)
        M_inv = np.linalg.inv(M)

        Ez = np.zeros((n, k))
        Ezz = np.zeros((n, k, k))

        for i in range(n):
            if missing_mask is not None and np.any(missing_mask[i]):
                obs = ~missing_mask[i]
                W_obs = W[obs, :]
                x_obs = X[i, obs] - mu[obs]
                M_obs = W_obs.T @ W_obs + sigma2 * np.eye(k)
                Ez[i] = np.linalg.solve(M_obs, W_obs.T @ x_obs)
                Ezz[i] = sigma2 * np.linalg.inv(M_obs) + np.outer(Ez[i], Ez[i])
            else:
                xi = X[i] - mu
                Ez[i] = M_inv @ W.T @ xi
                Ezz[i] = sigma2 * M_inv + np.outer(Ez[i], Ez[i])

        # M-step: update W and sigma^2
        S1 = np.zeros((d, k))
        S2 = np.zeros((k, k))
        for i in range(n):
            xi = X[i] - mu
            if missing_mask is not None and np.any(missing_mask[i]):
                obs = ~missing_mask[i]
                S1[obs] += np.outer(xi[obs], Ez[i])
            else:
                S1 += np.outer(xi, Ez[i])
            S2 += Ezz[i]

        W = S1 @ np.linalg.inv(S2)
        # ... (noise variance update continues)

    return W, sigma2, mu

Probabilistic PCA: eigenvalue spectrum and missing data recovery


Robust PCA: Principal Component Pursuit

The Problem

Standard PCA is sensitive to gross corruption: a few large outliers or structured noise can distort the principal components. Robust PCA (Candès, Li, Ma & Wright, 2011) separates the data matrix into a low-rank component LL and a sparse component SS:

X=L+SX = L + S

where LL captures the “clean” low-rank structure and SS captures sparse corruption (outliers, anomalies).

Principal Component Pursuit

Theorem 5 (Principal Component Pursuit).

Under mild incoherence conditions, the solution to:

minL,SL+λS1subject toL+S=X\min_{L, S} \|L\|_* + \lambda \|S\|_1 \quad \text{subject to} \quad L + S = X

recovers the true LL and SS exactly, where L\|L\|_* is the nuclear norm (sum of singular values) and S1\|S\|_1 is the entry-wise 1\ell_1 norm. The default parameter is λ=1/max(m,n)\lambda = 1 / \sqrt{\max(m, n)}.

The nuclear norm promotes low rank; the 1\ell_1 norm promotes sparsity. This convex relaxation can be solved efficiently by the Alternating Direction Method of Multipliers (ADMM).

12358
5%10%15%20%30%
Recovered rank
3 / 3 true
Nonzero in S
76 / 750 entries
Corruption
10%
Relative error
12.6%

Applications

  • Video surveillance: LL = static background, SS = moving foreground objects.
  • Face recognition: LL = faces under normal lighting, SS = shadows and occlusions.
  • Financial data: LL = systematic market factors, SS = idiosyncratic shocks or anomalies.
def robust_pca_admm(X, lam=None, mu=None, tol=1e-7, max_iter=500):
    """Robust PCA via ADMM (Principal Component Pursuit).

    Solves: min ||L||_* + λ||S||_1  s.t. L + S = X
    """
    m, n = X.shape
    if lam is None:
        lam = 1.0 / np.sqrt(max(m, n))
    if mu is None:
        mu = m * n / (4.0 * np.sum(np.abs(X)))

    L = np.zeros_like(X)
    S = np.zeros_like(X)
    Y = np.zeros_like(X)  # dual variable

    def soft_threshold(Z, threshold):
        return np.sign(Z) * np.maximum(np.abs(Z) - threshold, 0)

    def svd_threshold(Z, threshold):
        U, sigma, Vt = np.linalg.svd(Z, full_matrices=False)
        sigma_thresh = np.maximum(sigma - threshold, 0)
        return U @ np.diag(sigma_thresh) @ Vt, np.sum(sigma_thresh > 0)

    for iteration in range(max_iter):
        # Update L (nuclear norm proximal operator)
        L, rank_L = svd_threshold(X - S + Y / mu, 1.0 / mu)

        # Update S (L1 proximal operator)
        S = soft_threshold(X - L + Y / mu, lam / mu)

        # Update dual variable
        residual = X - L - S
        Y = Y + mu * residual

        if np.linalg.norm(residual, 'fro') / np.linalg.norm(X, 'fro') < tol:
            break

    return L, S

Robust PCA: original, low-rank, and sparse decomposition


Sparse PCA

The Interpretability Problem

Standard PCA components are linear combinations of all features — every loading is typically nonzero. In high-dimensional settings (dnd \gg n), this makes PCs hard to interpret: what does a component that loads on all 10,000 genes actually mean?

Sparse PCA Formulation

Sparse PCA adds an 1\ell_1 penalty to the loadings to encourage sparsity:

maxw=1wTΣ^wαw1\max_{\|w\| = 1} \, w^T \hat{\Sigma} w - \alpha \|w\|_1

where α>0\alpha > 0 controls the sparsity–variance trade-off. Larger α\alpha → fewer nonzero loadings → more interpretable but less variance explained.

Definition 5 (Sparse PCA trade-off).

Sparse PCA sacrifices optimality (in variance explained) for interpretability (in sparsity of loadings). The trade-off is controlled by the regularization parameter α\alpha:

  • α=0\alpha = 0: standard PCA (dense loadings, maximum variance).
  • α\alpha \to \infty: trivial solution (zero loadings).
  • Intermediate α\alpha: sparse loadings that explain most of the variance.

Several algorithms exist: SCoTLASS (Jolliffe et al., 2003), SPCA (Zou et al., 2006), and the coordinate descent approach in scikit-learn’s SparsePCA.

from sklearn.decomposition import SparsePCA, PCA

# Compare standard PCA vs. Sparse PCA on the Wine dataset
wine = load_wine()
X_std = StandardScaler().fit_transform(wine.data)

pca = PCA(n_components=3).fit(X_std)
spca = SparsePCA(n_components=3, alpha=1.0, random_state=42).fit(X_std)

# Standard PCA: all 13 features contribute to each component
print("Standard PCA loadings (PC 1):")
print(f"  Nonzero: {np.sum(np.abs(pca.components_[0]) > 0.01)}/13")

# Sparse PCA: only a few features per component
print("Sparse PCA loadings (PC 1):")
print(f"  Nonzero: {np.sum(np.abs(spca.components_[0]) > 0.01)}/13")

Sparse PCA: loading sparsity patterns across regularization strengths


Connections & Further Reading

PCA stands at the crossroads of the Linear Algebra track and the rest of the formalML curriculum.

TopicPCA Connection
The Spectral TheoremPCA eigendecomposition is the Spectral Theorem applied to Σ^\hat{\Sigma}. The Rayleigh quotient and Courant–Fischer theorem give the variance-maximization formulation.
Singular Value DecompositionPCA is the SVD of the centered data matrix. The Eckart–Young theorem makes PCA the optimal low-rank projection. The SVD avoids squaring the condition number.
Tensor DecompositionsTucker decomposition generalizes PCA to multi-way arrays. Multilinear PCA (MPCA) applies PCA along each tensor mode.
Persistent HomologyPCA is often a preprocessing step for TDA: reduce dimensionality before computing persistent homology. The explained variance ratio guides the choice of dimensionality.
The Mapper AlgorithmMapper frequently uses PCA (via SVD) as a filter function — projecting data onto the first 1–2 principal components before the covering step.
Sheaf TheoryThe Sheaf Laplacian’s eigendecomposition (Spectral Theorem) combined with PCA on the resulting feature vectors provides multi-scale structural analysis.
Measure-Theoretic ProbabilityThe Law of Large Numbers justifies why Σ^Σ\hat{\Sigma} \to \Sigma; the L2L^2 projection interpretation of conditional expectation is the infinite-dimensional analog of PCA’s variance-minimizing projection.

The Linear Algebra Track Dependency Graph

Spectral Theorem
├── Singular Value Decomposition
│   └── PCA & Low-Rank Approximation (this topic)
└── Tensor Decompositions

PCA is the SVD applied to centered data. The Spectral Theorem provides the eigendecomposition of the covariance matrix. Tensor Decompositions generalize both to multilinear algebra.

Connections

  • PCA is the SVD of the centered data matrix; the Eckart–Young theorem makes the truncated SVD the optimal low-rank projection; the SVD avoids squaring the condition number when computing PCA svd
  • PCA eigendecomposition is the Spectral Theorem applied to the sample covariance matrix; the Rayleigh quotient and Courant–Fischer theorem give the variance-maximization formulation spectral-theorem
  • Tucker decomposition generalizes PCA to multi-way arrays; Multilinear PCA (MPCA) applies PCA along each tensor mode tensor-decompositions
  • PCA is a standard preprocessing step for TDA — reducing dimensionality before computing persistent homology persistent-homology
  • Mapper frequently uses PCA (via SVD) as a filter function — projecting data onto the first 1–2 principal components before the covering step mapper-algorithm
  • the Sheaf Laplacian eigendecomposition (Spectral Theorem) combined with PCA on the resulting feature vectors provides multi-scale structural analysis sheaf-theory
  • the Law of Large Numbers guarantees that the sample covariance converges to the population covariance; the L2 projection interpretation of conditional expectation is the infinite-dimensional analog of PCA measure-theoretic-probability

References & Further Reading

  • book Introduction to Linear Algebra — Strang (2016) Chapter 7: SVD and PCA connection
  • book Pattern Recognition and Machine Learning — Bishop (2006) Chapter 12: Probabilistic PCA and EM derivation
  • book The Elements of Statistical Learning — Hastie, Tibshirani & Friedman (2009) Chapter 14.5: PCA, Sparse PCA, kernel PCA
  • book Matrix Analysis — Horn & Johnson (2013) Rayleigh quotient, Courant–Fischer, and Eckart–Young proofs
  • paper Probabilistic Principal Component Analysis — Tipping & Bishop (1999) Original PPCA paper
  • paper Robust Principal Component Analysis? — Candès, Li, Ma & Wright (2011) Principal Component Pursuit
  • paper Sparse Principal Component Analysis — Zou, Hastie & Tibshirani (2006) SPCA formulation and algorithm
  • paper Nonlinear Component Analysis as a Kernel Eigenvalue Problem — Schölkopf, Smola & Müller (1998) Original Kernel PCA paper
  • paper A generalization of principal component analysis to the exponential family — Collins, Dasgupta & Schapire (2001) Exponential family PCA
  • paper A parallel analysis criterion for determining the number of common factors or principal components — Horn (1965) Parallel analysis for dimensionality selection