Persistent Homology
Tracking topological features across scales — the workhorse of topological data analysis
Overview & Motivation
Consider two point clouds with identical first- and second-order statistics — same mean, same variance, same covariance. One is sampled from an annulus (a ring), the other from a Gaussian cluster. Every classical statistical test says they’re the same. But one has a hole and the other doesn’t, and that difference can be the signal you need — whether you’re detecting anomalies in sensor networks, classifying dynamical regimes in time series, or finding structure in high-dimensional embeddings.
The problem is: what does “has a hole” even mean for a finite point cloud? Points are discrete; they don’t have holes in any obvious sense. The Vietoris-Rips construction (see Simplicial Complexes) gives us a way to build a continuous-ish structure from discrete data, but it introduces a new problem: the result depends on a scale parameter . Choose too small and you see only isolated points. Too large and everything collapses into a single blob.
Persistent homology is the answer: instead of choosing a single scale, track the topology at every scale simultaneously. Features that persist across a wide range of are real structure; features that flicker in and out are noise. This insight — that persistence is a proxy for significance — is what makes TDA useful for machine learning.
Formal Framework
Filtrations
Definition 1.
A filtration of a simplicial complex is a nested sequence of subcomplexes:
Each is a simplicial complex, and the inclusion adds one or more simplices.
For the Vietoris-Rips complex, the filtration is parameterized by : as increases from 0 to , simplices enter the complex at their birth time (the minimum at which they appear). An edge is born at . A triangle is born at .
Chain Groups and the Boundary Operator
To detect “holes,” we need algebra. The key idea: a loop is a collection of edges that forms a cycle, and a hole is a loop that isn’t the boundary of something filled in.
Definition 2.
The -chain group of a simplicial complex is the free abelian group (or -vector space) generated by the -simplices of . A -chain is a formal sum of -simplices:
Remark.
Working over (coefficients are 0 or 1, with ) simplifies everything: we don’t need to worry about orientations. Most TDA software uses coefficients. Over , you get torsion information (e.g., distinguishing a Möbius band from a cylinder), but this is rarely needed in data analysis.
Definition 3.
The boundary operator maps each -simplex to the alternating sum of its -dimensional faces:
where denotes omission. Over , the signs disappear and this reduces to the sum of all faces.
Example 1.
For the triangle with vertices, edges, and the filled face:
: The boundary of the triangle is the sum of its edges:
: The boundary of each edge is the difference of its endpoints:
Written as matrices (with coefficients):
The rows of correspond to vertices ; columns to edges .
The following property is the engine of homology:
Theorem 1 (Fundamental Property of the Boundary Operator).
For all :
That is, the boundary of a boundary is always zero.
Proof.
It suffices to check on generators. Let be a -simplex. We compute by expanding both operators:
Applying to each term, we omit a second vertex . For each term , the boundary is:
where the exponent depends on the position of in the face after has been removed: if , and if .
Now consider a specific -face with . It appears exactly twice in the double sum:
- First omit , then omit : sign is
- First omit , then omit : sign is
The sum of these two signs is:
Every -face appears exactly twice with opposite signs, so the entire sum vanishes. Over , the argument is even simpler: each face appears twice, and .
∎Example 2.
We can verify this directly for the triangle. From Example 1:
The boundary of the triangle is the cycle , and the boundary of that cycle is zero — exactly because it closes up.
Homology Groups
The property gives us two important subgroups:
Definition 4.
The cycle group is — chains with zero boundary (they “close up”).
The boundary group is — chains that are the boundary of something.
Since , every boundary is a cycle: .
Definition 5.
The -th homology group of is the quotient:
The -th Betti number is .
Homology counts holes that are not filled in:
- = number of connected components
- = number of loops (independent 1-cycles not bounding a filled region)
- = number of voids (enclosed cavities)
Example 3.
For the filled triangle :
- has rank 1 (one independent cycle: the loop around the triangle)
- has rank 1 (that loop is the boundary of the filled face)
- — no holes. The apparent loop is a boundary.
- — one connected component.
Now remove the 2-simplex (just the wireframe triangle, no fill):
- still has rank 1
- now has rank 0 (no 2-simplices to bound anything)
- — one hole. The loop is no longer a boundary.
This is the key insight: the same cycle can be a hole or not, depending on whether there’s something to fill it.
Persistent Homology
Now we combine filtrations with homology. As increases, new simplices enter the complex. Each new simplex either:
- Creates a topological feature (a new cycle appears that isn’t a boundary), or
- Destroys a topological feature (a cycle that was a hole gets filled in)
Definition 6.
Given a filtration , the persistent homology tracks, for each topological feature:
- Its birth time : the filtration index where the feature first appears
- Its death time : the filtration index where the feature is destroyed
- Its persistence: , measuring how long the feature survives
A feature with (never destroyed) is called an essential feature.
Definition 7.
The persistence diagram is the multiset of points corresponding to birth-death pairs, together with infinitely many copies of every point on the diagonal .
Equivalently, the persistence barcode represents each feature as a horizontal interval .
Points far from the diagonal have high persistence — they represent features that survive across a wide range of scales. Points near the diagonal have low persistence — they are topological “noise,” features that appear and immediately disappear. The diagonal points are a technical convenience that makes the space of persistence diagrams well-behaved.
Remark.
There is a deep algebraic structure here that we’re glossing over: the inclusions induce maps on homology . The persistence module is the entire sequence of these maps, and the Structure Theorem for persistence modules (a consequence of the classification of finitely generated modules over a PID) guarantees that this module decomposes into interval summands — which are exactly the bars in the barcode. This decomposition is unique and computable via matrix reduction.
Visual Intuition
The visualization below is the core interactive experience. The top panel shows a Vietoris-Rips complex built from points sampled near a circle. The bottom panels show the persistence diagram and barcode.
Drag the slider and watch all three panels update together:
Persistence Diagram
Persistence Barcode
What to notice:
-
features (teal): At , every point is its own connected component — 15 features are born simultaneously. As increases, components merge. Each merge event kills a component: you see teal points appear in the persistence diagram with increasing death values. One component survives to (the essential feature).
-
features (purple): At a critical around 0.7-0.9, the edges form a cycle around the point cloud but the interior hasn’t filled in yet — a loop is born. As increases further, 2-simplices fill the interior, and the loop dies. The persistence (death minus birth) of this feature is the key topological signal: it tells you “this point cloud has a circular structure.”
-
Signal vs. noise: The long-lived bar stands out from the short bars near the diagonal. That gap between persistent and ephemeral features is what makes TDA work in practice — it’s a built-in notion of statistical significance, grounded in geometry rather than distributional assumptions.
The Stability Theorem
The practical utility of persistent homology rests on a single theorem: small perturbations of the input data produce small changes in the persistence diagram. Without this, the method would be useless — any noise would scramble the output.
Definition 8.
The bottleneck distance between two persistence diagrams and is:
where the infimum is over all bijections (including diagonal points).
Theorem 2 (Stability Theorem (Cohen-Steiner, Edelsbrunner, Harer 2007)).
Let be tame functions on a topological space . Then:
The bottleneck distance between persistence diagrams is bounded by the distance between the functions that generate them.
Proof.
The full proof uses the theory of interleavings of persistence modules and is beyond our scope. The essential idea: if , then the sublevel set filtrations of and are -interleaved — and vice versa — which forces a matching between their persistence diagrams that moves no point by more than . See Cohen-Steiner, Edelsbrunner & Harer (2007) for the complete argument.
∎For point cloud data, the relevant corollary is:
Corollary 1.
Let and be finite point clouds with (Hausdorff distance). Then:
In plain language: if you jiggle each point by at most , the persistence diagram moves by at most . This is what makes persistent homology safe to use on noisy real-world data.
Working Code
Computing Persistence with Ripser
import numpy as np
from ripser import ripser
from persim import plot_diagrams
# Generate an annulus (has a hole) and a Gaussian cluster (no hole)
np.random.seed(42)
n = 200
theta = np.random.uniform(0, 2 * np.pi, n)
r = 1.0 + np.random.normal(0, 0.1, n)
annulus = np.column_stack([r * np.cos(theta), r * np.sin(theta)])
cluster = np.random.randn(n, 2) * np.std(annulus, axis=0) + np.mean(annulus, axis=0)
# Compute persistence (H₀ and H₁)
result_annulus = ripser(annulus, maxdim=1)
result_cluster = ripser(cluster, maxdim=1)
# The annulus has a high-persistence H₁ feature (the loop)
dgm1 = result_annulus['dgms'][1]
max_pers = np.max(dgm1[:, 1] - dgm1[:, 0])
print(f"Annulus max H₁ persistence: {max_pers:.4f}")
# The cluster has only low-persistence H₁ noise
dgm1_c = result_cluster['dgms'][1]
max_pers_c = np.max(dgm1_c[:, 1] - dgm1_c[:, 0])
print(f"Cluster max H₁ persistence: {max_pers_c:.4f}")
Verifying the Stability Theorem Empirically
from persim import bottleneck
# Clean circle
n_pts = 100
theta_base = np.linspace(0, 2 * np.pi, n_pts, endpoint=False)
base_cloud = np.column_stack([np.cos(theta_base), np.sin(theta_base)])
base_dgm = ripser(base_cloud, maxdim=1)['dgms'][1]
# Add increasing noise and measure diagram movement
for noise_std in [0.0, 0.1, 0.2, 0.3, 0.5]:
noise = noise_std * np.random.randn(n_pts, 2)
noisy_cloud = base_cloud + noise
noisy_dgm = ripser(noisy_cloud, maxdim=1)['dgms'][1]
bn = bottleneck(base_dgm, noisy_dgm)
linf = np.max(np.abs(noise)) if noise_std > 0 else 0
print(f"noise σ={noise_std:.1f}: ||perturbation||∞={linf:.3f}, d_B={bn:.3f}, "
f"bound holds: {bn <= linf + 1e-10 or noise_std == 0}")
Vectorization: Persistence Images for ML
Persistence diagrams aren’t fixed-size vectors, so we can’t feed them directly to most classifiers. Persistence images (Adams et al., 2017) solve this by placing a Gaussian at each point and evaluating on a grid:
from persim import PersistenceImager
# Create a persistence image from H₁
pimgr = PersistenceImager(pixel_size=0.05, birth_range=(0, 1.5), pers_range=(0, 1.0))
pimgr.fit([result_annulus['dgms'][1]])
img = pimgr.transform(result_annulus['dgms'][1])
print(f"Feature vector shape: {img.flatten().shape}")
# This fixed-size vector goes into any sklearn classifier
Full Pipeline: Time Series Classification
The Takens delay embedding theorem connects time series analysis to topology:
Periodic signals produce loops in the embedding; chaotic signals do not.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
def takens_embedding(ts, dim=2, delay=10):
"""Takens delay embedding of a 1D time series."""
n = len(ts) - (dim - 1) * delay
return np.array([ts[i:i + dim * delay:delay] for i in range(n)])
def topological_features(ts):
"""Extract topological feature vector from a time series."""
embedded = takens_embedding(ts, dim=2, delay=10)
result = ripser(embedded, maxdim=1)
features = []
for d in range(2): # H₀ and H₁
dgm = result['dgms'][d]
finite = dgm[np.isfinite(dgm[:, 1])]
if len(finite) > 0:
pers = finite[:, 1] - finite[:, 0]
features.extend([
np.max(pers), # Max persistence
np.mean(pers), # Mean persistence
np.std(pers), # Std of persistence
np.sum(pers > 0.1), # Count of significant features
np.sum(pers ** 2), # Total persistence (L² norm)
])
else:
features.extend([0, 0, 0, 0, 0])
return np.array(features)
# Generate labeled time series from 3 dynamical regimes
X, y = [], []
for regime in range(3):
for _ in range(50):
t = np.linspace(0, 6 * np.pi, 300)
noise = 0.05 * np.random.randn(300)
if regime == 0: # Periodic
ts = np.sin(t) + noise
elif regime == 1: # Quasi-periodic
ts = np.sin(t) + 0.5 * np.sin(np.sqrt(2) * t) + noise
else: # Chaotic (logistic map)
ts = np.zeros(300)
ts[0] = 0.1
for i in range(1, 300):
ts[i] = 3.9 * ts[i-1] * (1 - ts[i-1])
X.append(topological_features(ts))
y.append(regime)
X, y = np.array(X), np.array(y)
scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
X, y, cv=5)
print(f"5-fold CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
This pipeline — time series → Takens embedding → persistent homology → feature vector → classifier — is one of the most successful applications of TDA in practice. It works because the persistence features are invariant to reparameterization of the time series (you can stretch or compress time and the topology doesn’t change) and stable under noise (by the Stability Theorem).
Connections & Applications
Persistent homology sits at the center of a growing web of mathematical and applied techniques:
-
Topological machine learning: Persistence features have been used for protein classification (Cang & Wei, 2017), materials science (Hiraoka et al., 2016), cosmology (Xu et al., 2019), and neuroscience (Giusti et al., 2015). The common thread: data has shape, and shape carries information that statistics alone miss.
-
Persistence landscapes (Bubenik, 2015) and persistence images (Adams et al., 2017) transform diagrams into Banach space elements or fixed-size vectors, enabling statistical inference and ML integration.
-
Extended persistence and zigzag persistence generalize the one-parameter filtration to more complex parameter spaces, capturing richer topological information.
-
The Stability Theorem extends to other metrics: Wasserstein distances between diagrams, interleaving distances between persistence modules, and Gromov-Hausdorff distances between metric spaces. This creates a rich metric geometry on the space of shapes.
-
Takens embedding connects persistent homology to dynamical systems and time series analysis — see Perea & Harer (2015) for the theoretical foundation and the working code above for the practical pipeline.
Connections
- builds on simplicial-complexes
- uses concepts from metric-spaces
References & Further Reading
- paper Topology and Data — Gunnar Carlsson (2009) The survey that defined TDA as a field — essential reading
- paper Stability of Persistence Diagrams — David Cohen-Steiner, Herbert Edelsbrunner & John Harer (2007) The foundational stability result; Theorem 5 in this topic
- book Computational Topology: An Introduction — Herbert Edelsbrunner & John Harer (2010) Chapters 4-7 cover persistent homology comprehensively
- paper Persistence Images: A Stable Vector Representation of Persistent Homology — Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta & Lori Ziegelmeier (2017) The standard vectorization method for ML pipelines
- paper Sliding Windows and Persistence: An Application of Topological Methods to Signal Analysis — Jose A. Perea & John Harer (2015) Time series → point cloud → topology via Takens embedding
- paper Ripser: Efficient Computation of Vietoris-Rips Persistence Barcodes — Ulrich Bauer (2021) State-of-the-art algorithm — what powers the ripser library
- book Elementary Applied Topology — Robert Ghrist (2014) The most visual introduction to applied algebraic topology