intermediate graph-theory 45 min read

Random Walks & Mixing

Markov chains on graphs — from the transition matrix to mixing times, hitting times, and the spectral gap

Overview & Motivation

In the Graph Laplacians & Spectrum topic, we studied graphs through the lens of the Laplacian matrix L=DAL = D - A and its eigenvalues. The normalized Laplacian L=ID1/2AD1/2\mathcal{L} = I - D^{-1/2}AD^{-1/2} made a brief appearance alongside the transition matrix P=D1AP = D^{-1}A. Now we place PP center stage.

A random walk on a graph is the simplest possible stochastic process on a network: at each step, the walker moves to a uniformly random neighbor. Despite this simplicity, the walk’s behavior reveals deep structural properties:

  • How quickly does the walk forget its starting position? The mixing time measures convergence to the stationary distribution. Fast mixing implies good connectivity — the graph has no bottlenecks.

  • How far apart are two vertices in “random walk distance”? The hitting time h(i,j)h(i,j) measures the expected number of steps to reach jj from ii. Unlike shortest-path distance, hitting time is sensitive to the graph’s branching structure and reflects the “difficulty” of finding a target by random exploration.

  • What is the electrical interpretation? Identifying edges with unit resistors turns the graph into an electrical network. The effective resistance between two vertices equals the commute time (normalized by total edge weight) — a beautiful equivalence between probability and physics.

The spectral gap γ=1λ2(P)=λ2(L)\gamma = 1 - \lambda_2(P) = \lambda_2(\mathcal{L}) is the key quantity: it controls mixing time (tmix=Θ(1/γ)t_{\mathrm{mix}} = \Theta(1/\gamma)), bounds hitting times, and characterizes Expander Graphs — sparse graphs that mix as fast as dense ones.

Roadmap. We build the theory in layers: define the walk and its transition matrix (§1), prove convergence to the stationary distribution (§2), quantify convergence speed via total variation and mixing time (§3–4), handle bipartite graphs with lazy walks (§5), develop hitting times and commute times (§6–7), and connect everything to applications (§8–9).


1. Markov Chains on Graphs

1.1 The Transition Matrix

A random walk on a graph G=(V,E)G = (V, E) with V=n|V| = n is a sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots taking values in VV, where at each step the walker moves to a uniformly random neighbor:

Pr[Xt+1=jXt=i]=Pij=Aijdi\Pr[X_{t+1} = j \mid X_t = i] = P_{ij} = \frac{A_{ij}}{d_i}

Definition 1 (Random Walk on a Graph).

A random walk on a graph G=(V,E)G = (V, E) is a Markov chain (Xt)t0(X_t)_{t \geq 0} on the state space VV with transition probabilities Pij=Aij/diP_{ij} = A_{ij} / d_i: at each step, the walker moves from its current vertex to a uniformly random neighbor.

Definition 2 (Transition Matrix).

The transition matrix of the simple random walk on GG is:

P=D1AP = D^{-1}A

where D=diag(d1,,dn)D = \mathrm{diag}(d_1, \ldots, d_n) is the degree matrix. Each row of PP sums to 1 — PP is a (row) stochastic matrix.

The entry PijP_{ij} gives the probability of stepping from vertex ii to vertex jj in one step. The matrix PtP^t gives tt-step transition probabilities: (Pt)ij=Pr[Xt=jX0=i](P^t)_{ij} = \Pr[X_t = j \mid X_0 = i].

Transition matrices for named graphs — heatmaps showing the row-stochastic structure of P for Path, Cycle, Complete, Star, Barbell, Grid, Hypercube, and Petersen graphs

1.2 Irreducibility and Aperiodicity

Two properties determine the walk’s long-run behavior:

  • Irreducible: Every vertex can be reached from every other vertex (equivalently, GG is connected).
  • Aperiodic: The walk does not get trapped in a deterministic cycle. A connected graph has period 2 if and only if it is bipartite. All other connected graphs are aperiodic.

Combining these:

Definition 3 (Ergodic Chain).

A Markov chain is ergodic if it is both irreducible and aperiodic. For graph random walks, this means GG is connected and non-bipartite.

Click node to set start vertexNode color encodes visit frequencyWalker = current positionFilled bars = empirical  Outlined bars = stationary

2. Stationary Distribution & Convergence

2.1 The Stationary Distribution

Definition 4 (Stationary Distribution).

A probability distribution πRn\boldsymbol{\pi} \in \mathbb{R}^n is stationary for PP if πTP=πT\boldsymbol{\pi}^T P = \boldsymbol{\pi}^T. Equivalently, if the walk’s initial distribution is π\boldsymbol{\pi}, it remains π\boldsymbol{\pi} after any number of steps.

Theorem 1 (Stationarity of π).

The stationary distribution of the simple random walk on GG is πi=di/(2m)\pi_i = d_i / (2m) where m=Em = |E|.

Proof.

We verify πTP=πT\boldsymbol{\pi}^T P = \boldsymbol{\pi}^T directly:

(πTP)j=i=1ndi2mAijdi=12mi=1nAij=dj2m=πj(\boldsymbol{\pi}^T P)_j = \sum_{i=1}^n \frac{d_i}{2m} \cdot \frac{A_{ij}}{d_i} = \frac{1}{2m}\sum_{i=1}^n A_{ij} = \frac{d_j}{2m} = \pi_j

The degree in the numerator cancels, and iAij=dj\sum_i A_{ij} = d_j since the column sum of the adjacency matrix equals the degree of vertex jj.

The stationary distribution assigns probability proportional to degree. High-degree vertices are visited more often — they are “easier to find” because more edges lead to them. On a dd-regular graph, all degrees are equal and π=(1/n,,1/n)\boldsymbol{\pi} = (1/n, \ldots, 1/n) — the uniform distribution.

Corollary 1 (Mixing of Regular Graphs).

On a dd-regular graph, the stationary distribution is uniform: π=(1/n,,1/n)\boldsymbol{\pi} = (1/n, \ldots, 1/n).

2.2 Detailed Balance (Reversibility)

Theorem 2 (Detailed Balance (Reversibility)).

The simple random walk on any undirected graph is reversible: πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji} for all i,ji, j.

Proof.

πiPij=di2mAijdi=Aij2m=Aji2m=dj2mAjidj=πjPji\pi_i P_{ij} = \frac{d_i}{2m} \cdot \frac{A_{ij}}{d_i} = \frac{A_{ij}}{2m} = \frac{A_{ji}}{2m} = \frac{d_j}{2m} \cdot \frac{A_{ji}}{d_j} = \pi_j P_{ji}

The equality Aij=AjiA_{ij} = A_{ji} follows from the graph being undirected.

Remark (Reversibility and the Inner Product).

Detailed balance means the walk is time-reversible: a video of the walk looks the same played forwards and backwards (in distribution). Algebraically, reversibility is equivalent to PP being self-adjoint with respect to the 2(π)\ell^2(\boldsymbol{\pi}) inner product f,gπ=iπif(i)g(i)\langle f, g \rangle_{\boldsymbol{\pi}} = \sum_i \pi_i f(i) g(i).

2.3 Convergence

Theorem 3 (Convergence Theorem).

If the random walk on GG is ergodic (connected, non-bipartite), then for any starting vertex xx:

limtPt(x,)=π\lim_{t \to \infty} P^t(x, \cdot) = \boldsymbol{\pi}

Proof.

By the Perron–Frobenius theorem applied to the non-negative irreducible matrix PP: the largest eigenvalue μ1=1\mu_1 = 1 is simple (by irreducibility), and aperiodicity ensures μi<1|\mu_i| < 1 for all i2i \geq 2.

The spectral decomposition gives:

Pt=1πT+i=2nμitfiwiTP^t = \boldsymbol{1}\boldsymbol{\pi}^T + \sum_{i=2}^n \mu_i^t \, \mathbf{f}_i \mathbf{w}_i^T

Since μi<1|\mu_i| < 1, each term μit0\mu_i^t \to 0 as tt \to \infty, so Pt1πTP^t \to \boldsymbol{1}\boldsymbol{\pi}^T. That is, every row of PtP^t converges to π\boldsymbol{\pi}.

Convergence of empirical visit frequencies to the stationary distribution for Complete, Cycle, and Barbell graphs — showing how graph topology controls convergence speed


3. Total Variation & Mixing Time

The convergence theorem says the walk eventually reaches stationarity, but how fast? We need a distance measure between distributions.

Definition 5 (Total Variation Distance).

The total variation distance between probability distributions pp and qq on VV is:

pqTV=12vVp(v)q(v)\|p - q\|_{\mathrm{TV}} = \frac{1}{2}\sum_{v \in V} |p(v) - q(v)|

Total variation ranges from 0 (identical distributions) to 1 (distributions with disjoint support). It has a clean probabilistic interpretation: pqTV=maxAVp(A)q(A)\|p - q\|_{\mathrm{TV}} = \max_{A \subseteq V} |p(A) - q(A)| — the maximum difference in probability assigned to any event.

Definition 6 (Mixing Time).

The mixing time to accuracy ε\varepsilon is the worst-case time for the walk to get ε\varepsilon-close to stationarity:

tmix(ε)=maxxVmin{t0:Pt(x,)πTVε}t_{\mathrm{mix}}(\varepsilon) = \max_{x \in V} \min\{t \geq 0 : \|P^t(x, \cdot) - \boldsymbol{\pi}\|_{\mathrm{TV}} \leq \varepsilon\}

By convention, tmix=tmix(1/4)t_{\mathrm{mix}} = t_{\mathrm{mix}}(1/4).

Examples illustrating the range of mixing behaviors:

  • Complete graph KnK_n: tmix=1t_{\mathrm{mix}} = 1. Every vertex is adjacent to every other — the walk mixes in a single step.
  • Cycle CnC_n: tmix=Θ(n2)t_{\mathrm{mix}} = \Theta(n^2). The walk must diffuse around the ring.
  • Path PnP_n: tmix=Θ(n2)t_{\mathrm{mix}} = \Theta(n^2). Similar to the cycle, but without the wraparound.
  • Barbell BkB_k: tmix=Θ(k3)t_{\mathrm{mix}} = \Theta(k^3). The bridge is a bottleneck — the walk gets stuck in one clique.
  • Hypercube QdQ_d: tmix=Θ(dlogd)t_{\mathrm{mix}} = \Theta(d \log d). Fast mixing despite sparsity — each vertex has degree dd out of 2d2^d vertices.

4. Spectral Analysis of Mixing

4.1 Spectral Decomposition of PtP^t

Theorem 4 (Spectral Decomposition of P^t).

Let μ1=1μ2μn1\mu_1 = 1 \geq \mu_2 \geq \cdots \geq \mu_n \geq -1 be the eigenvalues of PP, with corresponding right eigenvectors fi\mathbf{f}_i (normalized in 2(π)\ell^2(\boldsymbol{\pi})). Then:

Pt(x,y)=πy(1+i=2nμitfi(x)fi(y)fiπ2)P^t(x, y) = \pi_y \left(1 + \sum_{i=2}^n \mu_i^t \frac{f_i(x) f_i(y)}{\|f_i\|_{\pi}^2}\right)

Proof.

Since PP is similar to the symmetric matrix S=D1/2AD1/2S = D^{-1/2}AD^{-1/2} (via the Spectral Theorem), it has real eigenvalues and a complete eigenbasis. Write PtP^t in the spectral decomposition:

Pt=i=1nμitfiwiP^t = \sum_{i=1}^n \mu_i^t \, \mathbf{f}_i \otimes \mathbf{w}_i

where fi\mathbf{f}_i and wi\mathbf{w}_i are right and left eigenvectors. The i=1i = 1 term is 1πT\boldsymbol{1} \boldsymbol{\pi}^T (since P1=1P\boldsymbol{1} = \boldsymbol{1} and πTP=πT\boldsymbol{\pi}^T P = \boldsymbol{\pi}^T). For ergodic chains, μi<1|\mu_i| < 1 for i2i \geq 2, so each remaining term decays exponentially.

The rate of decay is controlled by λ=maxi2μi\lambda_\star = \max_{i \geq 2} |\mu_i|, the second-largest eigenvalue in absolute value. This leads to:

4.2 The Spectral Gap

Definition 7 (Spectral Gap).

The spectral gap of the random walk is:

γ=1λwhereλ=maxi2μi\gamma = 1 - \lambda_\star \quad \text{where} \quad \lambda_\star = \max_{i \geq 2}|\mu_i|

Corollary 2 (Spectral Gap Equals λ₂(𝓛)).

For non-bipartite graphs, λ=μ2\lambda_\star = \mu_2 and the spectral gap equals the second-smallest eigenvalue of the normalized Laplacian: γ=1μ2=λ2(L)\gamma = 1 - \mu_2 = \lambda_2(\mathcal{L}).

This is the bridge between random walks and spectral graph theory: the same eigenvalue λ2(L)\lambda_2(\mathcal{L}) that measures algebraic connectivity in the Laplacian world controls mixing speed in the random walk world.

4.3 Mixing Time Bounds

Theorem 5 (Upper Mixing Time Bound).

Pt(x,)πTV121/πminλt\|P^t(x, \cdot) - \boldsymbol{\pi}\|_{\mathrm{TV}} \leq \frac{1}{2}\sqrt{1/\pi_{\min}} \cdot \lambda_\star^t

Proof.

Apply Cauchy–Schwarz in 2(π)\ell^2(\boldsymbol{\pi}). The total variation distance satisfies:

Pt(x,)πTV=12yPt(x,y)πy1πy12y(Pt(x,y)πy1)2πy\|P^t(x, \cdot) - \boldsymbol{\pi}\|_{\mathrm{TV}} = \frac{1}{2}\sum_y \left|\frac{P^t(x,y)}{\pi_y} - 1\right| \pi_y \leq \frac{1}{2}\sqrt{\sum_y \left(\frac{P^t(x,y)}{\pi_y} - 1\right)^2 \pi_y}

The 2(π)\ell^2(\boldsymbol{\pi}) norm of Pt(x,)/π1P^t(x,\cdot)/\boldsymbol{\pi} - 1 is bounded by 1/πminλt\sqrt{1/\pi_{\min}} \cdot \lambda_\star^t using the spectral decomposition.

This gives the mixing time bound: tmix1γ(12ln1πmin+ln2)t_{\mathrm{mix}} \leq \frac{1}{\gamma}\left(\frac{1}{2}\ln\frac{1}{\pi_{\min}} + \ln 2\right).

Theorem 6 (Lower Mixing Time Bound).

tmix(ε)1γln12εt_{\mathrm{mix}}(\varepsilon) \geq \frac{1}{\gamma}\ln\frac{1}{2\varepsilon}

Proof.

Consider the second eigenvector test function. For the worst-case starting vertex, the TV distance at time tt is at least (1/2)λt(1/2)\lambda_\star^t. Setting this equal to ε\varepsilon and solving for tt gives the lower bound.

Summary: tmix=Θ(1γln1πmin)t_{\mathrm{mix}} = \Theta\left(\frac{1}{\gamma} \ln\frac{1}{\pi_{\min}}\right). The spectral gap γ\gamma is the dominant factor; the ln(1/πmin)\ln(1/\pi_{\min}) term accounts for graphs with extreme degree variation.

Graph families
Checkboxes toggle families on/offClick label to highlight family in graph panelSize slider adjusts node count for all familiesTime slider scrubs the walk distribution on the graphAmber node = worst-case start vertex
Path (n=8): t_mix = >200, γ = 0.0000Cycle (n=8): t_mix = >200, γ = 0.0000Complete (n=8): t_mix = 1, γ = 0.8571

Mixing time comparison: TV distance curves for graph families, showing how topology determines convergence speed

Spectral gap vs mixing time scatter plot across 150 random graphs — confirming the Θ(1/γ) relationship


5. Lazy Random Walks

What happens on bipartite graphs? The path graph PnP_n and the grid graph are bipartite, so PP has eigenvalue μn=1\mu_n = -1 and the walk oscillates instead of converging. The distribution at even time steps differs from odd time steps. The fix is simple:

Definition 8 (Lazy Random Walk).

The lazy random walk has transition matrix:

Plazy=12(I+P)P_{\text{lazy}} = \frac{1}{2}(I + P)

At each step, the walker stays put with probability 1/21/2 and moves to a random neighbor with probability 1/21/2.

Proposition 1 (Lazy Walk Spectrum).

If μ1,,μn\mu_1, \ldots, \mu_n are eigenvalues of PP, then the eigenvalues of PlazyP_{\text{lazy}} are μi(lazy)=(1+μi)/2\mu_i^{(\text{lazy})} = (1 + \mu_i)/2.

Proof.

If Pv=μivP\mathbf{v} = \mu_i \mathbf{v}, then:

Plazyv=12(I+P)v=1+μi2vP_{\text{lazy}} \mathbf{v} = \frac{1}{2}(I + P)\mathbf{v} = \frac{1 + \mu_i}{2}\mathbf{v}

So v\mathbf{v} is an eigenvector of PlazyP_{\text{lazy}} with eigenvalue (1+μi)/2(1 + \mu_i)/2.

Since μi[1,1]\mu_i \in [-1, 1], we have μi(lazy)[0,1]\mu_i^{(\text{lazy})} \in [0, 1] — no negative eigenvalues, no oscillation. The lazy walk converges even on bipartite graphs, at the cost of halving the spectral gap: γlazy=γ/2\gamma_{\text{lazy}} = \gamma / 2.

Lazy vs standard random walk on the bipartite path graph — the standard walk oscillates while the lazy walk converges smoothly


6. Hitting Times & Commute Times

Mixing time measures convergence of the distribution to stationarity. Hitting times measure something complementary: how long it takes a single walk to reach a specific target.

Definition 9 (Hitting Time).

The hitting time from ii to jj is the expected first passage time:

h(i,j)=Ei[min{t0:Xt=j}]h(i, j) = \mathbb{E}_i[\min\{t \geq 0 : X_t = j\}]

with h(j,j)=0h(j, j) = 0 by convention.

Hitting times satisfy a linear recurrence: h(i,j)=1+kjPikh(k,j)h(i, j) = 1 + \sum_{k \neq j} P_{ik} \, h(k, j) for iji \neq j.

Proposition 2 (Hitting Time via Fundamental Matrix).

For each target jj, the hitting times h(i,j)h(i, j) satisfy the linear system:

(IPjˉ)hj=1(I - P_{\bar{j}}) \mathbf{h}_j = \mathbf{1}

where PjˉP_{\bar{j}} is the transition matrix with row and column jj removed.

Hitting times are asymmetric in general: h(i,j)h(j,i)h(i, j) \neq h(j, i). On a star graph SnS_n with 1 hub and n1n - 1 leaves, any leaf reaches the hub in exactly 1 step (h(leaf,hub)=1h(\text{leaf}, \text{hub}) = 1), but the hub must guess the right leaf: h(hub,leaf)=n1h(\text{hub}, \text{leaf}) = n - 1. To go from one leaf to another, the walker first steps to the hub (1 step) then searches for the target leaf (n1n - 1 steps on average), giving h(leafa,leafb)=nh(\text{leaf}_a, \text{leaf}_b) = n. The asymmetry reflects the graph’s branching structure.

Definition 10 (Commute Time).

The commute time between ii and jj is the expected round-trip time:

κ(i,j)=h(i,j)+h(j,i)\kappa(i, j) = h(i, j) + h(j, i)

Unlike hitting times, commute times are symmetric: κ(i,j)=κ(j,i)\kappa(i, j) = \kappa(j, i).

Source: 0Target:
Click node: set source (blue)Click another: set target (red)Click source again: deselectDrag node: reposition

Hitting time heatmaps for Path, Star, and Barbell graphs — showing the asymmetry of hitting times and the structure-dependent patterns


7. Effective Resistance & Electrical Networks

The connection between random walks and electrical networks is one of the most beautiful results in combinatorics. Replace each edge with a unit resistor, and the graph becomes an electrical circuit. The Laplacian is the conductance matrix: applying Kirchhoff’s laws gives Lv=iextL\mathbf{v} = \mathbf{i}_{\text{ext}}.

Definition 11 (Effective Resistance).

The effective resistance between vertices ii and jj is:

Reff(i,j)=(eiej)TL(eiej)R_{\text{eff}}(i, j) = (\mathbf{e}_i - \mathbf{e}_j)^T L^\dagger (\mathbf{e}_i - \mathbf{e}_j)

where LL^\dagger is the Moore–Penrose pseudoinverse of the Laplacian and ei\mathbf{e}_i is the ii-th standard basis vector.

Effective resistance is a metric on VV: it satisfies non-negativity, symmetry, and the triangle inequality. It behaves like physical resistance — adding edges decreases resistance (parallel paths), removing edges increases it.

Theorem 7 (Commute Time–Effective Resistance Identity).

κ(i,j)=2mReff(i,j)\kappa(i, j) = 2m \cdot R_{\text{eff}}(i, j)

where m=Em = |E| is the total number of edges.

Proof.

Express h(i,j)h(i,j) via the fundamental matrix Z=(IP+1πT)1Z = (I - P + \boldsymbol{1}\boldsymbol{\pi}^T)^{-1}: we have h(i,j)=(ZjjZij)/πjh(i,j) = (Z_{jj} - Z_{ij})/\pi_j. Adding h(i,j)+h(j,i)h(i,j) + h(j,i) and using Z=L+(1/n)11TZ = L^\dagger + (1/n)\boldsymbol{1}\boldsymbol{1}^T (after appropriate normalization), the ZZ entries telescope to give κ(i,j)=2m(Lii+Ljj2Lij)=2mReff(i,j)\kappa(i,j) = 2m(L^\dagger_{ii} + L^\dagger_{jj} - 2L^\dagger_{ij}) = 2m \cdot R_{\text{eff}}(i,j).

See Doyle & Snell (1984) for the complete derivation.

Properties of effective resistance:

  1. Reff(i,j)0R_{\text{eff}}(i,j) \geq 0 with equality if and only if i=ji = j.
  2. Symmetric: Reff(i,j)=Reff(j,i)R_{\text{eff}}(i,j) = R_{\text{eff}}(j,i).
  3. Triangle inequality: Reff(i,k)Reff(i,j)+Reff(j,k)R_{\text{eff}}(i,k) \leq R_{\text{eff}}(i,j) + R_{\text{eff}}(j,k).
  4. Rayleigh monotonicity: Adding edges can only decrease effective resistance.

Random Target Lemma: i<jReff(i,j)=ntr(L)=nk=2n1/λk\sum_{i < j} R_{\text{eff}}(i,j) = n \cdot \mathrm{tr}(L^\dagger) = n \sum_{k=2}^n 1/\lambda_k.

Effective resistance network visualization — edge widths proportional to resistance, showing how bottlenecks correspond to high resistance


8. Applications: PageRank, DeepWalk, Label Propagation

8.1 PageRank

PageRank (Page, Brin, Motwani & Winograd, 1999) models a web surfer who follows links randomly, but occasionally “teleports” to a uniformly random page:

PPR=αP+(1α)1n11TP_{\text{PR}} = \alpha P + (1 - \alpha) \frac{1}{n}\mathbf{1}\mathbf{1}^T

with α=0.85\alpha = 0.85 typically. The teleportation term is a rank-one perturbation that guarantees ergodicity — even if the original graph is disconnected or bipartite.

Remark (PageRank as Perturbed Random Walk).

The teleportation term guarantees a spectral gap of at least 1α1 - \alpha, so PageRank converges in O(11αlogn)O\left(\frac{1}{1 - \alpha} \log n\right) iterations regardless of graph structure. This is why power iteration converges fast for PageRank — the teleportation creates an artificial spectral gap.

PageRank on a small web graph — node sizes proportional to PageRank score, comparing with the simple random walk stationary distribution

8.2 DeepWalk and node2vec

DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) treats random walks as “sentences” and vertices as “words,” then applies the skip-gram model from NLP to learn vertex embeddings. The key insight: DeepWalk implicitly factorizes a matrix related to log(max(Pk,1))\log(\max(P^k, 1)), connecting random walk statistics to low-rank matrix approximations studied in PCA & Low-Rank Approximation.

8.3 Label Propagation

Label propagation spreads known labels through the graph:

y(t+1)=αPy(t)+(1α)y(0)\mathbf{y}^{(t+1)} = \alpha P \mathbf{y}^{(t)} + (1 - \alpha) \mathbf{y}^{(0)}

At convergence, each vertex’s label is a discounted average of its neighbors’ labels, weighted by the random walk transition probabilities. The mixing time determines how far labels spread — fast-mixing graphs propagate labels globally, while graphs with bottlenecks propagate locally.

Remark (Over-Smoothing in GNNs).

Stacking tt GCN layers applies PtP^t to the feature matrix. By the convergence theorem, Pt1πTP^t \to \boldsymbol{1}\boldsymbol{\pi}^T — all vertex features collapse to a single vector. This is over-smoothing: the random walk convergence theorem in disguise. The number of useful GNN layers should roughly match 1/γ1/\gamma, the mixing time. Message Passing & GNNs develops solutions to this problem.


9. Computational Notes

Transition matrix construction. Given an adjacency matrix AA (as a NumPy array or scipy sparse matrix), the transition matrix is P=D1AP = D^{-1}A:

import numpy as np

def transition_matrix(A):
    """P = D^{-1}A for simple random walk."""
    D_inv = np.diag(1.0 / A.sum(axis=1))
    return D_inv @ A

Random walk simulation. Simulate a walk and track visit frequencies:

def simulate_walk(P, start, n_steps, rng=None):
    """Simulate a random walk, return visit counts."""
    rng = rng or np.random.default_rng()
    n = len(P)
    counts = np.zeros(n)
    v = start
    for _ in range(n_steps):
        counts[v] += 1
        v = rng.choice(n, p=P[v])
    return counts / n_steps

Mixing time estimation via spectral decomposition. Compute TV distance at each step without matrix exponentiation:

def mixing_profile(A, max_t=200):
    """TV distance at each step via spectral decomposition."""
    P = transition_matrix(A)
    deg = A.sum(axis=1)
    deg = np.where(deg == 0, 1, deg)  # guard against isolated vertices
    pi = deg / deg.sum()
    inv_sqrt_deg = 1.0 / np.sqrt(deg)
    evals, evecs = np.linalg.eigh(
        np.diag(inv_sqrt_deg) @ A @ np.diag(inv_sqrt_deg)
    )
    # evals of S = D^{-1/2}AD^{-1/2} equal evals of P
    tv = []
    for t in range(max_t + 1):
        max_tv = 0
        for x in range(len(A)):
            dist = np.zeros(len(A))
            for k in range(len(A)):
                dist += evals[k]**t * evecs[:, k] * evecs[x, k]
            dist = dist * np.sqrt(A.sum(axis=1)) / np.sqrt(A.sum(axis=1)[x])
            max_tv = max(max_tv, 0.5 * np.sum(np.abs(dist - pi)))
        tv.append(max_tv)
    return tv

Hitting time solver. Solve the linear system for all-pairs hitting times:

def all_pairs_hitting_times(A):
    """Solve (I - P_{-j}) h_j = 1 for each target j."""
    P = transition_matrix(A)
    n = len(A)
    H = np.zeros((n, n))
    for j in range(n):
        idx = [i for i in range(n) if i != j]
        P_red = P[np.ix_(idx, idx)]
        h = np.linalg.solve(np.eye(n-1) - P_red, np.ones(n-1))
        for k, i in enumerate(idx):
            H[i, j] = h[k]
    return H

Effective resistance. Compute via the Laplacian pseudoinverse:

def effective_resistance_matrix(A):
    """R_eff(i,j) via Laplacian pseudoinverse."""
    L = np.diag(A.sum(axis=1)) - A
    evals, evecs = np.linalg.eigh(L)
    # Pseudoinverse: invert nonzero eigenvalues
    L_pinv = sum(
        (1/lam) * np.outer(v, v)
        for lam, v in zip(evals, evecs.T) if lam > 1e-10
    )
    R = np.zeros_like(A, dtype=float)
    for i in range(len(A)):
        for j in range(i+1, len(A)):
            R[i,j] = R[j,i] = L_pinv[i,i] + L_pinv[j,j] - 2*L_pinv[i,j]
    return R

PageRank via power iteration. The standard implementation using networkx:

import networkx as nx

G = nx.karate_club_graph()
pr = nx.pagerank(G, alpha=0.85)

# Manual power iteration
A = nx.to_numpy_array(G)
P = transition_matrix(A)
n = len(A)
alpha = 0.85
P_pr = alpha * P + (1 - alpha) / n * np.ones((n, n))
r = np.ones(n) / n
for _ in range(100):
    r = r @ P_pr

10. Connections & Further Reading

Cross-Track and Within-Track Connections

TopicConnection
Graph Laplacians & SpectrumP=D1AP = D^{-1}A complements Lrw=IPL_{\text{rw}} = I - P. The spectral gap γ=λ2(L)\gamma = \lambda_2(\mathcal{L}) controls both algebraic connectivity and mixing time.
The Spectral TheoremPP is similar to the symmetric matrix D1/2AD1/2D^{-1/2}AD^{-1/2}. The spectral decomposition of PtP^t underlies every mixing bound.
Shannon Entropy & Mutual InformationDKL(Pt(x,)π)D_{\mathrm{KL}}(P^t(x,\cdot) \| \boldsymbol{\pi}) decreases monotonically along the walk. The stationary entropy H(π)=lognH(\boldsymbol{\pi}) = \log n if and only if the graph is regular.
Measure-Theoretic ProbabilityConvergence in total variation = convergence in L1L^1 on the probability simplex. The ergodic theorem says time averages converge almost surely.
Concentration InequalitiesChernoff bounds for sums along a Markov chain use the spectral gap to account for dependence between consecutive samples.
PCA & Low-Rank ApproximationDeepWalk/node2vec embeddings implicitly factorize random walk probabilities — a spectral approximation related to Laplacian eigenvectors.
Expander GraphsExpander Graphs are the sparse graphs that mix the fastest — their spectral gap γ\gamma is bounded away from zero even as nn \to \infty, giving O(logn)O(\log n) mixing time. The expander walk sampling theorem extends the mixing perspective to derandomization.
Message Passing & GNNsOver-smoothing in GNNs is the convergence theorem applied to feature propagation: tt layers ≈ PtP^t, and Pt1πTP^t \to \boldsymbol{1}\boldsymbol{\pi}^T collapses features. GNN depth should respect the mixing time.

Key Notation

SymbolNameDefinition
PPTransition matrixD1AD^{-1}A
π\boldsymbol{\pi}Stationary distributionπi=di/(2m)\pi_i = d_i/(2m)
γ\gammaSpectral gap1λ1 - \lambda_\star
tmixt_{\mathrm{mix}}Mixing timeWorst-case TV convergence to ε\varepsilon
PlazyP_{\text{lazy}}Lazy walk(I+P)/2(I + P)/2
h(i,j)h(i,j)Hitting timeExpected first passage time
κ(i,j)\kappa(i,j)Commute timeh(i,j)+h(j,i)=2mReff(i,j)h(i,j) + h(j,i) = 2m \cdot R_{\text{eff}}(i,j)
ReffR_{\text{eff}}Effective resistance(eiej)TL(eiej)(\mathbf{e}_i - \mathbf{e}_j)^T L^\dagger (\mathbf{e}_i - \mathbf{e}_j)

Connections

  • The transition matrix P = D⁻¹A is the complement of the random walk Laplacian L_rw = I − P. The eigenvalues of P are μᵢ = 1 − λᵢ(L_rw), so the spectral gap of the walk equals the algebraic connectivity λ₂ of the normalized Laplacian. graph-laplacians
  • P is similar to D^{1/2} P D^{-1/2} = D^{-1/2} A D^{-1/2}, a real symmetric matrix. The Spectral Theorem guarantees real eigenvalues in [−1, 1] and an orthonormal eigenbasis, which underlies the spectral decomposition of P^t. spectral-theorem
  • A random walk defines a sequence of random variables on the vertex set. Convergence in total variation is convergence in the L¹ norm on the probability simplex — a measure-theoretic statement about the walk's law. measure-theoretic-probability
  • The KL divergence D_KL(P^t(x,·) ∥ π) decreases monotonically along the walk and reaches zero at stationarity. The stationary entropy H(π) = log n for regular graphs — maximum randomness. The spectral gap controls the rate of entropy production. shannon-entropy
  • Chernoff-type bounds for sums of random variables along a Markov chain use the spectral gap to account for dependence — the walk mixes fast enough that consecutive samples are nearly independent. concentration-inequalities
  • DeepWalk and node2vec learn vertex embeddings by running random walks and applying skip-gram (implicit low-rank matrix factorization). The embedding captures the random walk transition probabilities — a spectral approximation related to the Laplacian eigenvectors. pca-low-rank

References & Further Reading

  • book Markov Chains and Mixing Times — Levin, Peres & Wilmer (2009) The standard textbook — covers mixing times, spectral methods, coupling, and hitting times with full proofs
  • book Spectral Graph Theory — Chung (1997) Chapter 2 covers random walks on graphs and the connection between mixing and the normalized Laplacian spectrum
  • paper The PageRank Citation Ranking: Bringing Order to the Web — Page, Brin, Motwani & Winograd (1999) PageRank = stationary distribution of a random walk with teleportation — the foundational application
  • paper DeepWalk: Online Learning of Social Representations — Perozzi, Al-Rfou & Skiena (2014) Random walks for vertex embedding — the bridge from Markov chains to representation learning
  • paper Faster Mixing via Average Conductance — Lovász & Kannan (1999) Sharper mixing time bounds via average conductance — extends Cheeger's inequality to the random walk setting
  • book Random Walks and Electric Networks — Doyle & Snell (1984) The classic treatment of the equivalence between random walk quantities and electrical network quantities