intermediate unsupervised 50 min read

Clustering

Mean-shift mode-finding on the KDE substrate: bandwidth-mode-count duality, basin-of-attraction partitions, and the k-means / GMM contrasts

Motivation: when linear partitions fail

The classic clustering algorithms — Lloyd’s k-means most prominently — partition space into pieces with flat boundaries. That’s fine when the true cluster shapes happen to be convex blobs, but it fails predictably the moment a cluster has an elbow, a curve, or any non-convex structure. Mean-shift clustering takes a different angle: rather than partitioning space, it climbs the density. Each point trails uphill toward a local high-density peak — a mode of the kernel density estimator f^h\hat f_h we inherit from formalStatistics: kernel-density-estimation — and the cluster assignment falls out of which peak the trajectory lands at.

This section sets up the contrast geometrically. We defer the algebra (§3) and the convergence theory (§4) until we have a clear picture of what the algorithm is trying to do.

The k-means linear-Voronoi assumption and what breaks under non-convex clusters

Lloyd’s k-means picks KK centroids μ1,,μKRd\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K \in \mathbb{R}^d, assigns each data point xi\mathbf{x}_i to the nearest centroid, recomputes each centroid as the mean of its cluster, and iterates until convergence. The fixed point has a tidy geometric description: each cluster is the set of points closer to its own centroid than to any other — a Voronoi cell of the centroid configuration — and the boundary between adjacent clusters is the perpendicular bisector of the segment joining their centroids, a flat hyperplane. Every cluster is therefore a convex polytope.

That’s fine when the underlying generative shapes are convex; k-means on a well-separated isotropic Gaussian mixture recovers exactly the right partition. The convexity assumption is silent in the algorithm’s specification — there’s no parameter that relaxes it — so two crescents nested into a moon shape, two concentric rings, or an elongated strip parallel to another all force k-means to slice through what a human eye reads as a single connected piece. The algorithm doesn’t fail loudly; it returns a confident-looking wrong partition.

Density modes as a clustering target

Suppose x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n are i.i.d. samples from a density f:RdR0f : \mathbb{R}^d \to \mathbb{R}_{\geq 0}. In the modal sense (Hartigan 1975), a cluster is a connected region of Rd\mathbb{R}^d on which ff stays above some level, separated from other such regions by lower-density valleys. Each such region corresponds to a mode of ff: a local maximum where f=0\nabla f = \mathbf{0} and the Hessian is negative definite. The clustering problem reduces to two coupled subproblems — find the modes of ff, and decide which mode each data point’s steepest-ascent trajectory converges to.

We don’t have ff; we have f^h\hat f_h. So the operational problem is: find the modes of f^h\hat f_h, and trace each data point’s steepest-ascent trajectory on f^h\hat f_h to its mode. Mean-shift is the algorithm that does this, with one elegant trick — the gradient-ascent step has a closed-form update that doesn’t require explicitly computing f^h\nabla \hat f_h at each iteration. For now, the geometry: imagine the KDE surface in 2D as a hilly landscape, place a ball at each data point, let each ball roll uphill until it stops at a peak, and group together the data points whose balls stopped at the same peak. That’s the algorithm.

The signature contrast on make_moons

The cleanest demonstration is sklearn.datasets.make_moons(n_samples=200, noise=0.1, random_state=42): two interleaved crescents in R2\mathbb{R}^2 separated by a thin gap. K-means with K=2K = 2 has nowhere reasonable to put a separating hyperplane — the perpendicular-bisector boundary cuts both crescents roughly in the middle, producing two clusters whose members are spread across both true crescents. Mean-shift with moderate bandwidth follows the density: each crescent has its own high-density ridge and its own mode, and the algorithm assigns each data point to its crescent.

Quantitatively, the adjusted Rand index (ARI; Hubert and Arabie 1985) measures cluster-assignment agreement against ground truth, with ARI=1\mathrm{ARI} = 1 meaning perfect recovery and ARI=0\mathrm{ARI} = 0 meaning no better than random. On this dataset, k-means scores ARI0.22\mathrm{ARI} \approx 0.22 with a clearly-wrong linear cut. Gaussian-kernel mean-shift at h=0.40h = 0.40 scores ARI0.43\mathrm{ARI} \approx 0.43 — roughly twice as high. The mean-shift partition is shape-correct: the two clusters follow the crescent ridges, with some boundary points near the gap mis-classified into the wrong crescent at noise = 0.1. The silhouette score (Rousseeuw 1987) rewards convex compactness and so reports the opposite ordering — k-means looks “tighter,” in the wrong way. The silhouette/ARI disagreement is a useful diagnostic any time a clustering metric prefers convex partitions over the right ones.

Side-by-side k-means and mean-shift on the moons dataset with ARI and silhouette annotations
K-means versus mean-shift on the moons. K-means cuts both crescents through their middles; mean-shift follows the density ridges.

Roadmap

§2 recaps the KDE substrate f^h\hat f_h and pins down what “mode” means in Rd\mathbb{R}^d. §3 derives the mean-shift update as gradient ascent on f^h\hat f_h — the central algorithmic result. §4 proves convergence: the monotone-density-ascent theorem (Comaniciu and Meer 2002) and Cheng’s (1995) fixed-point convergence. §5 introduces the bandwidth-mode-count duality and the mode tree. §6 takes a brief tour of kernel choice and explains why it matters less than for KDE-as-density-estimator. §7 turns the trajectory map into a clustering by tracking basins of attraction and merging modes; §8 covers the from-scratch implementation, the sklearn.cluster.MeanShift pitfalls, and scaling strategies. §9 and §10 are the contrast sections: k-means as the h0h \to 0 Voronoi limit, and GMM-via-EM as the parametric soft-clustering alternative. §11 points lightly at spectral clustering and DBSCAN — proper coverage lives in graph-laplacians and random-walks. §12 closes with cross-site references, honest limits, and forward pointers.


The KDE substrate and the mode-finding picture

We pinned down in §1 that the clustering problem we’ll solve is find the modes of f^h\hat f_h and assign each data point to the mode its steepest-ascent trajectory lands at. This section makes both halves precise.

Recap of the multivariate KDE

Given i.i.d. samples x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d from an unknown density ff, the kernel density estimator with kernel function K:RdR0K: \mathbb{R}^d \to \mathbb{R}_{\geq 0} and bandwidth h>0h > 0 is

f^h(x)  =  1nhdi=1nK ⁣(xxih),xRd.\hat f_h(\mathbf{x}) \;=\; \frac{1}{n h^d} \sum_{i=1}^n K\!\left(\frac{\mathbf{x} - \mathbf{x}_i}{h}\right), \qquad \mathbf{x} \in \mathbb{R}^d.

We make three simplifying choices throughout this topic. (a) The kernel is radial: K(u)=ck,dk(u2)K(\mathbf{u}) = c_{k,d}\, k(\|\mathbf{u}\|^2) for some univariate profile function kk (non-increasing on [0,)[0, \infty)) and a normalizing constant ck,dc_{k,d}. The profile function is the trick that powers the §3 derivation. (b) The bandwidth is isotropic and scalar, h(0,)h \in (0, \infty). (c) The default kernel is Gaussian, K(u)=(2π)d/2exp(12u2)K(\mathbf{u}) = (2\pi)^{-d/2} \exp(-\tfrac{1}{2}\|\mathbf{u}\|^2), with profile k(s)=exp(s/2)k(s) = \exp(-s/2) on s=u20s = \|\mathbf{u}\|^2 \geq 0.

These choices buy a clean mean-shift derivation in §3 without giving up much practical generality. Anisotropic bandwidth matrices and non-radial kernels are addressed in §5.4 and §6.4 as scope-bounding remarks.

What “mode of a multivariate density” means

Definition 1 (Mode of a smooth density).

For a smooth density g:RdR0g: \mathbb{R}^d \to \mathbb{R}_{\geq 0}, a mode is a point mRd\mathbf{m}^* \in \mathbb{R}^d at which the gradient vanishes and the Hessian is negative definite:

g(m)  =  0,2g(m)0.\nabla g(\mathbf{m}^*) \;=\; \mathbf{0}, \qquad \nabla^2 g(\mathbf{m}^*) \prec 0.

We write MhRd\mathcal{M}_h \subset \mathbb{R}^d for the set of modes of f^h\hat f_h and M(h):=MhM(h) := |\mathcal{M}_h| for the mode count.

The first condition says m\mathbf{m}^* is a stationary point; the second rules out saddles, minima, and degenerate plateaus, leaving exactly the local-maximum points. The cardinality M(h)M(h) is the number of clusters mean-shift discovers — and it depends on hh in the monotone way §5 develops.

A subtle point. Mean-shift’s convergence theory (§4) will guarantee convergence to a stationary point of f^h\hat f_h, not necessarily a mode. On data, saddle points are measure-zero attractors — almost every trajectory lands at a mode — but a pedantic statement of the algorithm’s guarantee is “stationary point,” not “mode.” We’ll handle this carefully when the theorem statements arrive in §4.

Mode-finding as the clustering target

Clustering algorithms divide cleanly into two paradigms. Partition-based methods — k-means, k-medoids, GMM-via-EM — pick a target number of clusters KK in advance and solve an optimization that assigns each point to a discrete label, minimizing some within-cluster loss. The cluster count KK is a hyperparameter, the cluster shape is implicit in the loss, and the algorithm carries no representation of the underlying density. Density-based methods — mean-shift, DBSCAN, level-set tree clustering — estimate the density first and read clusters off its structure. Cluster count is output, not input.

Mean-shift specifically defines clusters via the trajectory map Th:RdRdT_h: \mathbb{R}^d \to \mathbb{R}^d, which sends each starting point to the limit of its mean-shift iteration. Two points x,y\mathbf{x}, \mathbf{y} belong to the same cluster if and only if their trajectories converge to the same mode: Th(x)=Th(y)T_h^\infty(\mathbf{x}) = T_h^\infty(\mathbf{y}). The cluster count is M(h)=MhM(h) = |\mathcal{M}_h|, fixed by hh and the data via the KDE; no KK to pick.

This is fundamentally different from k-means’ formulation, and it’s why mean-shift is the right tool for the moons problem and k-means isn’t. The trade-off is real: mean-shift offloads the cluster-count choice into a bandwidth choice, and §5 will deal with how to pick hh well.

KDE surface and mode locations

To make the geometric picture concrete, we plot f^h\hat f_h on a fine 2D grid for make_blobs(n_samples=400, centers=[(-3,0),(0,2.5),(3,-0.5)], cluster_std=[0.4,0.6,0.8]) with Gaussian kernel and h=0.40h = 0.40, and locate the modes by brute-force grid search. With this bandwidth and these well-separated blobs the mode count agrees with the ground-truth cluster count, M(h)=3M(h) = 3.

The grid-search mode finder is the geometric stand-in for mean-shift. Grid search costs O(Ngrid)O(N_{\text{grid}}) density evaluations where NgridN_{\text{grid}} grows exponentially in dd, and it gives no per-point cluster assignment — it just locates the modes. Mean-shift will produce both the mode locations and the per-point assignments in one pass.

KDE surface with overlaid scatter and three mode markers on a 3-blob dataset
$\\hat f_h$ on the 3-blob dataset at $h = 0.40$ with the three modes located by neighborhood max-filter.

Mean-shift as gradient ascent on f^h\hat f_h

The point of §3 is to derive the mean-shift update rule and show that it’s gradient ascent on f^h\hat f_h in disguise. The derivation is worth doing in detail because three things follow from it: (a) the closed-form update doesn’t require computing f^h(x)\nabla \hat f_h(\mathbf{x}) explicitly at each iteration — the algorithmic payoff; (b) the update has a clean geometric reading as “move toward the sample-weighted mean,” motivating both the name and the §4 monotone-ascent proof; (c) the same algebraic decomposition reappears in §4, §7, and §9.

The profile-function decomposition and the gradient of f^h\hat f_h

Start from the radial-kernel form: K(u)=ck,dk(u2)K(\mathbf{u}) = c_{k,d}\, k(\|\mathbf{u}\|^2). Substituting into the KDE,

f^h(x)  =  ck,dnhdi=1nk ⁣(xxih2).\hat f_h(\mathbf{x}) \;=\; \frac{c_{k,d}}{n h^d} \sum_{i=1}^n k\!\left(\left\|\frac{\mathbf{x} - \mathbf{x}_i}{h}\right\|^2\right).

Write ui(x):=(xxi)/h2u_i(\mathbf{x}) := \|(\mathbf{x} - \mathbf{x}_i)/h\|^2. Differentiating by the chain rule, ui=(2/h2)(xxi)\nabla u_i = (2/h^2)(\mathbf{x} - \mathbf{x}_i), so

f^h(x)  =  2ck,dnhd+2i=1nk(ui(x))(xxi).\nabla \hat f_h(\mathbf{x}) \;=\; \frac{2 c_{k,d}}{n h^{d+2}} \sum_{i=1}^n k'(u_i(\mathbf{x}))\, (\mathbf{x} - \mathbf{x}_i).

Because kk is non-increasing, k0k' \leq 0. The algebra is cleaner if we work with g(s):=k(s)0g(s) := -k'(s) \geq 0 — the derived profile (or shadow profile) of Comaniciu and Meer (2002). Two examples: for the Gaussian kernel, k(s)=exp(s/2)k(s) = \exp(-s/2) gives g(s)=12exp(s/2)=12k(s)g(s) = \tfrac{1}{2}\exp(-s/2) = \tfrac{1}{2}k(s) — the Gaussian’s shadow is itself, up to a constant. For the Epanechnikov kernel, k(s)=(1s)1{s1}k(s) = (1-s)\mathbb{1}\{s \leq 1\} gives g(s)=1{s1}g(s) = \mathbb{1}\{s \leq 1\} — the shadow is the uniform kernel on the unit ball.

Substituting k=gk' = -g and flipping the displacement sign,

f^h(x)  =  2ck,dnhd+2i=1ng(ui(x))(xix).\nabla \hat f_h(\mathbf{x}) \;=\; \frac{2 c_{k,d}}{n h^{d+2}} \sum_{i=1}^n g(u_i(\mathbf{x}))\, (\mathbf{x}_i - \mathbf{x}).

The gradient identity

Split the sum and factor:

i=1ng(ui)(xix)  =  [i=1ng(ui)][ig(ui)xiig(ui)x].\sum_{i=1}^n g(u_i)\, (\mathbf{x}_i - \mathbf{x}) \;=\; \left[\sum_{i=1}^n g(u_i)\right] \left[ \frac{\sum_i g(u_i)\, \mathbf{x}_i}{\sum_i g(u_i)} - \mathbf{x} \right].

The scalar ig(ui)\sum_i g(u_i) is strictly positive whenever any sample lies inside the kernel’s support. Define the sample-weighted mean at x\mathbf{x}, the load-bearing object of the section:

mh(x)  :=  i=1ng ⁣((xxi)/h2)xii=1ng ⁣((xxi)/h2).\mathbf{m}_h(\mathbf{x}) \;:=\; \frac{\sum_{i=1}^n g\!\left(\|(\mathbf{x} - \mathbf{x}_i)/h\|^2\right)\, \mathbf{x}_i}{\sum_{i=1}^n g\!\left(\|(\mathbf{x} - \mathbf{x}_i)/h\|^2\right)}.

In words: mh(x)\mathbf{m}_h(\mathbf{x}) is a weighted average of the samples, with each xi\mathbf{x}_i weighted by how close it is to x\mathbf{x} on the kernel’s scale. Substituting back,

f^h(x)  =  2ck,dnhd+2i=1ng(ui(x))  (mh(x)x).\nabla \hat f_h(\mathbf{x}) \;=\; \frac{2 c_{k,d}}{n h^{d+2}}\sum_{i=1}^n g(u_i(\mathbf{x}))\;\bigl(\mathbf{m}_h(\mathbf{x}) - \mathbf{x}\bigr).

This is the central identity. The gradient of f^h\hat f_h at x\mathbf{x} is a strictly positive scalar multiple of the displacement from x\mathbf{x} to the sample-weighted mean. The direction of steepest ascent on f^h\hat f_h at x\mathbf{x} is the direction from x\mathbf{x} toward mh(x)\mathbf{m}_h(\mathbf{x}). We didn’t have to compute the gradient explicitly to find the ascent direction; the sample-weighted mean already encodes it.

Writing α(x):=(2ck,d/(nhd+2))ig(ui(x))>0\alpha(\mathbf{x}) := (2 c_{k,d} / (n h^{d+2})) \sum_i g(u_i(\mathbf{x})) > 0, the gradient identity reads f^h(x)=α(x)(mh(x)x)\nabla \hat f_h(\mathbf{x}) = \alpha(\mathbf{x}) (\mathbf{m}_h(\mathbf{x}) - \mathbf{x}).

The mean-shift update rule

Definition 2 (Mean-shift iteration).

Given data x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d, a bandwidth h>0h > 0, a profile function kk with derived profile g=kg = -k', and a starting point x0Rd\mathbf{x}_0 \in \mathbb{R}^d, the mean-shift trajectory from x0\mathbf{x}_0 is the sequence (xt)t0(\mathbf{x}_t)_{t \geq 0} defined by

xt+1  =  mh(xt)  =  i=1ng ⁣((xtxi)/h2)xii=1ng ⁣((xtxi)/h2).\mathbf{x}_{t+1} \;=\; \mathbf{m}_h(\mathbf{x}_t) \;=\; \frac{\sum_{i=1}^n g\!\left(\|(\mathbf{x}_t - \mathbf{x}_i)/h\|^2\right)\, \mathbf{x}_i}{\sum_{i=1}^n g\!\left(\|(\mathbf{x}_t - \mathbf{x}_i)/h\|^2\right)}.

The mean-shift vector at xt\mathbf{x}_t is mh(xt)xt\mathbf{m}_h(\mathbf{x}_t) - \mathbf{x}_t.

Proposition 1 (Mean-shift as gradient ascent).

Each mean-shift iteration is a gradient-ascent step on f^h\hat f_h with adaptive step size α(xt)1\alpha(\mathbf{x}_t)^{-1}:

xt+1xt  =  mh(xt)xt  =  1α(xt)f^h(xt).\mathbf{x}_{t+1} - \mathbf{x}_t \;=\; \mathbf{m}_h(\mathbf{x}_t) - \mathbf{x}_t \;=\; \frac{1}{\alpha(\mathbf{x}_t)}\,\nabla \hat f_h(\mathbf{x}_t).
Proof.

Rearrange the §3.2 gradient identity: f^h(x)=α(x)(mh(x)x)\nabla \hat f_h(\mathbf{x}) = \alpha(\mathbf{x})(\mathbf{m}_h(\mathbf{x}) - \mathbf{x}), so mh(x)x=α(x)1f^h(x)\mathbf{m}_h(\mathbf{x}) - \mathbf{x} = \alpha(\mathbf{x})^{-1} \nabla \hat f_h(\mathbf{x}).

For the Gaussian kernel, g(s)=12exp(s/2)g(s) = \tfrac{1}{2}\exp(-s/2), the 12\tfrac{1}{2} cancels in the weighted-mean ratio and the update reads

xt+1  =  iexp(xtxi2/(2h2))xiiexp(xtxi2/(2h2)).\mathbf{x}_{t+1} \;=\; \frac{\sum_i \exp(-\|\mathbf{x}_t - \mathbf{x}_i\|^2 / (2h^2))\, \mathbf{x}_i}{\sum_i \exp(-\|\mathbf{x}_t - \mathbf{x}_i\|^2 / (2h^2))}.

This has the structural form of kernel regression’s Nadaraya–Watson estimator with the positions xi\mathbf{x}_i playing the role of the responses being averaged.

For the flat (uniform) kernel — Epanechnikov mean-shift — the update collapses to the unweighted local mean within radius hh.

Why “mean shift” — interpretation

Ascent direction. The mean-shift vector mh(x)x\mathbf{m}_h(\mathbf{x}) - \mathbf{x} points along f^h(x)\nabla \hat f_h(\mathbf{x}).

Adaptive step size. The scalar α(x)\alpha(\mathbf{x}) is large where many samples are near x\mathbf{x} (dense regions) and small where samples are sparse. The per-iteration displacement shrinks as the trajectory approaches a mode (where f^h0\nabla \hat f_h \to 0) and grows where the data is sparse. No learning rate to tune.

Log-density form. Dividing both sides of the gradient identity by f^h(x)\hat f_h(\mathbf{x}),

logf^h(x)  =  α(x)f^h(x)(mh(x)x),\nabla \log \hat f_h(\mathbf{x}) \;=\; \frac{\alpha(\mathbf{x})}{\hat f_h(\mathbf{x})}\,\bigl(\mathbf{m}_h(\mathbf{x}) - \mathbf{x}\bigr),

so the mean-shift vector is proportional to logf^h(x)\nabla \log \hat f_h(\mathbf{x}) — the form §4’s monotone-ascent argument naturally lives in.

Mean-shift trajectories on the moons KDE

The from-scratch implementation is the central code-experiment. A vectorized Gaussian-kernel mean-shift function takes data XRn×dX \in \mathbb{R}^{n \times d}, query points QRm×dQ \in \mathbb{R}^{m \times d}, and bandwidth hh, and returns the full trajectory history. The inner loop computes the m×nm \times n matrix of pairwise squared distances via scipy.spatial.distance.cdist, exponentiates to obtain the Gaussian-kernel weights, and performs the weighted-mean update by matrix multiplication.

import numpy as np
from scipy.spatial.distance import cdist

def mean_shift_trajectories(X, Q, h, max_iter=100, tol=1e-6):
    Xt = Q.copy()
    history = [Xt.copy()]
    for _ in range(max_iter):
        D2 = cdist(Xt, X, 'sqeuclidean')
        # log-domain stability: row-max subtract
        log_w = -D2 / (2 * h * h)
        log_w -= log_w.max(axis=1, keepdims=True)
        W = np.exp(log_w)
        Xt_new = (W @ X) / W.sum(axis=1, keepdims=True)
        step = np.linalg.norm(Xt_new - Xt, axis=1).max()
        Xt = Xt_new
        history.append(Xt.copy())
        if step < tol:
            break
    return np.stack(history)

The figure plots 12 trajectories on the moons KDE from a sparse grid; the two converged modes are the upper-crescent ridge and the lower-crescent ridge of f^h\hat f_h.

12 mean-shift trajectories overlaid on the moons KDE contour backdrop, converging to two distinct modes
Twelve mean-shift trajectories on the moons KDE at $h = 0.30$. Step length shrinks visibly as each trajectory approaches a mode.

Convergence theory

§3 gave us the algorithm. What §3 did not give us is a guarantee that this algorithm does anything sensible. §4 supplies the missing guarantees. The first is qualitative: each iteration weakly increases f^h\hat f_h (Comaniciu and Meer 2002, Theorem 1). The second is sharper: the trajectory converges to a stationary point of f^h\hat f_h in the limit (Cheng 1995, Theorem 1).

We continue to assume the radial-kernel form with k:[0,)[0,)k: [0, \infty) \to [0, \infty) non-increasing, differentiable, and convex. The Gaussian profile k(s)=exp(s/2)k(s) = \exp(-s/2) is strictly convex; the Epanechnikov profile is linear inside its support and weakly convex.

The monotone-density-ascent theorem

Theorem 1 (Monotone Density Ascent (Comaniciu–Meer 2002, Thm 1)).

Let KK be a radial kernel with non-increasing, differentiable, convex profile kk, and let f^h\hat f_h be the corresponding KDE on data x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d. For any starting point x0Rd\mathbf{x}_0 \in \mathbb{R}^d at which ig(ui(x0))>0\sum_i g(u_i(\mathbf{x}_0)) > 0, the mean-shift iteration xt+1=mh(xt)\mathbf{x}_{t+1} = \mathbf{m}_h(\mathbf{x}_t) satisfies

f^h(xt+1)f^h(xt)    ck,dnhd+2xt+1xt2i=1ng(ui(xt))    0,\hat f_h(\mathbf{x}_{t+1}) - \hat f_h(\mathbf{x}_t) \;\geq\; \frac{c_{k,d}}{n h^{d+2}} \, \|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2 \, \sum_{i=1}^n g(u_i(\mathbf{x}_t)) \;\geq\; 0,

with equality if and only if xt+1=xt\mathbf{x}_{t+1} = \mathbf{x}_t.

The bound is sharper than just ”f^h\hat f_h is non-decreasing”: the density gain is at least quadratic in the step size.

Proof.

Move 1 — convexity inequality. Because kk is convex and differentiable, the supporting-line inequality k(b)k(a)+k(a)(ba)k(b) \geq k(a) + k'(a)(b - a) holds for all a,b0a, b \geq 0. Set a=ui(xt)a = u_i(\mathbf{x}_t) and b=ui(xt+1)b = u_i(\mathbf{x}_{t+1}) for each ii, and substitute k(a)=g(a)k'(a) = -g(a):

k(ui(xt+1))k(ui(xt))    g(ui(xt))(ui(xt+1)ui(xt)).k(u_i(\mathbf{x}_{t+1})) - k(u_i(\mathbf{x}_t)) \;\geq\; -g(u_i(\mathbf{x}_t)) \bigl(u_i(\mathbf{x}_{t+1}) - u_i(\mathbf{x}_t)\bigr).

Move 2 — chain across samples. Multiply by ck,d/(nhd)c_{k,d}/(n h^d) and sum. The LHS telescopes into f^h(xt+1)f^h(xt)\hat f_h(\mathbf{x}_{t+1}) - \hat f_h(\mathbf{x}_t):

f^h(xt+1)f^h(xt)    ck,dnhdi=1ng(ui(xt))(ui(xt+1)ui(xt)).\hat f_h(\mathbf{x}_{t+1}) - \hat f_h(\mathbf{x}_t) \;\geq\; -\frac{c_{k,d}}{n h^d} \sum_{i=1}^n g(u_i(\mathbf{x}_t)) \bigl(u_i(\mathbf{x}_{t+1}) - u_i(\mathbf{x}_t)\bigr).

Move 3 — expand the difference of squared distances. Write ui(x)=xxi2/h2u_i(\mathbf{x}) = \|\mathbf{x} - \mathbf{x}_i\|^2 / h^2 and expand xt+12xt22(xt+1xt)xi\|\mathbf{x}_{t+1}\|^2 - \|\mathbf{x}_t\|^2 - 2(\mathbf{x}_{t+1} - \mathbf{x}_t)^\top \mathbf{x}_i. Sum over ii and pull out the xi\mathbf{x}_i-independent terms, with St:=ig(ui(xt))S_t := \sum_i g(u_i(\mathbf{x}_t)):

i=1ng(ui)(ui(xt+1)ui(xt))  =  1h2[(xt+12xt2)St2(xt+1xt)ig(ui)xi].\sum_{i=1}^n g(u_i)\bigl(u_i(\mathbf{x}_{t+1}) - u_i(\mathbf{x}_t)\bigr) \;=\; \frac{1}{h^2}\Bigl[\bigl(\|\mathbf{x}_{t+1}\|^2 - \|\mathbf{x}_t\|^2\bigr)\, S_t - 2(\mathbf{x}_{t+1} - \mathbf{x}_t)^\top \sum_i g(u_i)\, \mathbf{x}_i\Bigr].

The mean-shift definition gives ig(ui)xi=Stxt+1\sum_i g(u_i)\, \mathbf{x}_i = S_t\, \mathbf{x}_{t+1}, so the sum becomes Sth2[xt+12xt22(xt+1xt)xt+1]\frac{S_t}{h^2}\bigl[\|\mathbf{x}_{t+1}\|^2 - \|\mathbf{x}_t\|^2 - 2(\mathbf{x}_{t+1} - \mathbf{x}_t)^\top \mathbf{x}_{t+1}\bigr].

Move 4 — recognize the squared step size. Expand the bracket:

xt+12xt22(xt+1xt)xt+1  =  (xt+122xtxt+1+xt2)  =  xt+1xt2.\|\mathbf{x}_{t+1}\|^2 - \|\mathbf{x}_t\|^2 - 2(\mathbf{x}_{t+1} - \mathbf{x}_t)^\top \mathbf{x}_{t+1} \;=\; -\bigl(\|\mathbf{x}_{t+1}\|^2 - 2 \mathbf{x}_t^\top \mathbf{x}_{t+1} + \|\mathbf{x}_t\|^2\bigr) \;=\; -\|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2.

So ig(ui)(ui(xt+1)ui(xt))=(St/h2)xt+1xt2\sum_i g(u_i)(u_i(\mathbf{x}_{t+1}) - u_i(\mathbf{x}_t)) = -(S_t/h^2)\,\|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2. Plugging back into the Move 2 inequality, the two minus signs cancel:

f^h(xt+1)f^h(xt)    ck,dnhd+2Stxt+1xt2.\hat f_h(\mathbf{x}_{t+1}) - \hat f_h(\mathbf{x}_t) \;\geq\; \frac{c_{k,d}}{n h^{d+2}} \, S_t \, \|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2.

This is the claimed inequality.

Stationarity characterization

A fixed point of the mean-shift map is a point x\mathbf{x}^* with mh(x)=x\mathbf{m}_h(\mathbf{x}^*) = \mathbf{x}^*.

Proposition 2 (Stationarity characterization).

At any x\mathbf{x}^* with ig(ui(x))>0\sum_i g(u_i(\mathbf{x}^*)) > 0,

mh(x)=x        f^h(x)=0.\mathbf{m}_h(\mathbf{x}^*) = \mathbf{x}^* \;\iff\; \nabla \hat f_h(\mathbf{x}^*) = \mathbf{0}.
Proof.

The gradient identity from §3.2 reads f^h(x)=α(x)(mh(x)x)\nabla \hat f_h(\mathbf{x}^*) = \alpha(\mathbf{x}^*)(\mathbf{m}_h(\mathbf{x}^*) - \mathbf{x}^*), with α(x)>0\alpha(\mathbf{x}^*) > 0.

The fixed-point set of mh\mathbf{m}_h equals the set of stationary points of f^h\hat f_h, which includes modes (local maxima), saddles, and minima. The monotone-ascent property rules out convergence to a strict local minimum. Saddles are measure-zero attractors — almost every starting point converges to a mode.

Convergence to a stationary point

Theorem 2 (Trajectory convergence (Cheng 1995, Thm 1)).

Under the Theorem 1 assumptions, and assuming additionally that f^h\hat f_h has only isolated stationary points (a generic-position assumption that holds almost surely for data with continuous distribution), the mean-shift trajectory (xt)t0(\mathbf{x}_t)_{t \geq 0} from any x0Rd\mathbf{x}_0 \in \mathbb{R}^d converges to a stationary point x\mathbf{x}^* of f^h\hat f_h:

limtxt  =  xwithf^h(x)=0.\lim_{t \to \infty} \mathbf{x}_t \;=\; \mathbf{x}^* \quad \text{with} \quad \nabla \hat f_h(\mathbf{x}^*) = \mathbf{0}.
Proof.

Step 1 — the trajectory is bounded. For the Gaussian kernel, each iterate xt+1=mh(xt)\mathbf{x}_{t+1} = \mathbf{m}_h(\mathbf{x}_t) is a convex combination of the data points (the weights g(ui(xt))/Stg(u_i(\mathbf{x}_t))/S_t are non-negative and sum to one). So xt+1conv({x1,,xn})\mathbf{x}_{t+1} \in \mathrm{conv}(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) for every t0t \geq 0, and the trajectory lies in the bounded convex hull of the data.

Step 2 — the density values converge. By Theorem 1, f^h(xt)\hat f_h(\mathbf{x}_t) is monotone non-decreasing; bounded above by f^hck,dk(0)/hd\hat f_h \leq c_{k,d}\, k(0)/h^d. A monotone bounded sequence converges, so f^h(xt)L\hat f_h(\mathbf{x}_t) \to L.

Step 3 — step sizes are square-summable, hence vanishing. Sum the Theorem 1 bound over t=0,,T1t = 0, \ldots, T-1:

f^h(xT)f^h(x0)    ck,dnhd+2t=0T1Stxt+1xt2.\hat f_h(\mathbf{x}_T) - \hat f_h(\mathbf{x}_0) \;\geq\; \frac{c_{k,d}}{n h^{d+2}} \sum_{t=0}^{T-1} S_t\, \|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2.

Taking TT \to \infty, the LHS is bounded by Lf^h(x0)L - \hat f_h(\mathbf{x}_0), so tStxt+1xt2<\sum_t S_t \|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2 < \infty. On the bounded set, StS_t is bounded below by some Smin>0S_{\min} > 0 (continuity and positivity of gg). Dividing,

t=0xt+1xt2  <  ,soxt+1xt0.\sum_{t=0}^\infty \|\mathbf{x}_{t+1} - \mathbf{x}_t\|^2 \;<\; \infty, \qquad \text{so} \qquad \|\mathbf{x}_{t+1} - \mathbf{x}_t\| \to 0.

Step 4 — accumulation points are stationary; isolation forces convergence. Let x\mathbf{x}^* be an accumulation point (Bolzano–Weierstrass) and (xtk)(\mathbf{x}_{t_k}) a convergent subsequence. By continuity of mh\mathbf{m}_h and step-vanishing,

mh(x)  =  limkmh(xtk)  =  limkxtk+1  =  limkxtk  =  x.\mathbf{m}_h(\mathbf{x}^*) \;=\; \lim_k \mathbf{m}_h(\mathbf{x}_{t_k}) \;=\; \lim_k \mathbf{x}_{t_k+1} \;=\; \lim_k \mathbf{x}_{t_k} \;=\; \mathbf{x}^*.

Every accumulation point is a fixed point of mh\mathbf{m}_h — a stationary point of f^h\hat f_h. The accumulation-point set AA is discrete (isolated-stationary-points assumption). Choose δ>0\delta > 0 smaller than half the minimum pairwise distance in AA. Eventually xt+1xt<δ\|\mathbf{x}_{t+1} - \mathbf{x}_t\| < \delta for all large tt; the trajectory cannot jump between different δ\delta-balls, forcing convergence to some xA\mathbf{x}^* \in A.

Step-size bounds and the local-curvature scale

Near a mode, the rate is linear in the favorable case. Linearizing f^h\nabla \hat f_h around a mode x\mathbf{x}^*, f^h(x)H(x)(xx)\nabla \hat f_h(\mathbf{x}) \approx \mathbf{H}(\mathbf{x}^*)(\mathbf{x} - \mathbf{x}^*) where H(x):=2f^h(x)0\mathbf{H}(\mathbf{x}^*) := \nabla^2 \hat f_h(\mathbf{x}^*) \prec 0. The §3.3 Proposition gives

xt+1x    (I+1α(x)H(x))(xtx).\mathbf{x}_{t+1} - \mathbf{x}^* \;\approx\; \Bigl(\mathbf{I} + \tfrac{1}{\alpha(\mathbf{x}^*)} \mathbf{H}(\mathbf{x}^*)\Bigr)(\mathbf{x}_t - \mathbf{x}^*).

The local contraction factor is the operator norm of I+(1/α(x))H(x)\mathbf{I} + (1/\alpha(\mathbf{x}^*))\mathbf{H}(\mathbf{x}^*). Eigenvalues lie in (1,1)(-1, 1) — giving linear contraction — when λi(H(x))<2α(x)|\lambda_i(\mathbf{H}(\mathbf{x}^*))| < 2\alpha(\mathbf{x}^*). For Gaussian-kernel mean-shift this is automatic near typical modes.

Numerical verification

We verify the two empirical predictions: (a) f^h(xt)\hat f_h(\mathbf{x}_t) is monotone non-decreasing; (b) xt+1xt0\|\mathbf{x}_{t+1} - \mathbf{x}_t\| \to 0. We run mean-shift from B=100B = 100 random starting points in [5,5]×[3,5][-5, 5] \times [-3, 5] on the 3-blob KDE at h=0.65h = 0.65 (the plateau bandwidth in our reproduced data). For each trajectory we record f^h(xt)\hat f_h(\mathbf{x}_t) and xt+1xt\|\mathbf{x}_{t+1} - \mathbf{x}_t\| at every iteration. The left panel below shows monotone-non-decreasing density curves with zero violations; the right panel shows geometrically-decaying step sizes (straight lines in log scale), consistent with the linear-contraction sketch.

Two-panel convergence diagnostic showing monotone density curves and geometric step-size decay
Convergence diagnostic on 100 random trajectories. Left: every trajectory's density curve is monotone non-decreasing (Theorem 1 visible). Right: step sizes decay geometrically (linear contraction).

Bandwidth, resolution, and the mode tree

§§3–4 took hh as given. §5 asks how the answer depends on hh. The center-of-gravity result is the bandwidth-mode-count duality — as hh increases, M(h)M(h) is non-increasing, with modes merging but never splitting. The duality has a clean proof in d=1d = 1 from Silverman (1981) via the heat-equation reformulation; the multivariate version, due to Chacón (2015) and Chacón and Duong (2018), follows the same convolution-semigroup picture and holds in a generic-position sense.

Bandwidth-as-resolution

The KDE is a convolution: f^h(x)=(μ^ϕh)(x)\hat f_h(\mathbf{x}) = (\hat\mu * \phi_h)(\mathbf{x}), where μ^=1niδxi\hat\mu = \tfrac{1}{n}\sum_i \delta_{\mathbf{x}_i} is the empirical measure and ϕh\phi_h is the Gaussian density with isotropic scalar bandwidth hh. The bandwidth is the scale at which we look at the data.

The bandwidth’s role in density estimation: small hh has low bias and high variance, large hh has high bias and low variance, with AMISE-optimal rate hn1/(d+4)h^* \asymp n^{-1/(d+4)}. The clustering picture is the same trade-off through a different loss — we care about mode structure. Small hh over-clusters (M(h)nM(h) \to n); large hh under-clusters (M(h)1M(h) \to 1). The “right” hh is the one whose mode structure matches the cluster structure.

The Chacón–Duong bandwidth-mode-count duality

Theorem 3 (Bandwidth-mode-count duality).

Let f^h\hat f_h be the Gaussian-kernel KDE with isotropic scalar bandwidth h>0h > 0 on data x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d with distinct points, and let M(h)M(h) be its mode count. For d=1d = 1 the function hM(h)h \mapsto M(h) is non-increasing on (0,)(0, \infty) (Silverman 1981, Theorem 2). For d2d \geq 2, the same conclusion holds generically — for almost every data configuration and almost every hh, modes can merge as hh grows but cannot split (Chacón 2015; Chacón and Duong 2018, Ch. 7).

Proof.

Step 1 — convolution semigroup. Gaussian densities form a semigroup under convolution: if σ1,σ2>0\sigma_1, \sigma_2 > 0 then ϕσ1ϕσ2=ϕσ12+σ22\phi_{\sigma_1} * \phi_{\sigma_2} = \phi_{\sqrt{\sigma_1^2 + \sigma_2^2}} (probabilistic proof: independent N(0,σi2I)N(\mathbf{0}, \sigma_i^2 \mathbf{I}) Gaussians sum to N(0,(σ12+σ22)I)N(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I})). For any 0<h1<h20 < h_1 < h_2 and σ:=h22h12\sigma := \sqrt{h_2^2 - h_1^2},

ϕh2=ϕh1ϕσ    f^h2=μ^ϕh2=(μ^ϕh1)ϕσ=f^h1ϕσ.\phi_{h_2} = \phi_{h_1} * \phi_\sigma \;\Longrightarrow\; \hat f_{h_2} = \hat\mu * \phi_{h_2} = (\hat\mu * \phi_{h_1}) * \phi_\sigma = \hat f_{h_1} * \phi_\sigma.

Step 2 — heat-equation reformulation. Define v(x,t):=(f^h1ϕt)(x)v(\mathbf{x}, t) := (\hat f_{h_1} * \phi_{\sqrt{t}})(\mathbf{x}). Then vv satisfies the heat equation tv=12Δv\partial_t v = \tfrac{1}{2}\Delta v with v(x,0)=f^h1(x)v(\mathbf{x}, 0) = \hat f_{h_1}(\mathbf{x}). Setting t=h22h12t = h_2^2 - h_1^2 identifies v(x,t)=f^h12+t(x)v(\mathbf{x}, t) = \hat f_{\sqrt{h_1^2 + t}}(\mathbf{x}).

Step 3 — critical-point monotonicity in d=1d = 1. Apply the heat operator to xv\partial_x v, which also solves the heat equation. Sturm’s lacunary theorem (Sturm 1836; modern proofs in Matano 1982 and Angenent 1988) says: if w(x,t)w(x, t) solves the 1D heat equation and w(,0)w(\cdot, 0) has finitely many zeros, then w(,t)w(\cdot, t) has finitely many zeros for all t>0t > 0 and the count is non-increasing in tt. The zeros of xv(,t)\partial_x v(\cdot, t) are the stationary points of v(,t)v(\cdot, t), so the stationary-point count of f^h\hat f_h is non-increasing in hh.

Modes are strict local maxima — a subset of stationary points. In 1D with f^h0\hat f_h \to 0 at ±\pm \infty, stationary points alternate max–min–max–… starting and ending with a max. Stationary points disappear only in min–max pairs at degenerate critical points, each annihilation removing one min and one max. The mode count is exactly half the stationary-point count (plus a half) and is non-increasing.

Step 4 — multivariate generalization. In d2d \geq 2, Chacón (2015) and Chacón and Duong (2018, Ch. 7) carry out the Morse-theoretic analysis: generically (Sard’s theorem), all critical points of f^h\hat f_h are non-degenerate, and as hh grows, critical points disappear only in max–saddle or min–saddle annihilations. Pathological data configurations exist in d2d \geq 2 where the strong statement fails on a measure-zero set of bandwidths; they don’t occur for data sampled from a continuous distribution.

Remark.

Known counterexamples in d2d \geq 2 are tuned configurations where two well-separated modes briefly coalesce at one bandwidth, then re-separate as a third mode pinches off at a slightly larger bandwidth. Measure-zero in data-space; Carreira-Perpiñán (2015, §3.2) discusses the literature. The numerical experiment in §5.5 will show small blips in M(h)M(h) on the moons that are a finite-precision shadow of this caveat — they should not be smoothed away.

The mode tree

The bandwidth-indexed family {Mh}\{\mathcal{M}_h\} traces a mode tree — a dendrogram whose leaves are fine-scale modes at small hh and whose internal nodes are merge events, with root at large hh where M(h)=1M(h) = 1. A flat region of the tree (long horizontal segment with no merges) corresponds to a stable bandwidth range over which the cluster assignment doesn’t change. The cluster count at this flat region is the natural cluster count — the scree-criterion principle of §5.4.

Mode-tree algorithms: Minnotte and Scott (1993) introduced the 1D version; Klemelä (2009) carries the multivariate version.

Selectors in practice

Four selectors are in use.

Silverman’s rule of thumb.

hS  =  σ^(4(d+2)n)1/(d+4),h_S \;=\; \hat\sigma \left(\frac{4}{(d+2)\, n}\right)^{1/(d+4)},

with σ^\hat\sigma a marginal-spread estimate. Cheap, often within a factor of two of optimal, easy to nudge upward to find the plateau.

Plug-in selectors (Sheather–Jones 1991; Wand and Jones 1994 for d2d \geq 2) estimate AMISE-optimal hh from the data. Excellent for density estimation; tend to under-smooth for clustering.

LSCV (Rudemo 1982; Bowman 1984; Scott and Terrell 1987): minimize J(h)=f^h2(2/n)if^h,i(xi)J(h) = \int \hat f_h^2 - (2/n)\sum_i \hat f_{h, -i}(\mathbf{x}_i). Consistent (Stone 1984), high-variance.

Scree criterion on the mode tree (Carreira-Perpiñán 2015, §4.1). Compute M(h)M(h) on a log-spaced grid, identify the longest plateau; the cluster count at the plateau is the natural cluster count.

For clustering, the workflow: start with Silverman’s hSh_S, sweep around it, find the plateau, report the cluster assignment at the plateau midpoint.

Numerical experiment: M(h)M(h) on the moons and 4-blob datasets

We sweep hh across 50 log-spaced values in [0.05,2.0][0.05, 2.0] on the moons (n=200n = 200, ground-truth K=2K = 2) and a varied-std 4-blob (n=400n = 400, ground-truth K=4K = 4). At each hh, run from-scratch mean-shift starting at every data point, count distinct converged modes. The 4-blob curve is cleanly monotone with a wide M=4M = 4 plateau; the moons curve descends monotonically with a narrow M=2M = 2 plateau followed by small blips back up before collapsing to M=1M = 1 — the multivariate-Silverman caveat from the Theorem 3 remark made empirically visible. The blips are not failures of the duality; they make the moons a cautionary example for naive scree-criterion bandwidth selection.

Two-panel figure showing M(h) step curve with shaded plateaus and a linked scatter at the selected bandwidth
$M(h)$ sweep with the M=2 plateau highlighted. Below the plateau, the moons over-cluster; above it, both crescents collapse into one.

Kernel-function choice

The §3.2 derivation went through for any radial kernel with non-increasing differentiable convex profile, so the mean-shift algorithm and its convergence theory are unchanged across kernels. What changes is the weighting function g=kg = -k' in the sample-weighted mean, and as a result the local curvature of f^h\hat f_h near each mode. The cluster assignment in the plateau regime is remarkably robust to those local changes.

The standard zoo

KernelProfile k(s)k(s)Shadow g(s)g(s)Support
Gaussianes/2e^{-s/2}12es/2\tfrac{1}{2} e^{-s/2}[0,)[0, \infty)
Epanechnikov(1s)1{s1}(1-s)\,\mathbb{1}\{s \leq 1\}1{s1}\mathbb{1}\{s \leq 1\}[0,1][0, 1]
Biweight (quartic)(1s)21{s1}(1-s)^2\,\mathbb{1}\{s \leq 1\}2(1s)1{s1}2(1-s)\,\mathbb{1}\{s \leq 1\}[0,1][0, 1]
Triweight(1s)31{s1}(1-s)^3\,\mathbb{1}\{s \leq 1\}3(1s)21{s1}3(1-s)^2\,\mathbb{1}\{s \leq 1\}[0,1][0, 1]
Uniform (box)1{s1}\mathbb{1}\{s \leq 1\}δ(s1)\delta(s - 1)[0,1][0, 1]

The uniform kernel’s profile is discontinuous; “uniform-kernel mean-shift” is shorthand for Epanechnikov-shadow mean-shift (Cheng 1995, §3) — both use g(s)=1{s1}g(s) = \mathbb{1}\{s \leq 1\}.

By Hodges and Lehmann’s (1956) efficiency comparison, kernel efficiencies fall in a tight band — Gaussian 95%, biweight 99%, triweight 99%, uniform 93% — smaller than a 5% bandwidth perturbation.

Flat-kernel mean-shift

Epanechnikov shadow g(s)=1{s1}g(s) = \mathbb{1}\{s \leq 1\} gives

xt+1  =  1Nh(xt)xiNh(xt)xi,Nh(xt):={xi:xtxih}.\mathbf{x}_{t+1} \;=\; \frac{1}{|\mathcal{N}_h(\mathbf{x}_t)|}\sum_{\mathbf{x}_i \in \mathcal{N}_h(\mathbf{x}_t)} \mathbf{x}_i, \qquad \mathcal{N}_h(\mathbf{x}_t) := \{\mathbf{x}_i : \|\mathbf{x}_t - \mathbf{x}_i\| \leq h\}.

Replace xt\mathbf{x}_t by the arithmetic mean of data points within radius hh. Most computationally tractable variant; the version most users carry in their head. Two practical wrinkles: empty neighborhood (stop the trajectory; locally expand hh; or seed inside the convex hull) and boundary non-smoothness of f^h\hat f_h (the C0C^0 Epanechnikov gives jump discontinuities in f^h\nabla \hat f_h at distance exactly hh from each sample — measure-zero, harmless for mean-shift).

Why kernel choice rarely moves the cluster assignment

The basin-of-attraction structure is determined by the topology of f^h\hat f_h. At bandwidth in the §5 plateau, f^h\hat f_h is locally smoothed enough that the global mode structure is robust to kernel-shape changes. The cluster assignment is invariant up to label permutation across kernel choice.

Two reasons. Smoothing-stable mode count: §5.2’s duality holds for any convex-profile radial kernel; the plateau corresponds to a bandwidth range with stable mode count regardless of small perturbations including kernel shape. Basin-boundary stability: basin boundaries are determined by saddle-ridge structure, stable under the same perturbations. Carreira-Perpiñán (2015, §3.3): “The choice of kernel function has very little effect on the clusters produced by mean-shift, given a sensible bandwidth choice.”

In sharp contrast to KDE-as-density-estimator, where kernel sensitivity at the AMISE-optimal level is real (capped at ~5% by Hodges–Lehmann efficiency). For cluster assignments, sensitivity is essentially zero in the plateau regime.

Four side-by-side scatter panels showing mean-shift cluster assignments with Gaussian, Epanechnikov, biweight, and triweight kernels
Four kernels at the same plateau bandwidth on the moons. Cluster assignments are nearly identical; ARI versus Gaussian is at most 0.04 different.

When kernel choice does matter

Heavy-tailed data, far from the bulk. Gaussian’s exponentially decaying tails give distant points non-zero weight; compact-support kernels give zero. Can shift mode locations near data extremes by amounts crossing cluster boundaries.

Computational cost. Gaussian needs all nn data points per query — O(n)O(n) per iteration. Compact-support kernels accelerate to O(logn+k)O(\log n + k) via kd-tree / ball-tree (§8.3). For large nn, order-of-magnitude difference.

Smoothness of f^h\hat f_h. Gaussian CC^\infty; Epanechnikov C0C^0; biweight C1C^1; triweight C2C^2. For mean-shift itself it doesn’t matter; for downstream applications using f^h\hat f_h beyond cluster assignment (Hessian estimation, level-set trees), pick a smooth profile.

Recommendation: Gaussian by default; switch to Epanechnikov if computational cost forces it.


From trajectories to clusters

§3.5 showed individual trajectories converging. §5.5 used trajectory endpoints as cluster proxies. §7 makes both moves formal.

The trajectory map and basin-of-attraction clustering

Definition 3 (Trajectory map and basin).

The trajectory map is

Th:RdMh,Th(x):=limtxt    where    x0=x.T_h: \mathbb{R}^d \to \mathcal{M}_h, \qquad T_h(\mathbf{x}) := \lim_{t \to \infty} \mathbf{x}_t \;\;\text{where}\;\; \mathbf{x}_0 = \mathbf{x}.

The basin of attraction of mode m\mathbf{m}^* is B(m):=Th1(m)B(\mathbf{m}^*) := T_h^{-1}(\mathbf{m}^*). The basins partition Rd\mathbb{R}^d (almost everywhere); the partition is the basin-of-attraction clustering induced by f^h\hat f_h.

Two data points are in the same cluster iff their trajectories converge to the same mode. The map ThT_h defines a spatial clustering — every point in Rd\mathbb{R}^d has a cluster label, not just the data points. Any new xnew\mathbf{x}_{\text{new}} is classified by running a fresh trajectory.

Basin boundaries are the (d1)(d-1)-dimensional saddle ridges of f^h\hat f_h.

Mode merging with tolerance

In finite precision, two trajectories that should converge to the same mode land at endpoints separated by a small distance. The mode-extraction step needs a merging tolerance δ\delta.

Greedy-union rule (used throughout): walk through endpoints, attach each to an existing mode if within δ\delta, else create a new mode. Heuristics for δ\delta: δ10tol\delta \approx 10 \cdot \text{tol} (safety margin) or δ0.1h\delta \approx 0.1 \cdot h (well below inter-mode separation). The from-scratch implementation uses δ=103\delta = 10^{-3}.

The mode-merging tolerance is a numerical parameter, not a clustering hyperparameter; cluster assignment is invariant to δ\delta over a wide range.

Blurring mean-shift

Cheng (1995, §4) introduces a variant where data points themselves move at each iteration:

xi(t+1)  =  jg((xi(t)xj(t))/h2)xj(t)jg((xi(t)xj(t))/h2).\mathbf{x}_i^{(t+1)} \;=\; \frac{\sum_j g\bigl(\|(\mathbf{x}_i^{(t)} - \mathbf{x}_j^{(t)})/h\|^2\bigr)\, \mathbf{x}_j^{(t)}}{\sum_j g\bigl(\|(\mathbf{x}_i^{(t)} - \mathbf{x}_j^{(t)})/h\|^2\bigr)}.

The KDE is rebuilt at each iteration. Converges much faster — data collapses to modes in geometric time — but Cheng (1995, Theorem 2) shows long-time blurring collapses to a single point regardless of hh. The practical algorithm runs only a small number of iterations.

Standard (non-blurring) mean-shift is the default: respects the mode-tree structure and the §4 convergence theory.

Initialization

Sample-as-queries — initialize at the data points. Standard for clustering data points.

Grid-as-queries — initialize on a regular grid covering the data extent. Produces the basin-of-attraction map for Rd\mathbb{R}^d (not just the data).

Adaptive initialization — only at “promising” locations (high-f^h\hat f_h regions, coarse-pass modes). For n104n \leq 10^4 in 2D, savings are modest; for very large nn in higher dd, substantial.

The basin-of-attraction map on the moons

80 × 80 = 6400 trajectories from a grid over the moons range, Gaussian kernel at h=0.40h = 0.40, deduped at tolerance 10310^{-3}. Endpoints aligned to data-trajectory modes for consistent label coloring between basin map and scatter. The basin boundary appears as a smooth curve threading between the two crescents, far from either’s bulk.

Filled raster of basin labels on the moons with the saddle-ridge boundary visible as a smooth curve
Basin-of-attraction map at 80×80 grid resolution. Every point in $\\mathbb{R}^2$ has a cluster label — clustering is a spatial property of $\\hat f_h$.

Practical mean-shift

§§3–7 covered the algorithm and theory. §8 is about what happens when you actually want to run mean-shift on real data.

Design choices in the from-scratch implementation

The mean_shift_trajectories function from §3.5 makes four design choices.

Vectorized pairwise distance. scipy.spatial.distance.cdist for the m×nm \times n squared-distance matrix per iteration — O(mn)O(m n) in compiled C. Memory 8mn8 m n bytes (float64), fine for m,n104m, n \leq 10^4. At larger scale, batch queries in chunks of 103\sim 10^3.

Log-domain stability. Per-row max-subtract before exponentiating the Gaussian-kernel weights — the subtracted constant cancels in the weighted-mean ratio. Same trick as softmax.

Vectorized convergence. Stop when the maximum per-query step falls below tol. Per-query stopping is more efficient but harder to vectorize.

Full trajectory history returned. (T+1,m,d)(T+1, m, d) array — more memory than necessary if you only need final positions, but needed for diagnostic plots and basin visualizations. At our scale (T100T \leq 100, md104m d \leq 10^4) it’s 10\leq 10 MB.

A production implementation would tighten all four. The notebook implementation is optimized for pedagogical clarity, not throughput.

sklearn.cluster.MeanShift pitfalls

Three quirks worth knowing.

The estimate_bandwidth default. MeanShift() with bandwidth=None calls estimate_bandwidth(X) internally, which is O(n2)O(n^2) in time and memory. For n=104n = 10^4: 10810^8 distance computations, ~800 MB intermediate memory. For n=105n = 10^5: 101010^{10} computations, 80 GB. Always pass an explicit bandwidth=h chosen via §5.4’s selectors.

The internal kernel is flat, not Gaussian. Uses Epanechnikov-shadow update (unweighted mean inside bandwidth ball). Not user-selectable. §6.3 already showed kernel-invariance in the plateau regime; this is rarely a correctness issue but is the source of rare ARI < 1.0 disagreement on boundary points.

Orphan points get label -1. Trajectories not reaching a mode within max_iter (default 300) iterations are labeled -1. Rare at reasonable bandwidth, common at adversarial small hh.

bin_seeding=True initializes only at coarse-grid bin centers — substantial speedup for large data.

Scaling: kd-tree and ball-tree neighbor queries

For compact-support kernels, only Nh(x)\mathcal{N}_h(\mathbf{x}) contributes to mh(x)\mathbf{m}_h(\mathbf{x}). Brute force is O(n)O(n) per query; kd-tree (Bentley 1975) or ball-tree (Omohundro 1989) is O(logn+k)O(\log n + k) where kk is the average neighborhood size. Build time O(nlogn)O(n \log n).

At n=105n = 10^5 in R2\mathbb{R}^2 with Epanechnikov: brute force 101010^{10} distance computations per iteration; kd-tree 1.7×1061.7 \times 10^6 neighbor lookups. Three orders of magnitude per iteration.

For Gaussian, truncate: ignore xxi>ch\|\mathbf{x} - \mathbf{x}_i\| > c \cdot h for c[3,4]c \in [3, 4] (captures 99.7% of Gaussian mass). Truncated Gaussian admits the same kd-tree acceleration.

scipy.spatial.cKDTree and sklearn.neighbors.BallTree provide tree implementations. Kd-tree for d3d \leq 3; ball-tree for d6d \geq 6; benchmark in between.

The sample-and-cluster pattern

For n104n \gg 10^4, the sample-and-cluster pattern (Carreira-Perpiñán 2015, §6.2) trades small approximation error for order-of-magnitude speedup:

  1. Sample mnm \ll n seed points.
  2. Run mean-shift trajectories from seeds; dedupe to mode set.
  3. Assign remaining data via nearest-mode kd-tree query.

Total cost O(mTlogm)+O(nlogm)O(m T \log m) + O(n \log m). With m=nm = \sqrt{n}, dominant term O(nlogn)O(n \log n).

Approximation error from sub-sampling: low-density modes may be missed. Harmless for typical clustering; mitigated by stratified sampling for anomaly localization.

Sanity check: from-scratch vs sklearn

Run both implementations on the moons, 3-blob, and 4-blob at plateau bandwidths, with the from-scratch in flat-kernel mode for like-for-like comparison against sklearn’s internal Epanechnikov shadow. Expect mode counts to agree dataset-by-dataset and ARIs in [0.95,1.00][0.95, 1.00] — small disagreement on boundary points is normal.

Three side-by-side scatter panels comparing from-scratch and sklearn mean-shift assignments on three datasets
From-scratch flat-kernel mean-shift versus sklearn.cluster.MeanShift. Mode counts agree exactly; ARI ≥ 0.97 across the three datasets.

k-means as the h0h \to 0 Voronoi limit

K-means and mean-shift cluster the same data with different parameters and produce different partitions on anything non-Gaussian. §9 makes the relationship precise. Gaussian-kernel mean-shift in the h0h \to 0 limit collapses onto Lloyd’s hard nearest-mean assignment.

Lloyd’s algorithm and the Voronoi-partition fixed point

Lloyd’s algorithm (Lloyd 1982; MacQueen 1967 historical antecedent) picks KK centroids (typically via k-means++ seeding, Arthur and Vassilvitskii 2007), then alternates:

  1. Assignment. c(xi)=argminkxiμkc(\mathbf{x}_i) = \arg\min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|.
  2. Update. μk(1/Sk)iSkxi\boldsymbol{\mu}_k \leftarrow (1/|S_k|) \sum_{i \in S_k} \mathbf{x}_i.

The fixed point locally minimizes the WCSS J=kiSkxiμk2J = \sum_k \sum_{i \in S_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2. Local convergence; k-means++ providing O(logK)O(\log K) expected-WCSS approximation.

At a fixed point, xi\mathbf{x}_i belongs to cluster kk iff xiV(μk)\mathbf{x}_i \in V(\boldsymbol{\mu}_k) — the Voronoi cell of μk\boldsymbol{\mu}_k. Each cell is a convex polytope; cell boundaries are perpendicular bisectors of centroid pairs.

What each method assumes

K-means. Cluster count KK fixed in advance; cluster shape implicitly convex (Voronoi polytope); no density representation. WCSS objective is, up to constants, the NLL of an isotropic Gaussian mixture with equal variances and equal mixing weights (Bishop 2006, §9.1). Works exactly on well-separated isotropic Gaussians. Fails confidently on non-convex / non-isotropic / unequal-spread clusters.

Mean-shift. Cluster count M(h)M(h) emerges from hh; cluster shape is the basin of attraction of f^h\hat f_h; density-based. Handles any data with well-defined density modes. Trades KK hyperparameter for hh hyperparameter.

K-means optimizes WCSS; mean-shift optimizes f^h\hat f_h. The silhouette/ARI disagreement on the moons (§1.3) reflects this — silhouette rewards convex compactness (the k-means objective), ARI rewards getting the right partition.

The h0h \to 0 collapse theorem

Theorem 4 (Gaussian-kernel mean-shift collapse to nearest sample).

Let f^h\hat f_h be the Gaussian-kernel KDE on data x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d with distinct points. Fix xRd\mathbf{x} \in \mathbb{R}^d with a unique nearest sample xi\mathbf{x}_{i^*} — i.e., di2<di2d_{i^*}^2 < d_i^2 for all iii \neq i^*, where di:=xxid_i := \|\mathbf{x} - \mathbf{x}_i\|. Then

limh0+mh(x)  =  xi,withmh(x)xi    D(n1)exp ⁣(Δ2(x)2h2),\lim_{h \to 0^+} \mathbf{m}_h(\mathbf{x}) \;=\; \mathbf{x}_{i^*}, \qquad \text{with}\qquad \|\mathbf{m}_h(\mathbf{x}) - \mathbf{x}_{i^*}\| \;\leq\; D \cdot (n-1)\, \exp\!\left(-\frac{\Delta^2(\mathbf{x})}{2 h^2}\right),

where D=maxixixiD = \max_i \|\mathbf{x}_i - \mathbf{x}_{i^*}\| and Δ2(x):=minii(di2di2)>0\Delta^2(\mathbf{x}) := \min_{i \neq i^*}(d_i^2 - d_{i^*}^2) > 0.

The convergence rate is exponentially fast in 1/h21/h^2.

Proof.

Factor out wiw_{i^*} from numerator and denominator of mh(x)=iwixi/iwi\mathbf{m}_h(\mathbf{x}) = \sum_i w_i \mathbf{x}_i / \sum_i w_i with wi=exp(di2/(2h2))w_i = \exp(-d_i^2/(2h^2)). With ri:=wi/wir_i := w_i / w_{i^*}, for iii \neq i^*,

ri  =  exp ⁣(di2di22h2)    exp ⁣(Δ2(x)2h2)  =:  ε(h),r_i \;=\; \exp\!\left(-\frac{d_i^2 - d_{i^*}^2}{2 h^2}\right) \;\leq\; \exp\!\left(-\frac{\Delta^2(\mathbf{x})}{2 h^2}\right) \;=:\; \varepsilon(h),

with ri=1r_{i^*} = 1. Compute:

mh(x)xi  =  iiri(xixi)1+iiri.\mathbf{m}_h(\mathbf{x}) - \mathbf{x}_{i^*} \;=\; \frac{\sum_{i \neq i^*} r_i\, (\mathbf{x}_i - \mathbf{x}_{i^*})}{1 + \sum_{i \neq i^*} r_i}.

Take the Euclidean norm; apply the triangle inequality with xixiD\|\mathbf{x}_i - \mathbf{x}_{i^*}\| \leq D and the bound ri/(1+ri)ri\sum r_i / (1 + \sum r_i) \leq \sum r_i (denominator 1\geq 1):

mh(x)xi    Diiri    D(n1)ε(h).\|\mathbf{m}_h(\mathbf{x}) - \mathbf{x}_{i^*}\| \;\leq\; D \sum_{i \neq i^*} r_i \;\leq\; D (n-1) \varepsilon(h).

Corollary 1 (Basin = Voronoi cell at $h = 0$).

For any x\mathbf{x} with a unique nearest sample xi\mathbf{x}_{i^*}, limh0+Th(x)=xi\lim_{h \to 0^+} T_h(\mathbf{x}) = \mathbf{x}_{i^*}, and the basin partition {B(xi)}i=1n\{B(\mathbf{x}_i)\}_{i=1}^n converges to the Voronoi diagram of the data, {V(xi)}i=1n\{V(\mathbf{x}_i)\}_{i=1}^n.

Proof.

Theorem 4 shows mh(x)xi\mathbf{m}_h(\mathbf{x}) \to \mathbf{x}_{i^*} in one iteration. The next iteration starts arbitrarily close to xi\mathbf{x}_{i^*}, at which the nearest sample is xi\mathbf{x}_{i^*} itself (distance 0), so the trajectory stays put. Th(x)xiT_h(\mathbf{x}) \to \mathbf{x}_{i^*}; the preimage converges to the Voronoi cell.

Remark.

The corollary makes the "K=nK = n" equivalence precise: mean-shift in the h0h \to 0 limit produces the Voronoi diagram of the data — the partition k-means with K=nK = n frozen-data centroids would produce. For K<nK < n k-means, equivalence is approximate and conditional: in the M(h)=KM(h) = K plateau on isotropic-Gaussian-mixture data, both algorithms recover the same partition. The equivalence breaks on non-isotropic or non-Gaussian data — on the moons no hh value matches k-means at any KK.

Numerical experiment: ARI(mean-shift, k-means) on the 3-blob

Run k-means with K=3K = 3 as reference. Sweep hh across the M(h)=3M(h) = 3 plateau and below; at each hh compute mean-shift labels and ARIs against k-means and against ground truth. ARI(mean-shift, k-means) ≈ 1.0 across the plateau; ARI drops below the plateau when M(h)>3M(h) > 3. The basin-mosaic at decreasing hh shows boundaries first straightening (within the plateau’s lower edge) then fragmenting (below the plateau).

Row of mean-shift basin panels at decreasing bandwidth next to a k-means Voronoi panel
Mean-shift basins at three decreasing bandwidths alongside the k-means Voronoi partition. In the plateau, ARI(mean-shift, k-means) hits 0.99.

GMM-via-EM and soft assignment

§9 contrasted against k-means — the simplest hard-partition baseline. §10 contrasts against the probabilistic hard-partition baseline: Gaussian mixture models fit by expectation-maximization (Dempster, Laird, and Rubin 1977).

The Gaussian mixture model and its likelihood

p(xθ)  =  k=1KπkN(xμk,Σk),p(\mathbf{x} \mid \boldsymbol{\theta}) \;=\; \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),

with πk0\pi_k \geq 0 summing to 1, component means μkRd\boldsymbol{\mu}_k \in \mathbb{R}^d, covariances ΣkRd×d\boldsymbol{\Sigma}_k \in \mathbb{R}^{d \times d} positive definite. Latent-variable form: ziCat(π)z_i \sim \mathrm{Cat}(\boldsymbol{\pi}), xizi=kN(μk,Σk)\mathbf{x}_i \mid z_i = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k).

Log-likelihood:

(θ)  =  i=1nlogk=1KπkN(xiμk,Σk).\ell(\boldsymbol{\theta}) \;=\; \sum_{i=1}^n \log \sum_{k=1}^K \pi_k\, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k).

Log-sum-over-components has no closed-form maximum.

EM for GMMs: responsibilities as soft assignments

E-step. Compute posterior responsibilities:

γik(t)  :=  πk(t)N(xiμk(t),Σk(t))lπl(t)N(xiμl(t),Σl(t)).\gamma_{ik}^{(t)} \;:=\; \frac{\pi_k^{(t)}\, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})}{\sum_l \pi_l^{(t)}\, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_l^{(t)}, \boldsymbol{\Sigma}_l^{(t)})}.

γik[0,1]\gamma_{ik} \in [0, 1] is the soft assignment; kγik=1\sum_k \gamma_{ik} = 1.

M-step. With Nk=iγik(t)N_k = \sum_i \gamma_{ik}^{(t)}:

πk(t+1)=Nkn,μk(t+1)=1Nkiγik(t)xi,Σk(t+1)=1Nkiγik(t)(xiμk(t+1))(xiμk(t+1)).\pi_k^{(t+1)} = \frac{N_k}{n}, \quad \boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k} \sum_i \gamma_{ik}^{(t)}\, \mathbf{x}_i, \quad \boldsymbol{\Sigma}_k^{(t+1)} = \frac{1}{N_k} \sum_i \gamma_{ik}^{(t)}\, (\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})^\top.

Weighted single-Gaussian MLEs. See Bishop (2006, §9.2.2) for the full derivation.

Theorem 5 (EM monotone-likelihood ascent (Dempster–Laird–Rubin 1977, Thm 1)).

At each EM iteration the log-likelihood (θ(t))\ell(\boldsymbol{\theta}^{(t)}) is monotone non-decreasing; the algorithm converges to a local maximum of \ell (under regularity conditions on the latent-variable model).

The structural connection to mean-shift: GMM-EM’s M-step μk=(iγikxi)/(iγik)\boldsymbol{\mu}_k = (\sum_i \gamma_{ik} \mathbf{x}_i)/(\sum_i \gamma_{ik}) is the §3.3 mean-shift update form with responsibilities as kernel weights. Differences: parametric vs fixed-bandwidth weighting; KK centroids vs nn query points; soft per-component vs hard per-mode. The h0h \to 0 Gaussian-mean-shift → k-means limit (§9.3) and the Σk=σ2I,σ0\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}, \sigma \to 0 GMM-EM → k-means limit both converge on k-means.

Identifiability, label-switching, and what “the” cluster labels mean

Permuting component labels leaves the density unchanged. The likelihood has K!K! equivalent global maxima differing only in component naming.

Consequences: EM with random initialization converges to permuted optima; Bayesian inference suffers label-switching (Stephens 2000); cluster-quality metrics must be permutation-invariant (ARI does it, raw accuracy doesn’t).

The label-switching issue affects names, not partition. Two GMM fits agreeing on which samples cluster together but disagreeing on naming are operationally the same fit.

A separate identifiability concern: GMM likelihood is unbounded above. Setting μk=xi\boldsymbol{\mu}_k = \mathbf{x}_i and Σk0\boldsymbol{\Sigma}_k \to 0 sends \ell \to \infty. sklearn.mixture.GaussianMixture regularizes via reg_covar (default 10610^{-6}).

The M-open critique

GMM produces correct cluster assignments when data is genuinely a Gaussian mixture. When it isn’t, assignments are systematically biased: GMM fits ellipsoidal components to non-ellipsoidal cluster structure.

Bernardo and Smith (1994) framework: M-closed = true model is in our class; M-complete = known true model is intractable; M-open = true model is outside our class.

For clustering, GMM is M-open whenever cluster shapes aren’t ellipsoidal — most real data. Moons aren’t Gaussian. Concentric rings aren’t. Image segmentation, text embedding, biological data — none are Gaussian in any literal sense.

The critique cuts in two directions. Against GMM: when true clusters are non-Gaussian, biased assignments. Against mean-shift: non-parametric in f^h\hat f_h but commits to “cluster = basin of attraction of kernel-smoothed empirical density.” Less restrictive than ellipsoidal Gaussian, but not assumption-free.

Practical takeaway. Approximately-Gaussian clusters with arbitrary covariance shape: GMM-EM. Non-Gaussian or unknown-shape clusters: mean-shift. On real data, “run both and compare.”

GMM responsibilities vs mean-shift basins

3-blob (where GMM is correct): GMM-EM recovers components; responsibilities sharp inside blobs, soft at boundaries; mean-shift hard MAP agrees up to label permutation; ARI 1.0\approx 1.0 on both.

Moons (both partially M-open): GMM-EM with K=2K = 2 fits two ellipsoids that span across the crescent gap, ARI 0.46\approx 0.46. Mean-shift at the plateau bandwidth finds the right cluster count and shape but mis-classifies boundary points, ARI 0.43\approx 0.43. Both algorithms beat k-means’ ARI 0.22\approx 0.22 from §1.3 but neither recovers the moons perfectly at noise = 0.1 — the crescents are intrinsically hard at this noise level.

2x2 grid showing GMM soft responsibilities and mean-shift hard basins on the 3-blob and moons datasets
GMM responsibilities (left column, RGB blend) vs mean-shift basins (right column). On 3-blob, both agree. On moons, both improve over k-means but neither recovers the crescents perfectly.

Pointers: spectral and density-based clustering

§§9–10 covered partition-based contrasts. §11 covers two methods that don’t fit cleanly into either partition-based or mode-based — spectral clustering and DBSCAN. Light pointer; proper coverage of spectral clustering is in graph-laplacians and random-walks.

Affinity graphs and the Laplacian

Spectral clustering builds an affinity graph: vertices are samples, edge weights WijW_{ij} encode pairwise similarity. Standard constructions: Gaussian similarity Wij=exp(xixj2/(2σ2))W_{ij} = \exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / (2\sigma^2)); kk-NN graph; ϵ\epsilon-graph. kk-NN is most common: sparse, robust to σ\sigma, density-adaptive.

Graph Laplacian: L:=DWL := D - W, D=diag(jWij)D = \mathrm{diag}(\sum_j W_{ij}). Normalized variants: Lsym:=D1/2LD1/2L_{\mathrm{sym}} := D^{-1/2} L D^{-1/2} (Ng–Jordan–Weiss); Lrw:=D1LL_{\mathrm{rw}} := D^{-1} L (Shi–Malik). Eigenvectors corresponding to smallest eigenvalues encode cluster structure: for an ideal KK-component disconnected graph the bottom KK eigenvectors are component indicators; for almost-disconnected graphs they approximately encode the clusters (Mohar 1991; von Luxburg 2007).

Shi–Malik and Ng–Jordan–Weiss

Shi and Malik (2000). Bottom eigenvectors of LrwL_{\mathrm{rw}}; for K=2K = 2, partition by sign of second-smallest eigenvector; for K>2K > 2, recursive bipartitioning or k-means on bottom KK eigenvectors. Motivated by the normalized cut objective.

Ng, Jordan, and Weiss (2002). Bottom KK eigenvectors of LsymL_{\mathrm{sym}} as VRn×KV \in \mathbb{R}^{n \times K}. Row-normalize: Yij=Vij/lVil2Y_{ij} = V_{ij}/\sqrt{\sum_l V_{il}^2}. Run k-means on rows of YY. Geometric interpretation: row-normalized embedding lies on the unit sphere in RK\mathbb{R}^K, where cluster structure is spherical.

For the variational characterization, Cheeger inequality, perturbation analysis, random-walk dual — consult graph-laplacians and random-walks. von Luxburg (2007) is the standard intro.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (Ester, Kriegel, Sander, and Xu 1996). Two hyperparameters: ϵ\epsilon (neighborhood radius) and MinPts\text{MinPts}. Point types: core (Nϵ(xi)MinPts|\mathcal{N}_\epsilon(\mathbf{x}_i)| \geq \text{MinPts}), border (in a core’s neighborhood, not itself core), noise (neither). Clusters are connected components of density-reachability between core points; noise gets label 1-1.

DBSCAN vs mean-shift: both density-based, but DBSCAN uses hard threshold (ϵ\epsilon, MinPts) while mean-shift uses KDE smoothing. DBSCAN handles noise via 1-1; mean-shift assigns every point. DBSCAN has 2 hyperparameters, mean-shift 1. HDBSCAN (Campello, Moulavi, Sander 2013) replaces fixed ϵ\epsilon with a hierarchy.

For uniform-density clusters, DBSCAN excels. For varied-density data, it struggles — single ϵ\epsilon can’t resolve both tight and diffuse clusters; mean-shift’s bandwidth adapts via KDE smoothing.

Numerical demonstration on make_circles

make_circles(n_samples=200, factor=0.5, noise=0.05): outer ring radius 1.0, inner ring radius 0.5. KDE density is concentrated on ring-shaped ridges, not point modes — mean-shift trajectories converge to points along the ridges, basin partition fragments. Spectral and DBSCAN both succeed by exploiting connected-component structure on the affinity graph.

Four panels: mean-shift fails (ARI 0\approx 0); spectral K=2K = 2 with k=10k = 10-NN affinity succeeds (ARI 1\approx 1); DBSCAN ϵ=0.20\epsilon = 0.20 MinPts =5= 5 succeeds (ARI 1\approx 1); spectral embedding showing linear separability of the two rings.

Light pointer per §11. Full spectral-clustering theory lives in graph-laplacians and random-walks in the Graph Theory track.

Four-panel comparison on the concentric-circles dataset: mean-shift fragments, spectral and DBSCAN succeed, spectral embedding shows linear separability
Mean-shift fails on circles (no point modes for ring-shaped density). Spectral and DBSCAN succeed via connected-component structure. The bottom-eigenvector embedding makes the success geometrically concrete.

Connections, limits, and forward pointers

Within-formalML siblings

Kernel regression (Supervised Learning, intermediate). Covariate-conditional KDE counterpart. The NW estimator is structurally identical to Gaussian-kernel mean-shift update; kernel regression averages responses, mean-shift averages positions; same machinery.

Graph Laplacians (Graph Theory, advanced). Proper home for spectral clustering — eigenvalue spectrum, Cheeger inequality, perturbation analysis. §11 gestures; graph-laplacians develops.

Random walks (Graph Theory, intermediate). Dual Laplacian view: LrwL_{\mathrm{rw}} generates a random walk; mixing times encoded by spectral eigenvalues.

Normalizing flows (Unsupervised, advanced). Continuous-density estimation via invertible neural transformations. Mean-shift’s KDE is the non-parametric precursor to flow-based estimators.

Forward pointers

Density-ratio estimation (coming soon). KLIEP, LSIF, uLSIF — density-ratio estimators that plug KDE substrates into r(x)=pnum/pdenr(\mathbf{x}) = p_{\text{num}}/p_{\text{den}}. The unsupervised.ts exports meanShift and modeFinder are direct dependencies. Closes the T3 Unsupervised track.

Honest limits

Curse of dimensionality on hh. AMISE-optimal hn1/(d+4)h^* \asymp n^{-1/(d+4)} grows with dd, but the bandwidth-ball volume grows much faster — B(x,h)hd|B(\mathbf{x}, h)| \propto h^d. In d=10d = 10 with n=103n = 10^3, the typical neighborhood at hh^* contains essentially zero samples. Mean-shift becomes infeasible above d10d \approx 10 without dimensionality reduction (PCA, autoencoder, t-SNE / UMAP) as preprocessing.

Mode-vs-cluster distinction. Mean-shift converges to stationary points; the modal-clustering hypothesis (Hartigan 1975) asserts cluster = basin of attraction of a density mode. Three failure modes: well-separated clusters that don’t produce distinct density modes (concentric rings); spurious modes from noise; genuinely ambiguous cluster structure where no algorithm agrees.

No cluster-count guarantees under noise. Under additive noise, KDE develops spurious local maxima counting toward M(h)M(h). The bandwidth-mode-count duality is preserved (still monotone) but the plateau may shift or disappear. Carreira-Perpiñán (2015, §3.3) discusses noise-robust mode estimation; practical fix is to combine moderate bandwidth with a low-density-prune step dropping modes with insufficient basin mass.

Take-home. Mean-shift is the right tool for cluster discovery via density modes on moderate-dimensional, moderately-noised data with moderately-distinct cluster shapes. Outside that regime, k-means, GMM, spectral, DBSCAN are alternatives — on hard data, ensembles often do better than any single one.

Connections

  • The Gaussian-kernel mean-shift update (§3.3) and the Nadaraya–Watson regression estimator are structurally identical — both compute a kernel-weighted average over the data. Kernel regression averages responses given covariates; mean-shift averages positions given the current iterate. The same bandwidth-selection machinery applies to both. kernel-regression
  • The §11 spectral-clustering pointer gestures at the eigenvalue structure of the normalized Laplacian. The proper development — Cheeger inequality, perturbation analysis, Shi–Malik / Ng–Jordan–Weiss derivations — lives in graph-laplacians on the Graph Theory track. graph-laplacians
  • The §11 spectral pointer also covers the random-walk Laplacian $L_{\mathrm{rw}}$. The dynamics view (mixing times, spectral gap, hitting times) is the substrate random-walks develops; this topic just names the connection and redirects. random-walks
  • Mean-shift's KDE is the non-parametric precursor to flow-based density estimation. The §12.3 forward pointer notes that the unsupervised.ts shared module introduced here for the mean-shift estimators is the same module that normalizing-flows can plug into for KDE baselines. normalizing-flows
  • Each mean-shift iteration (§3.3 Proposition) is a gradient-ascent step on $\hat f_h$ with adaptive step size $\alpha(\mathbf{x}_t)^{-1}$. The convergence theory (§4) is the gradient-ascent convergence story specialized to a smooth bounded objective with isolated stationary points. gradient-descent

References & Further Reading