intermediate optimization 50 min read

Gradient Descent & Convergence

The workhorse of continuous optimization: convergence rates under smoothness and strong convexity, acceleration, coordinate descent, mirror descent, and stochastic variants

Prerequisites: Convex Analysis

Overview & Motivation

In Convex Analysis we established the geometric and analytic infrastructure that makes optimization tractable: convex sets are closed under line segments, convex functions have a single global minimum, and smooth convex functions admit tangent-line lower bounds. That machinery tells us what to optimize but not how. This topic is about the how.

Gradient descent is the simplest first-order algorithm for unconstrained minimization, and arguably the most important. The idea is disarmingly simple: to minimize ff, start at an initial guess x0x_0 and repeatedly take steps in the direction that decreases ff fastest — the negative gradient f(xk)-\nabla f(x_k). Each step replaces the function by its local linear approximation and moves to where that approximation is most favorable.

What makes gradient descent remarkable is not the algorithm itself but the convergence theory that governs it. With the right assumptions — smoothness, convexity, and sometimes strong convexity — we can prove exactly how fast the iterates approach the optimum, and these rates turn out to be tight: no first-order method can do better in the worst case (at least without acceleration).

What We Cover

  1. The Gradient Descent Algorithm — the update rule and its geometric interpretation as steepest descent.
  2. Smoothness & the Descent Lemma — how LL-smoothness provides a quadratic upper bound that guarantees progress at each step.
  3. Convergence for Convex Functions — the O(1/k)O(1/k) sublinear rate, proven via telescoping.
  4. Strong Convexity & Linear Convergence — the condition number κ=L/μ\kappa = L/\mu and the geometric rate (11/κ)k(1 - 1/\kappa)^k.
  5. Step Size Selection — fixed, exact line search, and Armijo backtracking.
  6. Nesterov Accelerated Gradient — the momentum trick that achieves the optimal O(1/k2)O(1/k^2) rate.
  7. Coordinate Descent — updating one variable at a time, and when that beats full-gradient steps.
  8. Mirror Descent & Bregman Divergence — generalizing Euclidean geometry for constrained domains.
  9. Stochastic Gradient Descent — noisy gradients, mini-batches, and decreasing step sizes.
  10. Computational Notes — NumPy implementations and convergence diagnostics.

Gradient descent on a 2D quadratic: contour trajectory and convergence plot


The Gradient Descent Algorithm

The gradient f(x)\nabla f(x) points in the direction of steepest increase of ff at xx. To decrease ff, we walk in the opposite direction.

Definition 1 (Gradient Descent Update).

Given a differentiable function f:RnRf : \mathbb{R}^n \to \mathbb{R}, a starting point x0Rnx_0 \in \mathbb{R}^n, and a step size (learning rate) η>0\eta > 0, the gradient descent algorithm generates iterates

xk+1=xkηf(xk),k=0,1,2,x_{k+1} = x_k - \eta \nabla f(x_k), \quad k = 0, 1, 2, \ldots

The iteration terminates when f(xk)\|\nabla f(x_k)\| falls below a chosen tolerance, or after a fixed budget of steps.

Geometrically, each iterate replaces ff by its first-order Taylor approximation f(xk)+f(xk)(xxk)f(x_k) + \nabla f(x_k)^\top (x - x_k) and moves in the direction that decreases this linear model fastest. The step size η\eta controls how far we trust the linear approximation — too large and we overshoot; too small and progress is glacially slow.

For a quadratic f(x)=12xAxf(x) = \frac{1}{2}x^\top A x with AA symmetric positive definite, the gradient is f(x)=Ax\nabla f(x) = Ax and the update simplifies to xk+1=(IηA)xkx_{k+1} = (I - \eta A)x_k. The convergence behavior of this linear map is entirely determined by the eigenvalues of IηAI - \eta A: convergence requires 1ηλi<1|1 - \eta \lambda_i| < 1 for every eigenvalue λi\lambda_i, which means η\eta must lie in (0,2/λmax)(0, 2/\lambda_{\max}).


Smoothness & the Descent Lemma

The key to analyzing gradient descent is smoothness — a Lipschitz condition on the gradient that provides a global quadratic upper bound on ff.

Definition 2 (L-Smoothness).

A differentiable function f:RnRf : \mathbb{R}^n \to \mathbb{R} is LL-smooth (or has LL-Lipschitz continuous gradient) if

f(x)f(y)Lxyfor all x,yRn\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \text{for all } x, y \in \mathbb{R}^n

The constant L>0L > 0 is the smoothness parameter. For twice-differentiable functions, LL-smoothness is equivalent to 2f(x)LI\nabla^2 f(x) \preceq L I for all xx — that is, the Hessian’s eigenvalues are bounded above by LL.

Smoothness says the gradient cannot change too fast. This is exactly the condition that allows us to guarantee progress at each gradient step: if the gradient doesn’t change too fast between xkx_k and xk+1x_{k+1}, then the decrease predicted by the linear approximation actually holds (up to a quadratic correction).

Theorem 1 (The Descent Lemma).

If ff is LL-smooth, then for all x,yRnx, y \in \mathbb{R}^n:

f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top (y - x) + \frac{L}{2}\|y - x\|^2

In particular, taking a gradient step y=x1Lf(x)y = x - \frac{1}{L}\nabla f(x) yields the guaranteed decrease:

f ⁣(x1Lf(x))f(x)12Lf(x)2f\!\left(x - \frac{1}{L}\nabla f(x)\right) \leq f(x) - \frac{1}{2L}\|\nabla f(x)\|^2

Proof.

By the fundamental theorem of calculus,

f(y)f(x)=01f(x+t(yx))(yx)dtf(y) - f(x) = \int_0^1 \nabla f(x + t(y-x))^\top (y-x) \, dt

Adding and subtracting f(x)(yx)\nabla f(x)^\top(y-x):

f(y)f(x)=f(x)(yx)+01[f(x+t(yx))f(x)](yx)dtf(y) - f(x) = \nabla f(x)^\top (y-x) + \int_0^1 \bigl[\nabla f(x + t(y-x)) - \nabla f(x)\bigr]^\top (y-x) \, dt

By Cauchy–Schwarz and the LL-Lipschitz condition on the gradient:

[f(x+t(yx))f(x)](yx)Ltyx2\bigl|\bigl[\nabla f(x + t(y-x)) - \nabla f(x)\bigr]^\top (y-x)\bigr| \leq L \cdot t \|y-x\|^2

Integrating from 00 to 11:

f(y)f(x)+f(x)(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y-x\|^2

For the guaranteed decrease, substitute y=x1Lf(x)y = x - \frac{1}{L}\nabla f(x):

f(y)f(x)+f(x) ⁣(1Lf(x))+L21Lf(x)2=f(x)12Lf(x)2f(y) \leq f(x) + \nabla f(x)^\top\!\left(-\frac{1}{L}\nabla f(x)\right) + \frac{L}{2}\left\|\frac{1}{L}\nabla f(x)\right\|^2 = f(x) - \frac{1}{2L}\|\nabla f(x)\|^2

The Descent Lemma tells us that at each point xx, the function ff is sandwiched between a linear lower bound (from convexity, if ff is convex) and a quadratic upper bound (from LL-smoothness). The gradient step minimizes the upper bound exactly, and the decrease is proportional to f(x)2\|\nabla f(x)\|^2.

Smoothness and the Descent Lemma: quadratic upper bounds at multiple points


Convergence for Convex Functions

Armed with the Descent Lemma, we can prove the first convergence result. For a convex, LL-smooth function, gradient descent with step size η=1/L\eta = 1/L achieves an O(1/k)O(1/k) rate — sublinear, but dimension-free.

Theorem 2 (Convergence for Convex + L-Smooth Functions).

Let ff be convex and LL-smooth, and let xx^* be a minimizer of ff. Then gradient descent with step size η=1/L\eta = 1/L satisfies:

f(xk)f(x)Lx0x22kf(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}

Proof.

From the Descent Lemma, each step guarantees:

f(xk+1)f(xk)12Lf(xk)2(1)f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2 \qquad (1)

Since ff is convex, the first-order condition gives us f(x)f(xk)+f(xk)(xxk)f(x^*) \geq f(x_k) + \nabla f(x_k)^\top(x^* - x_k), which rearranges to:

f(xk)f(x)f(xk)(xkx)f(xk)xkx(2)f(x_k) - f(x^*) \leq \nabla f(x_k)^\top(x_k - x^*) \leq \|\nabla f(x_k)\| \cdot \|x_k - x^*\| \qquad (2)

We will use a different approach that yields the tightest bound. Consider the distance to the optimum:

xk+1x2=xk1Lf(xk)x2=xkx22Lf(xk)(xkx)+1L2f(xk)2\|x_{k+1} - x^*\|^2 = \left\|x_k - \frac{1}{L}\nabla f(x_k) - x^*\right\|^2 = \|x_k - x^*\|^2 - \frac{2}{L}\nabla f(x_k)^\top(x_k - x^*) + \frac{1}{L^2}\|\nabla f(x_k)\|^2

Rearranging and using the convexity inequality f(xk)(xkx)f(xk)f(x)\nabla f(x_k)^\top(x_k - x^*) \geq f(x_k) - f(x^*):

f(xk)f(x)L2(xkx2xk+1x2)+12Lf(xk)2f(x_k) - f(x^*) \leq \frac{L}{2}\bigl(\|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2\bigr) + \frac{1}{2L}\|\nabla f(x_k)\|^2

From (1), 12Lf(xk)2f(xk)f(xk+1)\frac{1}{2L}\|\nabla f(x_k)\|^2 \leq f(x_k) - f(x_{k+1}). Since f(xk+1)f(x)f(x_{k+1}) \geq f(x^*), we get:

f(xk)f(x)L2(xkx2xk+1x2)f(x_k) - f(x^*) \leq \frac{L}{2}\bigl(\|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2\bigr)

Now telescope: sum both sides from k=0k = 0 to k=K1k = K-1:

k=0K1(f(xk)f(x))L2x0x2\sum_{k=0}^{K-1} \bigl(f(x_k) - f(x^*)\bigr) \leq \frac{L}{2}\|x_0 - x^*\|^2

Since f(xk)f(x)f(x_k) - f(x^*) is non-increasing (by the Descent Lemma), K(f(xK)f(x))k=0K1(f(xk)f(x))K \cdot (f(x_K) - f(x^*)) \leq \sum_{k=0}^{K-1}(f(x_k) - f(x^*)), giving:

f(xK)f(x)Lx0x22Kf(x_K) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2K}

The O(1/k)O(1/k) rate tells us that to reach ε\varepsilon-accuracy, we need kLx0x2/εk \sim L\|x_0 - x^*\|^2 / \varepsilon iterations. This depends on the initial distance to the optimum but — crucially — not on the dimension nn.

Convergence for convex functions: rate comparison and contour paths


Strong Convexity & Linear Convergence

The O(1/k)O(1/k) rate is slow. If we strengthen our assumption from plain convexity to strong convexity, the rate improves dramatically from sublinear to linear (i.e., geometric).

Definition 3 (μ-Strong Convexity).

A differentiable function ff is μ\mu-strongly convex (with μ>0\mu > 0) if for all x,yRnx, y \in \mathbb{R}^n:

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top(y - x) + \frac{\mu}{2}\|y - x\|^2

Equivalently, f(x)μ2x2f(x) - \frac{\mu}{2}\|x\|^2 is convex, or 2f(x)μI\nabla^2 f(x) \succeq \mu I for all xx.

Strong convexity says that ff curves upward at least as fast as a quadratic with parameter μ\mu. Together with LL-smoothness, this gives a “quadratic sandwich”:

f(x)+f(x)(yx)+μ2yx2f(y)f(x)+f(x)(yx)+L2yx2f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|y-x\|^2 \leq f(y) \leq f(x) + \nabla f(x)^\top(y-x) + \frac{L}{2}\|y-x\|^2

Definition 4 (Condition Number).

For an LL-smooth, μ\mu-strongly convex function, the condition number is:

κ=Lμ1\kappa = \frac{L}{\mu} \geq 1

For a quadratic f(x)=12xAxf(x) = \frac{1}{2}x^\top A x with AA symmetric positive definite, L=λmax(A)L = \lambda_{\max}(A), μ=λmin(A)\mu = \lambda_{\min}(A), and κ\kappa equals the spectral condition number of AA (see The Spectral Theorem). For least-squares problems minAxb2\min \|Ax - b\|^2, the relevant condition number is σmax2/σmin2\sigma_{\max}^2/\sigma_{\min}^2 from the SVD.

Theorem 3 (Linear Convergence Under Strong Convexity).

Let ff be LL-smooth and μ\mu-strongly convex with condition number κ=L/μ\kappa = L/\mu. Then gradient descent with step size η=1/L\eta = 1/L satisfies:

f(xk)f(x)(11κ)k(f(x0)f(x))f(x_k) - f(x^*) \leq \left(1 - \frac{1}{\kappa}\right)^k \bigl(f(x_0) - f(x^*)\bigr)

Proof.

The Descent Lemma gives f(xk+1)f(xk)12Lf(xk)2f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2. Subtracting f(x)f(x^*):

f(xk+1)f(x)f(xk)f(x)12Lf(xk)2(3)f(x_{k+1}) - f(x^*) \leq f(x_k) - f(x^*) - \frac{1}{2L}\|\nabla f(x_k)\|^2 \qquad (3)

We need a lower bound on f(xk)2\|\nabla f(x_k)\|^2 in terms of f(xk)f(x)f(x_k) - f(x^*). Strong convexity gives us a key inequality (the Polyak–Łojasiewicz condition):

f(x)22μ(f(x)f(x))\|\nabla f(x)\|^2 \geq 2\mu \bigl(f(x) - f(x^*)\bigr)

To see why: by μ\mu-strong convexity, for any yy,

f(y)f(x)+f(x)(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|y-x\|^2

Minimizing the right side over yy (a quadratic in yy), the minimum is achieved at y=x1μf(x)y^* = x - \frac{1}{\mu}\nabla f(x) and equals f(x)12μf(x)2f(x) - \frac{1}{2\mu}\|\nabla f(x)\|^2. Since f(x)f(y)f(x^*) \leq f(y^*):

f(x)f(x)12μf(x)2f(x^*) \leq f(x) - \frac{1}{2\mu}\|\nabla f(x)\|^2

which rearranges to f(x)22μ(f(x)f(x))\|\nabla f(x)\|^2 \geq 2\mu(f(x) - f(x^*)).

Substituting into (3):

f(xk+1)f(x)f(xk)f(x)μL(f(xk)f(x))=(11κ)(f(xk)f(x))f(x_{k+1}) - f(x^*) \leq f(x_k) - f(x^*) - \frac{\mu}{L}\bigl(f(x_k) - f(x^*)\bigr) = \left(1 - \frac{1}{\kappa}\right)\bigl(f(x_k) - f(x^*)\bigr)

Applying this recursion kk times completes the proof.

The linear rate means log(f(xk)f(x))\log(f(x_k) - f(x^*)) decreases at a constant rate per iteration — the convergence plot is a straight line on a log scale. To reach ε\varepsilon-accuracy, we need kκlog(1/ε)k \sim \kappa \log(1/\varepsilon) iterations. The dependence on 1/ε1/\varepsilon is logarithmic (much better than the 1/ε1/\varepsilon for the convex case), but the price is a linear dependence on κ\kappa. Ill-conditioned problems (large κ\kappa) converge slowly because the contours are highly elongated ellipses, and gradient descent zigzags across the narrow valley.

Step 0 / 110 | f(x) = 7.92e+0 | η = 0.2000

Strong convexity: quadratic sandwich, linear convergence, and iterations vs κ


Step Size Selection

So far we’ve used the fixed step size η=1/L\eta = 1/L, which requires knowing LL in advance. In practice, there are several strategies.

This strategy chooses the step size that minimizes ff along the gradient direction:

ηk=argminη>0f(xkηf(xk))\eta_k = \arg\min_{\eta > 0} f(x_k - \eta \nabla f(x_k))

For a quadratic f(x)=12xAxf(x) = \frac{1}{2}x^\top A x, this has a closed-form solution:

ηk=f(xk)2f(xk)Af(xk)\eta_k^* = \frac{\|\nabla f(x_k)\|^2}{\nabla f(x_k)^\top A \nabla f(x_k)}

Exact line search always converges at least as fast as fixed step size but is usually impractical for non-quadratic objectives.

Definition 5 (Armijo Sufficient Decrease Condition).

Given parameters c(0,1)c \in (0, 1) (typically c=104c = 10^{-4}) and backtracking factor β(0,1)\beta \in (0, 1) (typically β=0.5\beta = 0.5), the Armijo backtracking line search starts with α=α0\alpha = \alpha_0 and repeatedly sets αβα\alpha \leftarrow \beta \alpha until:

f(xkαf(xk))f(xk)cαf(xk)2f(x_k - \alpha \nabla f(x_k)) \leq f(x_k) - c \, \alpha \|\nabla f(x_k)\|^2

This is the sufficient decrease (Armijo) condition: it requires a decrease proportional to the step size and the squared gradient norm.

Proposition 1 (Armijo Termination).

If ff is LL-smooth, Armijo backtracking with any c<1c < 1 and β(0,1)\beta \in (0,1) terminates in a finite number of halvings. Specifically, any α2(1c)L\alpha \leq \frac{2(1-c)}{L} satisfies the Armijo condition, so the search terminates with αβ2(1c)L\alpha \geq \beta \, \frac{2(1-c)}{L}.

Proof.

From the Descent Lemma, for any α>0\alpha > 0:

f(xαf(x))f(x)αf(x)2+Lα22f(x)2=f(x)α(1Lα2)f(x)2f(x - \alpha \nabla f(x)) \leq f(x) - \alpha \|\nabla f(x)\|^2 + \frac{L \alpha^2}{2}\|\nabla f(x)\|^2 = f(x) - \alpha\left(1 - \frac{L\alpha}{2}\right)\|\nabla f(x)\|^2

For the Armijo condition to hold, we need α(1Lα/2)cα\alpha(1 - L\alpha/2) \geq c\alpha, which simplifies to α2(1c)/L\alpha \leq 2(1-c)/L. Since the search reduces α\alpha by factors of β\beta, it will find such an α\alpha within logβ(2(1c)/(Lα0))\lceil \log_\beta(2(1-c)/(L\alpha_0)) \rceil halvings.

Step size selection: trajectory comparison and convergence for fixed, exact line search, and Armijo backtracking


Nesterov Accelerated Gradient

Gradient descent with step size 1/L1/L converges at rate O(1/k)O(1/k) for convex functions. Can we do better? Nesterov showed in 1983 that the answer is yes — and that O(1/k2)O(1/k^2) is optimal among all first-order methods.

Definition 6 (Nesterov Accelerated Gradient).

The Nesterov accelerated gradient (NAG) method maintains a sequence of iterates xkx_k and extrapolation points yky_k:

yk=xk+tk1tk+1(xkxk1)y_k = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})

xk+1=yk1Lf(yk)x_{k+1} = y_k - \frac{1}{L} \nabla f(y_k)

where the sequence tkt_k satisfies t1=1t_1 = 1 and tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}.

The critical difference from vanilla gradient descent: the gradient is evaluated at the extrapolated point yky_k, not at the current iterate xkx_k. The extrapolation step yky_k adds a “momentum” term — it looks ahead in the direction the iterates have been moving, then computes the gradient at that look-ahead point.

Theorem 4 (Nesterov's Optimal Rate).

Let ff be convex and LL-smooth. Then the Nesterov accelerated gradient method satisfies:

f(xk)f(x)2Lx0x2(k+1)2f(x_k) - f(x^*) \leq \frac{2L\|x_0 - x^*\|^2}{(k+1)^2}

This is an O(1/k2)O(1/k^2) rate, which is optimal among all first-order methods that access ff only through gradient evaluations (Nesterov, 1983).

We state this theorem without proof — the proof requires a carefully constructed Lyapunov function argument that we refer the reader to Nesterov’s original paper.

Remark (Why O(1/k²) Is Optimal).

Nesterov also proved a matching lower bound: for any first-order method that generates iterates in the Krylov subspace span{f(x0),f(x1),}\text{span}\{\nabla f(x_0), \nabla f(x_1), \ldots\}, there exists an LL-smooth convex function for which f(xk)f(x)Ω(1/k2)f(x_k) - f(x^*) \geq \Omega(1/k^2). The accelerated gradient method achieves this lower bound and is therefore optimal in this class. This is one of the rare results in optimization where we have matching upper and lower bounds.

Step 0 / 150

Nesterov acceleration: GD vs Nesterov trajectories, convergence comparison, and momentum visualization


Coordinate Descent

Instead of computing the full gradient f(x)Rn\nabla f(x) \in \mathbb{R}^n, coordinate descent updates one coordinate at a time. This is attractive when nn is large and computing a single partial derivative is cheap.

Definition 7 (Coordinate Descent).

Cyclic coordinate descent updates coordinates in order i=1,2,,n,1,2,i = 1, 2, \ldots, n, 1, 2, \ldots:

xi(k+1)=xi(k)1Lifxi(x(k))x^{(k+1)}_i = x^{(k)}_i - \frac{1}{L_i} \frac{\partial f}{\partial x_i}(x^{(k)})

where LiL_i is the smoothness constant along coordinate ii. For f(x)=12xAxf(x) = \frac{1}{2}x^\top A x, this is xixi(Ax)iAiix_i \leftarrow x_i - \frac{(Ax)_i}{A_{ii}}.

Randomized coordinate descent selects ii uniformly at random at each step.

The characteristic visual signature of coordinate descent is the “staircase” pattern on contour plots: each step moves along a coordinate axis, creating axis-aligned zigzags. Compare this with gradient descent, which moves along the gradient direction (which is generally not axis-aligned).

Theorem 5 (Randomized Coordinate Descent Convergence).

Let ff be convex with coordinate-wise smoothness constants L1,,LnL_1, \ldots, L_n. Then randomized coordinate descent with step sizes 1/Li1/L_i satisfies:

E[f(xk)f(x)]2nLˉx0x2k\mathbb{E}[f(x_k) - f(x^*)] \leq \frac{2n \cdot \bar{L} \|x_0 - x^*\|^2}{k}

where Lˉ=1niLi\bar{L} = \frac{1}{n}\sum_i L_i.

Proof.

At each step, we select coordinate ii uniformly at random and update xixi1Liif(x)x_{i} \leftarrow x_i - \frac{1}{L_i} \nabla_i f(x). The coordinate-wise Descent Lemma gives:

f(x+)f(x)12Liif(x)2f(x^+) \leq f(x) - \frac{1}{2L_i} |\nabla_i f(x)|^2

Taking expectations over the random coordinate:

Ei[f(x+)]f(x)1ni=1n12Liif(x)2f(x)12nLˉf(x)2\mathbb{E}_i[f(x^+)] \leq f(x) - \frac{1}{n}\sum_{i=1}^n \frac{1}{2L_i}|\nabla_i f(x)|^2 \leq f(x) - \frac{1}{2n\bar{L}}\|\nabla f(x)\|^2

The rest follows the same telescoping argument as Theorem 2, with LL replaced by nLˉn\bar{L}.

Remark (When Coordinate Descent Wins).

The O(n/k)O(n/k) rate looks worse than GD’s O(1/k)O(1/k), but each CD step costs O(1)O(1) gradient component while each GD step costs O(n)O(n). When measured in coordinate gradient evaluations, the rates are comparable. CD wins when: (1) the function has separable or block-separable structure, (2) nn is very large and computing the full gradient is expensive, or (3) the coordinate-wise smoothness constants LiL_i vary widely (some coordinates are much smoother than others).

Coordinate descent: cyclic and randomized staircase trajectories and convergence comparison


Mirror Descent & Bregman Divergence

Gradient descent implicitly uses Euclidean distance to measure progress: the update xk+1=xkηf(xk)x_{k+1} = x_k - \eta \nabla f(x_k) can be written as

xk+1=argminx{f(xk)x+12ηxxk2}x_{k+1} = \arg\min_x \left\{ \nabla f(x_k)^\top x + \frac{1}{2\eta}\|x - x_k\|^2 \right\}

But Euclidean distance is not always the right geometry. When optimizing over the probability simplex Δ={x0:xi=1}\Delta = \{x \geq 0 : \sum x_i = 1\}, the Euclidean distance treats points near the boundary the same as interior points, which is wasteful — near a vertex, most directions leave the simplex. Mirror descent replaces xy2\|x - y\|^2 with a Bregman divergence matched to the geometry of the domain.

Definition 8 (Bregman Divergence).

Given a strictly convex, differentiable function ϕ:XR\phi : \mathcal{X} \to \mathbb{R} (the mirror map), the Bregman divergence is:

Dϕ(x,y)=ϕ(x)ϕ(y)ϕ(y),xyD_\phi(x, y) = \phi(x) - \phi(y) - \langle \nabla \phi(y), x - y \rangle

This measures the gap between ϕ(x)\phi(x) and its first-order Taylor approximation at yy. By strict convexity, Dϕ(x,y)0D_\phi(x, y) \geq 0 with equality iff x=yx = y.

When ϕ(x)=12x2\phi(x) = \frac{1}{2}\|x\|^2, the Bregman divergence reduces to Dϕ(x,y)=12xy2D_\phi(x,y) = \frac{1}{2}\|x - y\|^2 — ordinary Euclidean distance. When ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i (the negative entropy), the Bregman divergence is the KL divergence Dϕ(x,y)=ixilog(xi/yi)xi+yiD_\phi(x, y) = \sum_i x_i \log(x_i / y_i) - x_i + y_i.

Definition 9 (Mirror Descent).

The mirror descent update with mirror map ϕ\phi and step size η\eta is:

xk+1=argminxX{ηf(xk),x+Dϕ(x,xk)}x_{k+1} = \arg\min_{x \in \mathcal{X}} \left\{ \eta \langle \nabla f(x_k), x \rangle + D_\phi(x, x_k) \right\}

When ϕ(x)=ixilogxi\phi(x) = \sum_i x_i \log x_i and X\mathcal{X} is the simplex, this yields the exponentiated gradient update:

xk+1,i=xk,iexp(ηif(xk))jxk,jexp(ηjf(xk))x_{k+1,i} = \frac{x_{k,i} \exp(-\eta \, \nabla_i f(x_k))}{\sum_j x_{k,j} \exp(-\eta \, \nabla_j f(x_k))}

The exponentiated gradient update naturally maintains positivity and the simplex constraint without explicit projection. Mirror descent with negative entropy is the natural algorithm for the simplex: its iterates curve along the simplex boundary rather than cutting through the interior in straight lines.

Proposition 2 (Mirror Descent Convergence).

Let ff be convex and GG-Lipschitz (f(x)G\|\nabla f(x)\| \leq G for all xXx \in \mathcal{X}), and let ϕ\phi be σ\sigma-strongly convex with respect to a norm \|\cdot\|. Then mirror descent with step size η=RG2σk\eta = \frac{R}{G}\sqrt{\frac{2\sigma}{k}} (where R2=maxxXDϕ(x,x0)R^2 = \max_{x \in \mathcal{X}} D_\phi(x, x_0)) satisfies:

1kt=0k1f(xt)f(x)RG2σk\frac{1}{k}\sum_{t=0}^{k-1} f(x_t) - f(x^*) \leq RG\sqrt{\frac{2}{\sigma k}}

Proof.

From the mirror descent update definition and the three-point identity for Bregman divergence:

ηf(xt),xtxDϕ(x,xt)Dϕ(x,xt+1)+η22σf(xt)2\eta \langle \nabla f(x_t), x_t - x^* \rangle \leq D_\phi(x^*, x_t) - D_\phi(x^*, x_{t+1}) + \frac{\eta^2}{2\sigma}\|\nabla f(x_t)\|_*^2

By convexity, f(xt)f(x)f(xt),xtxf(x_t) - f(x^*) \leq \langle \nabla f(x_t), x_t - x^* \rangle. Summing from t=0t = 0 to k1k-1, the Bregman terms telescope to Dϕ(x,x0)Dϕ(x,xk)R2D_\phi(x^*, x_0) - D_\phi(x^*, x_k) \leq R^2. Using f(xt)G\|\nabla f(x_t)\|_* \leq G and optimizing η\eta gives the result.

The connection to Proximal Methods is deep: mirror descent is a special case of the proximal point algorithm with Bregman divergence replacing Euclidean distance. Proximal operators generalize the gradient step to handle non-smooth regularizers like the 1\ell_1 norm.

Step 0 / 40

Mirror descent: Bregman divergence, simplex trajectories, and convergence comparison


Stochastic Gradient Descent

In machine learning, the objective is often an average over a large dataset: f(x)=1Ni=1Nfi(x)f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x). Computing the full gradient f(x)=1Nifi(x)\nabla f(x) = \frac{1}{N}\sum_i \nabla f_i(x) requires NN gradient evaluations per step, which is expensive when NN is large. Stochastic gradient descent (SGD) replaces the full gradient with a noisy but unbiased estimate.

Definition 10 (Stochastic Gradient Descent).

Given an unbiased gradient estimator g(x)g(x) satisfying E[g(x)]=f(x)\mathbb{E}[g(x)] = \nabla f(x) and E[g(x)f(x)2]σ2\mathbb{E}[\|g(x) - \nabla f(x)\|^2] \leq \sigma^2, stochastic gradient descent performs:

xk+1=xkηkg(xk)x_{k+1} = x_k - \eta_k \, g(x_k)

The mini-batch variant averages BB independent samples: gB(x)=1Bj=1Bgj(x)g_B(x) = \frac{1}{B}\sum_{j=1}^B g_j(x), reducing variance to σ2/B\sigma^2/B.

The step size schedule is crucial. With a fixed step size η\eta, SGD converges to a neighborhood of the optimum but cannot reach it — the noise prevents the iterates from settling down. A decreasing schedule ηk=O(1/k)\eta_k = O(1/k) allows convergence to the exact optimum, at the cost of slower progress.

Theorem 6 (SGD Convergence).

Let ff be convex and LL-smooth, and let the gradient estimator have bounded variance σ2\sigma^2. Then:

Fixed step size η\eta:

1Kk=0K1E[f(xk)2]2(f(x0)f(x))ηK+ηLσ2\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla f(x_k)\|^2] \leq \frac{2(f(x_0) - f(x^*))}{\eta K} + \eta L \sigma^2

Optimizing η=2(f(x0)f(x))/(KLσ2)\eta = \sqrt{2(f(x_0) - f(x^*))/(KL\sigma^2)} gives an O(1/K)O(1/\sqrt{K}) rate.

Decreasing step size ηk=c/(k+k0)\eta_k = c/(k+k_0): SGD converges to the optimum at rate O(1/K)O(1/K) for strongly convex objectives.

Proof.

From the smoothness of ff and the SGD update:

E[f(xk+1)]f(xk)ηkf(xk)2+Lηk22E[g(xk)2]\mathbb{E}[f(x_{k+1})] \leq f(x_k) - \eta_k \|\nabla f(x_k)\|^2 + \frac{L\eta_k^2}{2}\mathbb{E}[\|g(x_k)\|^2]

Since E[g(xk)2]=f(xk)2+σ2\mathbb{E}[\|g(x_k)\|^2] = \|\nabla f(x_k)\|^2 + \sigma^2 (bias-variance decomposition):

E[f(xk+1)]f(xk)ηk(1Lηk2)f(xk)2+Lηk2σ22\mathbb{E}[f(x_{k+1})] \leq f(x_k) - \eta_k\left(1 - \frac{L\eta_k}{2}\right)\|\nabla f(x_k)\|^2 + \frac{L\eta_k^2 \sigma^2}{2}

For ηk1/L\eta_k \leq 1/L, the term (1Lηk/2)1/2(1 - L\eta_k/2) \geq 1/2. Rearranging and summing over kk gives the stated bound. The key tension: the first term decreases with KK (more iterations help), but the second term ηLσ2\eta L\sigma^2 remains constant for fixed η\eta. Decreasing ηk\eta_k eliminates the residual at the cost of slower convergence.

Stochastic gradient descent: step size comparison, mini-batch variance reduction, and noisy trajectory


Computational Notes

Here we collect practical implementations. The gradient_descent function below supports fixed step size and Armijo backtracking, and returns the full convergence history for diagnostic plotting.

def gradient_descent(f, grad_f, x0, eta=None, L=None, tol=1e-8, max_iter=1000,
                     line_search='fixed'):
    """
    General-purpose gradient descent.

    Parameters
    ----------
    f : callable — objective function
    grad_f : callable — gradient of f
    x0 : ndarray — initial point
    eta : float — fixed step size (required if line_search='fixed')
    L : float — Lipschitz constant; used for eta=1/L if eta not given
    tol : float — gradient norm tolerance
    max_iter : int — iteration budget
    line_search : str — 'fixed' or 'armijo'

    Returns
    -------
    dict with keys 'x', 'f_vals', 'grad_norms', 'trajectory', 'n_iter'
    """
    if eta is None and L is not None:
        eta = 1.0 / L

    x = x0.copy()
    f_vals = [f(x)]
    grad_norms = [np.linalg.norm(grad_f(x))]
    trajectory = [x.copy()]

    for k in range(max_iter):
        g = grad_f(x)
        if np.linalg.norm(g) < tol:
            break

        if line_search == 'armijo':
            step = armijo_backtracking(f, grad_f, x, -g)
        else:
            step = eta

        x = x - step * g
        f_vals.append(f(x))
        grad_norms.append(np.linalg.norm(grad_f(x)))
        trajectory.append(x.copy())

    return {
        'x': x,
        'f_vals': np.array(f_vals),
        'grad_norms': np.array(grad_norms),
        'trajectory': np.array(trajectory),
        'n_iter': len(f_vals) - 1
    }

The Armijo backtracking function:

def armijo_backtracking(f, grad_f, x, d, alpha0=1.0, beta=0.5, c=1e-4, max_iter=50):
    """Armijo backtracking line search."""
    alpha = alpha0
    fx = f(x)
    gx = grad_f(x)
    for _ in range(max_iter):
        if f(x + alpha * d) <= fx + c * alpha * (gx @ d):
            return alpha
        alpha *= beta
    return alpha

Nesterov accelerated gradient:

def nesterov_gd(A, x0, n_iter=100):
    """Nesterov accelerated gradient for f(x) = 0.5 x^T A x."""
    L = eigvalsh(A)[-1]
    eta = 1.0 / L
    x, y = x0.copy(), x0.copy()
    traj = [x0.copy()]
    t = 1.0
    for k in range(n_iter):
        x_new = y - eta * (A @ y)
        t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2
        y = x_new + ((t - 1) / t_new) * (x_new - x)
        x, t = x_new, t_new
        traj.append(x.copy())
    return np.array(traj)

Cyclic coordinate descent:

def cyclic_cd(A, x0, n_iter=50):
    """Cyclic coordinate descent for f(x) = 0.5 x^T A x."""
    x = x0.copy()
    traj = [x.copy()]
    for _ in range(n_iter):
        for i in range(len(x0)):
            grad_i = A[i, :] @ x
            x[i] -= grad_i / A[i, i]
            traj.append(x.copy())
    return np.array(traj)

Mirror descent on the simplex (exponentiated gradient):

def mirror_descent_simplex(grad_f, x0, eta, n_iter=50):
    """Mirror descent with neg-entropy on the probability simplex."""
    x = x0.copy()
    traj = [x.copy()]
    for _ in range(n_iter):
        g = grad_f(x)
        x_new = x * np.exp(-eta * g)
        x_new /= x_new.sum()
        x = x_new
        traj.append(x.copy())
    return np.array(traj)

SGD with mini-batch averaging:

def sgd_minibatch(A, x0, eta_schedule, noise_std=1.0, batch_size=1,
                  n_iter=500, seed=42):
    """SGD with Gaussian gradient noise and mini-batch averaging."""
    rng = np.random.RandomState(seed)
    x = x0.copy()
    traj = [x.copy()]
    for k in range(n_iter):
        eta = eta_schedule(k)
        true_grad = A @ x
        noise = rng.randn(batch_size, len(x)) * noise_std
        x = x - eta * (true_grad + noise.mean(axis=0))
        traj.append(x.copy())
    return np.array(traj)

Convergence diagnostics. When running gradient descent in practice, monitor four quantities: (1) the objective value f(xk)f(x_k), which should decrease monotonically for deterministic GD; (2) the gradient norm f(xk)\|\nabla f(x_k)\|, which indicates proximity to a stationary point; (3) the step size history (if using backtracking); and (4) the trajectory in parameter space (for low-dimensional problems).

Computational diagnostics: Rosenbrock function trajectory, convergence, gradient norm, and step size history


Connections & Further Reading

Gradient descent sits at the center of the Optimization track, building on convex analysis and connecting forward to proximal methods and duality.

TopicConnection
Convex AnalysisEvery convergence proof here relies on convexity and smoothness established in Convex Analysis. The Descent Lemma is the LL-smooth analogue of the first-order convexity condition; strong convexity provides the quadratic lower bound that yields linear convergence.
The Spectral TheoremThe condition number κ=L/μ\kappa = L/\mu for quadratics equals λmax/λmin\lambda_{\max}/\lambda_{\min} of the Hessian. Ill-conditioned Hessians (large eigenvalue spread) cause the zigzag behavior that slows gradient descent.
Singular Value DecompositionFor least-squares problems minAxb2\min \|Ax - b\|^2, the condition number of AAA^\top A governs GD convergence. The SVD reveals this as σmax2/σmin2\sigma_{\max}^2/\sigma_{\min}^2, connecting linear algebra to optimization dynamics.
Proximal MethodsMirror descent generalizes to proximal operators: proxηf(x)=argminy{f(y)+12ηyx2}\text{prox}_{\eta f}(x) = \arg\min_y \{f(y) + \frac{1}{2\eta}\|y - x\|^2\}. Proximal gradient descent handles composite objectives f+gf + g where gg is non-smooth but has a cheap proximal operator (e.g., 1\ell_1 regularization).
Lagrangian Duality & KKTConstrained optimization extends the unconstrained theory here. The KKT conditions generalize “gradient equals zero” to include constraints, and dual gradient descent operates on the Lagrangian dual function.

The Optimization Track

Convex Analysis
    ├── Gradient Descent & Convergence (this topic)
    │       └── Proximal Methods
    └── Lagrangian Duality & KKT

Connections

  • Every convergence proof in this topic relies on convexity and smoothness properties established in Convex Analysis. The Descent Lemma is the L-smooth analogue of the first-order convexity condition; strong convexity provides the quadratic lower bound that yields linear convergence. convex-analysis
  • The condition number κ = L/μ for quadratic objectives equals the ratio of extreme eigenvalues of the Hessian. The Spectral Theorem provides the eigendecomposition that makes this concrete: ill-conditioned Hessians (large eigenvalue spread) cause the zigzag behavior that slows gradient descent. spectral-theorem
  • For least-squares problems min ||Ax - b||², the condition number of A^T A governs GD convergence. The SVD reveals this condition number directly as σ_max²/σ_min², connecting linear algebra to optimization dynamics. svd

References & Further Reading