Skip to article frontmatterSkip to article content

Gradient Descent and Convexity

Cornell University

Last lecture we proved that GD converges, but doesn’t guarantee it minimizes loss. So when can we say it minimizes loss? When =f(x)\ell = f(x) is convex.

Convexity

Definition

To visualize graphically, a convex function is that if we draw a line segment between any two points in the graph of the function, that line segment will lie above the function.

Consider the function f:RdRf: \R^d \to\R

0th Order Definition

x,yRd,αR,  f(αx+(1α)y)αf(x)+(1α)f(y)\forall x,y \in \R^d, \alpha \in \R, \; f(\alpha x + (1 - \alpha) y) \le \alpha f(x) + (1 - \alpha) f(y)

1st Order Definition

f(x)+(yx)Tf(x)f(y)f(x) + (y-x)^T \nabla f(x) \le f(y)

A local linear approximation to it is always below it.

2nd Order Definition

x,uRd,αR,  α2f(x+αu)0\forall x,u \in \R^d, \alpha \in \R, \; \frac{\partial}{\partial \alpha^2} f(x+\alpha u) \ge 0

Second derivative is always greater than 0. Or in terms of Hessian,

uT2f(x)u0u^T \nabla^2 f(x) u \ge 0

Which is equivalent to saying the Hessian matrix H(x)=2f(x)H(x) = \nabla^2f(x) is positive semi-definite.

Good Properties

We want to optimize on convex function because it has this really nothing property that local minimum = global minimum.

There is a neat proof directly from the 1st order definition: take a point xx s.t. f(x)=0\nabla f(x) = 0, so xx is a local minimum. From the definition, we have

f(x)+(yx)Tf(x)f(y)f(x)+(yx)T0f(y)f(x)f(y)\begin{align} f(x) + (y-x)^T \nabla f(x) &\le f(y)\\ f(x) + (y-x)^T 0 &\le f(y) \\ f(x) &\le f(y) \end{align}

So xx is also a global minimum.

μ\mu-Strongly Convex

Definition

For a μR,μ>0\mu \in \R, \mu \gt 0, we call the function ff is μ\mu-Strongly Convex when

1st Order Definition

f(x)+(yx)Tf(x)+μ2yx2f(y)f(x) + (y-x)^T \nabla f(x) + \frac \mu 2 \|y-x\|^2 \le f(y)

For this stronger convex, ff not only has to be greater than linear approximation, but also has to be greater than it plus a parabola with positive curvature μ\mu.

2nd Order Definition

x,uRd,αR,s.t. u2=1,  2α2f(x+αu)μ\forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \frac{\partial^2}{\partial \alpha^2} f(x+\alpha u) \ge \mu

This is the most natural way to define it

Another Very Important Definition

g s.t. f(w)=g(w)+μ2w2 and g is convex\exists g \text{ s.t. } f(w) = g(w) + \frac \mu 2 \|w\|^2 \text{ and $g$ is convex}

This is very useful in ML because from this, we can say that any function is μ\mu-strictly convex as long as it can be written in the form of a convex function plus a L2 regularizer.

PL property

If a function is of μ\mu-strong convexity, it has this Polyak-Lojasiewicz property, which says (ff^* is the global minimum)

f(x)22μ(f(x)f)\left\| \nabla f(x) \right\|^2 \ge 2 \mu \left( f(x) - f^* \right)

This inequality says that our function’s gradient is only close to 0 when we are close to the global minimum.

GD Converge Linearly on PL Function

At the end of the proof, we had this inequality below

f(wt+1)f(wt)α(1αL2)f(wt)2f(w_{t+1}) \le f(w_t) - \alpha \left(1 - \frac{\alpha L}{2} \right) \cdot \| \nabla f(w_t) \|^2

By making an assumption that αL1\alpha L \le 1, we concluded that GD converges at a rate proportional to α\alpha and the norm of gradient. What happens when we have a PL function?

To simplify this inequality first, we get rid of this big term (1αL2)12- (1 - \frac{\alpha L}{2}) \le - \frac 1 2

f(wt+1)f(wt)α2f(wt)2f(w_{t+1}) \le f(w_t) - \frac \alpha 2 \| \nabla f(w_t) \|^2

With our PL property, we can replace the norm of gradient with global minimum.

f(wt+1)f(wt)μα(f(wt)f)f(wt+1)f(f(wt)f)μα(f(wt)f)f(wt+1)f(1μα)(f(wt)f)\begin{align} f(w_{t+1}) &\le f(w_t) - \mu \alpha \left( f(w_t) - f^* \right)\\ f(w_{t+1}) - f^* &\le (f(w_t) - f^*) - \mu \alpha \left( f(w_t) - f^* \right)\\ f(w_{t+1}) - f^* &\le (1 - \mu \alpha) \left( f(w_t) - f^* \right) \end{align}

Call this multiplicative factor σ=1μα\sigma = 1 - \mu \alpha, so we have

f(wt+1)fσ(f(wt)f)f(w_{t+1}) - f^* \le \sigma \left( f(w_t) - f^* \right)

Since the left of the inequality is a positive number and the right is also a positive number, we must have σ\sigma is a positive number. From here, we can then safely recursively apply this inequality TT times and get

f(wT)fσT(f(w0)f)f(w_{T}) - f^* \le \sigma^T \left( f(w_0) - f^* \right)

Recall σ=1μα\sigma = 1 - \mu \alpha and It holds true for all xx that 1xex1-x \le e^{-x}, so we can rewrite the above inequality as below, where ex=exp(x)e^x = exp(x):

f(wT)fexp(μα)T(f(w0)f)f(wT)fexp(Tμα)(f(w0)f)f(w_{T}) - f^* \le exp(-\mu \alpha)^T \left( f(w_0) - f^* \right)\\ f(w_{T}) - f^* \le exp(-T \mu \alpha) \left( f(w_0) - f^* \right)

This shows that, for μ\mu-strongly convex functions, gradient descent with a constant step size converges exponentially quickly to the optimum. This is sometimes called convergence at a linear rate (I know it’s confusing when we have exponential convergence rate but call it a “linear time”. This is just a name from numeric optimization)

What really interesting is

Now take a closer look at the value σ=1μα\sigma = 1 - \mu \alpha. To get the quickest convergence, we know from last time that we need to set α=1/L\alpha = 1/L, so we have

α=1L,σ=1μLf(wT)fexp(μTL)(f(w0)f)\alpha = \frac 1 L, \sigma = 1 - \frac \mu L \\ f(w_{T}) - f^* \le exp(-\frac {\mu T} L) \left( f(w_0) - f^* \right)

Set our goal as: for a given margin ϵ>0\epsilon \gt 0, by running GD TT times, we can output a prediction w^\hat w s.t. f(w^)fϵf(\hat w) - f^* \le \epsilon, so

f(wT)fϵf(w_{T}) - f^* \le \epsilon

We look at the one with an exponential term instead, so

exp(μTL)(f(w0)f)ϵexp(-\frac {\mu T} L) \left( f(w_0) - f^* \right) \le \epsilon

To solve this, we have

TLμlog(f(w0)fϵ)T \ge \frac{L}{\mu} \log\left( \frac{f(w_0) - f^*}{\epsilon} \right)

In CS, when we see a log\log term, we tend to say we can ignore it, so this inequality actually tells us that the number of iterations we need for GD to converge doesn’t depend on initialization w0w_0 or ϵ\epsilon how accurate we want our solution to be. It depends instead only on Lμ\frac L \mu.

Condition Number

We will call κ=Lμ\kappa = \frac L \mu the condition number of our problem. The condition number encodes how hard a strongly convex problem is to solve. This ratio is invariant to scaling of the objective function ff. Recall the definition of LL and μ\mu: LL is the upperbound of ff’s second gradient and μ\mu is the lowerbound.

x,uRd,αR,s.t. u2=1,  2α2f(w+αu)Lx,uRd,αR,s.t. u2=1,  2α2f(x+αu)μ\forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \|\frac{\partial^2}{\partial\alpha^2} f(\mathbf w+\alpha \mathbf u) \| \le L \\ \forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \| \frac{\partial^2}{\partial \alpha^2} f(x+\alpha u) \| \ge \mu

The naming of condition number comes from numerical analysis where it describes the ratio between the largest singular value and the smallest singular value in the SVD of a matrix. This κ\kappa we have here is indeed the condition number of the Hessian matrix of our loss. Since a Hessian matrix is always semi-definite and positive, the condition number of a Hessian matrix is just the ratio between biggest and smallest eigenvalue.

The in-class demo showed an example where when we run GD on linear regression and it works horrible in raw data but works well on normalized data. One thing nice about linear regression is that we can actually calculate the exact Hessian matrix. Therefore, if we look at the Hessian, we understand when we have unnormalized data, we had a very large condition number (10^14) and it will take forever for GD to converge.

(w)=1ni=1n(xiTwyi)2=1nXwY2(w)=2ni=1nxi(xiTwyi)=2nXT(XwY)2(w)=2ni=1nxixiT=2nXTX\begin{align} \ell(w) &= \frac{1}{n} \sum_{i=1}^n (x_i^T w - y_i)^2 = \frac 1 n \|Xw - Y\|^2\\ \nabla \ell(w) &= \frac{2}{n} \sum_{i=1}^n x_i (x_i^T w - y_i) = \frac 2 n X^T (Xw - Y)\\ \nabla^2 \ell(w) &= \frac{2 }{n} \sum_{i=1}^n x_i x_i^T = \frac 2 n X^T X \end{align}

Time Complexity

We conclude that to run gradient descent, it will take O(κ)O(\kappa) steps and O(ndκ)O(nd \kappa) time (nd time to calculate each gradient descent step)