Gradient Descent and Convexity

Last lecture we proved that GD converges, but doesn’t guarantee it minimizes loss. So when can we say it minimizes loss? When $\ell = f(x)$ is convex.

Convexity¶

Definition¶

To visualize graphically, a convex function is that if we draw a line segment between any two points in the graph of the function, that line segment will lie above the function.

Consider the function $f: \R^d \to\R$

0th Order Definition¶

\forall x,y \in \R^d, \alpha \in \R, \; f(\alpha x + (1 - \alpha) y) \le \alpha f(x) + (1 - \alpha) f(y)

(1)

1st Order Definition¶

f(x) + (y-x)^T \nabla f(x) \le f(y)

(2)

A local linear approximation to it is always below it.

2nd Order Definition¶

\forall x,u \in \R^d, \alpha \in \R, \; \frac{\partial}{\partial \alpha^2} f(x+\alpha u) \ge 0

(3)

Second derivative is always greater than 0. Or in terms of Hessian,

u^T \nabla^2 f(x) u \ge 0

(4)

Which is equivalent to saying the Hessian matrix $H(x) = \nabla^2f(x)$ is positive semi-definite.

Good Properties¶

We want to optimize on convex function because it has this really nothing property that local minimum = global minimum.

There is a neat proof directly from the 1st order definition: take a point $x$ s.t. $\nabla f(x) = 0$ , so $x$ is a local minimum. From the definition, we have

\begin{align} f(x) + (y-x)^T \nabla f(x) &\le f(y)\\ f(x) + (y-x)^T 0 &\le f(y) \\ f(x) &\le f(y) \end{align}

(5)

So $x$ is also a global minimum.

$\mu$ -Strongly Convex¶

Definition¶

For a $\mu \in \R, \mu \gt 0$ , we call the function $f$ is $\mu$ -Strongly Convex when

1st Order Definition¶

f(x) + (y-x)^T \nabla f(x) + \frac \mu 2 \|y-x\|^2 \le f(y)

(6)

For this stronger convex, $f$ not only has to be greater than linear approximation, but also has to be greater than it plus a parabola with positive curvature $\mu$ .

2nd Order Definition¶

\forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \frac{\partial^2}{\partial \alpha^2} f(x+\alpha u) \ge \mu

(7)

This is the most natural way to define it

Another Very Important Definition¶

\exists g \text{ s.t. } f(w) = g(w) + \frac \mu 2 \|w\|^2 \text{ and $g$ is convex}

(8)

This is very useful in ML because from this, we can say that any function is $\mu$ -strictly convex as long as it can be written in the form of a convex function plus a L2 regularizer.

PL property¶

If a function is of $\mu$ -strong convexity, it has this Polyak-Lojasiewicz property, which says ( $f^*$ is the global minimum)

\left\| \nabla f(x) \right\|^2 \ge 2 \mu \left( f(x) - f^* \right)

(9)

This inequality says that our function’s gradient is only close to 0 when we are close to the global minimum.

GD Converge Linearly on PL Function¶

At the end of the proof, we had this inequality below

f(w_{t+1}) \le f(w_t) - \alpha \left(1 - \frac{\alpha L}{2} \right) \cdot \| \nabla f(w_t) \|^2

(10)

By making an assumption that $\alpha L \le 1$ , we concluded that GD converges at a rate proportional to $\alpha$ and the norm of gradient. What happens when we have a PL function?

To simplify this inequality first, we get rid of this big term $- (1 - \frac{\alpha L}{2}) \le - \frac 1 2$

f(w_{t+1}) \le f(w_t) - \frac \alpha 2 \| \nabla f(w_t) \|^2

(11)

With our PL property, we can replace the norm of gradient with global minimum.

\begin{align} f(w_{t+1}) &\le f(w_t) - \mu \alpha \left( f(w_t) - f^* \right)\\ f(w_{t+1}) - f^* &\le (f(w_t) - f^*) - \mu \alpha \left( f(w_t) - f^* \right)\\ f(w_{t+1}) - f^* &\le (1 - \mu \alpha) \left( f(w_t) - f^* \right) \end{align}

(12)

Call this multiplicative factor $\sigma = 1 - \mu \alpha$ , so we have

f(w_{t+1}) - f^* \le \sigma \left( f(w_t) - f^* \right)

(13)

Since the left of the inequality is a positive number and the right is also a positive number, we must have $\sigma$ is a positive number. From here, we can then safely recursively apply this inequality $T$ times and get

f(w_{T}) - f^* \le \sigma^T \left( f(w_0) - f^* \right)

(14)

Recall $\sigma = 1 - \mu \alpha$ and It holds true for all $x$ that $1-x \le e^{-x}$ , so we can rewrite the above inequality as below, where $e^x = exp(x)$ :

f(w_{T}) - f^* \le exp(-\mu \alpha)^T \left( f(w_0) - f^* \right)\\ f(w_{T}) - f^* \le exp(-T \mu \alpha) \left( f(w_0) - f^* \right)

(15)

This shows that, for $\mu$ -strongly convex functions, gradient descent with a constant step size converges exponentially quickly to the optimum. This is sometimes called convergence at a linear rate (I know it’s confusing when we have exponential convergence rate but call it a “linear time”. This is just a name from numeric optimization)

What really interesting is¶

Now take a closer look at the value $\sigma = 1 - \mu \alpha$ . To get the quickest convergence, we know from last time that we need to set $\alpha = 1/L$ , so we have

\alpha = \frac 1 L, \sigma = 1 - \frac \mu L \\ f(w_{T}) - f^* \le exp(-\frac {\mu T} L) \left( f(w_0) - f^* \right)

(16)

Set our goal as: for a given margin $\epsilon \gt 0$ , by running GD $T$ times, we can output a prediction $\hat w$ s.t. $f(\hat w) - f^* \le \epsilon$ , so

f(w_{T}) - f^* \le \epsilon

(17)

We look at the one with an exponential term instead, so

exp(-\frac {\mu T} L) \left( f(w_0) - f^* \right) \le \epsilon

(18)

To solve this, we have

T \ge \frac{L}{\mu} \log\left( \frac{f(w_0) - f^*}{\epsilon} \right)

(19)

In CS, when we see a $\log$ term, we tend to say we can ignore it, so this inequality actually tells us that the number of iterations we need for GD to converge doesn’t depend on initialization $w_0$ or $\epsilon$ how accurate we want our solution to be. It depends instead only on $\frac L \mu$ .

Condition Number¶

We will call $\kappa = \frac L \mu$ the condition number of our problem. The condition number encodes how hard a strongly convex problem is to solve. This ratio is invariant to scaling of the objective function $f$ . Recall the definition of $L$ and $\mu$ : $L$ is the upperbound of $f$ ’s second gradient and $\mu$ is the lowerbound.

\forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \|\frac{\partial^2}{\partial\alpha^2} f(\mathbf w+\alpha \mathbf u) \| \le L \\ \forall x,u \in \R^d, \alpha \in \R, \text{s.t. } \|u\|^2 = 1, \; \| \frac{\partial^2}{\partial \alpha^2} f(x+\alpha u) \| \ge \mu

(20)

The naming of condition number comes from numerical analysis where it describes the ratio between the largest singular value and the smallest singular value in the SVD of a matrix. This $\kappa$ we have here is indeed the condition number of the Hessian matrix of our loss. Since a Hessian matrix is always semi-definite and positive, the condition number of a Hessian matrix is just the ratio between biggest and smallest eigenvalue.

The in-class demo showed an example where when we run GD on linear regression and it works horrible in raw data but works well on normalized data. One thing nice about linear regression is that we can actually calculate the exact Hessian matrix. Therefore, if we look at the Hessian, we understand when we have unnormalized data, we had a very large condition number (10^14) and it will take forever for GD to converge.