Skip to article frontmatterSkip to article content

Gradient Descent

Cornell University

Optimization Methods

Empirical Risk Minimization

For a model / hypothesis hh and its associated parameters wRdw \in \R^d, we can write out its loss in the following form

f(w)=1DxDf(w;x)f(w) = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} f(w; x)

Gradient Descent

With w0w_0 initialized to some value, we then update ww in each step according to the following rule

wt+1=wtαtf(wt)=wtαt1nxDf(w;x)w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w; x)

GD takes O(nd)O(nd) and O(d)O(d) memory.

Newton’s Method

wt+1=wtH(wt)f(wt).w_{t+1} = w_t - H(w_t) \nabla f(w_t).

H(x)H(x) is the Hessian matrix, which is the second order derivatives matrix. Sometimes people directly write H(wt)=(2f(wt))1H(w_t) = \left( \nabla^2 f(w_t) \right)^{-1} .

Newton’s method takes O(nd2+d3)O(nd^2 + d^3) time and O(d2)O(d^2) memory. Despite the problem with existence of second order derivative, the enormous extra time we have to spend calculating Hessian matrix when dd gets big is the main reason why people avoid Newton’s method.

Gradient Descent Converges

L-Smooth

We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded. Formally, such function is called L-smooth.

We call a function L-smooth when its gradient is L-Lipschitz continuous. That is

f(w)f(u)2Luv2\| \nabla f(w) - \nabla f(u) \|_2 \le L \| u-v \|_2

From the above definition, we can derive the following property, which guarantees that small change in weight only results in small change in gradient (the rate of gradient change is bounded by LL) This is the intuition behind why we need L-smooth for GD to converge.

w,uRd,αR  s.t.u=1,  2α2f(w+αu)L\forall \mathbf w,\mathbf u \in \R^d, \alpha \in \R \; s.t. \|\mathbf u\|=1, \; \|\frac{\partial^2}{\partial\alpha^2} f(\mathbf w+\alpha \mathbf u) \| \le L

For example, take f(x)=sin(x)f(x) = \sin(x) as an example, we have f(x)=sin(x)1|f''(x)| = |-\sin(x)| \le 1, so we call ff is 1-smooth and ff' is 1-Lipchitz continuous. Another example g(x)=x3g(x) = x^3, so g(x)=3x2,g(x)=6xg'(x) = 3x^2, g''(x) = 6x and a small x change can result in arbitrary large gradient descent, so this is bad for gradient descent.

Proof

Assume there exists a global minimum f  s.t.  wRd,  f(w)ff^* \; s.t. \; \forall w \in \R^d, \; f(w) \ge f^*

First, we start with looking at how the loss changes from step tt to step t+1t+1.

f(wt+1)=f(wtαf(wt))f(w_{t+1}) = f(w_t - \alpha \nabla f(w_t))

By Fundamental Theorem of Calculus: h(a)=h(0)+0ah(t)dth(a) = h(0) + \int^a_0 h'(t) dt

f(wt+1)=f(wt)+0αηf(wtηf(wt))  dηf(w_{t+1}) = f(w_t) + \int^\alpha_0 \frac{\partial}{\partial\eta} f(w_t - \eta \nabla f(w_t)) \; d\eta

We rewrite this equation with w^=wtηf(wt)\hat w = w_t - \eta \nabla f(w_t) (refer to the appendix for a detailed derivation)

f(wt+1)=f(wt)+0αf(w^)Tf(wt)  dηf(w_{t+1}) = f(w_t) + \int^\alpha_0 -\nabla f(\hat w)^T \nabla f(w_t) \; d\eta

Subtract and add f(wt)Tf(wt)\nabla f(w_t)^T \nabla f(w_t)

f(wt+1)=f(wt)+0αf(wt)Tf(wt)+f(wt)Tf(wt)f(w^)Tf(wt)  dη=f(wt)+0αf(wt)Tf(wt)+(f(wt)Tf(w^)T)f(wt)  dη=f(wt)0αf(wt)Tf(wt)  dη+0α(f(wt)f(w^))Tf(wt)  dη\begin{align} f(w_{t+1}) &= f(w_t) + \int^\alpha_0 - \nabla f(w_t)^T \nabla f(w_t) + \nabla f(w_t)^T \nabla f(w_t) - \nabla f(\hat w)^T \nabla f(w_t) \; d\eta \\ &= f(w_t) + \int^\alpha_0 - \nabla f(w_t)^T \nabla f(w_t) + (\nabla f(w_t)^T - \nabla f(\hat w)^T) \nabla f(w_t) \; d\eta \\ &= f(w_t) - \int^\alpha_0 \nabla f(w_t)^T \nabla f(w_t) \; d\eta + \int^\alpha_0 (\nabla f(w_t) - \nabla f(\hat w))^T \nabla f(w_t) \; d\eta \end{align}

Evaluate the first integral and replace the second integral with the triangular inequality

f(wt+1)f(wt)αf(wt)Tf(wt)+0αf(wt)f(wt)f(wtηf(wt))  dηf(w_{t+1}) \le f(w_t) -\alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot \| \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t)) \| \; d \eta \\

Since our function is L-smooth,

f(wt+1)f(wt)αf(wt)Tf(wt)+0αf(wt)Lwtwtηf(wt)  dηf(wt)αf(wt)Tf(wt)+0αf(wt)Lηf(wt)  dηf(wt)αf(wt)2+Lf(wt)20αη  dηf(wt)αf(wt)2+α2L2f(wt)2f(wt)α(1αL2)f(wt)2\begin{align} f(w_{t+1}) &\le f(w_t) - \alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot L \| w_t - w_t - \eta \nabla f(w_t) \| \; d \eta \\ &\le f(w_t) - \alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot L \| \eta \nabla f(w_t) \| \; d \eta \\ &\le f(w_t) - \alpha \| \nabla f(w_t) \|^2 + L \| \nabla f(w_t) \|^2 \int_{0}^{\alpha} \eta \; d \eta \\ &\le f(w_t) - \alpha \| \nabla f(w_t) \|^2 + \frac{\alpha^2 L}{2} \| \nabla f(w_t) \|^2 \\ &\le f(w_t) - \alpha \left(1 - \frac{\alpha L}{2} \right) \cdot \| \nabla f(w_t) \|^2 \\ \end{align}

Note Gradient Descent only converges when 1αL2>01 - \frac{\alpha L}{2} \gt 0, so the loss function is guaranteed to decrease at each step. We now make an assumption that αL1\alpha L \le 1 and we can safely conclude that GD converges.

How fast it Converges & Where does it Go?

From our assumption αL1\alpha L \le 1 above, we can get rid of the αL2\frac{\alpha L}{2} term

f(wt+1)f(wt)α2f(wt)2f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \| \nabla f(w_t) \|^2

We now move stuff a little bit and calculate the accumulated value when we run TT iterations

α2f(wt)2f(wt)f(wt+1)α2t=0T1f(wt)2f(w0)f(wT)\begin{align} \frac{\alpha}{2} \| \nabla f(w_t) \|^2 &\le f(w_t) - f(w_{t+1}) \\ \frac{\alpha}{2} \sum_{t=0}^{T-1} \| \nabla f(w_t) \|^2 &\le f(w_0) - f(w_{T}) \end{align}

Recall we have a global minimum loss ff^*. We can also find the minimum gradient norm f(wt)2\| \nabla f(w_t) \|^2 when we run these TT iterations.

αT2mint[0,T1]f(wt)2α2t=0T1f(wt)2f(w0)f(wT)f(w0)f\frac{\alpha T}{2} \min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2 \le \frac{\alpha}{2} \sum_{t=0}^{T-1} \| \nabla f(w_t) \|^2 \le f(w_0) - f(w_{T}) \le f(w_0) - f^*

We now have this following equation. We started off looking at the change in loss from each iteration. Then we accumulated the difference to look at the change by running TT iterations. Now we ended up here with this inequality with a min\min.

mint[0,T1]f(wt)22f(w0)fαT\min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2 \le 2\frac{f(w_0) - f^*}{\alpha T}

What is the minimum norm of gradient we can get in theory? It’s 0 when we reach a minimum of loss function ff, though we do not know whether that is a local minimum or a global one. No matter what kind of a minimum it is, if we reach it, we can call Gradient Descent has converged. Therefore, we want to make the bound on the right as tight as possible.

We can do this by either running more iterations TT or have a faster learning rate α\alpha. Within limited iterations, let’s tweak α\alpha here. Recall that αL1\alpha L \le 1, so replace α=1L\alpha = \frac 1 L, we have

mint[0,T1]f(wt)22Lf(w0)fT\min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2 \le 2L \frac{f(w_0) - f^*}{T}

This means that the smallest gradient we observe after TT iterations is getting smaller proportional to 1/T1/T and GD converges.

Note this proof doesn’t show it reaches a global minimum for the reasons explained above.

Appendix

The derivative of ff at point ww in the direction uu is defined as the following

f(w)Tu=limδ0f(w+δu)f(w)δ\nabla f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta}

Also the definition of a partial derivative is

xf(x,y,z)=limδ0f(x+δ,y,z)f(x,y,z)δ\frac{\partial}{\partial x}f(x,y,z) = \lim_{\delta\to0} \frac{f(x+\delta,y,z) - f(x,y,z)}{\delta}

According to the definition of partial derivative, we add a little δ\delta on the η\eta we are differentiating with:

$$

ηf(wtηf(wt))=limδ0f(wt(η+δ)f(wt))f(wtηf(wt))δpartial derivative definition=limδ0f(wtηf(wt)δf(wt))f(wtηf(wt))δ=limδ0f(w^δf(wt))f(w^)δCall w^=wtηf(wt)=f(w^)Tf(wt)definition of gradient=f(wtηf(wt))Tf(wt)\begin{align} \frac{\partial}{\partial\eta} f(w_t - \eta \nabla f(w_t)) &= \lim_{\delta \to 0} \frac{f(w_t - (\eta + \delta) \nabla f(w_t)) - f(w_t - \eta \nabla f(w_t))}{\delta} && \text{partial derivative definition} \\ &= \lim_{\delta \to 0} \frac{f(w_t - \eta \nabla f(w_t) - \delta \nabla f(w_t)) - f(w_t - \eta \nabla f(w_t))}{\delta}\\ &= \lim_{\delta \to 0} \frac{f(\hat w - \delta \nabla f(w_t)) - f(\hat w)}{\delta} && \text{Call $\hat w = w_t - \eta \nabla f(w_t)$} \\ &= -\nabla f(\hat w)^T \nabla f(w_t) && \text{definition of gradient}\\ &= -\nabla f(w_t - \eta \nabla f(w_t))^T \nabla f(w_t) \end{align}

$$

So we have translated this partial derivative to the gradient at point wtηf(wt)w_t - \eta \nabla f(w_t) in the direction of f(wt)-\nabla f(w_t).