Gradient Descent September 12, 2022
Optimization Methods ¶ Empirical Risk Minimization ¶ For a model / hypothesis h h h and its associated parameters w ∈ R d w \in \R^d w ∈ R d , we can write out its loss in the following form
f ( w ) = 1 ∣ D ∣ ∑ x ∈ D f ( w ; x ) f(w) = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} f(w; x) f ( w ) = ∣ D ∣ 1 x ∈ D ∑ f ( w ; x ) Gradient Descent ¶ With w 0 w_0 w 0 initialized to some value, we then update w w w in each step according to the following rule
w t + 1 = w t − α t ⋅ ∇ f ( w t ) = w t − α t ⋅ 1 n ∑ x ∈ D ∇ f ( w ; x ) w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w; x) w t + 1 = w t − α t ⋅ ∇ f ( w t ) = w t − α t ⋅ n 1 x ∈ D ∑ ∇ f ( w ; x ) GD takes O ( n d ) O(nd) O ( n d ) and O ( d ) O(d) O ( d ) memory.
Newton’s Method ¶ w t + 1 = w t − H ( w t ) ∇ f ( w t ) . w_{t+1} = w_t - H(w_t) \nabla f(w_t). w t + 1 = w t − H ( w t ) ∇ f ( w t ) . H ( x ) H(x) H ( x ) is the Hessian matrix, which is the second order derivatives matrix. Sometimes people directly write H ( w t ) = ( ∇ 2 f ( w t ) ) − 1 H(w_t) = \left( \nabla^2 f(w_t) \right)^{-1} H ( w t ) = ( ∇ 2 f ( w t ) ) − 1 .
Newton’s method takes O ( n d 2 + d 3 ) O(nd^2 + d^3) O ( n d 2 + d 3 ) time and O ( d 2 ) O(d^2) O ( d 2 ) memory. Despite the problem with existence of second order derivative, the enormous extra time we have to spend calculating Hessian matrix when d d d gets big is the main reason why people avoid Newton’s method.
Gradient Descent Converges ¶ L-Smooth ¶ We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded. Formally, such function is called L-smooth .
We call a function L-smooth when its gradient is L-Lipschitz continuous . That is
∥ ∇ f ( w ) − ∇ f ( u ) ∥ 2 ≤ L ∥ u − v ∥ 2 \| \nabla f(w) - \nabla f(u) \|_2 \le L \| u-v \|_2 ∥∇ f ( w ) − ∇ f ( u ) ∥ 2 ≤ L ∥ u − v ∥ 2 From the above definition, we can derive the following property, which guarantees that small change in weight only results in small change in gradient (the rate of gradient change is bounded by L L L ) This is the intuition behind why we need L-smooth for GD to converge.
∀ w , u ∈ R d , α ∈ R s . t . ∥ u ∥ = 1 , ∥ ∂ 2 ∂ α 2 f ( w + α u ) ∥ ≤ L \forall \mathbf w,\mathbf u \in \R^d, \alpha \in \R \; s.t. \|\mathbf u\|=1, \; \|\frac{\partial^2}{\partial\alpha^2} f(\mathbf w+\alpha \mathbf u) \| \le L ∀ w , u ∈ R d , α ∈ R s . t .∥ u ∥ = 1 , ∥ ∂ α 2 ∂ 2 f ( w + α u ) ∥ ≤ L For example, take f ( x ) = sin ( x ) f(x) = \sin(x) f ( x ) = sin ( x ) as an example, we have ∣ f ′ ′ ( x ) ∣ = ∣ − sin ( x ) ∣ ≤ 1 |f''(x)| = |-\sin(x)| \le 1 ∣ f ′′ ( x ) ∣ = ∣ − sin ( x ) ∣ ≤ 1 , so we call f f f is 1-smooth and f ′ f' f ′ is 1-Lipchitz continuous. Another example g ( x ) = x 3 g(x) = x^3 g ( x ) = x 3 , so g ′ ( x ) = 3 x 2 , g ′ ′ ( x ) = 6 x g'(x) = 3x^2, g''(x) = 6x g ′ ( x ) = 3 x 2 , g ′′ ( x ) = 6 x and a small x change can result in arbitrary large gradient descent, so this is bad for gradient descent.
Proof ¶ Assume there exists a global minimum f ∗ s . t . ∀ w ∈ R d , f ( w ) ≥ f ∗ f^* \; s.t. \; \forall w \in \R^d, \; f(w) \ge f^* f ∗ s . t . ∀ w ∈ R d , f ( w ) ≥ f ∗
First, we start with looking at how the loss changes from step t t t to step t + 1 t+1 t + 1 .
f ( w t + 1 ) = f ( w t − α ∇ f ( w t ) ) f(w_{t+1}) = f(w_t - \alpha \nabla f(w_t)) f ( w t + 1 ) = f ( w t − α ∇ f ( w t )) By Fundamental Theorem of Calculus: h ( a ) = h ( 0 ) + ∫ 0 a h ′ ( t ) d t h(a) = h(0) + \int^a_0 h'(t) dt h ( a ) = h ( 0 ) + ∫ 0 a h ′ ( t ) d t
f ( w t + 1 ) = f ( w t ) + ∫ 0 α ∂ ∂ η f ( w t − η ∇ f ( w t ) ) d η f(w_{t+1}) = f(w_t) + \int^\alpha_0 \frac{\partial}{\partial\eta} f(w_t - \eta \nabla f(w_t)) \; d\eta f ( w t + 1 ) = f ( w t ) + ∫ 0 α ∂ η ∂ f ( w t − η ∇ f ( w t )) d η We rewrite this equation with w ^ = w t − η ∇ f ( w t ) \hat w = w_t - \eta \nabla f(w_t) w ^ = w t − η ∇ f ( w t ) (refer to the appendix for a detailed derivation)
f ( w t + 1 ) = f ( w t ) + ∫ 0 α − ∇ f ( w ^ ) T ∇ f ( w t ) d η f(w_{t+1}) = f(w_t) + \int^\alpha_0 -\nabla f(\hat w)^T \nabla f(w_t) \; d\eta f ( w t + 1 ) = f ( w t ) + ∫ 0 α − ∇ f ( w ^ ) T ∇ f ( w t ) d η Subtract and add ∇ f ( w t ) T ∇ f ( w t ) \nabla f(w_t)^T \nabla f(w_t) ∇ f ( w t ) T ∇ f ( w t )
f ( w t + 1 ) = f ( w t ) + ∫ 0 α − ∇ f ( w t ) T ∇ f ( w t ) + ∇ f ( w t ) T ∇ f ( w t ) − ∇ f ( w ^ ) T ∇ f ( w t ) d η = f ( w t ) + ∫ 0 α − ∇ f ( w t ) T ∇ f ( w t ) + ( ∇ f ( w t ) T − ∇ f ( w ^ ) T ) ∇ f ( w t ) d η = f ( w t ) − ∫ 0 α ∇ f ( w t ) T ∇ f ( w t ) d η + ∫ 0 α ( ∇ f ( w t ) − ∇ f ( w ^ ) ) T ∇ f ( w t ) d η \begin{align}
f(w_{t+1}) &= f(w_t) + \int^\alpha_0 - \nabla f(w_t)^T \nabla f(w_t) + \nabla f(w_t)^T \nabla f(w_t) - \nabla f(\hat w)^T \nabla f(w_t) \; d\eta \\
&= f(w_t) + \int^\alpha_0 - \nabla f(w_t)^T \nabla f(w_t) + (\nabla f(w_t)^T - \nabla f(\hat w)^T) \nabla f(w_t) \; d\eta \\
&= f(w_t) - \int^\alpha_0 \nabla f(w_t)^T \nabla f(w_t) \; d\eta
+ \int^\alpha_0 (\nabla f(w_t) - \nabla f(\hat w))^T \nabla f(w_t) \; d\eta
\end{align} f ( w t + 1 ) = f ( w t ) + ∫ 0 α − ∇ f ( w t ) T ∇ f ( w t ) + ∇ f ( w t ) T ∇ f ( w t ) − ∇ f ( w ^ ) T ∇ f ( w t ) d η = f ( w t ) + ∫ 0 α − ∇ f ( w t ) T ∇ f ( w t ) + ( ∇ f ( w t ) T − ∇ f ( w ^ ) T ) ∇ f ( w t ) d η = f ( w t ) − ∫ 0 α ∇ f ( w t ) T ∇ f ( w t ) d η + ∫ 0 α ( ∇ f ( w t ) − ∇ f ( w ^ ) ) T ∇ f ( w t ) d η Evaluate the first integral and replace the second integral with the triangular inequality
f ( w t + 1 ) ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥ ∇ f ( w t ) ∥ ⋅ ∥ ∇ f ( w t ) − ∇ f ( w t − η ∇ f ( w t ) ) ∥ d η f(w_{t+1}) \le f(w_t) -\alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot \| \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t)) \| \; d \eta \\ f ( w t + 1 ) ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥∇ f ( w t ) ∥ ⋅ ∥∇ f ( w t ) − ∇ f ( w t − η ∇ f ( w t )) ∥ d η Since our function is L-smooth,
f ( w t + 1 ) ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥ ∇ f ( w t ) ∥ ⋅ L ∥ w t − w t − η ∇ f ( w t ) ∥ d η ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥ ∇ f ( w t ) ∥ ⋅ L ∥ η ∇ f ( w t ) ∥ d η ≤ f ( w t ) − α ∥ ∇ f ( w t ) ∥ 2 + L ∥ ∇ f ( w t ) ∥ 2 ∫ 0 α η d η ≤ f ( w t ) − α ∥ ∇ f ( w t ) ∥ 2 + α 2 L 2 ∥ ∇ f ( w t ) ∥ 2 ≤ f ( w t ) − α ( 1 − α L 2 ) ⋅ ∥ ∇ f ( w t ) ∥ 2 \begin{align}
f(w_{t+1}) &\le f(w_t) - \alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot L \| w_t - w_t - \eta \nabla f(w_t) \| \; d \eta \\
&\le f(w_t) - \alpha \nabla f(w_t)^T \nabla f(w_t) + \int_{0}^{\alpha} \| \nabla f(w_t) \| \cdot L \| \eta \nabla f(w_t) \| \; d \eta \\
&\le f(w_t) - \alpha \| \nabla f(w_t) \|^2 + L \| \nabla f(w_t) \|^2 \int_{0}^{\alpha} \eta \; d \eta \\
&\le f(w_t) - \alpha \| \nabla f(w_t) \|^2 + \frac{\alpha^2 L}{2} \| \nabla f(w_t) \|^2 \\
&\le f(w_t) - \alpha \left(1 - \frac{\alpha L}{2} \right) \cdot \| \nabla f(w_t) \|^2 \\
\end{align} f ( w t + 1 ) ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥∇ f ( w t ) ∥ ⋅ L ∥ w t − w t − η ∇ f ( w t ) ∥ d η ≤ f ( w t ) − α ∇ f ( w t ) T ∇ f ( w t ) + ∫ 0 α ∥∇ f ( w t ) ∥ ⋅ L ∥ η ∇ f ( w t ) ∥ d η ≤ f ( w t ) − α ∥∇ f ( w t ) ∥ 2 + L ∥∇ f ( w t ) ∥ 2 ∫ 0 α η d η ≤ f ( w t ) − α ∥∇ f ( w t ) ∥ 2 + 2 α 2 L ∥∇ f ( w t ) ∥ 2 ≤ f ( w t ) − α ( 1 − 2 αL ) ⋅ ∥∇ f ( w t ) ∥ 2 Note Gradient Descent only converges when 1 − α L 2 > 0 1 - \frac{\alpha L}{2} \gt 0 1 − 2 αL > 0 , so the loss function is guaranteed to decrease at each step. We now make an assumption that α L ≤ 1 \alpha L \le 1 αL ≤ 1 and we can safely conclude that GD converges .
How fast it Converges & Where does it Go? ¶ From our assumption α L ≤ 1 \alpha L \le 1 αL ≤ 1 above, we can get rid of the α L 2 \frac{\alpha L}{2} 2 αL term
f ( w t + 1 ) ≤ f ( w t ) − α 2 ∥ ∇ f ( w t ) ∥ 2 f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \| \nabla f(w_t) \|^2 f ( w t + 1 ) ≤ f ( w t ) − 2 α ∥∇ f ( w t ) ∥ 2 We now move stuff a little bit and calculate the accumulated value when we run T T T iterations
α 2 ∥ ∇ f ( w t ) ∥ 2 ≤ f ( w t ) − f ( w t + 1 ) α 2 ∑ t = 0 T − 1 ∥ ∇ f ( w t ) ∥ 2 ≤ f ( w 0 ) − f ( w T ) \begin{align}
\frac{\alpha}{2} \| \nabla f(w_t) \|^2 &\le f(w_t) - f(w_{t+1}) \\
\frac{\alpha}{2} \sum_{t=0}^{T-1} \| \nabla f(w_t) \|^2 &\le f(w_0) - f(w_{T})
\end{align} 2 α ∥∇ f ( w t ) ∥ 2 2 α t = 0 ∑ T − 1 ∥∇ f ( w t ) ∥ 2 ≤ f ( w t ) − f ( w t + 1 ) ≤ f ( w 0 ) − f ( w T ) Recall we have a global minimum loss f ∗ f^* f ∗ . We can also find the minimum gradient norm ∥ ∇ f ( w t ) ∥ 2 \| \nabla f(w_t) \|^2 ∥∇ f ( w t ) ∥ 2 when we run these T T T iterations.
α T 2 min t ∈ [ 0 , T − 1 ] ∥ ∇ f ( w t ) ∥ 2 ≤ α 2 ∑ t = 0 T − 1 ∥ ∇ f ( w t ) ∥ 2 ≤ f ( w 0 ) − f ( w T ) ≤ f ( w 0 ) − f ∗ \frac{\alpha T}{2} \min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2
\le \frac{\alpha}{2} \sum_{t=0}^{T-1} \| \nabla f(w_t) \|^2
\le f(w_0) - f(w_{T})
\le f(w_0) - f^* 2 α T t ∈ [ 0 , T − 1 ] min ∥∇ f ( w t ) ∥ 2 ≤ 2 α t = 0 ∑ T − 1 ∥∇ f ( w t ) ∥ 2 ≤ f ( w 0 ) − f ( w T ) ≤ f ( w 0 ) − f ∗ We now have this following equation. We started off looking at the change in loss from each iteration. Then we accumulated the difference to look at the change by running T T T iterations. Now we ended up here with this inequality with a min \min min .
min t ∈ [ 0 , T − 1 ] ∥ ∇ f ( w t ) ∥ 2 ≤ 2 f ( w 0 ) − f ∗ α T \min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2 \le 2\frac{f(w_0) - f^*}{\alpha T} t ∈ [ 0 , T − 1 ] min ∥∇ f ( w t ) ∥ 2 ≤ 2 α T f ( w 0 ) − f ∗ What is the minimum norm of gradient we can get in theory? It’s 0 when we reach a minimum of loss function f f f , though we do not know whether that is a local minimum or a global one. No matter what kind of a minimum it is, if we reach it, we can call Gradient Descent has converged. Therefore, we want to make the bound on the right as tight as possible.
We can do this by either running more iterations T T T or have a faster learning rate α \alpha α . Within limited iterations, let’s tweak α \alpha α here. Recall that α L ≤ 1 \alpha L \le 1 αL ≤ 1 , so replace α = 1 L \alpha = \frac 1 L α = L 1 , we have
min t ∈ [ 0 , T − 1 ] ∥ ∇ f ( w t ) ∥ 2 ≤ 2 L f ( w 0 ) − f ∗ T \min_{t \in [0, T-1]} \| \nabla f(w_t) \|^2 \le 2L \frac{f(w_0) - f^*}{T} t ∈ [ 0 , T − 1 ] min ∥∇ f ( w t ) ∥ 2 ≤ 2 L T f ( w 0 ) − f ∗ This means that the smallest gradient we observe after T T T iterations is getting smaller proportional to 1 / T 1/T 1/ T and GD converges.
Note this proof doesn’t show it reaches a global minimum for the reasons explained above.
Appendix ¶ The derivative of f f f at point w w w in the direction u u u is defined as the following
∇ f ( w ) T u = lim δ → 0 f ( w + δ u ) − f ( w ) δ \nabla f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta} ∇ f ( w ) T u = δ → 0 lim δ f ( w + δ u ) − f ( w ) Also the definition of a partial derivative is
∂ ∂ x f ( x , y , z ) = lim δ → 0 f ( x + δ , y , z ) − f ( x , y , z ) δ \frac{\partial}{\partial x}f(x,y,z) = \lim_{\delta\to0} \frac{f(x+\delta,y,z) - f(x,y,z)}{\delta} ∂ x ∂ f ( x , y , z ) = δ → 0 lim δ f ( x + δ , y , z ) − f ( x , y , z ) According to the definition of partial derivative, we add a little δ \delta δ on the η \eta η we are differentiating with:
$$
∂ ∂ η f ( w t − η ∇ f ( w t ) ) = lim δ → 0 f ( w t − ( η + δ ) ∇ f ( w t ) ) − f ( w t − η ∇ f ( w t ) ) δ partial derivative definition = lim δ → 0 f ( w t − η ∇ f ( w t ) − δ ∇ f ( w t ) ) − f ( w t − η ∇ f ( w t ) ) δ = lim δ → 0 f ( w ^ − δ ∇ f ( w t ) ) − f ( w ^ ) δ Call w ^ = w t − η ∇ f ( w t ) = − ∇ f ( w ^ ) T ∇ f ( w t ) definition of gradient = − ∇ f ( w t − η ∇ f ( w t ) ) T ∇ f ( w t ) \begin{align}
\frac{\partial}{\partial\eta} f(w_t - \eta \nabla f(w_t))
&= \lim_{\delta \to 0} \frac{f(w_t - (\eta + \delta) \nabla f(w_t)) - f(w_t - \eta \nabla f(w_t))}{\delta} && \text{partial derivative definition} \\
&= \lim_{\delta \to 0} \frac{f(w_t - \eta \nabla f(w_t) - \delta \nabla f(w_t)) - f(w_t - \eta \nabla f(w_t))}{\delta}\\
&= \lim_{\delta \to 0} \frac{f(\hat w - \delta \nabla f(w_t)) - f(\hat w)}{\delta} && \text{Call $\hat w = w_t - \eta \nabla f(w_t)$} \\
&= -\nabla f(\hat w)^T \nabla f(w_t) && \text{definition of gradient}\\
&= -\nabla f(w_t - \eta \nabla f(w_t))^T \nabla f(w_t)
\end{align} ∂ η ∂ f ( w t − η ∇ f ( w t )) = δ → 0 lim δ f ( w t − ( η + δ ) ∇ f ( w t )) − f ( w t − η ∇ f ( w t )) = δ → 0 lim δ f ( w t − η ∇ f ( w t ) − δ ∇ f ( w t )) − f ( w t − η ∇ f ( w t )) = δ → 0 lim δ f ( w ^ − δ ∇ f ( w t )) − f ( w ^ ) = − ∇ f ( w ^ ) T ∇ f ( w t ) = − ∇ f ( w t − η ∇ f ( w t ) ) T ∇ f ( w t ) partial derivative definition Call w ^ = w t − η ∇ f ( w t ) definition of gradient $$
So we have translated this partial derivative to the gradient at point w t − η ∇ f ( w t ) w_t - \eta \nabla f(w_t) w t − η ∇ f ( w t ) in the direction of − ∇ f ( w t ) -\nabla f(w_t) − ∇ f ( w t ) .