Skip to article frontmatterSkip to article content

Gradient Descent

Cornell University

argminargmin from Last Time

In the previous lecture on Logistic Regression we wrote down expressions for the parameters in our model as solutions to optimization problems that do not have closed form solutions. Specifically, given data {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i = 1}^{n} with xiRdx_{i} \in \mathbb{R}^{d} and yi{+1,1}y_{i} \in \{ + 1, - 1\} we saw that

w^MLE=argminwRd,bR  i=1nlog(1+eyi(wTxi+b)){\hat{w}}_{\text{MLE}} = \operatorname*{argmin}_{w \in \mathbb{R}^{d},b \in \mathbb{R}}\;\sum\limits_{i = 1}^{n}\log(1 + e^{- y_{i}(w^{T}x_{i} + b)})

and

w^MAP=argminwRd,bR  i=1nlog(1+eyi(wTxi+b))+λwTw{\hat{w}}_{\text{MAP}} = \operatorname*{argmin}_{w \in \mathbb{R}^{d},b \in \mathbb{R}}\;\sum\limits_{i = 1}^{n}\log(1 + e^{- y_{i}(w^{T}x_{i} + b)}) + \lambda w^{T}w

These notes will discuss general strategies to solve these problems and, therefore, we abstract our problem to

minw(w),\min\limits_{w}\ell(w),

where :RdR.\ell:\mathbb{R}^{d}\rightarrow\mathbb{R}. In other words, we would like to find the vector ww that makes (w)\ell(w) as small as possible.

Assumptions

Before diving into the mathematical and algorithmic details we will make some assumptions about our problem to simplify the discussion. Specifically, we will assume that:

  1. \ell is convex, so local minimum found is a global minimum.
  2. \ell is at least thrice continuously differentiable, so our Taylor approximations will work later.
  3. There are no constraints placed on w.w. It is common to consider the problem minwC(w)\min\limits_{w \in \mathcal{C}}\ell(w) where C\mathcal{C} represents some constraints on ww (e.g., ww is entrywise non-negative). Adding constraints is a level of complexity we will not address here.

Defining Minimizer

We call ww^{\ast} a local minimizer of \ell if:

ϵ,s.t.w where ww2<ϵ\exists \epsilon, s.t. \forall w \text{ where } \parallel w - w^{\ast} \parallel_{2} < \epsilon we have (w)(w).\ell(w^{\ast}) \leq \ell(w).

ww^* is a global minimizer if

w\forall w, we have that (w)(w).\ell(w^{\ast}) \leq \ell(w).

Taylor Expansion

Recall that the first order Taylor expansion of \ell centered at ww can be written as

(w+p)(w)+g(w)Tp,\ell(w + p) \approx \ell(w) + g(w)^{T}p,

where g(w)g(w) is the gradient of \ell evaluated at ww, so g(w)=(w)g(w) = \nabla\ell(w). This is the linear approximation of \ell and has error O(p22)\mathcal{O}( \parallel p \parallel_{2}^{2}).

Similarly, the second order Taylor expansion of \ell centered at ww can be written as

(w+p)(w)+g(w)Tp+12pTH(w)p\ell(w + p) \approx \ell(w) + g(w)^{T}p + \frac{1}{2}p^{T}H(w)p

where H(w)H(w) is the Hessian of \ell evaluated at ww. This is the quadratic approximation of \ell and has error O(p23)\mathcal{O}( \parallel p \parallel_{2}^{3}).

Search Direction Methods

We can’t really find the exact point ww^* where a minimum is achieved. Therefore, the core idea to solve such problem is that given a starting point w0w^{0} we construct a sequence of iterates w1,w2,w^{1},w^{2},\ldots with the goal that wkww^{k}\rightarrow w^{\ast} as k.k\rightarrow\infty. In a search direction method we will think of constructing wk+1w^{k + 1} from wkw^{k} by writing it as wk+1=wk+sw^{k + 1} = w^{k} + s for some “step” s.s. Concretely, this means our methods will have the following generic format:

Input: initial guess w0w^{0}

k = 0;

While not converged:

  1. Pick a step ss
  2. wk+1=wk+sw^{k + 1} = w^{k} + s and k=k+1k = k + 1
  3. Check for convergence; if converged set w^=wk\hat{w} = w^{k}

Return: w^\hat{w}

There are two clearly ambiguous steps in the above algorithm: how do we pick ss and how do we determine when we have converged. We will talk about two main methods used in picking ss and briefly touch on determining convergence.

A search direction method applied to optimize a function of two variables

Gradient Descent

We use the first order derivative to approximate our function \ell. At first attempt, we try to minimize this function. However, this is not attainable because linear function has no minimum - it just goes all the way to negative infinity. Therefore, we just take some step along this approximation to decrease the value by a bit.

If we directly look at the first order approximation, gradient descent can be also interpreted as follows: given we are currently at wk,w^{k}, determine the direction in which the function decreases the fastest (along the gradient) at this point and update our ww by taking a step in that direction. Consider the linear approximation to \ell at wkw^{k} provided above by the Taylor series:

(wk+1)=(wk+s)=(wk)+((wk))Ts\ell(w^{k+1}) = \ell(w^{k}+s) = \ell(w^{k}) + (\nabla\ell(w^{k}))^{T}s

then the fastest direction to descend is simply s=(wk)s = - \nabla\ell(w^{k}), so we just go along the gradient. We want to get a smaller value so go along the negative direction. We go some α>0\alpha \gt 0 distance, so set ss as

s=α(wk)s = - \alpha \nabla\ell(w^{k})

α\alpha

α\alpha is often referred to as the step size in classical optimization, but learning rate in Machine Learning.

  1. line search - find the best α\alpha, but expensive so we don’t use it
  2. fix α\alpha - might not converge
  3. decay α\alpha - set α=c/k\alpha = c/k at iteration kk for any constant c>0c > 0, so we take big step early on and take smaller steps as we think we are going to the minimizer. This guarantees convergence.

Adagrad

Adagrad (technically, diagonal Adagrad) can be seen as an improvement of decay α\alpha, where we determine the step size on an entry basis by the steepness of the function - big “decay” for steep direction, and small “decay” for flat direction.

Note in gradient descent, we have the same learning rate across entries, which basically says “we learn the same amount for both rare and common features”. However, we want to have

Therefore, we decrease the step size by a lot along the direction where we already took a large step (that’s along the steep direction, common features) and decrease the step size only by a bit along the direction where we just took a small step (the flat direction, rare features.)

Adagrad keeps track of history of steepness by keeping a running average of the squared gradient at each entry. It then sets a small learning rate for variables with large past gradients and a large learning rate for features with small past gradients.

Input: ,\ell, ,\nabla\ell, parameter ϵ>0,\epsilon > 0, and initial learning rate α.\alpha.

Set wj0=0w_{j}^{0} = 0 and zj=0z_{j} = 0 for j=1,,d.j = 1,\ldots,d. k = 0;

While not converged:

  1. Compute the gradient wk\nabla w^k and record values for each entry of the gradient gj=[wk]jg_{j} = [\nabla w^k]_j
  2. For each dimension jj, calculate the moving sum of squared gradient at dimension jj: zj=zj+gj2z_{j} = z_{j} + g_{j}^{2} for j=1,,dj = 1,\ldots,d
  3. wjk+1=wjkαgjzj+ϵw_{j}^{k + 1} = w_{j}^{k} - \alpha\frac{g_{j}}{\sqrt{z_{j} + \epsilon}} for j=1,,d.j = 1,\ldots,d.
  4. k=k+1k = k + 1
  5. Check for convergence; if converged set w^=wk\hat{w} = w^{k}

Return: w^\hat{w}

Convergence

Assuming the gradient is non-zero, we can show that there is always some small enough α\alpha such that (wkα(wk))<(wk).\ell(w^{k} - \alpha \nabla\ell(w^{k})) < \ell(w^{k}). In particular, we have

(wkα(wk))=(wk)α((wk))T(wk)+O(α2)\ell(w^{k} - \alpha \nabla\ell(w^{k})) = \ell(w^{k}) - \alpha (\nabla\ell(w^{k}))^{T}\nabla\ell(w^{k}) + \mathcal{O}(\alpha^{2})

Since ((wk))T(wk)>0(\nabla\ell(w^{k}))^{T}\nabla\ell(w^{k}) > 0 and α20\alpha^{2}\rightarrow 0 faster than α\alpha as α0\alpha\rightarrow 0, we conclude that for some sufficiently small α>0\alpha > 0 we have that (wkα(wk))<(wk).\ell(w^{k} - \alpha \nabla\ell(w^{k})) < \ell(w^{k}).

Newton’s Method

Newton’s Method instead uses the second order derivative to approximate \ell. We will pretend our approximation represents \ell very well, finds the minimum point of our quadratic approximation, and claim it also to be the minimum of our original function \ell. This works really well when our function indeed looks similar to a quadratic function locally, but when our approximation is off, the result will be way off and may never converge.

(wk+s)(wk)+((wk))Ts+12sTH(wk)s\ell(w^{k} + s) \approx \ell(w^{k}) + (\nabla\ell(w^{k}))^{T}s + \frac{1}{2}s^{T}H(w^{k})s

Specifically, we now chose a step by explicitly minimizing the quadratic approximation to \ell at wkw^{k}. We can find this minimum because \ell is convex. To accomplish this, we solve

mins(wk)+((wk))Ts+12sTH(wk)s\min\limits_{s}\ell(w^{k}) + (\nabla\ell(w^{k}))^{T}s + \frac{1}{2}s^{T}H(w^{k})s

by differentiating and setting the gradient equal to zero. Since the gradient of our quadratic approximation is (wk)+H(wk)s\nabla\ell(w^{k}) + H(w^{k})s this implies that ss solves the linear system

H(wk)s=(wk)H(w^{k})s = - \nabla\ell(w^{k})

This system is solvable because convex \ell guarantees a positive semi-definite H(w)H(w) for all ww. (In fact, we need to solve this function by having a positive definite HH, which is guaranteed by a strictly convex \ell, but we will introduce a workaround later)

Potential Issues

Bad Approximation

Newton’s method has very good properties in the neighborhood of a strict local minimizer and once close enough to a solution it converges rapidly. However, if you are far from where the quadratic approximation is valid, Newton’s Method can diverge or enter a cycle. Practically, a fix for this is to introduce a step size α>0\alpha > 0 and formally set

s=α[H(wk)]1(wk)s = - \alpha\lbrack H(w^{k})\rbrack^{- 1}\nabla\ell(w^{k})

We typically start with α=1\alpha = 1 since that is the proper step to take if the quadratic is a good approximation. However, if this step seems poor (e.g. (wk+αs)>(wk)\ell(w^{k} + \alpha s) > \ell(w^{k})) then we can decrease α\alpha at that iteration.

Positive Semi-Definite H(wk)H(w^k)

In principle this means that H(wk)s=(wk)H(w^{k})s = - \nabla\ell(w^{k}) may not have a solution. Or, if H(wk)H(w^{k}) has a zero eigenvalue then we always have infinite solutions. A “quick fix” is to simply solve

(H(wk)+ϵI)s=(wk)(H(w^{k}) + \epsilon I)s = - \nabla\ell(w^{k})

for some small parameter ϵ\epsilon instead. This lightly regularizes the quadratic approximation to \ell at wkw^{k} and ensures it has a strict global minimizer.

Slow Running Time

Newton’s method requires a second order approximation of the original function, which involved the Hessian matrix HH. There are a lot of parameters to compute compared to a first order approximation.

Checking for convergence

  1. Relative change in the iterates, i.e. wk+1wk2wk2<δ1\frac{\parallel w^{k + 1} - w^{k} \parallel_{2}}{\parallel w^{k} \parallel_{2}} < \delta_{1} for some small δ1>0\delta_{1} > 0.
  2. A reasonably small gradient, i.e., (wk)2<δ2\parallel \nabla\ell(w^{k}) \parallel_{2} < \delta_{2} for some small δ2>0\delta_{2} > 0.