Recall from previous lecture, the running time of GD on a strongly convex (PL-condition) function depends only on condition number κ. When the condition number is high, convergence can become slow.
The simplest possible setting with a non-1 condition number is a 2D quadratic. To get a high condition number, we want in the Hessian matrix, the biggest (bigger) value to be very big and the smallest (smaller) to be very small. Consider the following example with a>b. Here a is just the curvature of the first dimension and b is the curvature of the second dimension.
By definition of a condition number in linear algebra, we know κ=ba. By its definition in ML optimization problem, we know κ=μL, so we can just set a=L,b=μ, so
Therefore, the final value of f(wt) will be dominated by the exponential term: one of (1−αL)2T or (1−αμ)2T
To minimize f(wt), we have to minimize ∣1−αL∣ and ∣1−αμ∣ at the same time. That is to minimize max(∣1−αL∣,∣1−αμ∣) Note if we just look at the first dimension, so we just minimize ∣1−αL∣, we set α=L1. With L>μ, this will always be a small number. Therefore, for the first dimension with a respective larger curvature L, we want a smaller learning rate. On the other hand, if we only look at the second dimension, where the respective curvature is the smaller μ, we should want a higher learning rate.
Now that we have to minimize both at the same time, we are forced to choose something in the middle. We will always reach minimum when we have αL−1=1−αμ, so α=L+μ2. Substitute this α in,
We want to sort of detect high vs low curvature during GD. The idea is:
If this dim has high curvature, we tend to overshoot, so in next step, gradient will point in the opposite direction
If this dim has low curvature, we are more likely to stay in the same direction, which means the step size is too small
Therefore, we want to make steps smaller when gradients reverse sign and larger when gradients are consistently in the same direction. Polyak momentum does this
As before, what dominates the value of f(wt) will be the biggest among the exponential λT term. Therefore, we want to minimize all these eigenvalues at the same time.
Since we are analyzing the eigenvalues and the basis are not of any importance, we swap the 2nd column with 3rd column and swap 2nd row with 3rd row. This is equivalent to swap the 2nd and 3rd basis in domain vector space and also swap 2nd and 3rd basis in the codomain space. Again, this is fine because we only care about the eigenvalues.
This new block matrix is also in diagonal form, so to solve for its eigenvalues, we just have to solve for B’s eigen values.
Recall that we want to minimize all of eigen values’ norms all at the same time. We achieve this when they all have the same norm. For B specifically, it has two eigenvalues λ1,λ2 and we want ∣λ1∣=∣λ2∣.
Write out the characteristic polynomial of matrix B
To find the exact value of ∣λ∣, we don’t have to go through the process of finding norm of a complex number. In fact, we just recall that product of all eigenvalues is equal to the determinant of this matrix, so
This is a special case of Karush–Kuhn–Tucker conditions (KTT). When we solve a linear programming problem of k variables and n inequalities, k of these n inequalities will actually hold as equality. In this case, this minimization problem is solved when 2 of these 2 inequality constraints achieve equality. . So we solve
Therefore, we have shown that using momentum with GD on this simple example of quadratic function has a faster convergence rate than a vanilla GD. The result does generalize to more general cases with other kinds of functions.
When we use momentum with SGD, it is not guaranteed that it gives a better result, but people still use it.
One disadvantage of Polyak momentum is that the momentum term is not guaranteed to point to the right direction. Also, it is only guaranteed to have this nice acceleration for quadratics. Therefore, we introduce Nesterov Momentum, which works for general strongly convex objectives.
Difference: instead of calculating the momentum term at the current position, we pretend to have already taken one step and calculate the momentum term there.