Skip to article frontmatterSkip to article content

Back Propagation

Cornell University

Reverse Mode Automatic Differentiation

We need to fix an output in the NN. For most cases, we just use the final output \ell of the NN (\ell stands for loss). For each intermediate output value yy, we want to compute partial derivatives y\frac{\partial \ell}{\partial y}.

Same as before: we replace each number/array yy with a pair, except this time the pair is not (y,yx)(y, \frac{\partial y}{\partial x}) but instead (y,y)(y, \nabla_y \ell) where \ell is our final output (loss of NN). Observe that y\nabla_y \ell always has the same shape as yy.

Deriving Backprop from the Chain Rule

Single Variable

Suppose that the output \ell can be written as

=f(u),  u=g(y)\ell = f(u), \; u = g(y)

Here, gg represents the immediate next operation that takes yy (producing a new intermediate uu), and ff represents all the rest of the computation between uu and \ell. By the chain rule,

y=uuy=g(y)u\begin{align} \frac{\partial \ell}{\partial y} &= \frac{\partial \ell}{\partial u} \frac{\partial u}{\partial y} \\ &= g'(y) \cdot \frac{\partial \ell}{\partial u} \end{align}

Therefore, to compute y\frac{\partial \ell}{\partial y}, in addition to u\frac{\partial \ell}{\partial u} - the derivative of all the other intermediate computation, we only need to use “local” information g(y)g'(y) that’s available in the program around where yy is computed and used. That is, if ignore u\frac{\partial \ell}{\partial u}, we can compute y\frac{\partial \ell}{\partial y} together when we compute yy.

Multi Variable

More generally, suppose that the output \ell can be written as

=F(u1,u2,,uk),  ui=gi(y)\ell = F(u_1, u_2, \ldots, u_k), \; u_i = g_i(y)

Here, the g1,g2,,gkg_1, g_2, \ldots, g_k represent the immediate next operations that take yy; the uiu_i are the output by these operations, and FF represents all the rest of the computation between uu and \ell. By the multivariate chain rule,

y=i=1kuiuiyy=i=1kDgi(y)Tui\begin{align} \frac{\partial \ell}{\partial y} &= \sum_{i=1}^k \frac{\partial \ell}{\partial u_i} \frac{\partial u_i}{\partial y}\\ \nabla_y \ell &= \sum_{i=1}^k D g_i(y)^T \nabla_{u_i} \ell \end{align}

Note here we are talking about the more general case, where ,u,y\ell, u, y are all vectors. So gig_i takes in a vector yy and outputs a vector uiu_i When we take the derivative of gig_i at yy, we get a matrix and we call this matrix Dgi(y)D g_i(y)

Again, we see that we can compute y\nabla_y \ell using only ui\nabla_{u_i} \ell and “local” information Dgi(y)D g_i(y).

Computing Backprop

When to Compute?

Note the \ell value is computed as follows: yuy \to u \to \ell So when we are computing yy , which is also when we ideally want to compute y\frac{\partial \ell}{\partial y}, the value of uu is not yet available. Without uu, we cannot compute u\frac{\partial \ell}{\partial u}, which is a necessary part to get y\frac{\partial \ell}{\partial y}. Therefore we cannot compute y\frac{\partial \ell}{\partial y} along with yy

Therefore, we cannot compute y\frac{\partial \ell}{\partial y} until after we’ve computed everything between yy and \ell. So the solution is to remember the order in which we computed stuff in between yy and \ell, and then compute their gradients in the reverse order.

Computational Graph

We draw a graph to indicate the computational dependencies, just like the yuy \to u \to \ell above:

example graph

Algorithm

Generally speaking,

  1. For each tensor, initialize a gradient “accumulator” to 0. Set the gradient of \ell with respect to itself to 1.

  2. For each tensor yy' that \ell depends on, in the reverse order of computation, compute y\nabla_{y'} \ell - derivative of \ell with respect to yy'. This is obtainable because when we get to yy', all the intermediate value uu between yy' and \ell are already computed.

    Once this step is called for yy, its gradient will be fully manifested in its gradient accumulator.

In a computational graph, this means

  1. For each tensor, initialize a gradient “accumulator” to 0. Set the gradient of \ell with respect to itself to 1.

  2. For each tensor yy' pointing to \ell, start from \ell and go back along the path, compute y\nabla_{y'} \ell - derivative of \ell with respect to yy'. Specifically, we compute this value by looking at all its direct descendants uiu_i where y=i=1kuiuiy\nabla_{y'} \ell = \sum_{i=1}^k \frac{\partial \ell}{\partial u_i} \frac{\partial u_i}{\partial y'} We can do this because uu are the nodes between yy' and \ell, when we get to yy', all uu values are already computed.

    Once this step is called for yy, computation is done.