Automatic Differentiation - Incomplete CS Notes @ Cornell

Derivative¶

Gradients¶

Suppose I have a function $f$ from $\R^d$ to $\R$ . The gradient $\nabla f$ is is the vector of partial derivatives of the function. Mathematically, it is a function from $\R^d$ to $\R^d$ such that

\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta}

(1)

Another equivalent definition is that $\nabla f(w)^T$ is the linear map such that for any $u \in \R^d$

\nabla f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta}

(2)

More informally, we say it uniquely defines $w$ nearby $w_0$ in the following way:

f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0)

(3)

Gradient Operator¶

Here we introduce something much more general.

For a function $F$ from one vector space $U$ to another vector space $V$ , the derivative of $F$ is a function $DF$ which maps $U$ to $\mathcal{L}(U,V)$ , where $\mathcal{L}(U,V)$ denotes the set of linear maps from $U$ to $V$ . This means $DF$ takes in an element $x \in U$ and returns the derivative of $F$ at this point $x$ . This derivative function $DF(x)$ will take in another directional vector $\Delta \in U$ and outputs the directional derivative of $F$ at point $x$ in direction $\Delta$ .

The derivative is defined as the unique function such that for any $x$ and $\Delta$ in $U$ ,

\lim_{\alpha \rightarrow 0} \frac{F(x + \alpha \Delta) - F(x)}{\alpha} = (DF(x)) \Delta

(4)

As a special case, note for a function $\mathbb R^n \to \mathbb R$ , we can always write it in the form of $b^Tx$ . Therefore, if $F: \mathbb R^n \to \mathbb R$ , we can always write $DF(x)\Delta$ in the form of $b^Tx$ .

Symbolic differentiation¶

Write your function as a single mathematical expression.
Apply the chain rule, product rule, ..., to differentiate that expression.
Execute the expression as code.

Problems¶

Converting code into a mathematical expression is not trivial. We need humans to do it.
The differentiation can be very large and complicated, especially when it gets to chain rules.

Numerical Differentiation¶

Just take a small enough amount (like 1e-8) and use it as the infinitely small value $\epsilon$

Problems¶

suffers from numerical imprecision
can have problems if the function we’re differentiating is not smooth
we aren’t sure what values to use for $\epsilon$

Automatic Differentiation (Forward Mode)¶

Automatic Differentiation allows us to compute derivatives automatically without any overheads or loss of precision. There are two rough classes of methods: forward mode and reverse mode. We introduce the forward mode here.

It fixes one input variable $x$ over $\mathbb{R}$ . At each step of the computation, as we’re computing some value $y$ , also compute $\frac{\partial y}{\partial x}$ . We can do this with a dual numbers approach: each number $y$ is replaced with a pair $(y, \frac{\partial y}{\partial x})$ .

Demo¶

def to_dualnumber(x):
    if isinstance(x, DualNumber):
        return x
    elif isinstance(x, float):
        return DualNumber(x)
    elif isinstance(x, int):
        return DualNumber(float(x))
    else:
        raise Exception("couldn't convert {} to a dual number".format(x))

class DualNumber(object):
    def __init__(self, y, dydx=0.0):
        super().__init__()
        self.y = y
        self.dydx = dydx
        
    def __repr__(self):
        return "(y = {}, dydx = {})".format(self.y, self.dydx)

    # operator overloading
    def __add__(self, other):
        other = to_dualnumber(other)
        return DualNumber(self.y + other.y, self.dydx + other.dydx)
    def __sub__(self, other):
        other = to_dualnumber(other)
        return DualNumber(self.y - other.y, self.dydx - other.dydx)
    def __mul__(self, other):
        other = to_dualnumber(other)
        return DualNumber(self.y * other.y, self.dydx * other.y + self.y * other.dydx)
    def __truediv__(self, other):
        return DualNumber(self.y / other.y, self.dydx / other.y - self.y * other.dydx / (other.y * other.y))
    
    def __radd__(self, other):
        return to_dualnumber(other).__add__(self)
    def __rsub__(self, other):
        return to_dualnumber(other).__sub__(self)
    def __rmul__(self, other):
        return to_dualnumber(other).__mul__(self)
    def __rtruediv__(self, other):
        return to_dualnumber(other).__truediv__(self)
    
def forward_mode_diff(f, xv):
    """
    It computes the df/dx at x=xv
    f is a function that may use +,-,*,/ or other operators we have overloaded
    xv is where we want to calculate the derivative
    """
	# x is a variable that has value xv; dx/dx = 1.0
    x = DualNumber(xv, 1.0)
    # f(x) is a DualNumber, because x is a DualNumber and x goes through a bunch of operations like + or * in f, which are overloaded for DualNumber. x goes through these DualNumber specified operations so the result is also a DualNumber from which we can obtain the derivative directly. 
    return f(x).dydx

def f(x):
    return 2*x*x - 1
def dfdx(x):
    return 4*x
def numerical_derivative(f, x, eps = 1e-5):
    return (f(x+eps) - f(x-eps))/(2*eps)

print(dfdx(3.0)) # 12.0
print(numerical_derivative(f, 3.0)) # 12.000000000078613
print(forward_mode_diff(f, 3.0)) # 12.0

Benefits¶

Simple in-place operations, Easy to extend to compute higher-order derivatives

Problem¶

We can only differentiate with respect to one scalar input. It can get a bit complicated if we are given a vector.

CS4787 Principles of Large-Scale Machine Learning

Linear Algebra and NumPy

CS4787 Principles of Large-Scale Machine Learning

Back Propagation