Automatic Differentiation (AutoDiff or AD) is a set of techniques used to evaluate the partial derivative of a function. Its strength comes from the fact that it’s the most computationally effective way.
AD has a two-sided nature : it’s partly symbolic and partly numerical since it keeps track of the expression (unlike the numerical method) and gives a final numerical value (unlike symbolic method).
What AD is not
It’s not numerical differentiation
Numerical Differentiation uses finite difference approximation. For a multivariable scalar function \(f:\mathbb{R}^n->\mathbb{R}\), we can estimate the gradient \(\triangledown f = \left(\dfrac{\partial f}{\partial x_1}, …, \dfrac{\partial f}{\partial x_n}\right)\) using
with \( e_i \) the \(i^{th}\) unit vector and \(h\) very small and >0 (usually \(h \approx 10^{-5}\)).
It seems to be a good way to compute derivative since it’s easy to implement, but it has the disadvantage of having a time complexity of \(\mathcal{O}(n)\) for a vector of dimension n ( it will be an issue for the case of a model with millions of parameters) and require to carefully select \(h\). Moreover, since it still an approximation, it can lead to huge numerical errors due to Truncation Error and Round-off Error.
It’s not Symbolic Differentiation
Symbolic Differentitation is a straighforward application of the defined derivative expressions to the function. I.E. if \(f(x) = g(x)h(x)\) then
It can be very useful to get some insight into the evaluated function (such as finding \(\dfrac{d}{dx}f(x) = 0\)). The problem here is a phenomenon called expression swell, where calculation runtime is not efficient as they can get exponentially larger than the expression derivative they represent.
I.E. from this paper[1]: Iterations of the logistic map \(l_{n+1} = 4l_n(1-l_n)\text{, avec } l_1=x\), illustrating the expression swell
n | \(l_n\) | \(\dfrac{d}{dx}l_n\) |
---|---|---|
1 | \(x\) | \(1\) |
2 | \(4x(1-x)\) | \(4(1-x)-4x\) |
3 | \(16x(1-x)(1-2x)^2\) | \(16(1 − x)(1 − 2x)^2 − 16x(1 − 2x)^2 − 64x(1 − x)(1 − 2x)\) |
What could be interesting if we are just interested in the accurate numerical evaluation of the derivative and not their actual symbolic form would be to store only the value of intermediate sub-expressions in memory:
we could apply symbolic differentiation to all the element-wise operations and keep the intermediate numerical results -> this is the AD in forward mode.
AD and its two modes
The main idea of AD is to decompose the main expression into a composition of element-wise operations called element trace.
AD is mostly based on the fact that all numerical computation is a composition of basic operations which derivatives are known(Griewank and Walther, 2008), and that those derivatives can be combined through the Chain Rule. These basic operations are usually composed of :
- Basic arithmetic operations
- Unary sign switch
- Transcendals functions such as \(e^x\), \(ln(x)\) and trigonometric functions.
For the rest of the text, we will take the following function to illustrate my words:
and we will represent by \(w_i, i=0, …, l\) the intermediate variables. Let’s compute the evaluation trace of \(f\), and represent it in a computing graph, which is, as you will notice, a Directed Acyclic Graph (DAG).
intermediates vars | expressions | values |
---|---|---|
\(a\) | \(a\) | \(2\) |
\(b\) | \(b\) | \(5\) |
\(w_0\) | \(sin(a)\) | \(0.909\) |
\(w_1\) | \(b*w_0\) | \(1.818\) |
\(w_2\) | \(b^2\) | \(25\) |
\(r\) | \(w_1 + w_2\) | \(26.818\) |
flowchart LR A[$$a$$] --> B(( $$w_0$$ )) C[$$b$$] --> D(( $$w_1$$ )) B --> D C --> E(( $$w_2$$ )) E --> F(( $$r$$ )) D --> F
Forward Mode (Tangent)
Forward Mode is the easiest and most straight forward version of AD.Let’s retake our base function :
So to compute the derivative of \(f\) with respect to \(a\), we set the following notation for each derivative of the intermediate variable \(w_i\) :
We first begin by intialing the two values \(\dot{a}\) and \(\dot{b}\) to \(\dot{a} = 1\) and \(\dot{b} = 0\) and the final result \(\dot{r} = \dfrac{\partial r}{\partial a} = ?\), then using the Chain Rule to each element wise operation.
intermediates vars | derivative expression | derivative values (aka. Tangent Trace) |
---|---|---|
\(\dot{a}\) | \(\dot{a} = \dfrac{\partial a}{\partial a}\) | \(1\) |
\(\dot{b}\) | \(\dot{b} = \dfrac{\partial b}{\partial a}\) | \(0\) |
\(\dot{w_0}\) | \(\dot{a}*cos(a) = cos(a)\) | \(-0.416\) |
\(\dot{w_1}\) | \(\dot{b}w_0 + b\dot{w_0} = b*cos(a)\) | \(-2.081\) |
\(\dot{w_2}\) | \(2*\dot{b}\) | \(0\) |
\(\dot{r}\) | \(\dot{w_1} + \dot{w_2} = b*cos(a)\) | \(-2.081\) |
As u can see, this can be generalized to compute one column of the Jacobian matrix of a function \(f:\mathbb{R}^n -> \mathbb{R}^m\) at a given point \(a\) by setting \(\dot{y_j} = \dfrac{\partial y_j}{\partial x_i}\Bigg | _{x = a}\text{, }j=1, …,m\) |
The full jacobian is then computed in \(n\) evaluations. The forward mode is a matrix-free and excellent way to compute Jacobian-vector product
by initializing \(\dot{x} = r\). In general, for cases \(f:\mathbb{R}^n -> \mathbb{R}^m\) where \(n»m\), the second mode, the reverse mode, is preferable.
Reverse Mode (Adjoint)
Reverse mode Automatic Differentiation is a generalization of the backpropagation algorithm. Here, unlike the forward mode, we compute the derivative backward in a second time from a given output :
Unlike the forward mode, the reverse mode computes the contribution of each variable and intermediate variable to the change of the output variable \(r\). The reverse mode principally relies on the following assumption of the general rule of the mutlivariable case of the Chain Rule :
Then, after the forward pass to compute the primals, we will do a second pass starting from the result to compute the adjoint trace, for \((a, b) = (2, 5)\) . First we initialize the final result : \(\bar{r} = \dfrac{\partial r}{\partial r} = 1\)
intermediate vars | derivative expression | derivative value (aka. adjoint trace) |
---|---|---|
\(\bar{r}\) | \(\bar{r}\) | \(1\) |
\(\bar{w_1}\) | \(\bar{w_1}\) += \(\dfrac{\partial r}{\partial r} *\dfrac{\partial r}{\partial w_1} = \dfrac{\partial r}{\partial r} = \bar{r}\) | \(1\) |
\(\bar{w_2}\) | \(\bar{w_2}\) += \(\bar{r}\) | \(1\) |
\(\bar{b}\) | \(\bar{b}\) += \(\bar{w_2}*2b\) = \(2b\) | \(10\) |
\(\bar{w_0}\) | \(\bar{w_0}\) += \(\bar{w1}*b\) = \(b\) | \(5\) |
\(\bar{b}\) | \(\bar{b}\) += \(\bar{w1}*w_0\) = \(10 + sin(a)\) | \(10.909\) |
\(\bar{a}\) | \(\bar{a}\) += \(\bar{w_0}cos(a)\) = \(bcos(a)\) | \(-2.081\) |
An easy implementation
We will rely on Andrej Karpathy’s integration of Reverse AD[2].
We can, in a first time, define a class Value
that will save :
- the primitive value
- the adjoint trace
- the final derivative value
class Value(): def __init__(self, value): self.value = value self._trace = lambda : None self.diff = 0
In practice, the self._trace
argument will be defined in the forward pass and then computed in the backward pass.
Let’s recall the original example, the forward primal trace can be represented as a computing graph.
flowchart TB A[$$a$$] --> B(( $$w_0$$ )) C[$$b$$] --> D(( $$w_1$$ )) B --> D C --> E(( $$w_2$$ )) E --> F(( $$r$$ )) D --> F
We can obtain the reverse topological order from this graph thanks to a DFS algorithm (we then need to keep track of the parents of each intermediates values) : \([r, w_1, w_2, b, w_0, a]\).
class Value():
def __init__(self, value, parents = ()):
self.value = value
self._trace = lambda : None
self.diff = 0
self.parents = parents
So, in practice, for the backward pass, we will iterate through this reverse topological order. What can then be done, is to define the function _trace
in each child during the forward pass, and then execute it during the backward pass so that it updates the self.diff
of their parents.
To do so, we have to redefine the basic element wise operations inside the Value
class so that it defines the _trace
of the children variable. I.E. the \(\times\) operation:
__add__(self, other):
out = Value(self.value + other.value, (self, other))
def _trace():
self.diff += out.diff
other.diff += out.diff
out._trace() = _trace
return out
__mul__(self, other):
out = Value(self.value * other.value, (self, other))
def _trace():
self.diff += out.diff * other.value
other.diff += out.diff * self.value
out._trace() = _trace
return out
def sin(self):
out = Value(np.sin(self.value), (self))
def _trace():
self.diff += out.diff * np.cos(other.value)
out._trace() = _trace
return out
We also need to defined the DFS algorithm that will init the diff of the given variable to 1.
def backprop(self):
visited = set()
topo_order = []
def visit(node):
if node not in visited:
visited.add(node)
for children in set(node.children):
visit(children)
topo_order.append(node)
visit(self)
self.diff = 1
for value in reversed(topo_order):
value._trace()
Conclusion
That is all for the basics of autodiff, you can go more in depth with this survey of Baydin et al. [1], this step by step integration of Fang et al [3] check this repo that covers the subject and apply it to the backpropagation of a MLP. There is also some good videos on youtube that cover the subject [4] [5].
References
[1] : Baydin et al. “Automatic differentiation in machine learning: a survey”. The Journal of Machine Learning Research, 18(153):1–43, 2018. link
[2] : Andrej Karpathy. “micrograd”. Github. link
[3] : Fang et al. “A Step-by-step Introduction to the Implementation of Automatic Differentiation”. link
[4] : Ari Seff. “What is Automatic Differentiation?“
[5] : archquant. “Automatic Differentiation Explained with Example“