< previous

A quick review of Automatic Differentation and its modes

August 23, 2024

Automatic Differentiation (AutoDiff or AD) is a set of techniques used to evaluate the partial derivative of a function. Its strength comes from the fact that it’s the most computationally effective way.

AD has a two-sided nature : it’s partly symbolic and partly numerical since it keeps track of the expression (unlike the numerical method) and gives a final numerical value (unlike symbolic method).

What AD is not

It’s not numerical differentiation

Numerical Differentiation uses finite difference approximation. For a multivariable scalar function \(f:\mathbb{R}^n->\mathbb{R}\), we can estimate the gradient \(\triangledown f = \left(\dfrac{\partial f}{\partial x_1}, …, \dfrac{\partial f}{\partial x_n}\right)\) using

$$\dfrac{\partial f(x)}{\partial x_i} \approx \frac{f(x+he_i) - f(x)}{h}$$

with \( e_i \) the \(i^{th}\) unit vector and \(h\) very small and >0 (usually \(h \approx 10^{-5}\)).

It seems to be a good way to compute derivative since it’s easy to implement, but it has the disadvantage of having a time complexity of \(\mathcal{O}(n)\) for a vector of dimension n ( it will be an issue for the case of a model with millions of parameters) and require to carefully select \(h\). Moreover, since it still an approximation, it can lead to huge numerical errors due to Truncation Error and Round-off Error.

It’s not Symbolic Differentiation

Symbolic Differentitation is a straighforward application of the defined derivative expressions to the function. I.E. if \(f(x) = g(x)h(x)\) then

$$\dfrac{\partial f}{\partial x} = \dfrac{\partial g(x)}{\partial x}h(x) + g(x)\dfrac{\partial h(x)}{\partial x}$$

It can be very useful to get some insight into the evaluated function (such as finding \(\dfrac{d}{dx}f(x) = 0\)). The problem here is a phenomenon called expression swell, where calculation runtime is not efficient as they can get exponentially larger than the expression derivative they represent.

I.E. from this paper[1]: Iterations of the logistic map \(l_{n+1} = 4l_n(1-l_n)\text{, avec } l_1=x\), illustrating the expression swell

n \(l_n\) \(\dfrac{d}{dx}l_n\)
1 \(x\) \(1\)
2 \(4x(1-x)\) \(4(1-x)-4x\)
3 \(16x(1-x)(1-2x)^2\) \(16(1 − x)(1 − 2x)^2 − 16x(1 − 2x)^2 − 64x(1 − x)(1 − 2x)\)

What could be interesting if we are just interested in the accurate numerical evaluation of the derivative and not their actual symbolic form would be to store only the value of intermediate sub-expressions in memory:

we could apply symbolic differentiation to all the element-wise operations and keep the intermediate numerical results -> this is the AD in forward mode.

AD and its two modes

The main idea of AD is to decompose the main expression into a composition of element-wise operations called element trace.

AD is mostly based on the fact that all numerical computation is a composition of basic operations which derivatives are known(Griewank and Walther, 2008), and that those derivatives can be combined through the Chain Rule. These basic operations are usually composed of :

For the rest of the text, we will take the following function to illustrate my words:

$$f(a, b) = b*sin(a) + b^2 = r$$

and we will represent by \(w_i, i=0, …, l\) the intermediate variables. Let’s compute the evaluation trace of \(f\), and represent it in a computing graph, which is, as you will notice, a Directed Acyclic Graph (DAG).

intermediates vars expressions values
\(a\) \(a\) \(2\)
\(b\) \(b\) \(5\)
\(w_0\) \(sin(a)\) \(0.909\)
\(w_1\) \(b*w_0\) \(1.818\)
\(w_2\) \(b^2\) \(25\)
\(r\) \(w_1 + w_2\) \(26.818\)
flowchart LR
A[$$a$$] --> B(( $$w_0$$ ))
C[$$b$$] --> D(( $$w_1$$ ))
B --> D
C --> E(( $$w_2$$ ))
E --> F(( $$r$$ ))
D --> F

Forward Mode (Tangent)

Forward Mode is the easiest and most straight forward version of AD.Let’s retake our base function :

$$f(a, b) = b*sin(a) + b^2 = r$$

So to compute the derivative of \(f\) with respect to \(a\), we set the following notation for each derivative of the intermediate variable \(w_i\) :

$$\dot{w_i} = \dfrac{\partial w_i}{\partial a}$$

We first begin by intialing the two values \(\dot{a}\) and \(\dot{b}\) to \(\dot{a} = 1\) and \(\dot{b} = 0\) and the final result \(\dot{r} = \dfrac{\partial r}{\partial a} = ?\), then using the Chain Rule to each element wise operation.

intermediates vars derivative expression derivative values (aka. Tangent Trace)
\(\dot{a}\) \(\dot{a} = \dfrac{\partial a}{\partial a}\) \(1\)
\(\dot{b}\) \(\dot{b} = \dfrac{\partial b}{\partial a}\) \(0\)
\(\dot{w_0}\) \(\dot{a}*cos(a) = cos(a)\) \(-0.416\)
\(\dot{w_1}\) \(\dot{b}w_0 + b\dot{w_0} = b*cos(a)\) \(-2.081\)
\(\dot{w_2}\) \(2*\dot{b}\) \(0\)
\(\dot{r}\) \(\dot{w_1} + \dot{w_2} = b*cos(a)\) \(-2.081\)
As u can see, this can be generalized to compute one column of the Jacobian matrix of a function \(f:\mathbb{R}^n -> \mathbb{R}^m\) at a given point \(a\) by setting \(\dot{y_j} = \dfrac{\partial y_j}{\partial x_i}\Bigg _{x = a}\text{, }j=1, …,m\)
$$ \mathbf{J}_f = \left[ \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right] \Bigg|_{\mathbf{x} = \mathbf{a}} $$

The full jacobian is then computed in \(n\) evaluations. The forward mode is a matrix-free and excellent way to compute Jacobian-vector product

$$ \mathbf{J}_f \mathbf{r} = \left[ \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right] \left[ \begin{array}{c} r_1 \\ \vdots \\ r_n \end{array} \right] $$

by initializing \(\dot{x} = r\). In general, for cases \(f:\mathbb{R}^n -> \mathbb{R}^m\) where \(n»m\), the second mode, the reverse mode, is preferable.

Reverse Mode (Adjoint)

Reverse mode Automatic Differentiation is a generalization of the backpropagation algorithm. Here, unlike the forward mode, we compute the derivative backward in a second time from a given output :

$$\bar{a} = \dfrac{\partial y}{\partial a}$$

Unlike the forward mode, the reverse mode computes the contribution of each variable and intermediate variable to the change of the output variable \(r\). The reverse mode principally relies on the following assumption of the general rule of the mutlivariable case of the Chain Rule :

$$\dfrac{\partial (y_1, ..., y_k)}{\partial x_i} = \sum^m_{l=1}\dfrac{\partial (y_1, ..., y_k)}{\partial u_l}\dfrac{\partial u_l}{\partial x_i}$$ and in the special case where k = 1 $$\dfrac{\partial y}{\partial x_i} = \sum^m_{l=1}\dfrac{\partial y}{\partial u_l}\dfrac{\partial u_l}{\partial x_i}$$

Then, after the forward pass to compute the primals, we will do a second pass starting from the result to compute the adjoint trace, for \((a, b) = (2, 5)\) . First we initialize the final result : \(\bar{r} = \dfrac{\partial r}{\partial r} = 1\)

intermediate vars derivative expression derivative value (aka. adjoint trace)
\(\bar{r}\) \(\bar{r}\) \(1\)
\(\bar{w_1}\) \(\bar{w_1}\) += \(\dfrac{\partial r}{\partial r} *\dfrac{\partial r}{\partial w_1} = \dfrac{\partial r}{\partial r} = \bar{r}\) \(1\)
\(\bar{w_2}\) \(\bar{w_2}\) += \(\bar{r}\) \(1\)
\(\bar{b}\) \(\bar{b}\) += \(\bar{w_2}*2b\) = \(2b\) \(10\)
\(\bar{w_0}\) \(\bar{w_0}\) += \(\bar{w1}*b\) = \(b\) \(5\)
\(\bar{b}\) \(\bar{b}\) += \(\bar{w1}*w_0\) = \(10 + sin(a)\) \(10.909\)
\(\bar{a}\) \(\bar{a}\) += \(\bar{w_0}cos(a)\) = \(bcos(a)\) \(-2.081\)

An easy implementation

We will rely on Andrej Karpathy’s integration of Reverse AD[2]. We can, in a first time, define a class Value that will save :

In practice, the self._trace argument will be defined in the forward pass and then computed in the backward pass. Let’s recall the original example, the forward primal trace can be represented as a computing graph.

flowchart TB
A[$$a$$] --> B(( $$w_0$$ ))
C[$$b$$] --> D(( $$w_1$$ ))
B --> D
C --> E(( $$w_2$$ ))
E --> F(( $$r$$ ))
D --> F

We can obtain the reverse topological order from this graph thanks to a DFS algorithm (we then need to keep track of the parents of each intermediates values) : \([r, w_1, w_2, b, w_0, a]\).

class Value():
	def __init__(self, value, parents = ()):
		self.value = value
		self._trace = lambda : None
		self.diff = 0
		self.parents = parents

So, in practice, for the backward pass, we will iterate through this reverse topological order. What can then be done, is to define the function _trace in each child during the forward pass, and then execute it during the backward pass so that it updates the self.diff of their parents.

$$ \renewcommand{\arraystretch}{1.5} \begin{array}{|c|c|} \hline \text{Visited Value in the backward pass} & \text{Result of the attribued trace() function} \\ \hline r & \begin{array}{c} \bar{w_1} \mathrel{+}= \bar{r}\\ \bar{w_2} \mathrel{+}= \bar{r} \end{array} \\ \hline w_1 & \begin{array}{c} \bar{b} \mathrel{+}= w_0 \cdot \bar{w_1} \\ \bar{w_0} \mathrel{+}= b \cdot \bar{w_1} \end{array} \\ \hline w_2 & \bar{b} \mathrel{+}= 2b \cdot \bar{w_2} \\ \hline w_0 & \bar{a} \mathrel{+}= \cos(a) \cdot \bar{w_0} \\ \hline b & \text{None} \\ \hline a & \text{None} \\ \hline \end{array} $$

To do so, we have to redefine the basic element wise operations inside the Value class so that it defines the _trace of the children variable. I.E. the \(\times\) operation:

__add__(self, other):
	out = Value(self.value + other.value, (self, other))
	def _trace():
		self.diff += out.diff 
		other.diff += out.diff 
	out._trace() = _trace
	return out

__mul__(self, other):
	out = Value(self.value * other.value, (self, other))
	def _trace():
		self.diff += out.diff * other.value
		other.diff += out.diff * self.value
	out._trace() = _trace
	return out

def sin(self):
	out = Value(np.sin(self.value), (self))
	def _trace():
		self.diff += out.diff * np.cos(other.value)
	out._trace() = _trace
	return out

We also need to defined the DFS algorithm that will init the diff of the given variable to 1.

def backprop(self):
	visited = set()
	topo_order = []
	def visit(node):
		if node not in visited:
			visited.add(node)
			for children in set(node.children):
				visit(children)
			topo_order.append(node)
	visit(self)
	self.diff = 1
	for value in reversed(topo_order):
		value._trace()

Conclusion

That is all for the basics of autodiff, you can go more in depth with this survey of Baydin et al. [1], this step by step integration of Fang et al [3] check this repo that covers the subject and apply it to the backpropagation of a MLP. There is also some good videos on youtube that cover the subject [4] [5].

References

[1] : Baydin et al. “Automatic differentiation in machine learning: a survey”. The Journal of Machine Learning Research, 18(153):1–43, 2018. link

[2] : Andrej Karpathy. “micrograd”. Github. link

[3] : Fang et al. “A Step-by-step Introduction to the Implementation of Automatic Differentiation”. link

[4] : Ari Seff. What is Automatic Differentiation?

[5] : archquant. Automatic Differentiation Explained with Example