Autodifferentiation and backpropragation
Backpropagation is the fundamental algorithm that allows us to train neural networks, and autodifferentiation is the primary tool used to do this in practice. Lets start by writing a basic autograd library, and then use it to train a basic neural network using backpropagation.
Smallgrad
What is autodifferentiation
An autodiff library lets us compute the gradient of some derived value with respect to its (possibly many) inputs. E.g.
# Start with some basic value
x = ...
# Compute z as a complex transformation of x
y = foo(x)
z = bar(y)
# Compute the gradient of z wrt its inputs (x)
dzdx = grad(z)
This requires two steps:
- We keep track of how downstream values (
z) are derived from their inputs (x). We can represent this as a directed acyclic graph (DAG) - Then we can just apply the chain rule -
As long as we are working with compositions of basic transformations with known derivatives then this is actually all quite straightforward, and just comes down to the bookkeeping involved in tracking the DAG.
Andrej Karpathy's micrograd is a very simple but complete implementation of this using purely scalar values. I'd like to implement a version that can natively handle vector and matrix linear algebra operations to be slightly closer to real-world applications.
Basic Design
I'm aiming for a simple and clear pedalogical implementation of an autograd library that is functional enough to train a simple neural network. I'm not worried about performance, arbitrary tensor computations or deep learning.
We need:
- Objects representing basic units of data that we'll be working with (scalars, vectors and matrices)
- Implementations of basic operations between the different components and their associated gradients
- A way to track dependencies and build a graph from a sequence of operations
- A method of propagating gradients backwards through this graph from the end-point to the leaf nodes
- Some convenience tools for optimising neural networks
I'll be mostly copying how things work in PyTorch, as its what I'm most familiar with.
A Basic Scalar
We'll start with a simple Scalar class representing a single number. This should support basic mathematical operations such as multiplication, addition, negation etc:
class Scalar:
def __init__(
self,
value: float,
):
self._value = value
def __add__(self, other: Scalar):
return Scalar(
value=self._value + other._value,
)
def __mul__(self, other: Scalar):
return Scalar(
value=self._value * other._value,
)
...
This works as expected:
x = Scalar(1.0)
y = Scalar(2.0)
z = x + y
z
# Scalar(3.0)
We need to add two things to this: Firstly the tracking of dependencies, i.e. we need to store somehow that z = x + y, and secondly we need a method for computing the gradients of each operation and propagating them through the graph, accumulating as we go.
We can do this in one go by storing within the class a mapping from parent node to a "gradient accumulator function". This function gets passed the upstream gradient, computes the local gradient and accumulates them together. Then we need to add a new .backward() method to the class which will start the calculation off and propagate the gradients through the graph.
type Node = Scalar
type Gradient = float
type GradientAccumulator = Callable[[Gradient], Gradient]
type Parents = list[tuple[Node, GradientAccumulator]]
class Scalar:
def __init__(
self,
value: float,
parents: Parents | None = None,
):
self._value = value
self._parents = parents or list()
self._grad = 0.0
def __add__(self, other: Scalar):
return Scalar(
value=self._value + other._value,
parents=[(self, lambda grad: grad), (other, lambda grad: grad)],
)
def __mul__(self, other: Scalar):
return Scalar(
value=self._value * other._value,
parents=[
(self, lambda grad: grad * other._value),
(other, lambda grad: grad * self._value),
],
)
...
def backward(self, gradient=None):
if gradient is None:
gradient = 1.0
self._grad += gradient
for node, grad in self._parents:
node.backward(gradient=grad(gradient))
Let's be specific: If we have
Our root node is , and the leaf node is , and ultimately we want to compute from the "local" derivatives of each function , , and .
Working through what happens when we call l.backward():
gradientgets set to1.0, this is stored asl._grad()- It is multiplied by and then passed into
z.backward() - This is stored as
z._grad, then we multiply by , sogradientnow equals , tand then passed intoy.backward() - This is stored as
y._grad, then we multiply by , sogradientnow equals , tand then passed intox.backward() - This is stored as
x._grad
This is exactly what we want, in just a couple of lines of code. Some minor points though:
Firstly we accumulate the gradients in each node, i.e. we do x._grad += gradient. This is to correctly compute the gradients when nodes have multiple paths to the root. I.e. if we had the following graph:
Then expanding it all out we see and
i.e. we need to follow both routes from l to x and sum them to compute the full derivative.
Secondly this uses a very simple recursive depth first graph traversal which is quite inefficient, especially with large complex graphs. A implementation would traverse the graph and build a topo-sorted list of nodes first, then propagate gradients through it.
A Vector Class
Next let's add a basic Vector class, representing a 1D array of numbers. We'll follow the design of the Scalar class with regards to how the gradient calculation will work.
Now that we're not just working with a single scalar value we need to think a little harder about how the backwards calculation will work. The first thing to recognise is that the Vector class is really just a collection of Scalar values. As each scalar is associated with a gradient value, the gradient of the vector must always be an equally sized vector of gradient values.
Now let's think about how the gradient computation will work when we are transforming Vectors into Scalars and vice-versa. Take a basic .sum() method for example - that returns a Scalar value that is the sum of the elements of the Vector:
class Vector:
def __init__(
self,
value: np.ndarray,
parents: Parents | None = None,
):
assert value.ndim == 1, "Vector must be 1D"
self._value = value
self._parents = parents or list()
self._grad = np.zeros_like(value)
def zero_grad(self):
self._grad[:] = 0.0
def __len__(self):
return len(self._value)
def size(self):
return len(self)
def sum(self):
return Scalar(
value=self._value.sum(),
parents={self: lambda grad: ...},
)
...
What should the gradient accumulator function be for .sum()? Vector.sum() returns a Scalar, so it will be called with a scalar gradient. If the elements of our vector are , then the result of the sum is and , which just means that , so we can just pass the gradient through to each element of the vector:
def sum(self):
return Scalar(
value=self._value.sum(),
parents=[(self, lambda grad: np.full(self.size(), grad))],
)
The next extension to consider is broadcasting. Vectors should behave similarly to numpy arrays in that multiplying them by a Scalar should result in each element being multiplied by the Scalar. This means we first "broadcast" the Scalar into an equally sized same-valued Vector and then perform an elementwise multiplication. What should the backwards gradient accumulator be in this case?
def __mul__(self, other: Scalar | Vector):
if isinstance(other, Scalar):
return Vector(
value=self._value * other._value,
parents=[
(self, lambda grad: grad * other._value)
(other, ...),
],
)
....
We're simply adding the scalar to every element of the vector, so going backwards we need to sum up the gradient contributions from each element, reducing the vector of gradients back to a scalar:
def __mul__(self, other: Scalar | Vector):
if isinstance(other, Scalar):
return Vector(
value=self._value * other._value,
parents=[
(self, lambda grad: grad * other._value),
(other, lambda grad: (grad * self._value).sum()),
],
)
....
The .backwards() gradient propagation function is basically exactly the same as the Scalar version, with a slight modifiction to work with an array of gradient values rather than single floats.
def backward(self, gradient):
self._grad += gradient
for node, grad in self._parents:
node.backward(gradient=grad(gradient))
This is basically all we need (plus a few other similar operators) and we can now work easily with vectors and scalars.
A Matrix Class
Lastly we'll move onto a Matrix class so that we can do arbitrary linear algebra calculations. This is almost exactly the same as the Vector, and the most interesting thing to deal with is matrix multiplication. If we have
where are vectors and is a matrix, what is and ?
The first is very straight forwards
this is perhaps easiest to see by noting that is shorthand for with two free indices , so the result is a matrix (two dimensional tensor). so , or more equivalently
What is then? This is shorthand for which has three free indices , so the result is a three dimensional tensor. Again we note that so the result is
This can we written as
where is the tensor outer product and the identity matrix.
In practice we never want to form this sparse 3D tensor as it would be a huge waste of space. If we have where is scalar and , then where .
We can work through similar reasoning for matrix-matrix multiplication, and if and then
and
where
With this, the Matrix class looks like
class Matrix:
def __init__(
self,
value: np.ndarray,
parents: Parents | None = None,
):
assert value.ndim == 2, "Matrix must be 2D"
self._value = value
self._parents = parents or list()
self._grad = np.zeros_like(value)
...
def __add__(self, other: Scalar | Matrix):
if isinstance(other, Scalar):
other = Matrix(
value=np.full(self.shape(), other._value),
parents=[(other, lambda grad: sum(grad))],
)
return Matrix(
value=self._value + other._value,
parents=[(self, lambda grad: grad), (other, lambda grad: grad)],
)
...
def __matmul__(self, other: Vector | Matrix):
if isinstance(other, Vector):
return Vector(
value=self._value @ other._value,
parents=[
(self, lambda grad: np.outer(grad, other._value)),
(other, lambda grad: self._value.T @ grad),
],
)
return Matrix(
value=self._value @ other._value,
parents=[
(self, lambda grad: grad @ other._value.T),
(other, lambda grad: self._value.T @ grad),
],
)
...
def backward(self, gradient):
self._grad += gradient
for node, grad in self._parents:
node.backward(gradient=grad(gradient))
Notes, inefficiencies, improvements
This is very simple and functional, but is very far off anything that would be useful for non-trivial tasks. A non-exhaustive list of inefficiencies and potential improvements:
- Significant repetition of code between the Scalar, Vector and Matrix classes
- No arbitrarily sized tensor class that could subsume the above
- No support for working with non-gradient tracked values (e.g. multiply a
Matrixwith annp.ndarray - Many missing methods and functions
Fitting Models with Gradient Descent
We have enough already to fit a model using gradient descent. E.g. we can fit a linear regression model by minimising the mean of the square o the residuals. Say we have and we're trying to estime the parameters from the model : Firstly initialise the parameters () to some random values, then:
- Do a forward pass of the model:
- Compute the residuals
- Compute the loss
- Do the backwards pass, computing the gradient of the loss wrt to the parameters
- Update the parameters where is the learning rate
and repeat in a loop until converges. Obviously we could just solve OLS analytically but this is a useful test. This is what the code looks like:
def test_ols():
# Fitting OLS with gradient descent
rng = np.random.default_rng(0)
N = 100
k = 5
sigma = 0.05
beta = rng.normal(size=k)
X = rng.normal(size=(N, k))
y = X @ beta + sigma * rng.normal(size=N)
# True OLS solution
beta_ols = np.linalg.solve(X.T @ X, X.T @ y)
# Solve by gradient descent using smallgrad
beta_p = Vector(rng.normal(size=k))
def forward():
yhat = Matrix(X) @ beta_p
res = yhat - Vector(y)
mse = (res * res).sum() / Scalar(N)
return mse
lr = 1e-1
epochs = 1000
for _ in range(epochs):
beta_p.zero_grad()
loss = forward()
loss.backward()
beta_p._value -= lr * beta_p._grad
np.testing.assert_allclose(beta_p._value, beta_ols)
and it works!
If we want to fit non-linear models (e.g. neural networks) we need to add some form of non-linearity. ReLU is the simplest so we'll go with that, and the implementation is very straightforwards:
def relu(self) -> Vector:
return Vector(
value=np.maximum(self._value, 0),
parents=[
(self, lambda grad: grad * np.where(self._value > 0, 1.0, 0.0)),
],
)
This enables us to train a basic MLP with a single hidden layer and ReLU activations, giving exactly the same results as PyTorch:
def test_mlp():
rng = np.random.default_rng(42)
N = 100
k = 5
n_hidden = 10
sigma = 0.1
X_np, y_np = make_regression(
n_samples=N, n_features=k, noise=sigma, random_state=42
)
W1_np = rng.normal(size=(k, n_hidden)) * 0.1
w2_np = rng.normal(size=n_hidden) * 0.1
lr = 1e-2
epochs = 20
# smallgrad
W1 = Matrix(W1_np.copy())
w2 = Vector(w2_np.copy())
for _ in range(epochs):
W1.zero_grad()
w2.zero_grad()
hidden = (Matrix(X_np) @ W1).relu()
yhat = hidden @ w2
res = yhat - Vector(y_np)
loss = (res * res).sum() / Scalar(N)
loss.backward()
W1._value -= lr * W1._grad
w2._value -= lr * w2._grad
# torch
X_t = torch.tensor(X_np)
y_t = torch.tensor(y_np)
W1_t = torch.tensor(W1_np.copy(), requires_grad=True)
w2_t = torch.tensor(w2_np.copy(), requires_grad=True)
optim_t = torch.optim.SGD([W1_t, w2_t], lr=lr)
for _ in range(epochs):
optim_t.zero_grad()
hidden_t = (X_t @ W1_t).relu()
yhat_t = hidden_t @ w2_t
loss_t = ((yhat_t - y_t) ** 2).mean()
loss_t.backward()
optim_t.step()
np.testing.assert_allclose(W1._value, W1_t.detach().numpy(), rtol=1e-5)
np.testing.assert_allclose(w2._value, w2_t.detach().numpy(), rtol=1e-5)
Easy!
Summary
This is a simple and straightforward implementation of a PyTorch-like autograd engine that nonetheless is fully-featured enough to be used to train a basic neural network. Creating it has been a very useful learning experience, as usual I find that sitting down to actually implement something I think I understand exposes all kinds of holes in my knowledge. I could go further and add all kinds of additional functionality, but for now this is good enough.