A bit of precision

Autodifferentiation and backpropragation

Backpropagation is the fundamental algorithm that allows us to train neural networks, and autodifferentiation is the primary tool used to do this in practice. Lets start by writing a basic autograd library, and then use it to train a basic neural network using backpropagation.

Smallgrad

What is autodifferentiation

An autodiff library lets us compute the gradient of some derived value with respect to its (possibly many) inputs. E.g.

# Start with some basic value
x = ...

# Compute z as a complex transformation of x
y = foo(x)
z = bar(y)

# Compute the gradient of z wrt its inputs (x)
dzdx = grad(z)

This requires two steps:

  1. We keep track of how downstream values (z) are derived from their inputs (x). We can represent this as a directed acyclic graph (DAG)
  2. Then we can just apply the chain rule - dzdx=dzdy×dydx

As long as we are working with compositions of basic transformations with known derivatives then this is actually all quite straightforward, and just comes down to the bookkeeping involved in tracking the DAG.

Andrej Karpathy's micrograd is a very simple but complete implementation of this using purely scalar values. I'd like to implement a version that can natively handle vector and matrix linear algebra operations to be slightly closer to real-world applications.

Basic Design

I'm aiming for a simple and clear pedalogical implementation of an autograd library that is functional enough to train a simple neural network. I'm not worried about performance, arbitrary tensor computations or deep learning.

We need:

  1. Objects representing basic units of data that we'll be working with (scalars, vectors and matrices)
  2. Implementations of basic operations between the different components and their associated gradients
  3. A way to track dependencies and build a graph from a sequence of operations
  4. A method of propagating gradients backwards through this graph from the end-point to the leaf nodes
  5. Some convenience tools for optimising neural networks

I'll be mostly copying how things work in PyTorch, as its what I'm most familiar with.

A Basic Scalar

We'll start with a simple Scalar class representing a single number. This should support basic mathematical operations such as multiplication, addition, negation etc:

class Scalar:
    def __init__(
        self,
        value: float,
    ):
        self._value = value

    def __add__(self, other: Scalar):
        return Scalar(
            value=self._value + other._value,
        )

    def __mul__(self, other: Scalar):
        return Scalar(
            value=self._value * other._value,
        )
    ...

This works as expected:

x = Scalar(1.0)
y = Scalar(2.0)
z = x + y
z
# Scalar(3.0)

We need to add two things to this: Firstly the tracking of dependencies, i.e. we need to store somehow that z = x + y, and secondly we need a method for computing the gradients of each operation and propagating them through the graph, accumulating as we go.

We can do this in one go by storing within the class a mapping from parent node to a "gradient accumulator function". This function gets passed the upstream gradient, computes the local gradient and accumulates them together. Then we need to add a new .backward() method to the class which will start the calculation off and propagate the gradients through the graph.

type Node = Scalar
type Gradient = float
type GradientAccumulator = Callable[[Gradient], Gradient]
type Parents = list[tuple[Node, GradientAccumulator]]

class Scalar:
    def __init__(
        self,
        value: float,
        parents: Parents | None = None,
    ):
        self._value = value
        self._parents = parents or list()
        self._grad = 0.0

    def __add__(self, other: Scalar):
        return Scalar(
            value=self._value + other._value,
            parents=[(self, lambda grad: grad), (other, lambda grad: grad)],
        )

    def __mul__(self, other: Scalar):
        return Scalar(
            value=self._value * other._value,
            parents=[
                (self, lambda grad: grad * other._value),
                (other, lambda grad: grad * self._value),
            ],
        )

    ...

    def backward(self, gradient=None):
        if gradient is None:
            gradient = 1.0
        self._grad += gradient
        for node, grad in self._parents:
            node.backward(gradient=grad(gradient))

Let's be specific: If we have

y=f(x)z=g(y)l=h(z)

Our root node is l, and the leaf node is x, and ultimately we want to compute lx from the "local" derivatives of each function yx, zy, and lz.

Working through what happens when we call l.backward():

  1. gradient gets set to 1.0, this is stored as l._grad (ll=1)
  2. It is multiplied by lz and then passed into z.backward()
  3. This is stored as z._grad, then we multiply by zy, so gradient now equals lz×zy=ly, tand then passed into y.backward()
  4. This is stored as y._grad, then we multiply by yx, so gradient now equals lx, tand then passed into x.backward()
  5. This is stored as x._grad

This is exactly what we want, in just a couple of lines of code. Some minor points though:

Firstly we accumulate the gradients in each node, i.e. we do x._grad += gradient. This is to correctly compute the gradients when nodes have multiple paths to the root. I.e. if we had the following graph:

y=f(x)z=g(y)+xl=h(z)

Then expanding it all out we see l=h(g(f(x))+x) and

lx=lz×zx+lz×zy+yx

i.e. we need to follow both routes from l to x and sum them to compute the full derivative.

Secondly this uses a very simple recursive depth first graph traversal which is quite inefficient, especially with large complex graphs. A implementation would traverse the graph and build a topo-sorted list of nodes first, then propagate gradients through it.

A Vector Class

Next let's add a basic Vector class, representing a 1D array of numbers. We'll follow the design of the Scalar class with regards to how the gradient calculation will work.

Now that we're not just working with a single scalar value we need to think a little harder about how the backwards calculation will work. The first thing to recognise is that the Vector class is really just a collection of Scalar values. As each scalar is associated with a gradient value, the gradient of the vector must always be an equally sized vector of gradient values.

Now let's think about how the gradient computation will work when we are transforming Vectors into Scalars and vice-versa. Take a basic .sum() method for example - that returns a Scalar value that is the sum of the elements of the Vector:

class Vector:
    def __init__(
        self,
        value: np.ndarray,
        parents: Parents | None = None,
    ):
        assert value.ndim == 1, "Vector must be 1D"
        self._value = value
        self._parents = parents or list()
        self._grad = np.zeros_like(value)

    def zero_grad(self):
        self._grad[:] = 0.0

    def __len__(self):
        return len(self._value)

    def size(self):
        return len(self)

    def sum(self):
        return Scalar(
            value=self._value.sum(),
            parents={self: lambda grad: ...},
        )

    ...

What should the gradient accumulator function be for .sum()? Vector.sum() returns a Scalar, so it will be called with a scalar gradient. If the elements of our vector are xi, then the result of the sum is y=ixi and yxi=1, which just means that lxi=ly, so we can just pass the gradient through to each element of the vector:

    def sum(self):
        return Scalar(
            value=self._value.sum(),
            parents=[(self, lambda grad: np.full(self.size(), grad))],
        )

The next extension to consider is broadcasting. Vectors should behave similarly to numpy arrays in that multiplying them by a Scalar should result in each element being multiplied by the Scalar. This means we first "broadcast" the Scalar into an equally sized same-valued Vector and then perform an elementwise multiplication. What should the backwards gradient accumulator be in this case?

    def __mul__(self, other: Scalar | Vector):
        if isinstance(other, Scalar):
            return Vector(
                value=self._value * other._value,
                parents=[
                    (self, lambda grad: grad * other._value)
                    (other, ...),
                ],
            )
        ....

We're simply adding the scalar to every element of the vector, so going backwards we need to sum up the gradient contributions from each element, reducing the vector of gradients back to a scalar:

    def __mul__(self, other: Scalar | Vector):
        if isinstance(other, Scalar):
            return Vector(
                value=self._value * other._value,
                parents=[
                    (self, lambda grad: grad * other._value),
                    (other, lambda grad: (grad * self._value).sum()),
                ],
            )
        ....

The .backwards() gradient propagation function is basically exactly the same as the Scalar version, with a slight modifiction to work with an array of gradient values rather than single floats.

    def backward(self, gradient):
        self._grad += gradient
        for node, grad in self._parents:
            node.backward(gradient=grad(gradient))

This is basically all we need (plus a few other similar operators) and we can now work easily with vectors and scalars.

A Matrix Class

Lastly we'll move onto a Matrix class so that we can do arbitrary linear algebra calculations. This is almost exactly the same as the Vector, and the most interesting thing to deal with is matrix multiplication. If we have

y=Ax

where x,y are vectors and A is a matrix, what is yx and yA?

The first is very straight forwards

yx=A

this is perhaps easiest to see by noting that yx is shorthand for yixj with two free indices i,j, so the result is a matrix (two dimensional tensor). yi=jAi,jxj so yixj=Ai,j, or more equivalently yx=A

What is yA then? This is shorthand for yiAj,k which has three free indices i,j,k, so the result is a three dimensional tensor. Again we note that yi=jAi,jxj so the result is

yiAj,k=xkifi=j,0otherwise

This can we written as

yA=xTI

where is the tensor outer product and I the identity matrix.

In practice we never want to form this sparse 3D tensor as it would be a huge waste of space. If we have l=f(y) where l is scalar and y=Ax, then lA=lyyA=gxT where g=ly.

We can work through similar reasoning for matrix-matrix multiplication, and if C=AB and l=f(C) then

lA=lCCA=GBT

and

lB=lCCB=ATG

where G=lC

With this, the Matrix class looks like

class Matrix:
    def __init__(
        self,
        value: np.ndarray,
        parents: Parents | None = None,
    ):
        assert value.ndim == 2, "Matrix must be 2D"
        self._value = value
        self._parents = parents or list()
        self._grad = np.zeros_like(value)

    ...

    def __add__(self, other: Scalar | Matrix):
        if isinstance(other, Scalar):
            other = Matrix(
                value=np.full(self.shape(), other._value),
                parents=[(other, lambda grad: sum(grad))],
            )
        return Matrix(
            value=self._value + other._value,
            parents=[(self, lambda grad: grad), (other, lambda grad: grad)],
        )
    
    ...

    def __matmul__(self, other: Vector | Matrix):
        if isinstance(other, Vector):
            return Vector(
                value=self._value @ other._value,
                parents=[
                    (self, lambda grad: np.outer(grad, other._value)),
                    (other, lambda grad: self._value.T @ grad),
                ],
            )
        return Matrix(
            value=self._value @ other._value,
            parents=[
                (self, lambda grad: grad @ other._value.T),
                (other, lambda grad: self._value.T @ grad),
            ],
        )
    
    ...

    def backward(self, gradient):
        self._grad += gradient
        for node, grad in self._parents:
            node.backward(gradient=grad(gradient))

Notes, inefficiencies, improvements

This is very simple and functional, but is very far off anything that would be useful for non-trivial tasks. A non-exhaustive list of inefficiencies and potential improvements:

Fitting Models with Gradient Descent

We have enough already to fit a model using gradient descent. E.g. we can fit a linear regression model by minimising the mean of the square o the residuals. Say we have y,X and we're trying to estime the parameters β from the model y=Xβ: Firstly initialise the parameters (β) to some random values, then:

  1. Do a forward pass of the model: y^=Aβ
  2. Compute the residuals r=yy^
  3. Compute the loss l=1Niri2
  4. Do the backwards pass, computing the gradient of the loss wrt to the parameters g=lβ
  5. Update the parameters ββ+αg where α is the learning rate

and repeat in a loop until β converges. Obviously we could just solve OLS analytically but this is a useful test. This is what the code looks like:

def test_ols():
    # Fitting OLS with gradient descent
    rng = np.random.default_rng(0)
    N = 100
    k = 5
    sigma = 0.05
    beta = rng.normal(size=k)
    X = rng.normal(size=(N, k))
    y = X @ beta + sigma * rng.normal(size=N)

    # True OLS solution
    beta_ols = np.linalg.solve(X.T @ X, X.T @ y)

    # Solve by gradient descent using smallgrad
    beta_p = Vector(rng.normal(size=k))

    def forward():
        yhat = Matrix(X) @ beta_p
        res = yhat - Vector(y)
        mse = (res * res).sum() / Scalar(N)
        return mse

    lr = 1e-1
    epochs = 1000
    for _ in range(epochs):
        beta_p.zero_grad()
        loss = forward()
        loss.backward()
        beta_p._value -= lr * beta_p._grad

    np.testing.assert_allclose(beta_p._value, beta_ols)

and it works!

If we want to fit non-linear models (e.g. neural networks) we need to add some form of non-linearity. ReLU is the simplest so we'll go with that, and the implementation is very straightforwards:

    def relu(self) -> Vector:
        return Vector(
            value=np.maximum(self._value, 0),
            parents=[
                (self, lambda grad: grad * np.where(self._value > 0, 1.0, 0.0)),
            ],
        )

This enables us to train a basic MLP with a single hidden layer and ReLU activations, giving exactly the same results as PyTorch:

def test_mlp():
    rng = np.random.default_rng(42)
    N = 100
    k = 5
    n_hidden = 10
    sigma = 0.1
    X_np, y_np = make_regression(
        n_samples=N, n_features=k, noise=sigma, random_state=42
    )

    W1_np = rng.normal(size=(k, n_hidden)) * 0.1
    w2_np = rng.normal(size=n_hidden) * 0.1

    lr = 1e-2
    epochs = 20

    # smallgrad
    W1 = Matrix(W1_np.copy())
    w2 = Vector(w2_np.copy())

    for _ in range(epochs):
        W1.zero_grad()
        w2.zero_grad()
        hidden = (Matrix(X_np) @ W1).relu()
        yhat = hidden @ w2
        res = yhat - Vector(y_np)
        loss = (res * res).sum() / Scalar(N)
        loss.backward()
        W1._value -= lr * W1._grad
        w2._value -= lr * w2._grad

    # torch
    X_t = torch.tensor(X_np)
    y_t = torch.tensor(y_np)
    W1_t = torch.tensor(W1_np.copy(), requires_grad=True)
    w2_t = torch.tensor(w2_np.copy(), requires_grad=True)
    optim_t = torch.optim.SGD([W1_t, w2_t], lr=lr)

    for _ in range(epochs):
        optim_t.zero_grad()
        hidden_t = (X_t @ W1_t).relu()
        yhat_t = hidden_t @ w2_t
        loss_t = ((yhat_t - y_t) ** 2).mean()
        loss_t.backward()
        optim_t.step()

    np.testing.assert_allclose(W1._value, W1_t.detach().numpy(), rtol=1e-5)
    np.testing.assert_allclose(w2._value, w2_t.detach().numpy(), rtol=1e-5)

Easy!

Summary

This is a simple and straightforward implementation of a PyTorch-like autograd engine that nonetheless is fully-featured enough to be used to train a basic neural network. Creating it has been a very useful learning experience, as usual I find that sitting down to actually implement something I think I understand exposes all kinds of holes in my knowledge. I could go further and add all kinds of additional functionality, but for now this is good enough.