cpp_grad: a first adventure into backpropagation and neural networks

Machine Learning

Tom Kite


February 25, 2023

cpp_grad: a first adventure into backpropagation and neural networks

When I discovered Andrej Karpathy’s tutorial for micrograd I was immediately hooked. In my experience, it is rare to find resources on neural networks that build the code from scratch. Andrej Karpathy, however, does a great job of explaining the fundamentals and using simple yet powerful Python code to demonstrate the principles of backpropagation. I wanted to implement the code for myself and decided C++ would deliver a healthy challenge while potentially giving enough speed to make the code useful. I called it cpp_grad.

DALL·E 2’s interpretation of: “a dramatic visualization of a neural network backpropagating.” DALL·E 2’s interpretation of: “a dramatic visualization of a neural network backpropagating.”

What is micrograd?

At its heart, it is a simple wrapper built around arithmetic types (float, int, etc.) which allows you to find the derivatives of the operations you perform on them. Consider the equation L = x + yz. If you want to calculate the derivative dL/dz, then you have to propagate the derivative backward, firstly knowing dL/dy and then knowing dy/dz. Concretely, dL/dz = dL/dy * dy/dz through the chain rule. This is why we call the process backpropagation. More generally, we refer to this concept as differentiable programming.

And why do we care about the derivatives of operations? The answer is that knowing the derivatives allows us to solve optimization problems. In the case of neural networks, we have many weights and biases to optimize. We have a series of inputs and target values we would like to achieve from each input. We can define a function called the loss, which decreases as we get close to replicating the outputs, only reaching zero when we have exactly replicated the target outputs. Minimizing this loss function by using the derivatives constitutes training of the network.

Beyond this introduction, I won’t discuss the details of micrograd - Check out Andrej’s video on that for a better explanation than I could hope to give. I will, however, discuss the challenges I faced in implementing the code in C++, and the results I obtained.

Challenges in C++ Implementation

I didn’t realize when I first started this task how much this would refine my thinking of C++. I have developed in C++ for a number of years now, I have listened to podcasts with Bjarne Stroustrup and read the “tour guide”. I am familiar with the ideas of RAII, object lifetime, the lack of garbage collection. But this project really made me confront everything I had read and understood in theory.

The essential issue is that micrograd’s backpropagation relies on each value keeping track of its parents and the operation that gave rise to it. For example, the value C = A + B must remember who A and B are so it can update their gradients later. Simple enough, right? The line of code that first made me realize something subtle was going on here was writing a loss function as a sum of squares. Here we will simply write L = A^2 + B^2 + C^2 as an example. The point is that A, B, and C are not going to be local variables stored on the stack unless you have some horribly hardcoded neural network. Instead, A, B, and C will most likely all appear as temporary variables in a loop.

To further emphasize this point, we want to be able to write code like x = x + 10. But then who is the parent of x and where is the intermediate value of x stored in memory?

The key to resolving this is to divorce the user-facing memory address accessible through the variable named x and the true memory address of the backpropagated values which must be created and kept alive somehow when temporary intermediate values arise. To do this I made a Value class, just like in micrograd, which itself manages a pointer to a hidden type called _Value. The idea is that users will never directly interact with _Value. I was cautious to only have shared pointers to _Value types, meaning that once no references exist to a given parent it can safely be deleted, since no backpropagation could occur. This happens in practice when the MLP or loss Value goes out of scope, and the chain of previous _Value instances can be garbage collected. This ended up working, after some trial and error.

Here are some code snippets:

// Constructors and destructors
    _ptr = std::make_shared<_Value<T>>(static_cast<T>(0));
Value(const T& data)
    _ptr = std::make_shared<_Value<T>>(data);
    _ptr = nullptr;

// Copy and move constructors
Value(const Value& other)
    _ptr = other._ptr;
Value(Value&& other)
    _ptr = other._ptr;
    other._ptr = nullptr;

// Copy and move assignment operators
Value<T>& operator=(const Value<T>& other)
    if (&other != this)
        _ptr = other._ptr;
    return *this;
Value<T>& operator=(Value<T>&& other)
    _ptr = other._ptr;
    other._ptr = nullptr;
    return *this;

Interesting Bug

One interesting bug I found (I crashed my VM multiple times in building my first MLP) is that the lambda closures used in the backpropagation could contain shared pointers to the class they belong to. This would prevent the reference count from ever reaching zero, quickly stacking up tens of gigabytes in memory during training. To solve this I used a raw pointer within the lambda closures. This solution is quite inelegant, and could lead to issues upon extensions of the basic operations Value allows. A potential future improvement here would be to standardize the creation of a _back function, which is only allowed to interface with a raw pointer to a _Value, and through that its parents.

See the code snippet below for an example of my solution:

Value<T> operator*(const Value<T>& other) const
    auto out = Value<T>(
        get_data() * other.get_data(),
        { get_ptr(), other.get_ptr() } // Keeping track of parents

    _Value<T>* this_ptr = get_ptr().get();
    _Value<T>* other_ptr = other.get_ptr().get();
    _Value<T>* out_ptr = out.get_ptr().get();

    auto _back = [=]() // Lambda closure only contains raw pointers
        this_ptr->get_grad() += other_ptr->get_data() * out_ptr->get_grad();
        other_ptr->get_grad() += this_ptr->get_data() * out_ptr->get_grad();

    return out;

Building Module Classes

With the underlying classes built and finally robust (as far as I could tell), it was time to build up the Module classes as in micrograd. The details here are very similar to the Python implementation and didn’t give any new problems. In essence, once the underlying Value classes are well written it is easy to use them just as you would built-in types: a principle at the core of C++’s philosophy.

I tested the MLP on the MNIST handwritten numbers data set. I chose one hidden layer of size 30 and trained in batches of 50 for 3 epochs. This took 58 minutes, running on my desktop computer (in a VM), and yielded an accuracy of 85% as measured on the test set. Success! Much better than the one in ten odds from random guess - the model was definitely learning.

Using C++ Code in Python

One additional aspect I was excited about in this project was using cppyy to make the C++ code accessible through Python. Following a small learning curve, this worked really well: the speed of C++ with the ease of Python scripting. Even accessing the MNIST dataset is far easier through Python modules!

Here is an example of the Python I used to train the model:

import cppyy
import cppyy.gbl as cpp
from cppyy.gbl import std


(raw_train_X, raw_train_y), (raw_test_X, raw_test_y) = mnist.load_data()
train_X = std.vector[std.vector[float]]()
train_y = std.vector[std.vector[float]]()
test_X = std.vector[std.vector[float]]()
test_y = std.vector[std.vector[float]]()

[train_X.push_back(std.vector[float](list(x.flatten()/255 - 0.5))) for x in raw_train_X];
[train_y.push_back(std.vector[float](list(np.eye(10)[y]))) for y in raw_train_y];
[test_X.push_back(std.vector[float](list(x.flatten()/255 - 0.5))) for x in raw_test_X];
[test_y.push_back(std.vector[float](list(np.eye(10)[y]))) for y in raw_test_y];

mnist_model = cpp.MLP[float]([784, 30, 10])

BATCH = 50
running_loss = 0

for i in range(EPOCHS):
    for j in range(0, len(train_X), BATCH):
        for k in range(min(BATCH, len(train_X)-j)):
            loss = mnist_model.loss(train_X[j+k], train_y[j+k])
            running_loss += loss.get_data()
        print(f"Epoch: {i}, Batch: {j}, Loss: {running_loss/k}")
        running_loss = 0.0

Using the code this way took 144 minutes, so around twice as long as the C++ version due to some overhead in the Python interpreter. A strong price to pay for the ease of use, but a nice option to have nonetheless. This could likely be improved by writing more of the training loops in C++ and only calling them from Python.