Automatic differentiation is a core building block in modern deep learning frameworks. But many existing autodiff systems are embedded in large, complex libraries. cppautograd is a lightweight, from-scratch engine in C++ that demonstrates how to build the essentials of an automatic differenciation engine, while addressing practical performance and memory concerns.

What Is cppautograd?

cppautograd provides:

  • A tensor type supporting scalar and multidimensional operations
  • Gradient propagation (backward) for operations
  • And the design is inspired by LibTorch to make it intuitive and minimalistic

The goal is educational clarity while maintaining decent performance. Making this production ready would require numerous performance improvements and a significant amout of work.

Some Challenges & Optimizations

During development, several problems surfaced, some of which I outlined here. In short:

1. Indexing overhead

Using std::vector::operator[] for tensor access was suprisingly slow. On my machine switching to raw pointer arithmetic imporved throughput by 4x.

2. Memory explosion from intermediates

Naively storing copies of intermediate tensors to enable backprop might be the easiest solution, but obviously leads to high memory use.
To mitigate this, cppautograd uses a wrapper around tensor data to deallocate unneeded intermediates and avoid unnecessary deep copies.

3. Multithreading & batch-level parallelism

A straightforward multithreaded matrix multiplication approach saw limited gains due to thread overhead and memory bandwidth limits. To reduce thread-launch costs, the design parallelizes at the batch level, assigning multiple batches to threads and then averaging gradients. This further improved throughput, by roughly a factor of 10x.

Interested in giving it a try?

cppautograd includes a demo / tutorial directory. You can try things like:

Tensor a = Tensor({3.0}, true);
Tensor b = Tensor({4.0}, true);
Tensor c = a * b + 2.0;
c.backward();
std::cout << "grad(a) = " << a.get_grad() << "\n";