Differentiable Programming - A Simple Introduction

What is Differentiable Programming, and how is it different from Deep Learning? Check out this introduction to learn everything you need to know!

Differentiable Programming - A Simple Introduction

Table of contents

Differentiable Programming is a relatively new term that is often conflated with Deep Learning. While Deep Learning indeed overlaps with Differentiable Programming, Deep Learning is a subset of Differentiable Programming.

In this article we'll explain what Differentiable Programming is and how it differs from Deep Learning, in particular with reference to its greater generality. We'll learn through example by solving a physics-based problem in three (progressively smarter) ways. Let's get started!

Introduction to Differentiable Programming

Many Machine Learning techniques at their core boil down to minimizing some loss function to learn a model that is well-suited to solve some problem. Deep Learning builds on top of this central idea by imposing two requirements on the model itself - a network architecture and training via automatic differentiation. In contrast, Differentiable Programming demands only one of these requirements - training via automatic differentiation.

Differentiable Programming refers to utilizing automatic differentiation in some way that allows a program to optimize its parameters in order to get better at some task. It requires only three things:

  1. A parameterized function / method / model to be optimized
  2. A loss that is suitable to measure performance, and
  3. (Automatic) differentiability of the object to be optimized

While Deep Learning certainly checks these boxes, it is not the only field that does. Differentiable Programming can be applied to a wide variety of tasks in other areas, including probabilistic programming, Bayesian inference, robotics, and physics.

The Problem: Aiming a Cannon

We'll explore the topic of Differentiable Programming by considering the following question:

Given a target at a known distance, how can we adjust a cannon's angle and ejection velocity in order to hit the target?

We'll solve this problem in three ways - first using a purely Deep Learning approach, second using a purely Newtonian approach, and finally using a hybrid Newtonian Deep Learning approach.

Neural Network Approach

We'll first approach this problem using only neural networks, working under the assumption that we don't know anything about physics. Instead, we just have a cannon that we can shoot many, many times in order to collect a lot of data, recoding the launch angle, ejection velocity, and corresponding landing distance for each datum. Our goal is to understand how we can exploit this empirical data in order to go in the other direction, namely mapping from a target distance to suitable control parameters that will land our projectile at the desired target.

We might think at a first glance that this problem is easy. We want to input distances and output control parameters, so we can just train a model to perform this mapping using all of the data that we collected.

A first attempt at solving this problem might involve creating a Neural Network to map from distances to corresponding control parameters

Unfortunately, this approach will not work because we do not know how to measure loss. We can input a target distance and get hypothesized control parameters, but how do we determine if these are "good" control parameters? Is the average MAE between the hypothesized and true control parameters suitable? What if there are multiple solutions (which there are, in this case)? Remember that we don't know anything about physics in this scenario!

Instead, we map in the other direction, from control parameters to resulting distance, because the loss is easy to calculate - we simply measure the difference between our predicted distance and true difference (and square it for differentiability)[1].

Mapping instead from control parameters to the resulting projectile distance provides a simple and intuitive way to measure loss

Let's train such a neural network. Below you can see the results of training on 1,000 data points, where the blue surface is the true underlying function that we are trying to replicate.

Neural Network learning on data (green) to approximate the true underlying function (blue) over 100 epochs

After training, we have a neural network which has learned to map from control parameters to the resulting distance. Let's see how the loss surface changes as we vary the target distance in the plot below. The x-axis is the initial velocity of the projectile (ranging from 0 to max of 10 m/s), the y-axis is the angle of the shot (ranging from 0 to pi/2 radians, exclusive), and the z-axis is the MSE between the resulting distance and target distance.

The empirical loss surface (as a function of initial velocity and launch angle) changes form depending on target distance

Now that we have a neural network that approximates the mapping from control parameters to resulting distance, how are we to get suitable control parameters for a given target distance? Remember we want to input a target distance and output control parameters, but our neural network maps in the other direction.

As we can see in the above plot, we have a loss surface which is a function of the control parameters and whose shape is parameterized by our target distance; therefore, we can simply use gradient descent[2]! While we first used gradient descent in order to learn an approximation of the control parameters to distances mapping, we are now using gradient descent to minimize the corresponding loss surface of this mapping as parameterized by our input target distance. Let's perform this gradient descent now:

Path of gradient descent towards empirical loss function minimum curve

We started with an initial guess and successfully learned better control parameters that got us closer to our target. Below you can see an animation of different trajectories during the descent, which took several hundred iterations in total.

During gradient descent, the landing spot tends towards the target

I said the learned control parameters are better, not correct, intentionally. Remember, although we are minimizing our loss with gradient descent (and may in fact reach a loss of zero), this loss is with respect to the approximate form of the true underlying function which defines the relationship between control parameters and resulting distance. Although we probably learn better control parameters, we don't know that they're perfect or even sufficient. In fact, the final trajectory in the animation above has a 20-centimeter error from the target even though we set our gradient descent to have a maximum of a 5-centimeter error. The error is at most 5 centimeters with respect to the approximate model, but is technically unbounded with respect to the true model.

In order to bound the error, we would need to input a target, use gradient descent (on the fixed neural model) to learn control parameters, use these control parameters to run experiments and collect the true resulting distance, and then compare this distance to the input target distance, performing some statistical analysis over large amounts of data.

Gradient descent to optimize the control parameters happens with respect to the empirical model - to determine its performance, the results need to be compared to real experimentation.

While this whole process seems labor-intensive and insufficient, especially given the unboundedness of the difference between predicted performance and true performance, it truly is the best we can do in this scenario (save training the neural network on more data). How can we improve our approach?

Newtonian Approach

By now you will have noticed the glaringly obvious problem with our above approach - we are not taking advantage of the hundreds of years of physics that provides us with insight into the problem at hand, insight which is conveniently encapsulated in the form of mathematical relationships which we can exploit.

Above, we were approximating the mapping from control parameters to the resulting projectile distance, but if we know the laws of motion why would we approximate this function with a neural network? In this case, a quite simple kinematic analysis provides us with the true underlying mapping, which in turn gives us the true loss surface, again as a function of initial velocity and ejection angle and parameterized by the target distance. Let's again take a look at how this loss surface changes as the target distance parameter is varied.

How the true loss surface (which our Neural Network sought to approximate in the last section) changes as parameterized by target distance

The easily observed difference in form between the true loss surface and approximate-model loss surface helps us highlight the issue with the previous method - minimizing a perturbed loss surface can yield imperfect control parameters. You'll notice that, despite the fact that the surfaces trend the same way, the parabola which defines the minimum curve in the real model (blue) is not identical to the minimum curve of the neural model (green). That is, even if we properly minimize on the empirical loss surface, the resulting control parameters may not lie on the true solution curve.

Gradient descent on the empirical loss surface compared to the true loss surface. Note that the projections of the minimum curves of the empirical and true loss surfaces onto the control parameter plane would not perfectly align.

Given our true loss surface, we can perform gradient descent just as before. You can see the gradient descent path in the plot below.

Path of gradient descent on true loss function

If we examine the trajectories of the projectile during the descent, we can see a slow-but-steady tuning to sufficient control parameters.

Gradient descent of the true loss function takes fewer iterations to reach suitable control parameters

The true loss surface has the benefit of being globally smoother, yielding fewer iterations and a more robust descent. Also, we don't have to worry about ill-conditioning in contrast to the empirical loss surface, where if we initialized in the top right-hand corner (red arrow) our descent would've resulted in wildly incorrect control parameters. Relating to this, using the true loss surface has the benefit of ensuring that found solutions are actually correct (insofar as our physical model actually reflects reality, but this is the scope of science and not Machine Learning).

The true loss surface (blue, left) is globally smooth and provides more robust and accurate gradient descent when compared to the empirical (green, right) loss surface

At this point you may be wondering why we even brought up neural networks in the first place. If we can simply perform gradient descent on the true loss function, why would we not just do this in the first place? The answer is that the best method actually combines the two previous methods yielding a hybrid Newtonian Neural Network approach, which is precisely the Differentiable Programming approach.

Differentiable Programming Approach

In either of the two cases above, we needed to perform gradient descent. This means that if we were to deploy such a model, we would have to worry about unknown runtimes, poorly tuned learning rates, getting stuck in local minima, etc. The need for gradient based optimization resulted ultimately from our inability to devise a way to measure loss with respect to output control parameters:

Previously, in our physics-blind method, we did not know how to measure loss with respect to output control parameters

This inability to devise a suitable loss resulted in us flipping the inputs and outputs, giving us a sensible loss as the MSE between target distance and predicted distance, and then performing gradient descent to get sufficient control parameters. While we can exploit our knowledge of physics to optimize the true loss surface in this way, we can do better by using the physical model to generate a sensible loss function.

The result is a Differentiable Programming approach where we are mapping from a target distance to corresponding control parameters and then to the true distance resulting from these parameters. The result is effectively an autoencoding network in which the neural network learns the "inverse"[3] of the physical model which maps from control parameters to resulting distances:

We incorporate prior domain-specific knowledge to create an "autoencoding" network which we can backpropagate through to train our approximation network

Since our physical model constitutes a composition of differentiable functions, we can backpropagate through the network and update the parameters of the neural network to learn.

The result is a prediction network which maps from target distances to suitable control parameters, which was our goal all along. Once we use it to generate predicted control parameters, the physical model can then be used to verify that the control parameters yield a landing distance that is within the allowed error of our target distance.

If the landing distance is not suitably close, we can once again perform gradient descent on the true model, but this time using the predicted control parameters as a starting point rather than a random initialization. Therefore, with a Differentiable Programming approach, we have:

  1. A pre-trained neural model that can quickly provide control parameters estimates given a target distance
  2. A method of verifying that these control parameters will indeed land the projectile sufficiently close to the target, and
  3. A fast way of adjusting the control parameters if they are not sufficient (requiring far fewer iterations on average than a random initialization).

We can see the total system in the chart below:

Complete schematic of the Differentiable Programming approach to solving the Cannon Problem

Final Words

While we provide a simple, high-level use case of Differentiable Programming for hybrid neural-physical models, its applications far exceed this example. Some areas that utilize Differentiable Programming[4] to inject their projects with Artificial Intelligence are:

For those interested in more advanced use cases, the team at the Julia Programming Language has some great resources on Differentiable Programming[4].


  1. Note here that we are doing pure Machine Learning - we could know nothing about the situation at hand, viewing the data only as 3 columns of numbers a, b, and c, and still the train model successfully as long as we are told that MSE is an appropriate loss function.
  2. It should be noted here that, even if the underlying true model is not differentiable or even continuous, the approximate model is guaranteed to be differentiable, so we can be sure that gradient descent is a viable method of optimization. This guarantee stems from the fact that a neural network is a composition of differentiable functions, therefore itself being globally differentiable.
  3. This is not a true inverse in the mathematical sense because the forward mapping from control parameters to resulting distances is not injective. It is simply an inverse in the sense that it finds a path through the minimizing level curves of the forward mapping as a function of the target distance.
  4. A lecture covering these topics can be found here and was the inspiration for the cannon problem in this article.