Deep Learning

How physics advanced Generative AI

Many cutting-edge Generative AI models are inspired by concepts from physics. In this guide, we’ll take a high-level look at how physics is driving advancements in AI.

How physics advanced Generative AI

Distinct fields often cross-pollinate important concepts which help drive their progress. Concepts from mathematics lay at the foundation for progress in physics; concepts from physics often inspire frameworks in economics.

Artificial Intelligence (AI) has joined this cohort, pulling in ideas from physics to develop state-of-the-art models and inform how they work at a fundamental level. While ideas from physics have been incorporated into AI before, only recently have these models outperformed other approaches in such an indomitable way with models like DALL-E 2 and Stable Diffusion.

In this article, we’ll take a high-level look at these recent advancements and show how concepts from two distinct subfields of physics - electrostatics and thermodynamics - have elevated the performance of Generative AI models to a new echelon.

A Generative AI model generating images of human faces by using principles from electrostatics (provided by PFGM authors)

This article is geared towards anyone who is interested in the high-level concepts of how these powerful models work. We won’t get into particular mathematical details, so the explanations should be helpful to readers at all experience levels in AI.

Lessons from Electrostatics and Thermodynamics

Both of the cases we’ll look at are most often applied to Generative AI for images. For electrostatics, the treatment of a probability density as an electric charge density is the kernel of the method, where the motion of electrons according to the laws of physics can be exploited to generate novel images.

In the second case of thermodynamics, the treatment of the pixels in an image as atoms is the kernel of the method, where the natural movement of these atoms forward and backward in time can similarly be exploited to generate images.

Let’s take a look at the first case now.

Generative AI with electrostatics

Electrostatics can be viewed as the study of electric charges. Charge densities are continuous objects that have different amounts of charge in different areas. A place with a high charge density would repel (or attract) electrons with a greater force than areas with low charge density.

This electrically-charged rod has different amounts of charge (number of electrons) at different points on the rod

We can plot out the charge density of this rod - for each point on the rod we plot out “how much” charge is at that point. As we can see, there is a lot of charge in the middle, which tapers off to a lower charge at either end of the rod.

At each point on the rod, the height of the curve specifies the charge density

On the other hand, there are also probability densities. These curves show how likely each value of something is. Below, we show the probability density curve for the height of human males. As we can see, a male with a height of 5’11” (71 in, 180cm) is fairly likely, whereas heights much taller or shorter than this are less likely.

The height distribution for human males can be plotted in a similar fashion

You may have noticed that these curves look very similar. A specific class of Generative AI model - Poisson Flow Generative Models (PFGMs) - observes this too. PFGMs work by treating a probability density as a charge density.

Specifically, to generate data we need to sample from the probability distribution of that type of data. If we want generate a sample of realistic humans (considering only height and weight), it is unlikely that they will look like this:

Such unlikely heights and weights form even less likely combinations, and an even less likely sample together as a triplet

In particular, it is fairly unlikely to have someone that tall and thin, or that short and wide, never mind having a sample of 3 such extremes simultaneously. We need to be able to sample from the distribution according to how likely the combinations of height and weight are in order to generate more realistic novel data, like this:

Considering only height and weight, this sample of males is much more realistic than the above

With Generative AI, we attempt to use a set of example data points to learn what combinations are likely in order to generate realistic data. This set of example data points is called the training data, and it dictates what type of data we will generate. For example, if our training data are images of human faces, then we will be training the model to generate images of human faces.

How does this relate to electrostatics?

Data distribution as a charge distribution

In general, it can be hard to learn to generate samples similar to the training data. Rather than trying to do this directly, PFGMs exploit a clever trick using electrostatics to circumvent this issue.

Instead of looking at the data as a probability distribution, PFGMs change perspective and look at this distribution as a charge distribution. More likely data points (higher probability density) are considered to have more charge (higher charge density).

This, on its own, is not much help - but PFGMs utilize a crucial fact: when viewed as a charge distribution, the distribution will repel itself. Over time, this repulsion will “inflate” and gradually transform  the distribution into a big uniform hemisphere. We can see a video of this process below:

When treated as electrons, the training data repel themselves to form a uniform hemisphere over time (provided by PFGM authors)

We see that the example heart shaped distribution is morphed into the hemispherical distribution by following, at each point, trajectories like those shown by the black curves below.

For several randomly-selected points in the data (heart-shaped), we see the trajectories (black curves) that map them to the hemisphere (source)

How does this process help us? We said earlier that it is difficult to sample from the data distribution, which is ultimately our goal. What’s not difficult is to sample from this uniform hemisphere. Since it is so uniform and regular, we can sample from the hemisphere simply by picking any point at random on it.

Let’s exploit this fact: rather than trying to model the data distribution directly and sample from it directly, we will instead sample a point on the uniform hemisphere and then use physics to map this back into the data distribution. The goal of Poisson Flow Generative Models is to learn trajectory curves like those we saw in the diagram above. These curves, which result from the laws of physics, provide the mapping between the two distributions.

Since normal forward-time physics maps the data to the hemisphere along the trajectories, we use the PFGM to go backwards in time to map in the other direction. Rather than trying to model the probability distribution of the data directly, we just model the transformation between the complicated probability distribution and the simple hemispherical distribution that we can easily choose points from.

We learn how the laws of physics map between data distributions in order to generate novel images from data that is easy to sample

This whole process is illustrated in the above figure. To summarize:

  1. Our end goal is New data. We can’t get there by directly sampling from the data distribution because it is too complicated to sample from directly.
  2. The Laws of physics transform this complicated data distribution into the simple hemispherical distribution
  3. Our PFGM learns this transformation (i.e. the trajectories) for our particular set of training data.
  4. We then sample from the hemisphere, which is easy to do
  5. Once we have this sample, we run physics in reverse-time to move backwards along these trajectories that we just learned, arriving at the data distribution and therefore generating novel data.

Don’t worry if this is confusing - it’s a tricky concept to understand. The important part is that physics provides the bridge between what we want (new data), and what we can easily get (data on the hemisphere).

We can utilize this approach too in other areas - let’s take a look at how we do that with thermodynamics for Generative AI now.

Generative AI with thermodynamics

Thermodynamics can be viewed as the study of randomness. For example, if we throw a bunch of coins on the ground randomly, we can ask how the probability of 50% of them landing heads-up compares to the probability of 100% of them landing heads up.

Let’s look at the case of four coins. The probability that 100% (four) of them land heads-up is less than the probability of just 50% (two) of them landing heads-up. This is because there are six ways for only two coins to land heads-up, while there is only one way for all four coins to land heads-up.

There are more ways for only two coins to fall heads-up because there is flexibility in terms of which two coins fall heads-up, whereas there is no such flexibility in the four coin case - all coins must fall heads-up

In this case, we see that 50% of the coins being heads-up is 6 times more likely than 100%. If we extend this same thought-experiment to ten coins, then 50% (five) of the coins landing heads-up is 252 times more likely than 100% (ten) of them landing heads-up. If we extend this to just fifty coins, then this factor becomes 126 trillion times more likely. What if we extend this concept to billions of coins?

From coins to atoms: Diffusion

Thermodynamics casts atoms as “coins” and studies the consequences of the above phenomenon in physical systems. For example, if a drop of food coloring is placed into a glass of water, the food coloring spreads out to eventually create a uniform color in the glass. Why is this?

Food coloring naturally spreads out over time to create a uniform color in the glass (source)

The uniform color is a result of the atoms of food coloring spreading out over time. There are many more ways for the billions of atoms to be in different places than all in the same place, just as there are many more ways for 50% of coins to land heads-up than 100% of them. When all of the atoms are concentrated in a single drop, they can be considered to be “100% heads-up”; when the atoms are spread out evenly, they can be considered to be “50% heads-up”.

Remember, the “50% heads-up” state is more likely, and only becomes more likely as the number of coins grows - it was 126 trillion times more likely with only 50 coins. When we consider atoms as coins, we must keep in mind that there are trillions of billions of atoms in just a drop of food coloring. With this number of atoms, it becomes overwhelmingly more likely that they will end up spread out than in a concentrated drop. So, simply through random motion, the drop will spread out over time as it approaches this 50% state of uniform color.

This process is called diffusion, and it inspires models like DALL-E 2 and Stable Diffusion.

From atoms to pixels: Diffusion in Generative AI

Just as thermodynamics views atoms as coins, Diffusion Models view the pixels of images as atoms. Similarly to how the random motion of food coloring will always lead to a uniform color, the “random motion” of pixels will always lead to “TV static”, which is the image equivalent of uniform food coloring.

Random motion of the atoms will always lead to a uniform color, and random motion of the pixels (i.e. slightly altering their values) will always lead to TV static

Importantly, no matter where we place the initial drop of food coloring, over time all possible starting positions will yield this same final state of uniform color.

As time progresses, all starting drops approach the same final state

Note in particular that it is impossible to go backward and figure out where the drop initially was from this uniform state since all initial states lead to it. The lack of injectivity makes it impossible to go backward in general.

Since all possible drops lead to the same final state, it is impossible to know where the initial drop was when looking only at the final state. 

We always know how drops will diffuse in forward time, but we don’t know how to reverse-diffuse the uniform coloring due to this issue of injectivity. However, if we relegate our concerns to one particular drop, then we can model this process both forward and backward in time.

If we consider only one initial drop location, then we can successfully model the diffusion process forward and backward through time

Diffusion Models use this same principle in the image domain. In particular, the different “drops” for Diffusion Models correspond to different types of images. For example, these drops could correspond to images of dogs, images of humans, and images of handwritten digits.

Each type of image (dog faces, human faces, digits) is analogous to a different initial “drop” in the liquid

By picking just one type of image, say images of dogs, Diffusion Models can learn to go backwards in time for that one type of image, just like how we can learn to go backwards in time from the uniform color by picking just one drop.

By picking just one type of image, in this case images of dogs, we can learn to go backwards in time from TV static to images of dogs

Image generation with Diffusion Models

It may be unclear why we would want to do this - if we have a dataset of images of dogs, why would we want to go forward and backward like this? The answer lies in the fact that the figure directly above is slightly deceptive - a particular image of a dog is not analogous to the drop of food coloring - it is the entire class of images of dogs that is analogous to the drop of food coloring.

Particular images of dogs are actually analogous to particular atoms in the drop of food coloring. Recall from above that relegating our concerns to one initial drop allowed us to model the diffusion process forward and backward in time.

From above, we saw that focusing on one specific starting drop allows us to model the dynamics in forward and reverse time

Understanding how the diffusion process works in reverse-time allows us to trace individual atoms back to their starting points in the drop. In particular, we pick a random atom from the uniform food coloring, and then reverse time to see where in the initial drop of food coloring it started from.

Picking one drop allows us to model diffusion in reverse-time, which allows us to trace individual atoms back to their starting positions

We mimic this process with Diffusion Models. Analogously, we pick a random image of TV static (“atom”) and then go backwards through time to figure out where it started in the data distribution (“initial drop”). That is, we determine which image of a dog led to that image of TV static in forward-time.

Images are like atoms - we use a set of examples (training data) of a specific type (e.g. dogs) to learn how the diffusion process works for any particular image of that type. Then we pick a random image of TV static (not in the training data) and use this knowledge to generate a novel image.

This process is very similar to PFGMs. With PFGMs, we modeled the physics that maps our data distribution to a uniform hemisphere. Since the hemisphere is easy to sample from, we pick a point on it and run physics in reverse-time to generate a new image. With Diffusion Models, we model the physics that maps our data distribution to TV static. Since TV static is easy to generate, we pick a random image of TV static and run physics in reverse-time to generate a new image.

Sampling from the data distribution is hard, but sampling from the TV static distribution is not. Noting that physics transforms the former into the latter, we use reverse-time physics to transformer a sample from the latter into one from the former.

Diffusion Models lie at the foundation of much of the progress in Generative AI in the image domain. Text-to-image models like Imagen and DALL-E 2 augment this process, allowing us to tell the model what we want the generated image to look like.

Final Words

Many of the recent advancements in Artificial Intelligence are inspired by ideas from physics. As we have seen, these high-level ideas lie at the foundation of modern methods in Generative AI, powering the newest generation of AI models.

If you enjoyed this article, feel free to check out some of our other articles to learn about the Emergent Abilities of Large Language Models or How ChatGPT actually works. Alternatively, feel free to subscribe to our newsletter to stay in the loop when we release new content like this.