Modern Generative AI is capable of creating high-quality images in a matter of seconds, given only a textual description. These models have changed the way that many industries operate and will operate in the future, but **how do these modern Generative AI image models actually work?**

In this article, we’ll take a look at this topic from a birds-eye view. If you want to familiarize yourself with Generative AI more generally first you can read our Introduction to Generative AI.

## How images are made with Generative AI

While Generative AI can be used to make images in many ways, the advancements of the past few years generally rely on a newer paradigm called **Diffusion Models**. Diffusion Models are a mathematical framework for generating data which is inspired by **thermodynamics**.

If a drop of food coloring is placed into a glass of water, thermodynamic diffusion is the process that describes how the food coloring spreads out to eventually create a uniform color in the glass.

**Diffusion models work by applying this concept to the image space**. Given an image, we can **diffuse** it, which corresponds to slightly altering the pixel values in the image over time. As the image diffuses it will eventually reach "TV static", which is the image equivalent of uniform color for the food coloring case.

This process is impossible to reverse in real life. When food coloring has been thoroughly stirred into water, there is no way to "unstir" it and recover the original, concentrated drop of food coloring. While we can’t go backward in time in real life, Diffusion Models *learn *how to simulate this reverse-time process for images. That is, **Diffusion Models learn how to go backwards in time from TV static to images**.

Learning this reverse-time process allows us to generate novel images. To do this, we first generate an image of TV static, which is **easy **for computers to do. We then simulate the reverse-diffusion process, which allows us to go backward in time to determine what original image *leads to that TV static* when it diffuses (in forward time). **This is how Generative AI creates images**.

Below we can see the reverse-diffusion process for generating images of handwritten digits. 16 samples of "TV static" were created, and we can see how the images evolve as they go backward through time to yield 16 images of handwritten digits:

The central idea at play here is that it is **very difficult** to tell a computer to generate an image of a dog, of a human face, or of anything else of interest. On the other hand, it is **easy **to generate TV static, and (relatively) easy to** ****transform**** **the TV static into an image of a face using AI (and some insight from physics). So, rather than trying to sample **directly **from the distribution for the training data, we indirectly sample it in this roundabout way to create new images.

If you want a more detailed explanation or want to learn about the principles that inspired the development of Diffusion Models, you can check out our beginner guide:

Next, we briefly introduce the mathematical formulation of this process. If you are not so familiar with these concepts, feel free to skip to the next section where we discuss how Diffusion Models are incorporated into modern Generative AI image models like DALL-E 2 and Stable Diffusion.

### Mathematical Formulation

Formally, Diffusion Models are parameterized as a **Markov chain**. The forward diffusion process is modeled as a series of discrete timesteps, where the latent variables are the gradually noisier images. The transitions in the chain are conditional Gaussian distributions, which correspond to random motion of the pixels.

To travel backwards along this chain and learn the reverse diffusion process, we seek to train a denoising model that takes in a latent variable and seeks to predict the one before it.

The objective is maximum likelihood, meaning that the goal is to find the parameters *theta* of the denoising model that maximize the likelihood of our data.

In practice, we maximize a lower bound on this value, which allows us to rewrite the problem in terms of Kullback-Leibler divergences. From this new formulation, the fact that the conditional distributions in the Diffusion chain are Gaussian, paired with the Markov assumption, allow the problem to be reduced to a tractable form.

These details are not crucial and are just placed here to highlight how these lofty maximum likelihood objectives become tenable as we impose assumptions by imposing restrictions on our model and how it is trained. For a full treatment of this subject, see our dedicated Introduction to Diffusion Models.

Lastly, we note the distinction between the denoising model and the Diffusion Model. The Diffusion Model is the mathematical **framework **of this Markov chain that provides asymptotic guarantees, whereas the denoising model is the actual **neural network** that is trained to traverse backwards along the Markov chain.

## How do text-to-image Generative AI models work?

Above, we saw how Diffusion Models work to generate images. You may have noticed that we can generate images, but not *specific* types of images. For example, we saw how Diffusion Models can generate images of dogs, but not a dog with *specific features*- what if we want the dog to have black hair, or be sticking its tongue out, or be a specific breed?

Text-to-image models allow us to do this. They use a textual description to **control **the Diffusion Process in order to generate images that correspond to the description. While there is no single approach to how text-to-image models operate, we will take a look at the overarching principles that power these models to see how they work at a conceptual level.

In particular, diffusion-based text-to-image models will generally include two essential components:

- A
**text encoder**that maps the text to a vector which captures the*meaning*of the text - A
**Diffusion Model**that*decodes*this “meaning vector” into an image

Let’s take a high-level look at how these pieces fit together now.

### Extracting meaning from text

The first step in the text-to-image process is to extract meaning from the text by using a **text encoder**. A text encoder converts the text into a high-dimensional vector, embedding the information in such a way that captures meaning. Let’s take a look at a simple example to build intuition for how this process works.

Let’s assume we have three words - “woman”, “man”, and “boy”. Let’s assume that each word comes paired with a *vector* - [1, 0], [1, 1], and [0, 1].

As of now, we have been given these three vector-word pairs and nothing more. There are no pre-existing relationships that have been provided, and we have not (yet) defined any ourselves. But now let’s assume we are given **one more word** - “girl” - and **no vector**. What vector might correspond to this word?

Let’s try to formulate a mapping schema for our existing vector-word pairs. We see that the vectors for “woman” and “man” share the **same first entry** and **differing second entry**. Therefore, “woman” and “man” are alike on some axis, and differ along another. Let's consider first a way in which they are alike. Since both a woman and man are **adults**, let’s let a **1** in the first entry mean “**adult**”. If we examine the vector for “boy”, we find that the first-entry value is **0**. Since a boy is not an adult, this meaning for the first entry is **consistent**.

Similarly, if we inspect “man” and “boy” in the same way, we again find that they are similar along one axis and differ along another. We have already established how they are dissimilar (i.e. whether they have the quality of "adultness"), so now let's consider how they are similar. Since both a man and boy are **males**, let’s let a **1 **in the second entry mean **“male”**. Again, this interpretation is internally **consistent** when we additionally consider “woman”

What does this newly-defined interpretation schema mean with respect to our new word, “girl”? Well, since girls are not adults, then according to this schema the first vector entry should be **zero**. Similarly, since girls are also not male, the second entry should be **zero**. In this way we can deduce that **the new vector should be [0, 0]**.

As we can see, **our vectors capture information about the meaning of the words**. It is not the words *themselves* that the vectors describe, but the *objects *that the words reference. That is, the vector [0, 1] does not describe the *word* “woman” itself, but instead the *concept *of women as a Platonic ideal.

**The job of the text encoder is to learn such a “meaning” schema**. The text encoder is the component in a text-to-image model that is used to extract meaning from the text so that we can use this semantic representation. Note that this statement isn’t *quite* true for some models like DALL-E 2 and is more accurate for a model like Imagen, but this understanding suffices for our purposes.

Additionally, for text-to-image models the “meaning vectors'' generally have many **more than two “entries”** (or “components”). For example, Imagen’s “meaning vectors” have over one thousand components. Further, text-to-image models allow the entry values to be **any real number**, like 1.2 or -4 or 0.75, instead of just 0 or 1. These factors together allow for a much more expansive and finer-grainer understanding of the meaning of language.

### Manifesting meaning visually

Remember, above we said that the word “woman” is different from the *concept* of a woman. The actual word “woman” is one *representation* of this concept. This concept affords other representations as well, like this one:

Therefore, we have these two objects - the *word* woman, and an *image* of a woman - that reference the same “meaning”.

Our text encoder just *learned* how to map from the textual representation of a woman to the concept of a woman in the form of a vector. Above we saw that there exist interpretation schemas in which a vector can be considered to capture information about the concept that a given word references. We now *use* that meaning to generate another representation of it. In particular, we have learned to map **from words to meaning**, now we must learn to map **from meaning to images**.

This is where **Diffusion Models** come into the picture. We use the Diffusion Model to generate a new image, and we **condition**** **this process using the meaning vector.

**Conditioning **can be considered the practice of providing additional information to a process to impose a *condition *on its outcome. For example, let’s say that we want to randomly generate a number that corresponds to the side of a die. If we sample uniformly, then we will have a **1 in 6 chance** of generating any given integer from 1 to 6. Now let’s repeat this process, but this time **condition** it by requiring the generated number to be **even**. Now there is a **1 in 3 chance** for each number 2, 4, and 6, while there is no chance for each odd number 1, 3, and 5. We have **conditioned** this generation process with additional information to affect the generated output.

Similarly for our text-to-image model, we can condition the Diffusion Model as it creates images. By incorporating our “meaning vector” in the Diffusion process, we can condition the image that is being generated to properly capture the meaning held within the “meaning vector”.

Technical note

DALL-E 2 actually includes another component that maps between different vectors in the representation space. The details are out of the scope of this article, but feel free to check out How DALL-E 2 Actually Works for more details.

To see an example of how this conditioning actually happens from a technical point of view, interested readers can see here.

It is important to note that the Diffusion process is **still stochastic**. This means that *even with the same conditioning*, a different image will be generated each time we reverse-diffuse. That is, we can generate **multiple different images **given the **same input text**.

This fact is important because there is no single image that properly represents **all** semantic information in a “meaning”. When we say “an image of a woman”, what do we mean? What color of hair does she have? Where is she situated? What emotion is she expressing, if any? If you ask a room of 10 people to imagine “an image of a woman”, each of them will depict it differently in their minds’ eyes.

Indeed, when we ask Stable Diffusion to generate “An image of a woman”, it outputs many images that can each be considered to reflect this prompt. This aspect of the model is valuable because each of these images is indeed “An image of a woman”.

If you’re interested in how these models are actually built, you can check out our **MinImagen** article. We go through how to build a minimal implementation of Imagen, and provide all code, documentation, and a thorough guide on each salient part.

## Other Generative AI Paradigms

While Diffusion Models are generally what power modern Generative AI applications in the image domain, other paradigms exist as well. Two popular paradigms are Vector-Quantized Variational Autoencoders (VQ-VAEs) and Generative Adversarial Networks (GANs). While we have mentioned that Diffusion Models power many of the recent advancements in Generative AI, each method has its pros and cons, and paradigms like GANs still find their foundational principles in use in other domains.

We list these other methods here simply to point out some other alternative methods for further reading for those interested.

## Final Words

In this article, we’ve taken a look at the progress in Generative AI in the image domain. After understanding the intuition behind Diffusion Models, we examined how they are put to use in text-to-image models like DALL-E 2.

In the next article in our *Everything you need to know about Generative AI* series, we will look at recent progress in Generative AI in the language domain, which powers applications like ChatGPT.