Deep Learning

# Modern Generative AI for images

Modern Generative AI models for images are powering a range of creative applications and changing the way we work. This guide will overview everything you need to know about these models and how they work.

Modern Generative AI is capable of creating high-quality images in a matter of seconds, given only a textual description. These models have changed the way that many industries operate and will operate in the future, but how do these modern Generative AI image models actually work?

In this article, we’ll take a look at this topic from a birds-eye view. If you want to familiarize yourself with Generative AI more generally first you can read our Introduction to Generative AI.

## How images are made with Generative AI

While Generative AI can be used to make images in many ways, the advancements of the past few years generally rely on a newer paradigm called Diffusion Models. Diffusion Models are a mathematical framework for generating data which is inspired by thermodynamics.

If a drop of food coloring is placed into a glass of water, thermodynamic diffusion is the process that describes how the food coloring spreads out to eventually create a uniform color in the glass.

Diffusion models work by applying this concept to the image space. Given an image, we can diffuse it, which corresponds to slightly altering the pixel values in the image over time. As the image diffuses it will eventually reach "TV static", which is the image equivalent of uniform color for the food coloring case.

This process is impossible to reverse in real life. When food coloring has been thoroughly stirred into water, there is no way to "unstir" it and recover the original, concentrated drop of food coloring. While we can’t go backward in time in real life, Diffusion Models learn how to simulate this reverse-time process for images. That is, Diffusion Models learn how to go backwards in time from TV static to images.

Learning this reverse-time process allows us to generate novel images. To do this, we first generate an image of TV static, which is easy for computers to do. We then simulate the reverse-diffusion process, which allows us to go backward in time to determine what original image leads to that TV static when it diffuses (in forward time). This is how Generative AI creates images.

Below we can see the reverse-diffusion process for generating images of handwritten digits. 16 samples of "TV static" were created, and we can see how the images evolve as they go backward through time to yield 16 images of handwritten digits:

The central idea at play here is that it is very difficult to tell a computer to generate an image of a dog, of a human face, or of anything else of interest. On the other hand, it is easy to generate TV static, and (relatively) easy to transform the TV static into an image of a face using AI (and some insight from physics). So, rather than trying to sample directly from the distribution for the training data, we indirectly sample it in this roundabout way to create new images.

If you want a more detailed explanation or want to learn about the principles that inspired the development of Diffusion Models, you can check out our beginner guide:

Next, we briefly introduce the mathematical formulation of this process. If you are not so familiar with these concepts, feel free to skip to the next section where we discuss how Diffusion Models are incorporated into modern Generative AI image models like DALL-E 2 and Stable Diffusion.

### Mathematical Formulation

Formally, Diffusion Models are parameterized as a Markov chain. The forward diffusion process is modeled as a series of discrete timesteps, where the latent variables are the gradually noisier images. The transitions in the chain are conditional Gaussian distributions, which correspond to random motion of the pixels.

To travel backwards along this chain and learn the reverse diffusion process, we seek to train a denoising model that takes in a latent variable and seeks to predict the one before it.

The objective is maximum likelihood, meaning that the goal is to find the parameters theta of the denoising model that maximize the likelihood of our data.

In practice, we maximize a lower bound on this value, which allows us to rewrite the problem in terms of Kullback-Leibler divergences. From this new formulation, the fact that the conditional distributions in the Diffusion chain are Gaussian, paired with the Markov assumption, allow the problem to be reduced to a tractable form.

These details are not crucial and are just placed here to highlight how these lofty maximum likelihood objectives become tenable as we impose assumptions by imposing restrictions on our model and how it is trained. For a full treatment of this subject, see our dedicated Introduction to Diffusion Models.

Lastly, we note the distinction between the denoising model and the Diffusion Model. The Diffusion Model is the mathematical framework of this Markov chain that provides asymptotic guarantees, whereas the denoising model is the actual neural network that is trained to traverse backwards along the Markov chain.

Introduction to Diffusion Models for Machine Learning

## How do text-to-image Generative AI models work?

Above, we saw how Diffusion Models work to generate images. You may have noticed that we can generate images, but not specific types of images. For example, we saw how Diffusion Models can generate images of dogs, but not a dog with specific features- what if we want the dog to have black hair, or be sticking its tongue out, or be a specific breed?

Text-to-image models allow us to do this. They use a textual description to control the Diffusion Process in order to generate images that correspond to the description. While there is no single approach to how text-to-image models operate, we will take a look at the overarching principles that power these models to see how they work at a conceptual level.

In particular, diffusion-based text-to-image models will generally include two essential components:

1. A text encoder that maps the text to a vector which captures the meaning of the text
2. A Diffusion Model that decodes this “meaning vector” into an image

Let’s take a high-level look at how these pieces fit together now.

### Extracting meaning from text

The first step in the text-to-image process is to extract meaning from the text by using a text encoder. A text encoder converts the text into a high-dimensional vector, embedding the information in such a way that captures meaning. Let’s take a look at a simple example to build intuition for how this process works.

Let’s assume we have three words - “woman”, “man”, and “boy”. Let’s assume that each word comes paired with a vector - [1, 0], [1, 1], and [0, 1].

As of now, we have been given these three vector-word pairs and nothing more. There are no pre-existing relationships that have been provided, and we have not (yet) defined any ourselves. But now let’s assume we are given one more word - “girl” - and no vector. What vector might correspond to this word?

Let’s try to formulate a mapping schema for our existing vector-word pairs. We see that the vectors for “woman” and “man” share the same first entry and differing second entry. Therefore, “woman” and “man” are alike on some axis, and differ along another. Let's consider first a way in which they are alike. Since both a woman and man are adults, let’s let a 1 in the first entry mean “adult”. If we examine the vector for “boy”, we find that the first-entry value is 0. Since a boy is not an adult, this meaning for the first entry is consistent.

Similarly, if we inspect “man” and “boy” in the same way, we again find that they are similar along one axis and differ along another. We have already established how they are dissimilar (i.e. whether they have the quality of "adultness"), so now let's consider how they are similar. Since both a man and boy are males, let’s let a 1 in the second entry mean “male”. Again, this interpretation is internally consistent when we additionally consider “woman”

What does this newly-defined interpretation schema mean with respect to our new word, “girl”? Well, since girls are not adults, then according to this schema the first vector entry should be zero. Similarly, since girls are also not male, the second entry should be zero. In this way we can deduce that the new vector should be [0, 0].

As we can see, our vectors capture information about the meaning of the words. It is not the words themselves that the vectors describe, but the objects that the words reference. That is, the vector [0, 1] does not describe the word “woman” itself, but instead the concept of women as a Platonic ideal.

The job of the text encoder is to learn such a “meaning” schema. The text encoder is the component in a text-to-image model that is used to extract meaning from the text so that we can use this semantic representation. Note that this statement isn’t quite true for some models like DALL-E 2 and is more accurate for a model like Imagen, but this understanding suffices for our purposes.

Additionally, for text-to-image models the “meaning vectors'' generally have many more than two “entries” (or “components”). For example, Imagen’s “meaning vectors” have over one thousand components. Further, text-to-image models allow the entry values to be any real number, like 1.2 or -4 or 0.75, instead of just 0 or 1. These factors together allow for a much more expansive and finer-grainer understanding of the meaning of language.

### Manifesting meaning visually

Remember, above we said that the word “woman” is different from the concept of a woman. The actual word “woman” is one representation of this concept. This concept affords other representations as well, like this one:

Therefore, we have these two objects - the word woman, and an image of a woman - that reference the same “meaning”.

Our text encoder just learned how to map from the textual representation of a woman to the concept of a woman in the form of a vector. Above we saw that there exist interpretation schemas in which a vector can be considered to capture information about the concept that a given word references. We now use that meaning to generate another representation of it. In particular, we have learned to map from words to meaning, now we must learn to map from meaning to images.

This is where Diffusion Models come into the picture. We use the Diffusion Model to generate a new image, and we condition this process using the meaning vector.

Conditioning can be considered the practice of providing additional information to a process to impose a condition on its outcome. For example, let’s say that we want to randomly generate a number that corresponds to the side of a die. If we sample uniformly, then we will have a 1 in 6 chance of generating any given integer from 1 to 6. Now let’s repeat this process, but this time condition it by requiring the generated number to be even. Now there is a 1 in 3 chance for each number 2, 4, and 6, while there is no chance for each odd number 1, 3, and 5. We have conditioned this generation process with additional information to affect the generated output.

Similarly for our text-to-image model, we can condition the Diffusion Model as it creates images. By incorporating our “meaning vector” in the Diffusion process, we can condition the image that is being generated to properly capture the meaning held within the “meaning vector”.

Technical note

DALL-E 2 actually includes another component that maps between different vectors in the representation space. The details are out of the scope of this article, but feel free to check out How DALL-E 2 Actually Works for more details.

To see an example of how this conditioning actually happens from a technical point of view, interested readers can see here.

It is important to note that the Diffusion process is still stochastic. This means that even with the same conditioning, a different image will be generated each time we reverse-diffuse. That is, we can generate multiple different images given the same input text.

This fact is important because there is no single image that properly represents all semantic information in a “meaning”. When we say “an image of a woman”, what do we mean? What color of hair does she have? Where is she situated? What emotion is she expressing, if any? If you ask a room of 10 people to imagine “an image of a woman”, each of them will depict it differently in their minds’ eyes.

Indeed, when we ask Stable Diffusion to generate “An image of a woman”, it outputs many images that can each be considered to reflect this prompt. This aspect of the model is valuable because each of these images is indeed “An image of a woman”.

If you’re interested in how these models are actually built, you can check out our MinImagen article. We go through how to build a minimal implementation of Imagen, and provide all code, documentation, and a thorough guide on each salient part.

MinImagen - Build your own text-to-image model