Deep Learning

Stable Diffusion 1 vs 2 - What you need to know

Stable Diffusion 2 was released recently, sparking some debate about its performance relative to Stable Diffusion 1. Learn where the differences between the two models stem from and what they mean in practice in this simple guide.

Stable Diffusion 1 vs 2 - What you need to know

The open-source community has been busy exploring Stable Diffusion 2 since its release just a couple of weeks ago. In some cases, users have reported apparently lower performance compared to Stable Diffusion 1.

In this article, we'll summarize all of the important points in the Stable Diffusion 1 vs Stable Diffusion 2 debate. We will look at the actual reason why these differences exist in the first section, but if you want to get straight to the practical differences, you can jump down the Negative Prompts section. Let's get started!

Release Note

On the day after the publication of this article, Stable Diffusion 2.1 was released. 2.1 was intended to address many of the relative shortcomings of 2.0 compared to 1.5. The contents of this article remain relevant in understanding Stable Diffusion 1 vs 2, but readers should be sure to additionally read the appended Stable Diffusion 2.1 section for an understanding of the full picture.

OpenCLIP

The most important shift that Stable Diffusion 2 makes is replacing the text encoder. Stable Diffusion 1 uses OpenAI's CLIP, an open-source model that learns how well a caption describes an image. While the model itself is open-source, the dataset on which CLIP was trained is importantly not publicly-available.

Stable Diffusion 2 instead uses OpenCLIP, an open-source verison of CLIP, which was trained using a known dataset - an aesthetic subset of LAION-5B that filters out NSFW images. Stability AI says that OpenCLIP "greatly improves the quality" of generated images and is, in fact, superior to an unreleased version of CLIP on metrics.

Why this matters

Questions about the relative performance of these models aside, the shift from CLIP to OpenCLIP is the source of many of the discrepancies between Stable Diffusion 1 and Stable Diffusion 2.

In particular, many users of Stable Diffusion 2 have claimed that it cannot represent celebrities or artistic styles as well as Stable Diffusion 1, despite the fact that the training data for Stable Diffusion 2 was not deliberately filtered to remove artists. This discrepancy stems from the fact that CLIP's training data had more celebrities and artists than the LAION dataset. Since CLIP's dataset is not available to the public, it is not possible to recover this same functionality using the LAION dataset alone. In other words, many of the canonical prompting methods for Stable Diffusion 1 are all but obsolete for Stable Diffusion 2.

What this means

This change to a fully open-source, open-data model marks an important shift in the Stable Diffusion story. It will be on the shoulders of the open-source community to finetune Stable Diffusion 2 and build out the capabilities that people want to see, but this was in fact the intent of Stable Diffusion ab initio - a community-driven, fully open project. While some users may be disappointed in the relative performance of Stable Diffusion 2 at this point, the StabilityAI team has spent over 1 million A100 hours creating a solid foundation to build upon.

Further, while not mentioned explicitly by the creators, this shift away from using CLIP may afford some protection against potential liability issues for project contributors, which is important given the impending wave of intellectual property litigation that is surely down the road for models like these.

With this contextual backdrop in mind, it's now time to discuss the practical differences between Stable Diffusion 1 and 2.

Negative Prompts

We begin with an inspection of negative prompting, which appears to be much more important to strong performance in Stable Diffusion (SD) 2 compared to SD 1 as we can see below:

(modified from source)

Let's take a look at negative prompting in more detail now.

Simple Prompt

To begin we feed the prompt "infinity pool" to both Stable Diffusion 1.5 and Stable Diffusion 2 with no negative prompt. Three images from each model are shown, where each column corresponds to a different random seed.

Settings

  • prompt: "infinity pool"
  • size: 512x512
  • guidance scale: 12
  • steps: 50
  • sampler: DDIM

As we can see, Stable Diffusion 1.5 seems to perform better than Stable Diffusion 2 overall. In SD 2, the leftmost image has a patch that fits poorly within the image, and the rightmost image is almost incoherent.

Now we generate images in the same fashion from the same starting noise, this time using negative prompting. We add the negative prompt "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy", which is a negative prompt used by Emad Mostaque.

With negative prompting added, SD 1.5 generally performs better, although the middle image potentially has poorer caption alignment. For SD 2, improvements are more drastic, although overall performance is still not as good as SD 1.5

Settings

  • prompt: "infinity pool"
  • size: 512x512
  • guidance scale: 12
  • steps: 50
  • sampler: DDIM
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

We directly compare the SD 2 performance with and without the negative prompts. Inspection reveals support for the proposition that negative prompting is essential to SD 2.

Below we can see a comparison of the final images generated by SD 1.5 and 2, both with and without a negative prompt, starting from the same random seed.

Complicated Prompt

We run the same experiment as above, this time with a more complicated (positive) prompt. This time, rather than "infinity pool", we use "infinity pool with a tropical forest in the background, high resolution, detail, 8 k, dslr, good lighting, ray tracing, realistic". While we could've omitted the "with a tropical forest in the background" part in order to isolate purely aesthetic additions, we include it in order to better probe the semantic fit for a more complicated prompt.

Again, we show the results with no negative prompting. The images no longer look photorealistic, and the caption alignment is arguably better. The water texture is also much better with SD 1.5.

Settings

  • prompt: "infinity pool with a tropical forest in the background, high resolution, detail, 8 k, dslr, good lighting, ray tracing, realistic"
  • size: 512x512
  • guidance scale: 12
  • steps: 50
  • sampler: DDIM

Once we add the same negative prompt as the previous example, we see some interesting results. In particular, it appears that the negative prompt may actually adversely affect SD 1, but universally help SD 2. Each image from SD 2 is better with negative prompting, while the caption alignment for SD 1 seems to universally go down. Interestingly, adding the negative prompt seems to push the generated images towards photorealism.

Settings

  • prompt: "infinity pool with a tropical forest in the background, high resolution, detail, 8 k, dslr, good lighting, ray tracing, realistic"
  • size: 512x512
  • guidance scale: 12
  • steps: 50
  • sampler: DDIM
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

We again directly compare the images generated from various random seeds with and without a negative prompt for SD 2.

Finally, we again show a SD 1.5/SD2 vs with/without negative prompt matrix:

Textual Inversion

Beyond ordinary negative prompts, Stable Diffusion also supports textual inversion. Textual inversion is a method in which a small handful of reference images can be used to generate a new "word" that represents the images. Once learned, the "word" can then be used as usual in prompts, allowing us to generate images that map faithfully to the reference images. In the example below, 4 images of a small figure are inverted to "S_*". This "word" is then used as usual in various prompts, combining the reference images faithfully with other semantic concepts:

(image modified from source)

In the below example, we have several images created using Stable Diffusion 2.0 from the base prompt "a delicious hamburger". This prompt is then either augmented with either a positive prompt or textually-inverted token, and/or a negative prompt or textually-inverted token. For example, the rightmost image in the second row augments the base prompt with a textually-inverted token that references Midjourney, and a normal negative prompt "ugly, boring, bad anatomy".

(source)

As we can see, the use of textual inversion significantly improves the performance of Stable Diffusion 2.0. The above image was pulled from this blog post by Max Woolf which serves as a great treatment of the topic.

Celebrities

Given that LAION contains fewer celebrity images than CLIP's training data, it may be unsurprising to know that many SD 2 users have observed a poorer ability to generate images of celebrities than SD 1.5.

Below we show the images generated from 3 random seeds (columns) with and without negative prompting for both SD 1.5 and SD 2. The prompt is "keanu reeves" A full resolution version of this image is also available.

(full resolution)

Settings

  • prompt: "keanu reeves"
  • size: 512x512
  • guidance scale: 7
  • steps: 50
  • seed: 119
  • sampler: DDIM
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

Overall, the performance of SD 2 is comparable to SD 1.5 with respect to this particular prompt. That having been said, Stable Diffusion 2's ability to depict celebrities appears to break down when they are combined with semantic concepts. We present comparisons for two such prompts below, where again each column within an image corresponds to a given random seed. This time, we use negative prompting in every case.

Settings

  • prompt: "a white marble bust of Robert Downey Jr. in a museum, cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art, fantasy background"
  • size: 512x512
  • guidance scale: 12
  • steps: 50
  • seed: 120-122
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, out of frame, deformed, blurry, bad art, blurred, watermark, grainy"

Settings

  • prompt: "a studio photograph of Robert Downey Jr., cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art"
  • size: 512x512
  • guidance scale: 7
  • steps: 50
  • seed: 119-121
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

As we can see, Stable Diffusion 1.5 tends to outperform Stable Diffusion 2 on this front (which even appears to depict Steve Carrell instead of Robert Downey Jr. at one point). While this disparity is expected, its magnitude is perhaps greater than expected given the results from the Keanu Reeves example.

Artistic Images

As mentioned in the OpenCLIP section, in addition to containing fewer celebrity images than the CLIP training data, the LAION dataset also contains fewer artistic images. This means that is it more difficult to generate stylized images, and that the canonical method of " _____ in the style of _____" no longer functions as it did in Stable Diffusion 1. Below we compare the images from 4 random seeds for Stable Diffusion 1.5 and Stable Diffusion 2 where we have attempted to generate an image in the style of Greg Rutkowski.

(full resolution)

Settings

  • prompt: "A monster fighting a hero by greg rutkowski, romanticism, cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, trending on artstation, digital art"
  • size: 512x512
  • guidance scale: 9
  • steps: 50
  • seed: 119-122
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

The results are drastic - Stable Diffusion 1.5 is once again the clear winner against Stable Diffusion 2 (out of the box). While augmenting the prompt with other descriptors that do not explicitly reference artists can still generate stylized images with SD 2, the performance is still not as good as SD 1.5, as seen below:

(full resolution)

Settings

  • "Original" prompt: "a cute rabbit"
  • "Augmented" prompt: "a cute rabbit, cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art, fantasy background"
  • size: 512x512
  • guidance scale: 9
  • steps: 50
  • seed: 119-122
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, out of frame, deformed, blurry, bad art, blurred, watermark, grainy"

On the other hand, some users have found that SD 2 is highly capable for generating photo-realistic images:

(modified from source)

Text Coherence

One place where Stable Diffusion 2 may have an advantage out-of-the-box compared to Stable Diffusion 1 is text coherence. Most text-to-image models are comically poor at representing text. This is wholly unsurprising - while we as humans have an easy time parsing text, we must remember that words are part of extremely complicated linguistic systems, arranged according to special rules to convey meaning. Further, these words themselves are composed of letters in apparently near-random ways; and, further still, the actual visual manifestation of these letters can vary drastically (e.g. compare jokerman and consolas fonts). These considerations (and others) provide some explanation for the inability of these models to properly convey text, especially beyond simple words.

That having been said, it appears that Stable Diffusion 2 may be slightly better at conveying text than Stable Diffusion 1. Below we provide several images for comparison:

(full resolution)

As we can see, the results are not stellar in either case, and negative prompting appears to have very little effect on this front. While it is difficult to come up with an objective measure for how well these models generate text, it could be argued that the average person would consider Stable Diffusion 2 to be slightly better.

Other Models

In addition to the shift from CLIP to OpenCLIP, Stable Diffusion 2 released with some other great features that we summarize below.

Depth Model

A depth model was released alongside SD 2. This model takes in a 2D image and returns a predicted depth map for that image. This information can then be used to condition image generation in addition to text, allowing users to generate new that remain faithful to the geometry of a reference image.

(source)

Below we can see an litany of such images that all preserve the same basic geometric structure.

(source)

Upscaling Model

Stable Diffusion 2 also released with an upscaling model, which can upscale images to 4x their original side length. That means the upscaled images have 16 times the area of the original!

Below we can see the result of upscaling one of our previously generated images:

If we zoom in on the eye of the rabbit in each image, the difference is immediately evident and very impressive.

Inpainting Model

Stable Diffusion 2 also comes with an updated inpainting model, which lets you modify subsections of an image in such a way that the patch fits in aesthetically:

(source)

768 x 768 Model

Finally, Stable Diffusion 2 now offers support for 768 x 768 images - over twice the area of the 512 x 512 images of Stable Diffusion 1.

Stable Diffusion 2.1

Stable Diffusion 2.1 was released shortly after the release of Stable Diffusion 2.0. SD 2.1 is intended to address many of the relative shortcomings of 2.0 relative to 1.5. Let's take a look at how 2.1 manages to do this.

NSFW Filter

The biggest change that 2.1 makes relative to 2.0 is a modified NSFW filter. Recall that 2.0 was trained on a subset of the LAION dataset which was filtered for inappropriate content using an NSFW filter, which in turn results in a relatively lowered ability to depict humans.

Stable Diffusion 2.1 is also trained with such a filter, although the filter itself is modified to be less restrictive. In particular, the filter throws fewer false positives, which drastically increases the number of images able to make it through the filter and train the model. This increase in training data results in an improved ability to depict people. We show again several images of Robert Downey Jr. created using identical settings except for the model version used to generate them, this time including Stable Diffusion 2.1.

An image of Robert Downey Jr. created with each model using identical settings

Settings

  • prompt: "a studio photograph of Robert Downey Jr., cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art"
  • size: 512x512
  • guidance scale: 7
  • steps: 50
  • seed: 119
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

As we can see, Stable Diffusion 2.1 is a marked improvement over Stable Diffusion 2, being able to actually depict Robert Downey Jr. Additionally, the skin texture for SD 2.1 is even better than SD 1.5.

Artistic Styles

Unfortunately, SD 2.1's ability to depict specific artist's styles still apparently falls short of SD 1.5. Below we again see images created with identical settings except for the model used to created them. The images are intended to capture the style of Greg Rutkowski.

Settings

  • prompt: "A monster fighting a hero by greg rutkowski, romanticism, cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, trending on artstation, digital art"
  • size: 512x512
  • guidance scale: 9
  • steps: 50
  • seed: 158
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

As we can see, Stable Diffusion 1.5 still reigns supreme on this front.

General Images

We repeat the experiment from the section above regarding plain vs "augmented" prompts, again changing only the model version.

Settings

  • "Original" prompt: "a cute rabbit"
  • "Augmented" prompt: "a cute rabbit, cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art, fantasy background"
  • size: 512x512
  • guidance scale: 9
  • steps: 50
  • seed: 119
  • sampler: DPM-Solver++
  • negative prompt: "ugly, tiling, out of frame, deformed, blurry, bad art, blurred, watermark, grainy"

As we can see, 2.1's "original" textures are an improvment over 2.0. 2.1's "augmented" image is more stylized than 2.0's but very similar overall.

Conclusion

While these experiments are certainly not rigorous or exhaustive, they provide some insight into the relative performance of Stable Diffusion 1 and Stable Diffusion 2.

If you have more questions about text-to-image models, check out some of the below resources for further learning:

  1. How do I build a text-to-image model?
  2. What is classifier-free guidance?
  3. What is prompt engineering?
  4. How does DALL-E 2 work?
  5. How does Imagen work?

Alternatively, consider following our YouTube channel, Twitter, or newsletter to stay up to date with our latest tutorials and deep dives!