* This blog post is a summary of this video.

SDXL 1.0 vs Stable Diffusion 1.5: Hands-on Comparison

Author: SiliconThaumaturgyTime: 2024-03-23 13:40:00

Table of Contents

Introducing SDXL 1.0 and Stable Diffusion 1.5

After another short delay, the long-awaited SDXL version 1.0 has been officially released. It makes the original Stable Diffusion 1.5 checkpoint look like a joke in terms of output quality. However, Stable Diffusion 1.5 came out over nine months ago, and the community has refined and polished this diamond in the rough into innumerable high-performing checkpoints like Dreamshaper and Epic Realism.

While I have full confidence that one day SDXL will far surpass even the top tier Stable Diffusion 1.5 checkpoints, let's see how things stack up right now at the release of SDXL 1.0.

SDXL 1.0 Key Details

SDXL has 2.6 billion parameters, which is more than three times as many as Stable Diffusion 1.5. This is slightly less than DALL-E 2, which has 3.5 billion parameters, and Imagen, which has 4.6 billion parameters. The text encoder uses both OpenCLIP ViT-G and ViT-L. OpenCLIP ViT-G has 695 million parameters and over 80% accuracy on public datasets.

Stable Diffusion 1.5 Key Details

Stable Diffusion 1.5 has 860 million parameters. The text encoder was a frozen version of ViT-L/14 from OpenAI, with 123 million parameters and 75.5% accuracy.

Comparing Model Parameters and Architecture

Let's take a quick look at these models on paper. As a warning, this is the nerdiest part of this video—if you're only interested in more concrete details, skip ahead with the bookmarks.

The single defining stat for AI models is the number of parameters. On that front, Stable Diffusion 1.5 has 860 million parameters and Stable Diffusion 2 has only slightly more at 865 million. In comparison, SDXL has 2.6 billion parameters, which is more than three times as many.

One thing people often forget is that diffusion models use several neural networks. The most important other one is the text encoder. For Stable Diffusion 1.5, the text encoder was a frozen ViT-L/14 model from OpenAI with 123 million parameters.

In contrast, Stable Diffusion 2 used OpenCLIP ViT-H/14, trained on Laion-2B with 354 million parameters, almost triple Stable Diffusion 1.5's encoder.

SDXL actually uses two different text encoders: OpenCLIP ViT-G and ViT-L. ViT-G has almost six times as many parameters as Stable Diffusion 1.5. Per the OpenCLIP GitHub, this model has over 80% accuracy on public datasets, the highest among those trained on public data.

ViT-L is the exact same encoder used for Stable Diffusion 1.5. Looking at the SDXL paper, it sounds like they run the prompt through both clips, then combine them. I'm not clear why this was done—maybe using a smaller model helps start generation, or maybe ViT-L has some secret sauce not in OpenCLIP.

Examining Practical Differences

File Size Requirements

Base Stable Diffusion 1.5 is 4 GB, with pruned custom checkpoints as small as 2 GB. Between the SDXL 1.0 base and refiner, you'll need 12.1 GB. This huge size is the price for tripling parameters. Not a win for SDXL, but hard drive space is cheap.

Maximum Image Generation Size

With my 3090 and 24 GB VRAM, I generated 5.3 megapixel images with Stable Diffusion 1.5. With SDXL, I reached 7.9 megapixels, almost 50% larger. Using tiled VAE, I completed a 10.2 megapixel image. That's a 92% increase in maximum size for SDXL—a huge win since VRAM bottlenecks consumer hardware.

Image Generation Speed

For Automatic1111, performance is very competitive. SDXL is 35% slower at low resolutions, equal around 1200p, then 30% faster above 1500p. For Comfy UI, SDXL loses efficiency at both higher and lower resolutions. Compared to Stable Diffusion 1.5 in Comfy UI, SDXL is 50% faster at 1024p but 50% slower at 512p. I'd call this a draw since both models have better performance in different cases.

Evaluating Output Quality

Hand Quality

Making anatomically correct hands has been a huge struggle. My testing shows SDXL scores well below the average custom Stable Diffusion 1.5 checkpoint for hand quality. Results are much better than base SD 1.5, but we seem to have hit a capability cap there. For now, you can still get much better hands from Stable Diffusion 1.5 versions.

Cohesion at High Resolutions

SDXL performs comparably or better than recent SD 1.5 checkpoints for image cohesion, so we aren't losing ground. However, I noticed SDXL tends to stretch images instead of twinning them at higher resolutions. You'll need to develop an eye for this distortion. SDXL has a higher maximum resolution before artifacts and performs well at lower resolutions too—another win over current SD 1.5.

Style Flexibility

Like base Stable Diffusion 1.5, SDXL lacks built-in style flexibility. Style is defined more by the specific checkpoint. We'll likely see user-created SDXL checkpoints soon that provide stylistic range, like we did for SD 1.5. The community is much larger and more experienced now, so I expect SDXL checkpoints to surpass SD 1.5 in under 3 months.

The Outlook for SDXL Evolution

While SDXL currently lacks the range of optimized SD 1.5 checkpoints, the architecture is vastly superior. It enables efficient use of VRAM, larger images, and faster generation speed. For the first time since ControlNet, I'm fully on board this hype train. I hope this video gave compelling reasons to take SDXL seriously. The future is bright!

FAQ

Q: How do the parameters compare between SDXL 1.0 and Stable Diffusion 1.5?
A: SDXL 1.0 has over 3 times as many parameters - 2.6 billion vs. 860 million for Stable Diffusion 1.5.

Q: What is the maximum image size each model can generate?
A: SDXL can generate 50% larger images than Stable Diffusion 1.5 before running into VRAM limitations.

Q: Which model generates images faster?
A: It depends on the image size, but SDXL tends to be 30%+ faster at high resolutions in Automatic1111.

Q: How does hand quality compare between the models?
A: Stable Diffusion 1.5 currently has better quality hands when using specialized checkpoints.

Q: Will SDXL overtake Stable Diffusion 1.5 in output quality?
A: Given rapid advancement of checkpoints, SDXL will likely surpass Stable Diffusion 1.5 within 3 months.