* This blog post is a summary of this video.

Releasing New High-Quality SDXL Model for Free: Tips, Tricks and Best Practices

Author: Think DiffusionTime: 2024-03-23 09:45:00

Table of Contents

Overview of the Powerful New SDXL Model

We released a new SDXL model 2 weeks ago with 10,000 4K images each hand tagged and 1,800,000 training steps. We've worked very hard on this model for the past few months with some of the top model creators. The training time and dataset surpass most other custom SDXL models. Best yet, it's 100% free to use!

The total training time was 750 hours plus 300-400 hours to manually caption the images. The training was split over three separate datasets: 7,000 safer work images, 2,000 background images, and 1,000 not safe for work images. These were then merged into the final model, so yes, this model supports NSFW imagery.

In my experience, SDXL has a really good cinematic feel if you include 'cinematic' in the prompt. So if you prefer that look, I'm sure you'll find landscapes, objects, and things like food very interesting to generate.

Training Data and Compute Statistics

The total footprint of the dataset contained more than 42GB of images. The smallest resolution is 1365x248 pixels, however most are higher resolution with the largest being 4622x6753 pixels. The total training compute was over 750 GPU hours on state-of-the-art hardware with optimal hyperparameters.

Image Quality and Style

Here are some example images from our community on Discord. You can really feel the cinematic style coming through with a high level of detail and realism. SDXL does a great job capturing this look as you'll see in the following comparison with other top models.

Recommended Settings and Prompting Methods

Based on extensive testing, here are the recommended settings and best practices for prompting this model:

Key Model Parameters

Guidance scale should be between 5-10, sampler should be DPM++ 2M CFG scale 7. You don't really need a secondary model refiner. The VAE strength can be normal or high based on your preference.

Prompt Structure for Best Results

The hand captioning for the training data follows a 3-part structure:

  1. Style/category keywords like 'landscape photography', 'closeup portrait' etc.
  2. Detailed description of the full image and subjects.
  3. Extra optional details you want the model to learn that fit the description like 'sunny day'.

Comparisons to Other Top SDXL Models

Here are some quick comparisons to demonstrate the advantages of this new model over the current top performers:

As you can see, the images generated have a more realistic, less saturated look compared to other models while still retaining impressive levels of detail and accuracy.


Q: How much training data was used for this model?
A: The total training data set contained over 10,000 4K images, each hand-tagged, as well as 1.8 million training steps.

Q: What is the footprint size of the full dataset?
A: The total footprint of the dataset adds up to more than 42 GB of images.

Q: What is the resolution range supported?
A: The smallest resolution is 1365x248 pixels but most images are higher resolution, up to 4622x6753 pixels.

Q: What is the best sampler to use?
A: The recommended samplers are DPM++ 2M CM.

Q: Do I need to use a refiner?
A: No, a refiner is not necessary.

Q: What CLIP guidance scale works best?
A: A CLIP guidance scale between 5 and 10 generally works very well.

Q: How can I prompt for cinematic images?
A: Including "cinematic" in the prompt tends to give images a nice cinematic style and feel.

Q: What is the hand-captioning format used?
A: The images use a 3-part caption structure: style/category, detailed description, and extra descriptive details.

Q: Where can I access this model?
A: The model is available for free on CVTi and can also be tested on the ThinkDiffusion website.

Q: What if I need more assistance with this model?
A: Check the video description and comments section for helpful prompt examples and other tips.