SD3 Medium Base Model in ComfyUI: Not as Good as Expected – Better to Wait for Fine-Tuned Versions

黎黎原上咩
13 Jun 202407:39

TLDRStability AI's SD3, initially met with high expectations, has faced setbacks including leadership changes and financial struggles. Despite these, SD3 was released on June 12th, featuring improved photorealism, prompt adherence, and text generation. The release includes three model files with varying clip inclusions, requiring Comfy UI for optimal use. While SD3 shows promise, it has flaws, particularly in generating human figures. The community awaits fine-tuned versions for better performance.

Takeaways

  • 😀 Stability AI announced the release of SD3, a major version following SD 1.5 and SDXL, which was highly anticipated.
  • 😔 The company faced leadership changes and financial difficulties, leading to concerns about the future of SD3.
  • 📅 Despite setbacks, SD3 was officially open-sourced and released on June 12th as scheduled.
  • 🖼️ SD3 showcases excellent photorealistic effects and the ability to understand complex prompts involving spatial relationships and compositional elements.
  • 📝 Improvements in text generation are evident, with no artifacts or spelling errors in the generated images.
  • 🔧 The new architecture used by SD3 is the multimodal diffusion Transformer (DIT), which contributes to its advantages.
  • 📚 The official recommendation for using SD3 is through ComfyUI, which was recently updated to support SD3.
  • 📁 Three model files were released, with the smallest being SD3 Medium at 4.34 GB, requiring separate CLIP downloads for use in ComfyUI.
  • 💻 Users need to upgrade ComfyUI to the latest version to ensure compatibility with SD3.
  • 🐑 SD3 has shown the ability to correctly interpret and generate text on images, such as writing a nickname on a hat.
  • 🙁 However, SD3 has flaws, particularly in generating human figures, which has been a point of complaint among users.
  • 🔮 The future of SD3 depends on the adoption and development of third-party models, such as control nets and other enhancements.

Q & A

  • What was the initial expectation for SD3 based on previous versions?

    -The initial expectation for SD3 was that it would be another major version like SD 1.5 and SDXL, expected to be widely used and highly anticipated.

  • What challenges did Stability AI face before the release of SD3?

    -Stability AI faced several challenges including the resignation of the company's founder and CEO Emad Moake, the departure of the core research team, and funding difficulties due to their free open-source business model, which put the company's financial situation in jeopardy.

  • When was SD3 officially released by Stability AI?

    -SD3 was officially released by Stability AI on June 12th, as announced at AMD's launch event.

  • What are the notable capabilities of SD3 as showcased in the initial images?

    -The initial images showcased SD3's excellent photorealistic effect, adherence to complex prompts involving spatial relationships, compositional elements, actions, and styles, and an evident improvement in text generation without artifacts or spelling errors.

  • What is the multimodal diffusion Transformer (DIT) and how does it contribute to SD3's advantages?

    -The multimodal diffusion Transformer (DIT) is the new architecture used by SD3, which is responsible for the model's photorealistic effect, prompt adherence, and improved text generation capabilities.

  • How many model files were released for SD3 and what are their sizes?

    -Three model files were released for SD3: the smallest is SD3 medium at 4.34 GB, the medium size is 5.97 GB, and the largest is the full package with everything included.

  • What is the recommended software to use with SD3 according to the official recommendation?

    -The official recommendation is to use ComfyUI with SD3.

  • What are the hardware requirements for using SD3 in ComfyUI?

    -The hardware requirements for using SD3 in ComfyUI include a graphics card with sufficient VRM to handle the model size plus the additional clips, and it's advised not to enable T5 if the graphics card has low VRM.

  • What was the result of testing SD3's text generation ability with a prompt for a sheep with a hat?

    -SD3 correctly generated the image of a sheep with a yellow hat that said 'Mimi' on its head, demonstrating its text generation ability.

  • What are some of the flaws observed in SD3's performance?

    -Some flaws observed in SD3's performance include poor generation of human figures and issues with certain prompts leading to broken or scary images.

  • What is the future outlook for SD3 and what factors will influence it?

    -The future of SD3 depends on the adoption and speed of third-party models, fine-tuning, and the development of control mechanisms like Laura and ControlNet.

  • Why can't the Pony series model author adapt to SD3?

    -Due to license issues, the Pony series model author, Asly Har, confirmed they cannot adapt their model to SD3.

Outlines

00:00

🚀 Launch of Stability AI's SD3

Stability AI's much-anticipated SD3 was expected to follow the success of its predecessors, SD 1.5 and SDXL. However, the company faced challenges with its CEO, Emad Moake, stepping down and the core research team resigning due to a free open-source business model that led to funding issues. Despite these setbacks, SD3 was officially released on June 12th as promised. The video will explore the new features of SD3, how to download and install it, and compare its image quality and usage method to MJ. It will also cover the hardware requirements and provide a hands-on demonstration using Comfy UI.

05:09

🔍 SD3's Features and Performance Review

The video script discusses the features and performance of Stability AI's newly released SD3 model. It highlights SD3's photorealistic effects, prompt adherence, and improved text generation capabilities. The script also mentions the new architecture, the multimodal diffusion Transformer (DIT), which underpins these features. The video demonstrates the model's capabilities through hands-on testing using Comfy UI, comparing different model files and their requirements. It notes that while SD3 performs well in certain areas, such as spatial relationships and text generation, it struggles with generating human figures. The video concludes with a discussion of workflow customization and the potential for future fine-tuned versions of SD3, as well as a mention of licensing issues affecting the adaptation of the Pony series model to SD3.

Mindmap

Keywords

💡SD3

SD3 refers to a major version update of a software or model, presumably related to AI or image generation, as it is compared to previous versions like SD 1.5 and SDXL. In the video, it is the central topic being discussed, with its release and features being the main focus. The script mentions the anticipation and subsequent concerns about the development of SD3 due to internal company issues.

💡ComfyUI

ComfyUI appears to be a user interface or software platform used to interact with the SD3 model. The script suggests that it is the recommended way to use SD3, as it is mentioned multiple times in the context of downloading, installing, and exploring the new features of SD3.

💡Photorealistic effect

The term 'photorealistic effect' is used to describe the quality of images generated by SD3, emphasizing their high level of realism that closely resembles real-world photographs. This is highlighted as one of the notable capabilities of SD3 in the script.

💡Prompt adherence

Prompt adherence refers to the ability of SD3 to understand and follow the instructions given in a prompt, which can include complex elements like spatial relationships, composition, actions, and styles. The script provides examples of how SD3 adheres to prompts, such as the positions of objects in an image following the user's instructions.

💡Text generation

Text generation is a feature of SD3 that is mentioned as having improved capabilities. It involves the model's ability to generate text that is coherent and free of errors, as demonstrated by the script's examples of text on a flag and graffiti.

💡Multimodal diffusion Transformer (DIT)

The Multimodal diffusion Transformer, or DIT, is the new architecture used by SD3. It is the technical foundation that enables the advantages mentioned in the script, such as photorealism and prompt adherence. While the script does not delve into technical details, it is presented as a key innovation of SD3.

💡Checkpoints

Checkpoints in the context of the script refer to different versions or stages of the SD3 model that have been saved and can be downloaded. There are three mentioned in the script, each with different sizes and features, and they are essential for users looking to utilize the model in ComfyUI.

💡Hardware requirements

Hardware requirements pertain to the specifications needed to run the SD3 model effectively. The script discusses the memory usage and the recommendation to avoid certain features if the user's graphics card has limited VRM, indicating that high-performance hardware is necessary for optimal use of SD3.

💡Fine-tuned versions

Fine-tuned versions suggest improved or customized iterations of the base SD3 model that may offer better performance or specific functionalities. The script expresses the hope that such versions will become available in the future, implying that the current release may have some limitations.

💡Third-party models

Third-party models refer to versions of SD3 that are developed or adapted by entities other than the original creators. The script mentions that the future success of SD3 may depend on the adoption and speed at which these third-party models are created and integrated.

💡License issues

License issues are mentioned in the context of a specific model author being unable to adapt their work to SD3 due to licensing restrictions. This highlights potential legal or compatibility challenges that can affect the availability and development of certain models or features within the SD3 ecosystem.

Highlights

Stability AI announced the upcoming release of SD3 in February, expected to be widely used and highly anticipated.

SD3 faced challenges with the company's founder and CEO stepping down and the core research team resigning.

Funding difficulties due to the free open-source business model put Stability AI's financial situation in jeopardy.

SD3 was officially released on June 12th as scheduled, despite the company's internal issues.

SD3 medium model showcases excellent photorealistic effects, making real-world images completely photo-level realistic.

The model demonstrates prompt adherence, understanding complex prompts involving spatial relationships and compositional elements.

SD3 shows an evident improvement in text generation with no artifacts or spelling errors.

The new architecture used by SD3 is the multimodal diffusion Transformer (DIT), responsible for its advantages.

The official recommendation for using SD3 is through ComfyUI.

Three checkpoints of the SD3 model were released, with the smallest being 4.34 GB and requiring separate clip downloads.

An extra triple clip loader node is needed for using the SD3 medium model in ComfyUI, offering flexibility.

The largest SD3 model includes all necessary components, making it a one-stop solution for users.

ComfyUI was updated to support SD3, enhancing user experience.

SD3 medium model and all three clips are recommended for flexibility and functionality.

SD3 correctly generated text on a sheep's hat, demonstrating its text generation ability.

SD3 has been criticized for poor performance in generating human figures.

Despite flaws, SD3's understanding of spatial relationships and prompts is accurate.

The future of SD3 depends on the adoption and speed of third-party models and control mechanisms.

Due to license issues, the Pony series model author cannot adapt to SD3.