Stable Diffusion 3 - RAW First Impression!

Olivio Sarikas
23 Feb 202413:37

TLDRThe video discusses the announcement of Stabil Diffusion 3, an AI image generation model, with a critical analysis of its capabilities. The creator compares it to Mid Journey, noting that while Stabil Diffusion excels in text rendering, it has limitations in detailed elements like shadows and complex structures. The video showcases various examples of AI-generated images, highlighting the strengths and weaknesses of each model, and suggests that community training will improve their artistic and stylistic output over time.


  • ๐Ÿš€ Introduction of Stabil Diffusion 3 has generated significant hype in the AI image generation market.
  • ๐Ÿ” The video aims to critically evaluate the images produced by Stabil Diffusion 3, noting that early examples may be cherry-picked.
  • ๐Ÿ“ธ Stabil Diffusion 3's website allows for sign-up for early access, with models ranging from 800 million to 8 billion parameters for various system capabilities.
  • ๐ŸŒ The new model accepts multimodal inputs, potentially including 3D shapes, offering more control over artistic output.
  • ๐Ÿค– An example of a robot with a long text on its shield demonstrates Stabil Diffusion 3's strength in handling text, despite some limitations in detailing smaller elements.
  • ๐ŸŽจ A video showcases elements being replaced with consistency in style and detail, though some elements like sushi placement are not accurate.
  • ๐Ÿ–ผ๏ธ Comparisons with Mid Journey highlight differences in aesthetic quality and adherence to prompts, with each AI having its own strengths and weaknesses.
  • ๐Ÿ•ฏ๏ธ An image of a kitchen table setting with an embroidered cloth and a candle shows good design but lacks accurate shadow rendering from the candlelight.
  • ๐Ÿฏ Gemini's attempt at creating an image with a prompt including a tiger and a cloth with text shows promise but does not fully adhere to the prompt.
  • ๐Ÿงช Stabil Diffusion 3's ability to generate complex and specific images, such as glass bottles with colored liquids, is commendable, though not perfect.
  • ๐Ÿคน The depiction of clowns in a diner scene reveals common AI shortcomings with hands and anatomy, yet some images are aesthetically pleasing despite these issues.

Q & A

  • What is the main focus of the video regarding Stable Diffusion 3?

    -The main focus of the video is to critically analyze the images generated by Stable Diffusion 3 and compare its performance with Mid Journey, highlighting both the strengths and limitations of the AI in creating images.

  • How can interested individuals access Stable Diffusion 3?

    -Interested individuals can sign up on the official website for Early Access to Stable Diffusion 3, and hopefully be chosen to use it.

  • What are the different model sizes available for Stable Diffusion 3?

    -Stable Diffusion 3 offers different model sizes ranging from 800 million to 8 billion parameters, allowing users with different system capabilities to access and use the models.

  • What does 'multimodal inputs' mean in the context of Stable Diffusion 3?

    -Multimodal inputs refer to the ability of Stable Diffusion 3 to accept various types of inputs, such as images, text, and potentially other formats like 3D shapes, which could provide more control over the composition and artistic output.

  • What is one notable improvement in Stable Diffusion 3 as demonstrated in the video?

    -One notable improvement in Stable Diffusion 3 is its ability to handle long text inputs, which was previously very challenging for AI image generation models.

  • What are some limitations observed in the images generated by Stable Diffusion 3?

    -Some limitations include issues with rendering detailed parts like hands of a robot, background elements like packages melting into each other, and inconsistencies in lighting and shadows.

  • How does the video compare Stable Diffusion 3 with Mid Journey in terms of artistic expression?

    -The video suggests that while Stable Diffusion 3 excels in handling text and creating complex images, Mid Journey tends to produce results that are more aesthetically pleasing and artistically expressive.

  • What is the significance of the 'mind-blowing' example shown in the video?

    -The 'mind-blowing' example demonstrates Stable Diffusion 3's ability to create images with various elements that are consistent and work well together, even when animating certain parts like parallax movement.

  • How does the video address the issue of hands often looking deformed in AI-generated images?

    -The video points out that hands are often a weak point in AI-generated images, appearing deformed or with incorrect anatomy, and highlights this as an area where the technology still has room for improvement.

  • What potential does the video see for Stable Diffusion 3 in video creation?

    -The video suggests that the text handling capabilities and the new abilities introduced by Stable Diffusion 3 could be very impactful for video creation, potentially leading to mind-blowing results.

  • What is the overall conclusion of the video about AI image generation?

    -The overall conclusion is that while AI image generation has come a long way and shows great potential, there are still areas that need improvement, and the journey towards perfect AI-generated images is ongoing.



๐Ÿ–ผ๏ธ Introduction to Stability Diffusion 3 and Comparison with Mid Journey

The paragraph introduces Stability Diffusion 3, a new image AI that has generated a lot of hype. The speaker aims to critically evaluate the images produced by this AI, noting that they might be cherry-picked. A comparison is drawn with Mid Journey, another AI, highlighting the strengths and weaknesses of both. The speaker emphasizes the importance of signing up for early access and supporting through Patreon. Stability Diffusion 3's features are discussed, including its multi-modal inputs and various model sizes, indicating its potential for widespread use. Examples of images created by Stability AI's founder demonstrate the AI's capabilities and limitations, such as handling long text and detailed backgrounds.


๐ŸŽจ Analysis of AI-Generated Images and Their Fidelity to Prompts

This paragraph delves into a detailed analysis of several AI-generated images, evaluating their adherence to the given prompts and the quality of their artistic expression. The speaker discusses the success of Stability Diffusion 3 in creating a '90s desktop computer image and contrasts it with Mid Journey's output. The limitations of both AIs are highlighted, such as issues with shadows and the placement of elements. The speaker also notes the improvement potential through community training and the aesthetic appeal of the images, despite some inaccuracies in following the prompts.


๐Ÿค– Examination of AI's Handling of Complex Image Prompts and Anatomical Accuracy

The speaker examines the AI's ability to handle complex image prompts, such as a scene with specific arrangements and colors of objects. The AI's performance in creating a perfect image with transparent bottles is praised, but issues with color order and aesthetic appeal are noted. The paragraph also discusses the AI's struggle with anatomical accuracy, particularly in depicting animals. The speaker appreciates the AI's attempt to reflect environmental colors in the subjects and the composition's correctness, despite some imperfections. The paragraph concludes with a recognition of the AI's potential to improve over time.

๐ŸŽญ Evaluation of AI's Artistic Expression and Shortcomings in Detail

The speaker evaluates the AI's artistic expression in creating images with clowns and a diner setting, highlighting the AI's shortcomings in rendering hands and facial features accurately. The paragraph discusses the differences between the AI's output and the original prompts, noting the AI's struggle with detailed elements like hands and the positioning of objects. Despite these issues, the speaker appreciates the overall visual appeal and artistic expression of the images. The paragraph ends with a positive note on the AI's potential for future improvements and its impact on video creation.



๐Ÿ’กStable Diffusion 3

Stable Diffusion 3 is a newly announced AI model that generates images based on text prompts. It is the subject of the video, with the creator providing a critical analysis of its capabilities. The video discusses the hype around this technology and compares it to other AI models like Mid Journey, highlighting its strengths and weaknesses.

๐Ÿ’กCherry Picking

Cherry picking refers to the selection of the best or most favorable examples from a larger set to present a more positive impression. In the context of the video, the creator is aware that the images showcased for Stable Diffusion 3 might be cherry-picked to display the AI's best performance, and thus seeks to critically evaluate the technology beyond these select examples.

๐Ÿ’กMultimodal Inputs

Multimodal inputs refer to the ability of a system to accept and process different types of data or inputs simultaneously. In the context of the video, Stable Diffusion 3 is mentioned to accept multimodal inputs, suggesting that it can handle not just text, but possibly images, videos, or other data types to generate more controlled and diverse outputs.

๐Ÿ’กOpen Source

Open source describes a philosophy and practice of allowing users to access, use, modify, and distribute a product or service freely. In the video, it is mentioned that the different model sizes of Stable Diffusion 3 will be open source, meaning that they will be available for the community to use and improve upon across various systems and GPUs.


Aesthetics refers to the appreciation of beauty or good taste, and the creation of visually pleasing compositions. In the context of the video, aesthetics is a critical criterion used to evaluate the quality of the images generated by AI models, with the creator comparing the visual appeal and artistic expression of the outputs.

๐Ÿ’กCommunity Training

Community training involves the collective effort of a group of users to improve a machine learning model by contributing data, feedback, and computational resources. In the video, the creator mentions that the AI models will benefit from community training, which will help refine the models and address shortcomings over time.

๐Ÿ’กImage Generation

Image generation is the process of creating visual content using AI algorithms based on given inputs, such as text descriptions. It is a core focus of the video, which evaluates the capabilities of Stable Diffusion 3 and other AI models in producing high-quality, realistic, and artistically expressive images from textual prompts.

๐Ÿ’กArtificial Intelligence (AI)

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think, learn, and problem-solve like humans. In the video, AI is the driving technology behind the image generation models discussed, and the creator evaluates the AI's ability to understand and execute complex visual and textual prompts.


Animation is the process of creating the illusion of motion and change through a series of images or frames. In the context of the video, the creator is impressed by the animated elements in some of the AI-generated images, such as the parallax movement in a video example, demonstrating the AI's potential to create dynamic visual content.


Critique is the act of analyzing and offering judgments on the merits of a subject, often with a focus on areas for improvement. In the video, the creator provides a critical look at the images generated by Stable Diffusion 3, highlighting both its strengths in text generation and its weaknesses in handling certain visual details.


Stable Diffusion 3 has been announced, generating hype in the AI image generation market.

The speaker aims to critically analyze the images produced by Stable Diffusion 3, noting that early examples may be cherry-picked.

Stable Diffusion 3 is compared to Mid Journey, with the former being praised for aesthetics but criticized for following the prompt.

Stable Diffusion 3 offers different model sizes from 800 million to 8 billion parameters, democratizing access to the models.

Stable Diffusion 3 is expected to be open-source, allowing use on various systems with different GPU capabilities.

The AI now accepts multimodal inputs, which could include 3D shapes or other inputs for greater control over artistic output.

Despite the AI's proficiency with text, it still struggles with detailed elements like the hands of a robot in an image.

An example of Stable Diffusion 3's capabilities includes a video where elements are replaced seamlessly, maintaining consistency and detail.

The AI can create images with a mix of artistic styles, such as a digital painting of a cat that transitions into a photographic style raccoon.

Stable Diffusion 3's image of a '90s desktop computer is accurate and detailed, showing the AI's potential for creating realistic scenes.

Mid Journey's results, while aesthetically pleasing, sometimes fail to follow the prompt accurately.

Gemini's attempt at creating an image with an embroidered cloth and a baby tiger shows some promise, but has inaccuracies.

Stable Diffusion 3 can produce aesthetically pleasing images with correct color handling and light effects.

The AI struggles with anatomical correctness, as seen in an image where the cat's head appears too small.

The AI's ability to handle complex compositions, like a scene with clowns in a diner, is improving but still has noticeable shortcomings.

Stable Diffusion 3's potential for video creation is hinted at, suggesting future advancements in multimedia AI capabilities.