Stable Diffusion ComfyUI & Suno AI Create AI Music Video On Our Control

Future Thinker @Benji
9 May 202418:08

TLDRIn this video, the creator discusses the limitations of an AI music video tool called Noisy AI, which generates music videos based on text prompts but often results in inconsistent and disappointing outputs. The creator then proposes a superior method for producing AI music videos using a combination of tools such as ComfyUI, Stable Diffusion, and Suno AI. The process involves using a large language model to transform content into text prompts for Stable Diffusion, creating motion animations, and generating music with Suno AI. The creator demonstrates how to edit these elements together in CapCut to produce a high-quality, cohesive music video with more control over the content and style, ultimately resulting in a more satisfying and engaging final product.


  • 🎥 The video discusses creating music videos using AI tools, specifically mentioning Stable Diffusion, Animate Diff, and SVD.
  • 🚀 The presenter is initially inspired by Noisy AI's introduction videos, which appear to generate music videos from text prompts.
  • 😕 Upon further investigation, the presenter expresses disappointment with Noisy AI's Discord tool, comparing it to social media aggregators from the early 2000s that have since become obsolete.
  • 🤖 The AI models used by Noisy AI seem to be repurposed from other sources, such as Stable Diffusion, rather than being unique to their platform.
  • 📚 The video content generated by Noisy AI's Discord bot does not match the provided lyrics, indicating a lack of integration between the music and visuals.
  • 🚫 The presenter deems the output from Noisy AI as 'garbage' due to the mismatch between the lyrics and the generated scenes, and the lack of creative control.
  • 🌟 The video suggests creating a more controlled and higher quality music video workflow using Comfy UI, Stable Diffusion, and an AI music generator like Suno AI.
  • 🎼 Suno AI is praised for generating an R&B style song that fits the theme of long-distance love, which will be used to create an emotionally resonant music video.
  • 🎭 The workflow involves using a large language model to transform content into text prompts for Stable Diffusion, creating both A-roll (singer's performance) and B-roll (background story) scenes.
  • 🔄 The A-roll scenes are processed through an anime diffusion video workflow to create a new style, while B-roll scenes are generated using Stable Diffusion prompts based on the song's lyrics.
  • ✂️ The final step is editing the generated scenes with the music in an app like CapCut, allowing for fine-tuning and creative control over the final music video.
  • ✨ The presenter emphasizes the importance of having control over the video content and the potential for improvement with further research and development.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is about creating music videos using AI models and tools, specifically discussing the limitations of a tool called Noisy AI and demonstrating an alternative workflow using ComfyUI and Suno AI.

  • What is the creator's opinion on Noisy AI?

    -The creator is disappointed with Noisy AI, as it does not seem to generate high-quality music videos as advertised on their website. The creator finds the output to be inconsistent and not matching the user-provided music.

  • What are the issues the creator found with Noisy AI's generated music videos?

    -The creator found that the generated music videos from Noisy AI had mismatched lyrics and video content, deformed characters, and lacked the creative control the creator desired.

  • What alternative workflow does the creator propose?

    -The creator proposes using ComfyUI for building a workflow that includes using a large language model to transform content into text prompts for stable diffusion, creating motion animations, and using Suno AI for music generation.

  • How does the creator plan to generate the music for the music video?

    -The creator plans to use Suno AI to generate the music, which they find to be of better quality for the purpose of creating an emotional and fitting soundtrack for the video.

  • What role does the 'a roll' play in the music video?

    -The 'a roll' represents the main performance or singing scenes in the music video, which are transformed through stable diffusion animations to match the style and mood of the music.

  • How does the creator intend to improve the lip syncing in the video?

    -The creator plans to use lip sync technology to improve the synchronization between the singer's mouth movements and the music, although they mention this as a potential future enhancement rather than a feature of the current workflow.

  • What is the significance of using a large language model in the workflow?

    -The large language model is used to transform the song lyrics into more descriptive stories or scene descriptions, which then serve as prompts for generating the visuals using stable video diffusion.

  • What is the advantage of using the proposed workflow over Noisy AI?

    -The proposed workflow offers more control over the video content, allows for customization of character and background styles, and provides a higher quality output that is more closely aligned with the user's creative vision.

  • How does the creator feel about the quality of the music generated by Suno AI compared to other AI music generators?

    -The creator prefers Suno AI for generating music of higher quality and emotional depth, as opposed to other generators that may produce more humorous or less serious tracks.

  • What is the final step in the creator's workflow for making the music video?

    -The final step is to import all the generated scenes and the AI music into a video editing software like CapCut, where the scenes are arranged, effects are added, and transitions are set to complete the music video.



🎬 Introduction to AI Music Video Creation

The speaker begins by discussing various tools for generating animations and AI music. They express excitement about a new tool called Noisy AI, which seemingly generates music videos from text prompts. However, after exploring the tool's Discord community and generated content, disappointment sets in due to the lack of originality and quality. The speaker criticizes the tool for merely combining existing AI models and not offering a unique approach to music video creation. They also point out inconsistencies in the generated videos, such as mismatched lyrics and video content, and the prevalence of artifacts like deformed hands. The speaker concludes this paragraph by advocating for creating a more controlled and personalized music video workflow using local tools.


🚀 Building a Personalized Music Video Workflow

The speaker outlines their plan to create a superior music video workflow using local tools and AI models. They describe using a large language model to transform content into text prompts for stable video diffusion, which will be used to create motion animations. The speaker intends to keep each scene short, approximately 3-4 seconds, to fit the music video format. They also mention using the stable diffusions anime diff video to create styles from reference source videos. The workflow involves using AI to generate music with Sunno, which the speaker has already experimented with and found to be of higher quality than other options. The speaker then demonstrates how they will use these tools to create a music video, starting with generating scenes and editing them together with the generated music.


🎤 Crafting Emotional Visuals for Music

The speaker generated an R&B style song using Sunno AI, focusing on the theme of long-distance love and sadness. They discuss how to translate the song's emotions into visual elements for the music video. The speaker uses video clips of people singing as a representation of the singer's role in the video, which will be transformed using a stable diffusion animation workflow. Additionally, they plan to use stable video diffusion to create love story scenes for the B-roll, using prompts generated from the song's lyrics. The speaker describes the process of generating scene descriptions from the lyrics using a large language model and then using a fine-tuned Llama 3 model to create stable diffusion prompts for each scene. They also touch on the potential for improving lip syncing and other video editing techniques in future projects.


🎉 Final Thoughts on AI Music Video Tutorial

The speaker concludes the tutorial by emphasizing the ease of compiling scenes and adding AI-generated music to create a cohesive music video. They highlight the benefits of having more control over the arrangement of scenes, effects, and transitions compared to the Noisy AI model. The speaker also suggests areas for improvement, such as better lip syncing and research into synchronization techniques. They express satisfaction with the quality of the final product, stating it surpasses the output from the Noisy AI models. The speaker encourages viewers to use their preferred AI music generator and offers inspiration for creating their own music videos, ending the video on a positive note and looking forward to future interactions.



💡Stable Diffusion

Stable Diffusion is a term referring to a type of AI model used for generating images and animations from textual descriptions. In the context of the video, it is used to create visuals for a music video by transforming text prompts into animated scenes. An example from the script is the use of 'stable video diffusions animate diff and SVD' to generate animation.

💡AI Music Generator

An AI Music Generator is a tool or software that uses artificial intelligence to compose music based on certain parameters or prompts given by the user. In the video, the AI music generator is used to create an original song that will be synchronized with the visuals in the music video. The script mentions using 'zuno AI' for creating music.


Discord is a communication platform that allows users to chat via text, voice conversations, and video calls. In the script, it is mentioned as a place where the user interacted with a bot to generate music videos, indicating a community or server setup for collaborative content creation.

💡Text Prompt

A text prompt is a short piece of text used as an input to an AI system to generate a specific output, such as an image, video, or piece of music. In the video's context, text prompts are used to instruct the AI to generate scenes for the music video, like 'robot dancing in Disco'.

💡A-Roll and B-Roll

In video production, the A-Roll typically refers to the main action or dialogue in a scene, while the B-Roll consists of supplementary footage that adds context or emphasis. The script discusses using AI to generate both A-Roll, which includes the singer's performance, and B-Roll, which tells the background story related to the music and lyrics.

💡Lip Sync

Lip Sync refers to the process of matching the movements of the mouth in a video to the lyrics or dialogue in a soundtrack or voiceover. The video aims to improve the quality of the music video by incorporating lip-syncing to make the singer's performance more realistic and synchronized with the music.

💡Comfy UI

Comfy UI, mentioned in the script, likely refers to a user interface that is comfortable or easy to use. In the context of the video, it is suggested as a tool for creating a workflow to generate music videos, implying a user-friendly environment for content creation.

💡Suno AI

Suno AI is an AI music generator mentioned in the script as an alternative to another platform for creating music. The user found that Suno AI produced higher quality songs suitable for the music video, indicating its role in the creative process.

💡Large Language Model

A Large Language Model is an AI system designed to process and understand large volumes of human language data. In the video, it is used to transform lyrics into more descriptive stories, which then serve as prompts for generating scenes with Stable Diffusion.

💡Differential Diffusion

Differential Diffusion is a technique used in the AI models to create variations in the generated content. In the script, it is used to create styles from a reference source video, contributing to the unique visual style of the generated music video scenes.

💡Cap Cut

Cap Cut is a video editing software mentioned in the script as a tool for compiling and editing the generated scenes into a cohesive music video. It is described as being very easy to use, allowing for simple video editing without the need for tutorials.


The video discusses creating music videos using AI tools on a local PC.

Noisy AI is introduced as a tool for generating music videos from text prompts.

The quality of Noisy AI's generated content is compared to other AI models like Sora and SVD.

The speaker expresses disappointment with Noisy AI's output, citing inconsistency and lack of originality.

A critique of the AI industry's current state, drawing parallels to the early days of social media.

The video demonstrates how to use a text prompt to generate a scene using stable video diffusions.

The issue of AI-generated videos lacking consistency in details like hands and fingers is mentioned.

The music for the generated videos is not created by the AI but must be provided by the user.

The video content and lyrics generated by the AI do not match, leading to a lack of cohesion.

The speaker proposes creating a more controlled and higher quality music video workflow using Comfy UI.

Stable diffusions animate diff and SVD are used to create motion animations and style transformations.

Sunno AI is used to generate the music for the music video, offering a higher quality alternative.

The process involves using large language models to transform lyrics into descriptive stories for each scene.

Llama 3 fine-tuned models are used to connect with stable diffusion and generate each scene.

The final music video is edited using CapCut, combining the A-roll and B-roll with the generated music.

The video emphasizes the importance of having control over the arrangement and effects in the final video.

Suggestions for improvement include better lip syncing and further research for higher quality output.

The tutorial concludes with encouragement for viewers to create their own AI music videos with more control and quality.