ComfyUI for Everything (other than stable diffusion)

Design Input
19 Mar 202432:52

TLDRThe video explores various functionalities of ComfyUI beyond stable diffusion, showcasing nearly 30 use cases like image-to-text, creating captions, and sound effects from images. It demonstrates workflows for background removal, using for note-taking and mood boards, video background removal, local and third-party server text generation, image enhancement, filters, and combining text-to-audio and image-to-audio models. The potential of ComfyUI as a versatile AI tool for non-coders is highlighted.


  • πŸ˜€ ComfyUI offers a wide range of functionalities beyond just stable diffusion, including image to text, sound effects from images, and various image enhancements.
  • πŸ” The Lava module is an image-to-text model that can describe images and answer detailed questions about them, requiring the download of specific BL models.
  • πŸ–ΌοΈ ComfyUI can remove backgrounds from images using different models, some of which allow specifying which object to keep, offering flexibility in image editing.
  • πŸ“ Secr Tal is a visual note-taking platform that can be used for tasks like creating mood boards, and it integrates well with ComfyUI for documenting ideas.
  • πŸŽ₯ The video-to-mask module allows for the background removal from videos, providing options to customize frame rates and limits for processing.
  • πŸ“ Local and third-party server LLM (Large Language Models) can be utilized within ComfyUI for generating text, with options to adjust settings like maximum tokens and temperature for creativity.
  • πŸ” The text generator part of ComfyUI supports different types of LLM models, which can be run locally or through external services like Open AI's API.
  • πŸ–ŒοΈ Image filters and enhancements in ComfyUI include options for sharpening, upscaling, and applying various effects to adjust the style and appearance of images.
  • 🎨 Creative image filters like channel shake, watercolor, motion blur, and color adaptation are available in ComfyUI for artistic image transformations.
  • 🌈 Color adjustments, film grain application, and look-up table (LUT) color styles can be applied to images for fine-tuning their visual appeal.
  • πŸ”Š ComfyUI can generate audio from text using the Audio LDM model, and even combine image descriptions with sound effects to create immersive audio-visual content.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is exploring various use cases of ComfyUI beyond its application for running stable diffusion, including image to text, creating captions, sound effects, and other image enhancements and filters.

  • What is Lava in the context of the video script?

    -In the context of the video script, Lava is an image to text model that can understand and describe what is happening in an image, allowing users to ask detailed questions about the image content.

  • How can ComfyUI be used to remove the background of an image?

    -ComfyUI offers several workflows to remove the background of an image. It can automatically detect the main element and remove the background or allow users to specify which object to keep in the image using different models for various purposes, such as human segmentation.

  • What is Secr Tal and how does it relate to ComfyUI?

    -Secr Tal is a visual note-taking platform that can be used in conjunction with ComfyUI. It allows users to place cards on a board, connect ideas, and document them better with the ability to add lists, images, PDFs, videos, and more.

  • How can ComfyUI be used to remove the background of a video?

    -ComfyUI can remove the background of a video by using a video to mask component that segments the subject from the background. Users can adjust settings such as frame rate and the number of frames to process, and then merge the frames back to create a video with the background removed.

  • What is the purpose of the LM part in ComfyUI?

    -The LM part in ComfyUI is a text generator component that allows users to run different types of language model (LLM) workflows, either locally on their computer or by using external services, to generate prompts or text based on user inputs.

  • How can ComfyUI enhance an image using sharpening or upscaling?

    -ComfyUI can enhance an image by using components that apply sharpening to remove blur and add texture, or by upscaling the image using various models to increase its size without losing quality, or even improving it in some cases.

  • What are some of the image filters that ComfyUI offers?

    -ComfyUI offers image filters such as channel shake, watercolor, motion blue, depth of field, and color adaptation, which can be used to add different visual effects and touches to images.

  • How can ComfyUI generate audio from text?

    -ComfyUI can generate audio from text using an Audio LDM (Language Model) generative model. Users provide a prompt, and the model creates an audio clip that corresponds to the described scene or situation.

  • What is the process of combining image and audio in ComfyUI as described in the script?

    -The process involves using the Lava model to describe an image, feeding the description to a local LLM model to suggest sound effects, and then using those suggestions as prompts for the audio generative model to create the actual sound effects.

  • How can ComfyUI be used to create a grid view of images with text descriptions?

    -ComfyUI allows users to write text directly on top of images using text creators and then combine them using image to patch and create an image grid with customizable border color, thickness, and column count to display multiple images together.

  • What is the potential of ComfyUI according to the video script?

    -According to the video script, ComfyUI has the potential to become a comprehensive tool not just for stable diffusion but also for harnessing the full potential of new AI technologies and models, especially for non-coders.



πŸ–ΌοΈ Image to Text with Comi

The paragraph introduces various capabilities of Comi beyond stable diffusion, such as image to text conversion. The Lava module is highlighted, which interprets images and answers questions about them. The process involves downloading BL models and crafting prompts to describe images or inquire about specific details. The output can be adjusted with parameters like token limits and temperature to control the model's creativity, resulting in detailed image descriptions or material composition queries.


🌟 Background Removal Techniques

This section explores different methods for background removal from images using Comi's modules. It explains the automatic selection of main elements for background subtraction and the flexibility of specifying objects to keep. The paragraph also discusses the use of human segmentation models and adjusting parameters like threshold values for better results, emphasizing the trade-off between quality and flexibility.


πŸ“‹ Note-Taking and Organization with Comi

The speaker expresses admiration for Comi as a versatile tool, not just for image processing but also for tasks like note-taking and creating mood boards. The paragraph introduces 'secr tal,' a visual note-taking platform that allows users to organize information with cards, lists, images, and various media. It also mentions the use of templates for different use cases and the integration of resources within Comi.


πŸŽ₯ Video Background Removal and Enhancement

This paragraph delves into video processing capabilities, specifically removing backgrounds from videos using Comi. It describes the process of loading a video, selecting a frame rate, and setting frame limits for processing efficiency. The use of different models for human segmentation is mentioned, along with the merging of frames to create a backgroundless video. The paragraph also touches on the flexibility of settings for various video lengths and frame numbers.


πŸ“ Local and Remote Text Generation with LLMs

The focus shifts to text generation using locally installed models or remote services. The paragraph outlines the process of installing and selecting models for text generation within Comi, showcasing the use of VM notes extensions. It also discusses the integration of external platforms like open AI for model selection and the customization of prompts for specific text generation tasks, highlighting the flexibility of using different models and settings.


πŸ–ŒοΈ Image Filters and Effects

This section showcases various image filters and effects available in Comi, such as channel shake, watercolor, motion blue, and depth of field blur. The paragraph explains how these filters can be applied to enhance or alter the appearance of images, providing creative options for image editing. It also mentions the ability to adjust filter parameters to fine-tune the effects, emphasizing the artistic potential of these tools.


🎨 Color Adjustments and Audio Generation

The paragraph introduces color adjustment features for images, such as brightness, contrast, saturation, and sharpness, as well as the application of film grain and color style adjustments using loots. It also discusses the integration of audio generation from text using an Audio LDM model, which can create sound effects based on prompts, enhancing the multimedia capabilities of Comi.

πŸ“š Combining Text, Audio, and Images

This section describes the process of combining text, audio, and images to create multimedia content. The paragraph explains how to use Lava for image description, LLM for suggesting sound effects based on the image, and an audio generative model to produce the soundscape. It illustrates the potential of integrating different AI models to create a cohesive multimedia experience.

πŸ› οΈ Advanced Image Manipulation Techniques

The paragraph explores advanced image manipulation techniques in Comi, such as creating drop shadows, strokes, and outer glows around objects, as well as generating color palettes from images. It discusses the use of image opacity reduction and 3D generation from a single image, highlighting the experimental nature of some features and their potential applications in design and character creation.

πŸ“ Writing Text on Images and Creating Grids

The final paragraph demonstrates how to write text directly on images and create grid views for comparison or display purposes. It explains the use of text creators in Comi, allowing for adjustments in margins, line spacing, and shadows. The paragraph concludes with the combination of images and text into a grid layout, showcasing the versatility of Comi for organizing and presenting visual information.

🌐 Comi as a Comprehensive AI Tool

The concluding paragraph emphasizes Comi's potential as a comprehensive tool for non-coders to harness the power of AI technologies and models. It invites viewers to explore Comi for its capabilities beyond stable diffusion and mentions the availability of installation videos and templates on Patreon, encouraging engagement and further exploration of the tool.




ComfyUI is a user interface platform that allows users to interact with various AI models and tools. It is not limited to running stable diffusion, which is a type of AI model for generating images from text prompts. In the video, ComfyUI is showcased as a versatile tool that can perform a multitude of functions, such as image to text conversion, creating captions, and applying sound effects, making it a comprehensive solution for different creative workflows.

πŸ’‘Stable Diffusion

Stable Diffusion is an AI model that generates images from textual descriptions. It is one of the functionalities that ComfyUI can interface with, but as the video emphasizes, ComfyUI offers a wide range of other capabilities beyond just running stable diffusion. The script mentions that users do not need to rely on any other software for their creative tasks because of the diverse features available within ComfyUI.


LAVA, which stands for Language Visual Assistant, is an image-to-text model that can analyze and describe images. In the context of the video, LAVA is used to understand what is happening in an image and answer questions about it. For example, the script describes using LAVA to describe the location and activities in an image, showcasing its ability to interpret visual data and provide textual responses.

πŸ’‘Background Removal

Background removal is a process where the background of an image is extracted and removed, typically to isolate the main subject or object. The video script explains different workflows for background removal, including using AI models that can automatically detect the main element in an image or allowing users to specify what they want to keep in the image. This feature is useful for creating images with a transparent background or for compositing images in different contexts.

πŸ’‘Miro (Secr Tal)

Miro, referred to as Secr Tal in the script, is a visual note-taking platform that allows users to organize ideas and information through a board where cards can be placed and connected. It is highlighted in the video as a tool that complements ComfyUI, offering functionalities like taking notes, creating mood boards, and documenting research or lecture notes with various media types embedded within each card.

πŸ’‘Video to Masks

Video to Masks is a feature that enables the removal of the background from a video, isolating the main subject or object within the video frames. The script describes a workflow where a video is loaded, and a model like UNIT is used for human segmentation to separate the dancer from the background. This process can be applied to multiple frames, resulting in a video with a removed background, which is useful for creating composite videos or special effects.

πŸ’‘LLM (Large Language Models)

Large Language Models (LLMs) are AI models designed to process and generate human-like text based on input prompts. The video discusses different ways to utilize LLMs within ComfyUI, such as running them locally on a user's computer or using third-party platforms like OpenAI. LLMs are used for generating prompts, creating text, and even suggesting sound effects based on image descriptions.

πŸ’‘Upscale Enhancer

Upscale Enhancer refers to a set of tools and models that can improve the resolution and quality of images. In the video, it is mentioned in the context of sharpening images to remove blur and adding texture, as well as upscaling images using models like Ultra Sharp to increase their size without losing detail. These tools are useful for enhancing image quality and preparing images for different uses.

πŸ’‘Image Filters

Image filters are effects that can be applied to images to alter their appearance or add specific visual styles. The script mentions several types of image filters available in ComfyUI, such as channel shake, watercolor, motion blue, and depth of field effects. These filters can be used to create unique visual effects and enhance images for various purposes.


Text-to-Audio is a feature that converts written text into spoken audio. In the video, it is demonstrated how ComfyUI can use an Audio LDM (Language Model) to generate audio based on a text prompt, such as creating a soundscape of a busy city or footsteps in a kitchen. This capability allows users to add sound effects to videos or create audio content from textual descriptions.

πŸ’‘Image-to-3D Generator

Image-to-3D Generator is a tool that can create 3D representations from 2D images. The video script describes using the table 0123 model to generate six images from different angles, effectively reconstructing a 3D space from a single image. Although still experimental, this feature has potential applications in object and character design, as well as architectural visualization.

πŸ’‘Inpaint Workflow

Inpaint Workflow refers to a process where parts of an image are removed or altered, and the software fills in the missing areas to create a seamless result. In the context of the video, the script explains using a table diffusion model to remove objects like windows or chairs from an image and blend the changes with the rest of the image, demonstrating the capability to modify images with minimal traces of edits.

πŸ’‘Text on Image

Text on Image is a feature that allows users to overlay text directly onto images. The video script describes using text creators within ComfyUI to add descriptions or labels to images, which can be useful for creating comparisons, mood boards, or grids of images with annotations. This functionality enhances the visual presentation by integrating textual information directly into the image composition.


ComfyUI offers over 30 different use cases beyond stable diffusion.

Lava is an image-to-text model that can understand and describe images.

ComfyUI can generate captions and sound effects directly from images.

Users can create more complex workflows by combining various ComfyUI functions.

Background removal workflows automatically identify and remove image backgrounds.

ComfyUI allows for fine-tuning of object segmentation with adjustable threshold values.

Scrol is a visual note-taking platform that can be integrated with ComfyUI for better documentation.

ComfyUI's video-to-mask module can remove backgrounds from videos, creating isolated subjects.

LM part of ComfyUI enables text generation using different LLM models locally or via external services.

ComfyUI supports upscaling and enhancing images with models like Ultra Sharp for improved clarity.

Image filters in ComfyUI, such as channel shake and watercolor effects, add artistic touches to images.

ComfyUI can adapt image colors to match a reference image for consistent color themes.

Adjustment filters in ComfyUI allow for fine-tuning of brightness, contrast, and saturation.

Text-to-audio capabilities in ComfyUI generate sound effects based on textual prompts.

ComfyUI can combine image and text-to-audio workflows to create videos with synchronized sound effects.

Layer effects in ComfyUI, such as drop shadows and outer glows, enhance the visual presentation of images.

Color palette generation from images is possible with ComfyUI for mood board creation.

ComfyUI's image-to-3D generator creates 3D views of spaces from single images.

In-painting workflow in ComfyUI allows for the removal of objects from images with table diffusion.

ComfyUI enables adding text overlays on images and creating grid views for comparison or display.

ComfyUI is a versatile tool for non-coders to harness the full potential of AI technologies and models.