13 Jan 202324:22

TLDRDiscover the innovative method of training your face or any character onto various Stable Diffusion models using text-only version embeddings. This one-time training allows for seamless application across community-trained models, saving time and effort. Learn how to select high-quality images, caption them accurately, and train the embedding with the Stable Diffusion 1.5 base for maximum compatibility. Master the art of balancing learning rates and training steps to avoid overfitting and achieve the desired results. Apply your trained embedding to any model, and utilize tricks like the XY plot to determine the best parameters for generating images. Revolutionize your creative process with this powerful technique.


  • 🎯 The video introduces a method to apply one's face or any desired style onto various models of Stable Diffusion without retraining the models repeatedly.
  • 🚀 This solution is called 'text-only version embeddings' which allows training an embedding with one's face or style just once and applying it to any model.
  • 🌟 The process involves using high-quality, high-resolution images as the base for training the embedding, emphasizing the importance of image quality for the final result.
  • 📸 Images for training should be diverse, capturing different angles, expressions, and backgrounds, and should be free of noise and pixelation.
  • 💡 The video provides a detailed guide on selecting and preparing the images, including resizing and captioning them accurately to ensure the AI understands the subject matter.
  • 📈 The training process requires careful selection of parameters such as learning rate, batch size, and gradient accumulation steps, which can significantly impact the outcome.
  • 🔄 It's important to monitor the training process and determine the optimal step at which the character starts to look best without being overtrained.
  • 🛠️ The video offers tips on continuing the training process if the initial results are not satisfactory, by adjusting the learning rate and other parameters.
  • 🔍 A useful trick for comparing different embeddings and their training steps is presented using an XY plot, which helps in identifying the best parameters for each case.
  • 🎭 Once trained, the text-only version embedding can be applied to any Stable Diffusion model created by the community that uses the same base version, offering a wide range of applications.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is about training textual inversion embeddings using Stable Diffusion, specifically focusing on how to put one's face or any desired subject on various models with a single training process.

  • What is a text-only version embedding?

    -A text-only version embedding is a small file that is trained using one's own images to represent a style, face, or concept, which can then be applied to any model.

  • Why is image selection important in the training process?

    -Image selection is crucial because the quality and resolution of the base images directly affect the final results. High-quality, high-resolution images with good variation lead to better training outcomes.

  • What is the recommended number of images for training a textual inversion embedding?

    -The recommended number of images varies, but at least 10 high-quality images are suggested. More images with different angles, backgrounds, and lighting can improve the final results, but also increase training time.

  • How can one upscale and resize images for training?

    -Images can be upscaled and resized using tools like berm.net or the Stable Diffusion extra tab. Manual resizing and centering of the subject in the final image are also necessary for optimal results.

  • What is the purpose of captioning in the training process?

    -Captioning is used to describe every detail in the images so that the training process understands and learns the specific characteristics of the sample images, which helps in creating a more accurate embedding.

  • Why is choosing a unique name for the embedding important?

    -A unique name for the embedding is important to avoid confusion with existing known entities in the Stable Diffusion model, ensuring that the embedding represents the intended subject or style accurately.

  • What is the optimal learning rate for training an embedding?

    -The optimal learning rate depends on the number of training images and the desired flexibility of the model. It should be high enough to learn quickly but low enough to avoid overtraining and loss of flexibility.

  • How can one determine if an embedding is overtrained?

    -Overtraining can be identified by a decline in the quality of the generated images, with the subject becoming too rigid or artifacts appearing. The training can be adjusted by reducing the learning rate or stopping at an earlier step.

  • What is the XY plot and how is it used?

    -The XY plot is a tool that generates images at different training steps (X value) and with different CFG scales (Y value). It helps in comparing the results to determine the best parameters for using the embedding on a new model.

  • How can the trained embedding be applied to other models?

    -Once the embedding is trained, it can be applied to any other Stable Diffusion models created by the community that use the same base version as the trained embedding, allowing the subject's face or style to be used across various models with no additional training.



🤖 Introduction to Textual Inversion Embeddings

This paragraph introduces the concept of textual inversion embeddings, a method that allows individuals to train a small file, known as an embedding, using their own images. The speaker explains that this embedding can then be applied to any model, making it a useful tool for those who want to put their face on new models of stable diffusion without the need for repeated training. The video promises to show viewers how to train their own face using textual inversion embeddings and apply them to various models with a one-time training process.


🖼️ Selecting and Preparing Images for Training

The speaker emphasizes the importance of selecting high-quality, high-resolution images for training the embedding. They recommend using at least 10 varied images and provide tips on how to find and download suitable images. The paragraph also explains the process of resizing images to 512 by 512 pixels using tools like berm.net and the need to center the subject in the images. Additionally, the speaker discusses the pre-processing of images, which involves automatically generating and refining captions for each image to help the AI understand what the sample images represent.


🧠 Creating and Training the Embedding

This section delves into the process of creating an embedding using the stable diffusion web UI. The speaker explains how to choose a unique name for the embedding, select the appropriate model base (in this case, the 1.5 model), and determine the number of vectors for the token based on the number of training images. They also discuss the importance of selecting the right learning rate to avoid overtraining and provide guidance on choosing between fixed and varied learning rates. The paragraph concludes with the speaker starting the training process and explaining the various settings and their impact on the training.


🔍 Evaluating and Adjusting the Training Process

The speaker describes how to evaluate the training process by examining images generated at different steps to determine the optimal point before overtraining occurs. They explain how to continue training from a specific step if necessary, by adjusting the learning rate to improve the final embedding. The paragraph also covers the use of the XY plot as a tool for comparing different embeddings and determining the best parameters for generating images of the character with various models and CVG scales.


🚀 Applying the Trained Embedding to New Models

In this final paragraph, the speaker demonstrates how to apply the trained textual inversion embedding to new models, such as the progen model presented in a previous video. They explain that once the embedding is trained correctly, it can be used on any stable diffusion models created by the community that use the same base model. The speaker also shares a trick for using the XY plot to quickly assess which embedding and parameters yield the best results, making it easy to determine the optimal settings for generating images of the character with different models.



💡Stable Diffusion

Stable Diffusion is an AI model used for generating images based on textual descriptions. It is a type of deep learning model that has been trained on a large dataset of images and text. In the context of the video, Stable Diffusion is the platform on which the user can train their own 'textual inversion' embeddings, allowing them to apply specific styles or faces to a variety of models.

💡Textual Inversion Embeddings

Textual Inversion Embeddings is a method that involves training a small file, known as an embedding, using one's own images to represent a particular style, face, or concept. This embedding can then be applied to any model. In the video, the creator demonstrates how to train a face using this method and apply it to various Stable Diffusion models.


Protogen is a specific model within the Stable Diffusion framework that has been trained by the community. It is mentioned in the video as an example of a model where one can apply their trained textual inversion embeddings, essentially placing their face onto the models created by the community.


In the context of the video, training refers to the process of teaching the AI model to recognize and reproduce specific visual elements, such as a person's face or a particular style, through the use of textual inversion embeddings. This is achieved by feeding the model a set of images and their corresponding captions, allowing the model to learn the unique characteristics it needs to generate.


Embeddings are representations of words, phrases, or images in a mathematical space that capture their semantic meaning. In the context of the video, embeddings are small files that contain the learned features of a particular subject, such as a face or style, which can be applied to various models.

💡Trigger Words

Trigger words are specific phrases or terms that are used to instruct the AI model to apply a certain embedding or style when generating an image. In the video, the creator emphasizes the importance of choosing unique trigger words for their embeddings to ensure that the AI correctly applies the intended style or face.

💡High-Resolution Images

High-resolution images are those with a large number of pixels, providing more detail and clarity. In the context of training AI models, using high-resolution images ensures that the embeddings and the resulting generated images are of higher quality, capturing more details from the original images.


Captioning in this context refers to the process of providing detailed textual descriptions for each image during the training of the AI model. These descriptions help the model understand and learn the specific elements present in the images, such as the subject's face or style, which are to be replicated in the generated images.

💡Learning Rate

The learning rate is a hyperparameter in machine learning models that determines how much the model's weights are updated during training. A higher learning rate may lead to faster training but can also result in overfitting, while a lower learning rate ensures a more gradual and stable learning process. In the video, the creator advises on choosing an appropriate learning rate to prevent overtraining and maintain the flexibility of the model.


VRAM, or Video RAM, is the memory used by graphics processing units (GPUs) to store图像 data. In the context of training AI models like Stable Diffusion, having a larger VRAM allows for processing more data simultaneously, which can speed up training and allow for the use of larger batch sizes.


Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, which can lead to poor performance on new, unseen data. In the context of the video, overfitting can result in the AI model generating images that are too similar to the sample images, losing the ability to apply the learned style or face to different models flexibly.


Introducing the text-only version embeddings for Stable Diffusion models.

You can now apply your face or any desired style to multiple models without retraining.

The process involves a one-time training of your chosen subject using textual inversion.

Embeddings are small files that can be easily shared and applied to any compatible model.

The training process requires high-quality, high-resolution images of the subject.

Proper image selection and captioning are crucial for the success of the training.

The embedding file is created by choosing a unique name and setting the appropriate parameters.

The learning rate plays a vital role in preventing overtraining and maintaining model flexibility.

The training process can be monitored by observing the images generated at different steps.

If overtraining occurs, you can adjust the learning rate and continue training the embedding.

Embeddings can be applied to any Stable Diffusion 1.5 model created by the community.

The process is demonstrated using the character Wednesday Addams played by Jenna Ortega.

The method allows for the training of various subjects, including fictional characters and pets.

The training can be optimized by using the right batch size and gradient accumulation steps.

The XY plot is a useful tool for comparing different training steps and CFG scales.

Once trained, the embedding can be applied to new models with ease, showcasing its practical applications.