Latest AI Developments: ZeroScope V2, Stable Diffusion XL 0.9, and More
Table of Contents
- ZeroScope V2 TechSoup Video Models Released
- Stable Diffusion XL 0.9 Beta Model Unveiled
- Textbooks are All You Need: High Performance Small LM
- Convergence of Language, Perception and Action
- 30 Billion Parameter Chat Model Open Sourced
- Speech-to-Speech Translation with Accents
- Transformers' Limitations in Complex Tasks
- Motion Language Model for Body Movements
- Bypassing Paywalls with ChatGPT
- Public Awareness of AI Models Still Low
ZeroScope V2 TechSoup Video Models Released
The most exciting news this week is definitely the ZeroScope V2 TechSoup video model collection. This collection has a total of three models that are watermark-free, scope-based video models which are all trained to do different things. For those who don't know what ModelScope is or want to learn more about the current text-to-video developments, check out this informative video for more context.
The first model is the ZeroScope V2 XL model, where you can generate video at a resolution of 1024x576 up to 24 FPS. The second model is a ZeroScope V2 576w model, trained to generate video at a resolution of 576x320 up to 24 FPS — a lighter model compared to XL. The third model is a ZeroScope Dark V2 model focused on 30 FPS generation, albeit at a slightly lower 448x256 resolution.
All the models were trained on 9,923 clips with 29,769 tagged frames — an impressive dataset. You can try them out now using Automatic1111's text-to-video extension. More instructions are on their Hugging Face model cards.
Three High Quality Models for Text-to-Video Generation
This ZeroScope V2 collection offers three high-quality models for generating watermark-free text-to-video. Each focuses on different resolutions and frame rates to suit different needs. The XL model produces 1024x576 video up to 24 FPS. The 576w model generates video at 576x320 up to 24 FPS as a lighter-weight option. Finally, the Dark V2 model focuses on high frame rate 30 FPS video, at a resolution of 448x256.
Train on Nearly 10,000 Video Clips
A key strength of these ZeroScope V2 models is the robust dataset they were trained on. In total, the models saw 9,923 video clips with 29,769 tagged frames during training. Having such a large and diverse training set allows the models to generate high quality, realistic, and flexible video from text prompts across resolutions and frame rates.
Stable Diffusion XL 0.9 Beta Model Unveiled
Another major headline is Stable Diffusion's release of XL 0.9, their first base model focusing on images with greater detail and composition. This model is still in beta but available now on Clipdrop, a new free service for generating images.
The official comparisons show XL 0.9 has much better compositions, colors, and fine details compared to the older SD XL model. Here is another comparison between SD XL 0.9 and SD 1.5 — while SD 1.5 images took 2 hours each to recreate XL 0.9 quality with control net, inpainting, and upscaling, SD XL 0.9 instantly generates incredible quality and top-notch composition, lighting, and color.
FAQ
Q: What is ZeroScope V2?
A: ZeroScope V2 is a new collection of three high-quality text-to-video models released by TechSoup, trained on nearly 10,000 video clips.
Q: What improvements does Stable Diffusion XL 0.9 offer?
A: The XL 0.9 model generates images with greater detail, better composition, colors and lighting than previous Stable Diffusion models.
Q: What is special about the Textbooks are All You Need model?
A: This 6 billion token model was trained only on high quality textbook data from the web, yet achieves over 50% on a human evaluation benchmark, outperforming models 10-100x its size.
Q: What does the Cosmos 2 model allow?
A: Cosmos 2 takes a step towards artificial general intelligence by understanding multimodal input across language, visual perception, and action.
Q: What does the Arrow Palm model do?
A: Arrow Palm is a speech-to-speech translation model that preserves the speaker's voice and allows choosing an output accent.