This ML Scientist reproduced Karpathy's GPT-2 for Audio!!!
TLDRMachine learning engineer Shas vah successfully adapted Andrej Karpathy's GPT-2 for audio processing, creating a model that takes audio input and produces audio output. Despite overfitting and imperfections, the model demonstrates the potential of GPT-2 architecture for multimodal applications. Shas shares insights on the project's development, emphasizing the importance of data and computational resources in advancing machine learning models.
Takeaways
- 🧠 Shas Vah, a machine learning engineer, successfully reproduced Andrej Karpathy's GPT-2 for audio, demonstrating the model's ability to process and generate audio data.
- 📚 Shas has a background in data science with a focus on machine learning and has been working on language models in his role at Expedia, which inspired him to explore the audio modality.
- 🔊 The project leverages the GPT-2 architecture, with the primary change being the adaptation to handle audio data through the use of a specialized tokenizer.
- 📈 Despite the model's overfitting, the fact that it can generate audio at all is considered a significant achievement and a proof of concept.
- 🎵 Shas used an open domain audio dataset from LibriVox for training, which allowed the model to learn the format of audio sequences quickly.
- 🔧 The model was trained with modifications to the input size to accommodate audio data, rather than text, and it was trained to a point where it could generate audio that, while not perfect, showed promise.
- 🔄 Shas discussed the potential for multimodal models that can process text, audio, and images natively without separate model heads, highlighting the flexibility of large language models.
- 🚀 The experiment suggests that with more data and computational power, it's possible to achieve higher quality audio generation and even explore other modalities like video.
- 💡 Shas was inspired by the potential of LLMs to extend beyond text to other modalities and encourages others to experiment with different data sets and models to see what's possible.
- 🔗 Shas shared his work on GitHub and Medium, providing others with the opportunity to replicate his experiment and explore further applications.
- 🌐 The discussion touched on the importance of efficiency in model deployment, with Shas expressing excitement about the potential for models to run on lower-powered devices.
Q & A
What is the main achievement of the machine learning engineer Shas Vah in the context of this script?
-Shas Vah has successfully ported Andrej Karpathy's GPT-2 model to work with audio, creating a model that can take audio input and generate audio output based on the GPT-2 architecture.
What is the current limitation of the model that Shas Vah has developed?
-The model is currently quite overfitting and is not perfect. It works to some extent but is limited in its ability to generalize beyond the specific training data it has been exposed to.
What is Shas Vah's professional background, and how does it relate to his project on GPT-2 for audio?
-Shas Vah works at Expedia as a machine learning scientist. He has a background in data science and has been working with large language models (LLMs) in his role at Expedia's NLP team, which gave him the necessary knowledge to undertake this personal research project.
What inspired Shas Vah to attempt the adaptation of GPT-2 for audio?
-Shas Vah was inspired by Andrej Karpathy's video on GPT-2 and the idea of creating a singular model capable of reasoning over multiple modalities, such as text, images, and audio, without needing separate model heads for each.
Can you explain the concept of 'native multimodality' as mentioned in the script?
-Native multimodality refers to a single model that can process and reason over different types of data (like text, images, and audio) natively, without needing separate specialized components or 'heads' for each data type.
What is the significance of the tokenization method used in Shas Vah's project?
-The tokenization method is crucial as it converts audio into a sequence of tokens that the model can understand and process. Shas Vah used the SNACK tokenizer, which has a hierarchical structure, allowing the audio to be represented at different levels of granularity.
What is the current state of the project in terms of training and data used?
-The project has been trained on a small dataset from LibriVox, resulting in overfitting after a certain number of training steps. The dataset is limited, which restricts the model's ability to learn a wide variety of audio patterns.
What are Shas Vah's future plans for improving the model?
-Shas Vah plans to build a larger and more diverse dataset to pre-train the model on. He is also considering experimenting with different architectures like Mamba and exploring the possibility of zero-shot voice cloning.
What is the computational cost of training the model as described in the script?
-Training the model on a larger dataset would require significant computational resources. Shas Vah mentions that training on a larger scale could take weeks or months, even with access to GPUs.
How does Shas Vah's project relate to other works in the field, such as Meta's Chameleon and BitGPT?
-Like Meta's Chameleon and BitGPT, Shas Vah's project explores the possibility of training models on multiple modalities. However, his work is at a smaller scale and focuses specifically on adapting GPT-2 for audio processing.
What are some of the technical challenges Shas Vah faced or might face in his project?
-Some of the challenges include creating a tokenizer that effectively converts audio into a format the model can learn from, managing overfitting due to a small dataset, and the computational demands of training a model on a larger and more diverse dataset.
Outlines
🤖 Machine Learning Engineer's Audio GPT2 Adaptation
Shenas, a machine learning engineer at Expedia, discusses his personal project of adapting Andre Karpa's GPT2 model for audio input and output. He emphasizes the novelty of this approach, despite the model's current overfitting issues. Shenas shares his educational background in data science and his experience with large language models (LLMs) at work, which inspired him to experiment with audio processing using the GPT2 architecture. He also mentions previous works like Meta's Chameleon and Bite GPT, which have shown multimodal capabilities in models.
🔊 Exploring Audio Tokenization and Model Training
The discussion shifts to the technical aspects of Shenas's project, focusing on how he used the SNACK tokenizer to convert audio into a hierarchical token sequence. He explains the process of flattening these tokens for input into the GPT2 model and the challenges of working with a small dataset from LibriVox. Shenas details the modifications he made to the original code and the rapid overfitting of the model due to the limited data variety. He also touches on the potential of training on a larger and more diverse dataset to improve the model's realism.
📈 Model's Rapid Learning and Overfitting
Shenas describes the model's quick learning curve, noting how it rapidly picked up the formatting sequence of the SNACK tokens. He discusses the use of a separator in the token sequence and the model's adherence to this format. Despite the model's overfitting, Shenas is encouraged by its ability to generate infinite output that decodes into audio, indicating that the model has learned the correct format, even if the content is yet to be refined with more varied data.
🎙️ Generating Audio and Future Improvements
The conversation continues with Shenas demonstrating the model's current capabilities, including generating audio from a single separator input and replicating input audio from the training dataset. He talks about the slow inference process on Google Colab and his future plans to build a better dataset for pre-training. Shenas also considers the possibility of training the model on multiple voices and the challenge of collecting a large amount of clean, single-voice audio data.
🚀 Scaling Up with More Data and Compute
Shenas reflects on the potential of scaling up the project with more data and computational resources. He references the experience of training large models like GPT-3 and the need for a substantial amount of tokens to achieve decent output quality. Shenas also contemplates the idea of zero-shot voice cloning and the implications of training on multiple voices. He acknowledges the current limitations in compute power and the costs associated with training large models.
🌐 Broadening LLMs to Multimodal Applications
Shenas shares his excitement about the future of LLMs, particularly their efficiency and the possibility of running them on devices with lower computational capabilities. He mentions projects like m-free LLM that aim to reduce memory usage while maintaining performance. Shenas encourages more experimentation with LLMs across different modalities, such as video, and suggests that with the right tokenizer, one could train models like GPT2 to generate content in new ways.
🔍 Looking Forward to Advances in LLM Efficiency
In the final paragraph, Shenas expresses his enthusiasm for the ongoing advancements in LLM efficiency, citing examples of models running on smartphones and the potential for local training of large models in the future. He also discusses the impact of these developments on the demand for GPU resources and the possibility of training even larger models as compute efficiency improves.
📝 Conclusions and Contacting Shenas
The interview concludes with Shenas sharing his thoughts on the ease of experimenting with LLMs and his contact information for those interested in following his work or discussing ideas. He highlights the importance of empirical testing in addition to theoretical considerations when working with LLMs.
Mindmap
Keywords
💡Machine Learning Engineer
💡GPT-2
💡Overfitting
💡Multi-modality
💡Tokenizer
💡Data Set
💡NLP (Natural Language Processing)
💡Inference
💡Audio Pal
💡Fine-tuning
💡Collaboratory (Colab)
Highlights
A machine learning engineer, Shas Vah, has successfully ported Andrej Karpathy's GPT-2 for audio processing.
The model takes audio input and produces audio output, demonstrating the GPT-2 architecture's adaptability to audio.
Despite being overfitting, the model's functionality is considered 'magical' and worthy of exploration.
Shas Vah works at Expedia as a machine learning scientist and conducted this project as personal research.
Shas has a background in data science and machine learning, with recent experience in large language models.
The project's code is based on Andrej Karpathy's original GPT-2, with modifications for audio data.
Interest in audio and text-to-speech models led Shas to explore merging these with large language models.
The idea was inspired by the launch of GPT-4 and its native multimodality, aiming for a singular model for reasoning.
Previous work like Meta's Chameleon and Byte GPT showed the potential for models to handle multiple modalities.
Shas discusses the hierarchical structure of the SNACK tokenizer and its role in the project.
The tokenizer flattens the hierarchical audio tokens into a sequence for model processing.
Shas built a tokenizer around the SNACK model and used it for encoding and decoding audio.
The model was trained on a small dataset from LibriVox, leading to quick overfitting.
The model learned to format sequences in the SNACK style very quickly, even with limited data.
At around 4,000 to 5,000 steps, the model began overfitting and replicating the training data.
Shas is considering building a larger, more diverse dataset for better model training.
The project suggests that with enough data and compute, multimodal models can be trained effectively.
Shas encourages others to experiment with different modalities and models, given the accessibility of tools.
The efficiency of models running on lower-end hardware is a significant development in the field.
Shas is inspired by the potential for local training of large models in the future.
The project and discussion highlight the importance of experimentation and the potential of current LLMs.