This ML Scientist reproduced Karpathy's GPT-2 for Audio!!!

1littlecoder Podcast
15 Jun 202430:37

TLDRMachine learning engineer Shas vah successfully adapted Andrej Karpathy's GPT-2 for audio processing, creating a model that takes audio input and produces audio output. Despite overfitting and imperfections, the model demonstrates the potential of GPT-2 architecture for multimodal applications. Shas shares insights on the project's development, emphasizing the importance of data and computational resources in advancing machine learning models.


  • 🧠 Shas Vah, a machine learning engineer, successfully reproduced Andrej Karpathy's GPT-2 for audio, demonstrating the model's ability to process and generate audio data.
  • 📚 Shas has a background in data science with a focus on machine learning and has been working on language models in his role at Expedia, which inspired him to explore the audio modality.
  • 🔊 The project leverages the GPT-2 architecture, with the primary change being the adaptation to handle audio data through the use of a specialized tokenizer.
  • 📈 Despite the model's overfitting, the fact that it can generate audio at all is considered a significant achievement and a proof of concept.
  • 🎵 Shas used an open domain audio dataset from LibriVox for training, which allowed the model to learn the format of audio sequences quickly.
  • 🔧 The model was trained with modifications to the input size to accommodate audio data, rather than text, and it was trained to a point where it could generate audio that, while not perfect, showed promise.
  • 🔄 Shas discussed the potential for multimodal models that can process text, audio, and images natively without separate model heads, highlighting the flexibility of large language models.
  • 🚀 The experiment suggests that with more data and computational power, it's possible to achieve higher quality audio generation and even explore other modalities like video.
  • 💡 Shas was inspired by the potential of LLMs to extend beyond text to other modalities and encourages others to experiment with different data sets and models to see what's possible.
  • 🔗 Shas shared his work on GitHub and Medium, providing others with the opportunity to replicate his experiment and explore further applications.
  • 🌐 The discussion touched on the importance of efficiency in model deployment, with Shas expressing excitement about the potential for models to run on lower-powered devices.

Q & A

  • What is the main achievement of the machine learning engineer Shas Vah in the context of this script?

    -Shas Vah has successfully ported Andrej Karpathy's GPT-2 model to work with audio, creating a model that can take audio input and generate audio output based on the GPT-2 architecture.

  • What is the current limitation of the model that Shas Vah has developed?

    -The model is currently quite overfitting and is not perfect. It works to some extent but is limited in its ability to generalize beyond the specific training data it has been exposed to.

  • What is Shas Vah's professional background, and how does it relate to his project on GPT-2 for audio?

    -Shas Vah works at Expedia as a machine learning scientist. He has a background in data science and has been working with large language models (LLMs) in his role at Expedia's NLP team, which gave him the necessary knowledge to undertake this personal research project.

  • What inspired Shas Vah to attempt the adaptation of GPT-2 for audio?

    -Shas Vah was inspired by Andrej Karpathy's video on GPT-2 and the idea of creating a singular model capable of reasoning over multiple modalities, such as text, images, and audio, without needing separate model heads for each.

  • Can you explain the concept of 'native multimodality' as mentioned in the script?

    -Native multimodality refers to a single model that can process and reason over different types of data (like text, images, and audio) natively, without needing separate specialized components or 'heads' for each data type.

  • What is the significance of the tokenization method used in Shas Vah's project?

    -The tokenization method is crucial as it converts audio into a sequence of tokens that the model can understand and process. Shas Vah used the SNACK tokenizer, which has a hierarchical structure, allowing the audio to be represented at different levels of granularity.

  • What is the current state of the project in terms of training and data used?

    -The project has been trained on a small dataset from LibriVox, resulting in overfitting after a certain number of training steps. The dataset is limited, which restricts the model's ability to learn a wide variety of audio patterns.

  • What are Shas Vah's future plans for improving the model?

    -Shas Vah plans to build a larger and more diverse dataset to pre-train the model on. He is also considering experimenting with different architectures like Mamba and exploring the possibility of zero-shot voice cloning.

  • What is the computational cost of training the model as described in the script?

    -Training the model on a larger dataset would require significant computational resources. Shas Vah mentions that training on a larger scale could take weeks or months, even with access to GPUs.

  • How does Shas Vah's project relate to other works in the field, such as Meta's Chameleon and BitGPT?

    -Like Meta's Chameleon and BitGPT, Shas Vah's project explores the possibility of training models on multiple modalities. However, his work is at a smaller scale and focuses specifically on adapting GPT-2 for audio processing.

  • What are some of the technical challenges Shas Vah faced or might face in his project?

    -Some of the challenges include creating a tokenizer that effectively converts audio into a format the model can learn from, managing overfitting due to a small dataset, and the computational demands of training a model on a larger and more diverse dataset.



🤖 Machine Learning Engineer's Audio GPT2 Adaptation

Shenas, a machine learning engineer at Expedia, discusses his personal project of adapting Andre Karpa's GPT2 model for audio input and output. He emphasizes the novelty of this approach, despite the model's current overfitting issues. Shenas shares his educational background in data science and his experience with large language models (LLMs) at work, which inspired him to experiment with audio processing using the GPT2 architecture. He also mentions previous works like Meta's Chameleon and Bite GPT, which have shown multimodal capabilities in models.


🔊 Exploring Audio Tokenization and Model Training

The discussion shifts to the technical aspects of Shenas's project, focusing on how he used the SNACK tokenizer to convert audio into a hierarchical token sequence. He explains the process of flattening these tokens for input into the GPT2 model and the challenges of working with a small dataset from LibriVox. Shenas details the modifications he made to the original code and the rapid overfitting of the model due to the limited data variety. He also touches on the potential of training on a larger and more diverse dataset to improve the model's realism.


📈 Model's Rapid Learning and Overfitting

Shenas describes the model's quick learning curve, noting how it rapidly picked up the formatting sequence of the SNACK tokens. He discusses the use of a separator in the token sequence and the model's adherence to this format. Despite the model's overfitting, Shenas is encouraged by its ability to generate infinite output that decodes into audio, indicating that the model has learned the correct format, even if the content is yet to be refined with more varied data.


🎙️ Generating Audio and Future Improvements

The conversation continues with Shenas demonstrating the model's current capabilities, including generating audio from a single separator input and replicating input audio from the training dataset. He talks about the slow inference process on Google Colab and his future plans to build a better dataset for pre-training. Shenas also considers the possibility of training the model on multiple voices and the challenge of collecting a large amount of clean, single-voice audio data.


🚀 Scaling Up with More Data and Compute

Shenas reflects on the potential of scaling up the project with more data and computational resources. He references the experience of training large models like GPT-3 and the need for a substantial amount of tokens to achieve decent output quality. Shenas also contemplates the idea of zero-shot voice cloning and the implications of training on multiple voices. He acknowledges the current limitations in compute power and the costs associated with training large models.


🌐 Broadening LLMs to Multimodal Applications

Shenas shares his excitement about the future of LLMs, particularly their efficiency and the possibility of running them on devices with lower computational capabilities. He mentions projects like m-free LLM that aim to reduce memory usage while maintaining performance. Shenas encourages more experimentation with LLMs across different modalities, such as video, and suggests that with the right tokenizer, one could train models like GPT2 to generate content in new ways.


🔍 Looking Forward to Advances in LLM Efficiency

In the final paragraph, Shenas expresses his enthusiasm for the ongoing advancements in LLM efficiency, citing examples of models running on smartphones and the potential for local training of large models in the future. He also discusses the impact of these developments on the demand for GPU resources and the possibility of training even larger models as compute efficiency improves.

📝 Conclusions and Contacting Shenas

The interview concludes with Shenas sharing his thoughts on the ease of experimenting with LLMs and his contact information for those interested in following his work or discussing ideas. He highlights the importance of empirical testing in addition to theoretical considerations when working with LLMs.



💡Machine Learning Engineer

A machine learning engineer is a professional who applies machine learning techniques to build systems that can learn from and make decisions based on data. In the context of the video, Shas Vah, the machine learning engineer, has taken the initiative to adapt Andrej Karpathy's GPT-2 model for processing audio inputs and generating audio outputs, showcasing the potential of machine learning in audio applications.


GPT-2, which stands for 'Generative Pre-trained Transformer 2,' is an advanced artificial neural network developed by OpenAI. It is designed to generate human-like text based on given prompts. In the video, Shas Vah has adapted this model to work with audio, demonstrating the flexibility and adaptability of GPT-2's underlying architecture.


Overfitting occurs in machine learning when a model learns the training data too well, to the extent that it negatively impacts the model's performance on new, unseen data. In the script, it is mentioned that the adapted GPT-2 model for audio is overfitting, meaning it has learned the training data so well that it can replicate it but may not generalize well to new audio inputs.


Multi-modality refers to the ability of a system to process and understand multiple types of input data, such as text, images, and audio. The video discusses the concept of a 'native multi-modality' model, which is a single model capable of reasoning over different data types without needing separate specialized components for each.


A tokenizer is a tool used in natural language processing to convert text or, in this case, audio into tokens, which are discrete units of meaning. The script describes the use of a tokenizer to convert audio into a sequence of tokens that the adapted GPT-2 model can process, illustrating an essential step in preparing data for machine learning models.

💡Data Set

A data set is a collection of data used for analysis or to train machine learning models. In the script, Shas Vah mentions using a data set from LibriVox, an open-domain audiobook collection, to train the adapted GPT-2 model, emphasizing the importance of quality data for model training.

💡NLP (Natural Language Processing)

NLP is a field of computer science and artificial intelligence that focuses on the interaction between computers and human language. The video touches on the use of large language models like GPT-2 in NLP tasks, and how these models are now being adapted for other modalities such as audio.


Inference in the context of machine learning refers to the process of making predictions or decisions based on a trained model. The script discusses the inference phase for the adapted GPT-2 model, where it generates audio output based on the learned patterns from the training data.

💡Audio Pal

Audio Pal is mentioned in the script as a previous work that involved processing both audio and text. It serves as an example of the script's discussion on the evolution of machine learning models that can handle multiple modalities, including audio.


Fine-tuning is the process of further training a machine learning model on a specific task after it has been pre-trained on a larger dataset. The script suggests that a model like the adapted GPT-2 could be fine-tuned for various tasks involving audio, such as text-to-speech conversion.

💡Collaboratory (Colab)

Colab, short for 'Google Collaboratory,' is a cloud-based platform for machine learning education and research. It is mentioned in the script as the environment where Shas Vah trained the adapted GPT-2 model, highlighting the accessibility of machine learning experimentation through cloud computing.


A machine learning engineer, Shas Vah, has successfully ported Andrej Karpathy's GPT-2 for audio processing.

The model takes audio input and produces audio output, demonstrating the GPT-2 architecture's adaptability to audio.

Despite being overfitting, the model's functionality is considered 'magical' and worthy of exploration.

Shas Vah works at Expedia as a machine learning scientist and conducted this project as personal research.

Shas has a background in data science and machine learning, with recent experience in large language models.

The project's code is based on Andrej Karpathy's original GPT-2, with modifications for audio data.

Interest in audio and text-to-speech models led Shas to explore merging these with large language models.

The idea was inspired by the launch of GPT-4 and its native multimodality, aiming for a singular model for reasoning.

Previous work like Meta's Chameleon and Byte GPT showed the potential for models to handle multiple modalities.

Shas discusses the hierarchical structure of the SNACK tokenizer and its role in the project.

The tokenizer flattens the hierarchical audio tokens into a sequence for model processing.

Shas built a tokenizer around the SNACK model and used it for encoding and decoding audio.

The model was trained on a small dataset from LibriVox, leading to quick overfitting.

The model learned to format sequences in the SNACK style very quickly, even with limited data.

At around 4,000 to 5,000 steps, the model began overfitting and replicating the training data.

Shas is considering building a larger, more diverse dataset for better model training.

The project suggests that with enough data and compute, multimodal models can be trained effectively.

Shas encourages others to experiment with different modalities and models, given the accessibility of tools.

The efficiency of models running on lower-end hardware is a significant development in the field.

Shas is inspired by the potential for local training of large models in the future.

The project and discussion highlight the importance of experimentation and the potential of current LLMs.