This course is designed to help beginners learn how to train a language model from start to finish. Imad will guide you through the whole process, using Moroccan Darija as an example.

In this course, you will learn:

- How to load text data
- How to train a tokenizer from scratch using the Byte Pair Encoding (BPE) method
- How to use the tokenizer to encode text data
- How the Transformer architecture works in language models
- How to pre-train a model
- How to create a supervised fine-tuning dataset
- How to fine-tune the model and build an AI assistant that you can chat with

You can find the slides, notebook, and scripts in this GitHub repository:  
https://github.com/ImadSaddik/Train_Your_Language_Model_Course

The supervised fine-tuning dataset is available here:  
https://github.com/ImadSaddik/BoDmaghDataset  
https://huggingface.co/datasets/ImadSaddik/BoDmaghDataset

The tokenizers trained on AtlaSet can be found here:  
https://github.com/ImadSaddik/DarijaTokenizers

You can access the AtlaSet on HuggingFace here:  
https://huggingface.co/datasets/atlasia/Atlaset

To connect with Imad Saddik, check out his social accounts:  
- LinkedIn: https://www.linkedin.com/in/imadsaddik/  
- YouTube: https://www.youtube.com/@3CodeCampers  
- Discord: imad_saddik

❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp

⭐️ Course Contents ⭐️  
(0:00:00) About the Course 
(0:03:03) Introduction  
(0:07:24) Training Data  
(0:15:33) Tokenization  
(0:29:00) The Transformer Architecture  
(0:52:21) Pre-training  
(1:24:46) Fine-tuning Dataset  
(1:33:05) Instruction Fine-tuning  
(2:06:17) Fine-tuning with LoRA  
(2:20:39) Let's Scale Everything  
(3:09:40) Bonus  
(3:27:10) Conclusion  

🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual

--

Learn to code for free and get a developer job: https://www.freecodecamp.org

Read hundreds of articles on programming: https://freecodecamp.org/news

freeCodeCamp is a nonprofit organization offering free online coding courses and programming tutorials, covering topics such as web development, data science, and machine learning. Learners can gain practical coding skills, build real-world projects, and earn certifications to advance their careers in tech.

freeCodeCamp

The course teaches how to train a language model (LLM) from scratch using WhatsApp or Telegram chat data. It covers the entire process from data extraction and cleaning, tokenization, and transformer architecture implementation. The course aims to help users create models that can mimic someone's unique communication style or develop models for underrepresented languages. The course is divided into two parts, starting with small datasets to understand basic concepts, and scaling up to larger datasets. The teaching includes exporting data, cleaning data, encoding text using byte pair encoding, building a transformer model, and fine-tuning the model. Resources, slides, notebooks, and code are provided for an easier understanding and application of the concepts.

Train Your Own LLM – Tutorial