Best of Transformers — November 2024

1
Video
3Blue1Brown·2y
Large Language Models explained briefly
The post explains large language models (LLMs), how they function, and the complexities behind their training. LLMs predict the next word in a sequence based on probabilities, using vast amounts of text data for training. The introduction of transformers in 2017 allowed for parallel processing of text, enhancing computation efficiency. Pre-training is supplemented by reinforcement learning with human feedback to refine model predictions. The sheer scale of data and computation involved is formidable, taking advantage of specialized hardware like GPUs.
177
3
2
Article
gitconnected·2y
Let’s Build our own GPT Model from Scratch with PyTorch
Learn how to build a basic Generative Pre-trained Transformer (GPT) model from scratch using PyTorch. This tutorial covers auto-regressive models, character-level tokenization, data batching, and training using text in the style of William Shakespeare. It provides a detailed implementation of a bi-gram language model including the use of multi-head attention, forward and training operations, and generating new text tokens.
43
3
Article
AIModels.fyi·2y
Get ready to lose to Transformers on Lichess
An innovative study trains large transformer models to play chess by generalizing strategies rather than memorizing moves, using a dataset called ChessBench with 10 million human games. These transformers achieved near-grandmaster level without search-based tactics, showing potential to revolutionize AI in strategic planning tasks.
13
1
4
Article
Towards AI·2y
Transformers For Images!!
This post explores the application of transformers in image processing within the field of computer vision, detailing three main methods: Pixel Transformers, Vision Transformers (ViT) by Google Brain, and Swin Transformers by Microsoft. It highlights the limitations of CNNs and offers solutions to computational inefficiencies, such as using image patches and techniques like window attention and hierarchical patches.
12
5
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Extending the Context Length of LLMs
The post explains techniques to extend the context length of large language models (LLMs), highlighting methods like sparse attention and flash attention. These techniques help manage the computational complexity associated with processing longer context windows, making it feasible to handle extensive tokens without a drastic increase in cost. The importance of optimizing positional embeddings, particularly rotary positional embeddings (RoPE), is also discussed to maintain the relative position and relation of tokens.
10

See all Transformers archives