The Transformer model uses attention mechanisms to significantly boost the training speed and performance of neural machine translation applications. It features parallelizable structures, consisting of encoding and decoding components with self-attention layers. The high-level view includes word embeddings and feed-forward
Table of contents
A High-Level LookBringing The Tensors Into The PictureNow We’re Encoding!Self-Attention at a High LevelSelf-Attention in DetailMatrix Calculation of Self-AttentionThe Beast With Many HeadsRepresenting The Order of The Sequence Using Positional EncodingThe ResidualsThe Decoder SideThe Final Linear and Softmax LayerRecap Of TrainingThe Loss FunctionGo Forth And TransformAcknowledgementsSort: