Explores alternatives to standard autoregressive transformer LLMs, including linear attention hybrids like Qwen3-Next and Kimi Linear that use Gated DeltaNet for improved efficiency, text diffusion models that generate tokens in parallel through iterative denoising, code world models that simulate program execution for better
•38m read time• From sebastianraschka.com
Table of contents
1. Transformer-Based LLMs2. (Linear) Attention Hybrids3. Text Diffusion Models4. World Models5. Small Recursive Transformers6. Conclusion1 Comment
Sort: