After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with resp...

Sebastian Raschka's Blog offers insights, tutorials, and research updates on machine learning, deep learning, and artificial intelligence. Covering topics such as neural networks, data science, and Python programming, Sebastian Raschka's Blog provides resources for students, researchers, and practitioners in the field of AI. Developers can learn about  algorithms, research methodologies, and practical applications of machine learning through Raschka's blog posts and publications.

Sebastian Raschka

Explores alternatives to standard autoregressive transformer LLMs, including linear attention hybrids like Qwen3-Next and Kimi Linear that use Gated DeltaNet for improved efficiency, text diffusion models that generate tokens in parallel through iterative denoising, code world models that simulate program execution for better code understanding, and small recursive transformers like TRM that refine answers through iterative self-refinement. While traditional transformer LLMs remain state-of-the-art, these alternatives offer promising trade-offs between efficiency and performance for specific use cases.

Beyond Standard LLMs

<p>The combination of profound dives and obvious comparisons actually made the convoluted stuff to fit. The section on attention hybrids and linear forms was very grounded at last some usefulness between speed and accuracy. The world models bit was novel as well, sort of the next logical step of code reasoning. And recursive transformers they be? Wild it can get reasoning level work out of something that small.</p>