Best of Deep LearningApril 2026

  1. 1
    Article
    Avatar of zedZed·7w

    How We Developed Zeta2 — Zed's Blog

    Zed's team details how they built Zeta2, their improved edit prediction model. Key improvements include richer input context (finer-grained edit history, LSP-resolved type/symbol definitions), a switch from Qwen 2.5 Coder (7B) to Seed Coder (8B) as the base model, and a knowledge distillation pipeline using Claude Sonnet as the teacher model. They addressed the 'reversal problem' where the model incorrectly deleted intentional user edits by improving teacher prompting and edit granularity. Training data shifted from synthetic GitHub commit examples to opt-in real user traces from open source repos, yielding ~250-300k training requests per week. The result is a 30% better acceptance rate and faster responses, validated through dogfooding, shadow releases, and gradual rollout.

  2. 2
    Video
    Avatar of davesgarageDave's Garage·6w

    Training a Neural Network on a Vintage PDP-11 from 1979!

    A hands-on demonstration of training a minimal single-layer, single-head transformer on a genuine 1979 PDP-11/44 minicomputer. The project, called Attention11, is written in raw PDP-11 assembly language and uses fixed-point arithmetic instead of floating point. The task is simple — learning to reverse an 8-digit sequence — but it exposes the full mechanics of transformer training: forward pass, softmax, loss calculation, backpropagation, and weight updates. With only 1,216 parameters and fitting in 32KB of memory, the model converges to 100% accuracy in about 350 training steps (~3.5 minutes on the 11/44). The piece demystifies modern AI by showing that the core learning loop is pure arithmetic — making guesses, measuring error, and nudging weights — and argues that hardware constraints force better engineering thinking.

  3. 3
    Video
    Avatar of bycloudbycloud·4w

    A new way to fine-tune LLMs just dropped

    Evolution strategies, long considered unscalable for deep neural networks, are making a comeback in LLM fine-tuning. Two key papers are driving this revival: 'Evolution Strategies at Scale' (Sept 2025), which showed ES can fine-tune billion-parameter models using a population of just 30 models by exploiting the low intrinsic dimensionality of useful update directions; and 'EgRoL' (Nov 2025), which structures perturbations as LoRA updates to dramatically reduce compute costs. EgRoL enables massively parallel inference-only training without backpropagation, outperforming GRPO on benchmarks like Countdown (35% vs 23% accuracy) and GSM8K while running up to 32x more parallel generations under the same hardware. The key insight is that ES fits naturally into RL-style fine-tuning where only a coarse outcome-level reward is available, avoiding the sparse credit assignment problem that plagues token-level RL methods like GRPO.

  4. 4
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·5w

    Google solved an Old RNN Problem

    Google Research introduces 'Memory Caching,' a technique that addresses the long-standing limitation of RNNs losing information over long sequences. Instead of relying on a single fixed-size memory state, the approach splits sequences into segments and saves the RNN's memory state at each segment boundary. During generation, each token attends to all saved checkpoints, achieving O(NL) complexity — a middle ground between RNNs' O(L) and Transformers' O(L²). Four variants are proposed: Residual Memory, Gated Residual Memory (GRM), Memory Soup, and Sparse Selective Caching (SSC), with GRM performing best. The technique significantly closes the recall gap between RNNs and Transformers and shows that hybrid architectures are implicitly a special case of Memory Caching. Experiments are at academic scale (up to 1.3B params), so frontier-scale performance remains an open question.

  5. 5
    Article
    Avatar of hnHacker News·5w

    GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh

    MacMind is a 1,216-parameter single-layer transformer neural network implemented entirely in HyperTalk — Apple's 1987 scripting language for HyperCard — and trained on a real Macintosh SE/30. It learns the bit-reversal permutation (the first step of the Fast Fourier Transform) from random examples using full backpropagation, self-attention, and stochastic gradient descent. No compiled code or external libraries are used. The project is designed as a transparent, inspectable demonstration that the math behind modern LLMs is not magic — the same forward pass, loss computation, and weight update loop that powers GPT-4 runs here on a 68030 processor at 8 MHz. After training (~1,000 steps, taking hours on real hardware), the attention map independently discovers the FFT butterfly routing pattern first published by Cooley and Tukey in 1965. A Python/NumPy reference implementation is included for validation.