Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware. - Luce-Org/lucebox-hub

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Lucebox is an open-source project publishing hand-tuned LLM inference optimizations for specific consumer GPUs, starting with two releases for the RTX 3090. The first is a megakernel for Qwen3.5-0.8B that fuses all 24 layers into a single CUDA dispatch, achieving 1.87 tok/J at 413 tok/s decode — matching Apple silicon efficiency at 2× throughput. The second is a GGUF port of DFlash speculative decoding with DDTree for Qwen3.5-27B, reaching up to 207 tok/s (3.43× faster than autoregressive) and fitting 128K context in 24 GB VRAM using Q4_K_M quantization. Both projects include full writeups, reproducible benchmarks, and MIT-licensed source code. The roadmap targets Ryzen AI MAX+ 395 and heterogeneous CPU+GPU optimizations next.

GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.

01 · Megakernel Qwen3.5 0.8B on RTX 3090

02 · DFlash DDtree Qwen3.5 27B GGUF on RTX 3090