Lucebox is an open-source project publishing hand-tuned LLM inference optimizations for specific consumer GPUs, starting with two releases for the RTX 3090. The first is a megakernel for Qwen3.5-0.8B that fuses all 24 layers into a single CUDA dispatch, achieving 1.87 tok/J at 413 tok/s decode — matching Apple silicon efficiency at 2× throughput. The second is a GGUF port of DFlash speculative decoding with DDTree for Qwen3.5-27B, reaching up to 207 tok/s (3.43× faster than autoregressive) and fitting 128K context in 24 GB VRAM using Q4_K_M quantization. Both projects include full writeups, reproducible benchmarks, and MIT-licensed source code. The roadmap targets Ryzen AI MAX+ 395 and heterogeneous CPU+GPU optimizations next.
Table of contents
Inside the box01 · Megakernel Qwen3.5 0.8B on RTX 309002 · DFlash DDtree Qwen3.5 27B GGUF on RTX 3090Why this existsRequirementsRepository layoutRoadmapCitationInspired byCommunitySort: