Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Taalas, a startup, has built a fixed-function ASIC chip with Llama 3.1 8B weights physically etched into silicon transistors, achieving 17,000 tokens/second inference — roughly 10x faster, cheaper, and more energy-efficient than GPU-based systems. By hardwiring the model's 32 layers sequentially on-chip, data flows directly through physical transistors without the constant VRAM fetch cycles that create the memory bandwidth bottleneck on GPUs. The chip uses on-chip SRAM only for KV cache and LoRA adapters, avoiding external DRAM entirely. A novel single-transistor 4-bit multiply scheme enables the dense weight storage. The base chip uses a generic logic gate grid, requiring only two mask customizations per model, allowing a new model variant to be taped out in roughly two months.

How Taalas "prints" LLM onto a chip?

HOW NVIDIA GPUs process stuff? (Inefficiency 101)

But isn't fabricating a custom chip for every model super expensive?