At the core of the attention mechanism in LLMs are three matrices: Query, Key, and Value. These matrices are how transformers actually pay attention to different parts of the input. In this write-up, we will go through the construction of these matrices from the ground up.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

The attention mechanism in transformers relies on three matrices: Query (Q), Key (K), and Value (V). These matrices are created by multiplying input embeddings with learned weight matrices (Wq, Wk, Wv). The Query represents what each token is looking for, the Key represents what each token contains, and the Value holds the actual information to be passed forward. This mechanism allows transformers to process all tokens simultaneously and determine which parts of the input are relevant to each other, replacing the sequential processing of RNNs. The article walks through constructing these matrices from scratch using NumPy with a simple three-word example.

The Q, K, V Matrices