Stealing Part of a Production Language Model | AI Paper Explained
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A Google DeepMind research paper demonstrates a model-stealing attack that extracts the embedding projection layer (the final layer) from closed-source LLMs like GPT-4 and PaLM-2 using only standard API access. By exploiting APIs that expose log probabilities or logit bias, attackers can reconstruct the full logit vector and apply SVD to recover the hidden dimension size and approximate weight matrix. The researchers estimated the full embedding projection layer of GPT-3.5-Turbo could be extracted for under $2,000 in API queries. OpenAI and Google have since deployed mitigations. The video explains the math behind the attack, including how partial top-k log probabilities can be extended to full logit vectors using biased queries.
Sort: