Entropy often struggles with generic secrets and short strings. We look at how token efficiency can better identify strings that don’t look like normal text.

Aikido Security

Token Efficiency is proposed as a superior alternative to Shannon Entropy for filtering candidate secrets in secrets scanning. By measuring the ratio of string length to BPE token count using the cl100k_base tokenizer, rare or non-natural-language strings (like API keys and passwords) produce low token efficiency scores. Benchmarked against the CredData dataset, Token Efficiency combined with a word filter achieves an F1 of 0.874 vs Entropy+word filter's 0.715. Integrated into Betterleaks (a Gitleaks successor), the approach reaches an F1 of 0.892, beating CredSweeper's 0.85. The post includes implementation code, benchmark results, and concrete examples showing cases where entropy fails but token efficiency succeeds.

Rare Not Random: Using Token Efficiency for Secrets Scanning