Token Efficiency is proposed as a superior alternative to Shannon Entropy for filtering candidate secrets in secrets scanning. By measuring the ratio of string length to BPE token count using the cl100k_base tokenizer, rare or non-natural-language strings (like API keys and passwords) produce low token efficiency scores. Benchmarked against the CredData dataset, Token Efficiency combined with a word filter achieves an F1 of 0.874 vs Entropy+word filter's 0.715. Integrated into Betterleaks (a Gitleaks successor), the approach reaches an F1 of 0.892, beating CredSweeper's 0.85. The post includes implementation code, benchmark results, and concrete examples showing cases where entropy fails but token efficiency succeeds.

12m read timeFrom aikido.dev
Post cover image
Table of contents
EntropyByte-pair encodingToken efficiencyCredDataExamplesPerformanceToken efficiency w/ Betterleaks

Sort: