A comprehensive analysis of Large Language Model (LLM) tokenizers is presented, focusing on the detection of untrained and under-trained tokens. The prevalence of such tokens across various models is demonstrated, along with insights for improving the efficiency and safety of language models.

1m read time From arxiv.org
Post cover image

Sort: