A comprehensive analysis of Large Language Model (LLM) tokenizers is presented, focusing on the detection of untrained and under-trained tokens. The prevalence of such tokens across various models is demonstrated, along with insights for improving the efficiency and safety of language models.
•1m read time• From arxiv.org
Sort: