A comprehensive analysis of Large Language Model (LLM) tokenizers is presented, focusing on the detection of untrained and under-trained tokens. The prevalence of such tokens across various models is demonstrated, along with insights for improving the efficiency and safety of language models.
Sort: