In computer science, we refer to human languages, like English and Mandarin, as “natural” languages. In contrast, languages designed to interact with computers, like Assembly and LISP, are called…

Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Natural Language Processing (NLP) bridges the gap between human and machine understanding by transforming raw text into a computable format. Tokenization, the process of breaking down text into manageable units called tokens, is crucial. Before tokenizing, text is standardized to ensure consistency by converting it to lowercase, removing punctuation, and normalizing characters. Various tokenization methods like word-level, character-level, and subword tokenization (e.g., Byte-Pair Encoding, WordPiece) prepare text for vectorization, making it intelligible to models. Implementations in Python using libraries like Hugging Face facilitate these processes.

The Art of Tokenization: Breaking Down Text for AI