This post discusses the development of go-sentencepiece, a pure Go implementation of the SentencePiece tokenizer, which is used in Google AI's models like Gemma and Gemini. Unlike the C++ and Python bindings, go-sentencepiece doesn't require a C compiler. It focuses on BPE tokenization and only supports encoding & decoding, not the training phase. The implementation leverages advanced algorithms, which significantly improve performance. A protobuf file configures the tokenizer, and an online demo is available for testing.
Sort: