sentencepiece
« Back to VersTracker
Description:
Unsupervised text tokenizer and detokenizer
Type: Formula  |  Tracked Since: Dec 28, 2025
Links: Homepage  |  formulae.brew.sh
Category: Ai ml
Tags: nlp tokenization machine-learning bpe ai
Install: brew install sentencepiece
About:
SentencePiece is an unsupervised text tokenizer and detokenizer primarily for Neural Network-based text processing systems. It implements subword segmentation algorithms like BPE and Unigram, enabling efficient handling of large vocabularies and out-of-vocabulary words. This library is essential for training modern NLP models such as BERT and T5.
Key Features:
  • Language-agnostic: Treats all input text as raw, requiring no pre-tokenization
  • Implements Byte Pair Encoding (BPE) and Unigram language model algorithms
  • Directly integrated into major frameworks like TensorFlow and PyTorch
  • Provides efficient C++ implementation for high-performance inference
Use Cases:
  • Preprocessing text data for training Large Language Models (LLMs)
  • Tokenizing multilingual datasets for machine translation tasks
  • Compressing text sequences for efficient storage and transmission
Alternatives:
  • Hugging Face Tokenizers – Provides a broader Rust-based ecosystem with fast training and inference, but SentencePiece remains the underlying standard for many models.
  • spaCy – Focuses on linguistic feature extraction and morphological analysis rather than pure subword segmentation for deep learning.
Version History
Detected Version Rev Change Commit
Sep 14, 2024 5:41pm 0 VERSION_BUMP 49230c72