Byte Pair Encoding (BPE)
Introduction
Byte Pair Encoding (BPE) is a subword tokenization algorithm that has become a fundamental component in modern Natural Language Processing (NLP). Originally developed for data compression, BPE was later adapted for use in machine translation and has since become a standard preprocessing step for many state-of-the-art language models.
How BPE Works
The BPE algorithm works by iteratively merging the most frequent pair of adjacent symbols (bytes or characters) in a text corpus. Here's a step-by-step breakdown:
- Initialize Vocabulary: Start with a vocabulary of individual characters and their frequencies.
- Count Pairs: Count the frequency of all adjacent symbol pairs in the vocabulary.
- Merge Most Frequent Pair: Replace the most frequent pair with a new symbol.
- Update Vocabulary: Add the new symbol to the vocabulary.
- Repeat: Continue the process until a desired vocabulary size is reached or no more merges are possible.
Example
Consider the following text with word frequencies:
"low" (5), "lower" (2), "newest" (6), "widest" (3)
- Initial vocabulary:
{l, o, w, e, r, n, s, i, d, t}
- After first merge (most frequent pair 'e' and 's'):
{l, o, w, e, r, n, es, i, d, t}
- Continue merging until desired vocabulary size is reached.
Implementation in NLP
BPE is particularly useful for NLP because it:
- Handles out-of-vocabulary words by breaking them into known subword units
- Reduces vocabulary size while maintaining meaningful representations
- Balances between word-level and character-level representations
Common Implementations
- SentencePiece: Google's implementation that supports both BPE and unigram language model tokenization.
- Hugging Face Tokenizers: Provides BPE implementation along with other tokenization algorithms.
- Subword NMT: The original implementation for neural machine translation.
Applications
BPE is widely used in:
- Pre-training language models (e.g., GPT, BERT)
- Neural machine translation
- Text generation tasks
- Any NLP task requiring subword tokenization
Advantages and Limitations
Advantages
- Handles rare words effectively
- Reduces vocabulary size
- Maintains meaningful subword units
- Language-agnostic
Limitations
- May split words in unintuitive ways
- Requires preprocessing and training on a corpus
- Fixed vocabulary size must be determined in advance
Encoding
Use the following prebuilt encodings. For modern models like GPT-4o and GPT-4.1, prefer o200k_base
. For GPT-4 and GPT-3.5, use cl100k_base
.
- o200k_base (modern models: GPT-4o, GPT-4.1, o1, etc.)
- cl100k_base (GPT-4, GPT-3.5)
- p50k_base
- p50k_edit
- r50k_base
ZH-CN Tokens extension
Code Example
JavaScript (Node)
npm install gpt-tokenizer
import { encode, decode } from "gpt-tokenizer/encoding/o200k_base";
const text = "Hello, 世界!";
const tokens = encode(text);
console.log("token count:", tokens.length);
console.log("tokens:", tokens);
console.log("decoded:", decode(tokens));
// If you need GPT-4/3.5 compatibility instead:
// import { encode as encodeCl100k } from 'gpt-tokenizer/encoding/cl100k_base';
// const tokensCl = encodeCl100k(text);
Python
pip install tiktoken
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
text = "Hello, 世界!"
tokens = enc.encode(text)
print("token count:", len(tokens))
print("tokens:", tokens)
print("decoded:", enc.decode(tokens))
# For GPT-4/3.5 models:
# enc = tiktoken.get_encoding("cl100k_base")
Tip: Token counts are model-dependent. Always match the encoding to your target model.
Conclusion
BPE provides a practical compromise between character- and word-level tokenization, enabling robust handling of rare and novel words while keeping the vocabulary manageable for efficient training and inference.
Further Reading
- OpenAI TikToken
- Tiktokenizer
- Hugging Face Tokenizers Documentation
- Hugging Face Byte-Pair Encoding tokenization
- GPT Tokenizer
- How LLMs See the World
Last updated: August 2025