Byte Pair Encoding (BPE)

Introduction

Byte Pair Encoding (BPE) is a subword tokenization algorithm that has become a fundamental component in modern Natural Language Processing (NLP). Originally developed for data compression, BPE was later adapted for use in machine translation and has since become a standard preprocessing step for many state-of-the-art language models.

How BPE Works

The BPE algorithm works by iteratively merging the most frequent pair of adjacent symbols (bytes or characters) in a text corpus. Here's a step-by-step breakdown:

Initialize Vocabulary: Start with a vocabulary of individual characters and their frequencies.
Count Pairs: Count the frequency of all adjacent symbol pairs in the vocabulary.
Merge Most Frequent Pair: Replace the most frequent pair with a new symbol.
Update Vocabulary: Add the new symbol to the vocabulary.
Repeat: Continue the process until a desired vocabulary size is reached or no more merges are possible.

Example

Consider the following text with word frequencies:

"low" (5), "lower" (2), "newest" (6), "widest" (3)

Initial vocabulary: {l, o, w, e, r, n, s, i, d, t}
After first merge (most frequent pair 'e' and 's'): {l, o, w, e, r, n, es, i, d, t}
Continue merging until desired vocabulary size is reached.

Implementation in NLP

BPE is particularly useful for NLP because it:

Handles out-of-vocabulary words by breaking them into known subword units
Reduces vocabulary size while maintaining meaningful representations
Balances between word-level and character-level representations

Common Implementations

SentencePiece: Google's implementation that supports both BPE and unigram language model tokenization.
Hugging Face Tokenizers: Provides BPE implementation along with other tokenization algorithms.
Subword NMT: The original implementation for neural machine translation.

Applications

BPE is widely used in:

Pre-training language models (e.g., GPT, BERT)
Neural machine translation
Text generation tasks
Any NLP task requiring subword tokenization

Advantages and Limitations

Advantages

Handles rare words effectively
Reduces vocabulary size
Maintains meaningful subword units
Language-agnostic

Limitations

May split words in unintuitive ways
Requires preprocessing and training on a corpus
Fixed vocabulary size must be determined in advance

Encoding

Use the following prebuilt encodings. For modern models like GPT-4o and GPT-4.1, prefer o200k_base. For GPT-4 and GPT-3.5, use cl100k_base.

ZH-CN Tokens extension

Chinese tokenization extension (community)

Code Example

JavaScript (Node)

npm install gpt-tokenizer

import { encode, decode } from "gpt-tokenizer/encoding/o200k_base";

const text = "Hello, 世界!";
const tokens = encode(text);

console.log("token count:", tokens.length);
console.log("tokens:", tokens);
console.log("decoded:", decode(tokens));

// If you need GPT-4/3.5 compatibility instead:
// import { encode as encodeCl100k } from 'gpt-tokenizer/encoding/cl100k_base';
// const tokensCl = encodeCl100k(text);

Python

pip install tiktoken

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
text = "Hello, 世界!"
tokens = enc.encode(text)

print("token count:", len(tokens))
print("tokens:", tokens)
print("decoded:", enc.decode(tokens))

# For GPT-4/3.5 models:
# enc = tiktoken.get_encoding("cl100k_base")

Tip: Token counts are model-dependent. Always match the encoding to your target model.

Conclusion

BPE provides a practical compromise between character- and word-level tokenization, enabling robust handling of rare and novel words while keeping the vocabulary manageable for efficient training and inference.

Introduction​

How BPE Works​

Example​

Implementation in NLP​

Common Implementations​

Applications​

Advantages and Limitations​

Advantages​

Limitations​

Encoding​

ZH-CN Tokens extension​

Code Example​

JavaScript (Node)​

Python​

Conclusion​

Further Reading​