Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.
Why It Matters
BPE is the tokenization method used by most modern LLMs. It handles any word (including misspellings and new terms) while keeping the vocabulary manageable.
Example
Starting with characters 'l','o','w','e','r', BPE might merge 'l'+'o' → 'lo', then 'lo'+'w' → 'low', building up common subwords that work across many words.
Think of it like...
Like creating abbreviations for common letter combinations in shorthand writing — frequent patterns get their own symbol, making the system efficient.
Related Terms
Tokenization
The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.