Artificial Intelligence

Byte-Pair Encoding

A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.

Why It Matters

BPE is the tokenization method used by most modern LLMs. It handles any word (including misspellings and new terms) while keeping the vocabulary manageable.

Example

Starting with characters 'l','o','w','e','r', BPE might merge 'l'+'o' → 'lo', then 'lo'+'w' → 'low', building up common subwords that work across many words.

Think of it like...

Like creating abbreviations for common letter combinations in shorthand writing — frequent patterns get their own symbol, making the system efficient.

Related Terms

Tokenization

The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.

Token

The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.

Back to Glossary