Tokenization Strategy
The approach and rules for how text is split into tokens. Different strategies (word-level, subword, character-level) make different tradeoffs between vocabulary size and sequence length.
Why It Matters
Your tokenization strategy affects model efficiency, multilingual support, and how well the model handles rare or novel words.
Example
A subword tokenizer splitting 'unhappiness' into 'un'+'happiness' versus a word-level tokenizer treating it as one token — subword handles new word forms better.
Think of it like...
Like choosing how to cut a log — you can cut it into large planks, medium boards, or small strips, each useful for different building projects.
Related Terms
Tokenization
The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.
Byte-Pair Encoding
A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.