Artificial Intelligence

Tokenization Strategy

The approach and rules for how text is split into tokens. Different strategies (word-level, subword, character-level) make different tradeoffs between vocabulary size and sequence length.

Why It Matters

Your tokenization strategy affects model efficiency, multilingual support, and how well the model handles rare or novel words.

Example

A subword tokenizer splitting 'unhappiness' into 'un'+'happiness' versus a word-level tokenizer treating it as one token — subword handles new word forms better.

Think of it like...

Like choosing how to cut a log — you can cut it into large planks, medium boards, or small strips, each useful for different building projects.

Related Terms

Tokenization

The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.

A subword tokenization algorithm that starts with individual characters and iteratively merges the most frequent pairs to create a vocabulary of subword units. It balances vocabulary size with handling of rare words.

Back to Glossary

Why It Matters

Example

Think of it like...

Related Terms

Tokenization

Byte-Pair Encoding