Artificial Intelligence

Tokenization

The process of breaking text into smaller units (tokens) for processing by NLP models. Tokenization can split text into words, subwords, or characters depending on the method used.

Why It Matters

Tokenization is the first step in any NLP pipeline. How text is tokenized affects model vocabulary size, handling of rare words, and language support.

Example

The sentence 'I don't like mushrooms' tokenized as: ['I', 'don', "'", 't', 'like', 'mushrooms'] or ['I', 'do', 'n\'t', 'like', 'mush', 'rooms'] depending on the tokenizer.

Think of it like...

Like breaking a sentence into Scrabble tiles — the way you split it up determines what building blocks the model has to work with.

Related Terms