CLIP
Contrastive Language-Image Pre-training — an OpenAI model trained to understand the relationship between images and text. CLIP can match images to text descriptions without being trained on specific image categories.
Why It Matters
CLIP bridged the gap between vision and language, enabling zero-shot image classification, image search, and the text-to-image generation that powers DALL-E and Stable Diffusion.
Example
Searching a photo library by typing 'sunset over mountains' — CLIP matches the text description to images with similar semantic content without needing labeled training data.
Think of it like...
Like a bilingual person who can describe any photo in words and find any photo from a description — they understand both languages and the connections between them.
Related Terms
Contrastive Learning
A self-supervised technique where the model learns by comparing similar (positive) and dissimilar (negative) pairs of examples. It learns representations where similar items are close and different items are far apart.
Multimodal AI
AI systems that can process and generate multiple types of data — text, images, audio, video — within a single model. Multimodal models understand the relationships between different data types.
Zero-Shot Learning
A model's ability to perform a task it was never explicitly trained on or shown examples of. The model applies its general knowledge and reasoning to handle entirely new task types.
Embedding
A numerical representation of data (text, images, etc.) as a vector of numbers in a high-dimensional space. Similar items are placed closer together in this space, enabling machines to understand semantic relationships.
Text-to-Image
AI models that generate visual images from natural language text descriptions (prompts). This technology converts written descriptions into original images, illustrations, or photorealistic visuals.