Multimodal AI
AI systems that can process and generate multiple types of data — text, images, audio, video — within a single model. Multimodal models understand the relationships between different data types.
Why It Matters
Multimodal AI enables richer, more natural interactions — like showing a model a photo and asking questions about it, or generating images from text descriptions.
Example
GPT-4 Vision analyzing a photo of a damaged car and generating a repair cost estimate, or Claude reading a chart image and explaining the trends.
Think of it like...
Like a person who can read, listen, look at pictures, and respond verbally — they use all their senses together to understand and communicate.
Related Terms
Vision-Language Model
An AI model that can process both visual and textual inputs, understanding images and generating text about them. VLMs combine computer vision with language understanding.
Text-to-Image
AI models that generate visual images from natural language text descriptions (prompts). This technology converts written descriptions into original images, illustrations, or photorealistic visuals.
Text-to-Speech
AI technology that converts written text into natural-sounding human speech. Modern TTS systems can generate voices with realistic intonation, emotion, and even clone specific voices.
Speech-to-Text
AI technology that converts spoken audio into written text (also called automatic speech recognition or ASR). Modern systems handle accents, background noise, and multiple speakers.
CLIP
Contrastive Language-Image Pre-training — an OpenAI model trained to understand the relationship between images and text. CLIP can match images to text descriptions without being trained on specific image categories.