Mixture of Experts
An architecture where a model consists of multiple specialized sub-networks (experts) and a gating mechanism that routes each input to only the most relevant experts. Only a fraction of the total parameters are active per input.
Why It Matters
MoE enables models with massive total parameter counts that are efficient to run because only a subset activates per query. GPT-4 is rumored to use MoE.
Example
A model with 8 expert networks where each input activates only 2 — the model has the knowledge of all 8 experts but the computational cost of just 2.
Think of it like...
Like a hospital with many specialists — you do not see every doctor for every visit, a triage nurse (the gating mechanism) routes you to the right specialists for your condition.
Related Terms
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Sparse Model
A neural network where most parameters are zero or inactive for any given input. Sparse models achieve high capacity with lower computational cost by only using relevant parameters.
Parameter
Any learnable value in a machine learning model that is adjusted during training. Parameters include weights and biases in neural networks. Model size is often described by parameter count.