Artificial Intelligence

Mixture of Experts

An architecture where a model consists of multiple specialized sub-networks (experts) and a gating mechanism that routes each input to only the most relevant experts. Only a fraction of the total parameters are active per input.

Why It Matters

MoE enables models with massive total parameter counts that are efficient to run because only a subset activates per query. GPT-4 is rumored to use MoE.

Example

A model with 8 expert networks where each input activates only 2 — the model has the knowledge of all 8 experts but the computational cost of just 2.

Think of it like...

Like a hospital with many specialists — you do not see every doctor for every visit, a triage nurse (the gating mechanism) routes you to the right specialists for your condition.

Mixture of Experts

Why It Matters

Example

Think of it like...

Related Terms

Transformer

Sparse Model

Parameter