LoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that freezes the original model weights and adds small trainable matrices to each layer. It dramatically reduces the compute and memory needed for fine-tuning.
Why It Matters
LoRA makes fine-tuning large models practical on consumer hardware. You can customize a 70B parameter model on a single GPU instead of needing a cluster.
Example
Fine-tuning Llama 2 70B with LoRA requires only ~16GB of GPU memory instead of hundreds of GB, making it accessible to individual developers and small teams.
Think of it like...
Like altering a suit instead of making a new one from scratch — small, targeted changes to key areas give you a custom fit without rebuilding everything.
Related Terms
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, domain-specific dataset to specialize its behavior for a particular task or domain. Fine-tuning adjusts the model's weights to improve performance on the target task.
QLoRA
Quantized Low-Rank Adaptation — combines LoRA with quantization to further reduce memory requirements for fine-tuning. It quantizes the base model to 4-bit precision while training LoRA adapters in higher precision.
Quantization
The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.