04 - Fine-Tuning and Training β
1. Pre-training vs Fine-tuning β
What: Two-stage process for creating useful LLMs.
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
β PRE-TRAINING β β β FINE-TUNING β β β ALIGNMENT β
β β β β β β
β Massive corpus β β Task-specific β β RLHF / DPO β
β (internet text) β β datasets β β Human prefs β
β β β β β β
β Next-token β β Instruction β β Safety + β
β prediction β β following β β helpfulness β
β β β β β β
β Weeks on 1000s β β Hours on 10s β β Days on 100s β
β of GPUs β β of GPUs β β of GPUs β
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ| Aspect | Pre-training | Fine-tuning |
|---|---|---|
| Data | Trillions of tokens (web, books, code) | Thousands to millions of examples |
| Objective | Next token prediction | Task-specific loss |
| Cost | $1M - $100M+ | $100 - $10,000 |
| Updates | All parameters | All or subset (LoRA) |
| Result | General language model | Specialized model |
2. LoRA (Low-Rank Adaptation) β
What: Parameter-efficient fine-tuning that adds small trainable matrices to frozen model weights. Instead of updating all billions of parameters, train only ~0.1-1% of them.
How it works:
Original weight matrix W: (d Γ d), e.g., (4096 Γ 4096)
Frozen during fine-tuning.
LoRA adds: W + ΞW = W + B Γ A
A: (d Γ r) β e.g., (4096 Γ 8) β trainable
B: (r Γ d) β e.g., (8 Γ 4096) β trainable
r = rank (typically 4-64), much smaller than dβββββββββββββββ
β Input x β
β (d dims) β
ββββββββ¬ββββββββ
β
βββββ΄ββββ
β β
βΌ βΌ
ββββββββ βββββββ
β W β β A β (d Γ r) β down-project
β(frozen)β β β
β β ββββ¬βββ
β β βΌ
β β βββββββ
β β β B β (r Γ d) β up-project
β β ββββ¬βββ
ββββ¬ββββ β
β β
βββββ¬ββββ
β add
βΌ
Output (d dims)Advantages:
- Train 0.1% of parameters instead of 100%
- Can run on consumer GPUs (single 24GB GPU for 7B models)
- Multiple LoRA adapters can be swapped at inference time
- Minimal quality loss compared to full fine-tuning
3. QLoRA β
What: LoRA + quantization. Quantize the base model to 4-bit precision, then apply LoRA adapters in full precision.
Full fine-tuning of 70B model: ~140GB VRAM (impossible on consumer hardware)
LoRA on 70B model: ~70GB VRAM
QLoRA on 70B model: ~35GB VRAM (fits on 2Γ A100 or 1Γ A100 80GB)Key innovations:
- 4-bit NormalFloat (NF4) quantization β information-theoretically optimal for normally distributed weights
- Double quantization β quantize the quantization constants too
- Paged optimizers β use CPU RAM when GPU runs out
4. RLHF (Reinforcement Learning from Human Feedback) β
What: Training methodology that aligns LLMs with human preferences using a reward model trained on human comparisons.
Step 1: Supervised Fine-Tuning (SFT)
Train on high-quality (prompt, response) pairs
Step 2: Reward Model Training
Humans rank model outputs: Response A > Response B
Train a reward model to predict human preferences
Step 3: RL Optimization (PPO)
Model generates responses β Reward model scores them
β PPO optimizes policy to maximize reward
β KL penalty prevents model from diverging too far from SFT modelReward model:
# Simplified: reward model scores a (prompt, response) pair
reward = reward_model(prompt, response) # scalar score
# Training signal:
# For pair (response_a > response_b):
# loss = -log(sigmoid(reward(a) - reward(b)))Challenges:
- Reward hacking: Model finds exploits in reward model
- Mode collapse: Model generates repetitive "safe" responses
- Expensive: Requires multiple models in memory simultaneously
- Human preference data is expensive and subjective
5. DPO (Direct Preference Optimization) β
What: Simpler alternative to RLHF that eliminates the separate reward model. Directly optimizes the language model on preference pairs.
RLHF pipeline: SFT β Reward Model β PPO β Aligned Model
DPO pipeline: SFT β DPO β Aligned Model (skip reward model + RL)How DPO works:
- Takes pairs of (preferred response, rejected response) for each prompt
- Directly increases probability of preferred response and decreases probability of rejected response
- Uses a mathematical reformulation that implicitly learns the reward
# DPO loss (simplified)
loss = -log(sigmoid(
beta * (log_prob_preferred - log_prob_rejected
- log_prob_ref_preferred + log_prob_ref_rejected)
))
# beta controls strength of preference optimization
# ref = reference (SFT) model β prevents driftAdvantages over RLHF:
- No separate reward model needed
- No RL training loop (more stable)
- Simpler to implement
- Lower memory requirements
6. Instruction Tuning β
What: Fine-tuning a base model on (instruction, response) pairs so it follows human instructions instead of just completing text.
Base model input: "Write a haiku about coding"
Base model output: "competitions are held annually in Japan..." (completion)
Instruction-tuned input: "Write a haiku about coding"
Instruction-tuned output: "Lines of logic flow / Debugging into the night / Code compiles at last"Dataset formats:
{
"instruction": "Explain quantum computing to a 5-year-old",
"input": "",
"output": "Imagine you have a magic coin..."
}
// With context:
{
"instruction": "Summarize this article",
"input": "The Federal Reserve announced today...",
"output": "The Fed raised interest rates by 0.25%..."
}Notable instruction datasets:
- Alpaca: 52K instructions generated by GPT-3.5
- Dolly: 15K human-written instructions (Databricks)
- OpenAssistant: 161K messages in conversation trees
- FLAN: Google's massive multi-task instruction collection
7. Dataset Preparation β
Best practices for fine-tuning data:
| Aspect | Recommendation |
|---|---|
| Quality | 1000 high-quality examples > 100K noisy ones |
| Format | Consistent structure (instruction/input/output or conversation) |
| Diversity | Cover edge cases, different phrasings, varied difficulty |
| Deduplication | Remove near-duplicates to prevent overfitting |
| Balance | Roughly equal representation of categories/tasks |
| Validation | Hold out 10-20% for evaluation |
Common format (ChatML):
<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a function to reverse a string in Python.<|im_end|>
<|im_start|>assistant
def reverse_string(s: str) -> str:
return s[::-1]<|im_end|>8. Evaluation β
How to measure fine-tuned model quality:
| Metric | What it measures |
|---|---|
| Perplexity | How well model predicts held-out text (lower = better) |
| BLEU / ROUGE | N-gram overlap with reference outputs |
| Human evaluation | Subjective quality ratings |
| Task-specific | Accuracy, F1, exact match for classification/QA |
| LLM-as-judge | Use GPT-4 or Claude to rate outputs |
| Benchmarks | MMLU, HellaSwag, HumanEval, etc. |
When to fine-tune vs. use RAG vs. prompt engineering:
ββββββββββββββββββββββββ
β Can prompting solve βββYesβββ Use prompting (cheapest)
β it? β
ββββββββββββ¬ββββββββββββ
No
β
ββββββββββββββββββββββββ
β Is the knowledge in βββNoββββ Use RAG (add knowledge)
β the model already? β
ββββββββββββ¬ββββββββββββ
Yes (but wrong style/format)
β
ββββββββββββββββββββββββ
β Fine-tune β (change behavior/style)
ββββββββββββββββββββββββ