Skip to content

04 - Fine-Tuning and Training ​


1. Pre-training vs Fine-tuning ​

What: Two-stage process for creating useful LLMs.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PRE-TRAINING   β”‚  β†’   β”‚   FINE-TUNING    β”‚  β†’   β”‚  ALIGNMENT   β”‚
β”‚                 β”‚      β”‚                  β”‚      β”‚              β”‚
β”‚ Massive corpus  β”‚      β”‚ Task-specific    β”‚      β”‚ RLHF / DPO   β”‚
β”‚ (internet text) β”‚      β”‚ datasets         β”‚      β”‚ Human prefs   β”‚
β”‚                 β”‚      β”‚                  β”‚      β”‚              β”‚
β”‚ Next-token      β”‚      β”‚ Instruction      β”‚      β”‚ Safety +      β”‚
β”‚ prediction      β”‚      β”‚ following        β”‚      β”‚ helpfulness   β”‚
β”‚                 β”‚      β”‚                  β”‚      β”‚              β”‚
β”‚ Weeks on 1000s  β”‚      β”‚ Hours on 10s     β”‚      β”‚ Days on 100s  β”‚
β”‚ of GPUs         β”‚      β”‚ of GPUs          β”‚      β”‚ of GPUs       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
AspectPre-trainingFine-tuning
DataTrillions of tokens (web, books, code)Thousands to millions of examples
ObjectiveNext token predictionTask-specific loss
Cost$1M - $100M+$100 - $10,000
UpdatesAll parametersAll or subset (LoRA)
ResultGeneral language modelSpecialized model

2. LoRA (Low-Rank Adaptation) ​

What: Parameter-efficient fine-tuning that adds small trainable matrices to frozen model weights. Instead of updating all billions of parameters, train only ~0.1-1% of them.

How it works:

Original weight matrix W: (d Γ— d), e.g., (4096 Γ— 4096)
Frozen during fine-tuning.

LoRA adds: W + Ξ”W = W + B Γ— A
  A: (d Γ— r)  β€” e.g., (4096 Γ— 8)    ← trainable
  B: (r Γ— d)  β€” e.g., (8 Γ— 4096)    ← trainable

r = rank (typically 4-64), much smaller than d
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Input x     β”‚
β”‚  (d dims)    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
   β”‚       β”‚
   β–Ό       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”
β”‚  W   β”‚ β”‚  A  β”‚ (d Γ— r) β€” down-project
β”‚(frozen)β”‚ β”‚     β”‚
β”‚      β”‚ β””β”€β”€β”¬β”€β”€β”˜
β”‚      β”‚    β–Ό
β”‚      β”‚ β”Œβ”€β”€β”€β”€β”€β”
β”‚      β”‚ β”‚  B  β”‚ (r Γ— d) β€” up-project
β”‚      β”‚ β””β”€β”€β”¬β”€β”€β”˜
β””β”€β”€β”¬β”€β”€β”€β”˜    β”‚
   β”‚       β”‚
   β””β”€β”€β”€β”¬β”€β”€β”€β”˜
       β”‚ add
       β–Ό
   Output (d dims)

Advantages:

  • Train 0.1% of parameters instead of 100%
  • Can run on consumer GPUs (single 24GB GPU for 7B models)
  • Multiple LoRA adapters can be swapped at inference time
  • Minimal quality loss compared to full fine-tuning

3. QLoRA ​

What: LoRA + quantization. Quantize the base model to 4-bit precision, then apply LoRA adapters in full precision.

Full fine-tuning of 70B model: ~140GB VRAM (impossible on consumer hardware)
LoRA on 70B model:            ~70GB VRAM
QLoRA on 70B model:           ~35GB VRAM (fits on 2Γ— A100 or 1Γ— A100 80GB)

Key innovations:

  • 4-bit NormalFloat (NF4) quantization β€” information-theoretically optimal for normally distributed weights
  • Double quantization β€” quantize the quantization constants too
  • Paged optimizers β€” use CPU RAM when GPU runs out

4. RLHF (Reinforcement Learning from Human Feedback) ​

What: Training methodology that aligns LLMs with human preferences using a reward model trained on human comparisons.

Step 1: Supervised Fine-Tuning (SFT)
  Train on high-quality (prompt, response) pairs

Step 2: Reward Model Training
  Humans rank model outputs: Response A > Response B
  Train a reward model to predict human preferences

Step 3: RL Optimization (PPO)
  Model generates responses β†’ Reward model scores them
  β†’ PPO optimizes policy to maximize reward
  β†’ KL penalty prevents model from diverging too far from SFT model

Reward model:

python
# Simplified: reward model scores a (prompt, response) pair
reward = reward_model(prompt, response)  # scalar score

# Training signal:
# For pair (response_a > response_b):
# loss = -log(sigmoid(reward(a) - reward(b)))

Challenges:

  • Reward hacking: Model finds exploits in reward model
  • Mode collapse: Model generates repetitive "safe" responses
  • Expensive: Requires multiple models in memory simultaneously
  • Human preference data is expensive and subjective

5. DPO (Direct Preference Optimization) ​

What: Simpler alternative to RLHF that eliminates the separate reward model. Directly optimizes the language model on preference pairs.

RLHF pipeline:  SFT β†’ Reward Model β†’ PPO β†’ Aligned Model
DPO pipeline:   SFT β†’ DPO β†’ Aligned Model  (skip reward model + RL)

How DPO works:

  • Takes pairs of (preferred response, rejected response) for each prompt
  • Directly increases probability of preferred response and decreases probability of rejected response
  • Uses a mathematical reformulation that implicitly learns the reward
python
# DPO loss (simplified)
loss = -log(sigmoid(
    beta * (log_prob_preferred - log_prob_rejected
            - log_prob_ref_preferred + log_prob_ref_rejected)
))
# beta controls strength of preference optimization
# ref = reference (SFT) model β€” prevents drift

Advantages over RLHF:

  • No separate reward model needed
  • No RL training loop (more stable)
  • Simpler to implement
  • Lower memory requirements

6. Instruction Tuning ​

What: Fine-tuning a base model on (instruction, response) pairs so it follows human instructions instead of just completing text.

Base model input: "Write a haiku about coding"
Base model output: "competitions are held annually in Japan..."  (completion)

Instruction-tuned input: "Write a haiku about coding"
Instruction-tuned output: "Lines of logic flow / Debugging into the night / Code compiles at last"

Dataset formats:

json
{
  "instruction": "Explain quantum computing to a 5-year-old",
  "input": "",
  "output": "Imagine you have a magic coin..."
}

// With context:
{
  "instruction": "Summarize this article",
  "input": "The Federal Reserve announced today...",
  "output": "The Fed raised interest rates by 0.25%..."
}

Notable instruction datasets:

  • Alpaca: 52K instructions generated by GPT-3.5
  • Dolly: 15K human-written instructions (Databricks)
  • OpenAssistant: 161K messages in conversation trees
  • FLAN: Google's massive multi-task instruction collection

7. Dataset Preparation ​

Best practices for fine-tuning data:

AspectRecommendation
Quality1000 high-quality examples > 100K noisy ones
FormatConsistent structure (instruction/input/output or conversation)
DiversityCover edge cases, different phrasings, varied difficulty
DeduplicationRemove near-duplicates to prevent overfitting
BalanceRoughly equal representation of categories/tasks
ValidationHold out 10-20% for evaluation

Common format (ChatML):

<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a function to reverse a string in Python.<|im_end|>
<|im_start|>assistant
def reverse_string(s: str) -> str:
    return s[::-1]<|im_end|>

8. Evaluation ​

How to measure fine-tuned model quality:

MetricWhat it measures
PerplexityHow well model predicts held-out text (lower = better)
BLEU / ROUGEN-gram overlap with reference outputs
Human evaluationSubjective quality ratings
Task-specificAccuracy, F1, exact match for classification/QA
LLM-as-judgeUse GPT-4 or Claude to rate outputs
BenchmarksMMLU, HellaSwag, HumanEval, etc.

When to fine-tune vs. use RAG vs. prompt engineering:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Can prompting solve   │──Yes──→ Use prompting (cheapest)
β”‚ it?                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           No
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Is the knowledge in   │──No───→ Use RAG (add knowledge)
β”‚ the model already?    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           Yes (but wrong style/format)
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Fine-tune             β”‚ (change behavior/style)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Frontend interview preparation reference.