04 - Fine-Tuning and Training

1. Pre-training vs Fine-tuning

What: Two-stage process for creating useful LLMs.

┌─────────────────┐      ┌──────────────────┐      ┌──────────────┐
│  PRE-TRAINING   │  →   │   FINE-TUNING    │  →   │  ALIGNMENT   │
│                 │      │                  │      │              │
│ Massive corpus  │      │ Task-specific    │      │ RLHF / DPO   │
│ (internet text) │      │ datasets         │      │ Human prefs   │
│                 │      │                  │      │              │
│ Next-token      │      │ Instruction      │      │ Safety +      │
│ prediction      │      │ following        │      │ helpfulness   │
│                 │      │                  │      │              │
│ Weeks on 1000s  │      │ Hours on 10s     │      │ Days on 100s  │
│ of GPUs         │      │ of GPUs          │      │ of GPUs       │
└─────────────────┘      └──────────────────┘      └──────────────┘

Aspect	Pre-training	Fine-tuning
Data	Trillions of tokens (web, books, code)	Thousands to millions of examples
Objective	Next token prediction	Task-specific loss
Cost	$1M - $100M+	$100 - $10,000
Updates	All parameters	All or subset (LoRA)
Result	General language model	Specialized model

2. LoRA (Low-Rank Adaptation)

What: Parameter-efficient fine-tuning that adds small trainable matrices to frozen model weights. Instead of updating all billions of parameters, train only ~0.1-1% of them.

How it works:

Original weight matrix W: (d × d), e.g., (4096 × 4096)
Frozen during fine-tuning.

LoRA adds: W + ΔW = W + B × A
  A: (d × r)  — e.g., (4096 × 8)    ← trainable
  B: (r × d)  — e.g., (8 × 4096)    ← trainable

r = rank (typically 4-64), much smaller than d

┌─────────────┐
│  Input x     │
│  (d dims)    │
└──────┬───────┘
       │
   ┌───┴───┐
   │       │
   ▼       ▼
┌──────┐ ┌─────┐
│  W   │ │  A  │ (d × r) — down-project
│(frozen)│ │     │
│      │ └──┬──┘
│      │    ▼
│      │ ┌─────┐
│      │ │  B  │ (r × d) — up-project
│      │ └──┬──┘
└──┬───┘    │
   │       │
   └───┬───┘
       │ add
       ▼
   Output (d dims)

Advantages:

Train 0.1% of parameters instead of 100%
Can run on consumer GPUs (single 24GB GPU for 7B models)
Multiple LoRA adapters can be swapped at inference time
Minimal quality loss compared to full fine-tuning

3. QLoRA

What: LoRA + quantization. Quantize the base model to 4-bit precision, then apply LoRA adapters in full precision.

Full fine-tuning of 70B model: ~140GB VRAM (impossible on consumer hardware)
LoRA on 70B model:            ~70GB VRAM
QLoRA on 70B model:           ~35GB VRAM (fits on 2× A100 or 1× A100 80GB)

Key innovations:

4-bit NormalFloat (NF4) quantization — information-theoretically optimal for normally distributed weights
Double quantization — quantize the quantization constants too
Paged optimizers — use CPU RAM when GPU runs out

4. RLHF (Reinforcement Learning from Human Feedback)

What: Training methodology that aligns LLMs with human preferences using a reward model trained on human comparisons.

Step 1: Supervised Fine-Tuning (SFT)
  Train on high-quality (prompt, response) pairs

Step 2: Reward Model Training
  Humans rank model outputs: Response A > Response B
  Train a reward model to predict human preferences

Step 3: RL Optimization (PPO)
  Model generates responses → Reward model scores them
  → PPO optimizes policy to maximize reward
  → KL penalty prevents model from diverging too far from SFT model

Reward model:

python

# Simplified: reward model scores a (prompt, response) pair
reward = reward_model(prompt, response)  # scalar score

# Training signal:
# For pair (response_a > response_b):
# loss = -log(sigmoid(reward(a) - reward(b)))

Challenges:

Reward hacking: Model finds exploits in reward model
Mode collapse: Model generates repetitive "safe" responses
Expensive: Requires multiple models in memory simultaneously
Human preference data is expensive and subjective

5. DPO (Direct Preference Optimization)

What: Simpler alternative to RLHF that eliminates the separate reward model. Directly optimizes the language model on preference pairs.

RLHF pipeline:  SFT → Reward Model → PPO → Aligned Model
DPO pipeline:   SFT → DPO → Aligned Model  (skip reward model + RL)

How DPO works:

Takes pairs of (preferred response, rejected response) for each prompt
Directly increases probability of preferred response and decreases probability of rejected response
Uses a mathematical reformulation that implicitly learns the reward

python

# DPO loss (simplified)
loss = -log(sigmoid(
    beta * (log_prob_preferred - log_prob_rejected
            - log_prob_ref_preferred + log_prob_ref_rejected)
))
# beta controls strength of preference optimization
# ref = reference (SFT) model — prevents drift

Advantages over RLHF:

No separate reward model needed
No RL training loop (more stable)
Simpler to implement
Lower memory requirements

6. Instruction Tuning

What: Fine-tuning a base model on (instruction, response) pairs so it follows human instructions instead of just completing text.

Base model input: "Write a haiku about coding"
Base model output: "competitions are held annually in Japan..."  (completion)

Instruction-tuned input: "Write a haiku about coding"
Instruction-tuned output: "Lines of logic flow / Debugging into the night / Code compiles at last"

Dataset formats:

json

{
  "instruction": "Explain quantum computing to a 5-year-old",
  "input": "",
  "output": "Imagine you have a magic coin..."
}

// With context:
{
  "instruction": "Summarize this article",
  "input": "The Federal Reserve announced today...",
  "output": "The Fed raised interest rates by 0.25%..."
}

Notable instruction datasets:

Alpaca: 52K instructions generated by GPT-3.5
Dolly: 15K human-written instructions (Databricks)
OpenAssistant: 161K messages in conversation trees
FLAN: Google's massive multi-task instruction collection

7. Dataset Preparation

Best practices for fine-tuning data:

Aspect	Recommendation
Quality	1000 high-quality examples > 100K noisy ones
Format	Consistent structure (instruction/input/output or conversation)
Diversity	Cover edge cases, different phrasings, varied difficulty
Deduplication	Remove near-duplicates to prevent overfitting
Balance	Roughly equal representation of categories/tasks
Validation	Hold out 10-20% for evaluation

Common format (ChatML):

<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a function to reverse a string in Python.<|im_end|>
<|im_start|>assistant
def reverse_string(s: str) -> str:
    return s[::-1]<|im_end|>

8. Evaluation

How to measure fine-tuned model quality:

Metric	What it measures
Perplexity	How well model predicts held-out text (lower = better)
BLEU / ROUGE	N-gram overlap with reference outputs
Human evaluation	Subjective quality ratings
Task-specific	Accuracy, F1, exact match for classification/QA
LLM-as-judge	Use GPT-4 or Claude to rate outputs
Benchmarks	MMLU, HellaSwag, HumanEval, etc.

When to fine-tune vs. use RAG vs. prompt engineering:

┌──────────────────────┐
│ Can prompting solve   │──Yes──→ Use prompting (cheapest)
│ it?                   │
└──────────┬───────────┘
           No
           ↓
┌──────────────────────┐
│ Is the knowledge in   │──No───→ Use RAG (add knowledge)
│ the model already?    │
└──────────┬───────────┘
           Yes (but wrong style/format)
           ↓
┌──────────────────────┐
│ Fine-tune             │ (change behavior/style)
└──────────────────────┘

04 - Fine-Tuning and Training ​

1. Pre-training vs Fine-tuning ​

2. LoRA (Low-Rank Adaptation) ​

3. QLoRA ​

4. RLHF (Reinforcement Learning from Human Feedback) ​

5. DPO (Direct Preference Optimization) ​

6. Instruction Tuning ​

7. Dataset Preparation ​

8. Evaluation ​