01 - LLM Fundamentals β
1. Transformer Architecture β
What: The Transformer is the neural network architecture behind all modern LLMs. Introduced in "Attention Is All You Need" (2017), it replaced RNNs by processing entire sequences in parallel using self-attention.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transformer β
β β
β Input Tokens β
β β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Input Embedding β β Positional Encodingβ β
β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β
β ββββββββββββ¬ββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββ ΓN layers β
β β Multi-Head Self-Attention β β
β β β β β
β β Add & Layer Norm β β
β β β β β
β β Feed-Forward Network (FFN) β β
β β β β β
β β Add & Layer Norm β β
β ββββββββββββββββββββββββββββββββββββ β
β β β
β Output Logits β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββKey insight: Every layer has two sub-layers β attention (tokens communicate with each other) and FFN (each token processes independently). Residual connections + layer norm around each.
2. Self-Attention β
What: The mechanism that lets each token "look at" every other token in the sequence to decide what's relevant.
How it works:
For each token, three vectors are computed from learned weight matrices:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
Attention(Q, K, V) = softmax(Q Γ K^T / βd_k) Γ VStep-by-step:
- Compute Q, K, V by multiplying input embeddings by learned weight matrices
- Dot product Q with all K's β raw attention scores
- Scale by βd_k (prevents softmax saturation for large dimensions)
- Apply softmax β attention weights (sum to 1)
- Weighted sum of V's β output for that position
import torch
import torch.nn.functional as F
def self_attention(Q, K, V):
d_k = Q.size(-1)
# (batch, seq_len, d_k) Γ (batch, d_k, seq_len) β (batch, seq_len, seq_len)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1)
# (batch, seq_len, seq_len) Γ (batch, seq_len, d_v) β (batch, seq_len, d_v)
return torch.matmul(weights, V)Why scaling matters: Without / βd_k, dot products grow large with dimension size, pushing softmax into regions with tiny gradients (vanishing gradient problem).
3. Multi-Head Attention β
What: Instead of one attention function, run multiple attention "heads" in parallel, each learning different relationship patterns.
βββββββββββββββββββββββββββββββββββββββββββ
β Multi-Head Attention β
β β
β Head 1: syntax relationships β
β Head 2: semantic meaning β
β Head 3: positional patterns β
β Head 4: coreference resolution β
β ... β
β β
β Concat all heads β Linear projection β
βββββββββββββββββββββββββββββββββββββββββββHow it works:
- Split Q, K, V into
hheads (e.g., 8 or 32) - Each head has dimension
d_k / hβ same total compute - Run attention on each head independently
- Concatenate results and project through a linear layer
# GPT-2 uses 12 heads with d_model=768
# Each head: 768/12 = 64 dimensions
# Total params same as single-head, but captures diverse patternsWhy multiple heads: A single attention can only compute one weighted average. Multiple heads let the model attend to information from different representation subspaces at different positions simultaneously.
4. Positional Encoding β
What: Transformers process all tokens in parallel (no inherent order). Positional encodings inject sequence order information.
Original approach β sinusoidal:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Each dimension uses a different frequency sinusoid. This lets the model learn relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).
Modern approaches:
| Method | Used By | How |
|---|---|---|
| Sinusoidal | Original Transformer | Fixed sin/cos functions |
| Learned | GPT-2, BERT | Trainable embedding per position |
| RoPE (Rotary) | LLaMA, Mistral | Rotates Q/K vectors by position angle |
| ALiBi | BLOOM | Adds linear bias to attention scores |
RoPE (Rotary Position Embedding):
- Encodes position by rotating the query and key vectors
- Naturally captures relative position (rotation difference)
- Enables better length extrapolation than learned positions
5. Encoder vs Decoder β
What: The original Transformer has both an encoder (processes input) and decoder (generates output). Modern LLMs typically use only one.
ββββββββββββββββ ββββββββββββββββ
β ENCODER β β DECODER β
β β β β
β Bidirectionalβ β Causal (left β
β attention ββββββββββ to right) β
β β cross β attention β
β Sees full β attn β β
β input β β + masked β
β β β self-attn β
ββββββββββββββββ ββββββββββββββββCausal masking (decoder): Each token can only attend to previous tokens. Implemented by masking future positions to -infinity before softmax.
# Causal mask for sequence length 4
mask = torch.triu(torch.ones(4, 4), diagonal=1) * float('-inf')
# tensor([[ 0., -inf, -inf, -inf],
# [ 0., 0., -inf, -inf],
# [ 0., 0., 0., -inf],
# [ 0., 0., 0., 0.]])6. GPT vs BERT Architecture Comparison β
| Aspect | GPT (Decoder-only) | BERT (Encoder-only) |
|---|---|---|
| Attention | Causal (left-to-right) | Bidirectional (sees full context) |
| Pre-training | Next token prediction | Masked language modeling (MLM) + next sentence prediction |
| Generation | Autoregressive (generates tokens one by one) | Not designed for generation |
| Best for | Text generation, chat, code, reasoning | Classification, NER, embeddings, search |
| Examples | GPT-4, Claude, LLaMA, Mistral | BERT, RoBERTa, DeBERTa |
| Context | Unidirectional β only sees past tokens | Full context β sees entire input |
Encoder-Decoder models (T5, BART): Use both. Best for sequence-to-sequence tasks like translation, summarization.
Why decoder-only won for LLMs:
- Simpler architecture, easier to scale
- Next-token prediction is a universal objective
- In-context learning emerged naturally at scale
- Bidirectional attention isn't needed when you can prompt effectively
Key parameters that define model size:
| Parameter | GPT-2 | GPT-3 | LLaMA 2 70B |
|---|---|---|---|
| Layers | 12 | 96 | 80 |
| Hidden dim | 768 | 12288 | 8192 |
| Attention heads | 12 | 96 | 64 |
| Parameters | 117M | 175B | 70B |
7. How LLMs Generate Text β
Autoregressive generation: Generate one token at a time, feeding each generated token back as input.
Input: "The cat sat on the"
Step 1: β "mat" (append)
Step 2: "The cat sat on the mat" β "." (append)
Step 3: "The cat sat on the mat." β [EOS]KV Cache: During generation, recompute is expensive. Cache the Key and Value tensors from previous tokens so each new token only computes attention against cached KV pairs.
Without KV cache: O(nΒ²) per token (recompute all attention)
With KV cache: O(n) per token (only new token attends to cache)
Trade-off: Memory β but Speed βββDecoding strategies:
- Greedy: Always pick highest probability token. Fast but repetitive.
- Top-k: Sample from top k tokens. More diverse.
- Top-p (nucleus): Sample from smallest set of tokens whose cumulative probability exceeds p.
- Temperature: Scale logits before softmax. Lower = more deterministic, higher = more random.