01 - LLM Fundamentals

1. Transformer Architecture

What: The Transformer is the neural network architecture behind all modern LLMs. Introduced in "Attention Is All You Need" (2017), it replaced RNNs by processing entire sequences in parallel using self-attention.

┌─────────────────────────────────────────────────────┐
│                   Transformer                        │
│                                                      │
│  Input Tokens                                        │
│      ↓                                               │
│  ┌──────────────────┐   ┌──────────────────┐        │
│  │   Input Embedding │   │ Positional Encoding│       │
│  └────────┬─────────┘   └────────┬─────────┘        │
│           └──────────┬───────────┘                   │
│                      ↓                               │
│  ┌──────────────────────────────────┐  ×N layers     │
│  │   Multi-Head Self-Attention      │                │
│  │          ↓                       │                │
│  │   Add & Layer Norm               │                │
│  │          ↓                       │                │
│  │   Feed-Forward Network (FFN)     │                │
│  │          ↓                       │                │
│  │   Add & Layer Norm               │                │
│  └──────────────────────────────────┘                │
│                      ↓                               │
│              Output Logits                           │
└─────────────────────────────────────────────────────┘

Key insight: Every layer has two sub-layers — attention (tokens communicate with each other) and FFN (each token processes independently). Residual connections + layer norm around each.

2. Self-Attention

What: The mechanism that lets each token "look at" every other token in the sequence to decide what's relevant.

How it works:

For each token, three vectors are computed from learned weight matrices:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Step-by-step:

Compute Q, K, V by multiplying input embeddings by learned weight matrices
Dot product Q with all K's → raw attention scores
Scale by √d_k (prevents softmax saturation for large dimensions)
Apply softmax → attention weights (sum to 1)
Weighted sum of V's → output for that position

python

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    d_k = Q.size(-1)
    # (batch, seq_len, d_k) × (batch, d_k, seq_len) → (batch, seq_len, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    # (batch, seq_len, seq_len) × (batch, seq_len, d_v) → (batch, seq_len, d_v)
    return torch.matmul(weights, V)

Why scaling matters: Without / √d_k, dot products grow large with dimension size, pushing softmax into regions with tiny gradients (vanishing gradient problem).

3. Multi-Head Attention

What: Instead of one attention function, run multiple attention "heads" in parallel, each learning different relationship patterns.

┌─────────────────────────────────────────┐
│            Multi-Head Attention          │
│                                          │
│   Head 1: syntax relationships           │
│   Head 2: semantic meaning               │
│   Head 3: positional patterns            │
│   Head 4: coreference resolution         │
│   ...                                    │
│                                          │
│   Concat all heads → Linear projection   │
└─────────────────────────────────────────┘

How it works:

Split Q, K, V into h heads (e.g., 8 or 32)
Each head has dimension d_k / h — same total compute
Run attention on each head independently
Concatenate results and project through a linear layer

python

# GPT-2 uses 12 heads with d_model=768
# Each head: 768/12 = 64 dimensions
# Total params same as single-head, but captures diverse patterns

Why multiple heads: A single attention can only compute one weighted average. Multiple heads let the model attend to information from different representation subspaces at different positions simultaneously.

4. Positional Encoding

What: Transformers process all tokens in parallel (no inherent order). Positional encodings inject sequence order information.

Original approach — sinusoidal:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each dimension uses a different frequency sinusoid. This lets the model learn relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).

Modern approaches:

Method	Used By	How
Sinusoidal	Original Transformer	Fixed sin/cos functions
Learned	GPT-2, BERT	Trainable embedding per position
RoPE (Rotary)	LLaMA, Mistral	Rotates Q/K vectors by position angle
ALiBi	BLOOM	Adds linear bias to attention scores

RoPE (Rotary Position Embedding):

Encodes position by rotating the query and key vectors
Naturally captures relative position (rotation difference)
Enables better length extrapolation than learned positions

5. Encoder vs Decoder

What: The original Transformer has both an encoder (processes input) and decoder (generates output). Modern LLMs typically use only one.

┌──────────────┐        ┌──────────────┐
│   ENCODER    │        │   DECODER    │
│              │        │              │
│ Bidirectional│        │ Causal (left │
│ attention    │───────→│ to right)    │
│              │ cross  │ attention    │
│ Sees full    │ attn   │              │
│ input        │        │ + masked     │
│              │        │ self-attn    │
└──────────────┘        └──────────────┘

Causal masking (decoder): Each token can only attend to previous tokens. Implemented by masking future positions to -infinity before softmax.

python

# Causal mask for sequence length 4
mask = torch.triu(torch.ones(4, 4), diagonal=1) * float('-inf')
# tensor([[ 0., -inf, -inf, -inf],
#         [ 0.,   0., -inf, -inf],
#         [ 0.,   0.,   0., -inf],
#         [ 0.,   0.,   0.,   0.]])

6. GPT vs BERT Architecture Comparison

Aspect	GPT (Decoder-only)	BERT (Encoder-only)
Attention	Causal (left-to-right)	Bidirectional (sees full context)
Pre-training	Next token prediction	Masked language modeling (MLM) + next sentence prediction
Generation	Autoregressive (generates tokens one by one)	Not designed for generation
Best for	Text generation, chat, code, reasoning	Classification, NER, embeddings, search
Examples	GPT-4, Claude, LLaMA, Mistral	BERT, RoBERTa, DeBERTa
Context	Unidirectional — only sees past tokens	Full context — sees entire input

Encoder-Decoder models (T5, BART): Use both. Best for sequence-to-sequence tasks like translation, summarization.

Why decoder-only won for LLMs:

Simpler architecture, easier to scale
Next-token prediction is a universal objective
In-context learning emerged naturally at scale
Bidirectional attention isn't needed when you can prompt effectively

Key parameters that define model size:

Parameter	GPT-2	GPT-3	LLaMA 2 70B
Layers	12	96	80
Hidden dim	768	12288	8192
Attention heads	12	96	64
Parameters	117M	175B	70B

7. How LLMs Generate Text

Autoregressive generation: Generate one token at a time, feeding each generated token back as input.

Input:  "The cat sat on the"
Step 1: → "mat"    (append)
Step 2: "The cat sat on the mat" → "."  (append)
Step 3: "The cat sat on the mat." → [EOS]

KV Cache: During generation, recompute is expensive. Cache the Key and Value tensors from previous tokens so each new token only computes attention against cached KV pairs.

Without KV cache: O(n²) per token (recompute all attention)
With KV cache:    O(n) per token (only new token attends to cache)

Trade-off: Memory ↑ but Speed ↑↑↑

Decoding strategies:

Greedy: Always pick highest probability token. Fast but repetitive.
Top-k: Sample from top k tokens. More diverse.
Top-p (nucleus): Sample from smallest set of tokens whose cumulative probability exceeds p.
Temperature: Scale logits before softmax. Lower = more deterministic, higher = more random.

01 - LLM Fundamentals ​

1. Transformer Architecture ​

2. Self-Attention ​

3. Multi-Head Attention ​

4. Positional Encoding ​

5. Encoder vs Decoder ​

6. GPT vs BERT Architecture Comparison ​

7. How LLMs Generate Text ​

01 - LLM Fundamentals

1. Transformer Architecture

2. Self-Attention

3. Multi-Head Attention

4. Positional Encoding

5. Encoder vs Decoder

6. GPT vs BERT Architecture Comparison

7. How LLMs Generate Text