Skip to content

01 - LLM Fundamentals ​


1. Transformer Architecture ​

What: The Transformer is the neural network architecture behind all modern LLMs. Introduced in "Attention Is All You Need" (2017), it replaced RNNs by processing entire sequences in parallel using self-attention.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Transformer                        β”‚
β”‚                                                      β”‚
β”‚  Input Tokens                                        β”‚
β”‚      ↓                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚   Input Embedding β”‚   β”‚ Positional Encodingβ”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                      ↓                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  Γ—N layers     β”‚
β”‚  β”‚   Multi-Head Self-Attention      β”‚                β”‚
β”‚  β”‚          ↓                       β”‚                β”‚
β”‚  β”‚   Add & Layer Norm               β”‚                β”‚
β”‚  β”‚          ↓                       β”‚                β”‚
β”‚  β”‚   Feed-Forward Network (FFN)     β”‚                β”‚
β”‚  β”‚          ↓                       β”‚                β”‚
β”‚  β”‚   Add & Layer Norm               β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                      ↓                               β”‚
β”‚              Output Logits                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: Every layer has two sub-layers β€” attention (tokens communicate with each other) and FFN (each token processes independently). Residual connections + layer norm around each.


2. Self-Attention ​

What: The mechanism that lets each token "look at" every other token in the sequence to decide what's relevant.

How it works:

For each token, three vectors are computed from learned weight matrices:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"
Attention(Q, K, V) = softmax(Q Γ— K^T / √d_k) Γ— V

Step-by-step:

  1. Compute Q, K, V by multiplying input embeddings by learned weight matrices
  2. Dot product Q with all K's β†’ raw attention scores
  3. Scale by √d_k (prevents softmax saturation for large dimensions)
  4. Apply softmax β†’ attention weights (sum to 1)
  5. Weighted sum of V's β†’ output for that position
python
import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    d_k = Q.size(-1)
    # (batch, seq_len, d_k) Γ— (batch, d_k, seq_len) β†’ (batch, seq_len, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    # (batch, seq_len, seq_len) Γ— (batch, seq_len, d_v) β†’ (batch, seq_len, d_v)
    return torch.matmul(weights, V)

Why scaling matters: Without / √d_k, dot products grow large with dimension size, pushing softmax into regions with tiny gradients (vanishing gradient problem).


3. Multi-Head Attention ​

What: Instead of one attention function, run multiple attention "heads" in parallel, each learning different relationship patterns.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Multi-Head Attention          β”‚
β”‚                                          β”‚
β”‚   Head 1: syntax relationships           β”‚
β”‚   Head 2: semantic meaning               β”‚
β”‚   Head 3: positional patterns            β”‚
β”‚   Head 4: coreference resolution         β”‚
β”‚   ...                                    β”‚
β”‚                                          β”‚
β”‚   Concat all heads β†’ Linear projection   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How it works:

  1. Split Q, K, V into h heads (e.g., 8 or 32)
  2. Each head has dimension d_k / h β€” same total compute
  3. Run attention on each head independently
  4. Concatenate results and project through a linear layer
python
# GPT-2 uses 12 heads with d_model=768
# Each head: 768/12 = 64 dimensions
# Total params same as single-head, but captures diverse patterns

Why multiple heads: A single attention can only compute one weighted average. Multiple heads let the model attend to information from different representation subspaces at different positions simultaneously.


4. Positional Encoding ​

What: Transformers process all tokens in parallel (no inherent order). Positional encodings inject sequence order information.

Original approach β€” sinusoidal:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each dimension uses a different frequency sinusoid. This lets the model learn relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).

Modern approaches:

MethodUsed ByHow
SinusoidalOriginal TransformerFixed sin/cos functions
LearnedGPT-2, BERTTrainable embedding per position
RoPE (Rotary)LLaMA, MistralRotates Q/K vectors by position angle
ALiBiBLOOMAdds linear bias to attention scores

RoPE (Rotary Position Embedding):

  • Encodes position by rotating the query and key vectors
  • Naturally captures relative position (rotation difference)
  • Enables better length extrapolation than learned positions

5. Encoder vs Decoder ​

What: The original Transformer has both an encoder (processes input) and decoder (generates output). Modern LLMs typically use only one.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ENCODER    β”‚        β”‚   DECODER    β”‚
β”‚              β”‚        β”‚              β”‚
β”‚ Bidirectionalβ”‚        β”‚ Causal (left β”‚
β”‚ attention    │───────→│ to right)    β”‚
β”‚              β”‚ cross  β”‚ attention    β”‚
β”‚ Sees full    β”‚ attn   β”‚              β”‚
β”‚ input        β”‚        β”‚ + masked     β”‚
β”‚              β”‚        β”‚ self-attn    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Causal masking (decoder): Each token can only attend to previous tokens. Implemented by masking future positions to -infinity before softmax.

python
# Causal mask for sequence length 4
mask = torch.triu(torch.ones(4, 4), diagonal=1) * float('-inf')
# tensor([[ 0., -inf, -inf, -inf],
#         [ 0.,   0., -inf, -inf],
#         [ 0.,   0.,   0., -inf],
#         [ 0.,   0.,   0.,   0.]])

6. GPT vs BERT Architecture Comparison ​

AspectGPT (Decoder-only)BERT (Encoder-only)
AttentionCausal (left-to-right)Bidirectional (sees full context)
Pre-trainingNext token predictionMasked language modeling (MLM) + next sentence prediction
GenerationAutoregressive (generates tokens one by one)Not designed for generation
Best forText generation, chat, code, reasoningClassification, NER, embeddings, search
ExamplesGPT-4, Claude, LLaMA, MistralBERT, RoBERTa, DeBERTa
ContextUnidirectional β€” only sees past tokensFull context β€” sees entire input

Encoder-Decoder models (T5, BART): Use both. Best for sequence-to-sequence tasks like translation, summarization.

Why decoder-only won for LLMs:

  • Simpler architecture, easier to scale
  • Next-token prediction is a universal objective
  • In-context learning emerged naturally at scale
  • Bidirectional attention isn't needed when you can prompt effectively

Key parameters that define model size:

ParameterGPT-2GPT-3LLaMA 2 70B
Layers129680
Hidden dim768122888192
Attention heads129664
Parameters117M175B70B

7. How LLMs Generate Text ​

Autoregressive generation: Generate one token at a time, feeding each generated token back as input.

Input:  "The cat sat on the"
Step 1: β†’ "mat"    (append)
Step 2: "The cat sat on the mat" β†’ "."  (append)
Step 3: "The cat sat on the mat." β†’ [EOS]

KV Cache: During generation, recompute is expensive. Cache the Key and Value tensors from previous tokens so each new token only computes attention against cached KV pairs.

Without KV cache: O(nΒ²) per token (recompute all attention)
With KV cache:    O(n) per token (only new token attends to cache)

Trade-off: Memory ↑ but Speed ↑↑↑

Decoding strategies:

  • Greedy: Always pick highest probability token. Fast but repetitive.
  • Top-k: Sample from top k tokens. More diverse.
  • Top-p (nucleus): Sample from smallest set of tokens whose cumulative probability exceeds p.
  • Temperature: Scale logits before softmax. Lower = more deterministic, higher = more random.

Frontend interview preparation reference.