02 - Tokenization and Embeddings

1. What is Tokenization?

What: Tokenization converts raw text into a sequence of integer token IDs that the model can process. It's the first and last step of any LLM pipeline.

"Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0]
                    tokenize                        token IDs

Why not characters or words?

Character-level: Sequences too long, model struggles with long-range dependencies
Word-level: Vocabulary too large, can't handle unseen words
Subword: Best of both — reasonable vocabulary size (~32K-100K), handles any text

2. BPE (Byte Pair Encoding)

What: The most common tokenization algorithm. Used by GPT, LLaMA, and most modern LLMs.

How it works:

Start with a vocabulary of individual characters (bytes)
Count all adjacent pairs in the training corpus
Merge the most frequent pair into a new token
Repeat until desired vocabulary size is reached

Corpus: "low low low low low lowest lowest newer newer wider"

Step 0: Characters: l, o, w, e, s, t, n, r, i, d, ...
Step 1: Most frequent pair: (l, o) → merge into "lo"
Step 2: Most frequent pair: (lo, w) → merge into "low"
Step 3: Most frequent pair: (e, r) → merge into "er"
Step 4: Most frequent pair: (n, ew) → merge into "new"
...

Result: Common words become single tokens, rare words split into subwords.

python

# tiktoken — OpenAI's tokenizer
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

tokens = enc.encode("Hello, world!")
print(tokens)        # [9906, 11, 1917, 0]
print(len(tokens))   # 4 tokens

# Rare/long word gets split
tokens = enc.encode("antidisestablishmentarianism")
print(len(tokens))   # ~6-8 subword tokens

3. Other Tokenization Methods

Method	Used By	Key Difference
BPE	GPT, LLaMA, Mistral	Byte-level merges, most common
WordPiece	BERT, DistilBERT	Similar to BPE but uses likelihood instead of frequency
SentencePiece	T5, LLaMA	Language-agnostic, treats input as raw bytes, no pre-tokenization
Unigram	T5 (via SentencePiece)	Starts with large vocab, prunes based on loss

SentencePiece key advantage: No language-specific pre-processing needed. Works on raw text/bytes directly. Handles CJK, emoji, code, etc. uniformly.

WordPiece vs BPE:

BPE: Greedily merges most frequent pair
WordPiece: Merges pair that maximizes likelihood of training data
In practice, results are very similar

4. Token Limits and Context Windows

What: Every LLM has a maximum number of tokens it can process (context window). This includes both input and output.

Model	Context Window
GPT-3	4,096 tokens
GPT-4	8K / 128K tokens
Claude 3.5	200K tokens
LLaMA 3	8K / 128K tokens
Gemini 1.5	1M / 2M tokens

Rule of thumb: ~4 characters per token in English (varies by language and content).

Why context limits exist:

Self-attention is O(n^2) in sequence length — memory and compute scale quadratically
KV cache memory grows linearly with sequence length
Positional encodings may not generalize beyond training length

Strategies for long contexts:

Sliding window attention (Mistral)
Sparse attention patterns
Retrieval-augmented generation (RAG) — don't put everything in context
Summarization / compression of earlier context

5. Embedding Spaces

What: Embeddings are dense vector representations of tokens (or sentences) in a continuous high-dimensional space where semantic relationships are captured by geometry.

                    king
                     ↑
                     |  (royalty direction)
                     |
     woman ←─────── queen
       ↑               ↑
       | (gender)       | (gender)
       |               |
      man ←──────── king

  king - man + woman ≈ queen

Token embeddings in LLMs:

python

# Simplified: each token maps to a learned vector
vocab_size = 50257      # GPT-2
d_model = 768           # embedding dimension

embedding_table = nn.Embedding(vocab_size, d_model)
# Shape: (50257, 768) — one 768-dim vector per token

token_ids = torch.tensor([9906, 11, 1917])  # "Hello, world"
vectors = embedding_table(token_ids)          # (3, 768)

6. Word2Vec

What: Foundational embedding algorithm (2013) that learns word vectors by predicting context. Not used in modern LLMs directly, but the concept underpins all embedding models.

Two architectures:

	CBOW	Skip-gram
Input	Context words	Center word
Predicts	Center word	Context words
Better for	Frequent words	Rare words

Skip-gram example:
"The cat sat on the mat"

Center: "sat" → Predict: "The", "cat", "on", "the"
Window size = 2 (2 words each side)

Key properties of learned vectors:

Similar words cluster together (dog ≈ cat)
Analogies via vector arithmetic (king - man + woman ≈ queen)
Dimensions capture semantic features (not individually interpretable)

7. Semantic Similarity

What: Measuring how similar two pieces of text are in meaning, using their embedding vectors.

Cosine similarity is the standard metric:

cos(A, B) = (A · B) / (||A|| × ||B||)

Range: [-1, 1]
  1  = identical direction (same meaning)
  0  = orthogonal (unrelated)
 -1  = opposite direction (opposite meaning)

python

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Embedding model outputs
emb_dog = get_embedding("dog")
emb_puppy = get_embedding("puppy")
emb_car = get_embedding("car")

cosine_similarity(emb_dog, emb_puppy)  # ~0.85 (very similar)
cosine_similarity(emb_dog, emb_car)    # ~0.15 (unrelated)

8. Dimensionality

What: The number of dimensions in an embedding vector. Higher dimensions can capture more nuance but cost more to store and compute.

Model	Dimensions	Notes
Word2Vec	100-300	Classic, lightweight
BERT base	768	Per-token embeddings
OpenAI text-embedding-3-small	1536	Sentence embeddings
OpenAI text-embedding-3-large	3072	Higher quality, more expensive
Cohere embed-v3	1024	Optimized for search

Dimensionality reduction: Sometimes useful to reduce dimensions for storage/speed:

PCA, t-SNE (visualization), UMAP
Matryoshka embeddings (OpenAI's approach) — model trained so truncated vectors still work

Trade-off: More dimensions = better semantic capture, but more memory, slower similarity search, and diminishing returns beyond a point.

02 - Tokenization and Embeddings ​

1. What is Tokenization? ​

2. BPE (Byte Pair Encoding) ​

3. Other Tokenization Methods ​

4. Token Limits and Context Windows ​

5. Embedding Spaces ​

6. Word2Vec ​

7. Semantic Similarity ​

8. Dimensionality ​

02 - Tokenization and Embeddings

1. What is Tokenization?

2. BPE (Byte Pair Encoding)

3. Other Tokenization Methods

4. Token Limits and Context Windows

5. Embedding Spaces

6. Word2Vec

7. Semantic Similarity

8. Dimensionality