02 - Tokenization and Embeddings โ
1. What is Tokenization? โ
What: Tokenization converts raw text into a sequence of integer token IDs that the model can process. It's the first and last step of any LLM pipeline.
"Hello, world!" โ ["Hello", ",", " world", "!"] โ [15496, 11, 995, 0]
tokenize token IDs2
Why not characters or words?
- Character-level: Sequences too long, model struggles with long-range dependencies
- Word-level: Vocabulary too large, can't handle unseen words
- Subword: Best of both โ reasonable vocabulary size (~32K-100K), handles any text
2. BPE (Byte Pair Encoding) โ
What: The most common tokenization algorithm. Used by GPT, LLaMA, and most modern LLMs.
How it works:
- Start with a vocabulary of individual characters (bytes)
- Count all adjacent pairs in the training corpus
- Merge the most frequent pair into a new token
- Repeat until desired vocabulary size is reached
Corpus: "low low low low low lowest lowest newer newer wider"
Step 0: Characters: l, o, w, e, s, t, n, r, i, d, ...
Step 1: Most frequent pair: (l, o) โ merge into "lo"
Step 2: Most frequent pair: (lo, w) โ merge into "low"
Step 3: Most frequent pair: (e, r) โ merge into "er"
Step 4: Most frequent pair: (n, ew) โ merge into "new"
...2
3
4
5
6
7
8
Result: Common words become single tokens, rare words split into subwords.
# tiktoken โ OpenAI's tokenizer
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]
print(len(tokens)) # 4 tokens
# Rare/long word gets split
tokens = enc.encode("antidisestablishmentarianism")
print(len(tokens)) # ~6-8 subword tokens2
3
4
5
6
7
8
9
10
11
3. Other Tokenization Methods โ
| Method | Used By | Key Difference |
|---|---|---|
| BPE | GPT, LLaMA, Mistral | Byte-level merges, most common |
| WordPiece | BERT, DistilBERT | Similar to BPE but uses likelihood instead of frequency |
| SentencePiece | T5, LLaMA | Language-agnostic, treats input as raw bytes, no pre-tokenization |
| Unigram | T5 (via SentencePiece) | Starts with large vocab, prunes based on loss |
SentencePiece key advantage: No language-specific pre-processing needed. Works on raw text/bytes directly. Handles CJK, emoji, code, etc. uniformly.
WordPiece vs BPE:
- BPE: Greedily merges most frequent pair
- WordPiece: Merges pair that maximizes likelihood of training data
- In practice, results are very similar
4. Token Limits and Context Windows โ
What: Every LLM has a maximum number of tokens it can process (context window). This includes both input and output.
| Model | Context Window |
|---|---|
| GPT-3 | 4,096 tokens |
| GPT-4 | 8K / 128K tokens |
| Claude 3.5 | 200K tokens |
| LLaMA 3 | 8K / 128K tokens |
| Gemini 1.5 | 1M / 2M tokens |
Rule of thumb: ~4 characters per token in English (varies by language and content).
Why context limits exist:
- Self-attention is O(n^2) in sequence length โ memory and compute scale quadratically
- KV cache memory grows linearly with sequence length
- Positional encodings may not generalize beyond training length
Strategies for long contexts:
- Sliding window attention (Mistral)
- Sparse attention patterns
- Retrieval-augmented generation (RAG) โ don't put everything in context
- Summarization / compression of earlier context
5. Embedding Spaces โ
What: Embeddings are dense vector representations of tokens (or sentences) in a continuous high-dimensional space where semantic relationships are captured by geometry.
king
โ
| (royalty direction)
|
woman โโโโโโโโ queen
โ โ
| (gender) | (gender)
| |
man โโโโโโโโโ king
king - man + woman โ queen2
3
4
5
6
7
8
9
10
11
Token embeddings in LLMs:
# Simplified: each token maps to a learned vector
vocab_size = 50257 # GPT-2
d_model = 768 # embedding dimension
embedding_table = nn.Embedding(vocab_size, d_model)
# Shape: (50257, 768) โ one 768-dim vector per token
token_ids = torch.tensor([9906, 11, 1917]) # "Hello, world"
vectors = embedding_table(token_ids) # (3, 768)2
3
4
5
6
7
8
9
6. Word2Vec โ
What: Foundational embedding algorithm (2013) that learns word vectors by predicting context. Not used in modern LLMs directly, but the concept underpins all embedding models.
Two architectures:
| CBOW | Skip-gram | |
|---|---|---|
| Input | Context words | Center word |
| Predicts | Center word | Context words |
| Better for | Frequent words | Rare words |
Skip-gram example:
"The cat sat on the mat"
Center: "sat" โ Predict: "The", "cat", "on", "the"
Window size = 2 (2 words each side)2
3
4
5
Key properties of learned vectors:
- Similar words cluster together (dog โ cat)
- Analogies via vector arithmetic (king - man + woman โ queen)
- Dimensions capture semantic features (not individually interpretable)
7. Semantic Similarity โ
What: Measuring how similar two pieces of text are in meaning, using their embedding vectors.
Cosine similarity is the standard metric:
cos(A, B) = (A ยท B) / (||A|| ร ||B||)
Range: [-1, 1]
1 = identical direction (same meaning)
0 = orthogonal (unrelated)
-1 = opposite direction (opposite meaning)2
3
4
5
6
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Embedding model outputs
emb_dog = get_embedding("dog")
emb_puppy = get_embedding("puppy")
emb_car = get_embedding("car")
cosine_similarity(emb_dog, emb_puppy) # ~0.85 (very similar)
cosine_similarity(emb_dog, emb_car) # ~0.15 (unrelated)2
3
4
5
6
7
8
9
10
11
12
8. Dimensionality โ
What: The number of dimensions in an embedding vector. Higher dimensions can capture more nuance but cost more to store and compute.
| Model | Dimensions | Notes |
|---|---|---|
| Word2Vec | 100-300 | Classic, lightweight |
| BERT base | 768 | Per-token embeddings |
| OpenAI text-embedding-3-small | 1536 | Sentence embeddings |
| OpenAI text-embedding-3-large | 3072 | Higher quality, more expensive |
| Cohere embed-v3 | 1024 | Optimized for search |
Dimensionality reduction: Sometimes useful to reduce dimensions for storage/speed:
- PCA, t-SNE (visualization), UMAP
- Matryoshka embeddings (OpenAI's approach) โ model trained so truncated vectors still work
Trade-off: More dimensions = better semantic capture, but more memory, slower similarity search, and diminishing returns beyond a point.