Skip to content

02 - Tokenization and Embeddings โ€‹


1. What is Tokenization? โ€‹

What: Tokenization converts raw text into a sequence of integer token IDs that the model can process. It's the first and last step of any LLM pipeline.

"Hello, world!" โ†’ ["Hello", ",", " world", "!"] โ†’ [15496, 11, 995, 0]
                    tokenize                        token IDs

Why not characters or words?

  • Character-level: Sequences too long, model struggles with long-range dependencies
  • Word-level: Vocabulary too large, can't handle unseen words
  • Subword: Best of both โ€” reasonable vocabulary size (~32K-100K), handles any text

2. BPE (Byte Pair Encoding) โ€‹

What: The most common tokenization algorithm. Used by GPT, LLaMA, and most modern LLMs.

How it works:

  1. Start with a vocabulary of individual characters (bytes)
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until desired vocabulary size is reached
Corpus: "low low low low low lowest lowest newer newer wider"

Step 0: Characters: l, o, w, e, s, t, n, r, i, d, ...
Step 1: Most frequent pair: (l, o) โ†’ merge into "lo"
Step 2: Most frequent pair: (lo, w) โ†’ merge into "low"
Step 3: Most frequent pair: (e, r) โ†’ merge into "er"
Step 4: Most frequent pair: (n, ew) โ†’ merge into "new"
...

Result: Common words become single tokens, rare words split into subwords.

python
# tiktoken โ€” OpenAI's tokenizer
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

tokens = enc.encode("Hello, world!")
print(tokens)        # [9906, 11, 1917, 0]
print(len(tokens))   # 4 tokens

# Rare/long word gets split
tokens = enc.encode("antidisestablishmentarianism")
print(len(tokens))   # ~6-8 subword tokens

3. Other Tokenization Methods โ€‹

MethodUsed ByKey Difference
BPEGPT, LLaMA, MistralByte-level merges, most common
WordPieceBERT, DistilBERTSimilar to BPE but uses likelihood instead of frequency
SentencePieceT5, LLaMALanguage-agnostic, treats input as raw bytes, no pre-tokenization
UnigramT5 (via SentencePiece)Starts with large vocab, prunes based on loss

SentencePiece key advantage: No language-specific pre-processing needed. Works on raw text/bytes directly. Handles CJK, emoji, code, etc. uniformly.

WordPiece vs BPE:

  • BPE: Greedily merges most frequent pair
  • WordPiece: Merges pair that maximizes likelihood of training data
  • In practice, results are very similar

4. Token Limits and Context Windows โ€‹

What: Every LLM has a maximum number of tokens it can process (context window). This includes both input and output.

ModelContext Window
GPT-34,096 tokens
GPT-48K / 128K tokens
Claude 3.5200K tokens
LLaMA 38K / 128K tokens
Gemini 1.51M / 2M tokens

Rule of thumb: ~4 characters per token in English (varies by language and content).

Why context limits exist:

  • Self-attention is O(n^2) in sequence length โ€” memory and compute scale quadratically
  • KV cache memory grows linearly with sequence length
  • Positional encodings may not generalize beyond training length

Strategies for long contexts:

  • Sliding window attention (Mistral)
  • Sparse attention patterns
  • Retrieval-augmented generation (RAG) โ€” don't put everything in context
  • Summarization / compression of earlier context

5. Embedding Spaces โ€‹

What: Embeddings are dense vector representations of tokens (or sentences) in a continuous high-dimensional space where semantic relationships are captured by geometry.

                    king
                     โ†‘
                     |  (royalty direction)
                     |
     woman โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€ queen
       โ†‘               โ†‘
       | (gender)       | (gender)
       |               |
      man โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ king

  king - man + woman โ‰ˆ queen

Token embeddings in LLMs:

python
# Simplified: each token maps to a learned vector
vocab_size = 50257      # GPT-2
d_model = 768           # embedding dimension

embedding_table = nn.Embedding(vocab_size, d_model)
# Shape: (50257, 768) โ€” one 768-dim vector per token

token_ids = torch.tensor([9906, 11, 1917])  # "Hello, world"
vectors = embedding_table(token_ids)          # (3, 768)

6. Word2Vec โ€‹

What: Foundational embedding algorithm (2013) that learns word vectors by predicting context. Not used in modern LLMs directly, but the concept underpins all embedding models.

Two architectures:

CBOWSkip-gram
InputContext wordsCenter word
PredictsCenter wordContext words
Better forFrequent wordsRare words
Skip-gram example:
"The cat sat on the mat"

Center: "sat" โ†’ Predict: "The", "cat", "on", "the"
Window size = 2 (2 words each side)

Key properties of learned vectors:

  • Similar words cluster together (dog โ‰ˆ cat)
  • Analogies via vector arithmetic (king - man + woman โ‰ˆ queen)
  • Dimensions capture semantic features (not individually interpretable)

7. Semantic Similarity โ€‹

What: Measuring how similar two pieces of text are in meaning, using their embedding vectors.

Cosine similarity is the standard metric:

cos(A, B) = (A ยท B) / (||A|| ร— ||B||)

Range: [-1, 1]
  1  = identical direction (same meaning)
  0  = orthogonal (unrelated)
 -1  = opposite direction (opposite meaning)
python
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Embedding model outputs
emb_dog = get_embedding("dog")
emb_puppy = get_embedding("puppy")
emb_car = get_embedding("car")

cosine_similarity(emb_dog, emb_puppy)  # ~0.85 (very similar)
cosine_similarity(emb_dog, emb_car)    # ~0.15 (unrelated)

8. Dimensionality โ€‹

What: The number of dimensions in an embedding vector. Higher dimensions can capture more nuance but cost more to store and compute.

ModelDimensionsNotes
Word2Vec100-300Classic, lightweight
BERT base768Per-token embeddings
OpenAI text-embedding-3-small1536Sentence embeddings
OpenAI text-embedding-3-large3072Higher quality, more expensive
Cohere embed-v31024Optimized for search

Dimensionality reduction: Sometimes useful to reduce dimensions for storage/speed:

  • PCA, t-SNE (visualization), UMAP
  • Matryoshka embeddings (OpenAI's approach) โ€” model trained so truncated vectors still work

Trade-off: More dimensions = better semantic capture, but more memory, slower similarity search, and diminishing returns beyond a point.

Frontend interview preparation reference.