07 - Embeddings and Similarity Search โ
1. Embedding Models โ
What: Models that convert text (or images, audio) into fixed-size dense vectors. These vectors capture semantic meaning โ similar texts produce similar vectors.
"The cat sat on the mat" โ [0.12, -0.34, 0.56, ..., 0.78] (1536 dims)
"A feline rested on a rug" โ [0.11, -0.32, 0.55, ..., 0.77] (very similar!)
"Stock prices rose today" โ [-0.45, 0.67, -0.12, ..., 0.23] (very different)2
3
Key embedding models:
| Model | Provider | Dimensions | Context | Notes |
|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | 8191 tokens | Best price/performance |
| text-embedding-3-large | OpenAI | 3072 | 8191 tokens | Highest quality (OpenAI) |
| embed-v3 | Cohere | 1024 | 512 tokens | Strong multilingual |
| e5-large-v2 | Microsoft (open) | 1024 | 512 tokens | Good open-source option |
| bge-large-en-v1.5 | BAAI (open) | 1024 | 512 tokens | Top open-source |
| all-MiniLM-L6-v2 | Sentence-Transformers | 384 | 256 tokens | Fast, lightweight |
| nomic-embed-text | Nomic (open) | 768 | 8192 tokens | Long context, open |
# OpenAI embeddings
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=["Hello world", "Goodbye world"]
)
embedding_1 = response.data[0].embedding # list of 1536 floats
embedding_2 = response.data[1].embedding
# Open-source with sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "Goodbye world"])
# numpy array: (2, 384)2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2. Cosine Similarity โ
What: Measures the angle between two vectors, ignoring magnitude. The most common similarity metric for text embeddings.
cos(A, B) = (A ยท B) / (||A|| ร ||B||)
B
/|
/ |
/ |
/ ฮธ |
/ |
Aโโโโโ
cos(ฮธ) = 1 โ identical direction (most similar)
cos(ฮธ) = 0 โ perpendicular (unrelated)
cos(ฮธ) = -1 โ opposite direction (most different)2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# In practice, most embedding models normalize vectors to unit length
# so cosine similarity = dot product (faster!)2
3
4
5
6
7
Why cosine over Euclidean: Cosine is invariant to vector magnitude โ a long document and short document about the same topic will have similar cosine similarity despite different magnitudes.
3. Dot Product โ
What: Simple multiplication of corresponding elements, summed. Equivalent to cosine similarity when vectors are normalized (unit length).
dot(A, B) = ฮฃ(A_i ร B_i)
For normalized vectors: dot(A, B) = cos(A, B)2
3
When to use:
- Vectors are already L2-normalized โ use dot product (faster than cosine)
- Vectors have meaningful magnitude โ use cosine (normalizes automatically)
- Most modern embedding APIs return normalized vectors, so dot product is preferred
4. Distance Metrics Comparison โ
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine similarity | AยทB / (||A|| ร ||B||) | [-1, 1] | Text similarity (default) |
| Dot product | AยทB | (-inf, inf) | Normalized vectors, fast |
| Euclidean (L2) | sqrt(ฮฃ(A_i - B_i)^2) | [0, inf) | When magnitude matters |
| Manhattan (L1) | ฮฃ|A_i - B_i| | [0, inf) | High-dimensional, sparse |
Cosine similarity: Measures angle โ "How similar in direction?"
Euclidean distance: Measures straight-line โ "How far apart?"
Dot product: Measures projection โ "How aligned and how large?"2
3
Practical rule: Use cosine similarity / cosine distance for text embeddings. Use Euclidean for image embeddings or when magnitude carries information.
5. Re-ranking โ
What: A second-stage model that re-scores initial retrieval results for better accuracy. Retrieval is fast but imprecise; re-ranking is slow but accurate.
Query: "How to handle auth in Next.js"
Stage 1 โ Retrieval (bi-encoder, fast):
Embed query โ ANN search โ Top 20 candidates (milliseconds)
Stage 2 โ Re-ranking (cross-encoder, accurate):
Score each (query, candidate) pair โ Re-order top 20 โ Return top 52
3
4
5
6
7
Why two stages:
| Bi-encoder (retrieval) | Cross-encoder (re-ranking) | |
|---|---|---|
| Input | Encodes query and doc separately | Encodes query + doc together |
| Speed | O(1) per doc (pre-computed embeddings) | O(n) โ must process each pair |
| Accuracy | Good | Much better |
| Use case | Narrow 1M docs to 20 | Re-order 20 to find best 5 |
# Using Cohere re-ranker
import cohere
co = cohere.Client('your-key')
results = co.rerank(
query="How to handle auth in Next.js",
documents=["Doc about Next.js auth...", "Doc about React hooks...", ...],
top_n=5,
model="rerank-english-v3.0"
)
# Open-source: cross-encoder/ms-marco-MiniLM-L-12-v2
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = model.predict([
("query", "doc1"),
("query", "doc2"),
])2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
6. Hybrid Search โ
What: Combines dense vector search (semantic) with sparse keyword search (BM25/TF-IDF) for better retrieval. Catches both semantic matches and exact keyword matches.
Query: "HNSW algorithm performance benchmarks"
Dense search (semantic):
โ "Approximate nearest neighbor methods show strong recall..."
โ Might miss exact acronym "HNSW"
Sparse search (keyword/BM25):
โ "HNSW: Hierarchical Navigable Small World graphs..."
โ Might miss semantically similar but differently worded docs
Hybrid (combine both):
โ Gets both semantic matches AND keyword matches2
3
4
5
6
7
8
9
10
11
12
How to combine scores:
# Reciprocal Rank Fusion (RRF) โ most common
def rrf_score(dense_rank, sparse_rank, k=60):
return 1 / (k + dense_rank) + 1 / (k + sparse_rank)
# Weighted combination
final_score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha = 0.7 is a common default (favoring semantic)2
3
4
5
6
7
Which databases support hybrid search:
- Pinecone: Sparse-dense vectors
- Weaviate: Built-in BM25 + vector
- Qdrant: Sparse vectors support
- Elasticsearch: kNN + BM25
- pgvector + pg_trgm: Combine vector search with text search in Postgres
7. Embedding Best Practices โ
| Practice | Why |
|---|---|
| Use the same model for indexing and querying | Different models produce incompatible vector spaces |
| Chunk text before embedding | Embedding models have token limits and work best on focused text |
| Prefix queries with task description | Some models (e5, nomic) expect "query: " or "search_query: " prefix |
| Normalize vectors | Enables faster dot product search instead of cosine |
| Batch embedding calls | Reduce API latency and cost |
| Cache embeddings | Don't re-embed unchanged documents |
| Evaluate with your actual data | Benchmark accuracy matters more than leaderboard scores |