07 - Embeddings and Similarity Search

1. Embedding Models

What: Models that convert text (or images, audio) into fixed-size dense vectors. These vectors capture semantic meaning — similar texts produce similar vectors.

"The cat sat on the mat"  →  [0.12, -0.34, 0.56, ..., 0.78]  (1536 dims)
"A feline rested on a rug" → [0.11, -0.32, 0.55, ..., 0.77]  (very similar!)
"Stock prices rose today"  → [-0.45, 0.67, -0.12, ..., 0.23]  (very different)

Key embedding models:

Model	Provider	Dimensions	Context	Notes
text-embedding-3-small	OpenAI	1536	8191 tokens	Best price/performance
text-embedding-3-large	OpenAI	3072	8191 tokens	Highest quality (OpenAI)
embed-v3	Cohere	1024	512 tokens	Strong multilingual
e5-large-v2	Microsoft (open)	1024	512 tokens	Good open-source option
bge-large-en-v1.5	BAAI (open)	1024	512 tokens	Top open-source
all-MiniLM-L6-v2	Sentence-Transformers	384	256 tokens	Fast, lightweight
nomic-embed-text	Nomic (open)	768	8192 tokens	Long context, open

python

# OpenAI embeddings
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Hello world", "Goodbye world"]
)

embedding_1 = response.data[0].embedding  # list of 1536 floats
embedding_2 = response.data[1].embedding

# Open-source with sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "Goodbye world"])
# numpy array: (2, 384)

2. Cosine Similarity

What: Measures the angle between two vectors, ignoring magnitude. The most common similarity metric for text embeddings.

cos(A, B) = (A · B) / (||A|| × ||B||)

           B
          /|
         / |
        /  |
       / θ |
      /    |
     A─────

cos(θ) = 1   → identical direction (most similar)
cos(θ) = 0   → perpendicular (unrelated)
cos(θ) = -1  → opposite direction (most different)

python

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# In practice, most embedding models normalize vectors to unit length
# so cosine similarity = dot product (faster!)

Why cosine over Euclidean: Cosine is invariant to vector magnitude — a long document and short document about the same topic will have similar cosine similarity despite different magnitudes.

3. Dot Product

What: Simple multiplication of corresponding elements, summed. Equivalent to cosine similarity when vectors are normalized (unit length).

dot(A, B) = Σ(A_i × B_i)

For normalized vectors: dot(A, B) = cos(A, B)

When to use:

Vectors are already L2-normalized → use dot product (faster than cosine)
Vectors have meaningful magnitude → use cosine (normalizes automatically)
Most modern embedding APIs return normalized vectors, so dot product is preferred

4. Distance Metrics Comparison

Metric	Formula	Range	Best For
Cosine similarity	A·B / (\|\|A\|\| × \|\|B\|\|)	[-1, 1]	Text similarity (default)
Dot product	A·B	(-inf, inf)	Normalized vectors, fast
Euclidean (L2)	sqrt(Σ(A_i - B_i)^2)	[0, inf)	When magnitude matters
Manhattan (L1)	Σ\|A_i - B_i\|	[0, inf)	High-dimensional, sparse

Cosine similarity:   Measures angle       → "How similar in direction?"
Euclidean distance:  Measures straight-line → "How far apart?"
Dot product:         Measures projection   → "How aligned and how large?"

Practical rule: Use cosine similarity / cosine distance for text embeddings. Use Euclidean for image embeddings or when magnitude carries information.

5. Re-ranking

What: A second-stage model that re-scores initial retrieval results for better accuracy. Retrieval is fast but imprecise; re-ranking is slow but accurate.

Query: "How to handle auth in Next.js"

Stage 1 — Retrieval (bi-encoder, fast):
  Embed query → ANN search → Top 20 candidates (milliseconds)

Stage 2 — Re-ranking (cross-encoder, accurate):
  Score each (query, candidate) pair → Re-order top 20 → Return top 5

Why two stages:

	Bi-encoder (retrieval)	Cross-encoder (re-ranking)
Input	Encodes query and doc separately	Encodes query + doc together
Speed	O(1) per doc (pre-computed embeddings)	O(n) — must process each pair
Accuracy	Good	Much better
Use case	Narrow 1M docs to 20	Re-order 20 to find best 5

python

# Using Cohere re-ranker
import cohere
co = cohere.Client('your-key')

results = co.rerank(
    query="How to handle auth in Next.js",
    documents=["Doc about Next.js auth...", "Doc about React hooks...", ...],
    top_n=5,
    model="rerank-english-v3.0"
)

# Open-source: cross-encoder/ms-marco-MiniLM-L-12-v2
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = model.predict([
    ("query", "doc1"),
    ("query", "doc2"),
])

6. Hybrid Search

What: Combines dense vector search (semantic) with sparse keyword search (BM25/TF-IDF) for better retrieval. Catches both semantic matches and exact keyword matches.

Query: "HNSW algorithm performance benchmarks"

Dense search (semantic):
  ✓ "Approximate nearest neighbor methods show strong recall..."
  ✗ Might miss exact acronym "HNSW"

Sparse search (keyword/BM25):
  ✓ "HNSW: Hierarchical Navigable Small World graphs..."
  ✗ Might miss semantically similar but differently worded docs

Hybrid (combine both):
  ✓ Gets both semantic matches AND keyword matches

How to combine scores:

python

# Reciprocal Rank Fusion (RRF) — most common
def rrf_score(dense_rank, sparse_rank, k=60):
    return 1 / (k + dense_rank) + 1 / (k + sparse_rank)

# Weighted combination
final_score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha = 0.7 is a common default (favoring semantic)

Which databases support hybrid search:

Pinecone: Sparse-dense vectors
Weaviate: Built-in BM25 + vector
Qdrant: Sparse vectors support
Elasticsearch: kNN + BM25
pgvector + pg_trgm: Combine vector search with text search in Postgres

7. Embedding Best Practices

Practice	Why
Use the same model for indexing and querying	Different models produce incompatible vector spaces
Chunk text before embedding	Embedding models have token limits and work best on focused text
Prefix queries with task description	Some models (e5, nomic) expect "query: " or "search_query: " prefix
Normalize vectors	Enables faster dot product search instead of cosine
Batch embedding calls	Reduce API latency and cost
Cache embeddings	Don't re-embed unchanged documents
Evaluate with your actual data	Benchmark accuracy matters more than leaderboard scores

07 - Embeddings and Similarity Search ​

1. Embedding Models ​

2. Cosine Similarity ​

3. Dot Product ​

4. Distance Metrics Comparison ​

5. Re-ranking ​

6. Hybrid Search ​

7. Embedding Best Practices ​

07 - Embeddings and Similarity Search

1. Embedding Models

2. Cosine Similarity

3. Dot Product

4. Distance Metrics Comparison

5. Re-ranking

6. Hybrid Search

7. Embedding Best Practices