Skip to content

07 - Embeddings and Similarity Search โ€‹


1. Embedding Models โ€‹

What: Models that convert text (or images, audio) into fixed-size dense vectors. These vectors capture semantic meaning โ€” similar texts produce similar vectors.

"The cat sat on the mat"  โ†’  [0.12, -0.34, 0.56, ..., 0.78]  (1536 dims)
"A feline rested on a rug" โ†’ [0.11, -0.32, 0.55, ..., 0.77]  (very similar!)
"Stock prices rose today"  โ†’ [-0.45, 0.67, -0.12, ..., 0.23]  (very different)

Key embedding models:

ModelProviderDimensionsContextNotes
text-embedding-3-smallOpenAI15368191 tokensBest price/performance
text-embedding-3-largeOpenAI30728191 tokensHighest quality (OpenAI)
embed-v3Cohere1024512 tokensStrong multilingual
e5-large-v2Microsoft (open)1024512 tokensGood open-source option
bge-large-en-v1.5BAAI (open)1024512 tokensTop open-source
all-MiniLM-L6-v2Sentence-Transformers384256 tokensFast, lightweight
nomic-embed-textNomic (open)7688192 tokensLong context, open
python
# OpenAI embeddings
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Hello world", "Goodbye world"]
)

embedding_1 = response.data[0].embedding  # list of 1536 floats
embedding_2 = response.data[1].embedding

# Open-source with sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "Goodbye world"])
# numpy array: (2, 384)

2. Cosine Similarity โ€‹

What: Measures the angle between two vectors, ignoring magnitude. The most common similarity metric for text embeddings.

cos(A, B) = (A ยท B) / (||A|| ร— ||B||)

           B
          /|
         / |
        /  |
       / ฮธ |
      /    |
     Aโ”€โ”€โ”€โ”€โ”€

cos(ฮธ) = 1   โ†’ identical direction (most similar)
cos(ฮธ) = 0   โ†’ perpendicular (unrelated)
cos(ฮธ) = -1  โ†’ opposite direction (most different)
python
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# In practice, most embedding models normalize vectors to unit length
# so cosine similarity = dot product (faster!)

Why cosine over Euclidean: Cosine is invariant to vector magnitude โ€” a long document and short document about the same topic will have similar cosine similarity despite different magnitudes.


3. Dot Product โ€‹

What: Simple multiplication of corresponding elements, summed. Equivalent to cosine similarity when vectors are normalized (unit length).

dot(A, B) = ฮฃ(A_i ร— B_i)

For normalized vectors: dot(A, B) = cos(A, B)

When to use:

  • Vectors are already L2-normalized โ†’ use dot product (faster than cosine)
  • Vectors have meaningful magnitude โ†’ use cosine (normalizes automatically)
  • Most modern embedding APIs return normalized vectors, so dot product is preferred

4. Distance Metrics Comparison โ€‹

MetricFormulaRangeBest For
Cosine similarityAยทB / (||A|| ร— ||B||)[-1, 1]Text similarity (default)
Dot productAยทB(-inf, inf)Normalized vectors, fast
Euclidean (L2)sqrt(ฮฃ(A_i - B_i)^2)[0, inf)When magnitude matters
Manhattan (L1)ฮฃ|A_i - B_i|[0, inf)High-dimensional, sparse
Cosine similarity:   Measures angle       โ†’ "How similar in direction?"
Euclidean distance:  Measures straight-line โ†’ "How far apart?"
Dot product:         Measures projection   โ†’ "How aligned and how large?"

Practical rule: Use cosine similarity / cosine distance for text embeddings. Use Euclidean for image embeddings or when magnitude carries information.


5. Re-ranking โ€‹

What: A second-stage model that re-scores initial retrieval results for better accuracy. Retrieval is fast but imprecise; re-ranking is slow but accurate.

Query: "How to handle auth in Next.js"

Stage 1 โ€” Retrieval (bi-encoder, fast):
  Embed query โ†’ ANN search โ†’ Top 20 candidates (milliseconds)

Stage 2 โ€” Re-ranking (cross-encoder, accurate):
  Score each (query, candidate) pair โ†’ Re-order top 20 โ†’ Return top 5

Why two stages:

Bi-encoder (retrieval)Cross-encoder (re-ranking)
InputEncodes query and doc separatelyEncodes query + doc together
SpeedO(1) per doc (pre-computed embeddings)O(n) โ€” must process each pair
AccuracyGoodMuch better
Use caseNarrow 1M docs to 20Re-order 20 to find best 5
python
# Using Cohere re-ranker
import cohere
co = cohere.Client('your-key')

results = co.rerank(
    query="How to handle auth in Next.js",
    documents=["Doc about Next.js auth...", "Doc about React hooks...", ...],
    top_n=5,
    model="rerank-english-v3.0"
)

# Open-source: cross-encoder/ms-marco-MiniLM-L-12-v2
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = model.predict([
    ("query", "doc1"),
    ("query", "doc2"),
])

What: Combines dense vector search (semantic) with sparse keyword search (BM25/TF-IDF) for better retrieval. Catches both semantic matches and exact keyword matches.

Query: "HNSW algorithm performance benchmarks"

Dense search (semantic):
  โœ“ "Approximate nearest neighbor methods show strong recall..."
  โœ— Might miss exact acronym "HNSW"

Sparse search (keyword/BM25):
  โœ“ "HNSW: Hierarchical Navigable Small World graphs..."
  โœ— Might miss semantically similar but differently worded docs

Hybrid (combine both):
  โœ“ Gets both semantic matches AND keyword matches

How to combine scores:

python
# Reciprocal Rank Fusion (RRF) โ€” most common
def rrf_score(dense_rank, sparse_rank, k=60):
    return 1 / (k + dense_rank) + 1 / (k + sparse_rank)

# Weighted combination
final_score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha = 0.7 is a common default (favoring semantic)

Which databases support hybrid search:

  • Pinecone: Sparse-dense vectors
  • Weaviate: Built-in BM25 + vector
  • Qdrant: Sparse vectors support
  • Elasticsearch: kNN + BM25
  • pgvector + pg_trgm: Combine vector search with text search in Postgres

7. Embedding Best Practices โ€‹

PracticeWhy
Use the same model for indexing and queryingDifferent models produce incompatible vector spaces
Chunk text before embeddingEmbedding models have token limits and work best on focused text
Prefix queries with task descriptionSome models (e5, nomic) expect "query: " or "search_query: " prefix
Normalize vectorsEnables faster dot product search instead of cosine
Batch embedding callsReduce API latency and cost
Cache embeddingsDon't re-embed unchanged documents
Evaluate with your actual dataBenchmark accuracy matters more than leaderboard scores

Frontend interview preparation reference.