06 - Vector Databases β
1. What is a Vector Database? β
What: A database optimized for storing, indexing, and querying high-dimensional vectors (embeddings). Unlike traditional databases that search by exact match or range, vector databases find the most similar vectors using distance metrics.
Traditional DB: SELECT * FROM docs WHERE category = 'AI' (exact match)
Vector DB: Find 5 vectors closest to query_vector (similarity search)βββββββββββββββββββββββββββββββββββββββββββββββ
β Vector Database β
β β
β Store: [0.12, -0.34, 0.56, ..., 0.78] β β 1536-dim vectors
β Index: HNSW / IVF / Flat β β fast ANN search
β Query: Find k nearest neighbors β β cosine / L2 / dot product
β Filter: + metadata filtering β β combine with traditional filters
β β
βββββββββββββββββββββββββββββββββββββββββββββββ2. ANN (Approximate Nearest Neighbor) Search β
What: Finding the exact nearest neighbors in high-dimensional space is slow (O(n) β must check every vector). ANN algorithms trade a small amount of accuracy for massive speed improvements.
Exact search: Check all 1M vectors β 100% accurate, slow
ANN search: Check ~1000 vectors β 95-99% accurate, 100x fasterWhy approximate is fine: In practice, the top-5 results from ANN almost always include the true top-5. The "missed" results are usually nearly as relevant.
3. HNSW (Hierarchical Navigable Small World) β
What: The most popular ANN index. Builds a multi-layer graph where each layer is progressively sparser, enabling fast traversal from coarse to fine.
Layer 3: A ββββββββββββββββββββ D (sparse, long-range)
β β
Layer 2: A ββββ B ββββββββ D β (medium density)
β β β β
Layer 1: A ββ B ββ C ββ D ββ E ββ F (dense, short-range)
β β β β β β
Layer 0: A B C D E F G H I J (all nodes)How search works:
- Start at top layer β find closest node using greedy traversal
- Drop to next layer β continue greedy search from that node
- Repeat until reaching bottom layer
- Return k nearest neighbors
Parameters:
M: Max connections per node (higher = better recall, more memory)ef_construction: Search depth during build (higher = better index quality)ef_search: Search depth at query time (higher = better recall, slower)
Trade-offs:
- Fast query time: O(log n)
- High memory usage (graph structure)
- Slow to build
- Best for: high-recall, low-latency requirements
4. IVF (Inverted File Index) β
What: Partitions vectors into clusters (using k-means), then only searches the nearest clusters at query time.
Build phase:
All vectors β K-means clustering β N clusters (centroids)
Query phase:
Query vector β Find nprobe nearest centroids β Search only those clusters
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
βCluster 1β βCluster 2β βCluster 3β βCluster 4β
β β’ β’ β’ β β β’ β’ β’ β β β’ β’ β β β’ β’ β’ β’ β
β β’ β’ β β β’ β’ β’ β’ β β β’ β’ β’ β β β’ β’ β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β β
nprobe=2: only search these two clustersParameters:
nlist: Number of clusters (typically sqrt(n))nprobe: Number of clusters to search (higher = better recall, slower)
Trade-offs:
- Lower memory than HNSW
- Faster build time
- Requires training (k-means on representative data)
- Best for: large datasets, memory-constrained environments
5. Vector Database Comparison β
| Database | Type | Index Types | Best For |
|---|---|---|---|
| Pinecone | Managed cloud | Proprietary | Production RAG, zero ops |
| Chroma | Embedded / local | HNSW | Prototyping, local dev |
| pgvector | Postgres extension | IVF, HNSW | Existing Postgres stack |
| Weaviate | Self-hosted / cloud | HNSW | Multi-modal, GraphQL API |
| Qdrant | Self-hosted / cloud | HNSW | Filtering + vector search |
| Milvus | Self-hosted / cloud | IVF, HNSW, DiskANN | Large-scale, GPU support |
| FAISS | Library (not DB) | IVF, HNSW, PQ | Research, custom pipelines |
6. Pinecone β
What: Fully managed vector database. No infrastructure to manage.
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
index = pc.Index("my-index")
# Upsert vectors
index.upsert(vectors=[
{"id": "doc1", "values": [0.1, 0.2, ...], "metadata": {"source": "docs"}},
{"id": "doc2", "values": [0.3, 0.4, ...], "metadata": {"source": "blog"}},
])
# Query
results = index.query(
vector=[0.15, 0.25, ...],
top_k=5,
filter={"source": {"$eq": "docs"}}, # metadata filtering
include_metadata=True
)Key features: Serverless tier, namespaces, metadata filtering, hybrid search (sparse + dense).
7. Chroma β
What: Open-source embedding database. Runs in-process (no server needed) or client-server. Great for prototyping.
import chromadb
client = chromadb.Client() # in-memory
# client = chromadb.PersistentClient(path="./chroma_db") # persistent
collection = client.create_collection("my_docs")
# Add documents (Chroma can auto-embed with default model)
collection.add(
documents=["Doc about AI", "Doc about cooking"],
ids=["doc1", "doc2"],
metadatas=[{"topic": "ai"}, {"topic": "food"}]
)
# Query
results = collection.query(
query_texts=["machine learning"],
n_results=5,
where={"topic": "ai"}
)8. pgvector β
What: PostgreSQL extension that adds vector similarity search. Use your existing Postgres database for embeddings β no separate infrastructure.
-- Enable extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536) -- OpenAI embedding dimension
);
-- Insert
INSERT INTO documents (content, embedding)
VALUES ('About AI', '[0.1, 0.2, ...]');
-- Cosine similarity search
SELECT content, 1 - (embedding <=> query_vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ...]' -- <=> is cosine distance
LIMIT 5;
-- Create HNSW index for faster queries
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);Operators: <-> L2 distance, <#> negative inner product, <=> cosine distance.
Trade-offs:
- Pro: No new infrastructure, ACID transactions, join with relational data
- Con: Not as fast as purpose-built vector DBs at scale, limited to Postgres
9. Key Considerations β
Choosing a vector database:
Small project / prototype? β Chroma (embedded, zero setup)
Already using Postgres? β pgvector (no new infra)
Production, want managed? β Pinecone (serverless)
Need advanced filtering? β Qdrant or Weaviate
Massive scale (100M+ vectors)? β Milvus or Pinecone
Research / custom pipeline? β FAISS (library)Important metrics:
- QPS (Queries Per Second): How many searches can you serve?
- Recall@k: What % of true top-k results does your ANN return?
- Latency (p99): Worst-case query time
- Memory per vector: Storage cost at scale
- Build time: How long to index your data?