05 - RAG Fundamentals
1. What is RAG?
What: Retrieval-Augmented Generation (RAG) is a pattern that enhances LLM responses by retrieving relevant external documents and including them in the prompt context. The model generates answers grounded in retrieved evidence rather than relying solely on its training data.
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ User │ │ Vector DB / │ │ LLM │
│ Query │────→│ Retriever │────→│ Generation │
│ │ │ │ │ │
│"How do I │ │ Returns top-k │ │ Generates │
│ deploy on │ │ relevant docs │ │ answer using │
│ Vercel?" │ │ │ │ retrieved │
│ │ │ │ │ context │
└──────────┘ └─────────────────┘ └──────────────┘The full RAG pipeline:
1. INDEXING (offline, one-time)
Documents → Chunk → Embed → Store in Vector DB
2. RETRIEVAL (per query)
Query → Embed → Search Vector DB → Top-k chunks
3. GENERATION (per query)
System prompt + Retrieved chunks + User query → LLM → Answer2. When to Use RAG vs Fine-tuning
| Scenario | Use RAG | Use Fine-tuning |
|---|---|---|
| Knowledge updates frequently | Yes | No (need to retrain) |
| Need source attribution | Yes (can cite chunks) | No |
| Domain-specific behavior/style | No | Yes |
| Proprietary data | Yes | Yes (but data exposure risk) |
| Low latency requirement | Adds retrieval latency | No extra latency |
| Budget constraint | Cheaper (no training) | Training cost |
| Factual accuracy critical | Yes (grounded) | Risky (hallucination) |
Combine both: Fine-tune for style/format + RAG for knowledge = best of both worlds.
3. Chunking Strategies
What: Documents must be split into chunks before embedding. Chunk size and strategy dramatically affect retrieval quality.
Fixed-size chunking:
# Simple: split every N tokens/characters with overlap
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
# Typical: chunk_size=512 tokens, overlap=50 tokensSemantic chunking strategies:
| Strategy | How | Best For |
|---|---|---|
| Fixed-size | Split every N tokens | Simple, general purpose |
| Sentence-based | Split on sentence boundaries | Preserving meaning |
| Paragraph-based | Split on paragraph breaks | Well-structured documents |
| Recursive | Try largest separator first, then smaller | Code, markdown |
| Semantic | Split when embedding similarity drops | Dense technical content |
| Document-aware | Use headers, sections, pages | PDFs, documentation |
Chunk size trade-offs:
Small chunks (100-200 tokens):
+ More precise retrieval
+ Less noise in results
- May lose context
- More chunks to search
Large chunks (500-1000 tokens):
+ More context preserved
+ Fewer chunks to manage
- May include irrelevant content
- Lower retrieval precision
Sweet spot: 256-512 tokens with 10-20% overlap4. Retrieval + Generation Flow
Detailed flow with code:
from openai import OpenAI
client = OpenAI()
# Step 1: Embed the query
query = "How do I handle authentication in Next.js?"
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Step 2: Search vector database
results = vector_db.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
# Step 3: Build context from retrieved chunks
context = "\n\n---\n\n".join([
f"Source: {r.metadata['source']}\n{r.metadata['text']}"
for r in results
])
# Step 4: Generate answer
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"""Answer based on the provided context.
If the context doesn't contain the answer, say so.
Context:
{context}"""},
{"role": "user", "content": query}
]
)Advanced retrieval patterns:
Basic RAG: Query → Retrieve → Generate
Multi-query: Query → LLM generates 3 query variants
→ Retrieve for each → Merge results → Generate
HyDE: Query → LLM generates hypothetical answer
→ Embed hypothetical answer → Retrieve → Generate
Iterative: Query → Retrieve → Generate partial answer
→ Generate follow-up query → Retrieve more → Final answer5. Hallucination Reduction
What: RAG reduces but doesn't eliminate hallucination. The model can still generate unsupported claims.
Strategies:
| Strategy | How |
|---|---|
| Grounding instructions | "Only answer based on the provided context" |
| Citation requirement | "Cite the source for each claim" |
| Confidence scoring | Ask model to rate its confidence |
| Chunk relevance filtering | Only include chunks above similarity threshold |
| Answer validation | Second LLM call to verify claims against sources |
| Abstention | Train model to say "I don't know" when context is insufficient |
system_prompt = """Answer the question based ONLY on the provided context.
Rules:
1. If the context doesn't contain enough information, say "I don't have enough
information to answer this question."
2. Do not use any prior knowledge — only the provided context.
3. Cite sources using [Source: filename] after each claim.
4. If you're unsure about a claim, preface it with "Based on the context..."
"""Common failure modes:
1. Retrieved chunks are irrelevant → Model ignores them and uses training data
Fix: Set similarity threshold, re-rank results
2. Chunks contain conflicting information → Model picks one arbitrarily
Fix: Instruct to acknowledge conflicts
3. Partial information → Model "fills in the gaps" with hallucinated details
Fix: Instruct to only state what's explicitly in context
4. Model merges information from different chunks incorrectly
Fix: Ask model to reason about each source separately6. RAG Evaluation
Metrics for evaluating RAG systems:
| Metric | What it measures |
|---|---|
| Retrieval precision | % of retrieved chunks that are relevant |
| Retrieval recall | % of relevant chunks that were retrieved |
| Answer faithfulness | Is the answer supported by retrieved context? |
| Answer relevance | Does the answer address the question? |
| Context relevance | Are retrieved contexts relevant to the question? |
Frameworks: RAGAS, TruLens, LangSmith for automated RAG evaluation.