05 - RAG Fundamentals

1. What is RAG?

What: Retrieval-Augmented Generation (RAG) is a pattern that enhances LLM responses by retrieving relevant external documents and including them in the prompt context. The model generates answers grounded in retrieved evidence rather than relying solely on its training data.

┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│  User     │     │  Vector DB /    │     │    LLM       │
│  Query    │────→│  Retriever      │────→│  Generation  │
│           │     │                 │     │              │
│"How do I  │     │ Returns top-k   │     │ Generates    │
│ deploy on │     │ relevant docs   │     │ answer using │
│ Vercel?"  │     │                 │     │ retrieved    │
│           │     │                 │     │ context      │
└──────────┘     └─────────────────┘     └──────────────┘

The full RAG pipeline:

1. INDEXING (offline, one-time)
   Documents → Chunk → Embed → Store in Vector DB

2. RETRIEVAL (per query)
   Query → Embed → Search Vector DB → Top-k chunks

3. GENERATION (per query)
   System prompt + Retrieved chunks + User query → LLM → Answer

2. When to Use RAG vs Fine-tuning

Scenario	Use RAG	Use Fine-tuning
Knowledge updates frequently	Yes	No (need to retrain)
Need source attribution	Yes (can cite chunks)	No
Domain-specific behavior/style	No	Yes
Proprietary data	Yes	Yes (but data exposure risk)
Low latency requirement	Adds retrieval latency	No extra latency
Budget constraint	Cheaper (no training)	Training cost
Factual accuracy critical	Yes (grounded)	Risky (hallucination)

Combine both: Fine-tune for style/format + RAG for knowledge = best of both worlds.

3. Chunking Strategies

What: Documents must be split into chunks before embedding. Chunk size and strategy dramatically affect retrieval quality.

Fixed-size chunking:

python

# Simple: split every N tokens/characters with overlap
chunks = []
for i in range(0, len(text), chunk_size - overlap):
    chunks.append(text[i:i + chunk_size])

# Typical: chunk_size=512 tokens, overlap=50 tokens

Semantic chunking strategies:

Strategy	How	Best For
Fixed-size	Split every N tokens	Simple, general purpose
Sentence-based	Split on sentence boundaries	Preserving meaning
Paragraph-based	Split on paragraph breaks	Well-structured documents
Recursive	Try largest separator first, then smaller	Code, markdown
Semantic	Split when embedding similarity drops	Dense technical content
Document-aware	Use headers, sections, pages	PDFs, documentation

Chunk size trade-offs:

Small chunks (100-200 tokens):
  + More precise retrieval
  + Less noise in results
  - May lose context
  - More chunks to search

Large chunks (500-1000 tokens):
  + More context preserved
  + Fewer chunks to manage
  - May include irrelevant content
  - Lower retrieval precision

Sweet spot: 256-512 tokens with 10-20% overlap

4. Retrieval + Generation Flow

Detailed flow with code:

python

from openai import OpenAI

client = OpenAI()

# Step 1: Embed the query
query = "How do I handle authentication in Next.js?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

# Step 2: Search vector database
results = vector_db.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

# Step 3: Build context from retrieved chunks
context = "\n\n---\n\n".join([
    f"Source: {r.metadata['source']}\n{r.metadata['text']}"
    for r in results
])

# Step 4: Generate answer
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": f"""Answer based on the provided context.
If the context doesn't contain the answer, say so.

Context:
{context}"""},
        {"role": "user", "content": query}
    ]
)

Advanced retrieval patterns:

Basic RAG:     Query → Retrieve → Generate

Multi-query:   Query → LLM generates 3 query variants
               → Retrieve for each → Merge results → Generate

HyDE:          Query → LLM generates hypothetical answer
               → Embed hypothetical answer → Retrieve → Generate

Iterative:     Query → Retrieve → Generate partial answer
               → Generate follow-up query → Retrieve more → Final answer

5. Hallucination Reduction

What: RAG reduces but doesn't eliminate hallucination. The model can still generate unsupported claims.

Strategies:

Strategy	How
Grounding instructions	"Only answer based on the provided context"
Citation requirement	"Cite the source for each claim"
Confidence scoring	Ask model to rate its confidence
Chunk relevance filtering	Only include chunks above similarity threshold
Answer validation	Second LLM call to verify claims against sources
Abstention	Train model to say "I don't know" when context is insufficient

python

system_prompt = """Answer the question based ONLY on the provided context.
Rules:
1. If the context doesn't contain enough information, say "I don't have enough
   information to answer this question."
2. Do not use any prior knowledge — only the provided context.
3. Cite sources using [Source: filename] after each claim.
4. If you're unsure about a claim, preface it with "Based on the context..."
"""

Common failure modes:

1. Retrieved chunks are irrelevant → Model ignores them and uses training data
   Fix: Set similarity threshold, re-rank results

2. Chunks contain conflicting information → Model picks one arbitrarily
   Fix: Instruct to acknowledge conflicts

3. Partial information → Model "fills in the gaps" with hallucinated details
   Fix: Instruct to only state what's explicitly in context

4. Model merges information from different chunks incorrectly
   Fix: Ask model to reason about each source separately

6. RAG Evaluation

Metrics for evaluating RAG systems:

Metric	What it measures
Retrieval precision	% of retrieved chunks that are relevant
Retrieval recall	% of relevant chunks that were retrieved
Answer faithfulness	Is the answer supported by retrieved context?
Answer relevance	Does the answer address the question?
Context relevance	Are retrieved contexts relevant to the question?

Frameworks: RAGAS, TruLens, LangSmith for automated RAG evaluation.

05 - RAG Fundamentals ​

1. What is RAG? ​

2. When to Use RAG vs Fine-tuning ​

3. Chunking Strategies ​

4. Retrieval + Generation Flow ​

5. Hallucination Reduction ​

6. RAG Evaluation ​

05 - RAG Fundamentals

1. What is RAG?

2. When to Use RAG vs Fine-tuning

3. Chunking Strategies

4. Retrieval + Generation Flow

5. Hallucination Reduction

6. RAG Evaluation