Skip to content

05 - RAG Fundamentals


1. What is RAG?

What: Retrieval-Augmented Generation (RAG) is a pattern that enhances LLM responses by retrieving relevant external documents and including them in the prompt context. The model generates answers grounded in retrieved evidence rather than relying solely on its training data.

┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│  User     │     │  Vector DB /    │     │    LLM       │
│  Query    │────→│  Retriever      │────→│  Generation  │
│           │     │                 │     │              │
│"How do I  │     │ Returns top-k   │     │ Generates    │
│ deploy on │     │ relevant docs   │     │ answer using │
│ Vercel?"  │     │                 │     │ retrieved    │
│           │     │                 │     │ context      │
└──────────┘     └─────────────────┘     └──────────────┘

The full RAG pipeline:

1. INDEXING (offline, one-time)
   Documents → Chunk → Embed → Store in Vector DB

2. RETRIEVAL (per query)
   Query → Embed → Search Vector DB → Top-k chunks

3. GENERATION (per query)
   System prompt + Retrieved chunks + User query → LLM → Answer

2. When to Use RAG vs Fine-tuning

ScenarioUse RAGUse Fine-tuning
Knowledge updates frequentlyYesNo (need to retrain)
Need source attributionYes (can cite chunks)No
Domain-specific behavior/styleNoYes
Proprietary dataYesYes (but data exposure risk)
Low latency requirementAdds retrieval latencyNo extra latency
Budget constraintCheaper (no training)Training cost
Factual accuracy criticalYes (grounded)Risky (hallucination)

Combine both: Fine-tune for style/format + RAG for knowledge = best of both worlds.


3. Chunking Strategies

What: Documents must be split into chunks before embedding. Chunk size and strategy dramatically affect retrieval quality.

Fixed-size chunking:

python
# Simple: split every N tokens/characters with overlap
chunks = []
for i in range(0, len(text), chunk_size - overlap):
    chunks.append(text[i:i + chunk_size])

# Typical: chunk_size=512 tokens, overlap=50 tokens

Semantic chunking strategies:

StrategyHowBest For
Fixed-sizeSplit every N tokensSimple, general purpose
Sentence-basedSplit on sentence boundariesPreserving meaning
Paragraph-basedSplit on paragraph breaksWell-structured documents
RecursiveTry largest separator first, then smallerCode, markdown
SemanticSplit when embedding similarity dropsDense technical content
Document-awareUse headers, sections, pagesPDFs, documentation

Chunk size trade-offs:

Small chunks (100-200 tokens):
  + More precise retrieval
  + Less noise in results
  - May lose context
  - More chunks to search

Large chunks (500-1000 tokens):
  + More context preserved
  + Fewer chunks to manage
  - May include irrelevant content
  - Lower retrieval precision

Sweet spot: 256-512 tokens with 10-20% overlap

4. Retrieval + Generation Flow

Detailed flow with code:

python
from openai import OpenAI

client = OpenAI()

# Step 1: Embed the query
query = "How do I handle authentication in Next.js?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

# Step 2: Search vector database
results = vector_db.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

# Step 3: Build context from retrieved chunks
context = "\n\n---\n\n".join([
    f"Source: {r.metadata['source']}\n{r.metadata['text']}"
    for r in results
])

# Step 4: Generate answer
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": f"""Answer based on the provided context.
If the context doesn't contain the answer, say so.

Context:
{context}"""},
        {"role": "user", "content": query}
    ]
)

Advanced retrieval patterns:

Basic RAG:     Query → Retrieve → Generate

Multi-query:   Query → LLM generates 3 query variants
               → Retrieve for each → Merge results → Generate

HyDE:          Query → LLM generates hypothetical answer
               → Embed hypothetical answer → Retrieve → Generate

Iterative:     Query → Retrieve → Generate partial answer
               → Generate follow-up query → Retrieve more → Final answer

5. Hallucination Reduction

What: RAG reduces but doesn't eliminate hallucination. The model can still generate unsupported claims.

Strategies:

StrategyHow
Grounding instructions"Only answer based on the provided context"
Citation requirement"Cite the source for each claim"
Confidence scoringAsk model to rate its confidence
Chunk relevance filteringOnly include chunks above similarity threshold
Answer validationSecond LLM call to verify claims against sources
AbstentionTrain model to say "I don't know" when context is insufficient
python
system_prompt = """Answer the question based ONLY on the provided context.
Rules:
1. If the context doesn't contain enough information, say "I don't have enough
   information to answer this question."
2. Do not use any prior knowledge — only the provided context.
3. Cite sources using [Source: filename] after each claim.
4. If you're unsure about a claim, preface it with "Based on the context..."
"""

Common failure modes:

1. Retrieved chunks are irrelevant → Model ignores them and uses training data
   Fix: Set similarity threshold, re-rank results

2. Chunks contain conflicting information → Model picks one arbitrarily
   Fix: Instruct to acknowledge conflicts

3. Partial information → Model "fills in the gaps" with hallucinated details
   Fix: Instruct to only state what's explicitly in context

4. Model merges information from different chunks incorrectly
   Fix: Ask model to reason about each source separately

6. RAG Evaluation

Metrics for evaluating RAG systems:

MetricWhat it measures
Retrieval precision% of retrieved chunks that are relevant
Retrieval recall% of relevant chunks that were retrieved
Answer faithfulnessIs the answer supported by retrieved context?
Answer relevanceDoes the answer address the question?
Context relevanceAre retrieved contexts relevant to the question?

Frameworks: RAGAS, TruLens, LangSmith for automated RAG evaluation.

Frontend interview preparation reference.