Module 8: RAG — Retrieval-Augmented Generation
Giving your agent access to external knowledge through RAG.
The Knowledge Problem
LLMs have a knowledge cutoff — they only know what was in their training data. They cannot answer questions about events after their cutoff date, your company's internal documents, or any private data they were never trained on. They also hallucinate confidently when they do not know something.
Retrieval-Augmented Generation (RAG) solves this by retrieving relevant documents at query time and injecting them into the LLM's context. Instead of relying on memorised knowledge, the model reads the actual source material and generates an answer grounded in that material.
RAG is the most widely deployed pattern in production AI systems because it is simpler and cheaper than fine-tuning, can be updated instantly (just add or remove documents), and provides citations back to source material.
An LLM without RAG is like a doctor taking an exam from memory alone. RAG gives them access to the latest medical journals during the exam — they still need expertise to interpret the material, but their answers are grounded in current evidence rather than potentially outdated knowledge.
Up-to-Date Answers
Query live data sources instead of relying on training-data knowledge that may be months or years old.
Domain Specificity
Ground answers in your company's internal docs, policies, and knowledge base without fine-tuning.
Reduced Hallucination
When the model answers from provided context, it is far less likely to fabricate facts.
Verifiable Sources
Every answer can cite the specific document chunks it drew from, enabling user verification.
RAG Architecture Overview
A RAG system has two phases: an indexing phase (offline, run once or periodically) and a query phase (online, run for every user question). Understanding both phases is essential to building an effective pipeline.
Indexing Phase (Offline)
- Load — Read documents from files, databases, APIs, or web pages
- Chunk — Split documents into smaller, meaningful pieces
- Embed — Convert each chunk into a vector (a list of numbers) using an embedding model
- Store — Save the vectors and original text in a vector database
Query Phase (Online)
- Embed Query — Convert the user's question into a vector using the same embedding model
- Retrieve — Find the most similar document chunks using vector similarity search
- Augment — Insert retrieved chunks into the LLM prompt as context
- Generate — The LLM reads the context and produces a grounded answer
RAG does not fine-tune or modify the model. It augments the prompt with retrieved context at query time. This means you can update your knowledge base instantly without retraining anything.
Document Chunking and Embedding
Chunking is arguably the most important step in a RAG pipeline. If your chunks are too large, they waste context tokens and dilute relevance. If they are too small, they lose context and the model cannot synthesise a coherent answer.
The goal is to create chunks that are self-contained (make sense on their own), focused (about one topic), and appropriately sized (typically 200-1000 tokens).
| Strategy | How It Works | Best For | Drawback |
|---|---|---|---|
| Fixed-size | Split every N characters/tokens | Simple implementation | May split mid-sentence or mid-paragraph |
| Sentence-based | Split on sentence boundaries | Natural text, articles | Variable chunk sizes, some sentences too short |
| Recursive character | Try paragraph, then sentence, then word boundaries | General-purpose, preserves structure | More complex to implement |
| Semantic | Use embeddings to detect topic shifts | Long documents with multiple topics | Expensive, requires embedding each sentence |
| Document-aware | Use headings, sections, or markup structure | Markdown, HTML, structured docs | Requires document-specific parsing |
Chunking with Overlap
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by character count."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to end at a sentence boundary
last_period = chunk.rfind(". ")
if last_period > chunk_size * 0.5:
end = start + last_period + 1
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap # overlap for context continuity
return chunks
# Example
text = open("company_handbook.txt").read()
chunks = chunk_text(text, chunk_size=800, overlap=100)
print(f"Created {len(chunks)} chunks from {len(text)} characters")
Overlap is critical. A 50-100 token overlap between chunks ensures that information at chunk boundaries is not lost. Without overlap, a sentence that spans two chunks will be incomplete in both.
Generating Embeddings
# Using OpenAI embeddings (most popular choice)
import openai
client = openai.OpenAI()
def get_embeddings(texts: list[str]) -> list[list[float]]:
"""Generate embeddings for a batch of texts."""
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions, cheap
input=texts
)
return [item.embedding for item in response.data]
# Embed a single query
query_embedding = get_embeddings(["What is the refund policy?"])[0]
# Embed all chunks
chunk_embeddings = get_embeddings(chunks)
Always use the same embedding model for both indexing and querying. Mixing models (e.g., indexing with OpenAI and querying with Cohere) will produce meaningless similarity scores because the vector spaces are different.
Vector Databases
Vector databases are purpose-built for storing and searching high-dimensional vectors. They use specialised indexing algorithms (like HNSW or IVF) that make similarity search fast even over millions of vectors.
ChromaDB
Open-source, runs locally, Python-native. Perfect for prototyping and small-to-medium datasets. Has built-in embedding support.
Pinecone
Fully managed cloud service. Scales to billions of vectors. Serverless pricing model. Best for production workloads.
pgvector
PostgreSQL extension. Use your existing Postgres infrastructure. Great when you already have a SQL database and want to add vector search.
Weaviate
Open-source with hybrid search (vector + keyword). Built-in support for multi-modal data. Good for complex search requirements.
Qdrant
Open-source, Rust-based, very fast. Rich filtering capabilities. Both cloud and self-hosted options.
FAISS
Meta's library for efficient similarity search. Not a database (no persistence built-in), but extremely fast for in-memory search.
Setting Up ChromaDB
import chromadb
# Persistent storage (survives restarts)
client = chromadb.PersistentClient(path="./chroma_db")
# Create a collection
collection = client.get_or_create_collection(
name="company_docs",
metadata={"hnsw:space": "cosine"} # cosine similarity
)
# Index documents
collection.add(
documents=chunks, # raw text
ids=[f"chunk_{i}" for i in range(len(chunks))], # unique IDs
metadatas=[{"source": "handbook.pdf", "page": i // 5}
for i in range(len(chunks))] # metadata for filtering
)
# Query
results = collection.query(
query_texts=["What is the vacation policy?"],
n_results=5,
where={"source": "handbook.pdf"} # optional metadata filter
)
for doc, score in zip(results["documents"][0], results["distances"][0]):
print(f"[{score:.3f}] {doc[:100]}...")
Building a Complete RAG Pipeline
Let us build a production-quality RAG pipeline that indexes a collection of documents and answers questions from them. This example uses ChromaDB for storage and Anthropic's Claude for generation.
import anthropic
import chromadb
import os
import json
class RAGPipeline:
"""Complete RAG pipeline: index, retrieve, generate."""
def __init__(self, collection_name: str = "knowledge_base"):
self.ai = anthropic.Anthropic()
self.db = chromadb.PersistentClient(path="./rag_db")
self.collection = self.db.get_or_create_collection(collection_name)
def index_document(self, text: str, source: str, chunk_size: int = 800):
"""Chunk and index a document."""
chunks = chunk_text(text, chunk_size=chunk_size, overlap=100)
self.collection.add(
documents=chunks,
ids=[f"{source}_{i}" for i in range(len(chunks))],
metadatas=[{"source": source, "chunk_index": i}
for i in range(len(chunks))]
)
print(f"Indexed {len(chunks)} chunks from '{source}'")
def index_directory(self, dir_path: str):
"""Index all .txt and .md files in a directory."""
for fname in os.listdir(dir_path):
if fname.endswith((".txt", ".md")):
path = os.path.join(dir_path, fname)
text = open(path, "r", encoding="utf-8").read()
self.index_document(text, source=fname)
def retrieve(self, query: str, n_results: int = 5) -> list[dict]:
"""Retrieve relevant chunks for a query."""
if self.collection.count() == 0:
return []
results = self.collection.query(
query_texts=[query],
n_results=min(n_results, self.collection.count())
)
return [
{"text": doc, "source": meta["source"], "score": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
]
def query(self, question: str, n_results: int = 5) -> dict:
"""Full RAG: retrieve context, then generate an answer."""
chunks = self.retrieve(question, n_results)
if not chunks:
return {"answer": "No documents indexed yet.", "sources": []}
# Build context string with source attribution
context_parts = []
for i, c in enumerate(chunks, 1):
context_parts.append(f"[Source {i}: {c['source']}]\n{c['text']}")
context = "\n\n---\n\n".join(context_parts)
system_prompt = f"""You are a helpful assistant that answers questions
based ONLY on the provided context. If the context does not contain
the answer, say "I don't have enough information to answer that."
Always cite which source(s) you used, e.g., [Source 1].
Context:
{context}"""
response = self.ai.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": question}]
)
return {
"answer": response.content[0].text,
"sources": [c["source"] for c in chunks],
"chunks_used": len(chunks)
}
# Usage
rag = RAGPipeline()
rag.index_directory("./company_docs")
result = rag.query("What is our remote work policy?")
print(result["answer"])
print("Sources:", result["sources"])
Always instruct the LLM to say "I don't know" when the context does not contain the answer. Without this instruction, the model will often hallucinate an answer that sounds plausible but is not grounded in your documents.
Retrieval quality is the bottleneck, not generation quality. If the wrong chunks are retrieved, even the best LLM will produce a bad answer. Spend most of your optimisation effort on chunking, embedding choice, and retrieval tuning rather than prompt engineering the generation step.
Evaluation and Quality
Measuring RAG quality requires evaluating both the retrieval step and the generation step separately. A common mistake is only evaluating the final answer without understanding whether failures come from bad retrieval or bad generation.
| Metric | What It Measures | How to Compute |
|---|---|---|
| Recall@k | Is the correct document in the top-k retrieved chunks? | Label ground-truth docs, check if retrieved |
| Precision@k | What fraction of retrieved chunks are actually relevant? | Human annotation or LLM-as-judge |
| Faithfulness | Is the answer grounded in the retrieved context (no hallucination)? | LLM-as-judge: "Is this claim supported by the context?" |
| Answer Relevance | Does the answer actually address the question asked? | LLM-as-judge or human evaluation |
| Latency | End-to-end response time | Measure retrieval + generation time |
Simple Evaluation Script
def evaluate_rag(pipeline: RAGPipeline, test_cases: list[dict]) -> dict:
"""Evaluate RAG pipeline on test cases.
Each test case: {"question": str, "expected_source": str, "expected_keywords": [str]}
"""
results = {"retrieval_hits": 0, "answer_relevant": 0, "total": len(test_cases)}
for tc in test_cases:
result = pipeline.query(tc["question"])
# Check retrieval: did we find the right source?
if tc["expected_source"] in result["sources"]:
results["retrieval_hits"] += 1
# Check answer: does it contain expected keywords?
answer_lower = result["answer"].lower()
if all(kw.lower() in answer_lower for kw in tc["expected_keywords"]):
results["answer_relevant"] += 1
results["retrieval_accuracy"] = results["retrieval_hits"] / results["total"]
results["answer_accuracy"] = results["answer_relevant"] / results["total"]
return results
# Example test cases
test_cases = [
{"question": "What is the vacation policy?",
"expected_source": "hr_handbook.txt",
"expected_keywords": ["days", "annual"]},
{"question": "How do I submit expenses?",
"expected_source": "finance_guide.txt",
"expected_keywords": ["receipt", "submit"]},
]
metrics = evaluate_rag(rag, test_cases)
print(f"Retrieval accuracy: {metrics['retrieval_accuracy']:.0%}")
print(f"Answer accuracy: {metrics['answer_accuracy']:.0%}")
Start with a simple pipeline, measure its performance, then optimise. Common improvements include: better chunking strategies, re-ranking retrieved results with a cross-encoder, query expansion (rewriting the query for better retrieval), and hybrid search combining vector similarity with keyword matching (BM25).