AI Agent Series — Ran Wei

Module 8: RAG — Retrieval-Augmented Generation

Giving your agent access to external knowledge through RAG.

The Knowledge Problem

LLMs have a knowledge cutoff — they only know what was in their training data. They cannot answer questions about events after their cutoff date, your company's internal documents, or any private data they were never trained on. They also hallucinate confidently when they do not know something.

Retrieval-Augmented Generation (RAG) solves this by retrieving relevant documents at query time and injecting them into the LLM's context. Instead of relying on memorised knowledge, the model reads the actual source material and generates an answer grounded in that material.

RAG is the most widely deployed pattern in production AI systems because it is simpler and cheaper than fine-tuning, can be updated instantly (just add or remove documents), and provides citations back to source material.

ANALOGY

An LLM without RAG is like a doctor taking an exam from memory alone. RAG gives them access to the latest medical journals during the exam — they still need expertise to interpret the material, but their answers are grounded in current evidence rather than potentially outdated knowledge.

Up-to-Date Answers

Query live data sources instead of relying on training-data knowledge that may be months or years old.

Domain Specificity

Ground answers in your company's internal docs, policies, and knowledge base without fine-tuning.

Reduced Hallucination

When the model answers from provided context, it is far less likely to fabricate facts.

Verifiable Sources

Every answer can cite the specific document chunks it drew from, enabling user verification.

RAG Architecture Overview

A RAG system has two phases: an indexing phase (offline, run once or periodically) and a query phase (online, run for every user question). Understanding both phases is essential to building an effective pipeline.

Indexing Phase (Offline)

Load — Read documents from files, databases, APIs, or web pages
Chunk — Split documents into smaller, meaningful pieces
Embed — Convert each chunk into a vector (a list of numbers) using an embedding model
Store — Save the vectors and original text in a vector database

Query Phase (Online)

Embed Query — Convert the user's question into a vector using the same embedding model
Retrieve — Find the most similar document chunks using vector similarity search
Augment — Insert retrieved chunks into the LLM prompt as context
Generate — The LLM reads the context and produces a grounded answer

NOTE

RAG does not fine-tune or modify the model. It augments the prompt with retrieved context at query time. This means you can update your knowledge base instantly without retraining anything.

Document Chunking and Embedding

Chunking is arguably the most important step in a RAG pipeline. If your chunks are too large, they waste context tokens and dilute relevance. If they are too small, they lose context and the model cannot synthesise a coherent answer.

The goal is to create chunks that are self-contained (make sense on their own), focused (about one topic), and appropriately sized (typically 200-1000 tokens).

Strategy	How It Works	Best For	Drawback
Fixed-size	Split every N characters/tokens	Simple implementation	May split mid-sentence or mid-paragraph
Sentence-based	Split on sentence boundaries	Natural text, articles	Variable chunk sizes, some sentences too short
Recursive character	Try paragraph, then sentence, then word boundaries	General-purpose, preserves structure	More complex to implement
Semantic	Use embeddings to detect topic shifts	Long documents with multiple topics	Expensive, requires embedding each sentence
Document-aware	Use headings, sections, or markup structure	Markdown, HTML, structured docs	Requires document-specific parsing

Chunking with Overlap

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by character count."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        # Try to end at a sentence boundary
        last_period = chunk.rfind(". ")
        if last_period > chunk_size * 0.5:
            end = start + last_period + 1
            chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap  # overlap for context continuity
    return chunks

# Example
text = open("company_handbook.txt").read()
chunks = chunk_text(text, chunk_size=800, overlap=100)
print(f"Created {len(chunks)} chunks from {len(text)} characters")

TIP

Overlap is critical. A 50-100 token overlap between chunks ensures that information at chunk boundaries is not lost. Without overlap, a sentence that spans two chunks will be incomplete in both.

Generating Embeddings

# Using OpenAI embeddings (most popular choice)
import openai

client = openai.OpenAI()

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Generate embeddings for a batch of texts."""
    response = client.embeddings.create(
        model="text-embedding-3-small",  # 1536 dimensions, cheap
        input=texts
    )
    return [item.embedding for item in response.data]

# Embed a single query
query_embedding = get_embeddings(["What is the refund policy?"])[0]

# Embed all chunks
chunk_embeddings = get_embeddings(chunks)

NOTE

Always use the same embedding model for both indexing and querying. Mixing models (e.g., indexing with OpenAI and querying with Cohere) will produce meaningless similarity scores because the vector spaces are different.

Vector Databases

Vector databases are purpose-built for storing and searching high-dimensional vectors. They use specialised indexing algorithms (like HNSW or IVF) that make similarity search fast even over millions of vectors.

ChromaDB

Open-source, runs locally, Python-native. Perfect for prototyping and small-to-medium datasets. Has built-in embedding support.

Pinecone

Fully managed cloud service. Scales to billions of vectors. Serverless pricing model. Best for production workloads.

pgvector

PostgreSQL extension. Use your existing Postgres infrastructure. Great when you already have a SQL database and want to add vector search.

Weaviate

Open-source with hybrid search (vector + keyword). Built-in support for multi-modal data. Good for complex search requirements.

Qdrant

Open-source, Rust-based, very fast. Rich filtering capabilities. Both cloud and self-hosted options.

FAISS

Meta's library for efficient similarity search. Not a database (no persistence built-in), but extremely fast for in-memory search.

Setting Up ChromaDB

import chromadb

# Persistent storage (survives restarts)
client = chromadb.PersistentClient(path="./chroma_db")

# Create a collection
collection = client.get_or_create_collection(
    name="company_docs",
    metadata={"hnsw:space": "cosine"}  # cosine similarity
)

# Index documents
collection.add(
    documents=chunks,                          # raw text
    ids=[f"chunk_{i}" for i in range(len(chunks))],  # unique IDs
    metadatas=[{"source": "handbook.pdf", "page": i // 5}
               for i in range(len(chunks))]     # metadata for filtering
)

# Query
results = collection.query(
    query_texts=["What is the vacation policy?"],
    n_results=5,
    where={"source": "handbook.pdf"}  # optional metadata filter
)

for doc, score in zip(results["documents"][0], results["distances"][0]):
    print(f"[{score:.3f}] {doc[:100]}...")

Building a Complete RAG Pipeline

Let us build a production-quality RAG pipeline that indexes a collection of documents and answers questions from them. This example uses ChromaDB for storage and Anthropic's Claude for generation.

import anthropic
import chromadb
import os
import json

class RAGPipeline:
    """Complete RAG pipeline: index, retrieve, generate."""

    def __init__(self, collection_name: str = "knowledge_base"):
        self.ai = anthropic.Anthropic()
        self.db = chromadb.PersistentClient(path="./rag_db")
        self.collection = self.db.get_or_create_collection(collection_name)

    def index_document(self, text: str, source: str, chunk_size: int = 800):
        """Chunk and index a document."""
        chunks = chunk_text(text, chunk_size=chunk_size, overlap=100)
        self.collection.add(
            documents=chunks,
            ids=[f"{source}_{i}" for i in range(len(chunks))],
            metadatas=[{"source": source, "chunk_index": i}
                       for i in range(len(chunks))]
        )
        print(f"Indexed {len(chunks)} chunks from '{source}'")

    def index_directory(self, dir_path: str):
        """Index all .txt and .md files in a directory."""
        for fname in os.listdir(dir_path):
            if fname.endswith((".txt", ".md")):
                path = os.path.join(dir_path, fname)
                text = open(path, "r", encoding="utf-8").read()
                self.index_document(text, source=fname)

    def retrieve(self, query: str, n_results: int = 5) -> list[dict]:
        """Retrieve relevant chunks for a query."""
        if self.collection.count() == 0:
            return []
        results = self.collection.query(
            query_texts=[query],
            n_results=min(n_results, self.collection.count())
        )
        return [
            {"text": doc, "source": meta["source"], "score": dist}
            for doc, meta, dist in zip(
                results["documents"][0],
                results["metadatas"][0],
                results["distances"][0]
            )
        ]

    def query(self, question: str, n_results: int = 5) -> dict:
        """Full RAG: retrieve context, then generate an answer."""
        chunks = self.retrieve(question, n_results)

        if not chunks:
            return {"answer": "No documents indexed yet.", "sources": []}

        # Build context string with source attribution
        context_parts = []
        for i, c in enumerate(chunks, 1):
            context_parts.append(f"[Source {i}: {c['source']}]\n{c['text']}")
        context = "\n\n---\n\n".join(context_parts)

        system_prompt = f"""You are a helpful assistant that answers questions
based ONLY on the provided context. If the context does not contain
the answer, say "I don't have enough information to answer that."
Always cite which source(s) you used, e.g., [Source 1].

Context:
{context}"""

        response = self.ai.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": question}]
        )

        return {
            "answer": response.content[0].text,
            "sources": [c["source"] for c in chunks],
            "chunks_used": len(chunks)
        }

# Usage
rag = RAGPipeline()
rag.index_directory("./company_docs")
result = rag.query("What is our remote work policy?")
print(result["answer"])
print("Sources:", result["sources"])

TIP

Always instruct the LLM to say "I don't know" when the context does not contain the answer. Without this instruction, the model will often hallucinate an answer that sounds plausible but is not grounded in your documents.

PITFALL

Retrieval quality is the bottleneck, not generation quality. If the wrong chunks are retrieved, even the best LLM will produce a bad answer. Spend most of your optimisation effort on chunking, embedding choice, and retrieval tuning rather than prompt engineering the generation step.

Evaluation and Quality

Measuring RAG quality requires evaluating both the retrieval step and the generation step separately. A common mistake is only evaluating the final answer without understanding whether failures come from bad retrieval or bad generation.

Metric	What It Measures	How to Compute
Recall@k	Is the correct document in the top-k retrieved chunks?	Label ground-truth docs, check if retrieved
Precision@k	What fraction of retrieved chunks are actually relevant?	Human annotation or LLM-as-judge
Faithfulness	Is the answer grounded in the retrieved context (no hallucination)?	LLM-as-judge: "Is this claim supported by the context?"
Answer Relevance	Does the answer actually address the question asked?	LLM-as-judge or human evaluation
Latency	End-to-end response time	Measure retrieval + generation time

Simple Evaluation Script

def evaluate_rag(pipeline: RAGPipeline, test_cases: list[dict]) -> dict:
    """Evaluate RAG pipeline on test cases.
    Each test case: {"question": str, "expected_source": str, "expected_keywords": [str]}
    """
    results = {"retrieval_hits": 0, "answer_relevant": 0, "total": len(test_cases)}

    for tc in test_cases:
        result = pipeline.query(tc["question"])

        # Check retrieval: did we find the right source?
        if tc["expected_source"] in result["sources"]:
            results["retrieval_hits"] += 1

        # Check answer: does it contain expected keywords?
        answer_lower = result["answer"].lower()
        if all(kw.lower() in answer_lower for kw in tc["expected_keywords"]):
            results["answer_relevant"] += 1

    results["retrieval_accuracy"] = results["retrieval_hits"] / results["total"]
    results["answer_accuracy"] = results["answer_relevant"] / results["total"]
    return results

# Example test cases
test_cases = [
    {"question": "What is the vacation policy?",
     "expected_source": "hr_handbook.txt",
     "expected_keywords": ["days", "annual"]},
    {"question": "How do I submit expenses?",
     "expected_source": "finance_guide.txt",
     "expected_keywords": ["receipt", "submit"]},
]
metrics = evaluate_rag(rag, test_cases)
print(f"Retrieval accuracy: {metrics['retrieval_accuracy']:.0%}")
print(f"Answer accuracy: {metrics['answer_accuracy']:.0%}")

NOTE

Start with a simple pipeline, measure its performance, then optimise. Common improvements include: better chunking strategies, re-ranking retrieved results with a cross-encoder, query expansion (rewriting the query for better retrieval), and hybrid search combining vector similarity with keyword matching (BM25).

Up Next

Module 9 — MCP — Building Your Own Servers