AI Agent Series — Ran Wei

Module 7: Memory & Context Management

Managing what your agent remembers.

The Context Window Problem

Every LLM has a finite context window — the maximum number of tokens it can process in a single request. This includes the system prompt, all conversation history, tool definitions, and the model's response. When your conversation exceeds this limit, you must decide what to keep and what to discard.

Context windows have grown dramatically (from 4K tokens in early GPT-3.5 to 200K in Claude and 1M+ in Gemini), but they are still finite. More importantly, longer contexts cost more money and increase latency. Even if a model supports 200K tokens, sending 200K tokens on every request is wasteful if only 5K are relevant.

Memory management is therefore about two things: staying within limits, and staying efficient. A well-designed memory system gives the agent access to everything it needs while keeping token usage minimal.

ANALOGY

Think of the context window as a desk. You can only spread so many papers on it at once. Memory management is the art of filing, summarising, and retrieving papers so the right information is on the desk when you need it — without burying yourself in old notes.

Model	Context Window	Approx. Pages of Text
GPT-4o	128K tokens	~300 pages
Claude Sonnet / Opus	200K tokens	~500 pages
Gemini 1.5 Pro	1M+ tokens	~2,500 pages

NOTE

Token counts are approximate. 1 token is roughly 4 characters in English, or about 0.75 words. Code and non-English text tend to use more tokens per word.

Short-Term Memory — Sliding Window

The simplest memory strategy is a sliding window: keep the most recent N messages and discard everything older. This mirrors how humans focus on the current conversation while forgetting earlier details.

The key design decision is where to cut. A naive approach drops the oldest messages, but this can remove the system prompt or important early instructions. A better approach always preserves the system prompt and the first user message, then applies the window to everything in between.

class SlidingWindowMemory:
    """Keep the system prompt + last N message pairs."""

    def __init__(self, max_pairs: int = 20):
        self.system_prompt = ""
        self.messages: list[dict] = []
        self.max_pairs = max_pairs  # each pair = user + assistant

    def set_system(self, prompt: str):
        self.system_prompt = prompt

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        # Keep max_pairs * 2 messages (user+assistant pairs)
        max_msgs = self.max_pairs * 2
        if len(self.messages) > max_msgs:
            self.messages = self.messages[-max_msgs:]

    def get_messages(self) -> list[dict]:
        """Return messages formatted for the API."""
        result = []
        if self.system_prompt:
            result.append({"role": "system", "content": self.system_prompt})
        result.extend(self.messages)
        return result

    @property
    def token_estimate(self) -> int:
        """Rough token count (4 chars per token)."""
        total = len(self.system_prompt)
        total += sum(len(m["content"]) for m in self.messages)
        return total // 4

TIP

Count tokens, not messages. A single message with a large tool result might use 5,000 tokens, while ten short chat messages might use only 500. Use tiktoken (OpenAI) or Anthropic's token counting API for accurate counts.

Token-Based Window

import tiktoken

class TokenWindowMemory:
    """Keep messages that fit within a token budget."""

    def __init__(self, max_tokens: int = 8000, model: str = "gpt-4o"):
        self.max_tokens = max_tokens
        self.encoder = tiktoken.encoding_for_model(model)
        self.messages: list[dict] = []

    def _count_tokens(self, messages: list[dict]) -> int:
        return sum(len(self.encoder.encode(m["content"])) for m in messages)

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        # Trim from the front until we're within budget
        while (len(self.messages) > 2 and
               self._count_tokens(self.messages) > self.max_tokens):
            self.messages.pop(1)  # keep index 0 (first message)

Summarisation Strategy

Sliding windows lose information permanently. A summarisation strategy compresses older messages into a concise summary, preserving key facts while reducing token count. You use the LLM itself to generate these summaries.

The typical approach is: when the conversation exceeds a threshold, take the oldest chunk of messages, summarise them, replace them with the summary, and continue. This creates a layered memory where recent messages are verbatim and older ones are compressed.

import anthropic

client = anthropic.Anthropic()

class SummarisedMemory:
    """Memory that summarises old messages to stay within budget."""

    def __init__(self, max_tokens: int = 6000, summary_threshold: int = 8000):
        self.messages: list[dict] = []
        self.summary: str = ""
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if self._estimate_tokens() > self.summary_threshold:
            self._compress()

    def _estimate_tokens(self) -> int:
        total = len(self.summary)
        total += sum(len(m["content"]) for m in self.messages)
        return total // 4

    def _compress(self):
        """Summarise the oldest half of messages."""
        mid = len(self.messages) // 2
        old_messages = self.messages[:mid]
        self.messages = self.messages[mid:]

        # Use the LLM to create a summary
        conversation = "\n".join(
            f'{m["role"]}: {m["content"]}' for m in old_messages
        )
        prompt = f"""Summarise this conversation, preserving:
- Key decisions and facts
- User preferences mentioned
- Any pending tasks or commitments

Previous summary: {self.summary or 'None'}

Conversation to summarise:
{conversation}"""

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        self.summary = response.content[0].text

    def get_messages(self) -> list[dict]:
        msgs = []
        if self.summary:
            msgs.append({"role": "system",
                         "content": f"Conversation summary so far:\n{self.summary}"})
        msgs.extend(self.messages)
        return msgs

PITFALL

Summaries lose nuance. If the user said "I'm allergic to peanuts" 50 messages ago and the summary didn't capture it, the agent might recommend a peanut dish. For safety-critical information, use explicit fact extraction alongside summaries.

TIP

Combine summarisation with a fact store: after each conversation turn, extract key facts (user preferences, constraints, decisions) into a structured store. Include these facts in every request regardless of the summary window.

Long-Term Memory — Semantic Retrieval

For agents that interact with users over days, weeks, or months, you need long-term memory that persists across sessions. The most effective approach uses vector embeddings to store and retrieve relevant past interactions semantically.

Instead of replaying the entire history, you embed each conversation turn and store it in a vector database. When the user sends a new message, you search for the most relevant past interactions and include them as context. This gives the agent seemingly unlimited memory while using minimal tokens.

import chromadb
from datetime import datetime

class LongTermMemory:
    """Semantic long-term memory using vector embeddings."""

    def __init__(self, user_id: str):
        self.client = chromadb.PersistentClient(path="./memory_db")
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{user_id}",
            metadata={"hnsw:space": "cosine"}
        )
        self.user_id = user_id

    def store(self, content: str, metadata: dict = None):
        """Store a memory with timestamp."""
        meta = {"timestamp": datetime.now().isoformat(),
                "user_id": self.user_id}
        if metadata:
            meta.update(metadata)
        self.collection.add(
            documents=[content],
            metadatas=[meta],
            ids=[f"mem_{datetime.now().timestamp()}"]
        )

    def recall(self, query: str, n_results: int = 5) -> list[str]:
        """Retrieve relevant memories for a query."""
        if self.collection.count() == 0:
            return []
        results = self.collection.query(
            query_texts=[query],
            n_results=min(n_results, self.collection.count())
        )
        return results["documents"][0]

    def get_context_string(self, query: str) -> str:
        """Format memories as context for the LLM."""
        memories = self.recall(query)
        if not memories:
            return ""
        formatted = "\n".join(f"- {m}" for m in memories)
        return f"Relevant memories from past conversations:\n{formatted}"

Episodic Memory

Stores specific interactions and events. "Last Tuesday, the user asked about flight prices to Tokyo."

Semantic Memory

Stores general facts and knowledge. "The user prefers window seats and vegetarian meals."

Procedural Memory

Stores learned processes. "When deploying, always run tests first, then push to staging."

Persistent Storage Patterns

Different types of memory data call for different storage backends. In practice, a production agent uses multiple storage systems together.

Storage Type	Best For	Example Tools	Access Pattern
Vector Database	Semantic search over past conversations	ChromaDB, Pinecone, Weaviate, pgvector	Similarity search by embedding
Key-Value Store	User preferences, session state, fast lookups	Redis, DynamoDB	Exact key lookup
SQL Database	Structured records, audit trails, relationships	PostgreSQL, SQLite	Queries with filters and joins
Document Store	Conversation logs, complex nested data	MongoDB, Firestore	Document queries

User Profile Store with Redis

import redis
import json

class UserProfileMemory:
    """Fast key-value store for user preferences and facts."""

    def __init__(self, user_id: str):
        self.redis = redis.Redis(host="localhost", port=6379, db=0)
        self.user_id = user_id
        self.key = f"agent:user:{user_id}:profile"

    def set_fact(self, key: str, value: str):
        """Store a user fact. e.g., set_fact('timezone', 'UTC+1')"""
        self.redis.hset(self.key, key, value)

    def get_fact(self, key: str) -> str | None:
        val = self.redis.hget(self.key, key)
        return val.decode() if val else None

    def get_all_facts(self) -> dict:
        data = self.redis.hgetall(self.key)
        return {k.decode(): v.decode() for k, v in data.items()}

    def get_context_string(self) -> str:
        facts = self.get_all_facts()
        if not facts:
            return ""
        lines = [f"- {k}: {v}" for k, v in facts.items()]
        return "Known user preferences:\n" + "\n".join(lines)

NOTE

Always separate facts (timezone, language preference, name) from episodic memories (what happened in past conversations). Facts should be deterministic lookups; episodes should be semantic search.

Putting It All Together

A production-grade memory system typically layers multiple strategies. Here is a complete example combining sliding window (recent), summarisation (medium-term), and vector retrieval (long-term).

class AgentMemory:
    """Layered memory: recent window + summary + long-term retrieval."""

    def __init__(self, user_id: str, system_prompt: str):
        self.system_prompt = system_prompt
        self.window = SlidingWindowMemory(max_pairs=10)
        self.summarised = SummarisedMemory()
        self.long_term = LongTermMemory(user_id)
        self.profile = UserProfileMemory(user_id)

    def add_turn(self, user_msg: str, assistant_msg: str):
        # Store in short-term window
        self.window.add("user", user_msg)
        self.window.add("assistant", assistant_msg)
        # Store in long-term
        self.long_term.store(f"User: {user_msg}\nAssistant: {assistant_msg}")

    def build_messages(self, current_query: str) -> list[dict]:
        # Layer 1: System prompt + user profile
        profile_ctx = self.profile.get_context_string()
        system = self.system_prompt
        if profile_ctx:
            system += f"\n\n{profile_ctx}"

        # Layer 2: Relevant long-term memories
        ltm_ctx = self.long_term.get_context_string(current_query)
        if ltm_ctx:
            system += f"\n\n{ltm_ctx}"

        # Layer 3: Conversation summary (if any)
        if self.summarised.summary:
            system += f"\n\nConversation summary:\n{self.summarised.summary}"

        # Layer 4: Recent messages (verbatim)
        messages = [{"role": "system", "content": system}]
        messages.extend(self.window.messages)
        messages.append({"role": "user", "content": current_query})
        return messages

TIP

Monitor your token usage per layer. In most applications, the system prompt + profile + retrieved memories should use no more than 30% of your token budget, leaving 70% for the actual conversation and model response.

Up Next

Module 8 — RAG — Retrieval-Augmented Generation