Module 7: Memory & Context Management
Managing what your agent remembers.
The Context Window Problem
Every LLM has a finite context window — the maximum number of tokens it can process in a single request. This includes the system prompt, all conversation history, tool definitions, and the model's response. When your conversation exceeds this limit, you must decide what to keep and what to discard.
Context windows have grown dramatically (from 4K tokens in early GPT-3.5 to 200K in Claude and 1M+ in Gemini), but they are still finite. More importantly, longer contexts cost more money and increase latency. Even if a model supports 200K tokens, sending 200K tokens on every request is wasteful if only 5K are relevant.
Memory management is therefore about two things: staying within limits, and staying efficient. A well-designed memory system gives the agent access to everything it needs while keeping token usage minimal.
Think of the context window as a desk. You can only spread so many papers on it at once. Memory management is the art of filing, summarising, and retrieving papers so the right information is on the desk when you need it — without burying yourself in old notes.
| Model | Context Window | Approx. Pages of Text |
|---|---|---|
| GPT-4o | 128K tokens | ~300 pages |
| Claude Sonnet / Opus | 200K tokens | ~500 pages |
| Gemini 1.5 Pro | 1M+ tokens | ~2,500 pages |
Token counts are approximate. 1 token is roughly 4 characters in English, or about 0.75 words. Code and non-English text tend to use more tokens per word.
Short-Term Memory — Sliding Window
The simplest memory strategy is a sliding window: keep the most recent N messages and discard everything older. This mirrors how humans focus on the current conversation while forgetting earlier details.
The key design decision is where to cut. A naive approach drops the oldest messages, but this can remove the system prompt or important early instructions. A better approach always preserves the system prompt and the first user message, then applies the window to everything in between.
class SlidingWindowMemory:
"""Keep the system prompt + last N message pairs."""
def __init__(self, max_pairs: int = 20):
self.system_prompt = ""
self.messages: list[dict] = []
self.max_pairs = max_pairs # each pair = user + assistant
def set_system(self, prompt: str):
self.system_prompt = prompt
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Keep max_pairs * 2 messages (user+assistant pairs)
max_msgs = self.max_pairs * 2
if len(self.messages) > max_msgs:
self.messages = self.messages[-max_msgs:]
def get_messages(self) -> list[dict]:
"""Return messages formatted for the API."""
result = []
if self.system_prompt:
result.append({"role": "system", "content": self.system_prompt})
result.extend(self.messages)
return result
@property
def token_estimate(self) -> int:
"""Rough token count (4 chars per token)."""
total = len(self.system_prompt)
total += sum(len(m["content"]) for m in self.messages)
return total // 4
Count tokens, not messages. A single message with a large tool result might use 5,000 tokens, while ten short chat messages might use only 500. Use tiktoken (OpenAI) or Anthropic's token counting API for accurate counts.
Token-Based Window
import tiktoken
class TokenWindowMemory:
"""Keep messages that fit within a token budget."""
def __init__(self, max_tokens: int = 8000, model: str = "gpt-4o"):
self.max_tokens = max_tokens
self.encoder = tiktoken.encoding_for_model(model)
self.messages: list[dict] = []
def _count_tokens(self, messages: list[dict]) -> int:
return sum(len(self.encoder.encode(m["content"])) for m in messages)
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Trim from the front until we're within budget
while (len(self.messages) > 2 and
self._count_tokens(self.messages) > self.max_tokens):
self.messages.pop(1) # keep index 0 (first message)
Summarisation Strategy
Sliding windows lose information permanently. A summarisation strategy compresses older messages into a concise summary, preserving key facts while reducing token count. You use the LLM itself to generate these summaries.
The typical approach is: when the conversation exceeds a threshold, take the oldest chunk of messages, summarise them, replace them with the summary, and continue. This creates a layered memory where recent messages are verbatim and older ones are compressed.
import anthropic
client = anthropic.Anthropic()
class SummarisedMemory:
"""Memory that summarises old messages to stay within budget."""
def __init__(self, max_tokens: int = 6000, summary_threshold: int = 8000):
self.messages: list[dict] = []
self.summary: str = ""
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if self._estimate_tokens() > self.summary_threshold:
self._compress()
def _estimate_tokens(self) -> int:
total = len(self.summary)
total += sum(len(m["content"]) for m in self.messages)
return total // 4
def _compress(self):
"""Summarise the oldest half of messages."""
mid = len(self.messages) // 2
old_messages = self.messages[:mid]
self.messages = self.messages[mid:]
# Use the LLM to create a summary
conversation = "\n".join(
f'{m["role"]}: {m["content"]}' for m in old_messages
)
prompt = f"""Summarise this conversation, preserving:
- Key decisions and facts
- User preferences mentioned
- Any pending tasks or commitments
Previous summary: {self.summary or 'None'}
Conversation to summarise:
{conversation}"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
self.summary = response.content[0].text
def get_messages(self) -> list[dict]:
msgs = []
if self.summary:
msgs.append({"role": "system",
"content": f"Conversation summary so far:\n{self.summary}"})
msgs.extend(self.messages)
return msgs
Summaries lose nuance. If the user said "I'm allergic to peanuts" 50 messages ago and the summary didn't capture it, the agent might recommend a peanut dish. For safety-critical information, use explicit fact extraction alongside summaries.
Combine summarisation with a fact store: after each conversation turn, extract key facts (user preferences, constraints, decisions) into a structured store. Include these facts in every request regardless of the summary window.
Long-Term Memory — Semantic Retrieval
For agents that interact with users over days, weeks, or months, you need long-term memory that persists across sessions. The most effective approach uses vector embeddings to store and retrieve relevant past interactions semantically.
Instead of replaying the entire history, you embed each conversation turn and store it in a vector database. When the user sends a new message, you search for the most relevant past interactions and include them as context. This gives the agent seemingly unlimited memory while using minimal tokens.
import chromadb
from datetime import datetime
class LongTermMemory:
"""Semantic long-term memory using vector embeddings."""
def __init__(self, user_id: str):
self.client = chromadb.PersistentClient(path="./memory_db")
self.collection = self.client.get_or_create_collection(
name=f"memory_{user_id}",
metadata={"hnsw:space": "cosine"}
)
self.user_id = user_id
def store(self, content: str, metadata: dict = None):
"""Store a memory with timestamp."""
meta = {"timestamp": datetime.now().isoformat(),
"user_id": self.user_id}
if metadata:
meta.update(metadata)
self.collection.add(
documents=[content],
metadatas=[meta],
ids=[f"mem_{datetime.now().timestamp()}"]
)
def recall(self, query: str, n_results: int = 5) -> list[str]:
"""Retrieve relevant memories for a query."""
if self.collection.count() == 0:
return []
results = self.collection.query(
query_texts=[query],
n_results=min(n_results, self.collection.count())
)
return results["documents"][0]
def get_context_string(self, query: str) -> str:
"""Format memories as context for the LLM."""
memories = self.recall(query)
if not memories:
return ""
formatted = "\n".join(f"- {m}" for m in memories)
return f"Relevant memories from past conversations:\n{formatted}"
Episodic Memory
Stores specific interactions and events. "Last Tuesday, the user asked about flight prices to Tokyo."
Semantic Memory
Stores general facts and knowledge. "The user prefers window seats and vegetarian meals."
Procedural Memory
Stores learned processes. "When deploying, always run tests first, then push to staging."
Persistent Storage Patterns
Different types of memory data call for different storage backends. In practice, a production agent uses multiple storage systems together.
| Storage Type | Best For | Example Tools | Access Pattern |
|---|---|---|---|
| Vector Database | Semantic search over past conversations | ChromaDB, Pinecone, Weaviate, pgvector | Similarity search by embedding |
| Key-Value Store | User preferences, session state, fast lookups | Redis, DynamoDB | Exact key lookup |
| SQL Database | Structured records, audit trails, relationships | PostgreSQL, SQLite | Queries with filters and joins |
| Document Store | Conversation logs, complex nested data | MongoDB, Firestore | Document queries |
User Profile Store with Redis
import redis
import json
class UserProfileMemory:
"""Fast key-value store for user preferences and facts."""
def __init__(self, user_id: str):
self.redis = redis.Redis(host="localhost", port=6379, db=0)
self.user_id = user_id
self.key = f"agent:user:{user_id}:profile"
def set_fact(self, key: str, value: str):
"""Store a user fact. e.g., set_fact('timezone', 'UTC+1')"""
self.redis.hset(self.key, key, value)
def get_fact(self, key: str) -> str | None:
val = self.redis.hget(self.key, key)
return val.decode() if val else None
def get_all_facts(self) -> dict:
data = self.redis.hgetall(self.key)
return {k.decode(): v.decode() for k, v in data.items()}
def get_context_string(self) -> str:
facts = self.get_all_facts()
if not facts:
return ""
lines = [f"- {k}: {v}" for k, v in facts.items()]
return "Known user preferences:\n" + "\n".join(lines)
Always separate facts (timezone, language preference, name) from episodic memories (what happened in past conversations). Facts should be deterministic lookups; episodes should be semantic search.
Putting It All Together
A production-grade memory system typically layers multiple strategies. Here is a complete example combining sliding window (recent), summarisation (medium-term), and vector retrieval (long-term).
class AgentMemory:
"""Layered memory: recent window + summary + long-term retrieval."""
def __init__(self, user_id: str, system_prompt: str):
self.system_prompt = system_prompt
self.window = SlidingWindowMemory(max_pairs=10)
self.summarised = SummarisedMemory()
self.long_term = LongTermMemory(user_id)
self.profile = UserProfileMemory(user_id)
def add_turn(self, user_msg: str, assistant_msg: str):
# Store in short-term window
self.window.add("user", user_msg)
self.window.add("assistant", assistant_msg)
# Store in long-term
self.long_term.store(f"User: {user_msg}\nAssistant: {assistant_msg}")
def build_messages(self, current_query: str) -> list[dict]:
# Layer 1: System prompt + user profile
profile_ctx = self.profile.get_context_string()
system = self.system_prompt
if profile_ctx:
system += f"\n\n{profile_ctx}"
# Layer 2: Relevant long-term memories
ltm_ctx = self.long_term.get_context_string(current_query)
if ltm_ctx:
system += f"\n\n{ltm_ctx}"
# Layer 3: Conversation summary (if any)
if self.summarised.summary:
system += f"\n\nConversation summary:\n{self.summarised.summary}"
# Layer 4: Recent messages (verbatim)
messages = [{"role": "system", "content": system}]
messages.extend(self.window.messages)
messages.append({"role": "user", "content": current_query})
return messages
Monitor your token usage per layer. In most applications, the system prompt + profile + retrieved memories should use no more than 30% of your token budget, leaving 70% for the actual conversation and model response.