Module 15: Deployment & Best Practices
Taking your agent from prototype to production: containerisation, monitoring, scaling, cost management, and a comprehensive deployment checklist.
Containerisation
Containerising your agent ensures it runs consistently across development, staging, and production environments. A Docker container packages your agent code, dependencies, and configuration into a single deployable unit.
Dockerfile for an Agent Service
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/
COPY config/ ./config/
# Non-root user for security
RUN useradd --create-home appuser
USER appuser
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
Docker Compose for Multi-Agent Systems
When you have multiple agents running as separate services (as discussed in Module 11), Docker Compose lets you define and run them together.
# docker-compose.yml
version: "3.9"
services:
orchestrator:
build: ./orchestrator
ports: ["8000:8000"]
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- REDIS_URL=redis://redis:6379
depends_on: [redis]
research-agent:
build: ./agents/research
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
writer-agent:
build: ./agents/writer
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
redis:
image: redis:7-alpine
volumes: ["redis-data:/data"]
volumes:
redis-data:
Never bake API keys into your Docker image. Use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.). The .env file should be in your .gitignore and .dockerignore.
Python agent containers can be large (1GB+) due to ML dependencies. Use multi-stage builds and python:3.12-slim as the base to keep images lean. Only install the packages your agent actually needs in production.
Error Handling
Production agents face a range of failures: API rate limits, network timeouts, malformed LLM responses, and tool execution errors. Robust error handling with retries and circuit breakers keeps your agent resilient.
Exponential Backoff with Jitter
import time, random
import anthropic
def call_with_retry(func, max_retries: int = 3, base_delay: float = 1.0):
"""Call a function with exponential backoff and jitter on failure."""
for attempt in range(max_retries + 1):
try:
return func()
except anthropic.RateLimitError:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})")
time.sleep(delay)
except anthropic.APIStatusError as e:
if e.status_code >= 500: # Server errors are retryable
if attempt == max_retries:
raise
time.sleep(base_delay * (2 ** attempt))
else:
raise # Client errors (4xx) are not retryable
except anthropic.APIConnectionError:
if attempt == max_retries:
raise
time.sleep(base_delay * (2 ** attempt))
raise Exception("Max retries exceeded")
Circuit Breaker Pattern
If an API is consistently failing, stop hitting it and fail fast instead of waiting through retries. A circuit breaker "opens" after a threshold of failures and "closes" after a cooldown period.
import time
class CircuitBreaker:
"""Prevent cascading failures with a circuit breaker."""
def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.failure_count = 0
self.last_failure_time = 0
self.state = "closed" # closed = normal, open = failing fast
def call(self, func):
"""Execute function through the circuit breaker."""
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open" # Try one request
else:
raise Exception("Circuit breaker is OPEN - failing fast")
try:
result = func()
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
print(f"Circuit breaker OPENED after {self.failure_count} failures")
raise
# Usage
api_breaker = CircuitBreaker(failure_threshold=5, reset_timeout=60)
try:
result = api_breaker.call(lambda: client.messages.create(...))
except Exception as e:
# Serve cached response or graceful fallback
result = fallback_response()
Combine retries with circuit breakers: retry individual transient failures, but if failures accumulate, the circuit breaker stops further attempts. This protects both your agent and the upstream API from overload.
Monitoring and Observability
You cannot fix what you cannot see. Production agents need comprehensive logging, metrics, and tracing to diagnose issues and track performance over time.
Structured Logging
import logging, json, time, uuid
class AgentLogger:
"""Structured logging for agent interactions."""
def __init__(self, agent_name: str):
self.logger = logging.getLogger(agent_name)
self.logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
self.logger.addHandler(handler)
def log_llm_call(self, request_id: str, model: str, input_tokens: int,
output_tokens: int, latency_ms: float, success: bool):
self.logger.info(json.dumps({
"event": "llm_call",
"request_id": request_id,
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": round(latency_ms, 2),
"success": success,
"cost_usd": self._estimate_cost(model, input_tokens, output_tokens),
"timestamp": time.time(),
}))
def log_tool_call(self, request_id: str, tool_name: str,
latency_ms: float, success: bool, error: str = ""):
self.logger.info(json.dumps({
"event": "tool_call",
"request_id": request_id,
"tool": tool_name,
"latency_ms": round(latency_ms, 2),
"success": success,
"error": error,
"timestamp": time.time(),
}))
def _estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
prices = {
"claude-sonnet-4-20250514": (3.0, 15.0), # per 1M tokens
"claude-haiku-4-20250514": (0.25, 1.25),
}
input_price, output_price = prices.get(model, (3.0, 15.0))
return (input_tokens * input_price + output_tokens * output_price) / 1_000_000
Key Metrics to Track
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Request latency (p50, p95, p99) | How fast your agent responds | p95 > 10s |
| Error rate | How often requests fail | > 5% |
| Token usage per request | How much context the agent uses | > 80% of context window |
| Cost per request | How much each interaction costs | > $0.50/request |
| Tool call success rate | How reliable your tools are | < 95% |
| Task completion rate | How often the agent finishes the task | < 90% |
| Human escalation rate | How often the agent cannot handle a request | > 20% |
Send structured logs to a centralised system (ELK stack, Datadog, Grafana Cloud) and set up dashboards that show real-time agent health. Build alerts for anomalies — a sudden spike in token usage often means your agent is stuck in a loop.
Scaling
AI agents are primarily I/O-bound (waiting for LLM API responses), not CPU-bound. This means scaling strategies differ from traditional compute-heavy applications.
Horizontal Scaling with FastAPI
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio, anthropic
app = FastAPI()
client = anthropic.AsyncAnthropic()
class AgentRequest(BaseModel):
user_id: str
message: str
class AgentResponse(BaseModel):
response: str
latency_ms: float
@app.post("/agent", response_model=AgentResponse)
async def handle_request(req: AgentRequest):
"""Handle agent requests asynchronously for high throughput."""
import time
start = time.time()
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": req.message}]
)
latency = (time.time() - start) * 1000
return AgentResponse(
response=response.content[0].text,
latency_ms=latency
)
@app.get("/health")
async def health():
return {"status": "healthy"}
Async I/O
Use asyncio and async API clients. A single process can handle hundreds of concurrent LLM requests while waiting for responses.
Worker Processes
Run multiple Uvicorn workers (--workers 4) behind a load balancer. Each worker handles its own async event loop.
Queue-Based
For long-running agent tasks, use a task queue (Celery, Redis Queue) to decouple request acceptance from processing.
Auto-Scaling
Deploy on Kubernetes or cloud services with auto-scaling based on request queue depth or latency metrics.
The bottleneck is almost always the LLM API, not your application code. Scaling your agent horizontally does not help if the LLM provider is rate-limiting you. Monitor your API rate limits and request higher limits from your provider before scaling your infrastructure.
Cost Management
LLM API costs can escalate quickly in production, especially with multi-agent systems where each user request triggers multiple API calls. Proactive cost management is essential.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Typical cost per task |
|---|---|---|---|
| Claude Opus 4 | $15.00 | $75.00 | $0.10 - $1.00 |
| Claude Sonnet 4 | $3.00 | $15.00 | $0.02 - $0.15 |
| Claude Haiku 4 | $0.25 | $1.25 | $0.001 - $0.01 |
| GPT-4o | $2.50 | $10.00 | $0.01 - $0.10 |
| GPT-4o-mini | $0.15 | $0.60 | $0.001 - $0.01 |
Cost Optimisation Strategies
class CostOptimiser:
"""Route requests to the cheapest capable model."""
def __init__(self):
self.client = anthropic.Anthropic()
self.model_tiers = {
"simple": "claude-haiku-4-20250514", # Classification, extraction
"medium": "claude-sonnet-4-20250514", # Analysis, writing
"complex": "claude-opus-4-20250514", # Complex reasoning
}
def classify_complexity(self, task: str) -> str:
"""Use a cheap model to classify task complexity."""
response = self.client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=10,
system="Classify task complexity as: simple, medium, or complex. "
"One word only.",
messages=[{"role": "user", "content": task}]
)
complexity = response.content[0].text.strip().lower()
return complexity if complexity in self.model_tiers else "medium"
def run(self, task: str) -> str:
"""Route to the appropriate model based on complexity."""
complexity = self.classify_complexity(task)
model = self.model_tiers[complexity]
print(f"Routing to {model} (complexity: {complexity})")
response = self.client.messages.create(
model=model, max_tokens=2048,
messages=[{"role": "user", "content": task}]
)
return response.content[0].text
Key cost-saving techniques:
- Model routing — use cheap models for simple tasks, expensive models only when needed (5-50x savings)
- Prompt caching — cache system prompts and frequently used context to reduce input tokens
- Response caching — cache identical or near-identical queries to avoid redundant API calls
- Token budgets — set
max_tokensappropriately for each task; do not request 4096 tokens for a yes/no answer - Batch processing — use batch APIs for non-time-sensitive workloads (typically 50% cheaper)
- Spending alerts — set hard spending caps on API keys and alert when approaching limits
The classification call in the model router costs fractions of a cent. Even if it sometimes routes incorrectly, the overall savings from using cheaper models for 60-80% of requests far outweigh the cost of the routing call itself.
Production Checklist
Before deploying an agent to production, walk through this comprehensive checklist. Each item addresses a common failure mode in production AI systems.
| Category | Item | Status |
|---|---|---|
| Security | API keys stored in secrets manager (not in code or env files) | |
| Security | Input validation and prompt injection defences (Module 13) | |
| Security | Output filtering for PII and harmful content | |
| Security | Least-privilege permissions for all tools | |
| Reliability | Exponential backoff with jitter for API calls | |
| Reliability | Circuit breakers for external dependencies | |
| Reliability | Timeouts on all LLM calls (30-120s depending on task) | |
| Reliability | Graceful degradation when services are unavailable | |
| Observability | Structured logging for all LLM calls and tool executions | |
| Observability | Metrics dashboard with latency, error rate, and cost | |
| Observability | Alerting for error rate spikes and cost anomalies | |
| Cost | Model routing for cost-appropriate model selection | |
| Cost | Hard spending caps on API keys | |
| Cost | Response caching where applicable | |
| Testing | Benchmark suite passing (Module 14) | |
| Testing | Regression tests for all prompt changes | |
| Safety | Human-in-the-loop for high-risk actions | |
| Safety | Rate limiting per user and per action type | |
| Safety | Audit trail for all agent decisions and actions |
What's Next?
You have completed all 15 modules of the AI Agent Series. You now have the foundational knowledge and practical skills to build, test, and deploy production-grade AI agents. Here is a suggested path for continuing your learning:
Build a Real Project
Pick a problem you care about and build an end-to-end agent. A customer support bot, a research assistant, or a code review agent are all great starting points.
Explore Frameworks
Try LangChain, CrewAI, or the Anthropic Agent SDK for building more sophisticated multi-agent systems with less boilerplate.
Go Deeper on RAG
Production RAG systems involve chunking strategies, embedding model selection, re-ranking, and hybrid search. Module 8 was just the beginning.
Contribute to Open Source
Contribute to MCP servers, A2A implementations, or agent frameworks. The ecosystem is young and welcomes contributors.
- Stay current with the rapidly evolving agent ecosystem by following the official documentation for Anthropic, MCP, and A2A
- Join community forums and Discord servers to share patterns and learn from others building agents
- Read research papers on agent architectures, tool use, and safety — the academic frontier is moving fast
You have completed all 15 modules of the AI Agent Series. You now have the foundations to build production-grade AI agents. The field is evolving rapidly — stay curious, keep building, and always prioritise safety.