Ran Wei/ AI Agents/Module 15
中文
AI Agent Series — Ran Wei

Module 15: Deployment & Best Practices

Taking your agent from prototype to production: containerisation, monitoring, scaling, cost management, and a comprehensive deployment checklist.

1

Containerisation

Containerising your agent ensures it runs consistently across development, staging, and production environments. A Docker container packages your agent code, dependencies, and configuration into a single deployable unit.

Dockerfile for an Agent Service

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/
COPY config/ ./config/

# Non-root user for security
RUN useradd --create-home appuser
USER appuser

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose for Multi-Agent Systems

When you have multiple agents running as separate services (as discussed in Module 11), Docker Compose lets you define and run them together.

# docker-compose.yml
version: "3.9"
services:
  orchestrator:
    build: ./orchestrator
    ports: ["8000:8000"]
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - REDIS_URL=redis://redis:6379
    depends_on: [redis]

  research-agent:
    build: ./agents/research
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}

  writer-agent:
    build: ./agents/writer
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}

  redis:
    image: redis:7-alpine
    volumes: ["redis-data:/data"]

volumes:
  redis-data:
TIP

Never bake API keys into your Docker image. Use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault, etc.). The .env file should be in your .gitignore and .dockerignore.

PITFALL

Python agent containers can be large (1GB+) due to ML dependencies. Use multi-stage builds and python:3.12-slim as the base to keep images lean. Only install the packages your agent actually needs in production.

2

Error Handling

Production agents face a range of failures: API rate limits, network timeouts, malformed LLM responses, and tool execution errors. Robust error handling with retries and circuit breakers keeps your agent resilient.

Exponential Backoff with Jitter

import time, random
import anthropic

def call_with_retry(func, max_retries: int = 3, base_delay: float = 1.0):
    """Call a function with exponential backoff and jitter on failure."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except anthropic.RateLimitError:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})")
            time.sleep(delay)
        except anthropic.APIStatusError as e:
            if e.status_code >= 500:  # Server errors are retryable
                if attempt == max_retries:
                    raise
                time.sleep(base_delay * (2 ** attempt))
            else:
                raise  # Client errors (4xx) are not retryable
        except anthropic.APIConnectionError:
            if attempt == max_retries:
                raise
            time.sleep(base_delay * (2 ** attempt))
    raise Exception("Max retries exceeded")

Circuit Breaker Pattern

If an API is consistently failing, stop hitting it and fail fast instead of waiting through retries. A circuit breaker "opens" after a threshold of failures and "closes" after a cooldown period.

import time

class CircuitBreaker:
    """Prevent cascading failures with a circuit breaker."""

    def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failure_count = 0
        self.last_failure_time = 0
        self.state = "closed"  # closed = normal, open = failing fast

    def call(self, func):
        """Execute function through the circuit breaker."""
        if self.state == "open":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "half-open"  # Try one request
            else:
                raise Exception("Circuit breaker is OPEN - failing fast")

        try:
            result = func()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
                print(f"Circuit breaker OPENED after {self.failure_count} failures")
            raise

# Usage
api_breaker = CircuitBreaker(failure_threshold=5, reset_timeout=60)
try:
    result = api_breaker.call(lambda: client.messages.create(...))
except Exception as e:
    # Serve cached response or graceful fallback
    result = fallback_response()
NOTE

Combine retries with circuit breakers: retry individual transient failures, but if failures accumulate, the circuit breaker stops further attempts. This protects both your agent and the upstream API from overload.

3

Monitoring and Observability

You cannot fix what you cannot see. Production agents need comprehensive logging, metrics, and tracing to diagnose issues and track performance over time.

Structured Logging

import logging, json, time, uuid

class AgentLogger:
    """Structured logging for agent interactions."""

    def __init__(self, agent_name: str):
        self.logger = logging.getLogger(agent_name)
        self.logger.setLevel(logging.INFO)
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)

    def log_llm_call(self, request_id: str, model: str, input_tokens: int,
                     output_tokens: int, latency_ms: float, success: bool):
        self.logger.info(json.dumps({
            "event": "llm_call",
            "request_id": request_id,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "latency_ms": round(latency_ms, 2),
            "success": success,
            "cost_usd": self._estimate_cost(model, input_tokens, output_tokens),
            "timestamp": time.time(),
        }))

    def log_tool_call(self, request_id: str, tool_name: str,
                      latency_ms: float, success: bool, error: str = ""):
        self.logger.info(json.dumps({
            "event": "tool_call",
            "request_id": request_id,
            "tool": tool_name,
            "latency_ms": round(latency_ms, 2),
            "success": success,
            "error": error,
            "timestamp": time.time(),
        }))

    def _estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        prices = {
            "claude-sonnet-4-20250514": (3.0, 15.0),   # per 1M tokens
            "claude-haiku-4-20250514":  (0.25, 1.25),
        }
        input_price, output_price = prices.get(model, (3.0, 15.0))
        return (input_tokens * input_price + output_tokens * output_price) / 1_000_000

Key Metrics to Track

MetricWhat It Tells YouAlert Threshold
Request latency (p50, p95, p99)How fast your agent respondsp95 > 10s
Error rateHow often requests fail> 5%
Token usage per requestHow much context the agent uses> 80% of context window
Cost per requestHow much each interaction costs> $0.50/request
Tool call success rateHow reliable your tools are< 95%
Task completion rateHow often the agent finishes the task< 90%
Human escalation rateHow often the agent cannot handle a request> 20%
TIP

Send structured logs to a centralised system (ELK stack, Datadog, Grafana Cloud) and set up dashboards that show real-time agent health. Build alerts for anomalies — a sudden spike in token usage often means your agent is stuck in a loop.

4

Scaling

AI agents are primarily I/O-bound (waiting for LLM API responses), not CPU-bound. This means scaling strategies differ from traditional compute-heavy applications.

Horizontal Scaling with FastAPI

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio, anthropic

app = FastAPI()
client = anthropic.AsyncAnthropic()

class AgentRequest(BaseModel):
    user_id: str
    message: str

class AgentResponse(BaseModel):
    response: str
    latency_ms: float

@app.post("/agent", response_model=AgentResponse)
async def handle_request(req: AgentRequest):
    """Handle agent requests asynchronously for high throughput."""
    import time
    start = time.time()

    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": req.message}]
    )

    latency = (time.time() - start) * 1000
    return AgentResponse(
        response=response.content[0].text,
        latency_ms=latency
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}

Async I/O

Use asyncio and async API clients. A single process can handle hundreds of concurrent LLM requests while waiting for responses.

Worker Processes

Run multiple Uvicorn workers (--workers 4) behind a load balancer. Each worker handles its own async event loop.

Queue-Based

For long-running agent tasks, use a task queue (Celery, Redis Queue) to decouple request acceptance from processing.

Auto-Scaling

Deploy on Kubernetes or cloud services with auto-scaling based on request queue depth or latency metrics.

NOTE

The bottleneck is almost always the LLM API, not your application code. Scaling your agent horizontally does not help if the LLM provider is rate-limiting you. Monitor your API rate limits and request higher limits from your provider before scaling your infrastructure.

5

Cost Management

LLM API costs can escalate quickly in production, especially with multi-agent systems where each user request triggers multiple API calls. Proactive cost management is essential.

ModelInput (per 1M tokens)Output (per 1M tokens)Typical cost per task
Claude Opus 4$15.00$75.00$0.10 - $1.00
Claude Sonnet 4$3.00$15.00$0.02 - $0.15
Claude Haiku 4$0.25$1.25$0.001 - $0.01
GPT-4o$2.50$10.00$0.01 - $0.10
GPT-4o-mini$0.15$0.60$0.001 - $0.01

Cost Optimisation Strategies

class CostOptimiser:
    """Route requests to the cheapest capable model."""

    def __init__(self):
        self.client = anthropic.Anthropic()
        self.model_tiers = {
            "simple":  "claude-haiku-4-20250514",    # Classification, extraction
            "medium":  "claude-sonnet-4-20250514",   # Analysis, writing
            "complex": "claude-opus-4-20250514",     # Complex reasoning
        }

    def classify_complexity(self, task: str) -> str:
        """Use a cheap model to classify task complexity."""
        response = self.client.messages.create(
            model="claude-haiku-4-20250514",
            max_tokens=10,
            system="Classify task complexity as: simple, medium, or complex. "
                   "One word only.",
            messages=[{"role": "user", "content": task}]
        )
        complexity = response.content[0].text.strip().lower()
        return complexity if complexity in self.model_tiers else "medium"

    def run(self, task: str) -> str:
        """Route to the appropriate model based on complexity."""
        complexity = self.classify_complexity(task)
        model = self.model_tiers[complexity]
        print(f"Routing to {model} (complexity: {complexity})")

        response = self.client.messages.create(
            model=model, max_tokens=2048,
            messages=[{"role": "user", "content": task}]
        )
        return response.content[0].text

Key cost-saving techniques:

  1. Model routing — use cheap models for simple tasks, expensive models only when needed (5-50x savings)
  2. Prompt caching — cache system prompts and frequently used context to reduce input tokens
  3. Response caching — cache identical or near-identical queries to avoid redundant API calls
  4. Token budgets — set max_tokens appropriately for each task; do not request 4096 tokens for a yes/no answer
  5. Batch processing — use batch APIs for non-time-sensitive workloads (typically 50% cheaper)
  6. Spending alerts — set hard spending caps on API keys and alert when approaching limits
TIP

The classification call in the model router costs fractions of a cent. Even if it sometimes routes incorrectly, the overall savings from using cheaper models for 60-80% of requests far outweigh the cost of the routing call itself.

6

Production Checklist

Before deploying an agent to production, walk through this comprehensive checklist. Each item addresses a common failure mode in production AI systems.

CategoryItemStatus
SecurityAPI keys stored in secrets manager (not in code or env files)
SecurityInput validation and prompt injection defences (Module 13)
SecurityOutput filtering for PII and harmful content
SecurityLeast-privilege permissions for all tools
ReliabilityExponential backoff with jitter for API calls
ReliabilityCircuit breakers for external dependencies
ReliabilityTimeouts on all LLM calls (30-120s depending on task)
ReliabilityGraceful degradation when services are unavailable
ObservabilityStructured logging for all LLM calls and tool executions
ObservabilityMetrics dashboard with latency, error rate, and cost
ObservabilityAlerting for error rate spikes and cost anomalies
CostModel routing for cost-appropriate model selection
CostHard spending caps on API keys
CostResponse caching where applicable
TestingBenchmark suite passing (Module 14)
TestingRegression tests for all prompt changes
SafetyHuman-in-the-loop for high-risk actions
SafetyRate limiting per user and per action type
SafetyAudit trail for all agent decisions and actions
7

What's Next?

You have completed all 15 modules of the AI Agent Series. You now have the foundational knowledge and practical skills to build, test, and deploy production-grade AI agents. Here is a suggested path for continuing your learning:

Build a Real Project

Pick a problem you care about and build an end-to-end agent. A customer support bot, a research assistant, or a code review agent are all great starting points.

Explore Frameworks

Try LangChain, CrewAI, or the Anthropic Agent SDK for building more sophisticated multi-agent systems with less boilerplate.

Go Deeper on RAG

Production RAG systems involve chunking strategies, embedding model selection, re-ranking, and hybrid search. Module 8 was just the beginning.

Contribute to Open Source

Contribute to MCP servers, A2A implementations, or agent frameworks. The ecosystem is young and welcomes contributors.

CONGRATULATIONS

You have completed all 15 modules of the AI Agent Series. You now have the foundations to build production-grade AI agents. The field is evolving rapidly — stay curious, keep building, and always prioritise safety.