Production Checklist¶

This checklist ensures your FastAgentic deployment is ready for production traffic. Each section explains why it matters and how to implement it.

Quick Assessment¶

Rate your readiness (0-3 per category):

Category	0	1	2	3
Observability	Nothing	Logs only	Basic tracing	Full analytics
Security	None	Auth only	+ Guardrails	+ PII handling
Reliability	No handling	Retry only	+ Circuit breaker	+ Fallbacks
Operations	Manual	Docker	Kubernetes	+ CI/CD
Data	None	Checkpoints	+ Encryption	+ Backup

Score interpretation: - 0-5: Development only - 6-10: Staging / internal - 11-15: Production ready

Checklist¶

1. Observability¶

Why it matters: You can't fix what you can't see. In production, debugging becomes critical.

Essential¶

Tracing enabled — Every request can be traced end-to-end

from fastagentic.integrations.langfuse import LangfuseHook

app = App(hooks=[LangfuseHook()])

Structured logging — JSON logs with request IDs

app = App(
    log_format="json",
    log_level="INFO",
)

Health endpoint — Kubernetes/load balancer can check status

curl http://localhost:8000/health
# {"status": "healthy", "version": "0.2.0"}

Recommended¶

Cost tracking — Know how much each request costs

LangfuseHook(
    track_cost=True,
    cost_per_user=True,
)

Metrics exported — Prometheus can scrape metrics

curl http://localhost:8000/metrics
# fastagentic_requests_total{endpoint="/chat"} 1234

Alerting configured — Get notified on errors

# alerts.yaml
- alert: HighErrorRate
  expr: rate(fastagentic_errors_total[5m]) > 0.1

Tools: Langfuse, Logfire, Datadog

2. Security¶

Why it matters: AI agents are attack surfaces. Prompt injection is the #1 OWASP risk for LLM apps.

Essential¶

Authentication enabled — Know who's calling

app = App(
    auth=OIDCAuth(
        issuer="https://auth.example.com",
        audience="my-agent",
    ),
)

Prompt injection protection — Block malicious inputs

from fastagentic.integrations.lakera import LakeraHook

app = App(hooks=[LakeraHook(on_detection="reject")])

HTTPS only — Never expose HTTP in production
```
# In production, always terminate TLS
```

Recommended¶

Rate limiting — Prevent abuse

app = App(
    rate_limit=RateLimit(
        rpm=60,
        by="user",
    ),
)

PII handling — Detect and handle sensitive data
```
LakeraHook(categories=["pii"])
```

Audit logging — Record who did what

@hook("on_request")
async def audit(ctx: HookContext):
    await audit_log.record(
        user=ctx.user.id,
        action=ctx.endpoint,
        input=ctx.request,
    )

Secrets management — No hardcoded API keys

# Use environment variables
app = App()  # Reads from OPENAI_API_KEY, etc.

Tools: Lakera, Guardrails AI

3. Reliability¶

Why it matters: LLM APIs fail. Rate limits hit. Networks timeout. Your agent should handle this gracefully.

Essential¶

Retry policy — Handle transient failures

@agent_endpoint(
    retry=RetryPolicy(
        max_attempts=3,
        backoff="exponential",
        retry_on=["rate_limit", "timeout"],
    ),
)

Timeouts — Don't hang forever

@agent_endpoint(
    timeout=Timeout(
        total_ms=60000,      # 1 minute max
        llm_call_ms=30000,   # 30 seconds per LLM call
    ),
)

Graceful degradation — Fail helpfully

@hook("on_error")
async def handle_error(ctx: HookContext):
    return {"error": "Service temporarily unavailable"}

Recommended¶

Circuit breaker — Stop calling failing services

@agent_endpoint(
    circuit_breaker=CircuitBreaker(
        failure_threshold=5,
        reset_timeout_ms=30000,
    ),
)

Model fallbacks — Switch models when primary fails

from fastagentic.integrations.portkey import PortkeyGateway

app = App(
    llm_gateway=PortkeyGateway(
        config={
            "strategy": {"mode": "fallback"},
            "targets": [
                {"virtual_key": "openai"},
                {"virtual_key": "anthropic"},
            ],
        },
    ),
)

Idempotency — Safe to retry requests

@agent_endpoint(
    idempotency=True,  # Same request ID = same result
)

Reference: Reliability Patterns

4. Durability¶

Why it matters: Long-running agent workflows should survive crashes and restarts.

Essential¶

Durable store configured — Checkpoints persist

app = App(
    durable_store="redis://localhost:6379",
)

Checkpointing enabled — Resume from failure

@agent_endpoint(
    durable=True,  # Auto-checkpoint
)

Recommended¶

Backup strategy — Don't lose data

# Redis: Enable RDB or AOF persistence
# Postgres: Regular pg_dump

Encryption at rest — Protect checkpoint data

app = App(
    durable_store="redis://localhost:6379",
    encryption_key=os.getenv("CHECKPOINT_ENCRYPTION_KEY"),
)

TTL configured — Clean up old checkpoints

app = App(
    checkpoint_ttl_hours=168,  # 7 days
)

Reference: Architecture - Durability

5. Operations¶

Why it matters: Deploying, monitoring, and updating your agent should be automated and safe.

Essential¶

Containerized — Consistent deployment

FROM python:3.11-slim
COPY . /app
RUN pip install fastagentic
CMD ["fastagentic", "run", "--host", "0.0.0.0"]

Resource limits — Prevent runaway consumption

resources:
  requests:
    memory: "256Mi"
    cpu: "200m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

Readiness probe — Don't route traffic until ready

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5

Recommended¶

Horizontal scaling — Handle traffic spikes

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 70

Rolling updates — Zero-downtime deployments

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

CI/CD pipeline — Automated testing and deployment

# .github/workflows/deploy.yml
- run: fastagentic test contract
- run: fastagentic test integration
- run: kubectl apply -f k8s/

Reference: Docker, Kubernetes

6. Testing¶

Why it matters: Untested agents break in production. Contract tests catch regressions.

Essential¶

Contract tests — API contracts are validated

fastagentic test contract
# ✓ /chat - input schema valid
# ✓ /chat - output schema valid

Integration tests — End-to-end flows work

async def test_chat_endpoint():
    response = await client.post("/chat", json={"message": "hello"})
    assert response.status_code == 200

Recommended¶

Load tests — Know your limits

locust -f loadtest.py --host http://localhost:8000
# Requests/s, p99 latency, error rate

Evaluation baseline — Track quality over time

BraintrustHook(
    project="my-agent",
    baseline_experiment="v1.0",  # Compare against baseline
)

7. Documentation¶

Why it matters: Your team needs to operate and debug the system.

Essential¶

API documented — Others can use your agent

curl http://localhost:8000/openapi.json
# or http://localhost:8000/docs (Swagger UI)

Runbook exists — How to handle common issues
What to do when error rate spikes
How to restart stuck runs
How to roll back deployments

Recommended¶

Architecture diagram — Visual overview
Incident playbook — Step-by-step for outages
On-call guide — Who to contact, escalation paths

Quick Start: Minimum Viable Production¶

Copy this configuration as your production baseline:

from fastagentic import App, agent_endpoint
from fastagentic.integrations.langfuse import LangfuseHook
from fastagentic.integrations.lakera import LakeraHook
from fastagentic.reliability import RetryPolicy, Timeout

app = App(
    title="My Production Agent",
    version="1.0.0",

    # Observability
    hooks=[
        LangfuseHook(),
        LakeraHook(on_detection="reject"),
    ],

    # Durability
    durable_store="redis://localhost:6379",

    # Auth
    auth=OIDCAuth(
        issuer=os.getenv("AUTH_ISSUER"),
        audience=os.getenv("AUTH_AUDIENCE"),
    ),

    # Rate limiting
    rate_limit=RateLimit(rpm=60, by="user"),
)

@agent_endpoint(
    path="/chat",
    runnable=...,
    durable=True,
    retry=RetryPolicy(max_attempts=3, backoff="exponential"),
    timeout=Timeout(total_ms=60000),
)
async def chat(message: str) -> str:
    pass

Production Checklist¶

Quick Assessment¶

Checklist¶

1. Observability¶

Essential¶

Recommended¶

2. Security¶

Essential¶

Recommended¶

3. Reliability¶

Essential¶

Recommended¶

4. Durability¶

Essential¶

Recommended¶

5. Operations¶

Essential¶

Recommended¶

6. Testing¶

Essential¶

Recommended¶

7. Documentation¶

Essential¶

Recommended¶

Quick Start: Minimum Viable Production¶

Common Production Issues¶

"My agent is slow"¶

"I'm getting rate limited"¶

"Users are injecting prompts"¶

"Checkpoints are filling up disk"¶

Next Steps¶