Production Checklist¶
This checklist ensures your FastAgentic deployment is ready for production traffic. Each section explains why it matters and how to implement it.
Quick Assessment¶
Rate your readiness (0-3 per category):
| Category | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Observability | Nothing | Logs only | Basic tracing | Full analytics |
| Security | None | Auth only | + Guardrails | + PII handling |
| Reliability | No handling | Retry only | + Circuit breaker | + Fallbacks |
| Operations | Manual | Docker | Kubernetes | + CI/CD |
| Data | None | Checkpoints | + Encryption | + Backup |
Score interpretation: - 0-5: Development only - 6-10: Staging / internal - 11-15: Production ready
Checklist¶
1. Observability¶
Why it matters: You can't fix what you can't see. In production, debugging becomes critical.
Essential¶
-
Tracing enabled — Every request can be traced end-to-end
-
Structured logging — JSON logs with request IDs
-
Health endpoint — Kubernetes/load balancer can check status
Recommended¶
-
Cost tracking — Know how much each request costs
-
Metrics exported — Prometheus can scrape metrics
-
Alerting configured — Get notified on errors
Tools: Langfuse, Logfire, Datadog
2. Security¶
Why it matters: AI agents are attack surfaces. Prompt injection is the #1 OWASP risk for LLM apps.
Essential¶
-
Authentication enabled — Know who's calling
-
Prompt injection protection — Block malicious inputs
-
HTTPS only — Never expose HTTP in production
Recommended¶
-
Rate limiting — Prevent abuse
-
PII handling — Detect and handle sensitive data
-
Audit logging — Record who did what
-
Secrets management — No hardcoded API keys
Tools: Lakera, Guardrails AI
3. Reliability¶
Why it matters: LLM APIs fail. Rate limits hit. Networks timeout. Your agent should handle this gracefully.
Essential¶
-
Retry policy — Handle transient failures
-
Timeouts — Don't hang forever
-
Graceful degradation — Fail helpfully
Recommended¶
-
Circuit breaker — Stop calling failing services
-
Model fallbacks — Switch models when primary fails
-
Idempotency — Safe to retry requests
Reference: Reliability Patterns
4. Durability¶
Why it matters: Long-running agent workflows should survive crashes and restarts.
Essential¶
-
Durable store configured — Checkpoints persist
-
Checkpointing enabled — Resume from failure
Recommended¶
-
Backup strategy — Don't lose data
-
Encryption at rest — Protect checkpoint data
-
TTL configured — Clean up old checkpoints
Reference: Architecture - Durability
5. Operations¶
Why it matters: Deploying, monitoring, and updating your agent should be automated and safe.
Essential¶
-
Containerized — Consistent deployment
-
Resource limits — Prevent runaway consumption
-
Readiness probe — Don't route traffic until ready
Recommended¶
-
Horizontal scaling — Handle traffic spikes
-
Rolling updates — Zero-downtime deployments
-
CI/CD pipeline — Automated testing and deployment
Reference: Docker, Kubernetes
6. Testing¶
Why it matters: Untested agents break in production. Contract tests catch regressions.
Essential¶
-
Contract tests — API contracts are validated
-
Integration tests — End-to-end flows work
Recommended¶
-
Load tests — Know your limits
-
Evaluation baseline — Track quality over time
7. Documentation¶
Why it matters: Your team needs to operate and debug the system.
Essential¶
-
API documented — Others can use your agent
-
Runbook exists — How to handle common issues
- What to do when error rate spikes
- How to restart stuck runs
- How to roll back deployments
Recommended¶
- Architecture diagram — Visual overview
- Incident playbook — Step-by-step for outages
- On-call guide — Who to contact, escalation paths
Quick Start: Minimum Viable Production¶
Copy this configuration as your production baseline:
from fastagentic import App, agent_endpoint
from fastagentic.integrations.langfuse import LangfuseHook
from fastagentic.integrations.lakera import LakeraHook
from fastagentic.reliability import RetryPolicy, Timeout
app = App(
title="My Production Agent",
version="1.0.0",
# Observability
hooks=[
LangfuseHook(),
LakeraHook(on_detection="reject"),
],
# Durability
durable_store="redis://localhost:6379",
# Auth
auth=OIDCAuth(
issuer=os.getenv("AUTH_ISSUER"),
audience=os.getenv("AUTH_AUDIENCE"),
),
# Rate limiting
rate_limit=RateLimit(rpm=60, by="user"),
)
@agent_endpoint(
path="/chat",
runnable=...,
durable=True,
retry=RetryPolicy(max_attempts=3, backoff="exponential"),
timeout=Timeout(total_ms=60000),
)
async def chat(message: str) -> str:
pass
Common Production Issues¶
"My agent is slow"¶
- Check trace latency in Langfuse
- Is it the LLM call or tool execution?
- Consider caching with Portkey
- Review timeout configuration
"I'm getting rate limited"¶
- Check RPM/TPM usage in observability
- Implement request queuing
- Add model fallbacks via Portkey
- Consider caching similar requests
"Users are injecting prompts"¶
- Enable Lakera with
on_detection="reject" - Review flagged inputs in logs
- Tune detection categories
- Consider fail-closed (
on_failure="reject")
"Checkpoints are filling up disk"¶
- Configure TTL:
checkpoint_ttl_hours=168 - Set up periodic cleanup job
- Archive old checkpoints to S3