Alerting Guide¶

Configure alerts for FastAgentic applications to detect issues before they impact users.

Alert Categories¶

Category	Severity	Response Time
Availability	Critical	Immediate
Performance	Warning	15 minutes
Cost	Warning	1 hour
Capacity	Info	Next business day

Prometheus Alert Rules¶

prometheus-rules.yaml¶

groups:
  - name: fastagentic-availability
    rules:
      - alert: FastAgenticDown
        expr: up{job="fastagentic"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "FastAgentic instance down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"
          runbook: "https://docs.example.com/runbooks/fastagentic-down"

      - alert: HighErrorRate
        expr: |
          rate(fastagentic_requests_total{status=~"5.."}[5m])
          / rate(fastagentic_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.path }}"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook: "https://docs.example.com/runbooks/high-error-rate"

      - alert: DurableStoreDown
        expr: fastagentic_durable_store_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Durable store unavailable"
          description: "Checkpoint storage is down"
          runbook: "https://docs.example.com/runbooks/durable-store-down"

  - name: fastagentic-performance
    rules:
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(fastagentic_request_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.path }}"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      - alert: SlowCheckpoints
        expr: |
          histogram_quantile(0.95, rate(fastagentic_checkpoint_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow checkpoint writes"
          description: "P95 checkpoint time is {{ $value | humanizeDuration }}"

      - alert: HighRunDuration
        expr: |
          histogram_quantile(0.95, rate(fastagentic_run_duration_seconds_bucket[5m])) > 300
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent runs taking too long"
          description: "P95 run duration is {{ $value | humanizeDuration }}"

  - name: fastagentic-cost
    rules:
      - alert: HighCostPerHour
        expr: sum(rate(fastagentic_cost_usd_total[1h])) * 3600 > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High hourly cost"
          description: "Hourly cost is ${{ $value | printf \"%.2f\" }}"

      - alert: TenantCostExceeded
        expr: |
          sum(increase(fastagentic_cost_usd_total[24h])) by (tenant) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenant {{ $labels.tenant }} exceeded daily budget"
          description: "Daily cost is ${{ $value | printf \"%.2f\" }}"

      - alert: UnusualTokenUsage
        expr: |
          sum(rate(fastagentic_tokens_total[1h]))
          > 2 * sum(rate(fastagentic_tokens_total[1h] offset 1d))
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Unusual token usage spike"
          description: "Token usage 2x higher than yesterday"

  - name: fastagentic-capacity
    rules:
      - alert: HighConcurrency
        expr: sum(fastagentic_runs_active) / count(fastagentic_runs_active) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High concurrent run utilization"
          description: "{{ $value | humanizePercentage }} capacity used"

      - alert: RateLimitHits
        expr: sum(rate(fastagentic_rate_limit_hits_total[5m])) > 10
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Users hitting rate limits"
          description: "{{ $value }} rate limit hits per second"

      - alert: QuotaNearLimit
        expr: fastagentic_quota_usage_ratio > 0.9
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "User {{ $labels.user }} near quota limit"
          description: "{{ $value | humanizePercentage }} of quota used"

Alertmanager Configuration¶

alertmanager.yml¶

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        severity: info
      receiver: 'slack-info'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-service-key>'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#agent-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'slack-info'
    slack_configs:
      - channel: '#agent-info'

Runbook Templates¶

High Error Rate¶

Alert: HighErrorRate

Symptoms: - HTTP 5xx responses > 5% - Users reporting failures

Diagnosis: 1. Check application logs: fastagentic tail --level ERROR 2. Verify LLM provider status 3. Check durable store connectivity 4. Review recent deployments

Resolution: 1. If LLM provider down: Enable fallback model 2. If durable store down: See DurableStoreDown runbook 3. If deployment issue: Rollback 4. If unknown: Escalate to on-call engineer

Durable Store Down¶

Alert: DurableStoreDown

Symptoms: - Checkpoints failing - Runs cannot resume - New runs may fail

Diagnosis: 1. Check store health: fastagentic inspect --config 2. Verify network connectivity 3. Check store logs (Redis/Postgres)

Resolution: 1. If network issue: Check security groups, DNS 2. If store crashed: Restart store service 3. If disk full: Expand storage 4. Failover to backup if available

High Cost¶

Alert: HighCostPerHour

Symptoms: - Cost exceeding budget - Possible abuse or bug

Diagnosis: 1. Identify high-cost users: Query fastagentic_cost_usd_total by tenant 2. Check for unusual patterns 3. Review recent changes

Resolution: 1. If abuse: Apply rate limits or block user 2. If bug: Fix and deploy 3. If legitimate growth: Adjust alerts, add capacity

Integration Examples¶

PagerDuty¶

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: '<routing-key>'
        severity: '{{ .GroupLabels.severity }}'
        description: '{{ .GroupLabels.alertname }}'
        details:
          summary: '{{ .CommonAnnotations.summary }}'
          description: '{{ .CommonAnnotations.description }}'
          runbook: '{{ .CommonAnnotations.runbook }}'

Slack¶

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Runbook:* {{ .CommonAnnotations.runbook }}
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook }}'

OpsGenie¶

receivers:
  - name: 'opsgenie'
    opsgenie_configs:
      - api_key: '<api-key>'
        message: '{{ .GroupLabels.alertname }}'
        priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

Next Steps¶

Metrics Reference - Available metrics
Tracing - Distributed tracing
Troubleshooting - Common issues