Skip to content

Production Checklist

Essential steps for production deployment.

Security

Authentication

  • [ ] Set strong JWT secret

    M9M_JWT_SECRET=$(openssl rand -hex 32)
    

  • [ ] Configure JWT expiration

    security:
      jwtExpiration: 24h
      refreshTokenExpiration: 168h
    

  • [ ] Enable API key authentication for service accounts

Encryption

  • [ ] Set encryption key for credentials

    M9M_ENCRYPTION_KEY=$(openssl rand -hex 16)
    

  • [ ] Enable TLS/HTTPS

    server:
      tls:
        enabled: true
        certFile: /etc/m9m/tls.crt
        keyFile: /etc/m9m/tls.key
    

Network

  • [ ] Configure CORS appropriately

    security:
      cors:
        allowedOrigins:
          - "https://app.example.com"
        allowedMethods:
          - GET
          - POST
          - PUT
          - DELETE
    

  • [ ] Set up firewall rules

  • [ ] Use private network for database connections
  • [ ] Enable rate limiting
    security:
      rateLimit:
        enabled: true
        requests: 100
        window: 60
    

Access Control

  • [ ] Disable debug endpoints in production

    server:
      enablePprof: false
    

  • [ ] Use minimal permissions for service accounts

  • [ ] Audit credential access

Database

  • [ ] Use PostgreSQL for production

    database:
      type: postgres
      url: "postgres://user:pass@host:5432/m9m?sslmode=require"
    

  • [ ] Enable SSL connections

  • [ ] Configure connection pooling
    database:
      maxOpenConns: 25
      maxIdleConns: 5
      connMaxLifetime: 5m
    

Backups

  • [ ] Set up automated backups

    pg_dump -h localhost -U m9m m9m > backup.sql
    

  • [ ] Test backup restoration

  • [ ] Store backups in separate location
  • [ ] Configure retention policy

Performance

  • [ ] Add database indexes
  • [ ] Monitor query performance
  • [ ] Set up connection pooling (PgBouncer)

High Availability

Multiple Instances

  • [ ] Run at least 2 replicas

    replicas: 2
    

  • [ ] Configure load balancer

  • [ ] Enable health checks

Queue

  • [ ] Use Redis for distributed queue

    queue:
      type: redis
      url: "redis://redis:6379"
    

  • [ ] Configure Redis persistence

  • [ ] Set up Redis Sentinel or Cluster

Failover

  • [ ] Configure pod anti-affinity
  • [ ] Set up PodDisruptionBudget
  • [ ] Test failover scenarios

Monitoring

Metrics

  • [ ] Enable Prometheus metrics

    monitoring:
      enabled: true
      metricsPort: 9090
    

  • [ ] Set up Grafana dashboards

  • [ ] Configure alerting rules

Key Metrics to Monitor

Metric Alert Threshold
m9m_execution_errors > 5 per minute
m9m_execution_duration p99 > 30s
m9m_queue_size > 1000
m9m_active_workflows Unexpected changes
Memory usage > 80%
CPU usage > 80%

Logging

  • [ ] Use JSON log format

    logging:
      format: json
      level: info
    

  • [ ] Set up log aggregation (ELK, Loki)

  • [ ] Configure log retention
  • [ ] Don't log sensitive data

Tracing

  • [ ] Enable distributed tracing
    tracing:
      enabled: true
      endpoint: "http://jaeger:14268/api/traces"
    

Performance

Resource Allocation

  • [ ] Set appropriate resource limits
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "1000m"
    

Autoscaling

  • [ ] Configure HPA
    autoscaling:
      enabled: true
      minReplicas: 2
      maxReplicas: 10
      targetCPU: 70
    

Timeouts

  • [ ] Set appropriate timeouts
    server:
      readTimeout: 30s
      writeTimeout: 30s
      idleTimeout: 120s
    

Operations

Deployment

  • [ ] Use container orchestration (Kubernetes)
  • [ ] Implement blue/green or rolling deployments
  • [ ] Version container images
  • [ ] Don't use latest tag in production

Configuration

  • [ ] Use environment variables or secrets manager
  • [ ] Don't commit secrets to version control
  • [ ] Use ConfigMaps for non-sensitive config

Updates

  • [ ] Plan maintenance windows
  • [ ] Test updates in staging first
  • [ ] Document rollback procedures
  • [ ] Keep dependencies updated

Disaster Recovery

Backup Strategy

Data Frequency Retention
Database Daily 30 days
Workflows On change Indefinite
Credentials Daily 30 days
Config On change Version controlled

Recovery Plan

  • [ ] Document recovery procedures
  • [ ] Test recovery regularly
  • [ ] Define RTO and RPO
  • [ ] Have runbooks ready

Compliance

Audit Logging

  • [ ] Enable audit logging

    audit:
      enabled: true
      logLevel: info
    

  • [ ] Log authentication events

  • [ ] Log workflow modifications
  • [ ] Log credential access

Data Protection

  • [ ] Encrypt data at rest
  • [ ] Encrypt data in transit (TLS)
  • [ ] Implement data retention policies
  • [ ] Handle PII appropriately

Pre-Launch Checklist

Final Verification

  • [ ] All secrets are properly configured
  • [ ] TLS certificates are valid and not expiring soon
  • [ ] Database backups are working
  • [ ] Monitoring and alerting are configured
  • [ ] Health checks are passing
  • [ ] Load testing completed
  • [ ] Security scan completed
  • [ ] Documentation is up to date
  • [ ] Runbooks are ready
  • [ ] Support contacts are defined

Go-Live

  • [ ] Announce maintenance window
  • [ ] Deploy to production
  • [ ] Verify health checks
  • [ ] Test critical workflows
  • [ ] Monitor metrics
  • [ ] Verify alerts work
  • [ ] Celebrate!