Production Deployment Checklist

Ensuring Your Application is Ready for Production

Introduction to Production Readiness

Deploying an application to production is a significant milestone that requires careful planning and preparation. Unlike development or staging environments, production serves real users and handles real data. A hastily deployed application can lead to downtime, security breaches, data loss, and damaged reputation.

Think of preparing for production deployment like preparing for a space launch: every system must be checked, redundancies must be in place, and contingency plans must be ready. There's no quick "fix it in production" – everything needs to be right the first time.

graph TD A[Development] --> B[Testing/QA] B --> C[Staging] C --> D[Pre-Production Checklist] D --> E[Production Deployment] subgraph "Pre-Production Checklist" F[Functionality] G[Performance] H[Security] I[Reliability] J[Scalability] K[Monitoring] L[Documentation] M[Compliance] N[Backup/Recovery] O[Deployment Plan] end

In this lecture, we'll cover a comprehensive production checklist to ensure your application is truly ready for prime time. We'll approach this from multiple angles: technical requirements, operational considerations, security aspects, and business needs.

Functionality Verification

Before deployment, ensure that all features work as expected in an environment that mirrors production as closely as possible.

Core Feature Testing

Cross-browser and Device Testing

Acceptance Testing

graph LR A[Feature Development] --> B[Unit Tests] B --> C[Integration Tests] C --> D[End-to-End Tests] D --> E[UAT] E --> F[Production Ready] G[Cross-browser Testing] --> F H[Accessibility Testing] --> F I[Security Testing] --> F

Real-world example: A financial services company had completed all technical testing for their new online banking platform. However, during final UAT, they discovered that their international customers couldn't complete wire transfers due to a form validation issue with international account numbers. Finding this before production prevented a significant business impact.

Performance Optimization

Performance issues that were tolerable in development become critical in production. Verify that your application performs well under expected loads.

Front-end Performance

// Example webpack.config.js for production optimization
const TerserPlugin = require('terser-webpack-plugin');
const CssMinimizerPlugin = require('css-minimizer-webpack-plugin');

module.exports = {
  mode: 'production',
  optimization: {
    minimizer: [
      new TerserPlugin({
        terserOptions: {
          compress: {
            drop_console: true,
          },
        },
      }),
      new CssMinimizerPlugin(),
    ],
    splitChunks: {
      chunks: 'all',
      maxInitialRequests: Infinity,
      minSize: 20000,
      cacheGroups: {
        vendor: {
          test: /[\\/]node_modules[\\/]/,
          name(module) {
            const packageName = module.context.match(
              /[\\/]node_modules[\\/](.*?)([\\/]|$)/
            )[1];
            return `npm.${packageName.replace('@', '')}`;
          },
        },
      },
    },
  },
  // ... other webpack configuration
};

Back-end Performance

// Example Redis caching middleware for Express
const redis = require('redis');
const client = redis.createClient(process.env.REDIS_URL);

const cacheMiddleware = (duration) => {
  return (req, res, next) => {
    const key = `__express__${req.originalUrl || req.url}`;
    
    client.get(key, (err, data) => {
      if (data) {
        // Cache hit
        const cachedBody = JSON.parse(data);
        res.json(cachedBody);
        return;
      }
      
      // Cache miss - store the original send method
      const originalSend = res.json;
      
      // Override res.send to cache the response
      res.json = function(body) {
        client.setex(key, duration, JSON.stringify(body));
        originalSend.call(this, body);
      };
      
      next();
    });
  };
};

Load Testing

// Example k6 load testing script
import http from 'k6/http';
import { sleep, check } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 50 },   // Ramp up to 50 users over 1 minute
    { duration: '3m', target: 50 },   // Stay at 50 users for 3 minutes
    { duration: '1m', target: 100 },  // Ramp up to 100 users over 1 minute
    { duration: '5m', target: 100 },  // Stay at 100 users for 5 minutes
    { duration: '1m', target: 0 },    // Ramp down to 0 users
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% of requests must complete below 500ms
    'http_req_duration{staticAsset:yes}': ['p(95)<100'], // Static assets should be faster
    http_errors: ['rate<0.01'],     // Error rate must be less than 1%
  },
};

export default function() {
  const BASE_URL = 'https://staging-app.example.com';
  
  // Load the home page
  const homeRes = http.get(`${BASE_URL}/`);
  check(homeRes, {
    'homepage status is 200': (r) => r.status === 200,
    'homepage loads in under 1s': (r) => r.timings.duration < 1000,
  });
  
  // Simulate user browsing behavior
  sleep(Math.random() * 3 + 2); // Random sleep between 2-5 seconds
  
  // Load product page
  const productRes = http.get(`${BASE_URL}/products/popular`);
  check(productRes, {
    'product page status is 200': (r) => r.status === 200,
  });
  
  sleep(Math.random() * 3 + 1);
  
  // Submit search
  const searchRes = http.post(`${BASE_URL}/api/search`, {
    query: 'test product',
  });
  check(searchRes, {
    'search status is 200': (r) => r.status === 200,
    'search has results': (r) => JSON.parse(r.body).results.length > 0,
  });
  
  sleep(Math.random() * 5 + 3);
}

Performance checklist:

Security Hardening

Security vulnerabilities can lead to data breaches, system compromise, and legal liability. Thoroughly review and address security concerns before deployment.

Authentication and Authorization

Data Protection

Common Vulnerabilities

// Example Node.js security middleware setup
const express = require('express');
const helmet = require('helmet');
const csurf = require('csurf');
const rateLimit = require('express-rate-limit');
const mongoSanitize = require('express-mongo-sanitize');
const xss = require('xss-clean');
const hpp = require('hpp');

const app = express();

// Set security HTTP headers
app.use(helmet());

// Rate limiting to prevent brute force attacks
const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // Limit each IP to 100 requests per windowMs
  message: 'Too many requests from this IP, please try again later'
});
app.use('/api', limiter);

// Body parser, reading data from body into req.body
app.use(express.json({ limit: '10kb' })); // Limit body size

// Data sanitization against NoSQL query injection
app.use(mongoSanitize());

// Data sanitization against XSS
app.use(xss());

// Prevent parameter pollution
app.use(hpp({
  whitelist: ['price', 'duration', 'rating'] // Parameters that can be duplicated
}));

// CSRF protection (for browser-based submissions)
app.use(csurf({ cookie: true }));

Dependency Scanning

# Example security scanning command with npm audit
npm audit --production
npm audit fix

# Using Snyk for deeper vulnerability scanning
snyk test
snyk monitor

Security Testing

Security checklist:

Infrastructure and Deployment

A robust infrastructure setup ensures that your application runs reliably and can scale as needed.

Infrastructure as Code

Document and automate your infrastructure setup with IaC tools:

# Example Terraform configuration for web app infrastructure
provider "aws" {
  region = "us-west-2"
}

# VPC Configuration
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  enable_dns_support = true
  enable_dns_hostnames = true
  
  tags = {
    Name = "production-vpc"
    Environment = "production"
  }
}

# Create public and private subnets
resource "aws_subnet" "public" {
  count = 2
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.${count.index}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
  
  tags = {
    Name = "public-subnet-${count.index + 1}"
    Environment = "production"
  }
}

resource "aws_subnet" "private" {
  count = 2
  vpc_id = aws_vpc.main.id
  cidr_block = "10.0.${count.index + 100}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "private-subnet-${count.index + 1}"
    Environment = "production"
  }
}

# Load balancer configuration
resource "aws_lb" "web" {
  name = "web-lb"
  internal = false
  load_balancer_type = "application"
  security_groups = [aws_security_group.lb.id]
  subnets = aws_subnet.public[*].id
  
  enable_deletion_protection = true
  
  tags = {
    Environment = "production"
  }
}

# ECS Cluster for containers
resource "aws_ecs_cluster" "main" {
  name = "production-cluster"
  
  setting {
    name = "containerInsights"
    value = "enabled"
  }
}

# ... additional resources for databases, caching, etc.

High Availability and Fault Tolerance

graph TD subgraph "Region 1" A[Load Balancer] --> B[App Server 1] A --> C[App Server 2] A --> D[App Server 3] B --> E[Primary Database] C --> E D --> E E --> F[Read Replica 1] end subgraph "Region 2 (Failover)" G[Load Balancer] --> H[App Server 4] G --> I[App Server 5] H --> J[Secondary Database] I --> J E -.Replication.-> J end K[Global DNS] --> A K -.Failover.-> G

Containerization and Orchestration

# Example Kubernetes deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-backend
  namespace: production
  labels:
    app: backend
    tier: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: backend
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: backend
        tier: api
    spec:
      containers:
      - name: api
        image: example/backend:1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: NODE_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000

Deployment Strategy

graph LR subgraph "Blue-Green Deployment" A[Load Balancer] --> B[Blue Environment] A -.Switch.-> C[Green Environment] end subgraph "Canary Release" D[Load Balancer] --> E[90% Production v1] D --> F[10% Production v2] end subgraph "Rolling Update" G[Deploy v2] --> H[Replace Instance 1] H --> I[Replace Instance 2] I --> J[Replace Instance 3] end

Monitoring and Observability

Proper monitoring is essential for maintaining visibility into your application's performance and health in production.

Metrics Collection

// Example Node.js Prometheus metrics setup
const express = require('express');
const promClient = require('prom-client');
const app = express();

// Create a Registry to register metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Create custom metrics
const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

const httpRequestCounter = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Register the metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestCounter);

// Middleware to track request metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    // Record metrics
    httpRequestDurationMicroseconds
      .labels(req.method, route, res.statusCode)
      .observe(duration);
      
    httpRequestCounter
      .labels(req.method, route, res.statusCode)
      .inc();
  });
  
  next();
});

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Logging

Tracing

// Example OpenTelemetry tracing setup for Node.js
const opentelemetry = require('@opentelemetry/api');
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');

// Create and configure the tracer provider
const provider = new NodeTracerProvider();

// Configure span processor and exporter
const exporter = new JaegerExporter({
  serviceName: 'api-service',
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

// Register automatic instrumentations
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
  tracerProvider: provider,
});

const tracer = opentelemetry.trace.getTracer('api-tracer');

// Example of manual instrumentation
app.get('/orders/:id', async (req, res) => {
  const span = tracer.startSpan('get-order-details');
  
  try {
    span.setAttribute('order.id', req.params.id);
    
    // Create a child span for database query
    const dbSpan = tracer.startSpan('database-query', {
      parent: span,
    });
    
    const order = await db.getOrder(req.params.id);
    
    dbSpan.end();
    
    if (!order) {
      span.setAttribute('error', true);
      span.setAttribute('error.message', 'Order not found');
      res.status(404).json({ error: 'Order not found' });
      return;
    }
    
    res.json(order);
  } catch (error) {
    span.setAttribute('error', true);
    span.setAttribute('error.message', error.message);
    res.status(500).json({ error: 'Internal server error' });
  } finally {
    span.end();
  }
});

Alerting

Dashboards

Monitoring checklist:

Backup and Disaster Recovery

Prepare for the unexpected with comprehensive backup and recovery procedures.

Backup Strategy

# Example AWS RDS backup configuration using Terraform
resource "aws_db_instance" "production" {
  identifier           = "production-db"
  engine               = "postgres"
  engine_version       = "13.4"
  instance_class       = "db.r5.large"
  allocated_storage    = 100
  storage_type         = "gp2"
  name                 = "appdb"
  username             = var.db_username
  password             = var.db_password
  multi_az             = true
  publicly_accessible  = false
  deletion_protection  = true
  
  # Backup configuration
  backup_retention_period = 7  # 7 days of retention
  backup_window           = "03:00-05:00"  # UTC time
  copy_tags_to_snapshot   = true
  
  # Point-in-time recovery
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  # Automatic minor version upgrade
  auto_minor_version_upgrade = true
  
  # Enhanced monitoring
  monitoring_interval = 30
  monitoring_role_arn = aws_iam_role.rds_monitoring_role.arn
  
  tags = {
    Environment = "production"
  }
}

Disaster Recovery Plan

graph TD A[Disaster Event] --> B{Severity Assessment} B -->|Critical| C[Activate Full DR Plan] B -->|Major| D[Partial System Recovery] B -->|Minor| E[Component-Level Recovery] C --> F[Failover to Secondary Region] C --> G[Notify All Stakeholders] C --> H[Engage Incident Response Team] D --> I[Restore Affected Services] D --> J[Notify Affected Teams] E --> K[Restore from Backup] E --> L[Internal Notification Only]

High Availability Configuration

Recovery checklist:

Documentation and Runbooks

Comprehensive documentation ensures that operations can continue smoothly even when key team members are unavailable.

System Documentation

Operational Runbooks

Template for Incident Response Runbook

Incident Response: Database Connection Failures

Description: This runbook covers the procedure for handling database connection failures in the production environment.

Symptoms:

  • API endpoints returning 500 errors
  • Error logs showing database connection timeouts or failures
  • Database connection pool exhaustion alerts

Initial Assessment:

  1. Check database monitoring dashboard
  2. Verify if connection errors are affecting all services or specific ones
  3. Check recent deployments or infrastructure changes

Resolution Steps:

  1. Check database server status:
    aws rds describe-db-instances --db-instance-identifier production-db
  2. Verify connection pool settings in affected services:
    kubectl exec -it [pod-name] -- env | grep DB_POOL
  3. Check for open connections and potential connection leaks:
    SELECT count(*), state FROM pg_stat_activity GROUP BY state;
  4. If necessary, restart application pods to reset connection pools:
    kubectl rollout restart deployment/api-service
  5. If database server is overloaded, scale up resources or enable read replicas for read traffic:
    aws rds modify-db-instance --db-instance-identifier production-db --db-instance-class db.r5.2xlarge

Escalation:

  1. If issue persists for more than 15 minutes, escalate to Database Administrator
  2. If downtime exceeds 30 minutes, notify Engineering Manager and Product Owner

Prevention:

  • Implement proper connection pooling with appropriate timeouts
  • Set up proactive monitoring for connection pool metrics
  • Configure automatic scaling based on connection utilization

Documentation checklist:

Compliance and Legal Requirements

Ensure your application meets all relevant regulatory and legal requirements before deployment.

Data Privacy Compliance

Compliance Checklist

Accessibility

Compliance verification:

Miscellaneous Checks

Additional considerations that don't fit neatly into other categories.

License Compliance

Search Engine Optimization

Analytics and Tracking

User Documentation

Pre-Launch Final Checklist

A comprehensive checklist to review before the final go-live decision.

Production Go-Live Checklist

Functionality
  • [ ] All critical user flows have been tested and verified
  • [ ] Cross-browser compatibility confirmed
  • [ ] Mobile responsiveness validated
  • [ ] All integrations with external services are working
  • [ ] Form validations are functioning properly
  • [ ] User acceptance testing completed and signed off
Performance
  • [ ] Load testing completed with expected traffic volume
  • [ ] Asset optimization verified (minification, bundling)
  • [ ] Database queries optimized and indexed
  • [ ] Caching strategy implemented and tested
  • [ ] CDN configured for static assets
  • [ ] Performance monitoring in place
Security
  • [ ] Security scan completed with no critical findings
  • [ ] Dependency vulnerabilities addressed
  • [ ] HTTPS properly configured with valid certificates
  • [ ] Authentication and authorization thoroughly tested
  • [ ] Data encryption implemented for sensitive information
  • [ ] Security headers properly configured
Infrastructure
  • [ ] Production environment provisioned and configured
  • [ ] High availability setup verified
  • [ ] Auto-scaling configured and tested
  • [ ] Database backups configured and verified
  • [ ] DNS configuration prepared
  • [ ] SSL certificates installed and valid
Monitoring and Support
  • [ ] Logging properly configured and centralized
  • [ ] Monitoring dashboards created and accessible
  • [ ] Alerts configured for critical metrics
  • [ ] On-call rotation established
  • [ ] Incident response procedures documented
  • [ ] Runbooks created for common issues
Compliance and Documentation
  • [ ] Privacy policy updated and accessible
  • [ ] Terms of service updated and accessible
  • [ ] Cookie consent implemented if required
  • [ ] Accessibility requirements met
  • [ ] System documentation updated
  • [ ] API documentation current
Business Readiness
  • [ ] Support team trained on the new features
  • [ ] Customer communication plan in place
  • [ ] Analytics tracking configured
  • [ ] Rollback plan documented and tested
  • [ ] Stakeholder sign-off obtained
  • [ ] Go/No-Go meeting completed with approval

Deployment Day Procedures

Procedures for the actual deployment to minimize risk and ensure smooth transition.

Pre-Deployment Tasks

  1. Conduct final go/no-go meeting with all stakeholders
  2. Verify that backup of current production is available
  3. Ensure all team members are available and roles are assigned
  4. Set up communication channels for deployment coordination
  5. Notify relevant teams about the deployment window

Deployment Steps

  1. Activate maintenance mode or display maintenance banner (if applicable)
  2. Execute deployment according to documented procedure
  3. Verify deployment success with smoke tests
  4. Run database migrations (if applicable)
  5. Update CDN resources (if applicable)
  6. Verify all services are operational
  7. Run post-deployment validation tests
  8. Disable maintenance mode

Post-Deployment Monitoring

  1. Actively monitor system metrics for anomalies
  2. Watch error rates and performance indicators
  3. Monitor user feedback channels
  4. Have team members test critical user flows
  5. Be prepared to roll back if significant issues arise

Rollback Procedure

  1. Decision criteria: When to trigger a rollback
  2. Step-by-step rollback process
  3. Verification steps after rollback
  4. Post-rollback communication plan

Post-Deployment Activities

Activities to ensure continued success after the initial deployment.

Immediate Post-Deployment

First Week Activities

Longer-term Follow-up

Conclusion and Key Takeaways

A successful production deployment requires careful planning, thorough testing, and meticulous attention to detail. The key takeaways from this lecture include:

Remember: Production readiness is not a one-time event but an ongoing process. The standards and practices covered in this checklist should be continuously reviewed and improved based on real-world experience.

Practical Exercise: Creating a Production Readiness Plan

Exercise Overview

In this exercise, you'll create a production readiness plan for a sample application:

  1. Review the sample application architecture and requirements
  2. Identify critical components and potential failure points
  3. Create a detailed production readiness checklist
  4. Develop runbooks for key operational tasks
  5. Design a deployment and rollback strategy
  6. Present your plan to the class for feedback

Resources Required

For detailed exercise instructions and starter code, refer to the course repository: Production Readiness Workshop (Example URL)

Additional Resources

Books

Online Resources

Tools

Next Lecture Preview: SSL Certificates Setup

In our next session, we'll explore SSL certificate management, covering: