Rolling Updates: Implementing Continuous Deployment with Zero Downtime

Gradual instance-by-instance deployment for reliable system updates

Introduction to Rolling Updates

Rolling updates represent one of the most widely used deployment strategies for achieving zero-downtime deployments. Unlike blue-green deployments that switch traffic between two complete environments, or canary releases that gradually shift traffic percentages, rolling updates focus on incrementally updating instances within a single environment.

In a rolling update, application instances are updated one subset at a time. This allows the application to remain available throughout the deployment process, as only a portion of the instances are unavailable at any given time. The deployment proceeds through the environment until all instances are running the new version.

sequenceDiagram participant LB as Load Balancer participant I1 as Instance 1 participant I2 as Instance 2 participant I3 as Instance 3 participant I4 as Instance 4 Note over I1,I4: Initial State: All instances running v1 LB->>I1: User Request I1->>LB: Response (v1) LB->>I2: User Request I2->>LB: Response (v1) Note over I1,I4: Phase 1: Update Instance 1 LB--xI1: Stop sending traffic Note over I1: Update to v2 Note over I1: Health check LB->>I1: Resume traffic LB->>I1: User Request I1->>LB: Response (v2) LB->>I2: User Request I2->>LB: Response (v1) Note over I1,I4: Phase 2: Update Instance 2 LB--xI2: Stop sending traffic Note over I2: Update to v2 Note over I2: Health check LB->>I2: Resume traffic LB->>I2: User Request I2->>LB: Response (v2) Note over I1,I4: Continue until all instances are updated

The Rotating Tires Analogy

Rolling updates can be compared to how a mechanic replaces tires on a car:

  • Traditional Deployment: Take the entire car off the road, replace all four tires at once, then put it back in service (causing downtime).
  • Blue-Green Deployment: Build a second identical car with new tires, then switch drivers from the old car to the new one (instant cutover).
  • Canary Release: Replace one tire, drive the car carefully with a few passengers to test it, then gradually replace the rest if no issues arise.
  • Rolling Update: Jack up one corner of the car at a time, replace that tire, lower it, and move to the next tire. The car remains partially functional and never completely "off the road" during the update process.

Benefits and Challenges of Rolling Updates

Benefits

  • Zero Downtime: Application remains available throughout the update process
  • Resource Efficiency: Requires no additional infrastructure beyond a small buffer capacity
  • Gradual Rollout: Issues affect only a subset of instances before they can be detected
  • Simple Implementation: Most orchestration platforms support rolling updates natively
  • Built-in Verification: Health checks ensure each new instance is functioning before proceeding
  • Traffic Management: Load balancers automatically route around instances being updated
  • Minimal Configuration: Requires little special configuration beyond normal deployment

Challenges

  • Version Coexistence: Multiple versions run simultaneously during the rollout
  • Database Compatibility: Database changes must be compatible with both versions
  • Rollback Complexity: Partial rollbacks can lead to more version mixing
  • Slower Deployment: Takes longer to complete than an all-at-once deployment
  • Subtle Issues: Some problems may only appear when old and new versions interact
  • Instance Readiness: Proper health checks are critical to prevent premature traffic
  • Capacity Planning: Must maintain sufficient capacity during the update process

Real-World Example: E-commerce Platform Rolling Update

A major e-commerce platform implements rolling updates for their product catalog service:

  • Deployment Configuration:
    • 30 total instances across 3 availability zones
    • Update 3 instances at a time (10% of capacity)
    • 30-second health check grace period before allowing traffic
    • 60-second pause between batch updates to monitor performance
  • Benefits Realized:
    • Maintained 100% availability during deployments
    • Detected performance regressions before affecting all users
    • Reduced deployment risk during high-traffic holiday periods
  • Challenges Addressed:
    • Implemented backward-compatible API versions to handle version coexistence
    • Used database change patterns that work with both old and new code
    • Integrated comprehensive monitoring to catch subtle issues

The platform now deploys updates to production multiple times per day with minimal risk, compared to their previous weekly deployment schedule with occasional downtime.

Key Concepts in Rolling Updates

Update Strategies

There are several variations of rolling update strategies, each with different trade-offs:

Strategy Description Pros Cons
Basic Rolling Update Update one instance at a time in sequence Minimal capacity impact, simplest to implement Slowest deployment time, extended version coexistence
Batched Rolling Update Update multiple instances simultaneously in batches Faster deployment, balanced risk/speed Requires more capacity headroom during deployment
Surge Rolling Update Create new instances before terminating old ones Maintains full capacity, faster readiness Temporarily requires additional infrastructure
Zone-Based Rolling Update Update all instances in one availability zone at a time Geographic isolation of risk, simpler tracking Requires cross-zone redundancy, zone imbalance during update
graph TD subgraph "Basic Rolling Update" A1[75% v1
25% v2] --> A2[50% v1
50% v2] --> A3[25% v1
75% v2] --> A4[0% v1
100% v2] end subgraph "Surge Rolling Update" B1[100% v1
25% v2] --> B2[75% v1
50% v2] --> B3[50% v1
75% v2] --> B4[25% v1
100% v2] --> B5[0% v1
100% v2] end subgraph "Zone-Based Rolling Update" C1[Zone A: v2
Zone B: v1
Zone C: v1] --> C2[Zone A: v2
Zone B: v2
Zone C: v1] --> C3[Zone A: v2
Zone B: v2
Zone C: v2] end

Critical Components for Successful Rolling Updates

Rolling Update Configuration Best Practices

  • Batch Size: Start with small batch sizes (10-20% of total capacity) and adjust based on confidence and urgency
  • Health Check Grace Period: Allow sufficient time for application startup and warming before health checks (30+ seconds for most applications)
  • Inter-batch Delay: Consider adding a delay between batches to observe performance and catch issues
  • Timeout: Set an overall deployment timeout to prevent "stuck" deployments
  • Capacity Buffer: Maintain 20-30% extra capacity to handle traffic during instance updates
  • Failure Threshold: Define a threshold for failed instances that will trigger a rollback
  • Cross-Zone Distribution: Ensure updates affect instances across availability zones evenly

Rolling Updates Implementations Across Platforms

Kubernetes Rolling Updates

Kubernetes provides built-in support for rolling updates through Deployment resources.

Kubernetes Deployment with Rolling Update Strategy


apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # How many pods can be created above desired number
      maxUnavailable: 1  # How many pods can be unavailable during the update
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app:v2
        ports:
        - containerPort: 8080
        readinessProbe:    # Defines when a pod is ready to serve traffic
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:     # Defines when a pod should be restarted
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          limits:
            cpu: "500m"
            memory: "512Mi"
          requests:
            cpu: "100m"
            memory: "128Mi"

Kubernetes Rolling Update Command


# Update a deployment with a new image
kubectl set image deployment/my-app my-app=my-app:v2 --record

# Monitor the rollout status
kubectl rollout status deployment/my-app

# Pause a rollout if issues are detected
kubectl rollout pause deployment/my-app

# Resume a rollout after issues are resolved
kubectl rollout resume deployment/my-app

# Rollback to the previous version if needed
kubectl rollout undo deployment/my-app

# View rollout history
kubectl rollout history deployment/my-app

AWS Auto Scaling Group Rolling Updates

AWS Auto Scaling Groups support rolling updates through instance refresh and launch template updates.

AWS CloudFormation Template for ASG with Rolling Update


Resources:
  MyAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - subnet-12345678
        - subnet-87654321
      LaunchTemplate:
        LaunchTemplateId: !Ref MyLaunchTemplate
        Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
      MinSize: 2
      MaxSize: 10
      DesiredCapacity: 5
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      TargetGroupARNs:
        - !Ref MyTargetGroup
      # Instance refresh configuration for rolling updates
      InstanceRefreshSpecification:
        MinHealthyPercentage: 90
        InstanceWarmup: 300
        Strategy: Rolling
        Preferences:
          MinHealthyPercentage: 90
          InstanceWarmup: 300
          CheckpointPercentages:
            - 25
            - 50
            - 75
            - 100
          CheckpointDelay: 300
      Tags:
        - Key: Name
          Value: MyApp
          PropagateAtLaunch: true

  MyLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: MyAppLaunchTemplate
      VersionDescription: Initial version
      LaunchTemplateData:
        ImageId: ami-0abcdef1234567890
        InstanceType: t3.medium
        SecurityGroupIds:
          - sg-12345678
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash
            echo "Starting application version 1.0"
            # Application startup commands

AWS CLI Commands for Rolling Update


# Start an instance refresh (rolling update)
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name MyAutoScalingGroup \
  --strategy Rolling \
  --preferences '{"MinHealthyPercentage": 90, "InstanceWarmup": 300}'

# Check the status of an instance refresh
aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name MyAutoScalingGroup

# Cancel an instance refresh if needed
aws autoscaling cancel-instance-refresh \
  --auto-scaling-group-name MyAutoScalingGroup

Docker Swarm Rolling Updates

Docker Swarm provides rolling update capabilities for services through update-config settings.

Docker Swarm Service with Rolling Update Configuration


version: '3.8'
services:
  web:
    image: nginx:latest
    deploy:
      replicas: 6
      update_config:
        parallelism: 2         # Update 2 containers at a time
        delay: 10s             # Wait 10s between updating a group of containers
        order: start-first     # Start new containers first, then stop old ones
        failure_action: pause  # Pause deployment if a container fails to start
        monitor: 60s           # Monitor for failures for 60s after each task update
      rollback_config:
        parallelism: 3         # Rollback 3 containers at a time
        delay: 5s              # Wait 5s between rolling back a group of containers
        failure_action: continue
        monitor: 30s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    ports:
      - "80:80"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 5s
      timeout: 2s
      retries: 3
      start_period: 5s

Docker CLI Commands for Rolling Update


# Deploy or update a stack with rolling update configuration
docker stack deploy -c docker-compose.yml my-stack

# Update a service with a new image (with rolling update)
docker service update --image nginx:new-version my-stack_web

# Monitor service update progress
docker service ps my-stack_web

# Rollback a service update
docker service rollback my-stack_web

Version Compatibility Strategies

Since rolling updates involve running multiple versions simultaneously, compatibility between versions is critical for success.

API and Contract Compatibility

API Version Compatibility Example


// Express.js API with version compatibility
const express = require('express');
const app = express();

// API v1 endpoint (original)
app.get('/api/v1/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getUserFromDatabase(userId);
  
  // v1 response format
  res.json({
    id: user.id,
    name: user.name,
    email: user.email
  });
});

// API v2 endpoint (new version with additional fields)
app.get('/api/v2/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getEnhancedUserFromDatabase(userId);
  
  // v2 response format with additional fields
  res.json({
    id: user.id,
    name: user.name,
    email: user.email,
    profile: {
      avatar: user.avatar_url,
      bio: user.bio,
      created_at: user.joined_date
    }
  });
});

// Backward compatible v1 handler in v2 code
// This handles v1 requests during rolling update
app.get('/api/v1/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getEnhancedUserFromDatabase(userId);
  
  // Maintain v1 response format for backward compatibility
  res.json({
    id: user.id,
    name: user.name,
    email: user.email
  });
});

// Forward compatibility example
function processUserRequest(request) {
  // Handle new fields gracefully in old code
  // Ignore fields that didn't exist in previous version
  const userId = request.id;
  const userName = request.name;
  
  // Check if new fields exist before using them
  const userSettings = request.settings ? request.settings : getDefaultSettings();
  
  // Process with available data
  return processUser(userId, userName, userSettings);
}

Database Schema Evolution

Database changes during rolling updates must support both old and new application versions.

graph TD A[Database Schema Changes] --> B[Additive Changes] A --> C[Schema Versioning] A --> D[Transition Periods] B --> B1[Add Tables/Columns] B --> B2[Add Indexes] B --> B3[Widen Fields] C --> C1[Schema Version Tables] C --> C2[Feature Flags] D --> D1[Maintain Old & New] D --> D2[Data Synchronization] D --> D3[Clean Up Post-Deployment]

Database Schema Evolution Example


-- Step 1: Add new columns without modifying existing ones
-- (Deploy this before application update)
ALTER TABLE customers ADD COLUMN phone_number VARCHAR(20);
ALTER TABLE customers ADD COLUMN address_line2 VARCHAR(100);

-- Step 2: Application code handles both schemas
-- Old version ignores new columns
-- New version uses new columns when available

-- Example application code handling schema version:
function getCustomerContact(customerId) {
  const customer = getCustomerFromDatabase(customerId);
  
  // Handle both schema versions
  if (customer.phone_number) {
    // New schema - use new field directly
    return {
      name: customer.name,
      phone: customer.phone_number,
      address: formatAddress(customer)
    };
  } else {
    // Old schema - use legacy contact info
    return {
      name: customer.name,
      phone: customer.contact_number, // Old field name
      address: customer.address // Single field in old schema
    };
  }
}

-- Step 3: Data migration (after deployment completes)
-- Update new fields with data from old fields
UPDATE customers 
SET phone_number = contact_number 
WHERE phone_number IS NULL AND contact_number IS NOT NULL;

-- Step 4: Eventually remove old columns
-- (Only after all instances are updated and data is migrated)
-- ALTER TABLE customers DROP COLUMN contact_number;

Database Changes Best Practices During Rolling Updates

  • Apply Database Changes First: Deploy schema changes before application updates
  • Never Remove: Don't remove columns or tables until all application instances are updated
  • Use Default Values: Provide sensible defaults for new columns
  • Split Large Changes: Break complex schema changes into multiple smaller deployments
  • Maintain Triggers/Views: Use database objects to maintain compatibility
  • Test Both Versions: Verify old and new application versions work with updated schema
  • Database Migrations: Use migration tools that support rollbacks

Health Checks and Readiness Probes

Health checks and readiness probes are critical components of successful rolling updates, ensuring that new instances are fully functional before receiving traffic.

Types of Health Checks

Type Purpose Implementation
Liveness Probe Determines if an instance should be restarted Basic check that application process is responsive
Readiness Probe Determines if an instance should receive traffic Deep check that application is fully initialized and ready
Startup Probe Handles initial application startup period Specialized check for applications with long startup times

Comprehensive Health Check Endpoint Implementation


// Express.js health check endpoint implementation
const express = require('express');
const app = express();

// Simple liveness check - Is the application running?
app.get('/health/liveness', (req, res) => {
  res.status(200).json({ status: 'UP' });
});

// Readiness check - Is the application ready to serve traffic?
app.get('/health/readiness', async (req, res) => {
  try {
    // Check database connectivity
    const dbStatus = await checkDatabaseConnection();
    
    // Check dependent services
    const servicesStatus = await checkDependentServices();
    
    // Check internal state
    const appStatus = checkApplicationState();
    
    // Determine overall health status
    const isHealthy = dbStatus.healthy && 
                     servicesStatus.every(s => s.healthy) && 
                     appStatus.healthy;
    
    // Return detailed health status
    if (isHealthy) {
      res.status(200).json({
        status: 'UP',
        checks: {
          database: dbStatus,
          services: servicesStatus,
          application: appStatus
        },
        version: process.env.APP_VERSION || '1.0.0',
        timestamp: new Date().toISOString()
      });
    } else {
      res.status(503).json({
        status: 'DOWN',
        checks: {
          database: dbStatus,
          services: servicesStatus,
          application: appStatus
        },
        version: process.env.APP_VERSION || '1.0.0',
        timestamp: new Date().toISOString()
      });
    }
  } catch (error) {
    res.status(500).json({
      status: 'DOWN',
      error: error.message,
      timestamp: new Date().toISOString()
    });
  }
});

// Database connection check
async function checkDatabaseConnection() {
  try {
    await db.query('SELECT 1');
    return { 
      healthy: true, 
      responseTime: 10 // ms
    };
  } catch (error) {
    return { 
      healthy: false, 
      error: error.message 
    };
  }
}

// Check dependent services health
async function checkDependentServices() {
  const services = [
    { name: 'auth-service', url: 'http://auth-service/health' },
    { name: 'payment-service', url: 'http://payment-service/health' }
  ];
  
  return Promise.all(services.map(async (service) => {
    try {
      const startTime = Date.now();
      const response = await axios.get(service.url, { timeout: 1000 });
      const responseTime = Date.now() - startTime;
      
      return {
        name: service.name,
        healthy: response.status === 200,
        responseTime
      };
    } catch (error) {
      return {
        name: service.name,
        healthy: false,
        error: error.message
      };
    }
  }));
}

// Application state check
function checkApplicationState() {
  // Check if application has completed initialization
  const initialized = global.appInitialized === true;
  
  // Check if application has necessary resources
  const resources = checkSystemResources();
  
  // Check connection pool health
  const connectionPool = checkConnectionPool();
  
  return {
    healthy: initialized && resources.healthy && connectionPool.healthy,
    initialized,
    resources,
    connectionPool
  };
}

Health Check Best Practices

  • Multi-Level Checks: Implement both shallow and deep health checks
  • Proper Timing: Set appropriate initialDelaySeconds to allow application startup
  • Check Dependencies: Verify connectivity to databases and services
  • Efficiency: Keep health checks lightweight and efficient
  • Avoid Side Effects: Health checks should not modify state or trigger business logic
  • Proper Status Codes: Use appropriate HTTP status codes for different scenarios
  • Version Information: Include application version in health check response
  • Performance Metrics: Include response time and resource utilization

Real-World Example: Health Check Engineering

A payment processing company implemented a sophisticated health check system for their microservices architecture:

  • Three-Tiered Health Checks:
    • Level 1 (Liveness): Basic process health (sub-50ms response time)
    • Level 2 (Readiness): Database connectivity and configuration validation
    • Level 3 (Deep Health): End-to-end test transactions, cache validation, and consistency checks
  • Progressive Exposure: New instances follow a staged traffic pattern:
    1. Pass Level 1 health check → Service registered but marked as "warming up"
    2. Pass Level 2 health check → Receive 5% of traffic for 60 seconds
    3. No errors during initial traffic → Receive full traffic share
    4. Level 3 health check runs continuously in background
  • Results: 99.99% successful deployments with zero customer impact, even with 20+ daily deployments across their service fleet

Connection Draining and Session Management

Managing existing connections and user sessions during rolling updates is critical to providing seamless user experiences.

Connection Draining

Connection draining is the process of allowing existing connections to complete naturally before removing an instance from service. This helps prevent disruptions to active user sessions during updates.

sequenceDiagram participant LB as Load Balancer participant I1 as Instance 1 (to be updated) participant User1 as User with Active Session participant User2 as New User User1->>I1: Established Session Note over LB,I1: Begin update process LB-->>I1: Stop sending new connections User2->>LB: New Request LB-->>User2: Route to other instances User1->>I1: Continue existing session I1->>User1: Response Note over LB,I1: Connection draining period User1->>I1: Final request in session I1->>User1: Final response Note over LB,I1: Draining complete LB-->>I1: Terminate instance Note over I1: Update to new version

AWS Connection Draining Configuration


# CloudFormation template for ELB with connection draining
Resources:
  MyLoadBalancer:
    Type: AWS::ElasticLoadBalancing::LoadBalancer
    Properties:
      Listeners:
        - LoadBalancerPort: '80'
          InstancePort: '80'
          Protocol: HTTP
      HealthCheck:
        Target: HTTP:80/health
        HealthyThreshold: '3'
        UnhealthyThreshold: '5'
        Interval: '30'
        Timeout: '5'
      ConnectionDrainingPolicy:
        Enabled: true
        Timeout: 300  # 5 minutes to allow connections to complete

Nginx Connection Draining Example


# Nginx configuration for graceful shutdown
events {
    worker_connections 1024;
}

http {
    upstream backend {
        server backend1.example.com;
        server backend2.example.com;
        server backend3.example.com;
    }
    
    server {
        listen 80;
        
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";  # Enable keepalive connections
            
            # Proper headers for connection draining
            proxy_next_upstream error timeout http_502 http_503 http_504;
            proxy_connect_timeout 5s;
            proxy_read_timeout 60s;
            proxy_send_timeout 60s;
        }
    }
}

# Graceful shutdown script
#!/bin/bash
# Signal Nginx to stop accepting new connections but finish processing existing ones
nginx -s quit

# Wait for connections to drain (adjust timeout as needed)
timeout=300  # 5 minutes
interval=5   # Check every 5 seconds
elapsed=0

while [ $elapsed -lt $timeout ]; do
    # Check if Nginx still has active connections
    active_connections=$(ss -tn | grep -c ":80")
    
    if [ $active_connections -eq 0 ]; then
        echo "All connections drained, proceeding with update"
        break
    fi
    
    echo "Waiting for $active_connections connections to complete..."
    sleep $interval
    elapsed=$((elapsed + interval))
done

# Continue with the update process

Session Management Strategies

Managing user sessions across instance updates requires proper design to prevent session loss.

Strategy Description Pros Cons
External Session Store Store session data in Redis, Memcached, or database Complete session persistence, works with any update strategy Extra infrastructure, potential performance impact
Sticky Sessions Route users to the same instance for their session duration Simple to implement, no external dependencies Sessions lost during instance updates, load balancing challenges
Client-Side Sessions Store session data in cookies or local storage No server-side state, simplified architecture Limited storage capacity, security considerations
Session Replication Replicate session data across instances High availability, no external dependencies Resource intensive, complex implementation

External Session Store Implementation


// Node.js with Express and Redis Session Store
const express = require('express');
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const app = express();

// Create Redis client
const redisClient = createClient({
  url: process.env.REDIS_URL || 'redis://localhost:6379'
});

redisClient.connect().catch(console.error);

// Configure session middleware with Redis store
app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET || 'your-secret-key',
  resave: false,
  saveUninitialized: false,
  cookie: {
    secure: process.env.NODE_ENV === 'production',
    maxAge: 1000 * 60 * 60 * 24 // 1 day
  }
}));

// Session usage in application
app.get('/profile', (req, res) => {
  if (!req.session.user) {
    return res.redirect('/login');
  }
  
  res.render('profile', { user: req.session.user });
});

app.post('/login', (req, res) => {
  // Authenticate user
  const user = authenticateUser(req.body.username, req.body.password);
  
  if (user) {
    // Store user in session
    req.session.user = {
      id: user.id,
      username: user.username,
      email: user.email,
      preferences: user.preferences
    };
    
    res.redirect('/dashboard');
  } else {
    res.render('login', { error: 'Invalid credentials' });
  }
});

// Graceful shutdown handling
process.on('SIGTERM', async () => {
  console.log('Received SIGTERM signal, shutting down gracefully');
  
  // Close Redis connection
  await redisClient.quit();
  
  // Close Express server
  server.close(() => {
    console.log('HTTP server closed');
    process.exit(0);
  });
});

Connection Draining and Session Management Best Practices

  • Appropriate Timeouts: Set connection draining timeouts based on typical session duration
  • Externalize State: Keep session data outside application instances
  • Graceful Shutdown: Implement proper shutdown hooks to handle in-flight requests
  • Sticky Session Fallbacks: If using sticky sessions, implement fallback for session recovery
  • Session Versioning: Include version information in session data for compatibility
  • Monitoring: Track connection counts during draining to ensure proper completion
  • Client Communication: Consider notifying clients about maintenance or updates

Monitoring and Rollback Strategies

Effective monitoring and the ability to quickly roll back are essential safety nets for rolling updates.

Monitoring During Rolling Updates

graph TD A[Key Metrics to Monitor] --> B[System Metrics] A --> C[Application Metrics] A --> D[Business Metrics] A --> E[Deployment Metrics] B --> B1[CPU/Memory Usage] B --> B2[Network Traffic] B --> B3[Disk I/O] C --> C1[Error Rates] C --> C2[Response Times] C --> C3[Request Throughput] D --> D1[Conversion Rates] D --> D2[User Engagement] D --> D3[Transaction Values] E --> E1[Deployment Progress] E --> E2[Health Check Success] E --> E3[Instance Startup Time]

Prometheus Alert Rules for Deployment Monitoring


groups:
- name: deployment_alerts
  rules:
  - alert: HighErrorRateAfterDeployment
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, version)
      / sum(rate(http_requests_total[5m])) by (service, version) > 0.05
      and on(service) (time() - max(deployment_time) by (service)) < 3600
    for: 2m
    labels:
      severity: critical
      category: deployment
    annotations:
      summary: "High error rate after deployment: {{ $labels.service }} v{{ $labels.version }}"
      description: "Error rate of {{ $value | humanizePercentage }} exceeds 5% threshold after recent deployment"
      
  - alert: LatencySpikeDuringDeployment
    expr: |
      (
        avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) by (service)
        /
        avg(rate(http_request_duration_seconds_sum[1h] offset 2h) / rate(http_request_duration_seconds_count[1h] offset 2h)) by (service)
      ) > 1.5
      and on(service) (time() - max(deployment_time) by (service)) < 3600
    for: 3m
    labels:
      severity: warning
      category: deployment
    annotations:
      summary: "Latency spike during deployment: {{ $labels.service }}"
      description: "Response times are {{ $value | humanizePercentage }} higher than baseline during deployment"
      
  - alert: DeploymentStalled
    expr: |
      deployment_status{status="in_progress"} == 1
      and
      (time() - deployment_time) > 1800
    for: 5m
    labels:
      severity: warning
      category: deployment
    annotations:
      summary: "Deployment stalled: {{ $labels.service }}"
      description: "Deployment has been in progress for over 30 minutes without completion"

Rollback Strategies

Having effective rollback mechanisms is crucial for responding to issues during rolling updates.

Rollback Approach Description Best For
Full Rollback Revert all instances to the previous version Critical issues affecting all new instances
Partial Rollback Revert only problematic instances while continuing update for others Issues affecting specific infrastructure or regions
Pause and Fix Pause the rolling update, fix issues, and continue with corrected version Minor problems that can be quickly resolved
Forward Fix Deploy a new version that addresses the issues without reverting Situations where rollback would cause more problems

Automated Rollback Script for Kubernetes


#!/bin/bash
# Automated monitoring and rollback script for Kubernetes deployments

# Configuration
NAMESPACE="production"
DEPLOYMENT="my-app"
ERROR_THRESHOLD=5  # Error percentage triggering rollback
LATENCY_THRESHOLD=1.5  # Latency increase factor
MONITOR_DURATION=600  # Monitor for 10 minutes after deployment
INTERVAL=15  # Check every 15 seconds

# Get the current deployment status
current_revision=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE | grep -A 1 "REVISION" | tail -n 1 | awk '{print $1}')
echo "Deployment $DEPLOYMENT updated to revision $current_revision"

# Record deployment time for future reference
deployment_time=$(date +%s)
echo "Starting post-deployment monitoring at $(date)"

# Monitor for the specified duration
end_time=$(($(date +%s) + $MONITOR_DURATION))

while [ $(date +%s) -lt $end_time ]; do
  # Get error rate from Prometheus
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT',status=~'5..'}[1m]))/sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m]))*100" | jq -r '.data.result[0].value[1]')
  
  # Get latency data from Prometheus
  CURRENT_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m])) by (le))" | jq -r '.data.result[0].value[1]')
  BASELINE_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1h] offset 2h)) by (le))" | jq -r '.data.result[0].value[1]')
  LATENCY_RATIO=$(echo "$CURRENT_LATENCY / $BASELINE_LATENCY" | bc -l)
  
  echo "Current metrics - Error rate: ${ERROR_RATE}%, P95 Latency: ${CURRENT_LATENCY}s (${LATENCY_RATIO}x baseline)"
  
  # Check if error rate exceeds threshold
  if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
    echo "Error rate exceeded threshold (${ERROR_RATE}% > ${ERROR_THRESHOLD}%), initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    
    # Send alert to monitoring system
    curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Error rate of '$ERROR_RATE'% exceeded threshold of '$ERROR_THRESHOLD'%"}}]'
    
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  # Check if latency exceeds threshold
  if (( $(echo "$LATENCY_RATIO > $LATENCY_THRESHOLD" | bc -l) )); then
    echo "Latency ratio exceeded threshold (${LATENCY_RATIO}x > ${LATENCY_THRESHOLD}x), initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    
    # Send alert to monitoring system
    curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Latency increase of '$LATENCY_RATIO'x exceeded threshold of '$LATENCY_THRESHOLD'x"}}]'
    
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  # Check deployment status to ensure it's still progressing
  deployment_status=$(kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --watch=false)
  if [[ $deployment_status == *"successfully rolled out"* ]]; then
    echo "Deployment completed successfully, continuing to monitor..."
  elif [[ $deployment_status == *"error"* ]]; then
    echo "Deployment encountered errors, initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  sleep $INTERVAL
done

echo "Post-deployment monitoring completed successfully at $(date)"
echo "Deployment is stable"
exit 0

Real-World Example: Progressive Rollback Strategy

A large SaaS company implemented a sophisticated monitoring and rollback strategy for their microservices platform:

  1. Deployment Phases: Rolling updates progressed through phases, with increasing traffic percentages
  2. Automated Canaries: First 5% of instances served as canaries with enhanced monitoring
  3. Multi-Dimensional Metrics: Each phase evaluated against 12 key metrics, including:
    • Error rates (by error type and endpoint)
    • Latency percentiles (p50, p90, p99)
    • CPU and memory utilization patterns
    • Business metrics (conversion rates, etc.)
    • Dependent service impacts
  4. Rollback Strategy: A tiered response system based on severity:
    • Level 1 (Minor Issues): Pause deployment, investigate, proceed if resolved
    • Level 2 (Moderate Issues): Partial rollback of affected instances/regions
    • Level 3 (Critical Issues): Full automatic rollback with engineer notification
    • Level 4 (Catastrophic Issues): Circuit breaker activation and incident response
  5. Post-Mortem Process: Any rollback triggered a thorough analysis process before the next deployment attempt

This approach reduced their deployment incidents by 87% while increasing deployment frequency by 3x, allowing them to safely ship new features multiple times per day.

Advanced Rolling Update Techniques

Traffic Shaping During Rolling Updates

Advanced traffic management techniques can improve the safety and efficiency of rolling updates.

Shadow Testing Implementation


// Nginx configuration for shadow testing
http {
    upstream production_backend {
        server prod1.example.com;
        server prod2.example.com;
        server prod3.example.com;
    }
    
    upstream shadow_backend {
        server shadow1.example.com;
        server shadow2.example.com;
    }
    
    server {
        listen 80;
        
        location / {
            # Primary request to production backend
            proxy_pass http://production_backend;
            
            # Shadow traffic to new version (only for GET requests)
            if ($request_method = GET) {
                mirror /shadow;
            }
        }
        
        # Shadow endpoint that won't return response to client
        location = /shadow {
            internal;
            proxy_pass http://shadow_backend$request_uri;
            proxy_ignore_client_abort on;
            proxy_connect_timeout 1s;  # Short timeout to not impact performance
            proxy_read_timeout 10s;
        }
    }
}

Feature Flag Integration with Rolling Update


// Using feature flags with rolling updates
import { initialize, LDClient } from 'launchdarkly-node-server-sdk';

// Initialize LaunchDarkly client
const ldClient = initialize('YOUR_SDK_KEY');

// Wait for client to be ready
await ldClient.waitForInitialization();

// Application configuration with feature flags
const app = express();

app.get('/api/products', async (req, res) => {
  const user = {
    key: req.user.id,
    custom: {
      groups: req.user.groups,
      // Add instance version as an attribute
      instanceVersion: process.env.APP_VERSION || '1.0.0'
    }
  };
  
  // Check if new product search algorithm should be enabled
  // Can be targeted to specific app versions, users, or instance groups
  const useNewAlgorithm = await ldClient.variation('new-search-algorithm', user, false);
  
  if (useNewAlgorithm) {
    // Use new implementation (only in updated instances)
    const products = await newSearchImplementation(req.query);
    res.json(products);
  } else {
    // Use old implementation (works in both old and new instances)
    const products = await legacySearchImplementation(req.query);
    res.json(products);
  }
});

// Feature flag for gradual activation during rolling update
// 1. Deploy code to all instances with feature flag off
// 2. Perform rolling update to new version
// 3. Gradually enable feature flag for increasing percentage of traffic
// 4. Monitor and roll back flag if issues occur (without rolling back deployment)

Automated Rollout Strategies

Automation can enhance the reliability and efficiency of rolling updates.

graph TD A[Automated Rollout] --> B[Progressive Exposure] A --> C[Metric-Based Progression] A --> D[Circuit Breaking] B --> B1[Instance Batching] B --> B2[Traffic Shifting] C --> C1[Automatic Analysis] C --> C2[Statistical Validation] D --> D1[Automatic Pause] D --> D2[Rollback Triggers]

Real-World Example: Google's Automated Canary Analysis

Google has developed a sophisticated automated deployment platform with rolling updates that incorporates:

  • Progressive Rollout: Updates proceed through multiple phases from canary to global deployment
  • Automatic Analysis: Statistical models compare metrics between versions
  • Multi-Dimensional Evaluation: System evaluates key SLI/SLOs across services
  • Baked-In Periods: Mandatory observation periods between deployment phases
  • Self-Service Tools: Service owners can configure rollout parameters and key metrics
  • Feedback Loops: Deployment system learns from past successes and failures

This system enables Google to safely deploy thousands of changes per day across their service fleet, with minimal human intervention for routine deployments.

Learning Activities

Activity 1: Rolling Update Strategy Design

Design a rolling update strategy for a microservices-based application with the following characteristics:

Your strategy should include:

Activity 2: Implementing Rolling Updates in Kubernetes

Create Kubernetes deployment configurations for a three-tier application:

  1. Frontend web application with 5 replicas
  2. Backend API service with 3 replicas
  3. Database with primary and standby replicas

Implement the following:

Activity 3: Rolling Update Failure Scenarios

Analyze the following rolling update failure scenarios and develop response plans:

  1. New version passes health checks but shows increased error rates after 10 minutes
  2. Database connection pool exhaustion during update
  3. Intermittent failures affecting only some instances
  4. Version compatibility issue between services updated at different times
  5. Memory leak in new version causing gradual degradation

For each scenario, describe:

Key Takeaways

Further Learning Resources