Rolling Updates: Implementing Continuous Deployment with Zero Downtime

Introduction to Rolling Updates

Rolling updates represent one of the most widely used deployment strategies for achieving zero-downtime deployments. Unlike blue-green deployments that switch traffic between two complete environments, or canary releases that gradually shift traffic percentages, rolling updates focus on incrementally updating instances within a single environment.

In a rolling update, application instances are updated one subset at a time. This allows the application to remain available throughout the deployment process, as only a portion of the instances are unavailable at any given time. The deployment proceeds through the environment until all instances are running the new version.

sequenceDiagram participant LB as Load Balancer participant I1 as Instance 1 participant I2 as Instance 2 participant I3 as Instance 3 participant I4 as Instance 4 Note over I1,I4: Initial State: All instances running v1 LB->>I1: User Request I1->>LB: Response (v1) LB->>I2: User Request I2->>LB: Response (v1) Note over I1,I4: Phase 1: Update Instance 1 LB--xI1: Stop sending traffic Note over I1: Update to v2 Note over I1: Health check LB->>I1: Resume traffic LB->>I1: User Request I1->>LB: Response (v2) LB->>I2: User Request I2->>LB: Response (v1) Note over I1,I4: Phase 2: Update Instance 2 LB--xI2: Stop sending traffic Note over I2: Update to v2 Note over I2: Health check LB->>I2: Resume traffic LB->>I2: User Request I2->>LB: Response (v2) Note over I1,I4: Continue until all instances are updated

The Rotating Tires Analogy

Rolling updates can be compared to how a mechanic replaces tires on a car:

Traditional Deployment: Take the entire car off the road, replace all four tires at once, then put it back in service (causing downtime).
Blue-Green Deployment: Build a second identical car with new tires, then switch drivers from the old car to the new one (instant cutover).
Canary Release: Replace one tire, drive the car carefully with a few passengers to test it, then gradually replace the rest if no issues arise.
Rolling Update: Jack up one corner of the car at a time, replace that tire, lower it, and move to the next tire. The car remains partially functional and never completely "off the road" during the update process.

Benefits and Challenges of Rolling Updates

Benefits

Zero Downtime: Application remains available throughout the update process
Resource Efficiency: Requires no additional infrastructure beyond a small buffer capacity
Gradual Rollout: Issues affect only a subset of instances before they can be detected
Simple Implementation: Most orchestration platforms support rolling updates natively
Built-in Verification: Health checks ensure each new instance is functioning before proceeding
Traffic Management: Load balancers automatically route around instances being updated
Minimal Configuration: Requires little special configuration beyond normal deployment

Challenges

Version Coexistence: Multiple versions run simultaneously during the rollout
Database Compatibility: Database changes must be compatible with both versions
Rollback Complexity: Partial rollbacks can lead to more version mixing
Slower Deployment: Takes longer to complete than an all-at-once deployment
Subtle Issues: Some problems may only appear when old and new versions interact
Instance Readiness: Proper health checks are critical to prevent premature traffic
Capacity Planning: Must maintain sufficient capacity during the update process

Real-World Example: E-commerce Platform Rolling Update

A major e-commerce platform implements rolling updates for their product catalog service:

Deployment Configuration:
- 30 total instances across 3 availability zones
- Update 3 instances at a time (10% of capacity)
- 30-second health check grace period before allowing traffic
- 60-second pause between batch updates to monitor performance
Benefits Realized:
- Maintained 100% availability during deployments
- Detected performance regressions before affecting all users
- Reduced deployment risk during high-traffic holiday periods
Challenges Addressed:
- Implemented backward-compatible API versions to handle version coexistence
- Used database change patterns that work with both old and new code
- Integrated comprehensive monitoring to catch subtle issues

The platform now deploys updates to production multiple times per day with minimal risk, compared to their previous weekly deployment schedule with occasional downtime.

Key Concepts in Rolling Updates

Update Strategies

There are several variations of rolling update strategies, each with different trade-offs:

Strategy	Description	Pros	Cons
Basic Rolling Update	Update one instance at a time in sequence	Minimal capacity impact, simplest to implement	Slowest deployment time, extended version coexistence
Batched Rolling Update	Update multiple instances simultaneously in batches	Faster deployment, balanced risk/speed	Requires more capacity headroom during deployment
Surge Rolling Update	Create new instances before terminating old ones	Maintains full capacity, faster readiness	Temporarily requires additional infrastructure
Zone-Based Rolling Update	Update all instances in one availability zone at a time	Geographic isolation of risk, simpler tracking	Requires cross-zone redundancy, zone imbalance during update

graph TD subgraph "Basic Rolling Update" A1[75% v1
25% v2] --> A2[50% v1
50% v2] --> A3[25% v1
75% v2] --> A4[0% v1
100% v2] end subgraph "Surge Rolling Update" B1[100% v1
25% v2] --> B2[75% v1
50% v2] --> B3[50% v1
75% v2] --> B4[25% v1
100% v2] --> B5[0% v1
100% v2] end subgraph "Zone-Based Rolling Update" C1[Zone A: v2
Zone B: v1
Zone C: v1] --> C2[Zone A: v2
Zone B: v2
Zone C: v1] --> C3[Zone A: v2
Zone B: v2
Zone C: v2] end

Critical Components for Successful Rolling Updates

Load Balancing: Distributes traffic only to available and healthy instances
Health Checks: Verifies new instances are ready before sending traffic
Connection Draining: Allows in-flight requests to complete before removing instances
Session Persistence: Handles user sessions during the update process
Capacity Planning: Ensures sufficient resources during the update
Monitoring: Detects issues with new instances before proceeding
Rollback Capability: Reverts to previous version if problems occur

Rolling Update Configuration Best Practices

Batch Size: Start with small batch sizes (10-20% of total capacity) and adjust based on confidence and urgency
Health Check Grace Period: Allow sufficient time for application startup and warming before health checks (30+ seconds for most applications)
Inter-batch Delay: Consider adding a delay between batches to observe performance and catch issues
Timeout: Set an overall deployment timeout to prevent "stuck" deployments
Capacity Buffer: Maintain 20-30% extra capacity to handle traffic during instance updates
Failure Threshold: Define a threshold for failed instances that will trigger a rollback
Cross-Zone Distribution: Ensure updates affect instances across availability zones evenly

Rolling Updates Implementations Across Platforms

Kubernetes Rolling Updates

Kubernetes provides built-in support for rolling updates through Deployment resources.

Kubernetes Deployment with Rolling Update Strategy


apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # How many pods can be created above desired number
      maxUnavailable: 1  # How many pods can be unavailable during the update
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app:v2
        ports:
        - containerPort: 8080
        readinessProbe:    # Defines when a pod is ready to serve traffic
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:     # Defines when a pod should be restarted
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          limits:
            cpu: "500m"
            memory: "512Mi"
          requests:
            cpu: "100m"
            memory: "128Mi"

Kubernetes Rolling Update Command


# Update a deployment with a new image
kubectl set image deployment/my-app my-app=my-app:v2 --record

# Monitor the rollout status
kubectl rollout status deployment/my-app

# Pause a rollout if issues are detected
kubectl rollout pause deployment/my-app

# Resume a rollout after issues are resolved
kubectl rollout resume deployment/my-app

# Rollback to the previous version if needed
kubectl rollout undo deployment/my-app

# View rollout history
kubectl rollout history deployment/my-app

AWS Auto Scaling Group Rolling Updates

AWS Auto Scaling Groups support rolling updates through instance refresh and launch template updates.

AWS CloudFormation Template for ASG with Rolling Update


Resources:
  MyAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - subnet-12345678
        - subnet-87654321
      LaunchTemplate:
        LaunchTemplateId: !Ref MyLaunchTemplate
        Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
      MinSize: 2
      MaxSize: 10
      DesiredCapacity: 5
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      TargetGroupARNs:
        - !Ref MyTargetGroup
      # Instance refresh configuration for rolling updates
      InstanceRefreshSpecification:
        MinHealthyPercentage: 90
        InstanceWarmup: 300
        Strategy: Rolling
        Preferences:
          MinHealthyPercentage: 90
          InstanceWarmup: 300
          CheckpointPercentages:
            - 25
            - 50
            - 75
            - 100
          CheckpointDelay: 300
      Tags:
        - Key: Name
          Value: MyApp
          PropagateAtLaunch: true

  MyLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: MyAppLaunchTemplate
      VersionDescription: Initial version
      LaunchTemplateData:
        ImageId: ami-0abcdef1234567890
        InstanceType: t3.medium
        SecurityGroupIds:
          - sg-12345678
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash
            echo "Starting application version 1.0"
            # Application startup commands

AWS CLI Commands for Rolling Update


# Start an instance refresh (rolling update)
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name MyAutoScalingGroup \
  --strategy Rolling \
  --preferences '{"MinHealthyPercentage": 90, "InstanceWarmup": 300}'

# Check the status of an instance refresh
aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name MyAutoScalingGroup

# Cancel an instance refresh if needed
aws autoscaling cancel-instance-refresh \
  --auto-scaling-group-name MyAutoScalingGroup

Docker Swarm Rolling Updates

Docker Swarm provides rolling update capabilities for services through update-config settings.

Docker Swarm Service with Rolling Update Configuration


version: '3.8'
services:
  web:
    image: nginx:latest
    deploy:
      replicas: 6
      update_config:
        parallelism: 2         # Update 2 containers at a time
        delay: 10s             # Wait 10s between updating a group of containers
        order: start-first     # Start new containers first, then stop old ones
        failure_action: pause  # Pause deployment if a container fails to start
        monitor: 60s           # Monitor for failures for 60s after each task update
      rollback_config:
        parallelism: 3         # Rollback 3 containers at a time
        delay: 5s              # Wait 5s between rolling back a group of containers
        failure_action: continue
        monitor: 30s
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    ports:
      - "80:80"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 5s
      timeout: 2s
      retries: 3
      start_period: 5s

Docker CLI Commands for Rolling Update


# Deploy or update a stack with rolling update configuration
docker stack deploy -c docker-compose.yml my-stack

# Update a service with a new image (with rolling update)
docker service update --image nginx:new-version my-stack_web

# Monitor service update progress
docker service ps my-stack_web

# Rollback a service update
docker service rollback my-stack_web

Version Compatibility Strategies

Since rolling updates involve running multiple versions simultaneously, compatibility between versions is critical for success.

API and Contract Compatibility

Backward Compatibility: New versions must accept inputs from old clients/services
Forward Compatibility: Old versions should handle inputs from new clients/services
Versioned APIs: Use explicit API versioning to manage compatibility
Request/Response Evolution: Add fields rather than changing existing ones
Graceful Degradation: Handle missing features or parameters elegantly

API Version Compatibility Example


// Express.js API with version compatibility
const express = require('express');
const app = express();

// API v1 endpoint (original)
app.get('/api/v1/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getUserFromDatabase(userId);
  
  // v1 response format
  res.json({
    id: user.id,
    name: user.name,
    email: user.email
  });
});

// API v2 endpoint (new version with additional fields)
app.get('/api/v2/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getEnhancedUserFromDatabase(userId);
  
  // v2 response format with additional fields
  res.json({
    id: user.id,
    name: user.name,
    email: user.email,
    profile: {
      avatar: user.avatar_url,
      bio: user.bio,
      created_at: user.joined_date
    }
  });
});

// Backward compatible v1 handler in v2 code
// This handles v1 requests during rolling update
app.get('/api/v1/users/:id', (req, res) => {
  const userId = req.params.id;
  const user = getEnhancedUserFromDatabase(userId);
  
  // Maintain v1 response format for backward compatibility
  res.json({
    id: user.id,
    name: user.name,
    email: user.email
  });
});

// Forward compatibility example
function processUserRequest(request) {
  // Handle new fields gracefully in old code
  // Ignore fields that didn't exist in previous version
  const userId = request.id;
  const userName = request.name;
  
  // Check if new fields exist before using them
  const userSettings = request.settings ? request.settings : getDefaultSettings();
  
  // Process with available data
  return processUser(userId, userName, userSettings);
}

Database Schema Evolution

Database changes during rolling updates must support both old and new application versions.

graph TD A[Database Schema Changes] --> B[Additive Changes] A --> C[Schema Versioning] A --> D[Transition Periods] B --> B1[Add Tables/Columns] B --> B2[Add Indexes] B --> B3[Widen Fields] C --> C1[Schema Version Tables] C --> C2[Feature Flags] D --> D1[Maintain Old & New] D --> D2[Data Synchronization] D --> D3[Clean Up Post-Deployment]

Database Schema Evolution Example


-- Step 1: Add new columns without modifying existing ones
-- (Deploy this before application update)
ALTER TABLE customers ADD COLUMN phone_number VARCHAR(20);
ALTER TABLE customers ADD COLUMN address_line2 VARCHAR(100);

-- Step 2: Application code handles both schemas
-- Old version ignores new columns
-- New version uses new columns when available

-- Example application code handling schema version:
function getCustomerContact(customerId) {
  const customer = getCustomerFromDatabase(customerId);
  
  // Handle both schema versions
  if (customer.phone_number) {
    // New schema - use new field directly
    return {
      name: customer.name,
      phone: customer.phone_number,
      address: formatAddress(customer)
    };
  } else {
    // Old schema - use legacy contact info
    return {
      name: customer.name,
      phone: customer.contact_number, // Old field name
      address: customer.address // Single field in old schema
    };
  }
}

-- Step 3: Data migration (after deployment completes)
-- Update new fields with data from old fields
UPDATE customers 
SET phone_number = contact_number 
WHERE phone_number IS NULL AND contact_number IS NOT NULL;

-- Step 4: Eventually remove old columns
-- (Only after all instances are updated and data is migrated)
-- ALTER TABLE customers DROP COLUMN contact_number;

Database Changes Best Practices During Rolling Updates

Apply Database Changes First: Deploy schema changes before application updates
Never Remove: Don't remove columns or tables until all application instances are updated
Use Default Values: Provide sensible defaults for new columns
Split Large Changes: Break complex schema changes into multiple smaller deployments
Maintain Triggers/Views: Use database objects to maintain compatibility
Test Both Versions: Verify old and new application versions work with updated schema
Database Migrations: Use migration tools that support rollbacks

Health Checks and Readiness Probes

Health checks and readiness probes are critical components of successful rolling updates, ensuring that new instances are fully functional before receiving traffic.

Types of Health Checks

Type	Purpose	Implementation
Liveness Probe	Determines if an instance should be restarted	Basic check that application process is responsive
Readiness Probe	Determines if an instance should receive traffic	Deep check that application is fully initialized and ready
Startup Probe	Handles initial application startup period	Specialized check for applications with long startup times

Comprehensive Health Check Endpoint Implementation


// Express.js health check endpoint implementation
const express = require('express');
const app = express();

// Simple liveness check - Is the application running?
app.get('/health/liveness', (req, res) => {
  res.status(200).json({ status: 'UP' });
});

// Readiness check - Is the application ready to serve traffic?
app.get('/health/readiness', async (req, res) => {
  try {
    // Check database connectivity
    const dbStatus = await checkDatabaseConnection();
    
    // Check dependent services
    const servicesStatus = await checkDependentServices();
    
    // Check internal state
    const appStatus = checkApplicationState();
    
    // Determine overall health status
    const isHealthy = dbStatus.healthy && 
                     servicesStatus.every(s => s.healthy) && 
                     appStatus.healthy;
    
    // Return detailed health status
    if (isHealthy) {
      res.status(200).json({
        status: 'UP',
        checks: {
          database: dbStatus,
          services: servicesStatus,
          application: appStatus
        },
        version: process.env.APP_VERSION || '1.0.0',
        timestamp: new Date().toISOString()
      });
    } else {
      res.status(503).json({
        status: 'DOWN',
        checks: {
          database: dbStatus,
          services: servicesStatus,
          application: appStatus
        },
        version: process.env.APP_VERSION || '1.0.0',
        timestamp: new Date().toISOString()
      });
    }
  } catch (error) {
    res.status(500).json({
      status: 'DOWN',
      error: error.message,
      timestamp: new Date().toISOString()
    });
  }
});

// Database connection check
async function checkDatabaseConnection() {
  try {
    await db.query('SELECT 1');
    return { 
      healthy: true, 
      responseTime: 10 // ms
    };
  } catch (error) {
    return { 
      healthy: false, 
      error: error.message 
    };
  }
}

// Check dependent services health
async function checkDependentServices() {
  const services = [
    { name: 'auth-service', url: 'http://auth-service/health' },
    { name: 'payment-service', url: 'http://payment-service/health' }
  ];
  
  return Promise.all(services.map(async (service) => {
    try {
      const startTime = Date.now();
      const response = await axios.get(service.url, { timeout: 1000 });
      const responseTime = Date.now() - startTime;
      
      return {
        name: service.name,
        healthy: response.status === 200,
        responseTime
      };
    } catch (error) {
      return {
        name: service.name,
        healthy: false,
        error: error.message
      };
    }
  }));
}

// Application state check
function checkApplicationState() {
  // Check if application has completed initialization
  const initialized = global.appInitialized === true;
  
  // Check if application has necessary resources
  const resources = checkSystemResources();
  
  // Check connection pool health
  const connectionPool = checkConnectionPool();
  
  return {
    healthy: initialized && resources.healthy && connectionPool.healthy,
    initialized,
    resources,
    connectionPool
  };
}

Health Check Best Practices

Multi-Level Checks: Implement both shallow and deep health checks
Proper Timing: Set appropriate initialDelaySeconds to allow application startup
Check Dependencies: Verify connectivity to databases and services
Efficiency: Keep health checks lightweight and efficient
Avoid Side Effects: Health checks should not modify state or trigger business logic
Proper Status Codes: Use appropriate HTTP status codes for different scenarios
Version Information: Include application version in health check response
Performance Metrics: Include response time and resource utilization

Real-World Example: Health Check Engineering

A payment processing company implemented a sophisticated health check system for their microservices architecture:

Three-Tiered Health Checks:
- Level 1 (Liveness): Basic process health (sub-50ms response time)
- Level 2 (Readiness): Database connectivity and configuration validation
- Level 3 (Deep Health): End-to-end test transactions, cache validation, and consistency checks
Progressive Exposure: New instances follow a staged traffic pattern:
1. Pass Level 1 health check → Service registered but marked as "warming up"
2. Pass Level 2 health check → Receive 5% of traffic for 60 seconds
3. No errors during initial traffic → Receive full traffic share
4. Level 3 health check runs continuously in background
Results: 99.99% successful deployments with zero customer impact, even with 20+ daily deployments across their service fleet

Connection Draining and Session Management

Managing existing connections and user sessions during rolling updates is critical to providing seamless user experiences.

Connection Draining

Connection draining is the process of allowing existing connections to complete naturally before removing an instance from service. This helps prevent disruptions to active user sessions during updates.

sequenceDiagram participant LB as Load Balancer participant I1 as Instance 1 (to be updated) participant User1 as User with Active Session participant User2 as New User User1->>I1: Established Session Note over LB,I1: Begin update process LB-->>I1: Stop sending new connections User2->>LB: New Request LB-->>User2: Route to other instances User1->>I1: Continue existing session I1->>User1: Response Note over LB,I1: Connection draining period User1->>I1: Final request in session I1->>User1: Final response Note over LB,I1: Draining complete LB-->>I1: Terminate instance Note over I1: Update to new version

AWS Connection Draining Configuration


# CloudFormation template for ELB with connection draining
Resources:
  MyLoadBalancer:
    Type: AWS::ElasticLoadBalancing::LoadBalancer
    Properties:
      Listeners:
        - LoadBalancerPort: '80'
          InstancePort: '80'
          Protocol: HTTP
      HealthCheck:
        Target: HTTP:80/health
        HealthyThreshold: '3'
        UnhealthyThreshold: '5'
        Interval: '30'
        Timeout: '5'
      ConnectionDrainingPolicy:
        Enabled: true
        Timeout: 300  # 5 minutes to allow connections to complete

Nginx Connection Draining Example


# Nginx configuration for graceful shutdown
events {
    worker_connections 1024;
}

http {
    upstream backend {
        server backend1.example.com;
        server backend2.example.com;
        server backend3.example.com;
    }
    
    server {
        listen 80;
        
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";  # Enable keepalive connections
            
            # Proper headers for connection draining
            proxy_next_upstream error timeout http_502 http_503 http_504;
            proxy_connect_timeout 5s;
            proxy_read_timeout 60s;
            proxy_send_timeout 60s;
        }
    }
}

# Graceful shutdown script
#!/bin/bash
# Signal Nginx to stop accepting new connections but finish processing existing ones
nginx -s quit

# Wait for connections to drain (adjust timeout as needed)
timeout=300  # 5 minutes
interval=5   # Check every 5 seconds
elapsed=0

while [ $elapsed -lt $timeout ]; do
    # Check if Nginx still has active connections
    active_connections=$(ss -tn | grep -c ":80")
    
    if [ $active_connections -eq 0 ]; then
        echo "All connections drained, proceeding with update"
        break
    fi
    
    echo "Waiting for $active_connections connections to complete..."
    sleep $interval
    elapsed=$((elapsed + interval))
done

# Continue with the update process

Session Management Strategies

Managing user sessions across instance updates requires proper design to prevent session loss.

Strategy	Description	Pros	Cons
External Session Store	Store session data in Redis, Memcached, or database	Complete session persistence, works with any update strategy	Extra infrastructure, potential performance impact
Sticky Sessions	Route users to the same instance for their session duration	Simple to implement, no external dependencies	Sessions lost during instance updates, load balancing challenges
Client-Side Sessions	Store session data in cookies or local storage	No server-side state, simplified architecture	Limited storage capacity, security considerations
Session Replication	Replicate session data across instances	High availability, no external dependencies	Resource intensive, complex implementation

External Session Store Implementation


// Node.js with Express and Redis Session Store
const express = require('express');
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const app = express();

// Create Redis client
const redisClient = createClient({
  url: process.env.REDIS_URL || 'redis://localhost:6379'
});

redisClient.connect().catch(console.error);

// Configure session middleware with Redis store
app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET || 'your-secret-key',
  resave: false,
  saveUninitialized: false,
  cookie: {
    secure: process.env.NODE_ENV === 'production',
    maxAge: 1000 * 60 * 60 * 24 // 1 day
  }
}));

// Session usage in application
app.get('/profile', (req, res) => {
  if (!req.session.user) {
    return res.redirect('/login');
  }
  
  res.render('profile', { user: req.session.user });
});

app.post('/login', (req, res) => {
  // Authenticate user
  const user = authenticateUser(req.body.username, req.body.password);
  
  if (user) {
    // Store user in session
    req.session.user = {
      id: user.id,
      username: user.username,
      email: user.email,
      preferences: user.preferences
    };
    
    res.redirect('/dashboard');
  } else {
    res.render('login', { error: 'Invalid credentials' });
  }
});

// Graceful shutdown handling
process.on('SIGTERM', async () => {
  console.log('Received SIGTERM signal, shutting down gracefully');
  
  // Close Redis connection
  await redisClient.quit();
  
  // Close Express server
  server.close(() => {
    console.log('HTTP server closed');
    process.exit(0);
  });
});

Connection Draining and Session Management Best Practices

Appropriate Timeouts: Set connection draining timeouts based on typical session duration
Externalize State: Keep session data outside application instances
Graceful Shutdown: Implement proper shutdown hooks to handle in-flight requests
Sticky Session Fallbacks: If using sticky sessions, implement fallback for session recovery
Session Versioning: Include version information in session data for compatibility
Monitoring: Track connection counts during draining to ensure proper completion
Client Communication: Consider notifying clients about maintenance or updates

Monitoring and Rollback Strategies

Effective monitoring and the ability to quickly roll back are essential safety nets for rolling updates.

Monitoring During Rolling Updates

graph TD A[Key Metrics to Monitor] --> B[System Metrics] A --> C[Application Metrics] A --> D[Business Metrics] A --> E[Deployment Metrics] B --> B1[CPU/Memory Usage] B --> B2[Network Traffic] B --> B3[Disk I/O] C --> C1[Error Rates] C --> C2[Response Times] C --> C3[Request Throughput] D --> D1[Conversion Rates] D --> D2[User Engagement] D --> D3[Transaction Values] E --> E1[Deployment Progress] E --> E2[Health Check Success] E --> E3[Instance Startup Time]

Prometheus Alert Rules for Deployment Monitoring


groups:
- name: deployment_alerts
  rules:
  - alert: HighErrorRateAfterDeployment
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, version)
      / sum(rate(http_requests_total[5m])) by (service, version) > 0.05
      and on(service) (time() - max(deployment_time) by (service)) < 3600
    for: 2m
    labels:
      severity: critical
      category: deployment
    annotations:
      summary: "High error rate after deployment: {{ $labels.service }} v{{ $labels.version }}"
      description: "Error rate of {{ $value | humanizePercentage }} exceeds 5% threshold after recent deployment"
      
  - alert: LatencySpikeDuringDeployment
    expr: |
      (
        avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) by (service)
        /
        avg(rate(http_request_duration_seconds_sum[1h] offset 2h) / rate(http_request_duration_seconds_count[1h] offset 2h)) by (service)
      ) > 1.5
      and on(service) (time() - max(deployment_time) by (service)) < 3600
    for: 3m
    labels:
      severity: warning
      category: deployment
    annotations:
      summary: "Latency spike during deployment: {{ $labels.service }}"
      description: "Response times are {{ $value | humanizePercentage }} higher than baseline during deployment"
      
  - alert: DeploymentStalled
    expr: |
      deployment_status{status="in_progress"} == 1
      and
      (time() - deployment_time) > 1800
    for: 5m
    labels:
      severity: warning
      category: deployment
    annotations:
      summary: "Deployment stalled: {{ $labels.service }}"
      description: "Deployment has been in progress for over 30 minutes without completion"

Rollback Strategies

Having effective rollback mechanisms is crucial for responding to issues during rolling updates.

Rollback Approach	Description	Best For
Full Rollback	Revert all instances to the previous version	Critical issues affecting all new instances
Partial Rollback	Revert only problematic instances while continuing update for others	Issues affecting specific infrastructure or regions
Pause and Fix	Pause the rolling update, fix issues, and continue with corrected version	Minor problems that can be quickly resolved
Forward Fix	Deploy a new version that addresses the issues without reverting	Situations where rollback would cause more problems

Automated Rollback Script for Kubernetes


#!/bin/bash
# Automated monitoring and rollback script for Kubernetes deployments

# Configuration
NAMESPACE="production"
DEPLOYMENT="my-app"
ERROR_THRESHOLD=5  # Error percentage triggering rollback
LATENCY_THRESHOLD=1.5  # Latency increase factor
MONITOR_DURATION=600  # Monitor for 10 minutes after deployment
INTERVAL=15  # Check every 15 seconds

# Get the current deployment status
current_revision=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE | grep -A 1 "REVISION" | tail -n 1 | awk '{print $1}')
echo "Deployment $DEPLOYMENT updated to revision $current_revision"

# Record deployment time for future reference
deployment_time=$(date +%s)
echo "Starting post-deployment monitoring at $(date)"

# Monitor for the specified duration
end_time=$(($(date +%s) + $MONITOR_DURATION))

while [ $(date +%s) -lt $end_time ]; do
  # Get error rate from Prometheus
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT',status=~'5..'}[1m]))/sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m]))*100" | jq -r '.data.result[0].value[1]')
  
  # Get latency data from Prometheus
  CURRENT_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m])) by (le))" | jq -r '.data.result[0].value[1]')
  BASELINE_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1h] offset 2h)) by (le))" | jq -r '.data.result[0].value[1]')
  LATENCY_RATIO=$(echo "$CURRENT_LATENCY / $BASELINE_LATENCY" | bc -l)
  
  echo "Current metrics - Error rate: ${ERROR_RATE}%, P95 Latency: ${CURRENT_LATENCY}s (${LATENCY_RATIO}x baseline)"
  
  # Check if error rate exceeds threshold
  if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
    echo "Error rate exceeded threshold (${ERROR_RATE}% > ${ERROR_THRESHOLD}%), initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    
    # Send alert to monitoring system
    curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Error rate of '$ERROR_RATE'% exceeded threshold of '$ERROR_THRESHOLD'%"}}]'
    
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  # Check if latency exceeds threshold
  if (( $(echo "$LATENCY_RATIO > $LATENCY_THRESHOLD" | bc -l) )); then
    echo "Latency ratio exceeded threshold (${LATENCY_RATIO}x > ${LATENCY_THRESHOLD}x), initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    
    # Send alert to monitoring system
    curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Latency increase of '$LATENCY_RATIO'x exceeded threshold of '$LATENCY_THRESHOLD'x"}}]'
    
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  # Check deployment status to ensure it's still progressing
  deployment_status=$(kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --watch=false)
  if [[ $deployment_status == *"successfully rolled out"* ]]; then
    echo "Deployment completed successfully, continuing to monitor..."
  elif [[ $deployment_status == *"error"* ]]; then
    echo "Deployment encountered errors, initiating rollback..."
    kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
    echo "Rollback initiated at $(date)"
    exit 1
  fi
  
  sleep $INTERVAL
done

echo "Post-deployment monitoring completed successfully at $(date)"
echo "Deployment is stable"
exit 0

Real-World Example: Progressive Rollback Strategy

A large SaaS company implemented a sophisticated monitoring and rollback strategy for their microservices platform:

Deployment Phases: Rolling updates progressed through phases, with increasing traffic percentages
Automated Canaries: First 5% of instances served as canaries with enhanced monitoring
Multi-Dimensional Metrics: Each phase evaluated against 12 key metrics, including:
- Error rates (by error type and endpoint)
- Latency percentiles (p50, p90, p99)
- CPU and memory utilization patterns
- Business metrics (conversion rates, etc.)
- Dependent service impacts
Rollback Strategy: A tiered response system based on severity:
- Level 1 (Minor Issues): Pause deployment, investigate, proceed if resolved
- Level 2 (Moderate Issues): Partial rollback of affected instances/regions
- Level 3 (Critical Issues): Full automatic rollback with engineer notification
- Level 4 (Catastrophic Issues): Circuit breaker activation and incident response
Post-Mortem Process: Any rollback triggered a thorough analysis process before the next deployment attempt

This approach reduced their deployment incidents by 87% while increasing deployment frequency by 3x, allowing them to safely ship new features multiple times per day.

Advanced Rolling Update Techniques

Traffic Shaping During Rolling Updates

Advanced traffic management techniques can improve the safety and efficiency of rolling updates.

Ring Deployment Model: Deploy to concentric "rings" of instances with increasing criticality
Dark Launching: Deploy new code but don't activate new functionality until stable
Shadow Testing: Send duplicate traffic to new instances without returning responses
Incremental Feature Activation: Combine rolling updates with feature flags
Synthetic User Testing: Automatically test new instances with synthetic traffic

Shadow Testing Implementation


// Nginx configuration for shadow testing
http {
    upstream production_backend {
        server prod1.example.com;
        server prod2.example.com;
        server prod3.example.com;
    }
    
    upstream shadow_backend {
        server shadow1.example.com;
        server shadow2.example.com;
    }
    
    server {
        listen 80;
        
        location / {
            # Primary request to production backend
            proxy_pass http://production_backend;
            
            # Shadow traffic to new version (only for GET requests)
            if ($request_method = GET) {
                mirror /shadow;
            }
        }
        
        # Shadow endpoint that won't return response to client
        location = /shadow {
            internal;
            proxy_pass http://shadow_backend$request_uri;
            proxy_ignore_client_abort on;
            proxy_connect_timeout 1s;  # Short timeout to not impact performance
            proxy_read_timeout 10s;
        }
    }
}

Feature Flag Integration with Rolling Update


// Using feature flags with rolling updates
import { initialize, LDClient } from 'launchdarkly-node-server-sdk';

// Initialize LaunchDarkly client
const ldClient = initialize('YOUR_SDK_KEY');

// Wait for client to be ready
await ldClient.waitForInitialization();

// Application configuration with feature flags
const app = express();

app.get('/api/products', async (req, res) => {
  const user = {
    key: req.user.id,
    custom: {
      groups: req.user.groups,
      // Add instance version as an attribute
      instanceVersion: process.env.APP_VERSION || '1.0.0'
    }
  };
  
  // Check if new product search algorithm should be enabled
  // Can be targeted to specific app versions, users, or instance groups
  const useNewAlgorithm = await ldClient.variation('new-search-algorithm', user, false);
  
  if (useNewAlgorithm) {
    // Use new implementation (only in updated instances)
    const products = await newSearchImplementation(req.query);
    res.json(products);
  } else {
    // Use old implementation (works in both old and new instances)
    const products = await legacySearchImplementation(req.query);
    res.json(products);
  }
});

// Feature flag for gradual activation during rolling update
// 1. Deploy code to all instances with feature flag off
// 2. Perform rolling update to new version
// 3. Gradually enable feature flag for increasing percentage of traffic
// 4. Monitor and roll back flag if issues occur (without rolling back deployment)

Automated Rollout Strategies

Automation can enhance the reliability and efficiency of rolling updates.

graph TD A[Automated Rollout] --> B[Progressive Exposure] A --> C[Metric-Based Progression] A --> D[Circuit Breaking] B --> B1[Instance Batching] B --> B2[Traffic Shifting] C --> C1[Automatic Analysis] C --> C2[Statistical Validation] D --> D1[Automatic Pause] D --> D2[Rollback Triggers]

Real-World Example: Google's Automated Canary Analysis

Google has developed a sophisticated automated deployment platform with rolling updates that incorporates:

Progressive Rollout: Updates proceed through multiple phases from canary to global deployment
Automatic Analysis: Statistical models compare metrics between versions
Multi-Dimensional Evaluation: System evaluates key SLI/SLOs across services
Baked-In Periods: Mandatory observation periods between deployment phases
Self-Service Tools: Service owners can configure rollout parameters and key metrics
Feedback Loops: Deployment system learns from past successes and failures

This system enables Google to safely deploy thousands of changes per day across their service fleet, with minimal human intervention for routine deployments.

Learning Activities

Activity 1: Rolling Update Strategy Design

Design a rolling update strategy for a microservices-based application with the following characteristics:

User-facing web application with authentication
A fleet of API microservices (10 different service types)
PostgreSQL database for persistent data
Redis for caching and session management
Average user session duration of 30 minutes
Business requirement: 99.9% uptime during updates

Your strategy should include:

Update sequence across services
Batch sizing and update parameters
Health check requirements
Session handling approach
Database change management
Monitoring and rollback plan

Activity 2: Implementing Rolling Updates in Kubernetes

Create Kubernetes deployment configurations for a three-tier application:

Frontend web application with 5 replicas
Backend API service with 3 replicas
Database with primary and standby replicas

Implement the following:

Appropriate rolling update strategy for each component
Health check probes (liveness, readiness, startup)
Resource limits and requests
Service configurations with appropriate selectors
Update hooks for pre/post-deployment tasks
Monitoring and alerting configuration

Activity 3: Rolling Update Failure Scenarios

Analyze the following rolling update failure scenarios and develop response plans:

New version passes health checks but shows increased error rates after 10 minutes
Database connection pool exhaustion during update
Intermittent failures affecting only some instances
Version compatibility issue between services updated at different times
Memory leak in new version causing gradual degradation

For each scenario, describe:

Detection method
Immediate mitigation
Root cause analysis approach
Prevention strategy for future deployments

Key Takeaways

Rolling updates provide zero-downtime deployments by incrementally updating instances
This approach balances availability with resource efficiency but requires careful management of version coexistence
Proper health checks and readiness probes are critical to successful rolling updates
Connection draining and session management help provide seamless user experiences
Database schema changes must be designed to work with both old and new application versions
Effective monitoring and automated rollback mechanisms provide safety nets
Advanced techniques like traffic shaping and progressive exposure can enhance safety
Most modern orchestration platforms provide built-in support for rolling updates