Introduction to Rolling Updates
Rolling updates represent one of the most widely used deployment strategies for achieving zero-downtime deployments. Unlike blue-green deployments that switch traffic between two complete environments, or canary releases that gradually shift traffic percentages, rolling updates focus on incrementally updating instances within a single environment.
In a rolling update, application instances are updated one subset at a time. This allows the application to remain available throughout the deployment process, as only a portion of the instances are unavailable at any given time. The deployment proceeds through the environment until all instances are running the new version.
The Rotating Tires Analogy
Rolling updates can be compared to how a mechanic replaces tires on a car:
- Traditional Deployment: Take the entire car off the road, replace all four tires at once, then put it back in service (causing downtime).
- Blue-Green Deployment: Build a second identical car with new tires, then switch drivers from the old car to the new one (instant cutover).
- Canary Release: Replace one tire, drive the car carefully with a few passengers to test it, then gradually replace the rest if no issues arise.
- Rolling Update: Jack up one corner of the car at a time, replace that tire, lower it, and move to the next tire. The car remains partially functional and never completely "off the road" during the update process.
Benefits and Challenges of Rolling Updates
Benefits
- Zero Downtime: Application remains available throughout the update process
- Resource Efficiency: Requires no additional infrastructure beyond a small buffer capacity
- Gradual Rollout: Issues affect only a subset of instances before they can be detected
- Simple Implementation: Most orchestration platforms support rolling updates natively
- Built-in Verification: Health checks ensure each new instance is functioning before proceeding
- Traffic Management: Load balancers automatically route around instances being updated
- Minimal Configuration: Requires little special configuration beyond normal deployment
Challenges
- Version Coexistence: Multiple versions run simultaneously during the rollout
- Database Compatibility: Database changes must be compatible with both versions
- Rollback Complexity: Partial rollbacks can lead to more version mixing
- Slower Deployment: Takes longer to complete than an all-at-once deployment
- Subtle Issues: Some problems may only appear when old and new versions interact
- Instance Readiness: Proper health checks are critical to prevent premature traffic
- Capacity Planning: Must maintain sufficient capacity during the update process
Real-World Example: E-commerce Platform Rolling Update
A major e-commerce platform implements rolling updates for their product catalog service:
- Deployment Configuration:
- 30 total instances across 3 availability zones
- Update 3 instances at a time (10% of capacity)
- 30-second health check grace period before allowing traffic
- 60-second pause between batch updates to monitor performance
- Benefits Realized:
- Maintained 100% availability during deployments
- Detected performance regressions before affecting all users
- Reduced deployment risk during high-traffic holiday periods
- Challenges Addressed:
- Implemented backward-compatible API versions to handle version coexistence
- Used database change patterns that work with both old and new code
- Integrated comprehensive monitoring to catch subtle issues
The platform now deploys updates to production multiple times per day with minimal risk, compared to their previous weekly deployment schedule with occasional downtime.
Key Concepts in Rolling Updates
Update Strategies
There are several variations of rolling update strategies, each with different trade-offs:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Basic Rolling Update | Update one instance at a time in sequence | Minimal capacity impact, simplest to implement | Slowest deployment time, extended version coexistence |
| Batched Rolling Update | Update multiple instances simultaneously in batches | Faster deployment, balanced risk/speed | Requires more capacity headroom during deployment |
| Surge Rolling Update | Create new instances before terminating old ones | Maintains full capacity, faster readiness | Temporarily requires additional infrastructure |
| Zone-Based Rolling Update | Update all instances in one availability zone at a time | Geographic isolation of risk, simpler tracking | Requires cross-zone redundancy, zone imbalance during update |
25% v2] --> A2[50% v1
50% v2] --> A3[25% v1
75% v2] --> A4[0% v1
100% v2] end subgraph "Surge Rolling Update" B1[100% v1
25% v2] --> B2[75% v1
50% v2] --> B3[50% v1
75% v2] --> B4[25% v1
100% v2] --> B5[0% v1
100% v2] end subgraph "Zone-Based Rolling Update" C1[Zone A: v2
Zone B: v1
Zone C: v1] --> C2[Zone A: v2
Zone B: v2
Zone C: v1] --> C3[Zone A: v2
Zone B: v2
Zone C: v2] end
Critical Components for Successful Rolling Updates
- Load Balancing: Distributes traffic only to available and healthy instances
- Health Checks: Verifies new instances are ready before sending traffic
- Connection Draining: Allows in-flight requests to complete before removing instances
- Session Persistence: Handles user sessions during the update process
- Capacity Planning: Ensures sufficient resources during the update
- Monitoring: Detects issues with new instances before proceeding
- Rollback Capability: Reverts to previous version if problems occur
Rolling Update Configuration Best Practices
- Batch Size: Start with small batch sizes (10-20% of total capacity) and adjust based on confidence and urgency
- Health Check Grace Period: Allow sufficient time for application startup and warming before health checks (30+ seconds for most applications)
- Inter-batch Delay: Consider adding a delay between batches to observe performance and catch issues
- Timeout: Set an overall deployment timeout to prevent "stuck" deployments
- Capacity Buffer: Maintain 20-30% extra capacity to handle traffic during instance updates
- Failure Threshold: Define a threshold for failed instances that will trigger a rollback
- Cross-Zone Distribution: Ensure updates affect instances across availability zones evenly
Rolling Updates Implementations Across Platforms
Kubernetes Rolling Updates
Kubernetes provides built-in support for rolling updates through Deployment resources.
Kubernetes Deployment with Rolling Update Strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 5
selector:
matchLabels:
app: my-app
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # How many pods can be created above desired number
maxUnavailable: 1 # How many pods can be unavailable during the update
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: my-app:v2
ports:
- containerPort: 8080
readinessProbe: # Defines when a pod is ready to serve traffic
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe: # Defines when a pod should be restarted
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
Kubernetes Rolling Update Command
# Update a deployment with a new image
kubectl set image deployment/my-app my-app=my-app:v2 --record
# Monitor the rollout status
kubectl rollout status deployment/my-app
# Pause a rollout if issues are detected
kubectl rollout pause deployment/my-app
# Resume a rollout after issues are resolved
kubectl rollout resume deployment/my-app
# Rollback to the previous version if needed
kubectl rollout undo deployment/my-app
# View rollout history
kubectl rollout history deployment/my-app
AWS Auto Scaling Group Rolling Updates
AWS Auto Scaling Groups support rolling updates through instance refresh and launch template updates.
AWS CloudFormation Template for ASG with Rolling Update
Resources:
MyAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:
- subnet-12345678
- subnet-87654321
LaunchTemplate:
LaunchTemplateId: !Ref MyLaunchTemplate
Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
MinSize: 2
MaxSize: 10
DesiredCapacity: 5
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:
- !Ref MyTargetGroup
# Instance refresh configuration for rolling updates
InstanceRefreshSpecification:
MinHealthyPercentage: 90
InstanceWarmup: 300
Strategy: Rolling
Preferences:
MinHealthyPercentage: 90
InstanceWarmup: 300
CheckpointPercentages:
- 25
- 50
- 75
- 100
CheckpointDelay: 300
Tags:
- Key: Name
Value: MyApp
PropagateAtLaunch: true
MyLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: MyAppLaunchTemplate
VersionDescription: Initial version
LaunchTemplateData:
ImageId: ami-0abcdef1234567890
InstanceType: t3.medium
SecurityGroupIds:
- sg-12345678
UserData:
Fn::Base64: !Sub |
#!/bin/bash
echo "Starting application version 1.0"
# Application startup commands
AWS CLI Commands for Rolling Update
# Start an instance refresh (rolling update)
aws autoscaling start-instance-refresh \
--auto-scaling-group-name MyAutoScalingGroup \
--strategy Rolling \
--preferences '{"MinHealthyPercentage": 90, "InstanceWarmup": 300}'
# Check the status of an instance refresh
aws autoscaling describe-instance-refreshes \
--auto-scaling-group-name MyAutoScalingGroup
# Cancel an instance refresh if needed
aws autoscaling cancel-instance-refresh \
--auto-scaling-group-name MyAutoScalingGroup
Docker Swarm Rolling Updates
Docker Swarm provides rolling update capabilities for services through update-config settings.
Docker Swarm Service with Rolling Update Configuration
version: '3.8'
services:
web:
image: nginx:latest
deploy:
replicas: 6
update_config:
parallelism: 2 # Update 2 containers at a time
delay: 10s # Wait 10s between updating a group of containers
order: start-first # Start new containers first, then stop old ones
failure_action: pause # Pause deployment if a container fails to start
monitor: 60s # Monitor for failures for 60s after each task update
rollback_config:
parallelism: 3 # Rollback 3 containers at a time
delay: 5s # Wait 5s between rolling back a group of containers
failure_action: continue
monitor: 30s
order: stop-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
ports:
- "80:80"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 5s
timeout: 2s
retries: 3
start_period: 5s
Docker CLI Commands for Rolling Update
# Deploy or update a stack with rolling update configuration
docker stack deploy -c docker-compose.yml my-stack
# Update a service with a new image (with rolling update)
docker service update --image nginx:new-version my-stack_web
# Monitor service update progress
docker service ps my-stack_web
# Rollback a service update
docker service rollback my-stack_web
Version Compatibility Strategies
Since rolling updates involve running multiple versions simultaneously, compatibility between versions is critical for success.
API and Contract Compatibility
- Backward Compatibility: New versions must accept inputs from old clients/services
- Forward Compatibility: Old versions should handle inputs from new clients/services
- Versioned APIs: Use explicit API versioning to manage compatibility
- Request/Response Evolution: Add fields rather than changing existing ones
- Graceful Degradation: Handle missing features or parameters elegantly
API Version Compatibility Example
// Express.js API with version compatibility
const express = require('express');
const app = express();
// API v1 endpoint (original)
app.get('/api/v1/users/:id', (req, res) => {
const userId = req.params.id;
const user = getUserFromDatabase(userId);
// v1 response format
res.json({
id: user.id,
name: user.name,
email: user.email
});
});
// API v2 endpoint (new version with additional fields)
app.get('/api/v2/users/:id', (req, res) => {
const userId = req.params.id;
const user = getEnhancedUserFromDatabase(userId);
// v2 response format with additional fields
res.json({
id: user.id,
name: user.name,
email: user.email,
profile: {
avatar: user.avatar_url,
bio: user.bio,
created_at: user.joined_date
}
});
});
// Backward compatible v1 handler in v2 code
// This handles v1 requests during rolling update
app.get('/api/v1/users/:id', (req, res) => {
const userId = req.params.id;
const user = getEnhancedUserFromDatabase(userId);
// Maintain v1 response format for backward compatibility
res.json({
id: user.id,
name: user.name,
email: user.email
});
});
// Forward compatibility example
function processUserRequest(request) {
// Handle new fields gracefully in old code
// Ignore fields that didn't exist in previous version
const userId = request.id;
const userName = request.name;
// Check if new fields exist before using them
const userSettings = request.settings ? request.settings : getDefaultSettings();
// Process with available data
return processUser(userId, userName, userSettings);
}
Database Schema Evolution
Database changes during rolling updates must support both old and new application versions.
Database Schema Evolution Example
-- Step 1: Add new columns without modifying existing ones
-- (Deploy this before application update)
ALTER TABLE customers ADD COLUMN phone_number VARCHAR(20);
ALTER TABLE customers ADD COLUMN address_line2 VARCHAR(100);
-- Step 2: Application code handles both schemas
-- Old version ignores new columns
-- New version uses new columns when available
-- Example application code handling schema version:
function getCustomerContact(customerId) {
const customer = getCustomerFromDatabase(customerId);
// Handle both schema versions
if (customer.phone_number) {
// New schema - use new field directly
return {
name: customer.name,
phone: customer.phone_number,
address: formatAddress(customer)
};
} else {
// Old schema - use legacy contact info
return {
name: customer.name,
phone: customer.contact_number, // Old field name
address: customer.address // Single field in old schema
};
}
}
-- Step 3: Data migration (after deployment completes)
-- Update new fields with data from old fields
UPDATE customers
SET phone_number = contact_number
WHERE phone_number IS NULL AND contact_number IS NOT NULL;
-- Step 4: Eventually remove old columns
-- (Only after all instances are updated and data is migrated)
-- ALTER TABLE customers DROP COLUMN contact_number;
Database Changes Best Practices During Rolling Updates
- Apply Database Changes First: Deploy schema changes before application updates
- Never Remove: Don't remove columns or tables until all application instances are updated
- Use Default Values: Provide sensible defaults for new columns
- Split Large Changes: Break complex schema changes into multiple smaller deployments
- Maintain Triggers/Views: Use database objects to maintain compatibility
- Test Both Versions: Verify old and new application versions work with updated schema
- Database Migrations: Use migration tools that support rollbacks
Health Checks and Readiness Probes
Health checks and readiness probes are critical components of successful rolling updates, ensuring that new instances are fully functional before receiving traffic.
Types of Health Checks
| Type | Purpose | Implementation |
|---|---|---|
| Liveness Probe | Determines if an instance should be restarted | Basic check that application process is responsive |
| Readiness Probe | Determines if an instance should receive traffic | Deep check that application is fully initialized and ready |
| Startup Probe | Handles initial application startup period | Specialized check for applications with long startup times |
Comprehensive Health Check Endpoint Implementation
// Express.js health check endpoint implementation
const express = require('express');
const app = express();
// Simple liveness check - Is the application running?
app.get('/health/liveness', (req, res) => {
res.status(200).json({ status: 'UP' });
});
// Readiness check - Is the application ready to serve traffic?
app.get('/health/readiness', async (req, res) => {
try {
// Check database connectivity
const dbStatus = await checkDatabaseConnection();
// Check dependent services
const servicesStatus = await checkDependentServices();
// Check internal state
const appStatus = checkApplicationState();
// Determine overall health status
const isHealthy = dbStatus.healthy &&
servicesStatus.every(s => s.healthy) &&
appStatus.healthy;
// Return detailed health status
if (isHealthy) {
res.status(200).json({
status: 'UP',
checks: {
database: dbStatus,
services: servicesStatus,
application: appStatus
},
version: process.env.APP_VERSION || '1.0.0',
timestamp: new Date().toISOString()
});
} else {
res.status(503).json({
status: 'DOWN',
checks: {
database: dbStatus,
services: servicesStatus,
application: appStatus
},
version: process.env.APP_VERSION || '1.0.0',
timestamp: new Date().toISOString()
});
}
} catch (error) {
res.status(500).json({
status: 'DOWN',
error: error.message,
timestamp: new Date().toISOString()
});
}
});
// Database connection check
async function checkDatabaseConnection() {
try {
await db.query('SELECT 1');
return {
healthy: true,
responseTime: 10 // ms
};
} catch (error) {
return {
healthy: false,
error: error.message
};
}
}
// Check dependent services health
async function checkDependentServices() {
const services = [
{ name: 'auth-service', url: 'http://auth-service/health' },
{ name: 'payment-service', url: 'http://payment-service/health' }
];
return Promise.all(services.map(async (service) => {
try {
const startTime = Date.now();
const response = await axios.get(service.url, { timeout: 1000 });
const responseTime = Date.now() - startTime;
return {
name: service.name,
healthy: response.status === 200,
responseTime
};
} catch (error) {
return {
name: service.name,
healthy: false,
error: error.message
};
}
}));
}
// Application state check
function checkApplicationState() {
// Check if application has completed initialization
const initialized = global.appInitialized === true;
// Check if application has necessary resources
const resources = checkSystemResources();
// Check connection pool health
const connectionPool = checkConnectionPool();
return {
healthy: initialized && resources.healthy && connectionPool.healthy,
initialized,
resources,
connectionPool
};
}
Health Check Best Practices
- Multi-Level Checks: Implement both shallow and deep health checks
- Proper Timing: Set appropriate initialDelaySeconds to allow application startup
- Check Dependencies: Verify connectivity to databases and services
- Efficiency: Keep health checks lightweight and efficient
- Avoid Side Effects: Health checks should not modify state or trigger business logic
- Proper Status Codes: Use appropriate HTTP status codes for different scenarios
- Version Information: Include application version in health check response
- Performance Metrics: Include response time and resource utilization
Real-World Example: Health Check Engineering
A payment processing company implemented a sophisticated health check system for their microservices architecture:
- Three-Tiered Health Checks:
- Level 1 (Liveness): Basic process health (sub-50ms response time)
- Level 2 (Readiness): Database connectivity and configuration validation
- Level 3 (Deep Health): End-to-end test transactions, cache validation, and consistency checks
- Progressive Exposure: New instances follow a staged traffic pattern:
- Pass Level 1 health check → Service registered but marked as "warming up"
- Pass Level 2 health check → Receive 5% of traffic for 60 seconds
- No errors during initial traffic → Receive full traffic share
- Level 3 health check runs continuously in background
- Results: 99.99% successful deployments with zero customer impact, even with 20+ daily deployments across their service fleet
Connection Draining and Session Management
Managing existing connections and user sessions during rolling updates is critical to providing seamless user experiences.
Connection Draining
Connection draining is the process of allowing existing connections to complete naturally before removing an instance from service. This helps prevent disruptions to active user sessions during updates.
AWS Connection Draining Configuration
# CloudFormation template for ELB with connection draining
Resources:
MyLoadBalancer:
Type: AWS::ElasticLoadBalancing::LoadBalancer
Properties:
Listeners:
- LoadBalancerPort: '80'
InstancePort: '80'
Protocol: HTTP
HealthCheck:
Target: HTTP:80/health
HealthyThreshold: '3'
UnhealthyThreshold: '5'
Interval: '30'
Timeout: '5'
ConnectionDrainingPolicy:
Enabled: true
Timeout: 300 # 5 minutes to allow connections to complete
Nginx Connection Draining Example
# Nginx configuration for graceful shutdown
events {
worker_connections 1024;
}
http {
upstream backend {
server backend1.example.com;
server backend2.example.com;
server backend3.example.com;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # Enable keepalive connections
# Proper headers for connection draining
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_connect_timeout 5s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
}
}
}
# Graceful shutdown script
#!/bin/bash
# Signal Nginx to stop accepting new connections but finish processing existing ones
nginx -s quit
# Wait for connections to drain (adjust timeout as needed)
timeout=300 # 5 minutes
interval=5 # Check every 5 seconds
elapsed=0
while [ $elapsed -lt $timeout ]; do
# Check if Nginx still has active connections
active_connections=$(ss -tn | grep -c ":80")
if [ $active_connections -eq 0 ]; then
echo "All connections drained, proceeding with update"
break
fi
echo "Waiting for $active_connections connections to complete..."
sleep $interval
elapsed=$((elapsed + interval))
done
# Continue with the update process
Session Management Strategies
Managing user sessions across instance updates requires proper design to prevent session loss.
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| External Session Store | Store session data in Redis, Memcached, or database | Complete session persistence, works with any update strategy | Extra infrastructure, potential performance impact |
| Sticky Sessions | Route users to the same instance for their session duration | Simple to implement, no external dependencies | Sessions lost during instance updates, load balancing challenges |
| Client-Side Sessions | Store session data in cookies or local storage | No server-side state, simplified architecture | Limited storage capacity, security considerations |
| Session Replication | Replicate session data across instances | High availability, no external dependencies | Resource intensive, complex implementation |
External Session Store Implementation
// Node.js with Express and Redis Session Store
const express = require('express');
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');
const app = express();
// Create Redis client
const redisClient = createClient({
url: process.env.REDIS_URL || 'redis://localhost:6379'
});
redisClient.connect().catch(console.error);
// Configure session middleware with Redis store
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET || 'your-secret-key',
resave: false,
saveUninitialized: false,
cookie: {
secure: process.env.NODE_ENV === 'production',
maxAge: 1000 * 60 * 60 * 24 // 1 day
}
}));
// Session usage in application
app.get('/profile', (req, res) => {
if (!req.session.user) {
return res.redirect('/login');
}
res.render('profile', { user: req.session.user });
});
app.post('/login', (req, res) => {
// Authenticate user
const user = authenticateUser(req.body.username, req.body.password);
if (user) {
// Store user in session
req.session.user = {
id: user.id,
username: user.username,
email: user.email,
preferences: user.preferences
};
res.redirect('/dashboard');
} else {
res.render('login', { error: 'Invalid credentials' });
}
});
// Graceful shutdown handling
process.on('SIGTERM', async () => {
console.log('Received SIGTERM signal, shutting down gracefully');
// Close Redis connection
await redisClient.quit();
// Close Express server
server.close(() => {
console.log('HTTP server closed');
process.exit(0);
});
});
Connection Draining and Session Management Best Practices
- Appropriate Timeouts: Set connection draining timeouts based on typical session duration
- Externalize State: Keep session data outside application instances
- Graceful Shutdown: Implement proper shutdown hooks to handle in-flight requests
- Sticky Session Fallbacks: If using sticky sessions, implement fallback for session recovery
- Session Versioning: Include version information in session data for compatibility
- Monitoring: Track connection counts during draining to ensure proper completion
- Client Communication: Consider notifying clients about maintenance or updates
Monitoring and Rollback Strategies
Effective monitoring and the ability to quickly roll back are essential safety nets for rolling updates.
Monitoring During Rolling Updates
Prometheus Alert Rules for Deployment Monitoring
groups:
- name: deployment_alerts
rules:
- alert: HighErrorRateAfterDeployment
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, version)
/ sum(rate(http_requests_total[5m])) by (service, version) > 0.05
and on(service) (time() - max(deployment_time) by (service)) < 3600
for: 2m
labels:
severity: critical
category: deployment
annotations:
summary: "High error rate after deployment: {{ $labels.service }} v{{ $labels.version }}"
description: "Error rate of {{ $value | humanizePercentage }} exceeds 5% threshold after recent deployment"
- alert: LatencySpikeDuringDeployment
expr: |
(
avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])) by (service)
/
avg(rate(http_request_duration_seconds_sum[1h] offset 2h) / rate(http_request_duration_seconds_count[1h] offset 2h)) by (service)
) > 1.5
and on(service) (time() - max(deployment_time) by (service)) < 3600
for: 3m
labels:
severity: warning
category: deployment
annotations:
summary: "Latency spike during deployment: {{ $labels.service }}"
description: "Response times are {{ $value | humanizePercentage }} higher than baseline during deployment"
- alert: DeploymentStalled
expr: |
deployment_status{status="in_progress"} == 1
and
(time() - deployment_time) > 1800
for: 5m
labels:
severity: warning
category: deployment
annotations:
summary: "Deployment stalled: {{ $labels.service }}"
description: "Deployment has been in progress for over 30 minutes without completion"
Rollback Strategies
Having effective rollback mechanisms is crucial for responding to issues during rolling updates.
| Rollback Approach | Description | Best For |
|---|---|---|
| Full Rollback | Revert all instances to the previous version | Critical issues affecting all new instances |
| Partial Rollback | Revert only problematic instances while continuing update for others | Issues affecting specific infrastructure or regions |
| Pause and Fix | Pause the rolling update, fix issues, and continue with corrected version | Minor problems that can be quickly resolved |
| Forward Fix | Deploy a new version that addresses the issues without reverting | Situations where rollback would cause more problems |
Automated Rollback Script for Kubernetes
#!/bin/bash
# Automated monitoring and rollback script for Kubernetes deployments
# Configuration
NAMESPACE="production"
DEPLOYMENT="my-app"
ERROR_THRESHOLD=5 # Error percentage triggering rollback
LATENCY_THRESHOLD=1.5 # Latency increase factor
MONITOR_DURATION=600 # Monitor for 10 minutes after deployment
INTERVAL=15 # Check every 15 seconds
# Get the current deployment status
current_revision=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE | grep -A 1 "REVISION" | tail -n 1 | awk '{print $1}')
echo "Deployment $DEPLOYMENT updated to revision $current_revision"
# Record deployment time for future reference
deployment_time=$(date +%s)
echo "Starting post-deployment monitoring at $(date)"
# Monitor for the specified duration
end_time=$(($(date +%s) + $MONITOR_DURATION))
while [ $(date +%s) -lt $end_time ]; do
# Get error rate from Prometheus
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT',status=~'5..'}[1m]))/sum(rate(http_requests_total{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m]))*100" | jq -r '.data.result[0].value[1]')
# Get latency data from Prometheus
CURRENT_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1m])) by (le))" | jq -r '.data.result[0].value[1]')
BASELINE_LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(http_request_duration_seconds_bucket{namespace='$NAMESPACE',deployment='$DEPLOYMENT'}[1h] offset 2h)) by (le))" | jq -r '.data.result[0].value[1]')
LATENCY_RATIO=$(echo "$CURRENT_LATENCY / $BASELINE_LATENCY" | bc -l)
echo "Current metrics - Error rate: ${ERROR_RATE}%, P95 Latency: ${CURRENT_LATENCY}s (${LATENCY_RATIO}x baseline)"
# Check if error rate exceeds threshold
if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
echo "Error rate exceeded threshold (${ERROR_RATE}% > ${ERROR_THRESHOLD}%), initiating rollback..."
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
# Send alert to monitoring system
curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Error rate of '$ERROR_RATE'% exceeded threshold of '$ERROR_THRESHOLD'%"}}]'
echo "Rollback initiated at $(date)"
exit 1
fi
# Check if latency exceeds threshold
if (( $(echo "$LATENCY_RATIO > $LATENCY_THRESHOLD" | bc -l) )); then
echo "Latency ratio exceeded threshold (${LATENCY_RATIO}x > ${LATENCY_THRESHOLD}x), initiating rollback..."
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
# Send alert to monitoring system
curl -X POST "http://alertmanager:9093/api/v1/alerts" -d '[{"labels":{"alertname":"DeploymentRollback","severity":"critical","deployment":"'$DEPLOYMENT'"},"annotations":{"summary":"Automatic rollback triggered","description":"Latency increase of '$LATENCY_RATIO'x exceeded threshold of '$LATENCY_THRESHOLD'x"}}]'
echo "Rollback initiated at $(date)"
exit 1
fi
# Check deployment status to ensure it's still progressing
deployment_status=$(kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --watch=false)
if [[ $deployment_status == *"successfully rolled out"* ]]; then
echo "Deployment completed successfully, continuing to monitor..."
elif [[ $deployment_status == *"error"* ]]; then
echo "Deployment encountered errors, initiating rollback..."
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
echo "Rollback initiated at $(date)"
exit 1
fi
sleep $INTERVAL
done
echo "Post-deployment monitoring completed successfully at $(date)"
echo "Deployment is stable"
exit 0
Real-World Example: Progressive Rollback Strategy
A large SaaS company implemented a sophisticated monitoring and rollback strategy for their microservices platform:
- Deployment Phases: Rolling updates progressed through phases, with increasing traffic percentages
- Automated Canaries: First 5% of instances served as canaries with enhanced monitoring
- Multi-Dimensional Metrics: Each phase evaluated against 12 key metrics, including:
- Error rates (by error type and endpoint)
- Latency percentiles (p50, p90, p99)
- CPU and memory utilization patterns
- Business metrics (conversion rates, etc.)
- Dependent service impacts
- Rollback Strategy: A tiered response system based on severity:
- Level 1 (Minor Issues): Pause deployment, investigate, proceed if resolved
- Level 2 (Moderate Issues): Partial rollback of affected instances/regions
- Level 3 (Critical Issues): Full automatic rollback with engineer notification
- Level 4 (Catastrophic Issues): Circuit breaker activation and incident response
- Post-Mortem Process: Any rollback triggered a thorough analysis process before the next deployment attempt
This approach reduced their deployment incidents by 87% while increasing deployment frequency by 3x, allowing them to safely ship new features multiple times per day.
Advanced Rolling Update Techniques
Traffic Shaping During Rolling Updates
Advanced traffic management techniques can improve the safety and efficiency of rolling updates.
- Ring Deployment Model: Deploy to concentric "rings" of instances with increasing criticality
- Dark Launching: Deploy new code but don't activate new functionality until stable
- Shadow Testing: Send duplicate traffic to new instances without returning responses
- Incremental Feature Activation: Combine rolling updates with feature flags
- Synthetic User Testing: Automatically test new instances with synthetic traffic
Shadow Testing Implementation
// Nginx configuration for shadow testing
http {
upstream production_backend {
server prod1.example.com;
server prod2.example.com;
server prod3.example.com;
}
upstream shadow_backend {
server shadow1.example.com;
server shadow2.example.com;
}
server {
listen 80;
location / {
# Primary request to production backend
proxy_pass http://production_backend;
# Shadow traffic to new version (only for GET requests)
if ($request_method = GET) {
mirror /shadow;
}
}
# Shadow endpoint that won't return response to client
location = /shadow {
internal;
proxy_pass http://shadow_backend$request_uri;
proxy_ignore_client_abort on;
proxy_connect_timeout 1s; # Short timeout to not impact performance
proxy_read_timeout 10s;
}
}
}
Feature Flag Integration with Rolling Update
// Using feature flags with rolling updates
import { initialize, LDClient } from 'launchdarkly-node-server-sdk';
// Initialize LaunchDarkly client
const ldClient = initialize('YOUR_SDK_KEY');
// Wait for client to be ready
await ldClient.waitForInitialization();
// Application configuration with feature flags
const app = express();
app.get('/api/products', async (req, res) => {
const user = {
key: req.user.id,
custom: {
groups: req.user.groups,
// Add instance version as an attribute
instanceVersion: process.env.APP_VERSION || '1.0.0'
}
};
// Check if new product search algorithm should be enabled
// Can be targeted to specific app versions, users, or instance groups
const useNewAlgorithm = await ldClient.variation('new-search-algorithm', user, false);
if (useNewAlgorithm) {
// Use new implementation (only in updated instances)
const products = await newSearchImplementation(req.query);
res.json(products);
} else {
// Use old implementation (works in both old and new instances)
const products = await legacySearchImplementation(req.query);
res.json(products);
}
});
// Feature flag for gradual activation during rolling update
// 1. Deploy code to all instances with feature flag off
// 2. Perform rolling update to new version
// 3. Gradually enable feature flag for increasing percentage of traffic
// 4. Monitor and roll back flag if issues occur (without rolling back deployment)
Automated Rollout Strategies
Automation can enhance the reliability and efficiency of rolling updates.
Real-World Example: Google's Automated Canary Analysis
Google has developed a sophisticated automated deployment platform with rolling updates that incorporates:
- Progressive Rollout: Updates proceed through multiple phases from canary to global deployment
- Automatic Analysis: Statistical models compare metrics between versions
- Multi-Dimensional Evaluation: System evaluates key SLI/SLOs across services
- Baked-In Periods: Mandatory observation periods between deployment phases
- Self-Service Tools: Service owners can configure rollout parameters and key metrics
- Feedback Loops: Deployment system learns from past successes and failures
This system enables Google to safely deploy thousands of changes per day across their service fleet, with minimal human intervention for routine deployments.
Learning Activities
Activity 1: Rolling Update Strategy Design
Design a rolling update strategy for a microservices-based application with the following characteristics:
- User-facing web application with authentication
- A fleet of API microservices (10 different service types)
- PostgreSQL database for persistent data
- Redis for caching and session management
- Average user session duration of 30 minutes
- Business requirement: 99.9% uptime during updates
Your strategy should include:
- Update sequence across services
- Batch sizing and update parameters
- Health check requirements
- Session handling approach
- Database change management
- Monitoring and rollback plan
Activity 2: Implementing Rolling Updates in Kubernetes
Create Kubernetes deployment configurations for a three-tier application:
- Frontend web application with 5 replicas
- Backend API service with 3 replicas
- Database with primary and standby replicas
Implement the following:
- Appropriate rolling update strategy for each component
- Health check probes (liveness, readiness, startup)
- Resource limits and requests
- Service configurations with appropriate selectors
- Update hooks for pre/post-deployment tasks
- Monitoring and alerting configuration
Activity 3: Rolling Update Failure Scenarios
Analyze the following rolling update failure scenarios and develop response plans:
- New version passes health checks but shows increased error rates after 10 minutes
- Database connection pool exhaustion during update
- Intermittent failures affecting only some instances
- Version compatibility issue between services updated at different times
- Memory leak in new version causing gradual degradation
For each scenario, describe:
- Detection method
- Immediate mitigation
- Root cause analysis approach
- Prevention strategy for future deployments
Key Takeaways
- Rolling updates provide zero-downtime deployments by incrementally updating instances
- This approach balances availability with resource efficiency but requires careful management of version coexistence
- Proper health checks and readiness probes are critical to successful rolling updates
- Connection draining and session management help provide seamless user experiences
- Database schema changes must be designed to work with both old and new application versions
- Effective monitoring and automated rollback mechanisms provide safety nets
- Advanced techniques like traffic shaping and progressive exposure can enhance safety
- Most modern orchestration platforms provide built-in support for rolling updates