Canary Releases: A Gradual Approach to Deployment

Introduction to Canary Releases

Modern software deployment requires balancing speed of delivery with risk management. While blue-green deployments provide a clean cutover between versions, canary releases take a more gradual approach by routing a small portion of traffic to the new version before proceeding with a full rollout.

Named after the historical practice of coal miners bringing canaries into mines to detect toxic gases, a canary release serves as an early warning system for potential issues. By exposing only a subset of users to the new version, teams can detect problems before they affect the entire user base.

sequenceDiagram participant Users participant Router as Load Balancer/Router participant Current as Current Version (v1) participant Canary as Canary Version (v2) Note over Router,Canary: Initial State: 100% traffic to v1 Users->>Router: Request Router->>Current: 100% of traffic Current->>Users: Response Note over Canary: Deploy v2 as canary Users->>Router: Request Router->>Current: 95% of traffic Router->>Canary: 5% of traffic Current->>Users: Response (95% of users) Canary->>Users: Response (5% of users) Note over Router,Canary: Monitor canary metrics Note over Router,Canary: Gradually increase canary traffic Users->>Router: Request Router->>Current: 80% of traffic Router->>Canary: 20% of traffic Current->>Users: Response (80% of users) Canary->>Users: Response (20% of users) Note over Router,Canary: Continue until 100% on v2 Users->>Router: Request Router->>Canary: 100% of traffic Canary->>Users: Response Note over Current: Decommission v1

The Restaurant Menu Analogy

Canary releases can be compared to how a restaurant might introduce a new menu item:

Traditional Deployment: The restaurant completely replaces its menu overnight. If customers don't like the new dishes, the restaurant faces significant backlash.
Blue-Green Deployment: The restaurant prepares two complete menus and switches all customers from one to the other on a specific day.
Canary Release: The restaurant offers the new dish as a "daily special" to a small percentage of customers. Based on feedback, they make adjustments before adding it to the main menu for everyone. If the dish receives poor reviews, only a few customers are disappointed rather than the entire clientele.

Canary Releases vs. Other Deployment Strategies

Strategy	Traffic Pattern	Risk Profile	Complexity	Best For
Traditional/Recreate	All at once	High risk	Low complexity	Dev/Test environments, non-critical applications
Rolling Update	Incremental by instance	Medium risk	Medium complexity	Stateless applications with multiple instances
Blue-Green	All at once with instant rollback	Medium risk	Medium complexity	Applications requiring atomic updates
Canary	Incremental by traffic percentage	Low risk	High complexity	High-traffic, critical applications
A/B Testing	Split by user segments	Low risk	High complexity	Feature validation, UX improvements

Key Differences Between Canary and Blue-Green

Blue-Green Deployment

Switches 100% of traffic at once
Provides atomic deployment (all users see the same version)
Requires two full production environments
Simpler traffic routing (on/off switch)
Faster complete rollout
Focused on zero-downtime deployment

Canary Deployment

Incrementally increases traffic to new version
Different users may see different versions during rollout
Can use a smaller footprint for initial canary
Requires more sophisticated traffic management
Slower complete rollout
Focused on risk reduction and early problem detection

graph TD subgraph "Blue-Green Deployment" A1[Load Balancer] --> B1[Blue Environment
100% traffic] A1 -.-> C1[Green Environment
0% traffic] D1[Switch] --> E1[Load Balancer] --> F1[Blue Environment
0% traffic] E1 --> G1[Green Environment
100% traffic] end subgraph "Canary Deployment" A2[Load Balancer] --> B2[Current Version
95% traffic] A2 --> C2[Canary Version
5% traffic] D2[Increase] --> E2[Load Balancer] --> F2[Current Version
80% traffic] E2 --> G2[Canary Version
20% traffic] H2[Complete] --> I2[Load Balancer] --> J2[Current Version
0% traffic] I2 --> K2[Canary Version
100% traffic] end style B1 fill:#1E88E5 style C1 fill:#4CAF50 style F1 fill:#1E88E5 style G1 fill:#4CAF50 style B2 fill:#1E88E5 style C2 fill:#4CAF50 style F2 fill:#1E88E5 style G2 fill:#4CAF50 style J2 fill:#1E88E5 style K2 fill:#4CAF50

Benefits and Challenges of Canary Releases

Benefits

Reduced Risk: Limits the impact of defects to a small subset of users
Early Detection: Discovers issues in real production conditions with real users
Progressive Confidence: Builds confidence gradually as more traffic shifts to the new version
Testing in Production: Validates actual system behavior with real traffic
Performance Validation: Enables comparison of performance metrics between versions
Controlled Experiment: Allows for controlled experimentation with actual users
Targeted Rollout: Can target specific user segments or regions first

Challenges

Complexity: Requires sophisticated traffic routing and monitoring
Version Compatibility: Multiple versions must coexist without conflicts
Database Evolution: Database schemas must support both versions
API Compatibility: APIs need to maintain backward compatibility
Monitoring Overhead: More complex monitoring to compare versions
User Experience: Different users may have inconsistent experiences
Rollout Duration: Complete deployment takes longer than all-at-once approaches

Real-World Example: Netflix Canary Deployment

Netflix, a pioneer in continuous delivery, implements canary deployments as a core part of their deployment strategy:

Initial Phase: Deploy to a small subset of random instances across all regions
Bake Time: Allow the canary to "bake" for 30+ minutes to observe behavior
Automated Analysis: Compare key metrics between canary and baseline instances
Decision Point: Automatically proceed or halt based on statistical analysis
Progressive Rollout: If canary is successful, progressively deploy to more instances
Region by Region: Roll out across regions to limit global impact

This approach enables Netflix to deploy thousands of times per day across their infrastructure while minimizing risk to their user experience.

Implementing Canary Releases

Key Components for Canary Deployments

graph TD A[Canary Implementation Components] --> B[Traffic Routing Mechanism] A --> C[Deployment Infrastructure] A --> D[Monitoring & Metrics] A --> E[Automated Analysis] A --> F[Rollback Mechanism] B --> B1[Load Balancer Configuration] B --> B2[Service Mesh] B --> B3[API Gateway] C --> C1[Container Orchestration] C --> C2[Instance Management] C --> C3[Pipeline Integration] D --> D1[Health Metrics] D --> D2[Business Metrics] D --> D3[Comparison Analytics] E --> E1[Automatic Promotion] E --> E2[Automatic Rollback] F --> F1[Traffic Reversion] F --> F2[Version Redeployment]

Traffic Routing Approaches

Several methods exist for implementing the traffic split required for canary deployments:

Approach	Implementation	Pros	Cons
HTTP Load Balancer	Configure weighted routing rules at the load balancer level	Simple to implement, works with any application	Limited granularity, no session affinity by default
Service Mesh	Use Istio, Linkerd, or similar to manage traffic splitting	Fine-grained control, rich metrics, request-level routing	Complex setup, requires service mesh infrastructure
Application-Level Routing	Implement routing logic within the application itself	Complete control, can leverage application context	Requires code changes, mixing deployment and business logic
Feature Flags	Use feature flag services to control access to features	Can target specific users, works with monoliths	Requires feature flag management, code complexity
DNS Weighted Routing	Use DNS service with weighted records	Simple to configure, works at global scale	Slow propagation, coarse control, client-side caching

Example: AWS ALB Weighted Target Groups


# AWS CLI command to create a weighted target group routing
aws elbv2 create-listener --load-balancer-arn $LB_ARN \
  --protocol HTTP --port 80 \
  --default-actions '[
    {
      "Type": "forward",
      "ForwardConfig": {
        "TargetGroups": [
          {
            "TargetGroupArn": "'$CURRENT_TG_ARN'",
            "Weight": 90
          },
          {
            "TargetGroupArn": "'$CANARY_TG_ARN'",
            "Weight": 10
          }
        ]
      }
    }
  ]'

# CloudFormation example
Resources:
  ApplicationLoadBalancerListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      LoadBalancerArn: !Ref ApplicationLoadBalancer
      Port: 80
      Protocol: HTTP
      DefaultActions:
        - Type: forward
          ForwardConfig:
            TargetGroups:
              - TargetGroupArn: !Ref CurrentTargetGroup
                Weight: 90
              - TargetGroupArn: !Ref CanaryTargetGroup
                Weight: 10

Example: Istio Service Mesh Traffic Splitting


# Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
    - my-service
  http:
    - route:
        - destination:
            host: my-service
            subset: current
          weight: 90
        - destination:
            host: my-service
            subset: canary
          weight: 10
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service
  subsets:
    - name: current
      labels:
        version: v1
    - name: canary
      labels:
        version: v2

Example: Application-Level Canary with Express.js


// Simple Express.js middleware for canary routing
const canaryRouter = (req, res, next) => {
  // Configuration (could be loaded from external source)
  const canaryPercentage = 10; // 10% of traffic to canary
  
  // Determine if this request should go to canary
  // Using consistent hashing for session affinity
  const userId = req.session.userId || req.ip || Math.random().toString();
  const hash = createHash(userId);
  const normalizedHash = hash % 100;
  
  if (normalizedHash < canaryPercentage) {
    // Route to canary service
    req.serviceVersion = 'canary';
  } else {
    // Route to current service
    req.serviceVersion = 'current';
  }
  
  next();
};

// Hash function for consistent routing
function createHash(str) {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    hash = ((hash << 5) - hash) + str.charCodeAt(i);
    hash |= 0; // Convert to 32bit integer
  }
  return Math.abs(hash);
}

// Using the middleware
app.use(canaryRouter);

// Route handlers can check req.serviceVersion
app.get('/api/products', (req, res) => {
  if (req.serviceVersion === 'canary') {
    // Call canary service
    axios.get('http://products-service-canary/products')
      .then(response => res.json(response.data))
      .catch(error => res.status(500).json({ error: error.message }));
  } else {
    // Call current service
    axios.get('http://products-service-current/products')
      .then(response => res.json(response.data))
      .catch(error => res.status(500).json({ error: error.message }));
  }
});

Canary Analysis and Metrics

The effectiveness of canary deployments depends on gathering and analyzing the right metrics to detect issues early.

Key Metric Categories

graph TD A[Canary Metrics] --> B[System Metrics] A --> C[Business Metrics] A --> D[User Experience Metrics] A --> E[Error Metrics] B --> B1[CPU/Memory Usage] B --> B2[Response Time] B --> B3[Throughput] B --> B4[Resource Saturation] C --> C1[Conversion Rate] C --> C2[Revenue] C --> C3[Cart Abandonment] C --> C4[User Engagement] D --> D1[Page Load Time] D --> D2[Time to Interactive] D --> D3[User Journey Completion] D --> D4[Session Duration] E --> E1[Error Rate] E --> E2[Exception Count] E --> E3[5xx Status Codes] E --> E4[4xx Status Codes]

Effective Canary Analysis

When comparing canary and baseline metrics, consider these approaches:

Statistical Significance: Use statistical methods to determine if differences are meaningful
Multi-dimensional Analysis: Look at multiple metrics together, not in isolation
Time-series Comparison: Compare current canary metrics to historical patterns
Automated Decision Making: Define clear thresholds for automatic promotion or rollback
Trend Analysis: Monitor for degradation trends, not just point-in-time values
User Impact Weighting: Weight metrics by their direct impact on user experience

Automating Canary Analysis

Several tools and platforms have emerged to automate the canary analysis process:

Kayenta: Open-source canary analysis service developed by Netflix and Google
Spinnaker: Continuous delivery platform with integrated canary analysis
Flagger: Progressive delivery operator for Kubernetes
AWS CloudWatch Evidently: Feature testing and experimentation service with canary deployments
Custom metrics platforms: Prometheus, Datadog, New Relic with custom dashboards

Example: Prometheus Metrics for Canary Analysis


# Prometheus recording rules for canary vs. baseline comparison
groups:
- name: canary_analysis
  rules:
  - record: canary:request_latency_ratio
    expr: |
      avg(rate(http_request_duration_seconds_sum{job="myapp",version="canary"}[5m]) / 
      rate(http_request_count{job="myapp",version="canary"}[5m])) / 
      avg(rate(http_request_duration_seconds_sum{job="myapp",version="baseline"}[5m]) / 
      rate(http_request_count{job="myapp",version="baseline"}[5m]))
      
  - record: canary:error_rate_ratio
    expr: |
      rate(http_requests_total{job="myapp",version="canary",status=~"5.."}[5m]) / 
      rate(http_requests_total{job="myapp",version="canary"}[5m]) / 
      (rate(http_requests_total{job="myapp",version="baseline",status=~"5.."}[5m]) / 
      rate(http_requests_total{job="myapp",version="baseline"}[5m]))
      
  - record: canary:throughput_ratio
    expr: |
      sum(rate(http_requests_total{job="myapp",version="canary"}[5m])) / 
      sum(rate(http_requests_total{job="myapp",version="baseline"}[5m]))

Example: Automated Canary Analysis Script


#!/bin/bash
# Simple automated canary analysis script

# Configuration
CANARY_NAMESPACE="prod"
CANARY_DEPLOYMENT="myapp-canary"
BASELINE_DEPLOYMENT="myapp-current"
PROMETHEUS_URL="http://prometheus:9090"
CANARY_TRAFFIC_PERCENTAGE=10
ERROR_THRESHOLD=1.1  # Allow 10% higher error rate
LATENCY_THRESHOLD=1.15  # Allow 15% higher latency
ANALYSIS_MINUTES=10

echo "Starting canary analysis for ${CANARY_MINUTES} minutes..."

# Function to query Prometheus
query_prometheus() {
  local query=$1
  curl -s -G "${PROMETHEUS_URL}/api/v1/query" --data-urlencode "query=${query}" | jq -r '.data.result[0].value[1]'
}

# Start time for analysis
start_time=$(date +%s)
end_time=$((start_time + ANALYSIS_MINUTES * 60))

while [ $(date +%s) -lt $end_time ]; do
  # Get current metrics
  canary_error_rate=$(query_prometheus "sum(rate(http_server_requests_seconds_count{deployment=\"${CANARY_DEPLOYMENT}\",status=~\"5..\"}[1m]))/sum(rate(http_server_requests_seconds_count{deployment=\"${CANARY_DEPLOYMENT}\"}[1m]))")
  baseline_error_rate=$(query_prometheus "sum(rate(http_server_requests_seconds_count{deployment=\"${BASELINE_DEPLOYMENT}\",status=~\"5..\"}[1m]))/sum(rate(http_server_requests_seconds_count{deployment=\"${BASELINE_DEPLOYMENT}\"}[1m]))")
  
  canary_latency=$(query_prometheus "sum(rate(http_server_requests_seconds_sum{deployment=\"${CANARY_DEPLOYMENT}\"}[1m]))/sum(rate(http_server_requests_seconds_count{deployment=\"${CANARY_DEPLOYMENT}\"}[1m]))")
  baseline_latency=$(query_prometheus "sum(rate(http_server_requests_seconds_sum{deployment=\"${BASELINE_DEPLOYMENT}\"}[1m]))/sum(rate(http_server_requests_seconds_count{deployment=\"${BASELINE_DEPLOYMENT}\"}[1m]))")
  
  # Calculate ratios (avoid division by zero)
  if [ "$baseline_error_rate" = "0" ] || [ "$baseline_error_rate" = "" ]; then
    if [ "$canary_error_rate" = "0" ] || [ "$canary_error_rate" = "" ]; then
      error_ratio=1
    else
      error_ratio=999
    fi
  else
    error_ratio=$(echo "scale=2; $canary_error_rate / $baseline_error_rate" | bc)
  fi
  
  if [ "$baseline_latency" = "0" ] || [ "$baseline_latency" = "" ]; then
    latency_ratio=999
  else
    latency_ratio=$(echo "scale=2; $canary_latency / $baseline_latency" | bc)
  fi
  
  echo "Current metrics - Error ratio: ${error_ratio}, Latency ratio: ${latency_ratio}"
  
  # Check if thresholds are exceeded
  if (( $(echo "$error_ratio > $ERROR_THRESHOLD" | bc -l) )); then
    echo "Error rate threshold exceeded! Rolling back canary."
    # Insert rollback logic here
    exit 1
  fi
  
  if (( $(echo "$latency_ratio > $LATENCY_THRESHOLD" | bc -l) )); then
    echo "Latency threshold exceeded! Rolling back canary."
    # Insert rollback logic here
    exit 1
  fi
  
  # Wait before next check
  sleep 30
done

echo "Canary analysis completed successfully. Proceeding with promotion."
# Insert promotion logic here

exit 0

Real-World Example: E-commerce Canary Metrics

An e-commerce company implemented the following canary analysis strategy for their checkout service:

Primary Metrics:
- Checkout completion rate (compared to baseline)
- Average checkout time (should not increase by more than 5%)
- Payment processing error rate (zero tolerance for increases)
- Server-side exceptions (compared to baseline)
Secondary Metrics:
- Page load time for all checkout steps
- Resource utilization (CPU, memory, network)
- API response times for dependent services
- Client-side JavaScript errors
Analysis Approach:
- 30-minute initial analysis period
- Automatic rollback on any payment error rate increase
- Automatic rollback if checkout completion drops by more than 1%
- Manual review required if secondary metrics show significant deviation

This approach helped them detect a subtle payment processing issue that only occurred with a specific credit card type, affecting approximately 0.5% of transactions—an issue that would have been much more impactful in a full rollout.

Advanced Canary Deployment Strategies

Traffic Segmentation Approaches

Beyond simple percentage-based traffic splitting, canary deployments can target specific segments:

Strategy	Description	Best For
Random Sampling	Direct a random percentage of users to the canary	General validation with statistically significant sample
Geographic Canary	Target users from specific regions or locations	Testing region-specific features, limiting impact to specific markets
User-Based Canary	Target specific user segments (e.g., beta users, internal employees)	Getting feedback from specific user groups before wider release
Attribute-Based Canary	Target based on user attributes (device, browser, etc.)	Testing compatibility with specific platforms or environments
Time-Based Canary	Deploy canary during specific time windows	Testing during lower traffic periods or with specific time-based patterns

Example: Nginx Configuration for Geographic Canary


# Nginx configuration for geographic canary routing
http {
    # GeoIP module configuration
    geoip_country /etc/nginx/geoip/GeoIP.dat;
    
    # Upstream definitions
    upstream current_backend {
        server current-app-1:8080;
        server current-app-2:8080;
    }
    
    upstream canary_backend {
        server canary-app-1:8080;
        server canary-app-2:8080;
    }
    
    server {
        listen 80;
        server_name example.com;
        
        # Route traffic based on country
        location / {
            # Users from Canada get the canary version
            if ($geoip_country_code = CA) {
                proxy_pass http://canary_backend;
            }
            
            # All other users get the current version
            proxy_pass http://current_backend;
            
            # Common proxy settings
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
}

Example: Feature Flag Implementation for User-Based Canary


// Using LaunchDarkly feature flag service for user-based canary
const LaunchDarkly = require('launchdarkly-node-server-sdk');

// Initialize the LaunchDarkly client
const ldClient = LaunchDarkly.init('YOUR_SDK_KEY');

async function routeRequest(req, res, next) {
  // Wait for the client to be ready
  await ldClient.waitForInitialization();
  
  // Create a user context from the request
  const user = {
    key: req.session.userId || 'anonymous',
    email: req.session.userEmail,
    custom: {
      groups: req.session.userGroups || [],
      userType: req.session.userType || 'regular',
      region: req.headers['x-user-region']
    }
  };
  
  // Check if this user should see the canary version
  const showCanary = await ldClient.variation('show-canary-version', user, false);
  
  if (showCanary) {
    // Route to canary service
    req.serviceVersion = 'canary';
  } else {
    // Route to current service
    req.serviceVersion = 'current';
  }
  
  next();
}

app.use(routeRequest);

Progressive Canary Rollout Patterns

Canary Rollout Best Practices

Start Small: Begin with a very small percentage (1-5%) to limit potential impact
Bake Time: Allow sufficient time (30+ minutes) at each stage to observe behavior
Incremental Steps: Use smaller increments for critical systems (5%, 10%, 25%, 50%, 75%, 100%)
Automatic Rollback: Define clear thresholds for automatic rollback
Session Affinity: Ensure users get a consistent experience during the rollout
Regional Progression: Consider rolling out region by region rather than globally
Working Hours: Schedule initial canary phases during business hours when support staff is available

Canary Implementation on Different Platforms

Kubernetes-Based Canary

Kubernetes provides several options for implementing canary deployments:

Example: Canary with Kubernetes and Flagger


# Flagger Canary Custom Resource
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app
  namespace: prod
spec:
  # Deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  # Service mesh provider
  provider: istio
  # Service mesh gateway (optional)
  service:
    port: 80
    targetPort: 8080
    gateways:
    - public-gateway.istio-system.svc.cluster.local
  # Canary analysis configuration
  analysis:
    # Schedule interval
    interval: 1m
    # Max number of failed checks before rollback
    threshold: 5
    # Max traffic percentage routed to canary
    maxWeight: 50
    # Canary increment step percentage
    stepWeight: 5
    # Prometheus metrics
    metrics:
    - name: request-success-rate
      threshold: 99
      interval: 1m
    - name: request-duration
      threshold: 500
      interval: 1m
    # Testing (optional)
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://my-app.prod/"

Example: Canary with Argo Rollouts


# Argo Rollouts Canary Configuration
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app:v2
        ports:
        - containerPort: 8080
  strategy:
    canary:
      # Canary analysis configuration
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 10m}
      # Add analysis
      analysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: my-app-preview
        startingStep: 2 # which step to start in-depth analysis
        
      # Traffic management (optional)
      trafficRouting:
        istio:
          virtualService:
            name: my-app-vsvc # existing virtual service
            routes:
            - primary # route selector
          destinationRule:
            name: my-app-destrule # existing destination rule
            canarySubsetName: canary
            stableSubsetName: stable

Cloud Provider Implementations

Example: AWS App Mesh Canary


# AWS App Mesh Virtual Router with Canary
{
  "virtualRouter": {
    "name": "my-app-virtual-router",
    "spec": {
      "listeners": [
        {
          "portMapping": {
            "port": 80,
            "protocol": "http"
          }
        }
      ],
      "routes": [
        {
          "name": "my-app-route",
          "spec": {
            "httpRoute": {
              "match": {
                "prefix": "/"
              },
              "action": {
                "weightedTargets": [
                  {
                    "virtualNode": "my-app-current",
                    "weight": 90
                  },
                  {
                    "virtualNode": "my-app-canary",
                    "weight": 10
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Example: GCP Traffic Splitting


# Google Cloud Traffic Director Configuration
resource "google_compute_url_map" "url_map" {
  name        = "traffic-director-map"
  description = "Traffic Director with canary support"

  default_service = google_compute_backend_service.current.id

  host_rule {
    hosts        = ["*"]
    path_matcher = "allpaths"
  }

  path_matcher {
    name            = "allpaths"
    default_service = google_compute_backend_service.default.id

    route_rules {
      priority = 1
      service  = google_compute_backend_service.weighted.id
      
      match_rules {
        prefix_match = "/"
      }
    }
  }
}

resource "google_compute_backend_service" "weighted" {
  name        = "weighted-backend"
  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 10

  backend {
    group = google_compute_instance_group_manager.current.instance_group
  }

  backend {
    group = google_compute_instance_group_manager.canary.instance_group
  }

  load_balancing_scheme = "INTERNAL_SELF_MANAGED"

  # Traffic splitting configuration
  circuit_breakers {
    max_requests_per_connection = 1
  }
  
  # Define the weight for each backend
  traffic_director_config {
    traffic_control {
      weight {
        value = 90
        target {
          group = google_compute_instance_group_manager.current.instance_group
        }
      }
      weight {
        value = 10
        target {
          group = google_compute_instance_group_manager.canary.instance_group
        }
      }
    }
  }
}

Serverless Canary Implementations

Example: AWS Lambda Canary Deployment


# AWS SAM Template for Canary Deployment
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: ./code
      Handler: index.handler
      Runtime: nodejs16.x
      AutoPublishAlias: live
      DeploymentPreference:
        Type: Canary10Percent5Minutes
        Alarms:
          # Alarms that should trigger a rollback if they breach
          - !Ref AliasErrorMetricGreaterThanZeroAlarm
          - !Ref LatencyAlarm

  # CloudWatch Alarms for monitoring
  AliasErrorMetricGreaterThanZeroAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: Alias Error Alarm
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: Resource
          Value: !Sub "${MyFunction}:live"
        - Name: FunctionName
          Value: !Ref MyFunction
      EvaluationPeriods: 2
      MetricName: Errors
      Namespace: AWS/Lambda
      Period: 60
      Statistic: Sum
      Threshold: 0

  LatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: Lambda P90 Latency Alarm
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: Resource
          Value: !Sub "${MyFunction}:live"
        - Name: FunctionName
          Value: !Ref MyFunction
      EvaluationPeriods: 2
      MetricName: p90
      Namespace: AWS/Lambda
      Period: 60
      Statistic: Average
      Threshold: 1000 # 1 second

Example: Azure Function App Slots for Canary


# Azure ARM Template for Function App Slots with Traffic Routing
{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "functionAppName": {
      "type": "string"
    },
    "canaryPercentage": {
      "type": "int",
      "defaultValue": 10
    }
  },
  "resources": [
    {
      "type": "Microsoft.Web/sites",
      "apiVersion": "2021-02-01",
      "name": "[parameters('functionAppName')]",
      "location": "[resourceGroup().location]",
      "kind": "functionapp",
      "properties": {
        "serverFarmId": "[resourceId('Microsoft.Web/serverfarms', 'appServicePlan')]",
        "siteConfig": {
          "appSettings": [
            {
              "name": "FUNCTIONS_WORKER_RUNTIME",
              "value": "node"
            },
            {
              "name": "FUNCTIONS_EXTENSION_VERSION",
              "value": "~4"
            }
          ]
        }
      }
    },
    {
      "type": "Microsoft.Web/sites/slots",
      "apiVersion": "2021-02-01",
      "name": "[concat(parameters('functionAppName'), '/canary')]",
      "location": "[resourceGroup().location]",
      "dependsOn": [
        "[resourceId('Microsoft.Web/sites', parameters('functionAppName'))]"
      ],
      "kind": "functionapp",
      "properties": {
        "serverFarmId": "[resourceId('Microsoft.Web/serverfarms', 'appServicePlan')]",
        "siteConfig": {
          "appSettings": [
            {
              "name": "FUNCTIONS_WORKER_RUNTIME",
              "value": "node"
            },
            {
              "name": "FUNCTIONS_EXTENSION_VERSION",
              "value": "~4"
            }
          ]
        }
      }
    },
    {
      "type": "Microsoft.Web/sites/slots/config",
      "apiVersion": "2021-02-01",
      "name": "[concat(parameters('functionAppName'), '/canary/slotConfigNames')]",
      "dependsOn": [
        "[resourceId('Microsoft.Web/sites/slots', parameters('functionAppName'), 'canary')]"
      ],
      "properties": {
        "appSettingNames": [
          "WEBSITE_RUN_FROM_PACKAGE"
        ]
      }
    },
    {
      "type": "Microsoft.Web/sites/config",
      "apiVersion": "2021-02-01",
      "name": "[concat(parameters('functionAppName'), '/routing')]",
      "dependsOn": [
        "[resourceId('Microsoft.Web/sites', parameters('functionAppName'))]",
        "[resourceId('Microsoft.Web/sites/slots', parameters('functionAppName'), 'canary')]"
      ],
      "properties": {
        "routingRules": [
          {
            "name": "canary",
            "routingRuleType": "percentage",
            "value": "[parameters('canaryPercentage')]",
            "actionHostName": "[concat(parameters('functionAppName'), '-canary.azurewebsites.net')]"
          }
        ]
      }
    }
  ]
}

Database Considerations for Canary Deployments

As with blue-green deployments, database schema changes present challenges for canary deployments, but with additional complexity due to the prolonged coexistence of versions.

Database Migration Approaches for Canary

graph TD A[Database Migration Approaches] --> B[Expand-Contract Pattern] A --> C[Read/Write Compatibility] A --> D[Versioned APIs] A --> E[Separate Schemas] B --> B1[Expand: Add New Columns/Tables] B1 --> B2[Dual-Write Period] B2 --> B3[Contract: Remove Old Columns] C --> C1[Old Code Reads New Schema] C --> C2[New Code Reads Old Schema] D --> D1[API Versioning] D1 --> D2[Schema Version Mapping] E --> E1[Separate Database Instances] E1 --> E2[Data Synchronization]

Example: Expand-Contract Database Migration


-- Phase 1: Expand (Deploy before canary)
-- Add new columns but keep old ones
ALTER TABLE users ADD COLUMN phone_number VARCHAR(20);

-- Phase 2: Dual-Write (During canary)
-- Application logic in both versions ensures data consistency
-- Old version only uses old fields
-- New version (canary) writes to both old and new fields and reads from new fields
-- Example dual-write code in application:

// In the new version (canary)
function updateUserContact(userId, phoneNumber) {
  const updates = {
    // Write to both fields for backward compatibility
    old_phone_field: phoneNumber,
    phone_number: phoneNumber
  };
  
  updateUserInDatabase(userId, updates);
}

// Data migration job (runs during canary to populate new fields)
function migratePhoneNumbers() {
  const users = fetchUsersWithEmptyPhoneNumber();
  
  for (const user of users) {
    if (user.old_phone_field && !user.phone_number) {
      updateUser(user.id, {
        phone_number: user.old_phone_field
      });
    }
  }
}

-- Phase 3: Contract (After successful canary)
-- Remove old columns once all instances are on new version
-- ALTER TABLE users DROP COLUMN old_phone_field;

Example: Feature Flag for Schema Version Compatibility


// Feature flag approach for schema version handling
const SCHEMA_VERSION = process.env.SCHEMA_VERSION || 'v1';

function getUserData(userId) {
  const user = fetchUserFromDatabase(userId);
  
  // Transform data based on schema version
  if (SCHEMA_VERSION === 'v1') {
    // Original schema format
    return {
      id: user.id,
      name: user.name,
      email: user.email,
      phone: user.old_phone_field,
      // Transform new fields to old format if needed
      address: user.address ? user.address : 
        `${user.address_street || ''}, ${user.address_city || ''}`
    };
  } else {
    // New schema format
    return {
      id: user.id,
      name: user.name,
      email: user.email,
      phone_number: user.phone_number || user.old_phone_field,
      address_street: user.address_street || 
        (user.address ? parseStreetFromAddress(user.address) : ''),
      address_city: user.address_city || 
        (user.address ? parseCityFromAddress(user.address) : ''),
      address_zip: user.address_zip || 
        (user.address ? parseZipFromAddress(user.address) : '')
    };
  }
}

Database Best Practices for Canary Deployments

Backward Compatibility First: Ensure new schema versions are backward compatible with old application versions
Forward Compatibility Second: Ensure old application versions can handle new schema elements gracefully
Avoid Breaking Changes: Never make schema changes that break existing code during canary period
Incremental Changes: Split large schema changes into multiple small, compatible changes
Automated Testing: Test both old and new application versions against each schema version
Database Metrics: Monitor database performance metrics during canary for any degradation
Replication Lag: Watch for increased replication lag in distributed database systems
Rollback Plans: Have clear database rollback plans for failed canaries

Real-World Example: Database Migration for a SaaS Platform

A SaaS company implemented a complex database schema change during a canary deployment:

Preparation Phase:
- Added new tables and columns without modifying existing ones
- Created database views that presented new schema structure while using old tables
- Implemented database triggers to maintain data consistency
Migration Phase:
- Deployed canary version that used new schema elements
- Ran background jobs to populate new columns from existing data
- Monitored database performance during dual-write period
Verification Phase:
- Verified data consistency between old and new schema elements
- Gradually increased canary traffic while monitoring database metrics
- Had automation to quickly revert schema changes if issues appeared
Completion Phase:
- Reached 100% traffic on new version
- Maintained backward compatibility for 2 weeks to ensure stability
- Finally removed deprecated schema elements after verification period

This gradual approach allowed them to safely restructure their core customer data model with zero downtime and no data loss.

Learning Activities

Activity 1: Design a Canary Deployment Strategy

Design a canary deployment strategy for a typical web application with the following characteristics:

E-commerce platform with user authentication, product catalog, and checkout process
Approximately 100,000 daily active users
Global user base with concentrations in North America, Europe, and Asia
Microservices architecture with 8 primary services
PostgreSQL database for transactional data
Redis for session management and caching

Your design should include:

Traffic routing approach and technology
Canary percentage progression strategy
Key metrics to monitor for each service
Rollback criteria and process
Database compatibility approach
Session handling strategy

Activity 2: Implement a Simple Canary Deployment

Using a technology stack of your choice (Kubernetes, AWS, Azure, etc.), implement a basic canary deployment setup:

Create a simple web application with a version indicator
Set up infrastructure for current and canary versions
Configure traffic splitting with a 90/10 distribution
Implement basic health and performance monitoring
Create a script to gradually increase canary traffic
Demonstrate a rollback scenario

Activity 3: Canary Analysis Dashboard

Design a monitoring dashboard for canary analysis that includes:

Side-by-side comparison of key metrics between current and canary versions
Historical trend visualization for important performance indicators
Statistical analysis of metric differences with confidence intervals
Alerting thresholds for automatic rollbacks
Traffic volume and distribution visualization
Business impact metrics (conversion rates, user engagement, etc.)

Canary Deployment Case Studies

Case Study 1: Social Media Platform

Feed Algorithm Update

Challenge: A social media platform needed to deploy a significant update to their feed ranking algorithm that could potentially affect user engagement metrics.

Solution:

User Segmentation: Selected a representative 2% of users across different demographics
Traffic Approach: Used feature flagging to direct targeted users to the new algorithm
Metrics: Monitored time spent in app, content interaction rate, ad engagement
Progressive Rollout: 2% → 5% → 10% → 25% → 50% → 100% over two weeks
Analysis: Used statistical analysis to compare metrics to both control group and historical baselines

Results:

Detected a 3% drop in video engagement at the 5% stage
Paused rollout to investigate and fix the issue
After refinements, resumed rollout with improved metrics
Final implementation resulted in 7% increase in overall engagement
Avoided potential platform-wide user experience degradation

Case Study 2: Payment Processing Service

Critical Infrastructure Update

Challenge: A payment processing company needed to update their core transaction processing service with enhanced security features and performance optimizations.

Solution:

Geographic Approach: Started with single, low-volume region (Australia)
Traffic Approach: Service mesh (Istio) for precise traffic control
Canary Services: Deployed canary instances at 20% capacity in target region
Monitoring: Critical focus on transaction success rate, authorization times, fraud detection accuracy
Zero-tolerance Metrics: Automatic rollback for any transaction failures
Regional Progression: Region-by-region rollout after successful initial canary

Results:

Identified subtle performance issue with specific card types during initial canary
Fixed issue before expanding to higher-volume regions
Successfully deployed to all regions over 10 days with zero downtime
Maintained 100% transaction processing availability throughout
Achieved 15% reduction in average transaction processing time

Case Study 3: Government Tax Filing System

Public Service Application

Challenge: A government tax agency needed to deploy significant changes to their online tax filing system during tax season without disrupting citizens in the process of filing their taxes.

Solution:

User Selection Strategy: Started with internal employees and opt-in beta users
Traffic Approach: Cookie-based routing with session stickiness
Progressive Approach: Added small batches of public users over 3 weeks
In-progress Protection: Users with partially completed forms kept on original version
Full Session Recording: Comprehensive session recording for canary users to identify any usability issues
Support Integration: Direct helpdesk routing for any canary users needing assistance

Results:

Discovered critical issue with specific tax form calculations during internal testing
Revised and retested without exposing to general public
Completed full deployment with no service disruption
Maintained 99.98% form submission success rate
Improved overall system usability score by 18%

Common Pitfalls and How to Avoid Them

Pitfall	Symptoms	Prevention Strategies
Insufficient Monitoring	Late detection of issues, unclear impact assessment	Implement comprehensive monitoring before canary deployment Set up side-by-side dashboards for canary and current versions Include business metrics alongside technical metrics
Poor User Segmentation	Unrepresentative test group, skewed metrics	Ensure canary traffic includes diverse user segments Use random selection rather than specific cohorts if possible Validate that canary users match overall demographics
Inadequate Bake Time	Missing slow-developing issues, premature progression	Allow sufficient time at each stage (minimum 30+ minutes) Include at least one full business cycle at higher stages Consider longer bake times for critical systems
Session Inconsistency	User confusion, disrupted workflows, session losses	Implement proper session affinity/stickiness Ensure users consistently see same version throughout session Consider completing in-progress workflows on original version
Manual Progression	Human error, inconsistent timing, forgotten stages	Automate the canary progression process Implement clear, metric-based promotion criteria Create audit logs of all canary progression decisions
Ignoring Minor Issues	Cumulative problems, escalating issues at higher percentages	Treat every anomaly as significant at low percentages Address issues immediately rather than continuing rollout Use statistical significance testing rather than gut feeling
Conflicting Changes	Interference between concurrent canaries, ambiguous metrics	Avoid multiple simultaneous canaries if possible Implement clear isolation between different canary tests Use multivariate analysis when multiple changes are necessary

Canary Deployment Anti-Patterns

Too Small Canary: Using such a small percentage that metrics aren't statistically significant
Too Large Initial Canary: Starting with too high a percentage, defeating the risk mitigation purpose
Skipping Stages: Jumping from a small percentage to 100% without intermediate validation
Metrics Mismatch: Monitoring different metrics in canary and production environments
Canary in the Wrong Mine: Testing in environments that don't match production traffic patterns
Moving Goalposts: Changing success criteria during the canary process
Ignoring Slower Users: Only analyzing fast-returning metrics, missing long-term effects
Alert Fatigue: Setting too many alerts, causing important signals to be ignored

Canary Deployments vs. A/B Testing

Canary deployments and A/B testing are often confused, as both involve directing different users to different versions of an application. However, they serve distinct purposes and have different implementation approaches.

Canary Deployment

Primary Purpose: Risk mitigation for new deployments
Success Criteria: No regression in metrics, system stability
Duration: Temporary until full rollout or rollback
User Selection: Usually random sampling or geographic
End Goal: 100% of users on new version if successful
Metric Focus: Error rates, performance metrics, system health
Ownership: DevOps or platform teams

A/B Testing

Primary Purpose: Feature optimization, user experience research
Success Criteria: Improvement in business metrics
Duration: Fixed test period for statistical significance
User Selection: Often targeted based on user segments
End Goal: Choose winning variant based on data
Metric Focus: Conversion rates, engagement, business KPIs
Ownership: Product or growth teams

Example: A/B Testing Implementation vs. Canary


        // A/B Testing with Split.io
        const splitClient = splitio({
          core: {
            authorizationKey: 'YOUR_SPLIT_API_KEY'
          }
        }).client();
        
        async function getUserExperience(userId, userAttributes) {
          // Wait for SDK to be ready
          await splitClient.ready();
          
          // Get treatment for this specific user (A or B)
          const treatment = splitClient.getTreatment(
            userId, 
            'new-checkout-experience',
            {
              attributes: {
                userType: userAttributes.userType,
                country: userAttributes.country,
                deviceType: userAttributes.deviceType
              }
            }
          );
          
          // Route to appropriate experience based on assigned treatment
          if (treatment === 'on') {
            // Show new version (B)
            return 'new-experience';
          } else {
            // Show control version (A)
            return 'current-experience';
          }
        }
        
        // Record conversion events for A/B test analysis
        function recordConversion(userId, value) {
          splitClient.track(userId, 'checkout', 'conversion', value);
        }

Compared to Canary Deployment (using AWS AppMesh):


        # AWS AppMesh canary deployment configuration
        {
          "virtualService": {
            "name": "checkout.example.com",
            "spec": {
              "provider": {
                "virtualRouter": {
                  "virtualRouterName": "checkout-router"
                }
              }
            }
          },
          "virtualRouter": {
            "name": "checkout-router",
            "spec": {
              "listeners": [
                {
                  "portMapping": {
                    "port": 80,
                    "protocol": "http"
                  }
                }
              ],
              "routes": [
                {
                  "name": "canary-route",
                  "spec": {
                    "httpRoute": {
                      "action": {
                        "weightedTargets": [
                          {
                            "virtualNode": "checkout-current",
                            "weight": 95
                          },
                          {
                            "virtualNode": "checkout-canary",
                            "weight": 5
                          }
                        ]
                      },
                      "match": {
                        "prefix": "/"
                      }
                    },
                    "priority": 10
                  }
                }
              ]
            }
          }
        }

Combining Canary and A/B Testing: Advanced Strategy

A sophisticated online retailer combined both approaches in a complementary way:

Canary Phase: Initial deployment of new checkout process to 5% of random users to verify stability and basic metrics
Gradual Expansion: Increased to 20% of traffic after confirming technical stability
A/B Testing Phase: At 20% deployment, implemented proper A/B test:
- Created two balanced user cohorts with similar demographics
- Established clear hypothesis about conversion improvement
- Ran test for two weeks to gather statistically significant data
- Measured detailed metrics including conversion rate, average order value, and cart abandonment
Data Analysis: Found that new checkout improved conversion by 4.6% but slightly decreased average order value
Iterative Improvement: Made adjustments based on A/B test findings
Canary Continuation: Deployed improved version to 50%, then 100% using canary approach

This combined approach provided both risk mitigation and data-driven optimization, with different teams focusing on their areas of expertise.

Organizational Aspects of Canary Deployments

Implementing canary deployments successfully requires organizational changes beyond just technical implementation.

Team Responsibilities in Canary Deployments

graph TD A[Organizations Roles] --> B[Development Team] A --> C[Operations/DevOps] A --> D[SRE Team] A --> E[Product Team] A --> F[QA/Testing] B --> B1[Create canary-compatible code] B --> B2[Ensure backward compatibility] B --> B3[Define key technical metrics] C --> C1[Configure deployment infrastructure] C --> C2[Implement traffic splitting] C --> C3[Automate canary pipelines] D --> D1[Define SLIs and SLOs] D --> D2[Monitor production metrics] D --> D3[Manage incident response] E --> E1[Define business success metrics] E --> E2[Determine success criteria] E --> E3[Balance risk vs. speed] F --> F1[Pre-canary verification] F --> F2[Synthetic testing during canary] F --> F3[Create test scenarios]

Cultural Shifts for Successful Canary Deployments

Risk Tolerance: Balance between perfect testing and learning in production
Failure Acceptance: View canary failures as successful risk mitigation, not deployment failures
Data-Driven Decisions: Rely on metrics rather than intuition for promotion/rollback
Shared Responsibility: Break down silos between development and operations
Patient Deployment: Value safety and stability over deployment speed
Incremental Approach: Embrace small, frequent changes over large, infrequent ones
Observability Culture: Invest in comprehensive monitoring and metrics

Canary Implementation Maturity Model

Organizations typically evolve through stages of canary deployment capability:

Level 0 - Basic Deployment: No canary capability, all-or-nothing deployments
Level 1 - Manual Canary: Basic canary with manual traffic shifting and monitoring
Level 2 - Automated Canary: Scripted canary deployments with some automated checks
Level 3 - Integrated Canary: Canary integrated into CI/CD with automated analysis
Level 4 - Advanced Canary: Sophisticated user targeting, multivariate analysis, automatic rollback
Level 5 - Progressive Delivery: Comprehensive strategy combining canary, feature flags, A/B testing, and personalization

Each level requires both technical capabilities and organizational maturity to implement effectively.

Organizational Case Study: Moving to Continuous Canary Deployment

A large financial services company transformed their deployment approach from quarterly releases to continuous canary deployments:

Initial State: Quarterly releases with extensive pre-release testing and frequent rollbacks
Technical Foundation: Built containerized architecture and service mesh for fine-grained routing
Team Reorganization: Shifted from component teams to cross-functional service teams
Observability Investment: Implemented comprehensive monitoring and alerting
Process Changes:
- Introduced feature flags to separate deployment from feature release
- Implemented automated canary analysis
- Created deployment scorecards to track success metrics
- Established "deployment champions" to guide teams
Governance Shift: From change approval boards to post-deployment reviews
Metric-Based Evaluation: Leadership focused on DORA metrics (deployment frequency, lead time, change failure rate)

Results after 18 months:

Deployment frequency increased from quarterly to multiple times per day
Change failure rate decreased from 18% to 4%
Mean time to recovery decreased from 4 hours to 30 minutes
Developer satisfaction increased by 47% in internal surveys

Future Trends in Progressive Deployment

Emerging Approaches and Technologies

AI-Driven Canary Analysis: Machine learning for anomaly detection and automated analysis
Traffic Capture and Replay: Capturing production traffic patterns for more realistic canary testing
Chaos Engineering Integration: Combining canary deployments with fault injection to test resilience
Multi-Dimensional Canary: Simultaneous testing of multiple variations using advanced routing
Personalized Risk Profiles: Different canary strategies for different services based on criticality
Production Experimentation Platforms: Unified infrastructure for canary, A/B testing, and feature management
Developer-Controlled Progressive Delivery: Self-service tools for developers to manage their own canary releases

Preparing for Next-Generation Canary Deployments

Key capabilities to develop for future-ready canary deployment:

Comprehensive Telemetry: Invest in application and infrastructure observability
Service Mesh: Implement fine-grained traffic control with advanced service mesh technology
Unified Control Plane: Centralized management of deployments, experiments, and features
Deployment as a Platform: Self-service capabilities for deployment pipeline configuration
Metrics Standardization: Consistent metrics definitions across services
API Versioning Strategy: Future-proof approach to API evolution

graph TD A[Progressive Delivery Evolution] --> B[Canary Deployments] B --> C[Feature Flags + Canary] C --> D[Experimentation Platforms] D --> E[Autonomous Deployments] E --> E1[AI-based analysis] E --> E2[Predictive risk assessment] E --> E3[Self-healing deployment] E --> E4[Auto-optimization]

Key Takeaways

Canary deployments provide a safer approach to releasing software by exposing a small percentage of users to new versions before full rollout
This approach enables early detection of issues in a real production environment while limiting the impact of potential problems
Effective canary deployments require sophisticated traffic routing, comprehensive monitoring, and automated analysis
Database schema changes present unique challenges for canary deployments and require careful planning for compatibility
Organizational culture and processes are as important as technical implementation for successful canary adoption
Canary deployments complement other progressive delivery approaches like blue-green deployments, feature flags, and A/B testing
The future of progressive delivery involves increased automation, AI-driven analysis, and unified platforms for deployment and experimentation

Further Learning Resources

Learning Activities

Activity 1: Canary vs. Blue-Green Comparison

Create a detailed comparison of canary and blue-green deployment strategies for the following scenarios, indicating which approach would be better and why:

A critical financial transaction processing system
A content management system for a media company
A high-traffic e-commerce site during holiday season
A new feature in a social media application
A backend API with multiple client applications

For each scenario, consider factors such as risk tolerance, release urgency, infrastructure costs, and monitoring requirements.

Activity 2: Canary Metrics Workshop

For a typical web application with frontend and backend components, design a comprehensive canary analysis framework:

Identify 8-10 key metrics to monitor during canary deployments
Classify metrics into technical health, user experience, and business impact categories
Define thresholds for automatic promotion and rollback
Design a dashboard layout for comparing canary and baseline metrics
Create a decision tree for canary progression based on metric analysis

Activity 3: Canary Failure Scenarios

Analyze the following canary deployment failure scenarios and develop response plans:

Canary shows 3% higher error rate, but only for a specific user segment
Canary performance is acceptable at 5% traffic but degrades significantly at 20%
Canary works perfectly for 24 hours, then shows database connection issues
Business metrics (conversion rate) drop in canary, but all technical metrics are normal
Intermittent failures appear in canary that are difficult to reproduce

For each scenario, describe the immediate actions, investigation approach, and criteria for deciding whether to proceed, roll back, or fix and retry.