Alerting Systems Setup and Configuration

Building Effective Monitoring and Notification Systems

Introduction to Alerting Systems

Collecting logs and metrics is only half the battle in maintaining reliable systems. Without a way to be notified when things go wrong, you might be collecting valuable data but still missing critical issues until users report them. Alerting systems bridge this gap by actively monitoring your logs and metrics, then notifying the appropriate teams when predefined conditions are met.

Think of an alerting system as the smoke detector for your application - it constantly monitors for signs of trouble and sounds an alarm when it detects a problem, often before the issue becomes visible to users.

graph TD A[Logs] --> C[Monitoring System] B[Metrics] --> C D[Traces] --> C C --> E{Alert Conditions} E -->|Threshold Exceeded| F[Alert Management] F --> G[Notification Channels] G --> H[Email] G --> I[SMS/Phone] G --> J[Chat Apps] G --> K[Incident Management] F --> L[Alert Grouping] F --> M[Alert Routing] F --> N[Escalation]

The Anatomy of an Effective Alerting System

A complete alerting system consists of several interconnected components:

Data Sources

Alert Definitions

Notification Channels

Alert Management

Remediation

The Psychology of Alerting

Building an effective alerting system isn't just a technical challenge—it's a human factors problem:

Alert Fatigue

Alert fatigue occurs when engineers receive too many alerts, particularly false positives, leading them to:

Real-world example: A hospital study found that nurses exposed to more than 350 alarms per day responded significantly slower to critical alerts. The same psychology applies to software engineers dealing with system alerts.

The "Boy Who Cried Wolf" Effect

Systems that frequently generate false alarms lose credibility with the teams responsible for them. Like in the fable, when a real emergency occurs, responders may not react with appropriate urgency.

Principles of Effective Alerting

Alert Design Patterns

The Four Golden Signals (Google SRE)

Google's Site Reliability Engineering team recommends focusing on four key metrics:

RED Method (Weave)

A pattern focused on service monitoring:

USE Method (Netflix)

A pattern for infrastructure monitoring:

graph TD subgraph "Google SRE: Four Golden Signals" A1[Latency] A2[Traffic] A3[Errors] A4[Saturation] end subgraph "Weave: RED Method" B1[Rate] B2[Errors] B3[Duration] end subgraph "Netflix: USE Method" C1[Utilization] C2[Saturation] C3[Errors] end

Building Progressive Alert Hierarchies

Alert Severity Levels

Not all issues require the same urgency of response. A common approach is to define multiple severity levels:

Severity Description Response Time Notification
P1 (Critical) Service outage, data loss risk Immediate (24/7) Phone call, SMS, push notification
P2 (High) Degraded service, partial functionality loss Within 30 minutes (24/7) SMS, push notification
P3 (Medium) Minor functionality issues, non-critical components Business hours Email, Slack
P4 (Low) Cosmetic issues, low impact problems Next business day Ticketing system, email

Escalation Paths

When alerts aren't acknowledged, they should follow a defined escalation path:

graph TD A[Alert Triggered] --> B[Primary On-Call Notified] B -->|Acknowledged| C[Incident Response] B -->|No Response in 15min| D[Secondary On-Call Notified] D -->|Acknowledged| C D -->|No Response in 15min| E[Team Lead Notified] E -->|Acknowledged| C E -->|No Response in 15min| F[Management Escalation]

Alert Types and Examples

Threshold-based Alerts

The most common alert type, triggered when a metric crosses a predefined value.

# Prometheus alerting rule example
groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate is above 5% for more than 10 minutes (current value: {{ $value }})"

Anomaly Detection Alerts

Alerts based on deviation from normal patterns, useful for detecting unusual behavior.

# Elasticsearch Watcher anomaly detection example
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["metrics-*"],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {"range": {"@timestamp": {"gte": "now-1h", "lte": "now"}}}
              ]
            }
          },
          "aggs": {
            "cpu_usage": {
              "avg": {
                "field": "system.cpu.usage"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload.aggregations.cpu_usage.value > params.threshold",
      "params": {
        "threshold": 80
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": ["ops@example.com"],
        "subject": "High CPU Usage Alert",
        "body": {
          "text": "Average CPU usage is {{ctx.payload.aggregations.cpu_usage.value}}%"
        }
      }
    }
  }
}

Missing Data Alerts

Alerts triggered when expected data points are missing, indicating potential system failure.

# Grafana alert for missing data
{
  "alertRuleTags": {},
  "conditions": [
    {
      "evaluator": {
        "params": [0],
        "type": "gt"
      },
      "operator": {
        "type": "and"
      },
      "query": {
        "params": ["A", "5m", "now"]
      },
      "reducer": {
        "params": [],
        "type": "count_non_null"
      },
      "type": "query"
    }
  ],
  "executionErrorState": "alerting",
  "for": "5m",
  "frequency": "1m",
  "handler": 1,
  "name": "No Data Received",
  "noDataState": "alerting",
  "notifications": []
}

Composite Alerts

Alerts that combine multiple conditions to reduce false positives.

# Datadog composite monitor example
{
  "name": "Composite Database Alert",
  "type": "composite",
  "query": "1 && 2 && 3",
  "message": "Database is experiencing high load and slow queries",
  "options": {
    "notify_no_data": true,
    "no_data_timeframe": 10,
    "notify_audit": false,
    "new_host_delay": 300,
    "include_tags": true,
    "escalation_message": "Database issues persisting"
  },
  "monitor_refs": [
    123456, // High CPU monitor ID
    123457, // Slow query monitor ID
    123458  // Connection count monitor ID
  ]
}

Popular Alerting Tools

Open Source Solutions

Commercial and SaaS Solutions

Cloud Provider Solutions

Implementing Alerting with Prometheus and Alertmanager

Architecture Overview

graph TD A[Application Metrics] --> B[Prometheus] B --> C[Alertmanager] C --> D[Email] C --> E[Slack] C --> F[PagerDuty] C --> G[Webhook]

Prometheus Alert Rules

Alert rules are defined in a YAML file:

# alert_rules.yml
groups:
- name: node_alerts
  rules:
  - alert: HighCpuLoad
    expr: node_load1 > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load is above 80% for more than 5 minutes (current value: {{ $value }})"
      
  - alert: MemoryAlmostFull
    expr: (node_memory_MemFree_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes < 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Memory almost full on {{ $labels.instance }}"
      description: "Less than 10% memory available (current value: {{ $value | humanizePercentage }})"

- name: app_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 2m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate is above 5% for more than 2 minutes (current value: {{ $value | humanizePercentage }})"
      dashboard: "https://grafana.example.com/d/abc123/http-metrics"
      runbook: "https://wiki.example.com/runbooks/high-error-rate"

Alertmanager Configuration

The Alertmanager handles alert notification and management:

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXX'

route:
  group_by: ['alertname', 'job', 'severity']
  group_wait: 30s       # Wait 30s to buffer alerts of the same group
  group_interval: 5m    # Wait 5m before sending new notification for group
  repeat_interval: 4h   # Wait 4h before resending a firing alert
  receiver: 'team-backend-slack'  # Default receiver
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty'
    continue: true    # Continue to other matching routes
  - match:
      team: backend
    receiver: 'team-backend-slack'
  - match:
      team: frontend
    receiver: 'team-frontend-email'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'your_pagerduty_service_key'
    
- name: 'team-backend-slack'
  slack_configs:
  - channel: '#alerts-backend'
    title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
    text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
    
- name: 'team-frontend-email'
  email_configs:
  - to: 'frontend-team@example.com'
    send_resolved: true

Docker Compose Setup

Here's a Docker Compose configuration for a complete monitoring stack:

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus:v2.35.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090
    restart: always
    
  alertmanager:
    image: prom/alertmanager:v0.24.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - 9093:9093
    restart: always

  grafana:
    image: grafana/grafana:8.5.2
    depends_on:
      - prometheus
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: always
    
  node-exporter:
    image: prom/node-exporter:v1.3.1
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - 9100:9100
    restart: always

volumes:
  grafana_data: {}

Implementing PagerDuty Integration

What is PagerDuty?

PagerDuty is a popular incident management platform that helps teams respond to disruptions in their services. It manages on-call schedules, escalation policies, and provides multiple notification methods.

Setting Up PagerDuty

  1. Create a service in PagerDuty specifically for your application
  2. Set up escalation policies to determine who gets notified and when
  3. Create schedules to manage on-call rotations
  4. Configure notification rules for team members

Integrating with Alertmanager

Update the Alertmanager configuration to send alerts to PagerDuty:

# alertmanager.yml (PagerDuty section)
receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - routing_key: 'your_pagerduty_integration_key'
    description: '{{ if gt (len .Alerts.Firing) 0 }}{{ (index .Alerts.Firing 0).Annotations.summary }}{{ end }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
      num_firing: '{{ .Alerts.Firing | len }}'
    client: 'Prometheus Alertmanager'
    client_url: 'https://alertmanager.example.com'
    severity: '{{ if gt (len .Alerts.Firing) 0 }}{{ (index .Alerts.Firing 0).Labels.severity }}{{ end }}'

Creating On-Call Schedules

Effective on-call schedules balance operational needs with engineer well-being:

gantt title On-Call Schedule Example dateFormat YYYY-MM-DD axisFormat %d section Team A Alice :a1, 2025-05-01, 7d Bob :a2, 2025-05-08, 7d Charlie :a3, 2025-05-15, 7d Diana :a4, 2025-05-22, 7d section Team B (Backup) Eve :b1, 2025-05-01, 7d Frank :b2, 2025-05-08, 7d Grace :b3, 2025-05-15, 7d Henry :b4, 2025-05-22, 7d

Building Slack Alert Integration

Slack is often the hub of team communication, making it an ideal place for non-emergency alerts.

Creating a Slack App for Alerts

  1. Go to api.slack.com/apps and create a new app
  2. Enable "Incoming Webhooks" feature
  3. Create a webhook URL for a specific channel
  4. Configure Alertmanager to use this webhook

Alertmanager Slack Configuration

# alertmanager.yml (Slack section)
receivers:
- name: 'team-slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXX'
    channel: '#alerts'
    send_resolved: true
    icon_emoji: ':warning:'
    title: '{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}'
    title_link: 'https://grafana.example.com'
    text: >-
      {{ range .Alerts -}}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      *Graph:* {{ .Annotations.dashboard }}
      *Runbook:* {{ .Annotations.runbook }}
      *Details:*
      {{ range .Labels.SortedPairs }}• *{{ .Name }}:* `{{ .Value }}`
      {{ end }}
      {{ end }}

Customizing Alert Formatting

Well-formatted alerts make it easier to understand issues at a glance:

🔴 CRITICAL: High Error Rate on api-server-01

Description: Error rate is above 5% for more than 2 minutes (current value: 8.3%)

Severity: critical

Started: 2025-05-05 14:32:15 UTC (5 minutes ago)

Links: Dashboard | Runbook | Logs

Details:

  • instance: api-server-01:8080
  • job: api-server
  • service: order-processing

Setting Up Email Alerts

Despite newer notification methods, email remains important for less urgent alerts and for maintaining a record of incidents.

SMTP Configuration in Alertmanager

# alertmanager.yml (Email section)
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

receivers:
- name: 'email-alerts'
  email_configs:
  - to: 'team@example.com'
    send_resolved: true
    headers:
      subject: '{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}'
    html: |
      <!DOCTYPE html>
      <html>
      <body>
        <h1>{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}</h1>
        
        <h2>Alerts</h2>
        {{ range .Alerts }}
        <div style="margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px; 
                  background-color: {{ if .Resolved }}#e6ffe6{{ else }}#ffe6e6{{ end }}">
          <h3>{{ .Annotations.summary }}</h3>
          <p><strong>Description:</strong> {{ .Annotations.description }}</p>
          <p><strong>Started:</strong> {{ .StartsAt }}</p>
          {{ if .Resolved }}
          <p><strong>Resolved:</strong> {{ .EndsAt }}</p>
          {{ end }}
          <h4>Labels:</h4>
          <ul>
            {{ range .Labels.SortedPairs }}
            <li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
            {{ end }}
          </ul>
        </div>
        {{ end }}
        
        <p>View in <a href="https://alertmanager.example.com">Alertmanager</a></p>
      </body>
      </html>

Email Best Practices

Alert Routing and Grouping Strategies

Routing Based on Labels

Alertmanager can route alerts to different teams based on labels:

# alertmanager.yml (routing section)
route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
  - match:
      service: payment-processing
    receiver: 'payment-team'
    routes:
    - match:
        severity: critical
      receiver: 'payment-team-pagerduty'
      
  - match:
      service: authentication
    receiver: 'auth-team'
    
  - match_re:
      service: .*-api
    receiver: 'api-team'
    
  - match:
      team: infrastructure
    receiver: 'infra-team'

Alert Grouping

Grouping related alerts reduces notification noise:

Inhibiting Redundant Alerts

Prevent less important alerts from firing when a related critical alert is active:

# alertmanager.yml (inhibit rules)
inhibit_rules:
  # Inhibit all warning-level alerts if there's a critical alert with the same alertname and instance
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
    
  # Inhibit service-specific alerts if the whole cluster is down
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: 'Service.*Down'
    equal: ['cluster']

Building Runbooks for Alert Response

Runbooks provide standardized procedures for responding to specific alerts, reducing mean time to resolution.

Anatomy of an Effective Runbook

Sample Runbook: High API Error Rate

Runbook: High API Error Rate

Alert Description: This alert triggers when the API error rate (HTTP 5xx responses) exceeds 5% for more than 2 minutes.

Impact: Users may experience failed requests and service disruption. High priority for customer-facing APIs.

Diagnostic Steps:

  1. Check the API logs for error patterns: kubectl logs -l app=api-service -n production | grep ERROR
  2. Verify database connectivity: kubectl exec api-pod -- curl database:5432
  3. Check dependent service health: curl https://metrics.example.com/health/dependencies
  4. Verify recent deployments: kubectl describe deployment api-service -n production

Common Causes:

  • Database connection issues (timeouts, connection limits)
  • Dependency service failures
  • Recent deployment introducing bugs
  • Resource exhaustion (memory leaks, CPU throttling)

Resolution Steps:

  1. If related to a recent deployment, roll back: kubectl rollout undo deployment api-service -n production
  2. If database connection issues:
    • Check connection pool settings
    • Verify database health
    • Restart the API service if necessary
  3. If dependency failure, check the dependent service's health and logs
  4. If resource exhaustion:
    • Scale up the deployment: kubectl scale deployment api-service --replicas=5 -n production
    • Restart problematic pods: kubectl delete pod [pod-name] -n production

Escalation Path:

  1. If unable to resolve within 15 minutes, escalate to the Database team (if DB-related) or Platform team (if infrastructure-related)
  2. For persistent issues, engage the Development team lead

Prevention Measures:

  • Implement circuit breakers for dependency calls
  • Add more comprehensive pre-deployment testing
  • Set up database connection pooling monitoring
  • Configure auto-scaling based on request load

Automating Runbooks

For common issues with well-defined solutions, consider automating the response:

Alert Fatigue Prevention Strategies

Measuring Alert Quality

Track these metrics to gauge the health of your alerting system:

Strategies to Reduce Alert Noise

Regular Alert Reviews

Schedule regular reviews of your alerts:

Practical Exercise: Setting Up a Complete Alerting System

Exercise Overview

In this exercise, you'll build a comprehensive alerting system for a web application:

  1. Set up Prometheus and Alertmanager using Docker Compose
  2. Configure alert rules for common scenarios (high error rate, service unavailability, resource exhaustion)
  3. Integrate with multiple notification channels (Slack, email)
  4. Create runbooks for each alert type
  5. Test the system by simulating failure conditions

Required Resources

For detailed exercise instructions and starter code, refer to the exercise repository: Alerting Workshop Repository (Example URL)

Conclusion and Key Takeaways

Remember: The goal of an alerting system is not just to notify you when things break, but to give you the context and tools to fix problems quickly and prevent them in the future.

Additional Resources

Documentation

Books

Online Resources

Next Lecture Preview: Production Deployment

In our next session, we'll explore how to prepare your application for production deployment, covering: