Introduction to Alerting Systems
Collecting logs and metrics is only half the battle in maintaining reliable systems. Without a way to be notified when things go wrong, you might be collecting valuable data but still missing critical issues until users report them. Alerting systems bridge this gap by actively monitoring your logs and metrics, then notifying the appropriate teams when predefined conditions are met.
Think of an alerting system as the smoke detector for your application - it constantly monitors for signs of trouble and sounds an alarm when it detects a problem, often before the issue becomes visible to users.
The Anatomy of an Effective Alerting System
A complete alerting system consists of several interconnected components:
Data Sources
- Metrics - Numerical data points collected over time (CPU usage, request rates, error rates)
- Logs - Event records from applications and systems
- Traces - Distributed request tracking across services
- Synthetic Monitors - Simulated user interactions to test availability
Alert Definitions
- Thresholds - Static values that trigger alerts when crossed
- Anomaly Detection - Alerts based on deviations from normal patterns
- Composite Conditions - Multiple conditions that must be met
- Missing Data - Alerts triggered when expected data is absent
Notification Channels
- Email - Traditional but can be easily overlooked
- SMS/Phone Calls - More urgent notifications
- Chat Applications - Team notifications (Slack, Microsoft Teams)
- Mobile Push Notifications - Direct alerts to on-call personnel
- Webhooks - Integration with other systems
Alert Management
- Grouping - Combining related alerts to reduce noise
- Routing - Directing alerts to the right teams
- Escalation - Notifying additional people if alerts aren't acknowledged
- Scheduling - Managing on-call rotations
Remediation
- Runbooks - Documented procedures for handling specific alerts
- Automation - Automatic responses to known issues
- Incident Management - Tools for coordinating responses to major incidents
The Psychology of Alerting
Building an effective alerting system isn't just a technical challenge—it's a human factors problem:
Alert Fatigue
Alert fatigue occurs when engineers receive too many alerts, particularly false positives, leading them to:
- Begin ignoring alerts entirely
- Develop stress and burnout
- Miss critical alerts hidden among trivial ones
Real-world example: A hospital study found that nurses exposed to more than 350 alarms per day responded significantly slower to critical alerts. The same psychology applies to software engineers dealing with system alerts.
The "Boy Who Cried Wolf" Effect
Systems that frequently generate false alarms lose credibility with the teams responsible for them. Like in the fable, when a real emergency occurs, responders may not react with appropriate urgency.
Principles of Effective Alerting
- Alerting on symptoms, not causes - Focus on user-impacting issues
- Every alert should be actionable - An engineer should know what to do when they receive it
- Different severity levels require different responses - Not everything is an emergency
- Context is crucial - Include relevant information to aid diagnosis
Alert Design Patterns
The Four Golden Signals (Google SRE)
Google's Site Reliability Engineering team recommends focusing on four key metrics:
- Latency - How long does it take to serve a request?
- Traffic - How much demand is placed on your system?
- Errors - Rate of requests that fail
- Saturation - How "full" is your service? (CPU, memory, disk IO, etc.)
RED Method (Weave)
A pattern focused on service monitoring:
- Rate - Requests per second
- Errors - Failed requests per second
- Duration - Distribution of request latencies
USE Method (Netflix)
A pattern for infrastructure monitoring:
- Utilization - Percentage of resource used
- Saturation - Amount of work queued
- Errors - Error events
Building Progressive Alert Hierarchies
Alert Severity Levels
Not all issues require the same urgency of response. A common approach is to define multiple severity levels:
| Severity | Description | Response Time | Notification |
|---|---|---|---|
| P1 (Critical) | Service outage, data loss risk | Immediate (24/7) | Phone call, SMS, push notification |
| P2 (High) | Degraded service, partial functionality loss | Within 30 minutes (24/7) | SMS, push notification |
| P3 (Medium) | Minor functionality issues, non-critical components | Business hours | Email, Slack |
| P4 (Low) | Cosmetic issues, low impact problems | Next business day | Ticketing system, email |
Escalation Paths
When alerts aren't acknowledged, they should follow a defined escalation path:
Alert Types and Examples
Threshold-based Alerts
The most common alert type, triggered when a metric crosses a predefined value.
# Prometheus alerting rule example
groups:
- name: example
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is above 5% for more than 10 minutes (current value: {{ $value }})"
Anomaly Detection Alerts
Alerts based on deviation from normal patterns, useful for detecting unusual behavior.
# Elasticsearch Watcher anomaly detection example
{
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"search": {
"request": {
"indices": ["metrics-*"],
"body": {
"size": 0,
"query": {
"bool": {
"must": [
{"range": {"@timestamp": {"gte": "now-1h", "lte": "now"}}}
]
}
},
"aggs": {
"cpu_usage": {
"avg": {
"field": "system.cpu.usage"
}
}
}
}
}
}
},
"condition": {
"script": {
"source": "return ctx.payload.aggregations.cpu_usage.value > params.threshold",
"params": {
"threshold": 80
}
}
},
"actions": {
"send_email": {
"email": {
"to": ["ops@example.com"],
"subject": "High CPU Usage Alert",
"body": {
"text": "Average CPU usage is {{ctx.payload.aggregations.cpu_usage.value}}%"
}
}
}
}
}
Missing Data Alerts
Alerts triggered when expected data points are missing, indicating potential system failure.
# Grafana alert for missing data
{
"alertRuleTags": {},
"conditions": [
{
"evaluator": {
"params": [0],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"params": [],
"type": "count_non_null"
},
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"handler": 1,
"name": "No Data Received",
"noDataState": "alerting",
"notifications": []
}
Composite Alerts
Alerts that combine multiple conditions to reduce false positives.
# Datadog composite monitor example
{
"name": "Composite Database Alert",
"type": "composite",
"query": "1 && 2 && 3",
"message": "Database is experiencing high load and slow queries",
"options": {
"notify_no_data": true,
"no_data_timeframe": 10,
"notify_audit": false,
"new_host_delay": 300,
"include_tags": true,
"escalation_message": "Database issues persisting"
},
"monitor_refs": [
123456, // High CPU monitor ID
123457, // Slow query monitor ID
123458 // Connection count monitor ID
]
}
Popular Alerting Tools
Open Source Solutions
- Prometheus Alertmanager - Alert management for Prometheus metrics
- Grafana Alerting - Visual alert configuration with multiple data source support
- ElastAlert - Alerting for Elasticsearch data
- Zabbix - Integrated monitoring and alerting system
- Nagios - Traditional monitoring and alerting platform
Commercial and SaaS Solutions
- PagerDuty - Incident response and on-call management
- OpsGenie - Alert management with advanced routing capabilities
- VictorOps - Incident management platform
- New Relic Alerts - Integrated with New Relic monitoring
- Datadog Monitors - Alerting integrated with Datadog observability platform
Cloud Provider Solutions
- AWS CloudWatch Alarms - Alerts based on AWS metrics
- Google Cloud Monitoring Alerts - Alerting for GCP resources
- Azure Monitor Alerts - Alerting for Azure services
Implementing Alerting with Prometheus and Alertmanager
Architecture Overview
Prometheus Alert Rules
Alert rules are defined in a YAML file:
# alert_rules.yml
groups:
- name: node_alerts
rules:
- alert: HighCpuLoad
expr: node_load1 > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is above 80% for more than 5 minutes (current value: {{ $value }})"
- alert: MemoryAlmostFull
expr: (node_memory_MemFree_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Memory almost full on {{ $labels.instance }}"
description: "Less than 10% memory available (current value: {{ $value | humanizePercentage }})"
- name: app_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is above 5% for more than 2 minutes (current value: {{ $value | humanizePercentage }})"
dashboard: "https://grafana.example.com/d/abc123/http-metrics"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
Alertmanager Configuration
The Alertmanager handles alert notification and management:
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXX'
route:
group_by: ['alertname', 'job', 'severity']
group_wait: 30s # Wait 30s to buffer alerts of the same group
group_interval: 5m # Wait 5m before sending new notification for group
repeat_interval: 4h # Wait 4h before resending a firing alert
receiver: 'team-backend-slack' # Default receiver
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true # Continue to other matching routes
- match:
team: backend
receiver: 'team-backend-slack'
- match:
team: frontend
receiver: 'team-frontend-email'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your_pagerduty_service_key'
- name: 'team-backend-slack'
slack_configs:
- channel: '#alerts-backend'
title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
- name: 'team-frontend-email'
email_configs:
- to: 'frontend-team@example.com'
send_resolved: true
Docker Compose Setup
Here's a Docker Compose configuration for a complete monitoring stack:
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus:v2.35.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- 9090:9090
restart: always
alertmanager:
image: prom/alertmanager:v0.24.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- 9093:9093
restart: always
grafana:
image: grafana/grafana:8.5.2
depends_on:
- prometheus
ports:
- 3000:3000
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
restart: always
node-exporter:
image: prom/node-exporter:v1.3.1
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- 9100:9100
restart: always
volumes:
grafana_data: {}
Implementing PagerDuty Integration
What is PagerDuty?
PagerDuty is a popular incident management platform that helps teams respond to disruptions in their services. It manages on-call schedules, escalation policies, and provides multiple notification methods.
Setting Up PagerDuty
- Create a service in PagerDuty specifically for your application
- Set up escalation policies to determine who gets notified and when
- Create schedules to manage on-call rotations
- Configure notification rules for team members
Integrating with Alertmanager
Update the Alertmanager configuration to send alerts to PagerDuty:
# alertmanager.yml (PagerDuty section)
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'your_pagerduty_integration_key'
description: '{{ if gt (len .Alerts.Firing) 0 }}{{ (index .Alerts.Firing 0).Annotations.summary }}{{ end }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
num_firing: '{{ .Alerts.Firing | len }}'
client: 'Prometheus Alertmanager'
client_url: 'https://alertmanager.example.com'
severity: '{{ if gt (len .Alerts.Firing) 0 }}{{ (index .Alerts.Firing 0).Labels.severity }}{{ end }}'
Creating On-Call Schedules
Effective on-call schedules balance operational needs with engineer well-being:
- Rotate on-call duty among team members
- Ensure handoffs between shifts are smooth
- Have documented backup procedures
- Consider follow-the-sun scheduling for global teams
Building Slack Alert Integration
Slack is often the hub of team communication, making it an ideal place for non-emergency alerts.
Creating a Slack App for Alerts
- Go to api.slack.com/apps and create a new app
- Enable "Incoming Webhooks" feature
- Create a webhook URL for a specific channel
- Configure Alertmanager to use this webhook
Alertmanager Slack Configuration
# alertmanager.yml (Slack section)
receivers:
- name: 'team-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXX'
channel: '#alerts'
send_resolved: true
icon_emoji: ':warning:'
title: '{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}'
title_link: 'https://grafana.example.com'
text: >-
{{ range .Alerts -}}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Graph:* {{ .Annotations.dashboard }}
*Runbook:* {{ .Annotations.runbook }}
*Details:*
{{ range .Labels.SortedPairs }}• *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
Customizing Alert Formatting
Well-formatted alerts make it easier to understand issues at a glance:
- Use emoji for severity levels (🔴 Critical, 🟠 Warning, etc.)
- Include direct links to dashboards
- Add runbook links for resolution steps
- Format messages to highlight the most important information
🔴 CRITICAL: High Error Rate on api-server-01
Description: Error rate is above 5% for more than 2 minutes (current value: 8.3%)
Severity: critical
Started: 2025-05-05 14:32:15 UTC (5 minutes ago)
Links: Dashboard | Runbook | Logs
Details:
- instance: api-server-01:8080
- job: api-server
- service: order-processing
Setting Up Email Alerts
Despite newer notification methods, email remains important for less urgent alerts and for maintaining a record of incidents.
SMTP Configuration in Alertmanager
# alertmanager.yml (Email section)
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
receivers:
- name: 'email-alerts'
email_configs:
- to: 'team@example.com'
send_resolved: true
headers:
subject: '{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}'
html: |
<!DOCTYPE html>
<html>
<body>
<h1>{{ if gt (len .Alerts.Firing) 0 }}{{ .CommonLabels.alertname }}{{ end }}</h1>
<h2>Alerts</h2>
{{ range .Alerts }}
<div style="margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px;
background-color: {{ if .Resolved }}#e6ffe6{{ else }}#ffe6e6{{ end }}">
<h3>{{ .Annotations.summary }}</h3>
<p><strong>Description:</strong> {{ .Annotations.description }}</p>
<p><strong>Started:</strong> {{ .StartsAt }}</p>
{{ if .Resolved }}
<p><strong>Resolved:</strong> {{ .EndsAt }}</p>
{{ end }}
<h4>Labels:</h4>
<ul>
{{ range .Labels.SortedPairs }}
<li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
{{ end }}
</ul>
</div>
{{ end }}
<p>View in <a href="https://alertmanager.example.com">Alertmanager</a></p>
</body>
</html>
Email Best Practices
- Use clear, descriptive subject lines
- Format emails for readability (HTML when possible)
- Include all necessary context in the email body
- Consider email filters and tags for organization
Alert Routing and Grouping Strategies
Routing Based on Labels
Alertmanager can route alerts to different teams based on labels:
# alertmanager.yml (routing section)
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
service: payment-processing
receiver: 'payment-team'
routes:
- match:
severity: critical
receiver: 'payment-team-pagerduty'
- match:
service: authentication
receiver: 'auth-team'
- match_re:
service: .*-api
receiver: 'api-team'
- match:
team: infrastructure
receiver: 'infra-team'
Alert Grouping
Grouping related alerts reduces notification noise:
- Group by service, job, or alertname
- Set appropriate group_wait and group_interval values
- Consider the tradeoff between reduced noise and potential delayed notifications
Inhibiting Redundant Alerts
Prevent less important alerts from firing when a related critical alert is active:
# alertmanager.yml (inhibit rules)
inhibit_rules:
# Inhibit all warning-level alerts if there's a critical alert with the same alertname and instance
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Inhibit service-specific alerts if the whole cluster is down
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: 'Service.*Down'
equal: ['cluster']
Building Runbooks for Alert Response
Runbooks provide standardized procedures for responding to specific alerts, reducing mean time to resolution.
Anatomy of an Effective Runbook
- Alert Description - What the alert means and why it matters
- Diagnostic Steps - How to investigate the root cause
- Common Causes - Frequently encountered reasons for this alert
- Resolution Steps - Actions to take to solve the problem
- Escalation Path - Who to contact if you can't resolve it
- Prevention Measures - How to avoid this issue in the future
Sample Runbook: High API Error Rate
Runbook: High API Error Rate
Alert Description: This alert triggers when the API error rate (HTTP 5xx responses) exceeds 5% for more than 2 minutes.
Impact: Users may experience failed requests and service disruption. High priority for customer-facing APIs.
Diagnostic Steps:
- Check the API logs for error patterns:
kubectl logs -l app=api-service -n production | grep ERROR - Verify database connectivity:
kubectl exec api-pod -- curl database:5432 - Check dependent service health:
curl https://metrics.example.com/health/dependencies - Verify recent deployments:
kubectl describe deployment api-service -n production
Common Causes:
- Database connection issues (timeouts, connection limits)
- Dependency service failures
- Recent deployment introducing bugs
- Resource exhaustion (memory leaks, CPU throttling)
Resolution Steps:
- If related to a recent deployment, roll back:
kubectl rollout undo deployment api-service -n production - If database connection issues:
- Check connection pool settings
- Verify database health
- Restart the API service if necessary
- If dependency failure, check the dependent service's health and logs
- If resource exhaustion:
- Scale up the deployment:
kubectl scale deployment api-service --replicas=5 -n production - Restart problematic pods:
kubectl delete pod [pod-name] -n production
- Scale up the deployment:
Escalation Path:
- If unable to resolve within 15 minutes, escalate to the Database team (if DB-related) or Platform team (if infrastructure-related)
- For persistent issues, engage the Development team lead
Prevention Measures:
- Implement circuit breakers for dependency calls
- Add more comprehensive pre-deployment testing
- Set up database connection pooling monitoring
- Configure auto-scaling based on request load
Automating Runbooks
For common issues with well-defined solutions, consider automating the response:
- Use webhooks to trigger automated remediation scripts
- Implement auto-scaling based on alert conditions
- Create self-healing systems where possible
Alert Fatigue Prevention Strategies
Measuring Alert Quality
Track these metrics to gauge the health of your alerting system:
- Alert frequency - How often alerts trigger
- Alert-to-noise ratio - Percentage of alerts that required action
- Mean time to acknowledge/resolve - How quickly teams respond
- Repeat alerts - Alerts that fire repeatedly for the same issue
Strategies to Reduce Alert Noise
- Set appropriate thresholds - Balance between capturing issues and avoiding false positives
- Add delay periods (for) - Require conditions to persist before alerting
- Implement smart grouping - Combine related alerts into a single notification
- Use maintenance windows - Suppress alerts during planned maintenance
- Implement alert hierarchies - Only alert on user-impacting symptoms
Regular Alert Reviews
Schedule regular reviews of your alerts:
- Review alert frequency and patterns
- Adjust thresholds based on data
- Remove or modify alerts that consistently generate noise
- Update runbooks based on real incidents
Practical Exercise: Setting Up a Complete Alerting System
Exercise Overview
In this exercise, you'll build a comprehensive alerting system for a web application:
- Set up Prometheus and Alertmanager using Docker Compose
- Configure alert rules for common scenarios (high error rate, service unavailability, resource exhaustion)
- Integrate with multiple notification channels (Slack, email)
- Create runbooks for each alert type
- Test the system by simulating failure conditions
Required Resources
- Docker and Docker Compose
- A sample web application (provided)
- Slack workspace with webhook permissions (for notifications)
- Text editor for configuration files
For detailed exercise instructions and starter code, refer to the exercise repository: Alerting Workshop Repository (Example URL)
Conclusion and Key Takeaways
- Effective alerting is about finding the balance between awareness and noise
- Focus on alerting for user-impacting issues
- Make every alert actionable with clear runbooks
- Use the right notification channel for the right severity level
- Regularly review and refine your alerting strategy
- Consider the human factors in alert design
Remember: The goal of an alerting system is not just to notify you when things break, but to give you the context and tools to fix problems quickly and prevent them in the future.
Additional Resources
Documentation
- Prometheus Alertmanager Documentation
- Grafana Alerting Documentation
- Elasticsearch Alerting Documentation
Books
- "Practical Monitoring" by Mike Julian
- "Site Reliability Engineering" by Google SRE Team
- "Implementing Service Level Objectives" by Alex Hidalgo
Online Resources
Next Lecture Preview: Production Deployment
In our next session, we'll explore how to prepare your application for production deployment, covering:
- Production readiness checklists
- SSL certificate management
- Domain configuration and DNS
- Load balancing and high availability
- Final security audits