Introduction to Application Monitoring
Application monitoring is the practice of observing, tracking, and analyzing the behavior and performance of software applications in real-time. It enables teams to identify issues, optimize performance, and ensure applications meet user expectations and business requirements. In today's complex distributed systems, effective monitoring is not just beneficial—it's essential for maintaining reliability and delivering excellent user experiences.
Modern application monitoring has evolved from simple uptime checks to comprehensive observability solutions that provide deep insights into application behavior, performance patterns, and user experiences. As applications grow more complex—spanning multiple services, cloud providers, and infrastructure layers—monitoring systems must evolve to track interdependencies and provide actionable insights when issues arise.
The Healthcare Analogy
Application monitoring can be compared to modern healthcare:
- Infrastructure Monitoring is like checking vital signs (heart rate, blood pressure, temperature). These basic metrics tell us if the system is alive and functioning at a fundamental level.
- Application Performance Monitoring is similar to diagnostic tests (blood work, imaging). These give us deeper insights into how well internal systems are functioning.
- Error Tracking resembles monitoring symptoms and complaints. Just as a doctor wants to know what hurts and when it started, developers need to understand error patterns and their contexts.
- Distributed Tracing is like tracking the spread of a medication through the body. It helps us understand how requests flow through a complex system.
- Alerting functions like a hospital's monitoring system that notifies staff when a patient's condition requires attention.
- Proactive Monitoring is similar to preventive medicine—identifying potential issues before they cause noticeable symptoms.
Just as modern medicine has evolved from treating symptoms to understanding whole-body health and prevention, application monitoring has evolved from simple uptime checks to comprehensive observability that provides context and insights into the entire application ecosystem.
Core Concepts in Application Monitoring
The Three Pillars of Observability
Observability—the ability to understand a system's internal state based on its external outputs—rests on three key pillars:
Metrics
Metrics are numerical values measured over time that represent the state or performance of a system component. They are typically:
- Collected at regular intervals
- Numerical and aggregatable (can be summed, averaged, etc.)
- Low overhead and suitable for long-term storage
- Ideal for trend analysis and alerting
Common Metric Types
| Type | Description | Examples |
|---|---|---|
| Counter | Cumulative value that only increases | Request count, error count, completed tasks |
| Gauge | Single numerical value that can go up and down | Memory usage, CPU utilization, active connections |
| Histogram | Distribution of values across configurable buckets | Response time distribution, request size distribution |
| Summary | Similar to histogram but calculates configurable quantiles | 90th/95th/99th percentile of response times |
Logs
Logs are time-stamped records of discrete events that occur within an application. They provide context and details about what happened at a specific moment. Logs typically include:
- Timestamp
- Severity level (INFO, WARN, ERROR, etc.)
- Source (service, component, function)
- Message and contextual data
Structured vs. Unstructured Logging
Unstructured Log:
[2025-05-08 14:32:17] ERROR Failed to process payment for order #12345 - Credit card declined
Structured Log (JSON format):
{
"timestamp": "2025-05-08T14:32:17.345Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc123def456",
"message": "Failed to process payment",
"orderId": "12345",
"errorCode": "CC_DECLINED",
"paymentMethod": "credit_card",
"customer": {
"id": "cust_9876",
"accountType": "premium"
}
}
Structured logs are more machine-readable and enable more powerful filtering, searching, and analysis.
Traces
Traces track the flow of requests as they propagate through distributed services. They help developers understand:
- The path of a request through multiple services
- The time spent in each component or service
- Dependencies between services
- The root cause of performance bottlenecks or failures
In the example above, a trace consists of multiple spans representing each segment of the request's journey. Each span includes:
- Start and end times
- Service/component information
- Operation name
- Parent-child relationships
- Contextual data (parameters, results, errors)
Effective Monitoring Strategies
The RED Method
The RED method, promoted by Weave Cloud, focuses on three key metrics for monitoring service health from a user perspective:
- Rate: Requests per second
- Errors: Number of failed requests
- Duration: Distribution of request latencies
These metrics provide a user-centric view of service health and performance, making them ideal for customer-facing services.
Implementing RED Metrics with Prometheus
// Example using Prometheus client for Node.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
// Create a Registry to register metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
// Create RED metrics
const httpRequestDurationMicroseconds = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
});
const httpRequestCounter = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Register metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestCounter);
// Middleware to track metrics
app.use((req, res, next) => {
const start = process.hrtime();
const path = req.path;
const method = req.method;
// Record metrics when response is finished
res.on('finish', () => {
const duration = process.hrtime(start);
const durationInSeconds = duration[0] + duration[1] / 1e9;
httpRequestDurationMicroseconds
.labels(method, path, res.statusCode.toString())
.observe(durationInSeconds);
httpRequestCounter
.labels(method, path, res.statusCode.toString())
.inc();
});
next();
});
// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Application routes
app.get('/', (req, res) => {
res.send('Hello World!');
});
app.listen(3000, () => {
console.log('Server listening on port 3000');
});
The USE Method
The USE method, developed by Brendan Gregg, focuses on resources rather than services:
- Utilization: Percentage of time the resource is busy
- Saturation: Degree to which the resource has extra work (queue)
- Errors: Count of error events
This approach is well-suited for monitoring infrastructure components like CPU, memory, network, and disk.
USE Method Application Examples
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | CPU usage % | Run queue length | System errors, CPU-related errors |
| Memory | Memory usage % | Swap usage, OOM events | Memory allocation errors |
| Disk I/O | Disk busy % | I/O queue length | Disk errors, timeouts |
| Network | Bandwidth usage % | Packet queue length | Network errors, retransmits, drops |
The Four Golden Signals
Google's Site Reliability Engineering (SRE) book outlines four key metrics for monitoring service health:
- Latency: Time taken to serve a request
- Traffic: Demand on the system (requests per second)
- Errors: Rate of failed requests
- Saturation: How "full" the service is (resource utilization)
This approach combines aspects of both RED and USE methods and is suitable for most applications and services.
Real-World Example: E-commerce Monitoring Strategy
A large e-commerce platform implements the Four Golden Signals across their architecture:
- Latency Monitoring:
- p50/p90/p99 response times for all API endpoints
- Frontend page load times by device type
- Database query latency for critical operations
- Traffic Tracking:
- Requests per second by service and endpoint
- User sessions per minute
- Search queries and product views per minute
- Error Monitoring:
- HTTP 5xx and 4xx error rates
- Failed payment transactions
- Add-to-cart failures
- Checkout abandonment rate
- Saturation Measurement:
- Database connection pool usage
- Queue depths for asynchronous processes
- Memory usage of key services
- CPU utilization across the infrastructure
These signals are monitored in real-time with automated alerts that trigger when thresholds are exceeded. During peak shopping events like Black Friday, they adjust alert thresholds to account for expected higher traffic and load.
Application Instrumentation
Instrumentation is the process of adding code to your application to collect monitoring data. Effective instrumentation is foundational to any monitoring strategy.
Types of Instrumentation
Manual Instrumentation
- Developers explicitly add monitoring code
- Provides precise control over what is monitored
- Can be tailored to business-specific metrics
- Requires ongoing maintenance
- Risk of inconsistent implementation
Automatic Instrumentation
- Added by monitoring libraries or agents
- Quick to implement across applications
- Consistent implementation
- Less flexible for custom metrics
- May miss application-specific contexts
Key Components to Instrument
- HTTP/API Endpoints: Track request rates, latencies, and error rates
- Database Queries: Monitor query execution times and failure rates
- External Service Calls: Track dependencies on third-party services
- Background Jobs: Monitor execution times and success/failure rates
- Critical Business Transactions: Track user-centric operations
- Resource Usage: Monitor CPU, memory, and other resource consumption
Manual Instrumentation Examples
Express.js with OpenTelemetry:
const { trace } = require('@opentelemetry/api');
const express = require('express');
const app = express();
// Initialize OpenTelemetry (setup code not shown)
app.get('/api/products/:id', async (req, res) => {
// Start a new span for this endpoint
const span = trace.getTracer('product-service').startSpan('get-product-by-id');
// Add context to the span
span.setAttribute('product.id', req.params.id);
try {
// Create a child span for database operation
const dbSpan = trace.getTracer('product-service').startSpan('database-query');
let product;
try {
product = await database.getProduct(req.params.id);
dbSpan.setAttribute('db.result_size', JSON.stringify(product).length);
} catch (dbError) {
dbSpan.setAttribute('error', true);
dbSpan.setAttribute('error.message', dbError.message);
throw dbError;
} finally {
dbSpan.end();
}
// Record business metrics
metrics.counter('product.views').add(1, { 'product.id': req.params.id });
res.json(product);
} catch (error) {
// Record the error on the span
span.setAttribute('error', true);
span.setAttribute('error.message', error.message);
span.setAttribute('error.type', error.constructor.name);
// Log the error
logger.error('Failed to retrieve product', {
error: error.message,
productId: req.params.id,
traceId: span.spanContext().traceId
});
res.status(500).json({ error: 'Failed to retrieve product' });
} finally {
// End the span
span.end();
}
});
Python with custom metrics:
import time
from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'request_count', 'App Request Count',
['app_name', 'endpoint', 'method', 'http_status']
)
REQUEST_LATENCY = Histogram(
'request_latency_seconds', 'Request latency in seconds',
['app_name', 'endpoint', 'method']
)
# Middleware to measure request latency and count
@app.before_request
def start_timer():
request.start_time = time.time()
@app.after_request
def record_metrics(response):
endpoint = request.endpoint or 'unknown'
request_latency = time.time() - request.start_time
REQUEST_LATENCY.labels('my_app', endpoint, request.method).observe(request_latency)
REQUEST_COUNT.labels('my_app', endpoint, request.method, response.status_code).inc()
return response
@app.route('/api/users/')
def get_user(user_id):
# Application logic here
return {'id': user_id, 'name': 'Example User'}
# Metrics endpoint for Prometheus to scrape
@app.route('/metrics')
def metrics():
return generate_latest()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
OpenTelemetry Standard
OpenTelemetry is an open-source observability framework that provides a standardized way to instrument applications across different languages and platforms.
Benefits of OpenTelemetry
- Vendor Neutrality: A single instrumentation that works with multiple backends
- Consistent Cross-Service Tracing: Standardized context propagation
- Comprehensive Coverage: Metrics, logs, and traces in one framework
- Language Agnostic: Supports most major programming languages
- Extensive Library Support: Auto-instrumentation for common frameworks
- Future-Proof: Increasingly becoming the industry standard
OpenTelemetry Setup Example (Node.js)
// tracing.js - OpenTelemetry setup file
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
// Create and configure the OpenTelemetry tracer provider
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
environment: process.env.NODE_ENV || 'development'
})
});
// Configure span processor and exporter
const jaegerExporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
});
provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));
// Register the provider
provider.register();
// Automatically instrument Express, HTTP, and MongoDB
registerInstrumentations({
instrumentations: [
new ExpressInstrumentation(),
new HttpInstrumentation(),
new MongoDBInstrumentation()
]
});
console.log('OpenTelemetry tracing initialized');
// In your main application file:
// require('./tracing'); // Add this at the top of your entry point file
Application Monitoring Tools
Tool Categories
Open-Source Monitoring Stack
Many organizations implement a monitoring stack based on open-source tools:
OpenTelemetry] --> B[Prometheus] A --> C[Loki] A --> D[Tempo/Jaeger] B --> E[Alertmanager] B --> F[Grafana] C --> F D --> F E --> G[Alert Channels] F --> H[Dashboards]
Common Open-Source Monitoring Components
- Prometheus: Time-series database for metrics with a powerful query language
- Grafana: Visualization platform for metrics, logs, and traces
- Loki: Log aggregation system designed to work with Prometheus and Grafana
- Jaeger/Tempo: Distributed tracing systems for tracking request flows
- Alertmanager: Handles alerts from Prometheus and routes them to receivers
- OpenTelemetry Collector: Receives, processes, and exports telemetry data
Commercial APM Tools
Commercial Application Performance Monitoring (APM) tools provide integrated solutions that combine metrics, logs, and traces with additional features:
| Tool | Key Features | Best For |
|---|---|---|
| New Relic | Full-stack observability, NRQL query language, AI-assisted analysis | Web applications, microservices, comprehensive monitoring |
| Datadog | Infrastructure monitoring, APM, log management, RUM, security | Complex environments, multi-cloud, hybrid infrastructures |
| Dynatrace | AI-powered monitoring, auto-discovery, root cause analysis | Enterprise applications, complex dependencies, automated analysis |
| AppDynamics | Business transaction monitoring, code-level diagnostics | Enterprise applications, business metric correlation |
| Elastic APM | Open-core APM integrated with Elasticsearch, Kibana | Organizations already using the Elastic Stack |
Real-World Example: E-commerce Platform Monitoring Setup
A mid-size e-commerce company with a microservices architecture implemented a hybrid monitoring approach:
- Infrastructure Monitoring: Prometheus and Grafana for all infrastructure metrics
- Log Management: ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging
- Application Monitoring: New Relic APM for detailed service performance metrics
- Real User Monitoring: New Relic Browser for frontend performance monitoring
- Synthetic Monitoring: Checkly for critical user journey validation
- Alerting: PagerDuty integrated with all monitoring systems
They chose this combination to balance cost and capabilities:
- Prometheus/Grafana covered their extensive infrastructure at a lower cost than commercial alternatives
- ELK provided powerful log analysis capabilities they couldn't find in simpler solutions
- New Relic APM gave them code-level visibility without building their own tracing infrastructure
- PagerDuty provided alert management and on-call rotation that integrated with all systems
This approach allowed them to monitor both technical metrics and business outcomes while controlling costs.
Effective Alerting Strategies
Alerting is the process of notifying relevant personnel when monitoring systems detect abnormal conditions. Effective alerting balances prompt notification of important issues with avoiding alert fatigue.
Alert Types and Priorities
| Priority | Description | Response Time | Examples |
|---|---|---|---|
| P1 (Critical) | Service is down or severely impaired for all users | Immediate (24/7) | Website outage, payment processing failure |
| P2 (High) | Major feature unavailable or significant performance degradation | Within 30 minutes (24/7) | Checkout process slow, search not working |
| P3 (Medium) | Minor feature issues or early warnings of potential problems | Business hours | Elevated error rates, increasing response times |
| P4 (Low) | Non-urgent issues that should be addressed | Next business day | Non-critical component warnings, resource trending toward threshold |
Alerting Best Practices
- Alert on Symptoms, Not Causes: Focus on user-impacting issues
- Reduce Noise: Eliminate redundant alerts for the same issue
- Provide Context: Include relevant information to aid troubleshooting
- Set Appropriate Thresholds: Avoid false positives and negatives
- Implement Alert Routing: Direct alerts to the right teams
- Use Multiple Channels: Ensure critical alerts are received
- Have Escalation Paths: Define what happens if alerts are not acknowledged
- Review and Refine: Regularly audit alert effectiveness
Prometheus Alerting Rules Example
# prometheus/alerts.yml
groups:
- name: service_availability
rules:
# Alert when success rate drops below 95%
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Service {{ $labels.service }} has a high error rate: {{ $value | humanizePercentage }}"
dashboard: "https://grafana.example.com/d/service-overview?var-service={{ $labels.service }}"
# Alert on high latency
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)) > 2
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.service }}"
description: "Service {{ $labels.service }} has a 95th percentile latency of {{ $value }} seconds"
dashboard: "https://grafana.example.com/d/service-performance?var-service={{ $labels.service }}"
- name: infrastructure
rules:
# Alert on high CPU usage
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Instance {{ $labels.instance }} has CPU usage above 80% for more than 10 minutes: {{ $value | humanizePercentage }}"
# Alert on low disk space
- alert: LowDiskSpace
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}: {{ $labels.mountpoint }}"
description: "Disk usage at {{ $labels.mountpoint }} on {{ $labels.instance }} has less than 10% free space: {{ $value | humanizePercentage }} available"
Alert Design Principles
- Actionable: Every alert should require action (if not, it's a metric)
- Timely: Alert at the right time (not too early, not too late)
- Precise: Target the specific component with an issue
- Clear Ownership: Define who should respond
- Documented Response: Have a documented playbook for each alert
- Tunable: Easy to adjust thresholds as conditions change
Avoiding Common Alerting Mistakes
- Alert Storms: Multiple alerts for a single root cause
- False Positives: Alerts that don't indicate real problems
- Alert Fatigue: Too many alerts leading to ignored notifications
- Missing Context: Alerts without enough information to diagnose
- Poor Prioritization: Non-critical issues with high severity
- Inconsistent Definitions: Varying alert criteria across services
Building Effective Monitoring Dashboards
Dashboards provide visual representations of monitoring data, helping teams understand system health and performance at a glance.
Dashboard Types and Purposes
- Overview Dashboards: High-level system health
- Service Dashboards: Detailed metrics for specific services
- Business Dashboards: Impact on user experience and business metrics
- On-Call Dashboards: Focus on common issues for incident responders
- Resource Dashboards: Infrastructure utilization and capacity
- Executive Dashboards: Simplified views for leadership
Dashboard Design Principles
- Purpose-Driven: Design for specific user needs and use cases
- Progressive Disclosure: Start with high-level views, enable drilling down
- Consistent Layout: Use consistent patterns across dashboards
- Focus on Signal: Remove noise and clutter, highlight important data
- Context and Reference: Include baselines and thresholds
- Annotation: Mark events like deployments or incidents
- Cross-Linking: Connect related dashboards and documentation
Real-World Example: Service Dashboard
A product team created a service dashboard with these panels:
Row 1: Service Health Overview
- Request Rate (Requests per second over time)
- Error Rate (Percentage of failed requests over time)
- Response Time (p50/p90/p99 latency over time)
- Success Rate SLO (Current status vs. target)
Row 2: Resource Utilization
- CPU Usage (Average across instances)
- Memory Usage (Average across instances)
- Network I/O (Inbound/outbound traffic)
- Instance Count (Total and per status)
Row 3: Dependencies
- Database Query Time (Average and p95 duration)
- Cache Hit Rate (Percentage of cache hits)
- External API Response Times (By endpoint)
- Queue Depth (For asynchronous processing)
Row 4: Business Impact
- Conversion Rate (Compared to 24h and 7d averages)
- Transactions Processed (Count and value)
- User Engagement (Active sessions, page views)
- Feature Usage (Key feature utilization metrics)
This dashboard included annotations for deployments and incidents, with links to related logs, traces, and documentation. They also added vertical markers for daily traffic patterns and a time-shifted overlay to compare to previous periods.
Grafana Dashboard JSON Example (Simplified)
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
},
{
"datasource": "Prometheus",
"enable": true,
"expr": "changes(app_version{service=\"user-service\"}[1m]) > 0",
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Deployments",
"titleFormat": "Deployment",
"textFormat": "Version {{tag_version}}"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 1,
"id": 1,
"iteration": 1620926545544,
"links": [],
"panels": [
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 20,
"panels": [],
"title": "Service Health",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": true,
"max": true,
"min": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.3",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"user-service\"}[5m]))",
"interval": "",
"legendFormat": "Requests/sec",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Request Rate",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "Requests/sec",
"logBase": 1,
"max": null,
"min": "0",
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"Error Rate": "red"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 1
},
"hiddenSeries": false,
"id": 4,
"legend": {
"avg": false,
"current": true,
"max": false,
"min": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.3",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"user-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"user-service\"}[5m])) * 100",
"interval": "",
"legendFormat": "Error Rate",
"refId": "A"
}
],
"thresholds": [
{
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt",
"value": 5,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Error Rate",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": "",
"logBase": 1,
"max": null,
"min": "0",
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"schemaVersion": 27,
"style": "dark",
"tags": ["service", "user-service"],
"templating": {
"list": [
{
"allValue": null,
"current": {
"selected": false,
"text": "user-service",
"value": "user-service"
},
"datasource": "Prometheus",
"definition": "label_values(service)",
"description": null,
"error": null,
"hide": 0,
"includeAll": false,
"label": "Service",
"multi": false,
"name": "service",
"options": [],
"query": {
"query": "label_values(service)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"type": "query"
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Service Dashboard",
"uid": "abc123",
"version": 1
}
Learning Activities
Activity 1: Application Instrumentation
Instrument a simple web application to collect key metrics:
- Choose a web application framework in your preferred language
- Add instrumentation to track request counts, response times, and error rates
- Implement custom metrics for a business transaction
- Create a metrics endpoint for Prometheus scraping
- Set up Prometheus to collect metrics from your application
- Create a basic Grafana dashboard to visualize the metrics
Activity 2: Define a Monitoring Strategy
For a hypothetical e-commerce application, design a comprehensive monitoring strategy:
- Identify key components and services to monitor
- Define the critical metrics for each component
- Specify tool choices for monitoring different aspects
- Design alert policies for critical scenarios
- Create mockups for the main dashboard types
- Outline a plan for implementing the strategy in phases
Activity 3: Alert Design and Evaluation
Evaluate and improve an alerting strategy:
- Review a provided set of alert definitions
- Identify potential issues like alert storms or false positives
- Redesign problematic alerts using best practices
- Create alert definitions for Prometheus/Alertmanager
- Define appropriate alert routing and escalation policies
- Document response procedures for critical alerts
Key Takeaways
- Effective monitoring is essential for maintaining reliable, high-performance applications
- The three pillars of observability—metrics, logs, and traces—provide a complete view of application behavior
- Monitoring strategies like RED, USE, and Four Golden Signals offer structured approaches to monitoring different aspects of systems
- Application instrumentation should balance comprehensive coverage with performance considerations
- A combination of open-source and commercial tools can provide a complete monitoring solution
- Alerting strategies should focus on actionable notifications while avoiding alert fatigue
- Well-designed dashboards enable teams to quickly understand system health and performance
- Modern observability practices are evolving toward standardization with frameworks like OpenTelemetry