Application Monitoring: Ensuring Reliability and Performance

Introduction to Application Monitoring

Application monitoring is the practice of observing, tracking, and analyzing the behavior and performance of software applications in real-time. It enables teams to identify issues, optimize performance, and ensure applications meet user expectations and business requirements. In today's complex distributed systems, effective monitoring is not just beneficial—it's essential for maintaining reliability and delivering excellent user experiences.

Modern application monitoring has evolved from simple uptime checks to comprehensive observability solutions that provide deep insights into application behavior, performance patterns, and user experiences. As applications grow more complex—spanning multiple services, cloud providers, and infrastructure layers—monitoring systems must evolve to track interdependencies and provide actionable insights when issues arise.

graph TD A[Application Monitoring] --> B[Infrastructure Monitoring] A --> C[Application Performance Monitoring] A --> D[User Experience Monitoring] A --> E[Business Metrics Monitoring] B --> B1[Server Health] B --> B2[Network Performance] B --> B3[Database Performance] C --> C1[Response Times] C --> C2[Error Rates] C --> C3[Throughput] C --> C4[Resource Utilization] D --> D1[Page Load Times] D --> D2[Transaction Success] D --> D3[User Journeys] E --> E1[Conversion Rates] E --> E2[Revenue Metrics] E --> E3[User Engagement]

The Healthcare Analogy

Application monitoring can be compared to modern healthcare:

Infrastructure Monitoring is like checking vital signs (heart rate, blood pressure, temperature). These basic metrics tell us if the system is alive and functioning at a fundamental level.
Application Performance Monitoring is similar to diagnostic tests (blood work, imaging). These give us deeper insights into how well internal systems are functioning.
Error Tracking resembles monitoring symptoms and complaints. Just as a doctor wants to know what hurts and when it started, developers need to understand error patterns and their contexts.
Distributed Tracing is like tracking the spread of a medication through the body. It helps us understand how requests flow through a complex system.
Alerting functions like a hospital's monitoring system that notifies staff when a patient's condition requires attention.
Proactive Monitoring is similar to preventive medicine—identifying potential issues before they cause noticeable symptoms.

Just as modern medicine has evolved from treating symptoms to understanding whole-body health and prevention, application monitoring has evolved from simple uptime checks to comprehensive observability that provides context and insights into the entire application ecosystem.

Core Concepts in Application Monitoring

The Three Pillars of Observability

Observability—the ability to understand a system's internal state based on its external outputs—rests on three key pillars:

graph TD A[Observability] --> B[Metrics] A --> C[Logs] A --> D[Traces] B --> B1[Quantitative measurements] B --> B2[Time-series data] B --> B3[Aggregatable] C --> C1[Discrete events] C --> C2[Contextual information] C --> C3[Unstructured/structured] D --> D1[Request flows] D --> D2[Cross-service calls] D --> D3[Performance timing]

Metrics

Metrics are numerical values measured over time that represent the state or performance of a system component. They are typically:

Collected at regular intervals
Numerical and aggregatable (can be summed, averaged, etc.)
Low overhead and suitable for long-term storage
Ideal for trend analysis and alerting

                Common Metric Types
                
                        Type
                        Description
                        Examples
                    
                        Counter
                        Cumulative value that only increases
                        Request count, error count, completed tasks
                    
                        Gauge
                        Single numerical value that can go up and down
                        Memory usage, CPU utilization, active connections
                    
                        Histogram
                        Distribution of values across configurable buckets
                        Response time distribution, request size distribution
                    
                        Summary
                        Similar to histogram but calculates configurable quantiles
                        90th/95th/99th percentile of response times

Type	Description	Examples
Counter	Cumulative value that only increases	Request count, error count, completed tasks
Gauge	Single numerical value that can go up and down	Memory usage, CPU utilization, active connections
Histogram	Distribution of values across configurable buckets	Response time distribution, request size distribution
Summary	Similar to histogram but calculates configurable quantiles	90th/95th/99th percentile of response times

Logs

Logs are time-stamped records of discrete events that occur within an application. They provide context and details about what happened at a specific moment. Logs typically include:

Timestamp
Severity level (INFO, WARN, ERROR, etc.)
Source (service, component, function)
Message and contextual data

Structured vs. Unstructured Logging

Unstructured Log:


[2025-05-08 14:32:17] ERROR Failed to process payment for order #12345 - Credit card declined

Structured Log (JSON format):


{
  "timestamp": "2025-05-08T14:32:17.345Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "message": "Failed to process payment",
  "orderId": "12345",
  "errorCode": "CC_DECLINED",
  "paymentMethod": "credit_card",
  "customer": {
    "id": "cust_9876",
    "accountType": "premium"
  }
}

Structured logs are more machine-readable and enable more powerful filtering, searching, and analysis.

Traces

Traces track the flow of requests as they propagate through distributed services. They help developers understand:

The path of a request through multiple services
The time spent in each component or service
Dependencies between services
The root cause of performance bottlenecks or failures

sequenceDiagram participant U as User participant A as API Gateway participant S as Service A participant D as Database participant C as Cache U->>A: Request Note right of A: Trace ID: abc123 A->>S: Forward request S->>D: Query data S->>C: Check cache C->>S: Cache response S->>A: Service response A->>U: Final response Note over U,C: Each step is a span with timing information

In the example above, a trace consists of multiple spans representing each segment of the request's journey. Each span includes:

Start and end times
Service/component information
Operation name
Parent-child relationships
Contextual data (parameters, results, errors)

Effective Monitoring Strategies

The RED Method

The RED method, promoted by Weave Cloud, focuses on three key metrics for monitoring service health from a user perspective:

Rate: Requests per second
Errors: Number of failed requests
Duration: Distribution of request latencies

These metrics provide a user-centric view of service health and performance, making them ideal for customer-facing services.

Implementing RED Metrics with Prometheus


// Example using Prometheus client for Node.js
const express = require('express');
const promClient = require('prom-client');
const app = express();

// Create a Registry to register metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Create RED metrics
const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
});

const httpRequestCounter = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Register metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestCounter);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = process.hrtime();
  const path = req.path;
  const method = req.method;
  
  // Record metrics when response is finished
  res.on('finish', () => {
    const duration = process.hrtime(start);
    const durationInSeconds = duration[0] + duration[1] / 1e9;
    
    httpRequestDurationMicroseconds
      .labels(method, path, res.statusCode.toString())
      .observe(durationInSeconds);
      
    httpRequestCounter
      .labels(method, path, res.statusCode.toString())
      .inc();
  });
  
  next();
});

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Application routes
app.get('/', (req, res) => {
  res.send('Hello World!');
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

The USE Method

The USE method, developed by Brendan Gregg, focuses on resources rather than services:

Utilization: Percentage of time the resource is busy
Saturation: Degree to which the resource has extra work (queue)
Errors: Count of error events

This approach is well-suited for monitoring infrastructure components like CPU, memory, network, and disk.

USE Method Application Examples

Resource	Utilization	Saturation	Errors
CPU	CPU usage %	Run queue length	System errors, CPU-related errors
Memory	Memory usage %	Swap usage, OOM events	Memory allocation errors
Disk I/O	Disk busy %	I/O queue length	Disk errors, timeouts
Network	Bandwidth usage %	Packet queue length	Network errors, retransmits, drops

The Four Golden Signals

Google's Site Reliability Engineering (SRE) book outlines four key metrics for monitoring service health:

Latency: Time taken to serve a request
Traffic: Demand on the system (requests per second)
Errors: Rate of failed requests
Saturation: How "full" the service is (resource utilization)

This approach combines aspects of both RED and USE methods and is suitable for most applications and services.

Real-World Example: E-commerce Monitoring Strategy

A large e-commerce platform implements the Four Golden Signals across their architecture:

Latency Monitoring:
- p50/p90/p99 response times for all API endpoints
- Frontend page load times by device type
- Database query latency for critical operations
Traffic Tracking:
- Requests per second by service and endpoint
- User sessions per minute
- Search queries and product views per minute
Error Monitoring:
- HTTP 5xx and 4xx error rates
- Failed payment transactions
- Add-to-cart failures
- Checkout abandonment rate
Saturation Measurement:
- Database connection pool usage
- Queue depths for asynchronous processes
- Memory usage of key services
- CPU utilization across the infrastructure

These signals are monitored in real-time with automated alerts that trigger when thresholds are exceeded. During peak shopping events like Black Friday, they adjust alert thresholds to account for expected higher traffic and load.

Application Instrumentation

Instrumentation is the process of adding code to your application to collect monitoring data. Effective instrumentation is foundational to any monitoring strategy.

Types of Instrumentation

Manual Instrumentation

Developers explicitly add monitoring code
Provides precise control over what is monitored
Can be tailored to business-specific metrics
Requires ongoing maintenance
Risk of inconsistent implementation

Automatic Instrumentation

Added by monitoring libraries or agents
Quick to implement across applications
Consistent implementation
Less flexible for custom metrics
May miss application-specific contexts

Key Components to Instrument

HTTP/API Endpoints: Track request rates, latencies, and error rates
Database Queries: Monitor query execution times and failure rates
External Service Calls: Track dependencies on third-party services
Background Jobs: Monitor execution times and success/failure rates
Critical Business Transactions: Track user-centric operations
Resource Usage: Monitor CPU, memory, and other resource consumption

Manual Instrumentation Examples

Express.js with OpenTelemetry:


const { trace } = require('@opentelemetry/api');
const express = require('express');
const app = express();

// Initialize OpenTelemetry (setup code not shown)

app.get('/api/products/:id', async (req, res) => {
  // Start a new span for this endpoint
  const span = trace.getTracer('product-service').startSpan('get-product-by-id');
  
  // Add context to the span
  span.setAttribute('product.id', req.params.id);
  
  try {
    // Create a child span for database operation
    const dbSpan = trace.getTracer('product-service').startSpan('database-query');
    
    let product;
    try {
      product = await database.getProduct(req.params.id);
      dbSpan.setAttribute('db.result_size', JSON.stringify(product).length);
    } catch (dbError) {
      dbSpan.setAttribute('error', true);
      dbSpan.setAttribute('error.message', dbError.message);
      throw dbError;
    } finally {
      dbSpan.end();
    }
    
    // Record business metrics
    metrics.counter('product.views').add(1, { 'product.id': req.params.id });
    
    res.json(product);
  } catch (error) {
    // Record the error on the span
    span.setAttribute('error', true);
    span.setAttribute('error.message', error.message);
    span.setAttribute('error.type', error.constructor.name);
    
    // Log the error
    logger.error('Failed to retrieve product', {
      error: error.message,
      productId: req.params.id,
      traceId: span.spanContext().traceId
    });
    
    res.status(500).json({ error: 'Failed to retrieve product' });
  } finally {
    // End the span
    span.end();
  }
});

Python with custom metrics:


import time
from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'request_count', 'App Request Count',
    ['app_name', 'endpoint', 'method', 'http_status']
)

REQUEST_LATENCY = Histogram(
    'request_latency_seconds', 'Request latency in seconds',
    ['app_name', 'endpoint', 'method']
)

# Middleware to measure request latency and count
@app.before_request
def start_timer():
    request.start_time = time.time()

@app.after_request
def record_metrics(response):
    endpoint = request.endpoint or 'unknown'
    request_latency = time.time() - request.start_time
    REQUEST_LATENCY.labels('my_app', endpoint, request.method).observe(request_latency)
    REQUEST_COUNT.labels('my_app', endpoint, request.method, response.status_code).inc()
    return response

@app.route('/api/users/')
def get_user(user_id):
    # Application logic here
    return {'id': user_id, 'name': 'Example User'}

# Metrics endpoint for Prometheus to scrape
@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

OpenTelemetry Standard

OpenTelemetry is an open-source observability framework that provides a standardized way to instrument applications across different languages and platforms.

Benefits of OpenTelemetry

Vendor Neutrality: A single instrumentation that works with multiple backends
Consistent Cross-Service Tracing: Standardized context propagation
Comprehensive Coverage: Metrics, logs, and traces in one framework
Language Agnostic: Supports most major programming languages
Extensive Library Support: Auto-instrumentation for common frameworks
Future-Proof: Increasingly becoming the industry standard

OpenTelemetry Setup Example (Node.js)


// tracing.js - OpenTelemetry setup file
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');

// Create and configure the OpenTelemetry tracer provider
const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    environment: process.env.NODE_ENV || 'development'
  })
});

// Configure span processor and exporter
const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));

// Register the provider
provider.register();

// Automatically instrument Express, HTTP, and MongoDB
registerInstrumentations({
  instrumentations: [
    new ExpressInstrumentation(),
    new HttpInstrumentation(),
    new MongoDBInstrumentation()
  ]
});

console.log('OpenTelemetry tracing initialized');

// In your main application file:
// require('./tracing'); // Add this at the top of your entry point file

Application Monitoring Tools

Tool Categories

graph TD A[Monitoring Tools] --> B[APM Solutions] A --> C[Infrastructure Monitoring] A --> D[Log Management] A --> E[Tracing Systems] A --> F[Synthetic Monitoring] A --> G[Real User Monitoring] B --> B1[New Relic] B --> B2[Datadog APM] B --> B3[Dynatrace] B --> B4[AppDynamics] C --> C1[Prometheus + Grafana] C --> C2[Nagios] C --> C3[Zabbix] D --> D1[ELK Stack] D --> D2[Graylog] D --> D3[Loki] E --> E1[Jaeger] E --> E2[Zipkin] E --> E3[AWS X-Ray] F --> F1[Pingdom] F --> F2[Uptrends] F --> F3[Checkly] G --> G1[Google Analytics] G --> G2[Hotjar] G --> G3[LogRocket]

Open-Source Monitoring Stack

Many organizations implement a monitoring stack based on open-source tools:

graph LR A[Apps with
OpenTelemetry] --> B[Prometheus] A --> C[Loki] A --> D[Tempo/Jaeger] B --> E[Alertmanager] B --> F[Grafana] C --> F D --> F E --> G[Alert Channels] F --> H[Dashboards]

Common Open-Source Monitoring Components

Prometheus: Time-series database for metrics with a powerful query language
Grafana: Visualization platform for metrics, logs, and traces
Loki: Log aggregation system designed to work with Prometheus and Grafana
Jaeger/Tempo: Distributed tracing systems for tracking request flows
Alertmanager: Handles alerts from Prometheus and routes them to receivers
OpenTelemetry Collector: Receives, processes, and exports telemetry data

Commercial APM Tools

Commercial Application Performance Monitoring (APM) tools provide integrated solutions that combine metrics, logs, and traces with additional features:

Tool	Key Features	Best For
New Relic	Full-stack observability, NRQL query language, AI-assisted analysis	Web applications, microservices, comprehensive monitoring
Datadog	Infrastructure monitoring, APM, log management, RUM, security	Complex environments, multi-cloud, hybrid infrastructures
Dynatrace	AI-powered monitoring, auto-discovery, root cause analysis	Enterprise applications, complex dependencies, automated analysis
AppDynamics	Business transaction monitoring, code-level diagnostics	Enterprise applications, business metric correlation
Elastic APM	Open-core APM integrated with Elasticsearch, Kibana	Organizations already using the Elastic Stack

Real-World Example: E-commerce Platform Monitoring Setup

A mid-size e-commerce company with a microservices architecture implemented a hybrid monitoring approach:

Infrastructure Monitoring: Prometheus and Grafana for all infrastructure metrics
Log Management: ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging
Application Monitoring: New Relic APM for detailed service performance metrics
Real User Monitoring: New Relic Browser for frontend performance monitoring
Synthetic Monitoring: Checkly for critical user journey validation
Alerting: PagerDuty integrated with all monitoring systems

They chose this combination to balance cost and capabilities:

Prometheus/Grafana covered their extensive infrastructure at a lower cost than commercial alternatives
ELK provided powerful log analysis capabilities they couldn't find in simpler solutions
New Relic APM gave them code-level visibility without building their own tracing infrastructure
PagerDuty provided alert management and on-call rotation that integrated with all systems

This approach allowed them to monitor both technical metrics and business outcomes while controlling costs.

Effective Alerting Strategies

Alerting is the process of notifying relevant personnel when monitoring systems detect abnormal conditions. Effective alerting balances prompt notification of important issues with avoiding alert fatigue.

Alert Types and Priorities

Priority	Description	Response Time	Examples
P1 (Critical)	Service is down or severely impaired for all users	Immediate (24/7)	Website outage, payment processing failure
P2 (High)	Major feature unavailable or significant performance degradation	Within 30 minutes (24/7)	Checkout process slow, search not working
P3 (Medium)	Minor feature issues or early warnings of potential problems	Business hours	Elevated error rates, increasing response times
P4 (Low)	Non-urgent issues that should be addressed	Next business day	Non-critical component warnings, resource trending toward threshold

Alerting Best Practices

Alert on Symptoms, Not Causes: Focus on user-impacting issues
Reduce Noise: Eliminate redundant alerts for the same issue
Provide Context: Include relevant information to aid troubleshooting
Set Appropriate Thresholds: Avoid false positives and negatives
Implement Alert Routing: Direct alerts to the right teams
Use Multiple Channels: Ensure critical alerts are received
Have Escalation Paths: Define what happens if alerts are not acknowledged
Review and Refine: Regularly audit alert effectiveness

Prometheus Alerting Rules Example


# prometheus/alerts.yml
groups:
  - name: service_availability
    rules:
      # Alert when success rate drops below 95%
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Service {{ $labels.service }} has a high error rate: {{ $value | humanizePercentage }}"
          dashboard: "https://grafana.example.com/d/service-overview?var-service={{ $labels.service }}"

      # Alert on high latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)) > 2
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "Service {{ $labels.service }} has a 95th percentile latency of {{ $value }} seconds"
          dashboard: "https://grafana.example.com/d/service-performance?var-service={{ $labels.service }}"

  - name: infrastructure
    rules:
      # Alert on high CPU usage
      - alert: HighCPUUsage
        expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has CPU usage above 80% for more than 10 minutes: {{ $value | humanizePercentage }}"
          
      # Alert on low disk space
      - alert: LowDiskSpace
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}: {{ $labels.mountpoint }}"
          description: "Disk usage at {{ $labels.mountpoint }} on {{ $labels.instance }} has less than 10% free space: {{ $value | humanizePercentage }} available"

Alert Design Principles

Actionable: Every alert should require action (if not, it's a metric)
Timely: Alert at the right time (not too early, not too late)
Precise: Target the specific component with an issue
Clear Ownership: Define who should respond
Documented Response: Have a documented playbook for each alert
Tunable: Easy to adjust thresholds as conditions change

Avoiding Common Alerting Mistakes

Alert Storms: Multiple alerts for a single root cause
False Positives: Alerts that don't indicate real problems
Alert Fatigue: Too many alerts leading to ignored notifications
Missing Context: Alerts without enough information to diagnose
Poor Prioritization: Non-critical issues with high severity
Inconsistent Definitions: Varying alert criteria across services

Building Effective Monitoring Dashboards

Dashboards provide visual representations of monitoring data, helping teams understand system health and performance at a glance.

Dashboard Types and Purposes

Overview Dashboards: High-level system health
Service Dashboards: Detailed metrics for specific services
Business Dashboards: Impact on user experience and business metrics
On-Call Dashboards: Focus on common issues for incident responders
Resource Dashboards: Infrastructure utilization and capacity
Executive Dashboards: Simplified views for leadership

Dashboard Design Principles

Purpose-Driven: Design for specific user needs and use cases
Progressive Disclosure: Start with high-level views, enable drilling down
Consistent Layout: Use consistent patterns across dashboards
Focus on Signal: Remove noise and clutter, highlight important data
Context and Reference: Include baselines and thresholds
Annotation: Mark events like deployments or incidents
Cross-Linking: Connect related dashboards and documentation

Real-World Example: Service Dashboard

A product team created a service dashboard with these panels:

Row 1: Service Health Overview

Request Rate (Requests per second over time)
Error Rate (Percentage of failed requests over time)
Response Time (p50/p90/p99 latency over time)
Success Rate SLO (Current status vs. target)

Row 2: Resource Utilization

CPU Usage (Average across instances)
Memory Usage (Average across instances)
Network I/O (Inbound/outbound traffic)
Instance Count (Total and per status)

Row 3: Dependencies

Database Query Time (Average and p95 duration)
Cache Hit Rate (Percentage of cache hits)
External API Response Times (By endpoint)
Queue Depth (For asynchronous processing)

Row 4: Business Impact

Conversion Rate (Compared to 24h and 7d averages)
Transactions Processed (Count and value)
User Engagement (Active sessions, page views)
Feature Usage (Key feature utilization metrics)

This dashboard included annotations for deployments and incidents, with links to related logs, traces, and documentation. They also added vertical markers for daily traffic patterns and a time-shifted overlay to compare to previous periods.

Grafana Dashboard JSON Example (Simplified)


{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "changes(app_version{service=\"user-service\"}[1m]) > 0",
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Deployments",
        "titleFormat": "Deployment",
        "textFormat": "Version {{tag_version}}"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 1,
  "id": 1,
  "iteration": 1620926545544,
  "links": [],
  "panels": [
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 20,
      "panels": [],
      "title": "Service Health",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": true,
        "max": true,
        "min": false,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.3",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"user-service\"}[5m]))",
          "interval": "",
          "legendFormat": "Requests/sec",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Request Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": "Requests/sec",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Error Rate": "red"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 1
      },
      "hiddenSeries": false,
      "id": 4,
      "legend": {
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.3",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"user-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"user-service\"}[5m])) * 100",
          "interval": "",
          "legendFormat": "Error Rate",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "gt",
          "value": 5,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Error Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "percent",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 27,
  "style": "dark",
  "tags": ["service", "user-service"],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "user-service",
          "value": "user-service"
        },
        "datasource": "Prometheus",
        "definition": "label_values(service)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Service",
        "multi": false,
        "name": "service",
        "options": [],
        "query": {
          "query": "label_values(service)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Service Dashboard",
  "uid": "abc123",
  "version": 1
}

Learning Activities

Activity 1: Application Instrumentation

Instrument a simple web application to collect key metrics:

Choose a web application framework in your preferred language
Add instrumentation to track request counts, response times, and error rates
Implement custom metrics for a business transaction
Create a metrics endpoint for Prometheus scraping
Set up Prometheus to collect metrics from your application
Create a basic Grafana dashboard to visualize the metrics

Activity 2: Define a Monitoring Strategy

For a hypothetical e-commerce application, design a comprehensive monitoring strategy:

Identify key components and services to monitor
Define the critical metrics for each component
Specify tool choices for monitoring different aspects
Design alert policies for critical scenarios
Create mockups for the main dashboard types
Outline a plan for implementing the strategy in phases

Activity 3: Alert Design and Evaluation

Evaluate and improve an alerting strategy:

Review a provided set of alert definitions
Identify potential issues like alert storms or false positives
Redesign problematic alerts using best practices
Create alert definitions for Prometheus/Alertmanager
Define appropriate alert routing and escalation policies
Document response procedures for critical alerts

Key Takeaways

Effective monitoring is essential for maintaining reliable, high-performance applications
The three pillars of observability—metrics, logs, and traces—provide a complete view of application behavior
Monitoring strategies like RED, USE, and Four Golden Signals offer structured approaches to monitoring different aspects of systems
Application instrumentation should balance comprehensive coverage with performance considerations
A combination of open-source and commercial tools can provide a complete monitoring solution
Alerting strategies should focus on actionable notifications while avoiding alert fatigue
Well-designed dashboards enable teams to quickly understand system health and performance
Modern observability practices are evolving toward standardization with frameworks like OpenTelemetry