Application Monitoring: Ensuring Reliability and Performance

Understanding and implementing effective monitoring strategies for modern applications

Introduction to Application Monitoring

Application monitoring is the practice of observing, tracking, and analyzing the behavior and performance of software applications in real-time. It enables teams to identify issues, optimize performance, and ensure applications meet user expectations and business requirements. In today's complex distributed systems, effective monitoring is not just beneficial—it's essential for maintaining reliability and delivering excellent user experiences.

Modern application monitoring has evolved from simple uptime checks to comprehensive observability solutions that provide deep insights into application behavior, performance patterns, and user experiences. As applications grow more complex—spanning multiple services, cloud providers, and infrastructure layers—monitoring systems must evolve to track interdependencies and provide actionable insights when issues arise.

graph TD A[Application Monitoring] --> B[Infrastructure Monitoring] A --> C[Application Performance Monitoring] A --> D[User Experience Monitoring] A --> E[Business Metrics Monitoring] B --> B1[Server Health] B --> B2[Network Performance] B --> B3[Database Performance] C --> C1[Response Times] C --> C2[Error Rates] C --> C3[Throughput] C --> C4[Resource Utilization] D --> D1[Page Load Times] D --> D2[Transaction Success] D --> D3[User Journeys] E --> E1[Conversion Rates] E --> E2[Revenue Metrics] E --> E3[User Engagement]

The Healthcare Analogy

Application monitoring can be compared to modern healthcare:

  • Infrastructure Monitoring is like checking vital signs (heart rate, blood pressure, temperature). These basic metrics tell us if the system is alive and functioning at a fundamental level.
  • Application Performance Monitoring is similar to diagnostic tests (blood work, imaging). These give us deeper insights into how well internal systems are functioning.
  • Error Tracking resembles monitoring symptoms and complaints. Just as a doctor wants to know what hurts and when it started, developers need to understand error patterns and their contexts.
  • Distributed Tracing is like tracking the spread of a medication through the body. It helps us understand how requests flow through a complex system.
  • Alerting functions like a hospital's monitoring system that notifies staff when a patient's condition requires attention.
  • Proactive Monitoring is similar to preventive medicine—identifying potential issues before they cause noticeable symptoms.

Just as modern medicine has evolved from treating symptoms to understanding whole-body health and prevention, application monitoring has evolved from simple uptime checks to comprehensive observability that provides context and insights into the entire application ecosystem.

Core Concepts in Application Monitoring

The Three Pillars of Observability

Observability—the ability to understand a system's internal state based on its external outputs—rests on three key pillars:

graph TD A[Observability] --> B[Metrics] A --> C[Logs] A --> D[Traces] B --> B1[Quantitative measurements] B --> B2[Time-series data] B --> B3[Aggregatable] C --> C1[Discrete events] C --> C2[Contextual information] C --> C3[Unstructured/structured] D --> D1[Request flows] D --> D2[Cross-service calls] D --> D3[Performance timing]

Metrics

Metrics are numerical values measured over time that represent the state or performance of a system component. They are typically:

Common Metric Types

Type Description Examples
Counter Cumulative value that only increases Request count, error count, completed tasks
Gauge Single numerical value that can go up and down Memory usage, CPU utilization, active connections
Histogram Distribution of values across configurable buckets Response time distribution, request size distribution
Summary Similar to histogram but calculates configurable quantiles 90th/95th/99th percentile of response times

Logs

Logs are time-stamped records of discrete events that occur within an application. They provide context and details about what happened at a specific moment. Logs typically include:

Structured vs. Unstructured Logging

Unstructured Log:


[2025-05-08 14:32:17] ERROR Failed to process payment for order #12345 - Credit card declined

Structured Log (JSON format):


{
  "timestamp": "2025-05-08T14:32:17.345Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc123def456",
  "message": "Failed to process payment",
  "orderId": "12345",
  "errorCode": "CC_DECLINED",
  "paymentMethod": "credit_card",
  "customer": {
    "id": "cust_9876",
    "accountType": "premium"
  }
}

Structured logs are more machine-readable and enable more powerful filtering, searching, and analysis.

Traces

Traces track the flow of requests as they propagate through distributed services. They help developers understand:

sequenceDiagram participant U as User participant A as API Gateway participant S as Service A participant D as Database participant C as Cache U->>A: Request Note right of A: Trace ID: abc123 A->>S: Forward request S->>D: Query data S->>C: Check cache C->>S: Cache response S->>A: Service response A->>U: Final response Note over U,C: Each step is a span with timing information

In the example above, a trace consists of multiple spans representing each segment of the request's journey. Each span includes:

  • Start and end times
  • Service/component information
  • Operation name
  • Parent-child relationships
  • Contextual data (parameters, results, errors)

Effective Monitoring Strategies

The RED Method

The RED method, promoted by Weave Cloud, focuses on three key metrics for monitoring service health from a user perspective:

These metrics provide a user-centric view of service health and performance, making them ideal for customer-facing services.

Implementing RED Metrics with Prometheus


// Example using Prometheus client for Node.js
const express = require('express');
const promClient = require('prom-client');
const app = express();

// Create a Registry to register metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Create RED metrics
const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
});

const httpRequestCounter = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Register metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestCounter);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = process.hrtime();
  const path = req.path;
  const method = req.method;
  
  // Record metrics when response is finished
  res.on('finish', () => {
    const duration = process.hrtime(start);
    const durationInSeconds = duration[0] + duration[1] / 1e9;
    
    httpRequestDurationMicroseconds
      .labels(method, path, res.statusCode.toString())
      .observe(durationInSeconds);
      
    httpRequestCounter
      .labels(method, path, res.statusCode.toString())
      .inc();
  });
  
  next();
});

// Expose metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Application routes
app.get('/', (req, res) => {
  res.send('Hello World!');
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

The USE Method

The USE method, developed by Brendan Gregg, focuses on resources rather than services:

This approach is well-suited for monitoring infrastructure components like CPU, memory, network, and disk.

USE Method Application Examples

Resource Utilization Saturation Errors
CPU CPU usage % Run queue length System errors, CPU-related errors
Memory Memory usage % Swap usage, OOM events Memory allocation errors
Disk I/O Disk busy % I/O queue length Disk errors, timeouts
Network Bandwidth usage % Packet queue length Network errors, retransmits, drops

The Four Golden Signals

Google's Site Reliability Engineering (SRE) book outlines four key metrics for monitoring service health:

This approach combines aspects of both RED and USE methods and is suitable for most applications and services.

Real-World Example: E-commerce Monitoring Strategy

A large e-commerce platform implements the Four Golden Signals across their architecture:

  • Latency Monitoring:
    • p50/p90/p99 response times for all API endpoints
    • Frontend page load times by device type
    • Database query latency for critical operations
  • Traffic Tracking:
    • Requests per second by service and endpoint
    • User sessions per minute
    • Search queries and product views per minute
  • Error Monitoring:
    • HTTP 5xx and 4xx error rates
    • Failed payment transactions
    • Add-to-cart failures
    • Checkout abandonment rate
  • Saturation Measurement:
    • Database connection pool usage
    • Queue depths for asynchronous processes
    • Memory usage of key services
    • CPU utilization across the infrastructure

These signals are monitored in real-time with automated alerts that trigger when thresholds are exceeded. During peak shopping events like Black Friday, they adjust alert thresholds to account for expected higher traffic and load.

Application Instrumentation

Instrumentation is the process of adding code to your application to collect monitoring data. Effective instrumentation is foundational to any monitoring strategy.

Types of Instrumentation

Manual Instrumentation

  • Developers explicitly add monitoring code
  • Provides precise control over what is monitored
  • Can be tailored to business-specific metrics
  • Requires ongoing maintenance
  • Risk of inconsistent implementation

Automatic Instrumentation

  • Added by monitoring libraries or agents
  • Quick to implement across applications
  • Consistent implementation
  • Less flexible for custom metrics
  • May miss application-specific contexts

Key Components to Instrument

Manual Instrumentation Examples

Express.js with OpenTelemetry:


const { trace } = require('@opentelemetry/api');
const express = require('express');
const app = express();

// Initialize OpenTelemetry (setup code not shown)

app.get('/api/products/:id', async (req, res) => {
  // Start a new span for this endpoint
  const span = trace.getTracer('product-service').startSpan('get-product-by-id');
  
  // Add context to the span
  span.setAttribute('product.id', req.params.id);
  
  try {
    // Create a child span for database operation
    const dbSpan = trace.getTracer('product-service').startSpan('database-query');
    
    let product;
    try {
      product = await database.getProduct(req.params.id);
      dbSpan.setAttribute('db.result_size', JSON.stringify(product).length);
    } catch (dbError) {
      dbSpan.setAttribute('error', true);
      dbSpan.setAttribute('error.message', dbError.message);
      throw dbError;
    } finally {
      dbSpan.end();
    }
    
    // Record business metrics
    metrics.counter('product.views').add(1, { 'product.id': req.params.id });
    
    res.json(product);
  } catch (error) {
    // Record the error on the span
    span.setAttribute('error', true);
    span.setAttribute('error.message', error.message);
    span.setAttribute('error.type', error.constructor.name);
    
    // Log the error
    logger.error('Failed to retrieve product', {
      error: error.message,
      productId: req.params.id,
      traceId: span.spanContext().traceId
    });
    
    res.status(500).json({ error: 'Failed to retrieve product' });
  } finally {
    // End the span
    span.end();
  }
});

Python with custom metrics:


import time
from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'request_count', 'App Request Count',
    ['app_name', 'endpoint', 'method', 'http_status']
)

REQUEST_LATENCY = Histogram(
    'request_latency_seconds', 'Request latency in seconds',
    ['app_name', 'endpoint', 'method']
)

# Middleware to measure request latency and count
@app.before_request
def start_timer():
    request.start_time = time.time()

@app.after_request
def record_metrics(response):
    endpoint = request.endpoint or 'unknown'
    request_latency = time.time() - request.start_time
    REQUEST_LATENCY.labels('my_app', endpoint, request.method).observe(request_latency)
    REQUEST_COUNT.labels('my_app', endpoint, request.method, response.status_code).inc()
    return response

@app.route('/api/users/')
def get_user(user_id):
    # Application logic here
    return {'id': user_id, 'name': 'Example User'}

# Metrics endpoint for Prometheus to scrape
@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

OpenTelemetry Standard

OpenTelemetry is an open-source observability framework that provides a standardized way to instrument applications across different languages and platforms.

Benefits of OpenTelemetry

  • Vendor Neutrality: A single instrumentation that works with multiple backends
  • Consistent Cross-Service Tracing: Standardized context propagation
  • Comprehensive Coverage: Metrics, logs, and traces in one framework
  • Language Agnostic: Supports most major programming languages
  • Extensive Library Support: Auto-instrumentation for common frameworks
  • Future-Proof: Increasingly becoming the industry standard

OpenTelemetry Setup Example (Node.js)


// tracing.js - OpenTelemetry setup file
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');

// Create and configure the OpenTelemetry tracer provider
const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    environment: process.env.NODE_ENV || 'development'
  })
});

// Configure span processor and exporter
const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));

// Register the provider
provider.register();

// Automatically instrument Express, HTTP, and MongoDB
registerInstrumentations({
  instrumentations: [
    new ExpressInstrumentation(),
    new HttpInstrumentation(),
    new MongoDBInstrumentation()
  ]
});

console.log('OpenTelemetry tracing initialized');

// In your main application file:
// require('./tracing'); // Add this at the top of your entry point file

Application Monitoring Tools

Tool Categories

graph TD A[Monitoring Tools] --> B[APM Solutions] A --> C[Infrastructure Monitoring] A --> D[Log Management] A --> E[Tracing Systems] A --> F[Synthetic Monitoring] A --> G[Real User Monitoring] B --> B1[New Relic] B --> B2[Datadog APM] B --> B3[Dynatrace] B --> B4[AppDynamics] C --> C1[Prometheus + Grafana] C --> C2[Nagios] C --> C3[Zabbix] D --> D1[ELK Stack] D --> D2[Graylog] D --> D3[Loki] E --> E1[Jaeger] E --> E2[Zipkin] E --> E3[AWS X-Ray] F --> F1[Pingdom] F --> F2[Uptrends] F --> F3[Checkly] G --> G1[Google Analytics] G --> G2[Hotjar] G --> G3[LogRocket]

Open-Source Monitoring Stack

Many organizations implement a monitoring stack based on open-source tools:

graph LR A[Apps with
OpenTelemetry] --> B[Prometheus] A --> C[Loki] A --> D[Tempo/Jaeger] B --> E[Alertmanager] B --> F[Grafana] C --> F D --> F E --> G[Alert Channels] F --> H[Dashboards]

Common Open-Source Monitoring Components

  • Prometheus: Time-series database for metrics with a powerful query language
  • Grafana: Visualization platform for metrics, logs, and traces
  • Loki: Log aggregation system designed to work with Prometheus and Grafana
  • Jaeger/Tempo: Distributed tracing systems for tracking request flows
  • Alertmanager: Handles alerts from Prometheus and routes them to receivers
  • OpenTelemetry Collector: Receives, processes, and exports telemetry data

Commercial APM Tools

Commercial Application Performance Monitoring (APM) tools provide integrated solutions that combine metrics, logs, and traces with additional features:

Tool Key Features Best For
New Relic Full-stack observability, NRQL query language, AI-assisted analysis Web applications, microservices, comprehensive monitoring
Datadog Infrastructure monitoring, APM, log management, RUM, security Complex environments, multi-cloud, hybrid infrastructures
Dynatrace AI-powered monitoring, auto-discovery, root cause analysis Enterprise applications, complex dependencies, automated analysis
AppDynamics Business transaction monitoring, code-level diagnostics Enterprise applications, business metric correlation
Elastic APM Open-core APM integrated with Elasticsearch, Kibana Organizations already using the Elastic Stack

Real-World Example: E-commerce Platform Monitoring Setup

A mid-size e-commerce company with a microservices architecture implemented a hybrid monitoring approach:

  • Infrastructure Monitoring: Prometheus and Grafana for all infrastructure metrics
  • Log Management: ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging
  • Application Monitoring: New Relic APM for detailed service performance metrics
  • Real User Monitoring: New Relic Browser for frontend performance monitoring
  • Synthetic Monitoring: Checkly for critical user journey validation
  • Alerting: PagerDuty integrated with all monitoring systems

They chose this combination to balance cost and capabilities:

  • Prometheus/Grafana covered their extensive infrastructure at a lower cost than commercial alternatives
  • ELK provided powerful log analysis capabilities they couldn't find in simpler solutions
  • New Relic APM gave them code-level visibility without building their own tracing infrastructure
  • PagerDuty provided alert management and on-call rotation that integrated with all systems

This approach allowed them to monitor both technical metrics and business outcomes while controlling costs.

Effective Alerting Strategies

Alerting is the process of notifying relevant personnel when monitoring systems detect abnormal conditions. Effective alerting balances prompt notification of important issues with avoiding alert fatigue.

Alert Types and Priorities

Priority Description Response Time Examples
P1 (Critical) Service is down or severely impaired for all users Immediate (24/7) Website outage, payment processing failure
P2 (High) Major feature unavailable or significant performance degradation Within 30 minutes (24/7) Checkout process slow, search not working
P3 (Medium) Minor feature issues or early warnings of potential problems Business hours Elevated error rates, increasing response times
P4 (Low) Non-urgent issues that should be addressed Next business day Non-critical component warnings, resource trending toward threshold

Alerting Best Practices

Prometheus Alerting Rules Example


# prometheus/alerts.yml
groups:
  - name: service_availability
    rules:
      # Alert when success rate drops below 95%
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Service {{ $labels.service }} has a high error rate: {{ $value | humanizePercentage }}"
          dashboard: "https://grafana.example.com/d/service-overview?var-service={{ $labels.service }}"

      # Alert on high latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)) > 2
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "Service {{ $labels.service }} has a 95th percentile latency of {{ $value }} seconds"
          dashboard: "https://grafana.example.com/d/service-performance?var-service={{ $labels.service }}"

  - name: infrastructure
    rules:
      # Alert on high CPU usage
      - alert: HighCPUUsage
        expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "Instance {{ $labels.instance }} has CPU usage above 80% for more than 10 minutes: {{ $value | humanizePercentage }}"
          
      # Alert on low disk space
      - alert: LowDiskSpace
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}: {{ $labels.mountpoint }}"
          description: "Disk usage at {{ $labels.mountpoint }} on {{ $labels.instance }} has less than 10% free space: {{ $value | humanizePercentage }} available"

Alert Design Principles

  • Actionable: Every alert should require action (if not, it's a metric)
  • Timely: Alert at the right time (not too early, not too late)
  • Precise: Target the specific component with an issue
  • Clear Ownership: Define who should respond
  • Documented Response: Have a documented playbook for each alert
  • Tunable: Easy to adjust thresholds as conditions change

Avoiding Common Alerting Mistakes

  • Alert Storms: Multiple alerts for a single root cause
  • False Positives: Alerts that don't indicate real problems
  • Alert Fatigue: Too many alerts leading to ignored notifications
  • Missing Context: Alerts without enough information to diagnose
  • Poor Prioritization: Non-critical issues with high severity
  • Inconsistent Definitions: Varying alert criteria across services

Building Effective Monitoring Dashboards

Dashboards provide visual representations of monitoring data, helping teams understand system health and performance at a glance.

Dashboard Types and Purposes

Dashboard Design Principles

  • Purpose-Driven: Design for specific user needs and use cases
  • Progressive Disclosure: Start with high-level views, enable drilling down
  • Consistent Layout: Use consistent patterns across dashboards
  • Focus on Signal: Remove noise and clutter, highlight important data
  • Context and Reference: Include baselines and thresholds
  • Annotation: Mark events like deployments or incidents
  • Cross-Linking: Connect related dashboards and documentation

Real-World Example: Service Dashboard

A product team created a service dashboard with these panels:

Row 1: Service Health Overview
  • Request Rate (Requests per second over time)
  • Error Rate (Percentage of failed requests over time)
  • Response Time (p50/p90/p99 latency over time)
  • Success Rate SLO (Current status vs. target)
Row 2: Resource Utilization
  • CPU Usage (Average across instances)
  • Memory Usage (Average across instances)
  • Network I/O (Inbound/outbound traffic)
  • Instance Count (Total and per status)
Row 3: Dependencies
  • Database Query Time (Average and p95 duration)
  • Cache Hit Rate (Percentage of cache hits)
  • External API Response Times (By endpoint)
  • Queue Depth (For asynchronous processing)
Row 4: Business Impact
  • Conversion Rate (Compared to 24h and 7d averages)
  • Transactions Processed (Count and value)
  • User Engagement (Active sessions, page views)
  • Feature Usage (Key feature utilization metrics)

This dashboard included annotations for deployments and incidents, with links to related logs, traces, and documentation. They also added vertical markers for daily traffic patterns and a time-shifted overlay to compare to previous periods.

Grafana Dashboard JSON Example (Simplified)


{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "changes(app_version{service=\"user-service\"}[1m]) > 0",
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Deployments",
        "titleFormat": "Deployment",
        "textFormat": "Version {{tag_version}}"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 1,
  "id": 1,
  "iteration": 1620926545544,
  "links": [],
  "panels": [
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 20,
      "panels": [],
      "title": "Service Health",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": true,
        "max": true,
        "min": false,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.3",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"user-service\"}[5m]))",
          "interval": "",
          "legendFormat": "Requests/sec",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Request Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": "Requests/sec",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {
        "Error Rate": "red"
      },
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 1
      },
      "hiddenSeries": false,
      "id": 4,
      "legend": {
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.3",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service=\"user-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"user-service\"}[5m])) * 100",
          "interval": "",
          "legendFormat": "Error Rate",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "gt",
          "value": 5,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Error Rate",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "percent",
          "label": "",
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 27,
  "style": "dark",
  "tags": ["service", "user-service"],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "user-service",
          "value": "user-service"
        },
        "datasource": "Prometheus",
        "definition": "label_values(service)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": false,
        "label": "Service",
        "multi": false,
        "name": "service",
        "options": [],
        "query": {
          "query": "label_values(service)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Service Dashboard",
  "uid": "abc123",
  "version": 1
}

Learning Activities

Activity 1: Application Instrumentation

Instrument a simple web application to collect key metrics:

  1. Choose a web application framework in your preferred language
  2. Add instrumentation to track request counts, response times, and error rates
  3. Implement custom metrics for a business transaction
  4. Create a metrics endpoint for Prometheus scraping
  5. Set up Prometheus to collect metrics from your application
  6. Create a basic Grafana dashboard to visualize the metrics

Activity 2: Define a Monitoring Strategy

For a hypothetical e-commerce application, design a comprehensive monitoring strategy:

  1. Identify key components and services to monitor
  2. Define the critical metrics for each component
  3. Specify tool choices for monitoring different aspects
  4. Design alert policies for critical scenarios
  5. Create mockups for the main dashboard types
  6. Outline a plan for implementing the strategy in phases

Activity 3: Alert Design and Evaluation

Evaluate and improve an alerting strategy:

  1. Review a provided set of alert definitions
  2. Identify potential issues like alert storms or false positives
  3. Redesign problematic alerts using best practices
  4. Create alert definitions for Prometheus/Alertmanager
  5. Define appropriate alert routing and escalation policies
  6. Document response procedures for critical alerts

Key Takeaways

Further Learning Resources