Service Communication Patterns

How Microservices Talk to Each Other

Introduction to Service Communication

In our previous lecture, we explored the fundamental principles of microservices architecture. Now, we'll dive into one of the most critical aspects of building effective microservices: how these services communicate with each other.

Communication between services is fundamentally different from communication between components in a monolithic application. In a monolith, components communicate through in-memory function calls, which are fast, reliable, and strongly typed. In contrast, microservices must communicate across network boundaries, introducing latency, reliability concerns, and serialization/deserialization overhead.

graph LR subgraph "Monolithic Application" A[Component A] --function call--> B[Component B] B --return value--> A end subgraph "Microservices Architecture" C[Service A] --network request--> D[Service B] D --network response--> C end

Think of the difference between talking to someone in the same room (monolith) versus having to call them on the phone (microservices). The phone call introduces potential issues: dropped calls, background noise, delays, or misunderstandings that wouldn't occur in face-to-face conversation.

This fundamental shift requires us to adopt specific patterns for service-to-service communication to ensure our distributed system remains resilient, performant, and maintainable.

Synchronous vs. Asynchronous Communication

Synchronous Communication

In synchronous communication, the caller service waits for a response from the called service before proceeding. This is similar to a phone call where you ask a question and wait on the line for an answer.

sequenceDiagram participant A as Service A participant B as Service B A->>B: Request Note right of A: Waiting... B->>A: Response Note right of A: Continues processing

Examples of synchronous communication include:

Advantages:

Disadvantages:

// Example: Synchronous REST API call in Node.js
const axios = require('axios');

async function getOrderDetails(orderId) {
  try {
    // Synchronous call - we wait for the response
    const response = await axios.get(`https://order-service/api/orders/${orderId}`);
    return response.data;
  } catch (error) {
    console.error('Error fetching order details:', error);
    throw error;
  }
}

// Usage
app.get('/api/dashboard/order/:id', async (req, res) => {
  try {
    const orderDetails = await getOrderDetails(req.params.id);
    res.json(orderDetails);
  } catch (error) {
    res.status(500).json({ error: 'Failed to retrieve order details' });
  }
});

Asynchronous Communication

In asynchronous communication, the caller service doesn't wait for a response but continues processing after sending the request. This is like sending a text message or email—you don't wait for an immediate reply before continuing with other tasks.

sequenceDiagram participant A as Service A participant Q as Message Queue participant B as Service B A->>Q: Publish Message Note right of A: Continues processing immediately Q->>B: Deliver Message Note right of B: Processes message B->>Q: Publish Result (optional) Q->>A: Deliver Result (optional)

Examples of asynchronous communication include:

Advantages:

Disadvantages:

// Example: Asynchronous messaging with RabbitMQ in Node.js
const amqp = require('amqplib');

// Publishing a message
async function createOrder(orderData) {
  try {
    const connection = await amqp.connect('amqp://localhost');
    const channel = await connection.createChannel();
    
    const queue = 'order_processing';
    await channel.assertQueue(queue, { durable: true });
    
    // Publish message and continue without waiting for processing
    channel.sendToQueue(queue, Buffer.from(JSON.stringify(orderData)), {
      persistent: true
    });
    
    console.log(`Order ${orderData.orderId} sent for processing`);
    
    setTimeout(() => {
      connection.close();
    }, 500);
    
    return { orderId: orderData.orderId, status: 'processing' };
  } catch (error) {
    console.error('Error publishing order message:', error);
    throw error;
  }
}

// Consuming messages
async function startOrderProcessor() {
  try {
    const connection = await amqp.connect('amqp://localhost');
    const channel = await connection.createChannel();
    
    const queue = 'order_processing';
    await channel.assertQueue(queue, { durable: true });
    
    // Only process one message at a time
    channel.prefetch(1);
    
    console.log('Order processor waiting for messages...');
    
    channel.consume(queue, async (msg) => {
      if (msg) {
        const orderData = JSON.parse(msg.content.toString());
        
        console.log(`Processing order ${orderData.orderId}...`);
        
        try {
          // Process the order (could be a long-running task)
          await processOrder(orderData);
          
          // Acknowledge successful processing
          channel.ack(msg);
          
          console.log(`Order ${orderData.orderId} processed successfully`);
        } catch (error) {
          console.error(`Error processing order ${orderData.orderId}:`, error);
          
          // Reject the message and requeue it
          channel.nack(msg, false, true);
        }
      }
    });
  } catch (error) {
    console.error('Error starting order processor:', error);
  }
}

// Start the consumer
startOrderProcessor();

Request-Response Pattern

The request-response pattern is the most straightforward communication pattern, where one service (the client) makes a request to another service (the server) and waits for a response. This pattern can be implemented synchronously or asynchronously.

Synchronous Request-Response

This is the most common implementation, using HTTP-based REST APIs or gRPC.

sequenceDiagram participant C as Client Service participant S as Server Service C->>S: HTTP GET /api/resource/:id S->>C: 200 OK with JSON response

RESTful API Design Best Practices:

// Example Express.js implementation of a RESTful API
const express = require('express');
const router = express.Router();

// Get all products with optional filtering
router.get('/products', async (req, res) => {
  try {
    const { category, minPrice, maxPrice, limit = 20, page = 1 } = req.query;
    
    // Build query filters
    const filters = {};
    if (category) filters.category = category;
    if (minPrice || maxPrice) {
      filters.price = {};
      if (minPrice) filters.price.$gte = parseFloat(minPrice);
      if (maxPrice) filters.price.$lte = parseFloat(maxPrice);
    }
    
    // Calculate pagination
    const skip = (parseInt(page) - 1) * parseInt(limit);
    
    // Fetch products from database
    const products = await Product.find(filters)
      .limit(parseInt(limit))
      .skip(skip)
      .sort({ createdAt: -1 });
      
    // Get total count for pagination
    const totalCount = await Product.countDocuments(filters);
    
    // Add hypermedia links
    const baseUrl = `${req.protocol}://${req.get('host')}${req.baseUrl}/products`;
    const totalPages = Math.ceil(totalCount / parseInt(limit));
    
    const links = {
      self: `${baseUrl}?page=${page}&limit=${limit}`,
    };
    
    if (parseInt(page) > 1) {
      links.prev = `${baseUrl}?page=${parseInt(page) - 1}&limit=${limit}`;
    }
    
    if (parseInt(page) < totalPages) {
      links.next = `${baseUrl}?page=${parseInt(page) + 1}&limit=${limit}`;
    }
    
    // Send response
    res.json({
      data: products,
      meta: {
        totalCount,
        page: parseInt(page),
        limit: parseInt(limit),
        totalPages
      },
      links
    });
  } catch (error) {
    console.error('Error fetching products:', error);
    res.status(500).json({ error: 'Failed to fetch products' });
  }
});

// Get a specific product
router.get('/products/:id', async (req, res) => {
  try {
    const product = await Product.findById(req.params.id);
    
    if (!product) {
      return res.status(404).json({ error: 'Product not found' });
    }
    
    // Add hypermedia links
    const baseUrl = `${req.protocol}://${req.get('host')}${req.baseUrl}`;
    product._links = {
      self: `${baseUrl}/products/${product._id}`,
      category: `${baseUrl}/categories/${product.category}`,
      reviews: `${baseUrl}/products/${product._id}/reviews`
    };
    
    res.json(product);
  } catch (error) {
    console.error('Error fetching product:', error);
    res.status(500).json({ error: 'Failed to fetch product' });
  }
});

gRPC for Service Communication

gRPC is a high-performance RPC (Remote Procedure Call) framework that uses Protocol Buffers for service definitions and binary serialization for efficient communication.

Advantages of gRPC:

// Example gRPC service definition (protocol buffer)
syntax = "proto3";

package product;

service ProductService {
  // Get a single product by ID
  rpc GetProduct(GetProductRequest) returns (Product) {}
  
  // Search for products
  rpc SearchProducts(SearchProductsRequest) returns (SearchProductsResponse) {}
  
  // Stream price updates for a product
  rpc WatchProductPrice(GetProductRequest) returns (stream PriceUpdate) {}
}

message GetProductRequest {
  string product_id = 1;
}

message Product {
  string id = 1;
  string name = 2;
  string description = 3;
  double price = 4;
  string category = 5;
  bool in_stock = 6;
}

message SearchProductsRequest {
  string query = 1;
  string category = 2;
  double min_price = 3;
  double max_price = 4;
  int32 page = 5;
  int32 limit = 6;
}

message SearchProductsResponse {
  repeated Product products = 1;
  int32 total_count = 2;
  int32 page = 3;
  int32 limit = 4;
  int32 total_pages = 5;
}

message PriceUpdate {
  string product_id = 1;
  double new_price = 2;
  string currency = 3;
  string timestamp = 4;
}

Asynchronous Request-Response

Asynchronous request-response uses messaging to decouple the request from the response, while still maintaining the request-response semantic.

sequenceDiagram participant C as Client Service participant RQ as Request Queue participant S as Server Service participant RQ as Response Queue C->>RQ: Publish Request with correlationId RQ->>S: Deliver Request S->>RQ: Publish Response with correlationId RQ->>C: Deliver Response

This pattern is especially useful for long-running operations where you don't want to keep an HTTP connection open, or when you need better resilience against failures.

// Example: Asynchronous request-response with RabbitMQ
const amqp = require('amqplib');
const { v4: uuidv4 } = require('uuid');

// Client side: Making an asynchronous request
async function getProductDetails(productId) {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  
  // Create a response queue with a unique name
  const { queue: replyQueue } = await channel.assertQueue('', { exclusive: true });
  
  // Generate a correlation ID for this request
  const correlationId = uuidv4();
  
  // Create a promise that will be resolved when we receive the response
  return new Promise((resolve, reject) => {
    // Set a timeout for the request
    const timeout = setTimeout(() => {
      connection.close();
      reject(new Error('Request timed out'));
    }, 10000);
    
    // Listen for the response
    channel.consume(replyQueue, (msg) => {
      if (msg.properties.correlationId === correlationId) {
        clearTimeout(timeout);
        
        // Parse the response
        const content = JSON.parse(msg.content.toString());
        
        // Close the connection
        setTimeout(() => connection.close(), 500);
        
        // Resolve the promise with the response data
        resolve(content);
      }
    }, { noAck: true });
    
    // Send the request
    channel.sendToQueue('product_requests', Buffer.from(JSON.stringify({ productId })), {
      correlationId,
      replyTo: replyQueue
    });
  });
}

// Server side: Processing requests and sending responses
async function startProductService() {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  
  const requestQueue = 'product_requests';
  await channel.assertQueue(requestQueue, { durable: false });
  
  console.log('Product service waiting for requests...');
  
  channel.consume(requestQueue, async (msg) => {
    if (msg) {
      const request = JSON.parse(msg.content.toString());
      const { productId } = request;
      
      console.log(`Processing request for product ${productId}`);
      
      try {
        // Fetch product details (e.g., from database)
        const product = await getProductFromDatabase(productId);
        
        // Send response back to the reply queue
        channel.sendToQueue(
          msg.properties.replyTo,
          Buffer.from(JSON.stringify(product)),
          {
            correlationId: msg.properties.correlationId
          }
        );
        
        // Acknowledge the request
        channel.ack(msg);
      } catch (error) {
        console.error(`Error processing product request:`, error);
        
        // Send error response
        channel.sendToQueue(
          msg.properties.replyTo,
          Buffer.from(JSON.stringify({ error: error.message })),
          {
            correlationId: msg.properties.correlationId
          }
        );
        
        // Acknowledge the request (we handled it, even though it failed)
        channel.ack(msg);
      }
    }
  });
}

Event-Driven Architecture

Event-driven architecture is a communication pattern where services produce and consume events. An event is a notification that something has happened, such as "OrderCreated", "PaymentProcessed", or "ShipmentDelivered". Services publish events without knowing or caring who will consume them, and other services subscribe to events they're interested in.

graph LR A[Order Service] --OrderCreated Event--> B[Event Bus] B --Event--> C[Inventory Service] B --Event--> D[Notification Service] B --Event--> E[Analytics Service]

Think of an event bus like a bulletin board in a shared office space. The Order Service posts a notice saying "New Order #1234 Created!" It doesn't know who will read it, but interested services (Inventory, Notification, Analytics) check the board regularly and take action when they see relevant notices.

Event Sourcing

Event sourcing is a pattern where instead of storing the current state of entities, you store a sequence of events that led to that state. The current state can be reconstructed by replaying the events.

graph LR A[Command: CreateOrder] --> B[Event: OrderCreated] B --> C[Event: PaymentProcessed] C --> D[Event: OrderShipped] D --> E[Event: OrderDelivered] F[Current State] --> G[Order #1234: DELIVERED]

Think of event sourcing like a checkbook register versus just looking at your current balance. The register shows every deposit and withdrawal that led to your current balance, allowing you to verify the balance is correct and understand how it changed over time.

// Example: Event-driven architecture with Kafka
const { Kafka } = require('kafkajs');

// Kafka client setup
const kafka = new Kafka({
  clientId: 'order-service',
  brokers: ['kafka:9092']
});

// Create a producer
const producer = kafka.producer();

// Function to publish an OrderCreated event
async function publishOrderCreatedEvent(order) {
  // Connect to Kafka if not already connected
  await producer.connect();
  
  // Create the event
  const event = {
    type: 'OrderCreated',
    data: {
      orderId: order.id,
      customerId: order.customerId,
      items: order.items,
      totalAmount: order.totalAmount,
      timestamp: new Date().toISOString()
    }
  };
  
  // Publish the event
  await producer.send({
    topic: 'orders',
    messages: [
      { 
        key: order.id,  // Using orderId as the key for partitioning
        value: JSON.stringify(event)
      }
    ]
  });
  
  console.log(`Published OrderCreated event for order ${order.id}`);
}

// Consumer setup for the Notification Service
async function startNotificationConsumer() {
  const consumer = kafka.consumer({ groupId: 'notification-service' });
  
  await consumer.connect();
  await consumer.subscribe({ topic: 'orders', fromBeginning: false });
  
  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      const event = JSON.parse(message.value.toString());
      
      if (event.type === 'OrderCreated') {
        console.log(`Processing OrderCreated event for order ${event.data.orderId}`);
        
        // Send an order confirmation email to the customer
        await sendOrderConfirmationEmail(event.data.customerId, event.data);
      }
    }
  });
}

// Start the notification consumer
startNotificationConsumer().catch(console.error);

Command Query Responsibility Segregation (CQRS)

CQRS is a pattern that separates the read and write operations of a service. This allows for optimizing each side independently and is often used with event sourcing.

graph TD A[Client] -- Command --> B[Command Handler] A -- Query --> C[Query Handler] B -- Events --> D[(Event Store)] D -- Build/Update --> E[(Read Database)] C -- Read From --> E

A real-world analogy for CQRS is a restaurant with separate processes for taking orders (commands) and serving food (queries). The waiter takes your order (command) and gives it to the kitchen. Later, when your food is ready, servers bring it to your table (query result). These processes are separate and optimized for their specific purposes.

// Example: CQRS in a Node.js application
// Command side - handling write operations
class OrderCommandHandler {
  async createOrder(command) {
    // Validate the command
    this.validateCreateOrderCommand(command);
    
    // Generate a unique ID for the order
    const orderId = generateId();
    
    // Create the OrderCreated event
    const event = {
      type: 'OrderCreated',
      data: {
        orderId,
        customerId: command.customerId,
        items: command.items,
        totalAmount: calculateTotal(command.items),
        status: 'CREATED',
        createdAt: new Date().toISOString()
      }
    };
    
    // Store the event
    await eventStore.append('order', orderId, event);
    
    // Return the new order ID
    return orderId;
  }
  
  async processPayment(command) {
    // Validate the command
    this.validateProcessPaymentCommand(command);
    
    // Get the current state of the order by replaying events
    const orderEvents = await eventStore.getEvents('order', command.orderId);
    const orderState = buildOrderState(orderEvents);
    
    // Make sure the order exists and is in the correct state
    if (!orderState || orderState.status !== 'CREATED') {
      throw new Error('Order not found or not in CREATED state');
    }
    
    // Process the payment (in a real app, call a payment service)
    // ...
    
    // Create the PaymentProcessed event
    const event = {
      type: 'PaymentProcessed',
      data: {
        orderId: command.orderId,
        paymentId: generateId(),
        amount: orderState.totalAmount,
        status: 'COMPLETED',
        processedAt: new Date().toISOString()
      }
    };
    
    // Store the event
    await eventStore.append('order', command.orderId, event);
  }
}

// Query side - handling read operations
class OrderQueryHandler {
  async getOrder(orderId) {
    // In a real app, this would read from a denormalized read database
    // that's optimized for queries
    return await orderReadDatabase.findById(orderId);
  }
  
  async getCustomerOrders(customerId) {
    return await orderReadDatabase.findByCustomerId(customerId);
  }
}

// Event handler to update the read database
class OrderEventHandler {
  async handleEvent(event) {
    const { type, data } = event;
    
    switch (type) {
      case 'OrderCreated':
        await orderReadDatabase.insert({
          id: data.orderId,
          customerId: data.customerId,
          items: data.items,
          totalAmount: data.totalAmount,
          status: data.status,
          createdAt: data.createdAt,
          updatedAt: data.createdAt
        });
        break;
        
      case 'PaymentProcessed':
        await orderReadDatabase.update(data.orderId, {
          status: 'PAID',
          paymentId: data.paymentId,
          updatedAt: data.processedAt
        });
        break;
        
      // Handle other event types
      // ...
    }
  }
}

API Gateway Pattern

An API Gateway acts as a single entry point for client applications to access a microservices-based system. It handles concerns like routing, authentication, rate limiting, and request aggregation.

graph TD A[Client] --> B[API Gateway] B --> C[User Service] B --> D[Product Service] B --> E[Order Service] B --> F[Payment Service]

Think of an API Gateway like a hotel concierge who acts as a single point of contact for guests. Instead of guests needing to know which department handles room service, which handles housekeeping, and which handles tour bookings, they just talk to the concierge who knows how to route each request to the right department.

Responsibilities of an API Gateway

API Gateway Implementation Options

// Example: Building a simple API Gateway with Express.js
const express = require('express');
const { createProxyMiddleware } = require('http-proxy-middleware');
const rateLimit = require('express-rate-limit');
const jwt = require('express-jwt');
const cors = require('cors');

const app = express();

// Enable CORS
app.use(cors());

// JWT Authentication middleware
const authenticate = jwt({
  secret: process.env.JWT_SECRET,
  algorithms: ['HS256']
});

// Rate limiting middleware
const apiLimiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100 // limit each IP to 100 requests per windowMs
});

// Logging middleware
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    console.log(`${req.method} ${req.originalUrl} ${res.statusCode} ${duration}ms`);
  });
  
  next();
});

// Proxy endpoints to respective services
// User Service
app.use(
  '/api/users',
  authenticate, // Apply authentication to user endpoints
  apiLimiter, // Apply rate limiting
  createProxyMiddleware({
    target: 'http://user-service:3001',
    pathRewrite: {
      '^/api/users': '/api/v1/users' // Rewrite path if needed
    },
    changeOrigin: true
  })
);

// Product Service - public endpoints don't need authentication
app.use(
  '/api/products',
  apiLimiter,
  createProxyMiddleware({
    target: 'http://product-service:3002',
    pathRewrite: {
      '^/api/products': '/api/v1/products'
    },
    changeOrigin: true
  })
);

// Order Service - requires authentication
app.use(
  '/api/orders',
  authenticate,
  apiLimiter,
  createProxyMiddleware({
    target: 'http://order-service:3003',
    pathRewrite: {
      '^/api/orders': '/api/v1/orders'
    },
    changeOrigin: true
  })
);

// Request aggregation - example of an API gateway enhancing capabilities
app.get('/api/products/:id/with-reviews', async (req, res) => {
  try {
    // Fetch product details
    const productResponse = await fetch(`http://product-service:3002/api/v1/products/${req.params.id}`);
    const product = await productResponse.json();
    
    // Fetch product reviews
    const reviewsResponse = await fetch(`http://review-service:3004/api/v1/reviews?productId=${req.params.id}`);
    const reviews = await reviewsResponse.json();
    
    // Combine the data
    res.json({
      ...product,
      reviews
    });
  } catch (error) {
    console.error('Error aggregating product data:', error);
    res.status(500).json({ error: 'Failed to fetch product with reviews' });
  }
});

// Error handling
app.use((err, req, res, next) => {
  console.error(err);
  
  // Handle JWT authentication errors
  if (err.name === 'UnauthorizedError') {
    return res.status(401).json({ error: 'Invalid token' });
  }
  
  res.status(500).json({ error: 'Internal Server Error' });
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`API Gateway running on port ${PORT}`);
});

Backend for Frontend (BFF) Pattern

BFF is a variation of the API Gateway pattern where separate gateways are created for different client types (web, mobile, etc.) to provide optimized APIs for each.

graph TD A[Web Client] --> B[Web BFF] C[Mobile Client] --> D[Mobile BFF] E[Desktop Client] --> F[Desktop BFF] B --> G[User Service] B --> H[Product Service] B --> I[Order Service] D --> G D --> H D --> I F --> G F --> H F --> I

Think of BFFs like having specialized concierges for different types of hotel guests: one for business travelers who's knowledgeable about meeting rooms and business services, one for families who knows about kid-friendly activities, and one for tour groups who can coordinate large group movements.

Service Discovery

Service discovery is the process of automatically detecting services and their instances within a network. This becomes essential in a microservices environment where services can scale dynamically and change their locations.

graph LR A[Service A] -->|Lookup| B[Service Registry] B -->|Return locations| A A -->|Call| C[Service B Instance 1] A -->|Call| D[Service B Instance 2] A -->|Call| E[Service B Instance 3] F[Service B Instance 1] -->|Register| B G[Service B Instance 2] -->|Register| B H[Service B Instance 3] -->|Register| B

Service Discovery Patterns

Service Discovery Implementations

// Example: Client-side discovery with Consul in Node.js
const Consul = require('consul');
const express = require('express');
const axios = require('axios');

const app = express();
const consul = new Consul({
  host: 'consul',
  port: 8500
});

// Register this service with Consul
function registerService() {
  const serviceId = `order-service-${process.env.HOSTNAME}`;
  
  consul.agent.service.register({
    id: serviceId,
    name: 'order-service',
    address: process.env.HOST,
    port: parseInt(process.env.PORT),
    check: {
      http: `http://${process.env.HOST}:${process.env.PORT}/health`,
      interval: '10s',
      timeout: '5s'
    }
  }, (err) => {
    if (err) {
      console.error('Error registering service:', err);
      return;
    }
    
    console.log(`Service registered with ID ${serviceId}`);
    
    // Deregister service on shutdown
    process.on('SIGINT', () => {
      console.log('Deregistering service...');
      consul.agent.service.deregister(serviceId, () => {
        process.exit();
      });
    });
  });
}

// Add a health check endpoint
app.get('/health', (req, res) => {
  res.status(200).send('OK');
});

// Discover product service instances
async function discoverProductService() {
  return new Promise((resolve, reject) => {
    consul.catalog.service.nodes('product-service', (err, result) => {
      if (err) {
        return reject(err);
      }
      
      if (!result || result.length === 0) {
        return reject(new Error('No product service instances found'));
      }
      
      // Randomly select one of the instances (simple load balancing)
      const instance = result[Math.floor(Math.random() * result.length)];
      resolve(`http://${instance.ServiceAddress}:${instance.ServicePort}`);
    });
  });
}

// Endpoint that calls the product service
app.get('/api/orders/:id/with-product-details', async (req, res) => {
  try {
    // First, get the order
    const order = await getOrderById(req.params.id);
    
    // For each product in the order, get the details
    const productServiceUrl = await discoverProductService();
    
    const productPromises = order.items.map(async (item) => {
      const productResponse = await axios.get(`${productServiceUrl}/api/products/${item.productId}`);
      return {
        ...item,
        productDetails: productResponse.data
      };
    });
    
    const itemsWithDetails = await Promise.all(productPromises);
    
    res.json({
      ...order,
      items: itemsWithDetails
    });
  } catch (error) {
    console.error('Error getting order with product details:', error);
    res.status(500).json({ error: 'Failed to get order with product details' });
  }
});

// Start the server and register with Consul
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Order service running on port ${PORT}`);
  registerService();
});

Handling Failures and Resilience

In a distributed system, failures are inevitable. Service communication patterns must account for these failures and build in resilience.

Resilience Patterns

stateDiagram-v2 [*] --> Closed Closed --> Open: Too many failures Open --> HalfOpen: Wait timeout HalfOpen --> Closed: Successful requests HalfOpen --> Open: Failed request

A circuit breaker works like an electrical circuit breaker in your home. When too much current flows (too many failures), the breaker trips (opens) to prevent damage. After a cooling-off period, it allows a test current (half-open) and if that succeeds, it resets (closes).

// Example: Circuit Breaker pattern with Opossum in Node.js
const CircuitBreaker = require('opossum');
const axios = require('axios');

// Function to call the product service
function getProductDetails(productId) {
  return axios.get(`http://product-service/api/products/${productId}`)
    .then(response => response.data);
}

// Create a circuit breaker
const breaker = new CircuitBreaker(getProductDetails, {
  failureThreshold: 50,    // 50% failure rate triggers open circuit
  resetTimeout: 10000,     // 10 seconds before attempting half-open state
  timeout: 3000,           // 3 second timeout for any request
  errorThresholdPercentage: 50,
  rollingCountTimeout: 60000, // 1 minute rolling window
  rollingCountBuckets: 10,  // 10 buckets of 6 seconds each
});

// Add listeners for events
breaker.on('open', () => {
  console.log('Circuit breaker opened - product service appears to be down');
});

breaker.on('halfOpen', () => {
  console.log('Circuit breaker half-open - testing if product service is back');
});

breaker.on('close', () => {
  console.log('Circuit breaker closed - product service is operational');
});

breaker.on('timeout', () => {
  console.log('Request to product service timed out');
});

// Fallback function for when the circuit is open
breaker.fallback(() => {
  return {
    id: 'unknown',
    name: 'Product details temporarily unavailable',
    price: 0,
    inStock: false
  };
});

// Example usage in an Express route
app.get('/api/orders/:id/with-product-details', async (req, res) => {
  try {
    // Get the order
    const order = await getOrderById(req.params.id);
    
    // For each product in the order, get the details using the circuit breaker
    const itemsWithDetails = await Promise.all(
      order.items.map(async (item) => {
        const productDetails = await breaker.fire(item.productId);
        return {
          ...item,
          productDetails
        };
      })
    );
    
    res.json({
      ...order,
      items: itemsWithDetails
    });
  } catch (error) {
    console.error('Error getting order with product details:', error);
    res.status(500).json({ error: 'Failed to get order with product details' });
  }
});

Distributed Tracing

Distributed tracing helps diagnose and troubleshoot issues in microservices by tracking requests as they flow through multiple services.

sequenceDiagram participant Client participant API Gateway participant Order Service participant Payment Service participant Notification Service Client->>+API Gateway: POST /api/orders Note right of Client: Trace ID: abc123 API Gateway->>+Order Service: Create Order Note right of API Gateway: Trace ID: abc123
Span ID: def456 Order Service->>+Payment Service: Process Payment Note right of Order Service: Trace ID: abc123
Span ID: ghi789 Payment Service-->>-Order Service: Payment Processed Order Service->>+Notification Service: Send Confirmation Note right of Order Service: Trace ID: abc123
Span ID: jkl012 Notification Service-->>-Order Service: Notification Sent Order Service-->>-API Gateway: Order Created API Gateway-->>-Client: 201 Created Note over Client,Notification Service: Tracing Platform collects and visualizes spans
// Example: Distributed tracing with OpenTelemetry in Node.js
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const opentelemetry = require('@opentelemetry/api');

// Initialize the tracer
function initTracing(serviceName) {
  const provider = new NodeTracerProvider();
  
  // Configure Jaeger exporter
  const exporter = new JaegerExporter({
    serviceName,
    endpoint: 'http://jaeger:14268/api/traces',
  });
  
  provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
  provider.register();
  
  // Register auto-instrumentations
  registerInstrumentations({
    instrumentations: [
      new HttpInstrumentation(),
      new ExpressInstrumentation(),
    ],
  });
  
  return opentelemetry.trace.getTracer(serviceName);
}

// Initialize tracing
const tracer = initTracing('order-service');

// Example of manual instrumentation in an Express route
app.post('/api/orders', async (req, res) => {
  // Create a new span for this route
  const span = tracer.startSpan('create-order');
  
  try {
    // Add attributes to the span
    span.setAttribute('userId', req.user.id);
    span.setAttribute('orderTotal', req.body.totalAmount);
    
    // Create a context with the current span
    const ctx = opentelemetry.trace.setSpan(opentelemetry.context.active(), span);
    
    // Create the order
    const order = {
      customerId: req.user.id,
      items: req.body.items,
      totalAmount: req.body.totalAmount,
      status: 'CREATED',
      createdAt: new Date()
    };
    
    // Save the order to the database
    const savedOrder = await opentelemetry.context.with(ctx, async () => {
      // This will be automatically associated with the parent span
      const childSpan = tracer.startSpan('save-order-to-database');
      
      try {
        const result = await Order.create(order);
        childSpan.end();
        return result;
      } catch (error) {
        childSpan.recordException(error);
        childSpan.end();
        throw error;
      }
    });
    
    // Process payment
    await opentelemetry.context.with(ctx, async () => {
      const paymentSpan = tracer.startSpan('process-payment');
      
      try {
        paymentSpan.setAttribute('paymentMethod', req.body.paymentMethod);
        paymentSpan.setAttribute('amount', order.totalAmount);
        
        // Call payment service
        await processPayment(req.body.paymentMethod, order.totalAmount);
        
        paymentSpan.end();
      } catch (error) {
        paymentSpan.recordException(error);
        paymentSpan.end();
        throw error;
      }
    });
    
    // Send confirmation
    await opentelemetry.context.with(ctx, async () => {
      const notificationSpan = tracer.startSpan('send-order-confirmation');
      
      try {
        await sendOrderConfirmation(savedOrder);
        notificationSpan.end();
      } catch (error) {
        notificationSpan.recordException(error);
        notificationSpan.setStatus({
          code: opentelemetry.SpanStatusCode.ERROR,
          message: error.message
        });
        notificationSpan.end();
        
        // We don't rethrow here because failing to send a confirmation
        // shouldn't fail the entire order creation
        console.error('Failed to send order confirmation:', error);
      }
    });
    
    // End the span and return the response
    span.end();
    res.status(201).json(savedOrder);
  } catch (error) {
    // Record the error and end the span
    span.recordException(error);
    span.setStatus({
      code: opentelemetry.SpanStatusCode.ERROR,
      message: error.message
    });
    span.end();
    
    console.error('Error creating order:', error);
    res.status(500).json({ error: 'Failed to create order' });
  }
});

Practical Exercise: Implementing Service Communication

Let's apply what we've learned by designing and implementing communication patterns for a simplified e-commerce application.

Exercise Scenario

You're building an e-commerce platform with the following microservices:

Exercise Tasks

  1. Choose appropriate communication patterns for each service interaction
  2. Design the API interfaces for synchronous communications
  3. Define the event schema for asynchronous communications
  4. Implement a resilience strategy for key service interactions
  5. Design a service discovery approach

Sample Solution

Here's one possible approach to designing the communication patterns:

graph TD A[Client] --> B[API Gateway] B -->|Synchronous REST| C[User Service] B -->|Synchronous REST| D[Product Service] B -->|Synchronous REST| E[Order Service] E -->|Synchronous REST| C E -->|Synchronous REST| D E -->|Synchronous REST| F[Payment Service] E -->|Async Event| G[Event Bus] F -->|Async Event| G G -->|OrderCreated Event| H[Notification Service] G -->|PaymentProcessed Event| H G -->|OrderCreated Event| I[Analytics Service] G -->|PaymentProcessed Event| I G -->|InventoryUpdated Event| D

Communication Pattern Choices:

API Design (Sample REST Endpoints):

// User Service API
GET    /api/users/:id               // Get user profile
POST   /api/users                   // Register new user
POST   /api/auth/login              // User login
GET    /api/users/:id/orders        // Get user's orders

// Product Service API
GET    /api/products                // List products
GET    /api/products/:id            // Get product details
GET    /api/products/:id/inventory  // Check product inventory
PUT    /api/products/:id/inventory  // Update inventory (internal)

// Order Service API
POST   /api/orders                  // Create order
GET    /api/orders/:id              // Get order details
PUT    /api/orders/:id/status       // Update order status

// Payment Service API
POST   /api/payments                // Process payment
GET    /api/payments/:id            // Get payment details

Event Schema (Sample Events):

// OrderCreated Event
{
  "eventType": "OrderCreated",
  "version": "1.0",
  "timestamp": "2025-05-10T15:30:00Z",
  "data": {
    "orderId": "ord_123456",
    "userId": "usr_789012",
    "items": [
      {
        "productId": "prod_111",
        "quantity": 2,
        "price": 29.99
      },
      {
        "productId": "prod_222",
        "quantity": 1,
        "price": 49.99
      }
    ],
    "totalAmount": 109.97,
    "status": "CREATED"
  }
}

// PaymentProcessed Event
{
  "eventType": "PaymentProcessed",
  "version": "1.0",
  "timestamp": "2025-05-10T15:31:00Z",
  "data": {
    "paymentId": "pay_123456",
    "orderId": "ord_123456",
    "userId": "usr_789012",
    "amount": 109.97,
    "status": "COMPLETED",
    "paymentMethod": "credit_card"
  }
}

// InventoryUpdated Event
{
  "eventType": "InventoryUpdated",
  "version": "1.0",
  "timestamp": "2025-05-10T15:32:00Z",
  "data": {
    "productId": "prod_111",
    "quantityChanged": -2,
    "newQuantity": 18,
    "reason": "ORDER",
    "orderId": "ord_123456"
  }
}

Resilience Strategy:

Service Discovery Approach:

Summary and Best Practices

We've explored various patterns for communication between microservices, each with its own strengths and trade-offs. Here are some key takeaways and best practices:

Communication Pattern Selection

API Design Best Practices

Event-Driven Architecture Best Practices

Resilience Best Practices

Further Reading and Resources