Network Resilience

In distributed systems, network failures are inevitable. Nodes become unreachable, connections fail, and partitions occur. Happen provides comprehensive resilience capabilities through NATS and JetStream, creating a system that maintains causal integrity even during the most challenging network disruptions.

Happen's approach to network resilience stays true to its philosophy of radical simplicity by leveraging NATS' battle-tested features:

  • Durable Storage: Events are preserved in JetStream before delivery attempts

  • Guaranteed Delivery: At-least-once and exactly-once delivery semantics

  • Automatic Reconnection: NATS clients intelligently reconnect after network failures

  • Message Replay: Unacknowledged messages are automatically replayed when connections recover

  • Cross-Region Replication: Messages remain available across geographic regions

  • Causal Ordering: Events maintain their causal relationships throughout disruptions

This creates a self-healing system that requires minimal configuration while providing enterprise-grade resilience.

Comprehensive Resilience Features

Happen combines multiple NATS capabilities to solve the full spectrum of network resilience challenges:

1. Persistent Message Storage

Events are automatically stored in NATS JetStream streams before delivery is attempted:

// When a node sends an event (internal implementation using NATS)
function sendEvent(targetNode, event) {
  // FIRST: Publish to JetStream stream with the event's ID as message ID
  // This happens automatically
  
  // THEN: Deliver to recipient
  // If successful, the recipient acknowledges
  // If unsuccessful, the event remains in JetStream for later delivery
}

This ensures events survive process crashes, network failures, and other disruptions.

2. Delivery Guarantees

Happen supports both at-least-once and exactly-once delivery semantics:

// Configure delivery guarantees at system level
const happen = initializeHappen({
  nats: {
    capabilities: {
      delivery: {
        // Choose your delivery semantics
        mode: "exactly-once", // or "at-least-once"
        deduplication: true,
        deduplicationWindow: "5m" // 5 minute window
      }
    }
  }
});

// Or configure per node
const paymentNode = createNode("payment-service", {
  delivery: {
    mode: "exactly-once",
    acknowledge: true
  }
});

With exactly-once delivery, even if messages are redelivered due to network issues, they'll only be processed once.

3. Automatic Reconnection

NATS clients automatically attempt to reconnect when network connections fail:

// Configure reconnection behavior
const happen = initializeHappen({
  nats: {
    connection: {
      // Reconnection settings
      reconnect: true,
      reconnectTimeWait: 2000, // 2 seconds between attempts
      maxReconnectAttempts: -1  // Unlimited reconnect attempts
    }
  }
});

During reconnection attempts, outbound messages are queued locally until the connection is restored.

4. Flow Control and Backpressure

NATS provides natural backpressure mechanisms that prevent overwhelming recipients during recovery:

// Configure flow control
const happen = initializeHappen({
  nats: {
    capabilities: {
      flowControl: {
        enabled: true,
        maxPending: 256 * 1024 // 256KB of pending messages
      }
    }
  }
});

This ensures that when a node recovers after network issues, it won't be flooded with a sudden burst of messages.

5. Multi-Region Resilience

Happen leverages NATS SuperClusters to provide cross-region resilience:

// Configure multi-region operation
const happen = initializeHappen({
  nats: {
    // SuperCluster configuration
    connection: {
      servers: ['nats://local-region:4222'],
      jetstream: true
    },
    // Region configuration
    regions: {
      primary: "us-west",
      replicas: ["us-east", "eu-central"],
      // Configuration for handling cross-region communication
      coordination: {
        strategy: "primary-replica",
        conflictResolution: "last-writer-wins"
      }
    }
  }
});

This enables:

  • Geographic Redundancy: System continues operating even if an entire region fails

  • Data Locality: Process data in the optimal region for performance and compliance

  • Disaster Recovery: Automatic failover and recovery between regions

6. Causal Event Recovery

When network connections recover, events are replayed while preserving causal relationships:

// Receive events in proper causal order after reconnection
orderNode.on("process-order", (event) => {
  // After reconnection, events will arrive in causal order
  // with exactly-once processing guarantees
  
  const { orderId } = event.payload;
  // Process normally - framework handles recovery
});

This ensures that even after severe network disruptions, your application's causal integrity is maintained.

Resilience Across Environment Boundaries

Happen's resilience capabilities work seamlessly across different runtime environments:

Server Environments

In server environments, NATS clients connect directly to the NATS server:

// Server node connecting directly to NATS
const serverNode = createNode("backend-service");

Browser Environments

In browser environments, clients connect through WebSockets:

// Browser node connecting via WebSockets
const browserNode = createNode("frontend-client");

Edge/IoT Environments

For edge or IoT devices with intermittent connectivity:

// Edge node with specialized configuration
const edgeNode = createNode("sensor-device", {
  // Configuration for intermittent connectivity
  connectivity: {
    mode: "intermittent",
    localBuffering: true,
    syncOnConnect: true
  }
});

In all these environments, the same resilience guarantees apply, providing consistent behavior regardless of where nodes are running.

Application Integration

While network resilience operates automatically, applications can optionally integrate with the system for enhanced awareness:

// Listen for connection status events
orderNode.on('system.connection-status', (event) => {
  const { status, reason } = event.payload;
  
  if (status === 'disconnected') {
    // Update UI to show connectivity issue
    updateConnectionStatus('degraded', reason);
    
    // Enable offline mode
    enableOfflineMode();
  } else if (status === 'reconnected') {
    // Update UI to show restored connectivity
    updateConnectionStatus('connected');
    
    // Return to normal operation
    disableOfflineMode();
  }
});

This allows applications to:

  • Notify users about connectivity issues

  • Adapt UI behavior during disruptions

  • Log network problems for diagnostics

  • Enable offline capabilities when appropriate

Key-Value Store for Persistent State

In addition to message persistence, Happen leverages NATS JetStream's Key-Value store for durable state:

// State is automatically persisted in JetStream KV store
orderNode.state.set(state => ({
  ...state,
  orders: {
    ...state.orders,
    "order-123": {
      status: "processing",
      updatedAt: Date.now()
    }
  }
}));

This ensures that node state persists across restarts and can be shared across instances, providing:

  • Durable State: State persists even if nodes restart

  • Consistent State: Updates maintain causal ordering

  • Shared State: State can be accessed from multiple instances

  • Versioned State: Each update creates a new version that can be tracked

Comprehensive Network Failure Handling

Happen's NATS-based architecture handles the full spectrum of network failures:

Process Crashes

If a process crashes:

  • Events are preserved in JetStream

  • State is preserved in the Key-Value store

  • Upon restart, the node resumes processing where it left off

Network Partitions

During network partitions:

  • Nodes continue functioning in their connected segments

  • Events for unreachable nodes are stored in JetStream

  • When the partition heals, normal operation resumes automatically

Regional Outages

During regional outages:

  • Traffic automatically routes to available regions

  • When the region recovers, it synchronizes with the rest of the system

  • Causal relationships are preserved across regions

Temporary Disconnections

During temporary disconnections:

  • NATS clients automatically attempt to reconnect

  • Outbound messages queue locally

  • Upon reconnection, normal event flow resumes

Long-Term Outages

For long-term outages:

  • Events are preserved in JetStream for the configured retention period

  • State remains in the Key-Value store

  • When connectivity is restored, normal operation resumes

Benefits of NATS-Powered Resilience

Happen's NATS-based approach to network resilience offers several key advantages:

  1. Battle-Tested Foundation: Built on NATS' proven resilience capabilities

  2. Zero Data Loss: Events are preserved even during severe disruptions

  3. Exactly-Once Processing: No duplicate processing, even during recovery

  4. Self-Healing: System automatically recovers when issues resolve

  5. Cross-Environment Consistency: Same resilience guarantees across all environments

  6. Minimal Configuration: Works out of the box with sensible defaults

  7. Global Scale: Extends from single-process to worldwide deployments

By leveraging NATS and JetStream, Happen provides enterprise-grade resilience capabilities without the complexity typically associated with distributed systems. Your application can focus on business logic while the framework handles the challenging aspects of network resilience.

Last updated