Network Resilience
In distributed systems, network failures are inevitable. Nodes become unreachable, connections fail, and partitions occur. Happen provides comprehensive resilience capabilities through NATS and JetStream, creating a system that maintains causal integrity even during the most challenging network disruptions.
Happen's approach to network resilience stays true to its philosophy of radical simplicity by leveraging NATS' battle-tested features:
Durable Storage: Events are preserved in JetStream before delivery attempts
Guaranteed Delivery: At-least-once and exactly-once delivery semantics
Automatic Reconnection: NATS clients intelligently reconnect after network failures
Message Replay: Unacknowledged messages are automatically replayed when connections recover
Cross-Region Replication: Messages remain available across geographic regions
Causal Ordering: Events maintain their causal relationships throughout disruptions
This creates a self-healing system that requires minimal configuration while providing enterprise-grade resilience.
Comprehensive Resilience Features
Happen combines multiple NATS capabilities to solve the full spectrum of network resilience challenges:
1. Persistent Message Storage
Events are automatically stored in NATS JetStream streams before delivery is attempted:
// When a node sends an event (internal implementation using NATS)
function sendEvent(targetNode, event) {
// FIRST: Publish to JetStream stream with the event's ID as message ID
// This happens automatically
// THEN: Deliver to recipient
// If successful, the recipient acknowledges
// If unsuccessful, the event remains in JetStream for later delivery
}
This ensures events survive process crashes, network failures, and other disruptions.
2. Delivery Guarantees
Happen supports both at-least-once and exactly-once delivery semantics:
// Configure delivery guarantees at system level
const happen = initializeHappen({
nats: {
capabilities: {
delivery: {
// Choose your delivery semantics
mode: "exactly-once", // or "at-least-once"
deduplication: true,
deduplicationWindow: "5m" // 5 minute window
}
}
}
});
// Or configure per node
const paymentNode = createNode("payment-service", {
delivery: {
mode: "exactly-once",
acknowledge: true
}
});
With exactly-once delivery, even if messages are redelivered due to network issues, they'll only be processed once.
3. Automatic Reconnection
NATS clients automatically attempt to reconnect when network connections fail:
// Configure reconnection behavior
const happen = initializeHappen({
nats: {
connection: {
// Reconnection settings
reconnect: true,
reconnectTimeWait: 2000, // 2 seconds between attempts
maxReconnectAttempts: -1 // Unlimited reconnect attempts
}
}
});
During reconnection attempts, outbound messages are queued locally until the connection is restored.
4. Flow Control and Backpressure
NATS provides natural backpressure mechanisms that prevent overwhelming recipients during recovery:
// Configure flow control
const happen = initializeHappen({
nats: {
capabilities: {
flowControl: {
enabled: true,
maxPending: 256 * 1024 // 256KB of pending messages
}
}
}
});
This ensures that when a node recovers after network issues, it won't be flooded with a sudden burst of messages.
5. Multi-Region Resilience
Happen leverages NATS SuperClusters to provide cross-region resilience:
// Configure multi-region operation
const happen = initializeHappen({
nats: {
// SuperCluster configuration
connection: {
servers: ['nats://local-region:4222'],
jetstream: true
},
// Region configuration
regions: {
primary: "us-west",
replicas: ["us-east", "eu-central"],
// Configuration for handling cross-region communication
coordination: {
strategy: "primary-replica",
conflictResolution: "last-writer-wins"
}
}
}
});
This enables:
Geographic Redundancy: System continues operating even if an entire region fails
Data Locality: Process data in the optimal region for performance and compliance
Disaster Recovery: Automatic failover and recovery between regions
6. Causal Event Recovery
When network connections recover, events are replayed while preserving causal relationships:
// Receive events in proper causal order after reconnection
orderNode.on("process-order", (event) => {
// After reconnection, events will arrive in causal order
// with exactly-once processing guarantees
const { orderId } = event.payload;
// Process normally - framework handles recovery
});
This ensures that even after severe network disruptions, your application's causal integrity is maintained.
Resilience Across Environment Boundaries
Happen's resilience capabilities work seamlessly across different runtime environments:
Server Environments
In server environments, NATS clients connect directly to the NATS server:
// Server node connecting directly to NATS
const serverNode = createNode("backend-service");
Browser Environments
In browser environments, clients connect through WebSockets:
// Browser node connecting via WebSockets
const browserNode = createNode("frontend-client");
Edge/IoT Environments
For edge or IoT devices with intermittent connectivity:
// Edge node with specialized configuration
const edgeNode = createNode("sensor-device", {
// Configuration for intermittent connectivity
connectivity: {
mode: "intermittent",
localBuffering: true,
syncOnConnect: true
}
});
In all these environments, the same resilience guarantees apply, providing consistent behavior regardless of where nodes are running.
Application Integration
While network resilience operates automatically, applications can optionally integrate with the system for enhanced awareness:
// Listen for connection status events
orderNode.on('system.connection-status', (event) => {
const { status, reason } = event.payload;
if (status === 'disconnected') {
// Update UI to show connectivity issue
updateConnectionStatus('degraded', reason);
// Enable offline mode
enableOfflineMode();
} else if (status === 'reconnected') {
// Update UI to show restored connectivity
updateConnectionStatus('connected');
// Return to normal operation
disableOfflineMode();
}
});
This allows applications to:
Notify users about connectivity issues
Adapt UI behavior during disruptions
Log network problems for diagnostics
Enable offline capabilities when appropriate
Key-Value Store for Persistent State
In addition to message persistence, Happen leverages NATS JetStream's Key-Value store for durable state:
// State is automatically persisted in JetStream KV store
orderNode.state.set(state => ({
...state,
orders: {
...state.orders,
"order-123": {
status: "processing",
updatedAt: Date.now()
}
}
}));
This ensures that node state persists across restarts and can be shared across instances, providing:
Durable State: State persists even if nodes restart
Consistent State: Updates maintain causal ordering
Shared State: State can be accessed from multiple instances
Versioned State: Each update creates a new version that can be tracked
Comprehensive Network Failure Handling
Happen's NATS-based architecture handles the full spectrum of network failures:
Process Crashes
If a process crashes:
Events are preserved in JetStream
State is preserved in the Key-Value store
Upon restart, the node resumes processing where it left off
Network Partitions
During network partitions:
Nodes continue functioning in their connected segments
Events for unreachable nodes are stored in JetStream
When the partition heals, normal operation resumes automatically
Regional Outages
During regional outages:
Traffic automatically routes to available regions
When the region recovers, it synchronizes with the rest of the system
Causal relationships are preserved across regions
Temporary Disconnections
During temporary disconnections:
NATS clients automatically attempt to reconnect
Outbound messages queue locally
Upon reconnection, normal event flow resumes
Long-Term Outages
For long-term outages:
Events are preserved in JetStream for the configured retention period
State remains in the Key-Value store
When connectivity is restored, normal operation resumes
Benefits of NATS-Powered Resilience
Happen's NATS-based approach to network resilience offers several key advantages:
Battle-Tested Foundation: Built on NATS' proven resilience capabilities
Zero Data Loss: Events are preserved even during severe disruptions
Exactly-Once Processing: No duplicate processing, even during recovery
Self-Healing: System automatically recovers when issues resolve
Cross-Environment Consistency: Same resilience guarantees across all environments
Minimal Configuration: Works out of the box with sensible defaults
Global Scale: Extends from single-process to worldwide deployments
By leveraging NATS and JetStream, Happen provides enterprise-grade resilience capabilities without the complexity typically associated with distributed systems. Your application can focus on business logic while the framework handles the challenging aspects of network resilience.
Last updated