146  SOA Resilience Patterns

In 60 Seconds

Circuit breakers prevent cascading failures by failing fast after 5+ consecutive errors (typical threshold), with a half-open state testing recovery every 30-60 seconds. Bulkhead isolation limits each service to a fixed thread/connection pool so one failing dependency cannot starve the entire system. Retry with exponential backoff (base 2s, max 60s, plus random jitter) prevents thundering herd problems during recovery.

146.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Diagnose Cascading Failures: Apply the circuit breaker pattern to prevent cascading failures by failing fast when downstream services are unhealthy
  • Implement Bulkhead Isolation: Isolate failures to prevent them from affecting the entire system
  • Configure Retry with Backoff: Handle transient failures with exponential backoff and jitter to prevent thundering herd

Resilience patterns are strategies for keeping IoT systems running even when parts fail. Think of how a city keeps functioning even when one road is closed – traffic gets rerouted. Patterns like circuit breakers, retries, and fallbacks help IoT services gracefully handle failures instead of crashing completely.

146.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Resilience patterns are like safety rules that keep one broken thing from breaking everything else!

146.2.1 The Sensor Squad Adventure: The Broken Oven

One day at the pizza restaurant, Thermo the Oven Master got sick and couldn’t bake pizzas. Without safety rules, here’s what happened:

  • Orders kept piling up waiting for Thermo
  • Sunny couldn’t take new orders (too many waiting!)
  • Pressi stopped making dough (no point if no baking!)
  • The WHOLE restaurant stopped!

Then they added Safety Rules (resilience patterns):

  1. Circuit Breaker: After 5 orders waiting too long, Sunny says “Sorry, no pizza today - come back later!” instead of making everyone wait forever.

  2. Backup Plan: If Thermo is sick, send orders to the backup restaurant next door!

  3. Spread Out: When Thermo comes back, don’t send ALL waiting orders at once - send them slowly so Thermo doesn’t get overwhelmed again.

146.2.2 Key Words for Kids

Word What It Means
Circuit Breaker A safety switch that says “stop sending work to broken things”
Bulkhead Walls between rooms so a flood in one room doesn’t flood the whole ship
Retry Trying again, but waiting a little longer each time

146.3 Resilience Patterns

Distributed systems fail in distributed ways. Resilience patterns prevent cascading failures.

146.3.1 Circuit Breaker Pattern

The circuit breaker prevents a failing service from overwhelming the system:

Circuit breaker state diagram showing transitions between Closed, Open, and Half-Open states for fault tolerance
Figure 146.1: Circuit breaker state transitions: Closed (normal), Open (failing fast), Half-Open (testing recovery)

Circuit Breaker in Action:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
def call_analytics_service(device_id):
    """Call analytics with circuit breaker protection."""
    response = requests.get(
        f"http://analytics-service/devices/{device_id}/insights",
        timeout=5
    )
    response.raise_for_status()
    return response.json()

# Usage with fallback
try:
    insights = call_analytics_service(device_id)
except CircuitBreakerError:
    # Circuit is open - use cached data or default
    insights = get_cached_insights(device_id)

Configuration Guidelines:

Parameter Typical Value Consideration
Failure Threshold 5-10 failures Lower for critical paths
Recovery Timeout 30-60 seconds Match downstream recovery time
Request Timeout 1-5 seconds Based on SLA requirements
Half-Open Requests 1-3 Test requests before full recovery
Try It: Circuit Breaker State Simulator

Simulate requests hitting a service protected by a circuit breaker. Adjust the failure threshold and recovery timeout, then click “Send Request” to see state transitions in real time.

146.3.2 Failure Walkthrough: What Happens Without a Circuit Breaker

Consider an IoT alert service with a 200-thread pool. It calls an email notification service (normal response: 100ms) and also handles SMS and push notifications independently.

Timeline of cascading failure without circuit breaker:

Time Email Service Alert Service Threads User Impact
T+0s Starts degrading (2s responses) 200 free None yet
T+1s Responding in 5s 195 free, 5 waiting on email Minor delay
T+3s Responding in 15s 150 free, 50 stuck on email SMS/push still working
T+5s Responding in 30s (timeout) 50 free, 150 stuck SMS/push slowing down
T+8s Not responding 0 free, 200 stuck ALL alerts dead – SMS, push, email all down
T+8s+ Recovering 0 free (all threads blocked) Complete outage continues

Total time from first degradation to complete outage: 8 seconds.

The same timeline with a circuit breaker (threshold=5 failures, timeout=30s):

Time Email Service Circuit State Alert Service User Impact
T+0s Starts degrading CLOSED Processing normally None
T+3s 5 timeouts counted CLOSED → OPEN Switches to fallback Email queued, SMS/push work
T+3s-T+33s Still down OPEN Failing email fast (1ms) SMS and push 100% operational
T+33s Recovering OPEN → HALF-OPEN Sends 1 test email Testing recovery
T+34s Responds OK HALF-OPEN → CLOSED Resumes email sending Full service restored

Result: SMS and push notifications never went down. Email was degraded for ~30 seconds instead of the entire system failing.

Circuit breakers prevent thread exhaustion. With a 200-thread pool and requests arriving at a rate that blocks each thread (30-second email service timeout):

\[\text{Time to exhaustion} = \frac{T_{pool}}{R_{requests/sec}}\]

At 10 req/sec incoming rate (each request blocks a thread for 30s):

\[t = \frac{200}{10} = 20 \text{ seconds}\]

Without circuit breaker: After 20 seconds, all 200 threads are blocked waiting for email timeouts. Service completely down.

With circuit breaker (threshold=5, opens after 5 timeouts at 5s each):

  • Circuit opens at \(t \approx 5\) sec (first 5 requests timeout simultaneously)
  • Subsequent requests fail fast in <1ms (not 5,000ms)
  • Thread pool preserved: \(200 - 50 = 150\) threads available for other work (50 requests queued during the 5s detection window)
  • Mean time to recovery: 30 sec (circuit reset time) vs hours (manual intervention)

The circuit breaker prevents cascading failure by containing 50 threads of damage (during the 5-second detection window) instead of losing all 200.

Production-ready circuit breaker with PyBreaker:

# pip install pybreaker
import pybreaker, logging

email_breaker = pybreaker.CircuitBreaker(
    fail_max=5, reset_timeout=30, name="email-notification"
)

@email_breaker
def send_email_notification(alert):
    """Send email with circuit breaker protection."""
    response = requests.post(
        "http://email-service/send",
        json={"to": alert.recipient, "subject": alert.title,
              "body": alert.message},
        timeout=5)
    response.raise_for_status()
    return response.json()

def send_alert(alert):
    """Send alert via all channels. Email failure does not block others."""
    results = {}
    try:
        results["email"] = send_email_notification(alert)
    except pybreaker.CircuitBreakerError:
        results["email"] = "queued"       # Circuit OPEN: fail fast
        queue_for_retry(alert, channel="email")
    # SMS and push -- always attempted, never blocked by email
    results["sms"] = send_sms(alert)
    results["push"] = send_push_notification(alert)
    return results

146.3.3 Bulkhead Pattern

Isolate failures to prevent them from affecting the entire system:

Bulkhead pattern diagram showing isolated thread pools for analytics, telemetry, and notifications to contain failures
Figure 146.2: Bulkhead isolation: Analytics failure only affects its dedicated pool, telemetry and notifications continue working

Bulkhead Implementation Example:

from concurrent.futures import ThreadPoolExecutor
from functools import partial

class BulkheadService:
    def __init__(self):
        # Separate thread pools for different concerns
        self.telemetry_pool = ThreadPoolExecutor(max_workers=50,
                                                  thread_name_prefix='telemetry')
        self.analytics_pool = ThreadPoolExecutor(max_workers=20,
                                                  thread_name_prefix='analytics')
        self.notification_pool = ThreadPoolExecutor(max_workers=10,
                                                     thread_name_prefix='notify')

    def process_telemetry(self, data):
        """Process telemetry in isolated pool."""
        return self.telemetry_pool.submit(self._handle_telemetry, data)

    def run_analytics(self, query):
        """Run analytics in isolated pool."""
        return self.analytics_pool.submit(self._handle_analytics, query)

    def send_notification(self, alert):
        """Send notification in isolated pool."""
        return self.notification_pool.submit(self._handle_notification, alert)
Try It: Bulkhead Thread Pool Isolation

See how bulkhead isolation protects healthy services when one pool is overwhelmed. Adjust pool sizes and simulate a failure in one service to observe the impact on others.

146.3.4 Retry with Exponential Backoff

For transient failures, retry with increasing delays:

import time
import random

def retry_with_backoff(func, max_retries=5, base_delay=1):
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            delay = base_delay * (2 ** attempt)

            # Add jitter to prevent thundering herd
            jitter = random.uniform(0, delay * 0.1)

            time.sleep(delay + jitter)
Try It: Exponential Backoff + Jitter Calculator

Adjust the retry parameters to see how exponential backoff spreads retry attempts over time. Compare the delay schedule with and without jitter. The SVG timeline shows when each attempt fires.

Retry Anti-Pattern: Thundering Herd

If all clients retry at the same time after a failure, they can overwhelm the recovering service. Always add jitter (randomness) to retry delays to spread the load.

146.3.5 Timeout Configuration

Proper timeouts are essential for resilience:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_resilient_session():
    """Create HTTP session with timeouts and retries."""
    session = requests.Session()

    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage with explicit timeouts
session = create_resilient_session()
response = session.get(
    "http://telemetry-service/data",
    timeout=(3.0, 10.0)  # (connect_timeout, read_timeout)
)

Timeout Guidelines:

Service Type Connect Timeout Read Timeout Rationale
Internal API 1-3 seconds 5-10 seconds Fast network, quick detection
External API 3-5 seconds 30-60 seconds Internet latency, larger payloads
Database 2-5 seconds 30-60 seconds Connection pooling, complex queries
Message Queue 1-2 seconds 5-10 seconds Fast acknowledgment expected

146.4 Combining Resilience Patterns

Real-world systems combine multiple patterns:

Combined resilience patterns diagram showing layered circuit breaker, bulkhead, and retry patterns working together
Figure 146.3: Layered resilience: Each pattern addresses a different failure mode

Pattern Interaction:

  1. Timeout catches slow responses
  2. Circuit Breaker detects repeated failures
  3. Retry handles transient failures
  4. Bulkhead isolates resource consumption
Try It: Thundering Herd Visualizer

Simulate what happens when thousands of IoT devices retry after an outage. Compare fixed-delay retries (thundering herd) vs. exponential backoff with jitter. The chart shows server load over time.

146.5 Real-World Outage: Samsung SmartThings (March 2023)

Production outages provide the most convincing evidence for why resilience patterns matter. Samsung SmartThings experienced a 15-hour outage in March 2023 that affected millions of smart home devices worldwide. Analyzing this through the lens of resilience patterns reveals which protections were missing.

What Happened:

  • SmartThings cloud backend became unreachable for ~15 hours
  • All cloud-dependent devices (locks, thermostats, lights controlled by routines) stopped responding
  • Users could not arm/disarm security systems, adjust heating, or control lights via automations

Resilience Analysis:

Pattern Was It Applied? Impact of Gap
Circuit breaker Unclear – mobile app continued retrying cloud requests, draining battery App should have opened circuit after 3 failures, shown “offline” mode
Bulkhead Missing – single backend failure affected ALL device types equally Locks, thermostats, and lights should have separate service partitions
Local fallback Partially – Hub V3 had local execution for some routines Devices with cloud-only routines had zero functionality
Retry with backoff Missing – millions of devices retried simultaneously on recovery Thundering herd on backend recovery, extending outage by ~3 hours
Graceful degradation Missing – security system went fully offline instead of falling back to local arm/disarm Life-safety devices must have autonomous fallback mode

Lessons for IoT Architects:

  1. Bulkhead life-safety devices: Security systems and fire alarms must operate in a separate service partition from convenience features (lights, music). If the lights service crashes, the lock must still work.

  2. Design for cloud-down from day one: If your IoT device requires cloud connectivity to perform its primary function, it will fail during outages. SmartThings Hub V3 partially addressed this with local execution, but many routines still required cloud.

  3. Stagger recovery: After an outage, millions of devices reconnecting simultaneously creates a thundering herd. Implement jittered reconnection: each device waits random(0, 300) seconds before reconnecting. For 5 million devices with 300-second jitter, the reconnection rate is ~17,000/second instead of 5,000,000 simultaneously.

  4. Test outages deliberately: Netflix pioneered Chaos Monkey (randomly killing production instances). IoT platforms should run “Cloud Kill” tests monthly – disconnect the cloud and verify that local fallbacks activate correctly. Samsung’s post-incident report acknowledged they had not tested prolonged cloud outages at scale.

Common Pitfalls

A circuit breaker that trips after 2 failures in 10 seconds will open frequently due to normal transient errors (network jitter, brief database spikes) in IoT environments. This causes unnecessary fallback responses and prevents normal traffic from reaching healthy services. Set thresholds based on measured baseline error rates – typically 50% failure rate over a 30-second window in IoT systems.

Key Concepts
  • Circuit Breaker: A resilience pattern that monitors failure rates to a downstream service and ‘trips open’ to fail fast and return cached responses, preventing cascading failures across IoT microservices
  • Bulkhead: A resilience pattern that isolates failures by allocating separate thread pools or connection pools to different downstream services, preventing one slow dependency from exhausting all resources
  • Retry with Exponential Backoff: A resilience pattern that retries failed requests with progressively increasing delays (1s, 2s, 4s, 8s) plus random jitter, preventing synchronized retry storms in IoT systems
  • Fallback Response: A pre-configured degraded response returned when a circuit breaker is open, allowing IoT dashboards to display cached last-known state rather than error pages during downstream failures
  • Timeout: A maximum wait duration for a downstream service response – without timeouts, slow services hold threads indefinitely, eventually exhausting the thread pool and causing total service failure
  • Health Check: An endpoint (/health, /ready) that reports service operational status, used by Kubernetes, load balancers, and circuit breakers to route traffic away from degraded IoT service instances
  • Saga Pattern: A distributed transaction pattern for multi-step IoT operations (provision device → register → activate) that uses compensating transactions to roll back partial failures without distributed locks
  • Dead Letter Queue: A message queue destination for messages that fail processing after all retries, enabling asynchronous IoT telemetry pipelines to handle poison messages without blocking the main processing queue

Retrying a command to turn on a device actuator when the first attempt may have succeeded but the acknowledgment was lost sends the command twice, potentially causing unintended state changes. Only retry idempotent operations (reads, status checks) or operations with unique request IDs that the server uses for deduplication. For non-idempotent commands, use single-attempt with confirmation polling.

When a downstream IoT service recovers after an outage, all waiting clients retry simultaneously, creating a traffic spike that re-crashes the recovering service. Always add random jitter (0-30% of backoff delay) to retry timers and implement circuit breakers so clients fail fast rather than queuing unlimited retries.

146.6 Summary

This chapter covered resilience patterns for fault-tolerant IoT systems:

  • Circuit Breaker: Fail fast when downstream services are unhealthy, preventing cascading failures
  • Bulkhead: Isolate resources (thread pools, connections) to prevent one failure from affecting everything
  • Retry with Backoff: Handle transient failures with exponential delays and jitter to prevent thundering herd
  • Timeouts: Set appropriate connect and read timeouts for different service types

Scenario: Alert service calls external email API. Need to configure circuit breaker to fail fast when email service degrades without false positives.

Email Service SLA: 99.9% uptime, <200ms p95 latency

Traffic Profile:

  • Normal: 50 requests/minute (5 seconds between requests)
  • Peak: 300 requests/minute (1 request every 200ms)

Circuit Breaker Configuration:

1. Failure Threshold (how many failures before opening):

Too low (threshold=3):
- 3 transient network blips → circuit opens
- False positive rate high
- Service availability suffers unnecessarily

Too high (threshold=20):
- 20 failures = 20 × 5sec timeout = 100 seconds of blocked threads
- By failure #20, thread pool may be exhausted
- Too slow to detect real outage

Recommended: threshold=5
- 5 failures = 25 seconds to detect (acceptable)
- Low false positive rate (5 consecutive failures unlikely if service is healthy)
- Protects thread pool before exhaustion

2. Timeout Duration (max wait per request):

Email SLA: <200ms p95
Set timeout at p99.9: ~5 seconds

Reasoning:
- 200ms is healthy
- 5 seconds catches degraded service (10s responses = clearly unhealthy)
- Shorter timeout (<1s) may cause false failures on network jitter

3. Open Timeout (how long to stay open before testing):

Too short (10 seconds):
- Circuit tests recovery every 10s
- If email service needs 2 minutes to recover, 12 failed tests
- Wastes resources on doomed requests

Too long (5 minutes):
- Email service recovers in 30 seconds
- Circuit stays open unnecessarily for 4.5 more minutes
- Poor user experience

Recommended: 30-60 seconds
- Matches typical service recovery time (load shedding, auto-scaling)
- Not so frequent that we hammer recovering service

4. Half-Open Success Threshold (how many successes to close):

Too low (1 success):
- One lucky request closes circuit
- If service is flapping, circuit oscillates open/closed
- Unstable behavior

Too high (10 successes):
- Requires 10 consecutive successes before closing
- If service has 90% success rate during recovery, takes forever to close
- Slow recovery

Recommended: 3-5 successes
- 3 consecutive successes = service likely recovered
- Fails fast if service still degraded (1 failure → back to OPEN)

Final Configuration:

email_breaker = CircuitBreaker(
    fail_max=5,              # Open after 5 failures
    reset_timeout=30,        # Test recovery after 30 seconds
    timeout_duration=5,      # 5-second request timeout
    half_open_max_calls=3    # 3 successes to close
)

Testing the Configuration:

Simulate email service outage (100% failure for 2 minutes):

T+0s: Email service goes down
T+5s: Failure #1
T+10s: Failure #2
T+15s: Failure #3
T+20s: Failure #4
T+25s: Failure #5 → Circuit OPENS

T+25-55s: All requests fail fast (<1ms, no 5s timeout wait)
  Threads saved: (55-25) / 5 × 50 req/min = 300 requests × 5sec = 1,500 thread-seconds

T+55s: Half-open test (attempt #1) → FAILS → back to OPEN
T+85s: Half-open test (attempt #2) → FAILS → back to OPEN
T+115s: Email service recovers
T+115s: Half-open test (attempt #3) → SUCCESS
T+120s: Half-open test (attempt #4) → SUCCESS
T+125s: Half-open test (attempt #5) → SUCCESS → Circuit CLOSES

Recovery time: 125 seconds from outage start
Actual outage: 115 seconds
Detection lag: 25 seconds (5 failures × 5s)
Recovery lag: 10 seconds (waiting for half-open test)

Key Insight: Circuit breaker configuration is a balance – fail fast enough to protect resources, but not so sensitive that transient blips cause false opens.

Failure Type Retry Strategy Max Retries Backoff Jitter Example
Transient Network Blip Exponential backoff 3-5 Base 1s, max 10s 10-20% DNS timeout, TCP handshake fail
Rate Limit (429) Exponential backoff + header 3 Honor Retry-After header No jitter API quota exceeded
Service Temporarily Unavailable (503) Exponential backoff 5-7 Base 2s, max 60s 20-30% Service restarting, rolling deploy
Internal Server Error (500) Limited retry 2-3 Base 2s, max 10s 10% May be persistent bug, fail fast
Timeout Exponential backoff 3 Base 5s, max 30s 30-50% Slow dependencies, DB queries
Authentication Failure (401) NO RETRY 0 N/A N/A Incorrect credentials won’t fix themselves
Not Found (404) NO RETRY 0 N/A N/A Resource doesn’t exist, retrying pointless
Bad Request (400) NO RETRY 0 N/A N/A Malformed request, won’t change

Decision Rules:

Retry if (likely transient): - Network failures (connection refused, timeout) - 429 (rate limit – will clear over time) - 503 (service unavailable – may be temporary) - 504 (gateway timeout)

DO NOT retry if (permanent error): - 400-level errors (except 429): Bad request, authentication, not found - 500 (may be persistent bug – fail fast and alert) - Business logic errors (validation failures, insufficient funds, etc.)

Jitter Importance:

# WITHOUT jitter (thundering herd):
delay = base_delay * (2 ** attempt)  # All clients retry at same time
time.sleep(delay)

# WITH jitter (spread load):
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.3)  # Add 0-30% randomness
time.sleep(delay + jitter)

Example with 1,000 clients after 10-second outage:
- No jitter: All 1,000 retry at T+10s (spike)
- 30% jitter: Retries spread across T+10s to T+13s (smooth ramp)

Backoff Formula:

delay = min(base × (2 ^ attempt), max_delay) + random(0, jitter%)

Example:
base=2s, max=60s, jitter=30%

Attempt 1: 2s + random(0, 0.6s) = 2.0-2.6s
Attempt 2: 4s + random(0, 1.2s) = 4.0-5.2s
Attempt 3: 8s + random(0, 2.4s) = 8.0-10.4s
Attempt 4: 16s + random(0, 4.8s) = 16.0-20.8s
Attempt 5: 32s + random(0, 9.6s) = 32.0-41.6s
Attempt 6: 60s (capped) + random(0, 18s) = 60.0-78.0s
Common Mistake: Retrying Without Exponential Backoff (Thundering Herd)

The Error: Implementing retry with fixed delays instead of exponential backoff.

Real Example:

  • IoT platform with 50,000 devices polling config API every 60 seconds
  • Config service crashes and recovers after 2 minutes
  • Devices retry with fixed 5-second delay

What Happens (with fixed delay):

T+0s: Config service crashes (50,000 devices last polled 0-60s ago)
T+60s: First wave of retries (devices whose poll was due at crash time)
  - 833 devices retry (50,000 / 60 seconds = 833/sec normal rate)
T+65s: Second retry wave (same 833 devices)
T+70s: Third retry wave (same 833 devices)
...
T+120s: Service recovers, but now has BACKLOG
  - Normal polling: 833/sec
  - Retrying devices: 833 × (120/5) = 20,000 devices retrying
  - Total load: 833 + 20,000 = 20,833 requests/sec (25x normal)
T+121s: Service crashes again under 25x load

The Thundering Herd: Fixed retry creates synchronized retry waves that overwhelm recovering services.

With Exponential Backoff + Jitter:

def retry_config_poll():
    for attempt in range(5):
        try:
            return get_config()
        except Exception:
            if attempt == 4:  # Last attempt
                raise
            delay = min(5 * (2 ** attempt), 300)  # 5s, 10s, 20s, 40s, 80s (capped at 300s)
            jitter = random.uniform(0, delay * 0.5)  # 0-50% jitter
            time.sleep(delay + jitter)

What Happens (with exponential backoff + jitter):

T+0s: Config service crashes
T+60s: First retry wave (833 devices)
  - Retry delays: 5s + jitter(0-2.5s) = 5-7.5s spread
T+65-67.5s: Second retry (spread over 2.5 seconds, not synchronized)
  - Retry delays: 10s + jitter(0-5s) = 10-15s
T+75-82.5s: Third retry (spread over 7.5 seconds)
  - Retry delays: 20s + jitter(0-10s) = 20-30s

T+120s: Service recovers
  - Devices in various retry states (some on attempt 2, some on 3, some on 4)
  - Load ramps up gradually: 833/sec → 1,200/sec → 1,500/sec over 30 seconds
  - Service handles gradual ramp without crashing

Result: Service stays stable, no crash cascade

Key Numbers:

Fixed delay (5s):
- Peak load after recovery: 20,833 req/sec (25x normal)
- Service crashes: YES

Exponential backoff (5s base, 2x multiplier):
- Peak load after recovery: 1,500 req/sec (1.8x normal)
- Service crashes: NO

Difference: 14x reduction in peak load

Lesson: Fixed-delay retries create synchronized retry storms. Always use exponential backoff with random jitter to spread retry load over time. This single change prevents most post-outage crash cascades.

Key Takeaway

In one sentence: Resilience patterns (circuit breakers, bulkheads, retries with jitter) prevent cascading failures by failing fast, isolating resources, and spreading recovery load over time.

Remember this rule: Circuit breakers protect against slow/failing services; bulkheads protect against resource exhaustion; jitter prevents thundering herd.

146.7 Knowledge Check

Challenge: Configure and test a circuit breaker for an IoT alert service that calls an external email API.

Setup:

# pip install pybreaker requests
import pybreaker
import requests
import time
import random

# External email API simulator (flaky by design)
class FlakyEmailAPI:
    def __init__(self, failure_rate=0.3):
        self.failure_rate = failure_rate
        self.call_count = 0

    def send_email(self, alert):
        self.call_count += 1
        if random.random() < self.failure_rate:
            time.sleep(5)  # Simulate timeout
            raise requests.Timeout("Email service timeout")
        time.sleep(0.1)  # Normal response time
        return {"status": "sent", "id": self.call_count}

email_api = FlakyEmailAPI(failure_rate=0.3)  # 30% failure rate

Tasks:

  1. Baseline measurement - Call the API 100 times without circuit breaker
    • How many failures?
    • Total time spent?
    • Thread blocking duration?
  2. Add circuit breaker:
breaker = pybreaker.CircuitBreaker(
    fail_max=5,           # Exercise: Tune this threshold
    reset_timeout=30,     # Exercise: Tune this timeout
    name="email-service"
)

@breaker
def send_email_protected(alert):
    return email_api.send_email(alert)
  1. Experiment with thresholds:
    • Try fail_max = 3, 5, 10
    • Try reset_timeout = 10s, 30s, 60s
    • Measure: total failures, circuit opens, time saved
  2. Simulate recovery:
    • After circuit opens, reduce failure_rate to 0.1
    • Observe half-open testing behavior

What to observe:

  • How quickly does the circuit open after threshold?
  • Does rapid opening save thread resources?
  • What happens if reset_timeout is too short during sustained outage?
  • Does the circuit close successfully when service recovers?

Expected learning:

  • Thresholds too low = false positives (circuit opens on transient blips)
  • Thresholds too high = slow detection (threads exhausted before circuit opens)
  • Reset timeout should match typical service recovery time

Extension: Add fallback logic (e.g., queue emails for later, send SMS instead).

146.9 What’s Next

If you want to… Read this
Deploy resilient IoT services using containers and Kubernetes SOA Container Orchestration
Design IoT APIs with versioning, rate limiting, and discovery SOA API Design
Understand SOA and microservice decomposition fundamentals SOA and Microservices Fundamentals
Model IoT device fault recovery with state machines State Machine Patterns
Explore cloud-native IoT deployment architectures Cloud Computing for IoT