5 Retries, Timeouts, and Circuit Breakers

design-patterns

soa

resilience

5.1 Start With the Command That Might Repeat

Resilience becomes concrete when a network timeout leaves everyone unsure whether a physical command already happened. Retrying a temperature read is one thing; retrying a lock, relay, valve, alarm, or certificate change is another.

Start with the user-visible deadline and the risk of repeating the operation. Then choose the timeout, retry, breaker, bulkhead, fallback, and evidence that keep the caller safe while the dependency is unhealthy.

Chapter Roadmap

This chapter follows one service call from risk to review:

First you decide what may repeat and what must be bounded by a deadline.
Then you set a command-call policy for timeout, retry, breaker, bulkhead, and fallback behavior.
Next you study the main failure modes: slow dependencies, retry storms, duplicate commands, exhausted pools, partial cloud loss, and poison messages.
Finally you combine the controls into review checklists, quizzes, and a resilience review note.

Checkpoints summarize the design contract as you go; collapsed quizzes and code activities are practice stops, not new policy.

In 60 Seconds

Resilience patterns keep one slow or failed dependency from taking down an IoT service. A good service call has a deadline, retries only when the operation is safe to repeat, opens a circuit breaker when failure signals are strong, falls back to a known safe response, and uses bulkheads so one dependency cannot consume every thread, connection, queue, or worker. The settings are not universal numbers; tune them from SLOs, latency data, idempotency guarantees, and recovery tests.

Minimum Viable Understanding

Timeouts and deadlines define the maximum wait. Without a deadline, slow dependencies can hold resources until healthy work is starved.
Retries are for transient and safe operations. Retry reads, idempotent writes, or writes with request IDs; do not blindly retry commands that may have already changed the physical world.
Circuit breakers fail fast after repeated failure signals. They protect callers from wasting resources on a dependency that is already failing.
Bulkheads isolate resource pools. Separate workers, queues, connection pools, and rate budgets keep one dependency from consuming the whole service.
Fallbacks must be designed before the outage. During an incident is too late to decide whether to return cached state, queue work, disable a feature, or ask for manual confirmation.

5.2 Resilience Protects The Caller First

Resilience is not about making every dependency succeed. It is about keeping the caller, user workflow, and physical system safe when a dependency is slow, unavailable, overloaded, or uncertain. In a service-oriented IoT platform, a device-command API may depend on an authorization service, a device registry, a broker, a notification provider, a time-series store, and a support-ticket integration. If any one of those dependencies hangs, the caller needs a controlled answer before its worker pool, queue, or user-facing deadline is exhausted.

In IoT systems, that distinction matters because retries can duplicate real-world actions. A repeated status read is usually harmless. A repeated unlock, valve change, alarm acknowledgement, firmware enrollment, billing activation, or certificate-rotation command can change the physical world or the security state twice. A retry policy that is acceptable for GET /devices/{id}/state may be unsafe for POST /commands/unlock unless the command carries an idempotency key, sequence number, or receiver-side deduplication rule.

Think about a gateway that sends cold-room temperature readings and accepts remote defrost commands. Telemetry can often be buffered, retried with backoff and jitter, and replayed through MQTT, Kafka, RabbitMQ, or a cloud queue because duplicate readings can be deduplicated by device id, timestamp, and message id. A defrost command is different: if the acknowledgement is lost, the gateway must not repeat the command unless the device can recognize the same command id and return the prior result. The resilience policy follows the risk of the operation.

Resilience controls therefore protect the caller in layers. A deadline bounds the total wait. A per-attempt timeout prevents one dependency call from consuming the whole budget. A retry policy decides whether another attempt is safe and useful. A circuit breaker stops calling a dependency that is already failing. A bulkhead limits the resources one dependency or tenant can consume. A fallback tells the user, device, or operator what still works while degraded. Each control is small, but together they prevent one unhealthy path from becoming a platform outage.

Deadlines stop slow dependencies from holding workers, sockets, and queue slots after the user-visible budget is gone.
Retry policy separates safe reads, idempotent writes, queued work, and unsafe physical commands.
Fallback behavior keeps the product honest with cached state, queued state, read-only mode, local control, manual confirmation, or explicit unavailability.

5.3 Set The Policy For One Command Call

For a command API calling an authorization service, start with the user-visible deadline and physical risk. The caller should not wait forever, retry blindly, or let the authorization dependency consume the same worker pool that serves read-only status. Write the policy for one call path, not for the whole platform: “remote door unlock requires authorization, device online state, command deduplication, dispatch, acknowledgement, and an operator-visible result inside two seconds.”

Allocate the budget before choosing numbers. A two-second command path might reserve 200 ms for request validation and registry lookup, 500 ms for authorization, 800 ms for command submission and acknowledgement, 200 ms for fallback selection, and 300 ms for network and serialization margin. If authorization consumes 1.8 seconds, a retry is already useless because the caller cannot still deliver a reliable answer. If authorization returns a transient 503 after 120 ms and the command id is stable, one retry with jitter may still fit.

Then define the degraded behavior. If authorization is unavailable, the API may reject new unlock commands with a clear unavailable response, allow read-only status checks, keep diagnostics available, and ask an operator to confirm local state. If the device is offline, the API may queue only commands that are safe to execute later and label them pending. If the command may have already reached the device, the API should store the command id and wait for a status transition instead of creating a second physical action.

Deadline: a two-second command path might allocate 500 ms per authorization attempt and reserve time for fallback response handling.
Retry: retry 503, 429, or connection reset only when the command carries an idempotency key or command sequence number.
Breaker: open after a rolling failure-rate or slow-call threshold, then use limited half-open probes before allowing normal traffic.
Bulkhead: put authorization calls in a dependency-specific pool so status reads, diagnostics, and support views keep capacity.

Use metrics to tune the policy. Track p50, p95, and p99 authorization latency; timeout count; retry attempts; retry success rate; open-circuit duration; half-open probe results; command deduplication hits; queue depth; and user-visible degraded responses. A retry policy whose success rate is low during incidents may be load amplification. A breaker that opens constantly may have a bad timeout, an overloaded dependency, or a workload split that needs a bulkhead. A fallback nobody sees in metrics is not an operational fallback.

5.4 Policies Need Runtime Signals

Production resilience depends on metrics and enforcement points, not just diagrams. Libraries and platforms such as Resilience4j, Polly, Envoy, Linkerd, Istio, OpenTelemetry, Prometheus, Grafana, Kafka, RabbitMQ, Redis, and cloud queues can provide the pieces, but the policy still has to match the workload. The implementation needs one place where the caller budget, attempt timeout, retry eligibility, breaker state, bulkhead reservation, fallback selection, and telemetry recording are enforced consistently.

sequenceDiagram
  participant Caller as Command API
  participant Pool as Auth bulkhead pool
  participant Breaker as Circuit breaker
  participant Auth as Authorization service
  participant Fallback as Degraded response
  Caller->>Caller: Start two-second deadline
  Caller->>Pool: Reserve dependency-specific capacity
  alt Pool full
    Pool-->>Caller: Reject quickly
    Caller->>Fallback: Return unavailable/read-only mode
  else Capacity available
    Caller->>Breaker: Check call admission
    alt Breaker open
      Breaker-->>Caller: Fail fast
      Caller->>Fallback: Use queued/manual/explicit fallback
    else Breaker closed or half-open
      Caller->>Auth: Attempt with per-call timeout
      Auth-->>Caller: Success, timeout, or transient error
      Caller->>Breaker: Record latency and outcome
      Caller->>Caller: Retry only if idempotent and deadline remains
    end
  end

Runtime resilience flow for a command dependency: budget first, isolate capacity, check the breaker, attempt with timeout, retry only when safe, then record the outcome.

The order matters. If the caller retries before reserving a bulkhead slot, the retry can crowd out healthier work. If it calls the dependency before checking an open breaker, it wastes resources on a path already known to be unhealthy. If it starts another attempt after the overall deadline is gone, the result cannot help the user and may slow recovery. If it returns fallback without recording why, operators cannot tell whether users saw cached data, queued work, read-only mode, or explicit unavailability.

Timeouts: connect timeout, read timeout, and overall deadline should be separate enough to debug where time is spent.
Retries: exponential backoff with jitter prevents synchronized device fleets from creating a second outage during recovery.
Dead letters: poison telemetry or command-status messages need bounded attempts and a dead-letter topic, queue, or table for operator review.
Observability: track deadline exhaustion, retry attempts, open circuits, half-open probes, queue depth, fallback count, and user-visible degraded-mode rate.

The data model behind the policy should preserve enough context to explain an incident. Store request id, command id, idempotency key, dependency name, deadline budget, attempt number, timeout reason, breaker state, pool rejection, fallback type, device id, tenant id, trace id, and final user-visible response. OpenTelemetry spans can connect the caller, dependency, queue, and fallback path; Prometheus counters can track rates; logs can carry the command id needed for support to reconcile a physical action.

A resilience design is ready when operators can see which policy acted, which request was protected, and which product behavior the user or device received. It is not ready if the only evidence is “the client timed out.” The target state is boring: slow dependencies stop consuming shared resources, unsafe commands do not repeat silently, device fleets recover without retry storms, poison messages move aside, and degraded modes are explicit enough for users and support teams to trust.

Checkpoint: Command-Call Policy

You now know:

A two-second command path needs budget allocation before retry settings; 500 ms authorization, 800 ms command submission, and fallback time must fit the same deadline.
Retry candidates such as 503, 429, or connection reset still require idempotency when the command may affect the physical world.
Operators need p50, p95, p99 latency, retry attempts, open-circuit duration, queue depth, fallback type, and trace ids to explain what happened.

5.5 Learning Objectives

By the end of this chapter, you will be able to:

Explain how slow dependencies create cascading failures in service-oriented IoT systems.
Choose between timeout, retry, circuit breaker, bulkhead, rate limit, and fallback controls for a given failure mode.
Configure retries around idempotency, exponential backoff, jitter, and deadline budgets.
Describe the closed, open, and half-open states of a circuit breaker.
Review an IoT service boundary for resource isolation and degraded-mode behavior.
Build a concise resilience review note for a service call.

Most Valuable Understanding

Resilience is not “try harder.” It is controlled failure. The service should stop waiting when its budget is gone, stop retrying when the operation is unsafe, stop calling dependencies that are clearly failing, and keep the rest of the system useful while the failed part recovers.

5.6 Prerequisites

SOA and Microservices Fundamentals: Understand service boundaries, ownership, and distributed-system cost.
SOA API Design and Service Discovery: Understand contracts, versioning, idempotency, and API error handling.
SOA Container Orchestration: Understand deployment, health checks, autoscaling, and service operations.
MQTT Fundamentals: Know why IoT systems often combine synchronous APIs with asynchronous message flows.
State Machine Patterns: Review explicit state transitions for devices and service workflows.

5.7 Resilience Control Map

A resilient service call is a chain of small controls. Each control has a narrow job.

A client request passes through deadline, retry, circuit breaker, fallback, and bulkhead controls before reaching a downstream dependency — Figure 5.1: Layered SOA resilience controls for an IoT service call

5.7.1 Deadline

Caps the total time a caller is willing to spend. It should include connection time, processing time, retries, and fallback selection.

5.7.2 Retry

Repeats only safe transient failures. It needs a retry budget, exponential backoff, jitter, and an idempotency rule.

5.7.3 Circuit Breaker

Stops calls to a repeatedly failing dependency, returns fallback behavior quickly, and probes recovery later.

5.7.4 Bulkhead

Limits how much resource one dependency or tenant can consume. Use separate pools, queues, connections, and rate budgets.

5.7.5 Fallback

Chooses a useful degraded result: cached data, queued work, local control, read-only mode, manual confirmation, or a clear unavailable response.

5.7.6 Observability

Makes the behavior measurable. Track deadline exhaustion, retry attempts, open circuits, fallback use, queue growth, and user impact.

Knowledge Check: Matching Control to Failure Mode

5.8 Failure Modes in IoT Services

IoT platforms fail differently from ordinary web applications because they combine remote devices, lossy networks, cloud APIs, queues, databases, and sometimes physical actuators.

5.8.1 Slow Dependency

A database query, external API, model inference endpoint, or message broker takes much longer than expected. The service appears alive, but callers spend their resources waiting.

5.8.2 Retry Storm

Devices, gateways, or services retry at the same schedule after an outage. A recovering dependency receives synchronized traffic and fails again.

5.8.3 Unsafe Duplicate Command

A lost acknowledgement makes a caller retry a command that already succeeded. This is dangerous for actuators, locks, alarms, payments, or provisioning operations.

5.8.4 Shared-Pool Exhaustion

One dependency consumes the shared worker pool, connection pool, queue, or rate limit. Healthy features are blocked by an unhealthy dependency.

5.8.5 Partial Cloud Loss

Cloud services are unreachable but the edge gateway, local device network, or cached configuration still works. The system needs a local or degraded operating mode.

5.8.6 Poison Message

One malformed message fails repeatedly and blocks a queue consumer. The pipeline needs retry limits and a dead-letter route.

5.9 Circuit Breaker Pattern

A circuit breaker is a state machine around a dependency call. It is not a retry mechanism. It decides whether the caller should attempt the dependency now or fail fast and use fallback behavior.

Circuit breaker states: closed allows normal calls, open fails fast with fallback, and half-open sends limited recovery probes — Figure 5.2: Circuit breaker states for service calls

5.9.1 Closed

Calls pass through. The policy counts failures and slow responses over a rolling window or consecutive sequence.

5.9.2 Open

Calls do not reach the dependency. The caller returns fallback behavior quickly and protects its own resource pool.

5.9.3 Half-Open

After a cooldown, the policy allows a small number of probes. Successful probes close the circuit; failed probes reopen it.

Do not copy someone else’s breaker thresholds. Choose the failure window, threshold, cooldown, and half-open probe count from your own request rate, downstream SLO, timeout budget, and false-positive tolerance.

5.9.4 Circuit Breaker Review Questions

What counts as a failure? Include hard errors, timeouts, rejected calls, and responses that are technically successful but too slow for the caller’s deadline.

What is the fallback? A breaker without fallback often just changes a slow failure into a fast error. That can still be useful, but the user experience should be intentional.

How does recovery get tested? Half-open probes should be limited so a recovering dependency is not immediately flooded.

Who sees the state? Expose open circuits, failure rates, fallback counts, and probe outcomes in service dashboards and alerts.

Knowledge Check: Circuit Breaker Behavior

5.10 Retry, Backoff, and Jitter

Retries are useful only when the failure is likely transient and the operation is safe to repeat. A retry policy should answer four questions before it is enabled.

A timeline shows normal device requests, a dependency failure, synchronized retries, and a request-load spike above server capacity — Figure 5.3: Naive retry storm after a dependency failure

5.10.1 Is the operation safe to retry?

Reads are usually safe. Writes need idempotency keys, command sequence numbers, or server-side deduplication. Physical commands deserve extra caution.

5.10.2 Is there enough deadline left?

Retries must fit inside the caller’s total deadline. A retry that starts after the user-facing budget is gone only creates extra load.

5.10.3 Does the error look transient?

Connection resets, rate limits, and temporary unavailability can be retry candidates. Invalid input, authorization failure, and missing resources usually are not.

5.10.4 Are retries desynchronized?

Exponential backoff reduces pressure. Jitter spreads devices and services so they do not all retry at the same instant.

def retry_delay(base_seconds, attempt, cap_seconds, jitter_fraction, random_value):
    """Return a capped exponential backoff delay with caller-provided randomness."""
    backoff = min(base_seconds * (2 ** attempt), cap_seconds)
    jitter = backoff * jitter_fraction * random_value
    return backoff + jitter

Retry Rule for IoT Commands

Do not blindly retry commands that may change the physical world. If an actuator command might already have succeeded, retry only when the command has an idempotency key or device-side command sequence number that prevents duplicate execution.

Knowledge Check: Retry Safety

The retry questions establish the danger of repeating work. The next control is the clock: without a deadline, even a “safe” retry can arrive too late to help.

5.11 Timeouts and Deadlines

A timeout caps one operation. A deadline caps the whole user-visible or workflow-visible budget. Resilient services usually need both.

1. Caller budget Start with the user, device, or workflow SLO. A status request may have a small budget; a firmware transfer may have a larger one.

2. Per-hop budget Allocate time to each dependency call. Leave room for serialization, network delay, fallback, and response handling.

3. Retry budget Only retry while enough budget remains for another useful attempt and fallback.

4. Cancel work Propagate cancellation so downstream services stop doing work after the caller no longer needs the result.

5.11.1 Timeout Too Short

Normal network variance looks like failure. The service may open circuits or use fallback unnecessarily.

5.11.2 Timeout Too Long

Workers, sockets, queue slots, and memory are held by work that the user has already stopped waiting for.

Checkpoint: Retry and Deadline Budget

You now know:

Retries need three gates: the operation is safe, the error is transient, and the deadline still has room for another useful attempt.
Exponential backoff and jitter prevent device fleets from retrying at the same instant during recovery.
Connect timeout, read timeout, and overall deadline should be separate signals so operators can see where time was lost.

5.12 Bulkhead Pattern

Bulkheads limit the blast radius of a dependency, tenant, device fleet, or workload class. The goal is simple: one failing dependency should not consume all resources that healthy work needs.

Telemetry, analytics, and notification workloads use separate pools so an exhausted analytics pool does not consume capacity reserved for the other workloads — Figure 5.4: Bulkhead isolation with separate thread pools

5.12.1 Worker Pools

Separate workers for telemetry ingestion, command dispatch, alerts, and analytics. A blocked analytics dependency should not stop command dispatch.

5.12.2 Connection Pools

Give each downstream database, API, or broker its own connection budget. A slow dependency cannot take every socket.

5.12.3 Queues

Separate high-priority command queues from bulk telemetry or batch analytics. Queue limits should shed low-value work first.

5.12.4 Rate Budgets

Apply per-tenant, per-device-class, or per-service limits so one noisy fleet does not starve everyone else.

5.12.5 Service Partitions

Separate life-safety or control-plane capabilities from convenience or analytics features where the risk justifies it.

5.12.6 Dead Letter Routes

Move poison messages aside after bounded attempts so they do not block the main processing stream.

Knowledge Check: Bulkhead Isolation

Bulkheads keep one queue, pool, or workload from exhausting the caller. Once the blast radius is contained, the design still needs to say what the user or device receives while degraded.

5.13 Fallback and Graceful Degradation

Fallback is not the same as hiding the problem. A good fallback is honest, useful, and bounded.

5.13.1 Last Known State

Show cached telemetry with a visible freshness indicator when live reads are unavailable.

5.13.2 Queue for Later

Accept a safe idempotent request, place it in a durable queue, and report that execution is pending.

5.13.3 Local Control

Let gateways or devices execute critical local rules when cloud coordination is unavailable.

5.13.4 Read-Only Mode

Disable writes while allowing status, history, and diagnostics to remain available.

5.13.5 Manual Confirmation

For high-risk physical actions, require an operator to confirm state rather than silently retrying.

5.13.6 Explicit Unavailable

When no safe fallback exists, fail clearly and preserve resources instead of pretending the action succeeded.

Try It: Degraded-Mode Test Card

Before a resilience pattern is accepted, write one test card for the degraded path it creates:

Trigger: name the exact condition, such as cloud API down, dependency timeout, full queue, or open circuit.
Allowed behavior: state what the user, device, or service may still do while degraded.
Blocked behavior: state which write, command, or automation must stop instead of retrying silently.
Freshness signal: decide how the interface or log shows cached data, queued work, read-only mode, or explicit unavailability.
Recovery check: define the first healthy signal that lets the service leave degraded mode without a retry surge.

The fallback is ready only when this card can be exercised in a test environment and the result is visible in metrics, logs, and the user-facing workflow.

5.14 Combining the Patterns

These controls should be ordered deliberately. One practical service-call flow is:

1. Start deadline Attach a total budget and cancellation signal.

2. Check bulkhead Reserve from the dependency-specific pool or reject quickly if that pool is full.

3. Ask breaker If open, skip the dependency and use fallback.

4. Attempt call Use per-attempt timeout and collect latency, status, and error signals.

5. Retry if safe Retry transient failures only while deadline, retry budget, and idempotency rules allow it.

6. Record outcome Update breaker state, metrics, traces, logs, and user-facing degraded-mode counters.

Knowledge Check: Pattern Ordering

Checkpoint: Ordered Resilience Controls

You now know:

Reserve bulkhead capacity before sending dependency traffic, then ask the circuit breaker whether the call should be admitted.
A first attempt that fails after 1.8 seconds inside a two-second deadline usually leaves no useful retry budget.
Fallbacks must be honest: cached state needs freshness, queued work needs pending status, unsafe commands need manual confirmation or explicit unavailability.

5.15 Architecture Review Checklist

Use this checklist during design review or incident follow-up.

Dependency inventory: List every synchronous dependency the service calls during the workflow.

Deadline budget: Document the total deadline, per-attempt timeout, retry limit, and cancellation behavior.

Retry eligibility: Mark each operation as safe, idempotent with key, unsafe to retry, or asynchronous retry only.

Breaker policy: Define the failure signal, opening rule, cooldown, half-open probe limit, and fallback behavior.

Bulkhead boundary: Identify the pool, queue, connection budget, or service partition that contains each dependency failure.

Fallback contract: Decide what users, devices, and downstream systems see during degraded mode.

Recovery test: Run controlled dependency failures and verify that the system degrades, alerts, and recovers without synchronized retry storms.

5.16 Common Pitfalls

5.16.1 Retrying Everything

Blind retries duplicate unsafe commands and amplify load. Retry only transient failures with idempotency and a deadline budget.

5.16.2 Timeout Inflation

Increasing timeouts may hide symptoms while making resource exhaustion worse. Investigate the dependency and the caller budget.

5.16.3 Circuit Breaker Without Fallback

Failing fast protects resources, but the product still needs a user-visible or workflow-visible degraded result.

5.16.4 Shared Pool Everywhere

One pool for every downstream call makes one slow dependency everyone’s incident.

5.16.5 Missing Jitter

Backoff without jitter can still synchronize clients. Jitter is essential for device fleets.

5.16.6 No Degraded-Mode Tests

Untested fallback paths often fail during the first real outage. Practice cloud-down, dependency-down, and queue-backlog scenarios.

Label the Diagram

Code Challenge

5.17 Summary

Resilience patterns prevent one failed dependency from consuming the whole service.
Deadlines and timeouts bound waiting.
Retries need idempotency, backoff, jitter, and a retry budget.
Circuit breakers fail fast after repeated failure signals and use half-open probes for recovery.
Bulkheads isolate resource pools so healthy features keep working.
Fallback behavior should be explicit, tested, observable, and honest about degraded mode.

5.18 Key Takeaway

Resilience is a coordinated contract, not a single pattern. Bound each call with deadlines, retry only safe operations with backoff and jitter, isolate shared resources, and test fallback behavior before a dependency outage reaches users or devices.

5.19 Knowledge Check

Quiz: SOA Resilience Patterns

Interactive Quiz: Match Resilience Pattern Concepts

Interactive Quiz: Sequence the Steps

5.20 Try It Yourself: Resilience Review Note

Choose one synchronous dependency in an IoT service and write a short review note.

service_call: command-api -> authorization-service
user_visible_deadline: 2s
per_attempt_timeout: 500ms
retry_policy:
  eligible_errors: [connection-reset, 503, 429]
  max_attempts: 2
  backoff: exponential
  jitter: required
idempotency:
  required: true
  key: command_id
circuit_breaker:
  opens_when: failure-rate-exceeds-policy-window
  half_open: limited-probes
bulkhead:
  resource: authorization-client-pool
  limit_basis: measured-load-test
fallback:
  behavior: reject unsafe commands, queue safe idempotent commands
observability:
  metrics: [deadline-exhausted, retry-attempts, open-circuit, fallback-count]

Then run a dependency-down test and verify that healthy requests still have capacity, unsafe commands are not duplicated, and retry traffic ramps back gradually.

5.21 References

5.22 What’s Next

5.22.1 State Machine Patterns

Use explicit state models for device behavior, workflow recovery, and circuit-breaker-like transitions.

5.22.2 SOA Container Orchestration

Deploy resilient services with health checks, rollout controls, autoscaling, and service-level resource policy.

5.22.3 SOA API Design and Service Discovery

Design API contracts with idempotency, versioning, rate limits, and discovery rules that make resilience possible.

5.22.4 Cloud Computing for IoT

Review cloud and edge placement choices that affect fallback, latency, and recovery behavior.

5.1 Start With the Command That Might Repeat

5.2 Resilience Protects The Caller First

5.3 Set The Policy For One Command Call

5.4 Policies Need Runtime Signals

5.5 Learning Objectives

5.6 Prerequisites

5.7 Resilience Control Map

5.7.1 Deadline

5.7.2 Retry

5.7.3 Circuit Breaker

5.7.4 Bulkhead

5.7.5 Fallback

5.7.6 Observability

5.8 Failure Modes in IoT Services

5.8.1 Slow Dependency

5.8.2 Retry Storm

5.8.3 Unsafe Duplicate Command

5.8.4 Shared-Pool Exhaustion

5.8.5 Partial Cloud Loss

5.8.6 Poison Message

5.9 Circuit Breaker Pattern

5.9.1 Closed

5.9.2 Open

5.9.3 Half-Open

5.9.4 Circuit Breaker Review Questions

5.10 Retry, Backoff, and Jitter

5.10.1 Is the operation safe to retry?

5.10.2 Is there enough deadline left?

5.10.3 Does the error look transient?

5.10.4 Are retries desynchronized?

5.11 Timeouts and Deadlines

5.11.1 Timeout Too Short

5.11.2 Timeout Too Long

5.12 Bulkhead Pattern

5.12.1 Worker Pools

5.12.2 Connection Pools

5.12.3 Queues

5.12.4 Rate Budgets

5.12.5 Service Partitions

5.12.6 Dead Letter Routes

5.13 Fallback and Graceful Degradation

5.13.1 Last Known State

5.13.2 Queue for Later

5.13.3 Local Control

5.13.4 Read-Only Mode

5.13.5 Manual Confirmation

5.13.6 Explicit Unavailable

5.14 Combining the Patterns

5.15 Architecture Review Checklist

5.16 Common Pitfalls

5.16.1 Retrying Everything

5.16.2 Timeout Inflation

5.16.3 Circuit Breaker Without Fallback

5.16.4 Shared Pool Everywhere

5.16.5 Missing Jitter

5.16.6 No Degraded-Mode Tests

5.17 Summary

5.18 Key Takeaway

5.19 Knowledge Check

5.20 Try It Yourself: Resilience Review Note

5.21 References

5.22 What’s Next

5.22.1 State Machine Patterns

5.22.2 SOA Container Orchestration

5.22.3 SOA API Design and Service Discovery

5.22.4 Cloud Computing for IoT

5.23 Navigation

5.23.1 Previous

5.23.2 Current

5.23.3 Next