11 Transport Error Handling

transport-protocols

reliability

error-handling

Keywords

IoT reliability errors, transport error handling, retry review evidence, timeout recovery record, reliability retest trigger

11.1 Start With One Error State

Reliability work begins when a message path lands in an unwanted state. Name the state first: timeout, duplicate, checksum failure, reconnect, partial command, or stale acknowledgement. Once the state is named, recovery can be reviewed as a bounded contract with evidence, owner, and retest trigger.

11.2 Overview: Reliable Error Handling Leaves a Bounded State

Transport reliability is not the absence of errors. Links drop frames, acknowledgements are missed, duplicate messages arrive, checksums fail, sessions expire, and gateways restart. Reliable handling means the system detects the condition, classifies it, takes a bounded action, and leaves evidence of the final state.

A useful error record does not stop at “timeout” or “retry.” It names the affected boundary, explains whether the condition is transient or persistent, records the decision, shows the retry or timeout limit, and states what must be retested when transport behavior changes.

For example, gateway-west sends close command C-841 to actuator-17 and expects an acknowledgement before the command window closes. If the first response is missing, the record should show whether the gateway retried, how long it waited, and what happened after the final attempt. A bounded outcome might be: two retries with backoff, command marked unresolved, valve state left at last-confirmed open, operator alarm raised, and later duplicate acknowledgements ignored unless they match the active command id. That evidence is much stronger than a retry counter alone because it proves the system did not silently claim closure, loop forever, or apply stale messages after recovery.

The same record should also state the non-claim. It does not prove that the radio path is healthy forever, that the actuator moved, or that the root cause was known. It proves a narrower handling behavior: when this boundary saw this abnormal condition, the implementation moved to a named state and exposed enough evidence for an operator or reviewer to decide the next step. That boundary keeps recovery evidence useful without over-selling it.

Error handling is reviewable when each abnormal condition is connected to a bounded decision and retest trigger.

Detect

Record the observable condition: timeout, missing acknowledgement, duplicate sequence, corrupt input, stale session, or rejected command.

Decide

Classify the condition before acting. The system may accept, reject, retry, hold, reset, reconnect, report, or fail safe.

Bound

Name the stopping condition, final state, owner, and retest trigger so recovery behavior does not become an unbounded loop.

Error Condition

Review Question

Safe Evidence

Weak Evidence

Timeout

What stopped waiting, and what state remained?

Timeout value, retry count, final state, and escalation or fail-safe rule.

A log line saying “timeout” with no state decision.

Duplicate

Was duplicate data accepted, ignored, or used only to acknowledge prior state?

Sequence evidence, previous accepted state, duplicate decision, and acknowledgement behavior.

Discarding duplicates without proving state was not changed.

Corrupt input

Was the input rejected before it could change application state?

Checksum or parse failure, reject path, retry or report action, and retained last-known-good state.

Silently dropping data with no owner or retest condition.

11.3 Practitioner: Build a Decision Record for Each Error Path

Start with one transport boundary: device to gateway, gateway to cloud, command response, connection lifecycle, queue handoff, or session resume. Then walk the abnormal condition through one review record. The goal is to show why the system did what it did and what state remains afterward.

Keep the evidence at the same grain as the action. A packet capture, gateway log, queue event, device state sample, and operator alarm may each answer a different part of the same recovery decision.

A decision record prevents retries, resets, and discards from becoming invisible side effects.

Record Field

What It Should Say

Why It Matters

Example

Observable condition

The specific event and where it was observed.

Separates a real condition from a guessed cause.

Gateway did not receive ACK within the command response window.

Affected boundary

The message path, state machine, or handoff under review.

Prevents a local fix from claiming system-wide reliability.

Valve command from gateway to actuator node.

Classification

Transient, persistent, duplicate, corrupt, stale, rejected, or unknown.

Different classifications need different recovery behavior.

Transient timeout until retry limit; persistent after limit.

Decision and limit

Accept, reject, retry, hold, reset, reconnect, alert, or fail safe, with a stop condition.

Bounds retry storms and state churn.

Retry twice with backoff, then mark command unresolved and alert owner.

Final state

The state after the action and what later data may change it.

Prevents stale or duplicate input from silently changing state.

Actuator state remains last-confirmed closed until fresh acknowledgement arrives.

Review habit: do not approve an error path until the retry or recovery action has a limit, a final state, and a retest trigger.

Duplicate Message Example

A gateway receives a message with a sequence value that was already accepted. A weak record only says “duplicate.” A stronger record shows the previous accepted state, whether the duplicate changed state, whether an acknowledgement was sent, and what condition would reopen sequence-handling review.

11.4 Under the Hood: Error Handling Is a State-Machine Contract

Under the hood, transport error handling is a state transition contract. A message may move from pending to acknowledged, timed out, retried, rejected, unresolved, or failed safe. A connection may move from open to suspect, reconnecting, resumed, or closed. The reliability question is whether those transitions are explicit, bounded, and observable.

TCP, UDP, CoAP, MQTT, DTLS, and application protocols expose different signals. TCP may report connection failure or reset. UDP may expose no delivery signal unless the application adds one. CoAP, MQTT, and application command paths often add acknowledgements, message IDs, sequence values, retry policies, and session state. The review record should match the actual signals available at that boundary.

The state-machine contract should name guards and side effects, not just states. A timeout guard might move a command from pending to retrying only while the retry budget remains. The final timeout might move it to unresolved, emit an alarm, preserve last-confirmed state, and reject late acknowledgements that no longer match the active command. Without those guards, a diagram can look tidy while the implementation still accepts stale, duplicated, or out-of-order recovery events.

Mechanism

Signal

Risk Without a Contract

Evidence to Keep

Timeout

No expected response before a deadline.

Unbounded waiting, duplicate commands, or premature failure.

Deadline, retry schedule, final state, and alert or hold rule.

Retry

Same operation sent again after no response or negative response.

Retry storm, duplicate state change, or battery drain.

Retry limit, backoff rule, idempotency evidence, and stop condition.

Reset or reconnect

Session or connection state is rebuilt after suspected failure.

Stale session state, lost in-flight commands, or replayed messages.

State before reset, resumed state, rejected stale data, and owner signal.

Reject or hold

Input is corrupt, stale, reordered, unauthorized, or missing evidence.

Silent data loss or hidden state drift.

Reason code, retained state, retry or report action, and retest trigger.

Idempotency

Repeated delivery must not create repeated physical action unless the command was designed for that behavior.

Ordering

Late or reordered messages need sequence, timestamp, or state evidence before they are accepted.

Backoff

Retry timing should reduce contention and power cost instead of synchronizing many devices into another burst.

Fail Safe

When evidence is missing, the system needs a defined hold, alert, last-known-good, or safe-state decision.

Failure mode: treating a log entry as a recovery action. A useful record shows what state changed, what state did not change, and when the same path must be retested.

11.5 Summary

Reliability error handling starts from an observable condition, not a guessed cause.
A reviewable record names the affected boundary, classification, decision, limit, recovery evidence, final state, owner, and retest trigger.
Retry, reset, reject, hold, reconnect, and fail-safe actions need explicit stopping conditions.
Duplicate, stale, reordered, corrupt, and timed-out inputs must not change state silently.
Transport protocol choice affects which signals are available, but application state handling still needs proof.

11.6 Key Takeaway

Error handling is part of protocol design. Decide how the system retries, rejects, holds, alerts, resets, or fails safe before deployment, and keep enough evidence to prove the resulting state.

11.1 Start With One Error State

11.2 Overview: Reliable Error Handling Leaves a Bounded State

Detect

Decide

Bound

11.3 Practitioner: Build a Decision Record for Each Error Path

Duplicate Message Example

11.4 Under the Hood: Error Handling Is a State-Machine Contract

Idempotency

Ordering

Backoff

Fail Safe

11.5 Summary

11.6 Key Takeaway

11.7 See Also