18  Reliability & Errors

In 60 Seconds

IoT reliability rests on five pillars: error detection (CRC/checksums to catch corruption), retry mechanisms (exponential backoff to handle packet loss without overwhelming the network), sequence numbering (to detect loss, duplication, and reordering), connection state management (state machines for robust session handling), and keep-alive monitoring (heartbeats to detect silent failures). This overview chapter introduces all five pillars and links to dedicated deep-dive chapters for each.

Key Concepts
  • Error Recovery Strategy: Systematic approach to handling detected errors: detect → classify (transient vs permanent) → remediate (retry, reconnect, reset, failover) → log → report
  • Transient vs Permanent Error: Transient: temporary condition (congestion, interference) that resolves with retry; permanent: requires human intervention (server misconfiguration, hardware failure)
  • Circuit Breaker Pattern: Application-level failure handling that stops retrying after N consecutive failures, preventing cascade overload; states: Closed (normal), Open (failing, block requests), Half-Open (test recovery)
  • Exponential Backoff: Retry strategy doubling the wait time after each failure (1 s, 2 s, 4 s, 8 s…) with jitter (±20%) to prevent synchronized retry storms from many IoT devices
  • Dead Letter Queue: Repository for messages that could not be delivered after all retries; allows manual inspection, replay, or alerting without blocking the main data pipeline
  • Graceful Degradation: System behavior under partial failure: continue operating with reduced functionality rather than complete failure; e.g., cache last-known sensor value when sensor is unreachable
  • Error Rate SLO (Service Level Objective): Target for acceptable error rate; examples: <0.1% data delivery failure, <1% message delivery latency >5 s; used to trigger alerts and capacity planning
  • MQTT Last Will and Testament (LWT): Pre-configured message published by broker when a client disconnects unexpectedly; enables other subscribers to detect and respond to device disconnection
Learning Objectives

By the end of this section, you will be able to:

  • Explain reliability fundamentals: Identify why IoT networks require error detection and recovery mechanisms, and distinguish which failure mode each pillar addresses
  • Implement retry strategies: Design exponential backoff algorithms for congestion-aware retransmission
  • Apply error detection: Calculate and verify CRC/checksum values to detect data corruption
  • Manage packet sequencing: Use sequence numbers to detect loss, duplication, and reordering
  • Design connection state machines: Build robust connection management with proper state transitions
  • Analyze reliability trade-offs: Balance reliability guarantees against power consumption and latency

18.1 Prerequisites

Before diving into this chapter, you should be familiar with:

  • Transport Fundamentals: Understanding TCP vs UDP trade-offs and basic acknowledgment concepts provides essential context for reliability mechanisms
  • Networking Basics: Knowledge of packets, headers, and basic network transmission helps you understand where reliability fits in the protocol stack
  • Binary and Hexadecimal: Familiarity with bitwise operations is helpful for understanding checksum calculations
Why Reliability Matters for IoT

Wireless IoT networks are inherently unreliable. Radio signals face interference, collisions, and fading. Battery-powered devices may sleep during transmissions. Network congestion causes packet drops. Without proper reliability mechanisms, your sensor data may never reach the cloud, or commands may fail to reach actuators. This chapter teaches you the building blocks that protocols like TCP, CoAP (Confirmable), and MQTT QoS use internally.

“IoT networks are inherently unreliable,” said Max the Microcontroller. “Radio signals face interference, batteries die mid-transmission, and networks get congested. We need five mechanisms to make data delivery dependable.”

“Error detection catches corrupted data using CRC or checksums,” explained Sammy the Sensor. “Retry mechanisms handle lost packets with exponential backoff – wait 1 second, then 2, then 4, so we do not overwhelm the network with retries.”

“Sequence numbers are my favorite,” added Lila the LED. “They detect three problems at once: lost packets (gap in sequence), duplicate packets (same number twice), and reordered packets (numbers arrive out of order). One simple counter solves all three.”

“Connection state management and keep-alive monitoring complete the picture,” said Bella the Battery. “State machines handle the lifecycle of connections, and heartbeat messages detect silent failures – when a device dies without sending a goodbye. Together, these five pillars give us TCP-like reliability even over unreliable wireless links.”


18.2 Chapter Overview

This topic is covered in three focused chapters:

18.2.1 1. Error Detection: CRC and Checksums

Learn how to detect data corruption during transmission using mathematical verification techniques.

Topics covered:

  • Simple checksum algorithms and their limitations
  • CRC (Cyclic Redundancy Check) calculation and polynomial division
  • Comparison of CRC-16 vs CRC-32 vs simple checksums
  • Hardware CRC acceleration on modern MCUs
  • Choosing the right error detection for your application

Key concept: CRC treats data as a polynomial and uses division to generate a remainder that changes if any bits are corrupted.

18.2.2 2. Retry Mechanisms and Sequence Numbers

Master congestion-aware retry strategies and packet ordering techniques.

Topics covered:

  • Why naive retry causes collision storms
  • Exponential backoff algorithm implementation
  • Random jitter to prevent synchronized retries
  • Sequence numbers for loss, duplicate, and reorder detection
  • Handling sequence number wraparound with modular arithmetic
  • Selective acknowledgment (SACK) for high throughput

Key concept: Exponential backoff doubles wait time after each failure, reducing network load during congestion.

18.2.3 3. Connection State Management and Lab

Build a complete reliable transport system with hands-on implementation.

Topics covered:

  • Connection state machine design (DISCONNECTED, CONNECTING, CONNECTED, DISCONNECTING)
  • Keep-alive mechanisms for long-lived connections
  • NAT timeout considerations for cellular IoT
  • Comprehensive ESP32 lab implementing all five reliability pillars
  • Statistics tracking and performance measurement

Key concept: State machines prevent resource leaks and handle unexpected disconnections gracefully.


18.3 The Five Pillars of IoT Reliability

Four-layer reliability stack diagram: bottom layer shows error detection via CRC and checksums, second layer shows loss recovery via ACK and retransmission, third layer shows flow control via sequence numbers and sliding window, top layer shows connection management via state machines and keep-alive, all working together to provide end-to-end reliable delivery over unreliable wireless IoT links.
Figure 18.1: Layered reliability architecture showing error detection (bottom), loss recovery (middle), flow control, and connection management (top) working together for end-to-end reliability.
Pillar Mechanism Failure Addressed Overhead Chapter
Detection CRC, checksums Bit corruption 2-4 bytes per packet Error Detection
Identification Sequence numbers Loss, duplication, reordering 2-4 bytes per packet Retry & Sequencing
Confirmation ACK/NACK Silent delivery failure 1 packet per message Retry & Sequencing
Recovery Timeout + retry Network loss Variable (retransmissions) Retry & Sequencing
Adaptation Exponential backoff Congestion Increased latency Retry & Sequencing
State Connection machine Session failures Minimal Connection & Lab


What is network reliability? Reliability means ensuring your data arrives at its destination correctly, completely, and in the right order. It is like sending a valuable package - you want tracking, confirmation of delivery, and protection against damage.

Why is this challenging for IoT? Unlike wired Ethernet with 99.99% reliability, wireless IoT networks may lose 5-30% of packets. Sensors may send data while the gateway is rebooting. Radio interference can corrupt bits mid-transmission. Without reliability mechanisms, your smart thermostat might miss the “turn off” command, or your environmental monitor might report corrupted readings.

Key mechanisms covered in this chapter:

Mechanism Purpose Analogy
Retry with Backoff Resend lost data without overwhelming the network Calling back later when the line is busy
Acknowledgments Confirm successful delivery Delivery receipt signature
Checksums/CRC Detect corrupted data Package inspection on arrival
Sequence Numbers Detect missing/duplicate/reordered data Numbered pages in a book
Timeouts Know when to give up waiting Postal tracking “delivery failed”
Connection State Track communication session status Phone call: ringing, connected, ended

18.5 Worked Example: Designing Reliability for a Pipeline Leak Detection System

Scenario: An oil company deploys 400 pressure sensors along a 200 km pipeline. Each sensor sends a 24-byte pressure reading every 30 seconds via LoRaWAN (SF10, ~1% packet loss in normal conditions, up to 15% during heavy rain). A pressure drop of more than 5 bar in 60 seconds indicates a potential leak and must trigger an alarm within 90 seconds. The design team must select reliability mechanisms that balance detection speed against battery life (target: 5 years on a D-cell lithium battery, 19,000 mAh).

Step 1: Determine which reliability pillars are needed

Pillar Needed? Rationale
Error detection (CRC) Yes Corrupted pressure readings could cause false alarms. LoRaWAN already includes CRC-16 at PHY layer.
Sequence numbers Yes Must detect missing readings to distinguish “no data received” from “pressure is stable.” A 2-minute gap could mask a leak.
Acknowledgments Conditional Routine readings: no ACK (Non-Confirmable). Alarm triggers: ACK required (Confirmable).
Retry with backoff Conditional Only for alarm messages. Routine readings are replaced by the next reading in 30 seconds.
Keep-alive Yes Gateway must detect silent sensor failure within 5 minutes to dispatch maintenance.

Step 2: Calculate packet loss impact on leak detection

A leak causes pressure to drop over 2-4 readings (60-120 seconds). The gateway needs at least 2 consecutive readings showing >5 bar drop to confirm a leak (avoiding false alarms from single corrupted readings).

Normal conditions (1% loss): Probability of missing 2 consecutive readings = 0.01 × 0.01 = 0.0001 (0.01%). Detection delay: 0-30 seconds. Acceptable.

Heavy rain (15% loss):

Scenario Probability Detection Delay Outcome
Both readings arrive 0.85 × 0.85 = 72.3% 30-60 sec Alarm within 90 sec target
First lost, second arrives 0.15 × 0.85 = 12.7% 60-90 sec Alarm at boundary of target
First arrives, second lost 0.85 × 0.15 = 12.7% 60-90 sec Must wait for third reading
Both lost 0.15 × 0.15 = 2.3% 90-120 sec Exceeds 90-sec target

Risk: In heavy rain, 2.3% chance of exceeding the 90-second alarm target. Mitigation: when the first anomalous reading is detected, switch that sensor to Confirmable mode (ACK required) for the next 5 minutes. Retry timeout: 2 seconds (well within the 90-second window).

Try It: Packet Loss Impact on Leak Detection

Adjust the packet loss rate to see how it affects the probability of missing two consecutive critical readings.

Energy Cost of Reliability Mechanisms

Let’s calculate how reliability overhead impacts 5-year battery life for the 400-sensor pipeline system:

\[ \text{Baseline TX energy per reading} = 90 \text{ mA} \times 0.37 \text{ s} = 33.3 \text{ mAs} = 0.00925 \text{ mAh} \]

At 30-second intervals (2,880 readings/day), daily transmission energy:

\[ E_{\text{daily}} = 2{,}880 \times 0.00925 + (0.002 \text{ mA} \times 24 \text{ h}) = 26.64 + 0.048 = 26.69 \text{ mAh} \]

\[ \text{Battery life} = \frac{19{,}000}{26.69} = 712 \text{ days} = 1.95 \text{ years} \]

Optimization: Adaptive reporting (5-minute normal interval, 30-second alert mode) reduces baseline:

\[ \text{Normal readings/day} = \frac{24 \times 60}{5} = 288 \quad \Rightarrow \quad E_{\text{daily}} = 288 \times 0.00925 + 0.048 = 2.71 \text{ mAh} \]

\[ \text{Battery life (normal)} = \frac{19{,}000}{2.71} = 7{,}011 \text{ days} = 19.2 \text{ years} \]

Adding keep-alive heartbeats (96/day at 0.89 mAh) and sleep current (0.048 mAh): 2.664 + 0.89 + 0.048 = 3.60 mAh/day → 14.5 years. Reliability mechanisms (keep-alive + ACK overhead) consume less than 30% of battery budget, leaving ample margin for environmental factors and hardware aging.

Step 3: Calculate energy cost of reliability mechanisms

Baseline (no reliability, Non-Confirmable only):

Component Value
TX current (LoRa SF10) 90 mA
TX time per reading 370 ms
Energy per reading 90 mA × 0.37 s = 33.3 mA·s = 0.00925 mAh
Readings per day 2,880
Daily TX energy 26.64 mAh
Sleep current 2 µA
Daily sleep energy 0.048 mAh
Daily total 26.69 mAh
Battery life 19,000 / 26.69 = 712 days = 1.95 years

This falls short of the 5-year target. Must reduce transmission frequency.

Optimized design (adaptive reporting):

Mode Interval Readings/day Daily TX energy
Normal (pressure stable) 5 minutes 288 2.66 mAh
Alert (pressure changing) 30 seconds Up to 600 in 5-hour alert window 5.55 mAh (worst case 5-hour sustained alert)
Keep-alive heartbeat 15 minutes 96 0.89 mAh

Normal day (no alerts): 2.66 + 0.89 + 0.048 = 3.60 mAh/day. Battery life = 19,000 / 3.60 = 5,278 days = 14.5 years. Well above the 5-year target.

Worst case (1 sustained alert event per day lasting 5 hours, e.g., a storm or maintenance window): 2.66 + 5.55 + 0.89 + 0.048 = 9.15 mAh/day. Battery life = 19,000 / 9.15 = 2,077 days = 5.7 years. Meets target even with prolonged daily alerts.

Try It: IoT Sensor Battery Life Calculator

Adjust reporting intervals and battery capacity to see how reliability overhead affects battery life for a LoRa SF10 sensor (90 mA TX, 370 ms per transmission, 2 µA sleep).

Step 4: Sequence number and keep-alive design

Mechanism Implementation Overhead
Sequence number 16-bit counter (wraps at 65,535 = 22.7 days at 30-sec intervals) 2 bytes per packet
Keep-alive Heartbeat every 15 min, gateway alerts if 2 consecutive heartbeats missed (30 min silence) 96 extra packets/day
Adaptive ACK Gateway sends downlink ACK request when anomalous reading detected 1 extra RX window per alert

Key insight: Reliability in this pipeline system is not one-size-fits-all. Routine pressure readings use fire-and-forget (Non-Confirmable) because the next reading arrives in 5 minutes. But the moment an anomalous reading arrives, the system shifts to confirmed delivery with retries, spending extra energy only when it matters. This adaptive approach achieves 14.5-year battery life in normal operation while still meeting the 90-second alarm deadline in 97.7% of heavy-rain scenarios. The remaining 2.3% risk is mitigated by the Confirmable retry, which adds at most one 2-second retry cycle.


18.6 Knowledge Check

18.7 Summary

Reliable data delivery in IoT requires multiple complementary mechanisms working together. No single mechanism is sufficient on its own:

  1. Error Detection: CRC/checksums catch bit corruption before bad data reaches the application
  2. Sequence Numbers: Identify loss, duplicates, and reordering with a single counter
  3. Acknowledgments: Confirm successful delivery so the sender knows when to stop waiting
  4. Exponential Backoff: Prevent network congestion storms during retries
  5. Connection State: Manage session lifecycle to avoid resource leaks

Key Trade-offs:

  • More reliability = more overhead (bytes, packets, latency, power)
  • Choose the right level for your application (CoAP NON vs CON, MQTT QoS 0/1/2)
  • Adapt mechanisms to network conditions - static configurations waste battery or miss alarms

18.8 Concept Relationships

Builds Upon:

  • Information theory → Error detection codes are systematic redundancy
  • Probability theory → Exponential backoff leverages randomness to avoid collisions
  • State machines → Connection lifecycle management uses formal state transitions

Enables:

  • CoAP Protocol: CON vs NON messages implement selective reliability
  • MQTT QoS: Three quality-of-service levels map to reliability pillars
  • TCP Optimization: Fine-tune TCP’s built-in reliability for IoT

Related Concepts:

  • Automatic Repeat Request (ARQ) protocols formalize retry mechanisms
  • Flow control (TCP sliding window) extends sequence numbering
  • Congestion control uses exponential backoff during network overload

18.9 See Also

Sub-Chapters (Deep Dives):

Protocol Applications:

System Design:

Standards:

  • RFC 793: TCP reliability mechanisms (retransmission, sequencing)
  • RFC 7252: CoAP reliability and message types
  • RFC 6298: TCP retransmission timeout computation

Common Pitfalls

A device that reconnects immediately after connection failure and retries 10 times per second generates a “retry storm” when a server goes down. A fleet of 10,000 devices each retrying at 10 req/s generates 100,000 connection attempts per second against the recovering server, preventing it from coming back up. Always implement exponential backoff with jitter: start at 1 second, double each retry, cap at 60–300 seconds, add ±25% random jitter to desynchronize fleet retry timing.

TCP connection refused (network error) and HTTP 503 Service Unavailable (application error) require different responses. Network errors may indicate a temporary outage → retry with backoff. HTTP 4xx errors (Bad Request, Unauthorized) are client errors that cannot be resolved by retrying — fix the request before retrying. HTTP 429 Too Many Requests means back off immediately and honor the Retry-After header. Mapping all errors to “retry” wastes resources and may violate rate limits.

An IoT device that discards sensor readings during connectivity outages creates data gaps that distort analytics. If the outage lasts 4 hours and readings occur every minute, 240 readings are permanently lost. Implement persistent local buffering: store readings in flash with timestamps during outage, batch-upload on reconnection. Size the buffer for the maximum expected outage duration: 7 days × readings/day × bytes/reading must fit in available flash storage.

Error handling code that is never executed in production tests may contain bugs discovered only during actual incidents. Use chaos engineering techniques: inject network failures (iptables DROP rules), introduce artificial API errors (mock server returning 500), simulate sensor failures (disconnect I2C sensor). Verify that each error path: logs the error with sufficient context, triggers the correct recovery action, and does not leave the system in an inconsistent state.

18.10 What’s Next

After understanding the five reliability pillars, choose your next step based on your learning goal:

Next Chapter Focus Why Read It
Error Detection: CRC and Checksums CRC-16 vs CRC-32, polynomial division, hardware acceleration Apply error detection — implement and compare checksum algorithms on a microcontroller
Retry Mechanisms and Sequence Numbers Exponential backoff, jitter, SACK, wraparound arithmetic Design congestion-safe retry logic and detect loss, duplication, and reordering
Connection State Management and Lab State machines, keep-alive, NAT timeouts, ESP32 implementation Build a complete reliable transport layer combining all five pillars
CoAP Features and Labs CON vs NON messages, retransmission timing Evaluate how CoAP maps reliability pillars to protocol-level message types
MQTT QoS and Session QoS 0, 1, 2; persistent sessions Select the correct MQTT QoS level for different IoT message classes
Transport Protocol Optimizations TCP tuning, buffer sizing, keep-alive timers Configure TCP reliability parameters for constrained IoT devices