%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
sequenceDiagram
participant D1 as Device 1
participant D2 as Device 2
participant D3 as Device 3
participant GW as Gateway
Note over D1,GW: Network congested - packets dropping
D1->>GW: Message (lost)
D2->>GW: Message (lost)
D3->>GW: Message (lost)
Note over D1,D3: All devices timeout simultaneously
D1->>GW: Immediate retry
D2->>GW: Immediate retry
D3->>GW: Immediate retry
Note over GW: COLLISION! All retries fail too
D1->>GW: Retry again
D2->>GW: Retry again
D3->>GW: Retry again
Note over D1,GW: Congestion worsens - "retry storm"
744 Retry Mechanisms and Sequence Numbers
By the end of this section, you will be able to:
- Implement exponential backoff: Design congestion-aware retry algorithms with proper timeout calculations
- Apply random jitter: Prevent collision storms by spreading retry attempts across time
- Manage sequence numbers: Use sequence numbers to detect loss, duplication, and reordering
- Handle sequence wraparound: Implement modular arithmetic for 16-bit sequence comparison
- Tune retry parameters: Select appropriate timeout and retry limits for different IoT scenarios
- Analyze retry overhead: Calculate the cost of reliability in terms of latency and power
744.1 Prerequisites
Before diving into this chapter, you should be familiar with:
- Error Detection: Understanding CRC and checksums for detecting corrupted data
- Reliability Overview: The parent chapter introducing all five reliability pillars
- Transport Fundamentals: Basic acknowledgment and timeout concepts
Packet loss is inevitable in IoT networks. Radio interference, network congestion, sleeping receivers, and gateway reboots all cause packets to disappear. Error detection tells you when data is corrupted, but retry mechanisms recover from complete loss. Without proper retry logic, your sensor readings may never reach the cloud, or critical commands may fail to reach actuators.
744.2 The Problem with Naive Retry
When a transmission fails, the obvious solution is to try again immediately. But naive retry can worsen the problem:
The solution: Exponential backoff with random jitter.
744.3 Exponential Backoff Algorithm
Exponential backoff doubles the wait time after each failure, reducing network load during congestion:
Wait Time = min(Initial_Timeout * 2^retry_count, Max_Timeout) + random_jitter
744.3.1 Backoff Timeline
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
gantt
title Exponential Backoff Timeline (Initial=500ms, Max=4000ms)
dateFormat X
axisFormat %s
section Attempt 1
Send :a1, 0, 1
Wait 500ms :crit, w1, 1, 6
section Attempt 2
Retry :a2, 6, 7
Wait 1000ms :crit, w2, 7, 17
section Attempt 3
Retry :a3, 17, 18
Wait 2000ms :crit, w3, 18, 38
section Attempt 4
Retry :a4, 38, 39
Wait 4000ms (max) :crit, w4, 39, 79
section Success
ACK Received :done, success, 79, 80
744.3.2 Implementation
#define INITIAL_TIMEOUT_MS 500
#define MAX_TIMEOUT_MS 8000
#define JITTER_PERCENT 25
int calculateBackoff(int retryCount) {
// Exponential: initial * 2^retry
int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);
// Cap at maximum
baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);
// Add random jitter (0 to 25% of timeout)
int jitter = random((baseTimeout * JITTER_PERCENT) / 100);
return baseTimeout + jitter;
}744.3.3 Backoff Calculation Examples
| Retry Count | Base Timeout | With Max Cap | With 25% Jitter |
|---|---|---|---|
| 0 | 500ms | 500ms | 500-625ms |
| 1 | 1000ms | 1000ms | 1000-1250ms |
| 2 | 2000ms | 2000ms | 2000-2500ms |
| 3 | 4000ms | 4000ms | 4000-5000ms |
| 4 | 8000ms | 8000ms | 8000-10000ms |
| 5 | 16000ms | 8000ms (capped) | 8000-10000ms |
744.4 Why Random Jitter is Essential
Without jitter, devices that experience simultaneous failures will retry at exactly the same time, causing collision storms:
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
subgraph NoJitter["Without Jitter"]
NJ1["Device A: retry at 1000ms"] --> COLL["COLLISION"]
NJ2["Device B: retry at 1000ms"] --> COLL
NJ3["Device C: retry at 1000ms"] --> COLL
end
subgraph WithJitter["With 25% Jitter"]
WJ1["Device A: retry at 1050ms"] --> SPREAD["Spread across<br/>250ms window"]
WJ2["Device B: retry at 1180ms"] --> SPREAD
WJ3["Device C: retry at 1220ms"] --> SPREAD
end
COLL --> FAIL["All fail again"]
SPREAD --> SUCCESS["Success more likely"]
style COLL fill:#E67E22,stroke:#2C3E50,color:#fff
style SPREAD fill:#16A085,stroke:#2C3E50,color:#fff
style FAIL fill:#E67E22,stroke:#2C3E50,color:#fff
style SUCCESS fill:#16A085,stroke:#2C3E50,color:#fff
744.4.1 Jitter Implementation Patterns
// Pattern 1: Percentage jitter (0 to X%)
int jitter1 = random((timeout * 25) / 100);
// Pattern 2: Fixed window jitter
int jitter2 = random(500); // 0-500ms random
// Pattern 3: Full decorrelation (AWS recommendation)
int jitter3 = random(min(MAX_TIMEOUT, baseTimeout * 3));
// Pattern 4: Equal jitter (half random, half fixed)
int jitter4 = (timeout / 2) + random(timeout / 2);For IoT applications, typical jitter parameters are:
- Percentage jitter: 20-30% of base timeout
- Minimum jitter: 50-100ms (to spread even short timeouts)
- Seed randomness: Use hardware random (ADC noise, radio noise) not just
millis()
744.5 Retry Parameter Selection
Different IoT scenarios require different retry parameters:
744.5.1 Parameter Guidelines
| Scenario | Initial Timeout | Max Timeout | Max Retries | Jitter |
|---|---|---|---|---|
| Real-time control | 100-200ms | 1s | 2-3 | 10% |
| Sensor reporting | 500ms-1s | 30s | 3-5 | 25% |
| Firmware update | 2-5s | 60s | 10+ | 30% |
| Low-power sleep | 5-10s | 5min | 3 | 50% |
744.5.2 Timeout Calculation
The initial timeout should be based on expected round-trip time (RTT) plus margin:
// Timeout = RTT + margin for processing + margin for variability
int timeout = expectedRTT * 2 + 100; // 2x RTT + 100ms buffer%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
subgraph Timeout["Timeout Components"]
TX["TX Time<br/>10ms"] --> PROP["Propagation<br/>50ms"] --> PROC["Processing<br/>20ms"] --> ACK["ACK TX<br/>10ms"] --> RETURN["Return Prop<br/>50ms"]
end
TOTAL["Total RTT: 140ms<br/>Timeout: 280ms + margin"]
TX --> TOTAL
RETURN --> TOTAL
style Timeout fill:#2C3E50,stroke:#16A085,color:#fff
style TOTAL fill:#E67E22,stroke:#2C3E50,color:#fff
744.6 Sequence Numbers and Loss Detection
Sequence numbers are unique identifiers assigned to each message. They solve three critical problems:
- Loss detection: Gaps in sequence indicate missing packets
- Duplicate detection: Same sequence number received twice indicates a successful retransmission
- Ordering: Receiver can reassemble out-of-order packets correctly
744.6.1 Sequence Number Scenarios
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
graph TB
subgraph "Normal Delivery"
N1["SEQ 1"] --> N2["SEQ 2"] --> N3["SEQ 3"]
N1 -.->|"Receiver sees: 1,2,3"| OK1["In Order - Accept all"]
end
subgraph "Packet Loss"
L1["SEQ 1"] --> L2["SEQ 2 LOST"] --> L3["SEQ 3"]
L1 -.->|"Receiver sees: 1,_,3"| LOSS["Gap Detected!<br/>Request SEQ 2 retransmit"]
end
subgraph "Duplicate (ACK lost)"
D1["SEQ 1"] --> D2["SEQ 2"] --> D3["SEQ 2 (retry)"]
D1 -.->|"Receiver sees: 1,2,2"| DUP["Duplicate!<br/>ACK again, discard data"]
end
subgraph "Reordering"
R1["SEQ 1"] --> R2["SEQ 3"] --> R3["SEQ 2"]
R1 -.->|"Receiver sees: 1,3,2"| REORD["Buffer until in-order<br/>Deliver as 1,2,3"]
end
style OK1 fill:#16A085,stroke:#2C3E50,color:#fff
style LOSS fill:#E67E22,stroke:#2C3E50,color:#fff
style DUP fill:#E67E22,stroke:#2C3E50,color:#fff
style REORD fill:#16A085,stroke:#2C3E50,color:#fff
744.6.2 Implementing Sequence Numbers
// Sender side
uint16_t nextSequenceNum = 0;
void sendMessage(const char* data) {
Message msg;
msg.sequenceNum = nextSequenceNum++;
msg.payload = data;
transmit(&msg);
}
// Receiver side
uint16_t expectedSequenceNum = 0;
void handleMessage(Message* msg) {
if (msg->sequenceNum == expectedSequenceNum) {
// Normal in-order delivery
processData(msg->payload);
expectedSequenceNum++;
sendAck(msg->sequenceNum);
} else if (msg->sequenceNum < expectedSequenceNum) {
// Duplicate - ACK but don't process again
sendAck(msg->sequenceNum);
} else {
// Gap detected - request missing packets
for (uint16_t seq = expectedSequenceNum; seq < msg->sequenceNum; seq++) {
requestRetransmit(seq);
}
bufferForLater(msg);
}
}744.7 Sequence Number Wraparound
With a 16-bit sequence number (0-65535), what happens after message 65535? The sequence โwraps aroundโ to 0. Receivers must handle this using modular arithmetic:
744.7.1 The Wraparound Problem
Sequence progression: 65533, 65534, 65535, 0, 1, 2, ...
Naive comparison fails:
// WRONG! This breaks at wraparound
if (received_seq > expected_seq) {
// Handle out of order
}
// When expected=65535 and received=0, this incorrectly thinks 0 < 65535744.7.2 Correct Wraparound Handling
// Check if seq_a comes before seq_b (handling wraparound)
bool isSequenceBefore(uint16_t seq_a, uint16_t seq_b) {
// Interpret difference as signed value
return (int16_t)(seq_a - seq_b) < 0;
}
// Example: seq_a=65535, seq_b=0
// 65535 - 0 = 65535, interpreted as signed = -1
// -1 < 0, so 65535 "comes before" 0 (correct!)This approach works correctly as long as sequence numbers donโt differ by more than 32767 (half the range). For high-throughput applications, use 32-bit sequence numbers.
744.8 Selective Acknowledgment (SACK)
For high-throughput applications, acknowledging every packet is inefficient. Selective Acknowledgment allows the receiver to acknowledge multiple packets at once and specifically identify gaps:
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
sequenceDiagram
participant S as Sender
participant R as Receiver
Note over S,R: Without SACK (Go-Back-N)
S->>R: SEQ 1
S->>R: SEQ 2 (lost)
S->>R: SEQ 3
S->>R: SEQ 4
R-->>S: ACK 1, NACK 2
Note over S: Must retransmit 2,3,4
Note over S,R: With SACK
S->>R: SEQ 1
S->>R: SEQ 2 (lost)
S->>R: SEQ 3
S->>R: SEQ 4
R-->>S: SACK: received 1,3,4; missing 2
Note over S: Only retransmit 2
744.9 Retry Overhead Analysis
Reliability has costs. Understanding the overhead helps you choose appropriate parameters:
744.9.1 Time Overhead
Total delivery time = RTT + (retries * average_backoff)
Example (20% loss, initial=500ms, 3 retries):
- 64% success on first try: 140ms RTT
- 26% success on second try: 140ms + 500ms = 640ms
- 8% success on third try: 140ms + 500ms + 1000ms = 1640ms
- 2% success on fourth try: 140ms + 500ms + 1000ms + 2000ms = 3640ms
Weighted average: 0.64*140 + 0.26*640 + 0.08*1640 + 0.02*3640 = 460ms
744.9.2 Power Overhead
| Component | Current | Duration | Energy |
|---|---|---|---|
| TX attempt | 100mA | 50ms | 5mJ |
| Wait (sleep) | 10uA | 500ms | 0.005mJ |
| RX window | 20mA | 100ms | 2mJ |
With 20% loss and average 0.25 retries:
Energy per message = (1 + 0.25) * (5 + 0.005 + 2) = 8.75mJ
To minimize retry power cost: 1. Use hardware CRC to avoid retransmits from corruption 2. Increase TX power slightly to reduce loss rate 3. Sleep during backoff rather than busy-waiting 4. Use adaptive timeout based on recent RTT measurements
744.10 Knowledge Check
Question: With initial timeout 500ms and max timeout 8000ms, what is the base timeout on the 4th retry (retry count = 3)?
- 1500ms
- 2000ms
- 4000ms
- 8000ms
Click for answer
Answer: C) 4000ms
Timeout = min(500 * 2^3, 8000) = min(500 * 8, 8000) = min(4000, 8000) = 4000ms
On the 5th retry (count=4): min(500 * 16, 8000) = min(8000, 8000) = 8000ms (hits max)Question: What is the PRIMARY purpose of adding random jitter to retry timeouts?
- To make debugging easier by adding timestamps
- To prevent multiple devices from retrying at exactly the same time
- To compensate for clock drift between sender and receiver
- To make the protocol more secure against timing attacks
Click for answer
Answer: B) To prevent multiple devices from retrying at exactly the same time
Without jitter, devices experiencing simultaneous failures (e.g., from a network outage) would all retry at identical times, causing collision storms that worsen congestion. Jitter spreads retry attempts across a time window.Question: A sender transmits SEQ=5, receives no ACK (lost), retransmits SEQ=5, receiver gets both copies. What should the receiver do?
- Process both messages normally
- Discard the first, process the second
- Process the first, discard the second, ACK both
- Send NACK for both
Click for answer
Answer: C) Process the first, discard the second, ACK both
The receiver should ACK both copies (in case the first ACK was lost) but only process the data once. The sequence number identifies the second copy as a duplicate of already-processed data.Question: With 16-bit sequence numbers, what value follows 65535?
- 65536
- 0
- 1
- Error - sequence exhausted
Click for answer
Answer: B) 0
16-bit sequence numbers wrap around from 65535 to 0. This is normal operation, not an error. Receivers must use modular arithmetic (signed comparison of differences) to correctly order packets across the wraparound boundary.744.11 Summary
This chapter covered retry mechanisms and sequence number management:
Exponential Backoff: - Doubles wait time after each failure - Reduces network load during congestion - Requires maximum timeout cap
Random Jitter: - Spreads retry attempts to prevent collisions - Typically 20-30% of base timeout - Essential for multi-device networks
Sequence Numbers: - Detect loss (gaps), duplicates (repeated), reordering - Handle wraparound with signed arithmetic - Enable selective acknowledgment (SACK)
Key Trade-offs:
| More Retries | Fewer Retries |
|---|---|
| Higher delivery rate | Lower latency |
| More power consumption | Less network load |
| Longer worst-case delay | May fail faster |
What is Next: The Connection State Lab chapter provides hands-on implementation of all reliability concepts.
- Reliability Overview: Parent chapter with all five reliability pillars
- Error Detection: CRC and checksums for corruption detection
- Connection State Lab: Hands-on implementation of all concepts
- Transport Optimizations: Advanced reliability tuning