744 Retry Mechanisms and Sequence Numbers

Learning Objectives

By the end of this section, you will be able to:

Implement exponential backoff: Design congestion-aware retry algorithms with proper timeout calculations
Apply random jitter: Prevent collision storms by spreading retry attempts across time
Manage sequence numbers: Use sequence numbers to detect loss, duplication, and reordering
Handle sequence wraparound: Implement modular arithmetic for 16-bit sequence comparison
Tune retry parameters: Select appropriate timeout and retry limits for different IoT scenarios
Analyze retry overhead: Calculate the cost of reliability in terms of latency and power

744.1 Prerequisites

Before diving into this chapter, you should be familiar with:

Error Detection: Understanding CRC and checksums for detecting corrupted data
Reliability Overview: The parent chapter introducing all five reliability pillars
Transport Fundamentals: Basic acknowledgment and timeout concepts

Why Retry Mechanisms Matter

Packet loss is inevitable in IoT networks. Radio interference, network congestion, sleeping receivers, and gateway reboots all cause packets to disappear. Error detection tells you when data is corrupted, but retry mechanisms recover from complete loss. Without proper retry logic, your sensor readings may never reach the cloud, or critical commands may fail to reach actuators.

744.2 The Problem with Naive Retry

Time: ~10 min | Level: Intermediate | Unit: P07.REL.U03

When a transmission fails, the obvious solution is to try again immediately. But naive retry can worsen the problem:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
sequenceDiagram
    participant D1 as Device 1
    participant D2 as Device 2
    participant D3 as Device 3
    participant GW as Gateway

    Note over D1,GW: Network congested - packets dropping

    D1->>GW: Message (lost)
    D2->>GW: Message (lost)
    D3->>GW: Message (lost)

    Note over D1,D3: All devices timeout simultaneously

    D1->>GW: Immediate retry
    D2->>GW: Immediate retry
    D3->>GW: Immediate retry

    Note over GW: COLLISION! All retries fail too

    D1->>GW: Retry again
    D2->>GW: Retry again
    D3->>GW: Retry again

    Note over D1,GW: Congestion worsens - "retry storm"

Figure 744.1: Naive retry problem: all devices retry simultaneously, causing collision storms that worsen congestion.

The solution: Exponential backoff with random jitter.

744.3 Exponential Backoff Algorithm

Exponential backoff doubles the wait time after each failure, reducing network load during congestion:

Wait Time = min(Initial_Timeout * 2^retry_count, Max_Timeout) + random_jitter

744.3.1 Backoff Timeline

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
gantt
    title Exponential Backoff Timeline (Initial=500ms, Max=4000ms)
    dateFormat X
    axisFormat %s

    section Attempt 1
    Send    :a1, 0, 1
    Wait 500ms :crit, w1, 1, 6

    section Attempt 2
    Retry   :a2, 6, 7
    Wait 1000ms :crit, w2, 7, 17

    section Attempt 3
    Retry   :a3, 17, 18
    Wait 2000ms :crit, w3, 18, 38

    section Attempt 4
    Retry   :a4, 38, 39
    Wait 4000ms (max) :crit, w4, 39, 79

    section Success
    ACK Received :done, success, 79, 80

Figure 744.2: Exponential backoff timeline: each retry doubles the wait time (500ms, 1000ms, 2000ms, 4000ms) until success or maximum retries.

744.3.2 Implementation

#define INITIAL_TIMEOUT_MS  500
#define MAX_TIMEOUT_MS      8000
#define JITTER_PERCENT      25

int calculateBackoff(int retryCount) {
    // Exponential: initial * 2^retry
    int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);

    // Cap at maximum
    baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);

    // Add random jitter (0 to 25% of timeout)
    int jitter = random((baseTimeout * JITTER_PERCENT) / 100);

    return baseTimeout + jitter;
}

744.3.3 Backoff Calculation Examples

Retry Count	Base Timeout	With Max Cap	With 25% Jitter
0	500ms	500ms	500-625ms
1	1000ms	1000ms	1000-1250ms
2	2000ms	2000ms	2000-2500ms
3	4000ms	4000ms	4000-5000ms
4	8000ms	8000ms	8000-10000ms
5	16000ms	8000ms (capped)	8000-10000ms

744.4 Why Random Jitter is Essential

Without jitter, devices that experience simultaneous failures will retry at exactly the same time, causing collision storms:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
    subgraph NoJitter["Without Jitter"]
        NJ1["Device A: retry at 1000ms"] --> COLL["COLLISION"]
        NJ2["Device B: retry at 1000ms"] --> COLL
        NJ3["Device C: retry at 1000ms"] --> COLL
    end

    subgraph WithJitter["With 25% Jitter"]
        WJ1["Device A: retry at 1050ms"] --> SPREAD["Spread across<br/>250ms window"]
        WJ2["Device B: retry at 1180ms"] --> SPREAD
        WJ3["Device C: retry at 1220ms"] --> SPREAD
    end

    COLL --> FAIL["All fail again"]
    SPREAD --> SUCCESS["Success more likely"]

    style COLL fill:#E67E22,stroke:#2C3E50,color:#fff
    style SPREAD fill:#16A085,stroke:#2C3E50,color:#fff
    style FAIL fill:#E67E22,stroke:#2C3E50,color:#fff
    style SUCCESS fill:#16A085,stroke:#2C3E50,color:#fff

Figure 744.3: Jitter comparison: without jitter all retries collide; with jitter retries are spread across a time window.

744.4.1 Jitter Implementation Patterns

// Pattern 1: Percentage jitter (0 to X%)
int jitter1 = random((timeout * 25) / 100);

// Pattern 2: Fixed window jitter
int jitter2 = random(500);  // 0-500ms random

// Pattern 3: Full decorrelation (AWS recommendation)
int jitter3 = random(min(MAX_TIMEOUT, baseTimeout * 3));

// Pattern 4: Equal jitter (half random, half fixed)
int jitter4 = (timeout / 2) + random(timeout / 2);

Best Practice: Jitter Parameters

For IoT applications, typical jitter parameters are:

Percentage jitter: 20-30% of base timeout
Minimum jitter: 50-100ms (to spread even short timeouts)
Seed randomness: Use hardware random (ADC noise, radio noise) not just millis()

744.5 Retry Parameter Selection

Different IoT scenarios require different retry parameters:

744.5.1 Parameter Guidelines

Scenario	Initial Timeout	Max Timeout	Max Retries	Jitter
Real-time control	100-200ms	1s	2-3	10%
Sensor reporting	500ms-1s	30s	3-5	25%
Firmware update	2-5s	60s	10+	30%
Low-power sleep	5-10s	5min	3	50%

744.5.2 Timeout Calculation

The initial timeout should be based on expected round-trip time (RTT) plus margin:

// Timeout = RTT + margin for processing + margin for variability
int timeout = expectedRTT * 2 + 100;  // 2x RTT + 100ms buffer

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
    subgraph Timeout["Timeout Components"]
        TX["TX Time<br/>10ms"] --> PROP["Propagation<br/>50ms"] --> PROC["Processing<br/>20ms"] --> ACK["ACK TX<br/>10ms"] --> RETURN["Return Prop<br/>50ms"]
    end

    TOTAL["Total RTT: 140ms<br/>Timeout: 280ms + margin"]

    TX --> TOTAL
    RETURN --> TOTAL

    style Timeout fill:#2C3E50,stroke:#16A085,color:#fff
    style TOTAL fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 744.4: Timeout components: transmission, propagation, processing, acknowledgment, and return propagation.

744.6 Sequence Numbers and Loss Detection

Time: ~10 min | Level: Intermediate | Unit: P07.REL.U04

Sequence numbers are unique identifiers assigned to each message. They solve three critical problems:

Loss detection: Gaps in sequence indicate missing packets
Duplicate detection: Same sequence number received twice indicates a successful retransmission
Ordering: Receiver can reassemble out-of-order packets correctly

744.6.1 Sequence Number Scenarios

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
graph TB
    subgraph "Normal Delivery"
        N1["SEQ 1"] --> N2["SEQ 2"] --> N3["SEQ 3"]
        N1 -.->|"Receiver sees: 1,2,3"| OK1["In Order - Accept all"]
    end

    subgraph "Packet Loss"
        L1["SEQ 1"] --> L2["SEQ 2 LOST"] --> L3["SEQ 3"]
        L1 -.->|"Receiver sees: 1,_,3"| LOSS["Gap Detected!<br/>Request SEQ 2 retransmit"]
    end

    subgraph "Duplicate (ACK lost)"
        D1["SEQ 1"] --> D2["SEQ 2"] --> D3["SEQ 2 (retry)"]
        D1 -.->|"Receiver sees: 1,2,2"| DUP["Duplicate!<br/>ACK again, discard data"]
    end

    subgraph "Reordering"
        R1["SEQ 1"] --> R2["SEQ 3"] --> R3["SEQ 2"]
        R1 -.->|"Receiver sees: 1,3,2"| REORD["Buffer until in-order<br/>Deliver as 1,2,3"]
    end

    style OK1 fill:#16A085,stroke:#2C3E50,color:#fff
    style LOSS fill:#E67E22,stroke:#2C3E50,color:#fff
    style DUP fill:#E67E22,stroke:#2C3E50,color:#fff
    style REORD fill:#16A085,stroke:#2C3E50,color:#fff

Figure 744.5: Sequence number scenarios: normal ordered delivery, gap detection for loss, duplicate identification, and reordering correction.

744.6.2 Implementing Sequence Numbers

// Sender side
uint16_t nextSequenceNum = 0;

void sendMessage(const char* data) {
    Message msg;
    msg.sequenceNum = nextSequenceNum++;
    msg.payload = data;
    transmit(&msg);
}

// Receiver side
uint16_t expectedSequenceNum = 0;

void handleMessage(Message* msg) {
    if (msg->sequenceNum == expectedSequenceNum) {
        // Normal in-order delivery
        processData(msg->payload);
        expectedSequenceNum++;
        sendAck(msg->sequenceNum);

    } else if (msg->sequenceNum < expectedSequenceNum) {
        // Duplicate - ACK but don't process again
        sendAck(msg->sequenceNum);

    } else {
        // Gap detected - request missing packets
        for (uint16_t seq = expectedSequenceNum; seq < msg->sequenceNum; seq++) {
            requestRetransmit(seq);
        }
        bufferForLater(msg);
    }
}

744.7 Sequence Number Wraparound

With a 16-bit sequence number (0-65535), what happens after message 65535? The sequence “wraps around” to 0. Receivers must handle this using modular arithmetic:

744.7.1 The Wraparound Problem

Sequence progression: 65533, 65534, 65535, 0, 1, 2, ...

Naive comparison fails:

// WRONG! This breaks at wraparound
if (received_seq > expected_seq) {
    // Handle out of order
}
// When expected=65535 and received=0, this incorrectly thinks 0 < 65535

744.7.2 Correct Wraparound Handling

// Check if seq_a comes before seq_b (handling wraparound)
bool isSequenceBefore(uint16_t seq_a, uint16_t seq_b) {
    // Interpret difference as signed value
    return (int16_t)(seq_a - seq_b) < 0;
}

// Example: seq_a=65535, seq_b=0
// 65535 - 0 = 65535, interpreted as signed = -1
// -1 < 0, so 65535 "comes before" 0 (correct!)

Wraparound Window

This approach works correctly as long as sequence numbers don’t differ by more than 32767 (half the range). For high-throughput applications, use 32-bit sequence numbers.

744.8 Selective Acknowledgment (SACK)

For high-throughput applications, acknowledging every packet is inefficient. Selective Acknowledgment allows the receiver to acknowledge multiple packets at once and specifically identify gaps:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
sequenceDiagram
    participant S as Sender
    participant R as Receiver

    Note over S,R: Without SACK (Go-Back-N)

    S->>R: SEQ 1
    S->>R: SEQ 2 (lost)
    S->>R: SEQ 3
    S->>R: SEQ 4
    R-->>S: ACK 1, NACK 2
    Note over S: Must retransmit 2,3,4

    Note over S,R: With SACK

    S->>R: SEQ 1
    S->>R: SEQ 2 (lost)
    S->>R: SEQ 3
    S->>R: SEQ 4
    R-->>S: SACK: received 1,3,4; missing 2
    Note over S: Only retransmit 2

Figure 744.6: SACK comparison: without SACK the sender retransmits all packets after the gap; with SACK only the missing packet is retransmitted.

744.9 Retry Overhead Analysis

Reliability has costs. Understanding the overhead helps you choose appropriate parameters:

744.9.1 Time Overhead

Total delivery time = RTT + (retries * average_backoff)

Example (20% loss, initial=500ms, 3 retries):
- 64% success on first try: 140ms RTT
- 26% success on second try: 140ms + 500ms = 640ms
- 8% success on third try: 140ms + 500ms + 1000ms = 1640ms
- 2% success on fourth try: 140ms + 500ms + 1000ms + 2000ms = 3640ms

Weighted average: 0.64*140 + 0.26*640 + 0.08*1640 + 0.02*3640 = 460ms

744.9.2 Power Overhead

Component	Current	Duration	Energy
TX attempt	100mA	50ms	5mJ
Wait (sleep)	10uA	500ms	0.005mJ
RX window	20mA	100ms	2mJ

With 20% loss and average 0.25 retries:

Energy per message = (1 + 0.25) * (5 + 0.005 + 2) = 8.75mJ

Power Optimization

To minimize retry power cost: 1. Use hardware CRC to avoid retransmits from corruption 2. Increase TX power slightly to reduce loss rate 3. Sleep during backoff rather than busy-waiting 4. Use adaptive timeout based on recent RTT measurements

744.10 Knowledge Check

Question 1: Backoff Calculation

Question: With initial timeout 500ms and max timeout 8000ms, what is the base timeout on the 4th retry (retry count = 3)?

1500ms
2000ms
4000ms
8000ms

Click for answer

Answer: C) 4000ms

Timeout = min(500 * 2^3, 8000) = min(500 * 8, 8000) = min(4000, 8000) = 4000ms

On the 5th retry (count=4): min(500 * 16, 8000) = min(8000, 8000) = 8000ms (hits max)

Question 2: Jitter Purpose

Question: What is the PRIMARY purpose of adding random jitter to retry timeouts?

To make debugging easier by adding timestamps
To prevent multiple devices from retrying at exactly the same time
To compensate for clock drift between sender and receiver
To make the protocol more secure against timing attacks

Click for answer

Answer: B) To prevent multiple devices from retrying at exactly the same time

Without jitter, devices experiencing simultaneous failures (e.g., from a network outage) would all retry at identical times, causing collision storms that worsen congestion. Jitter spreads retry attempts across a time window.

Question 3: Duplicate Detection

Question: A sender transmits SEQ=5, receives no ACK (lost), retransmits SEQ=5, receiver gets both copies. What should the receiver do?

Process both messages normally
Discard the first, process the second
Process the first, discard the second, ACK both
Send NACK for both

Click for answer

Answer: C) Process the first, discard the second, ACK both

The receiver should ACK both copies (in case the first ACK was lost) but only process the data once. The sequence number identifies the second copy as a duplicate of already-processed data.

Question 4: Wraparound

Question: With 16-bit sequence numbers, what value follows 65535?

65536
0
1
Error - sequence exhausted

Click for answer

Answer: B) 0

16-bit sequence numbers wrap around from 65535 to 0. This is normal operation, not an error. Receivers must use modular arithmetic (signed comparison of differences) to correctly order packets across the wraparound boundary.

744.11 Summary

This chapter covered retry mechanisms and sequence number management:

Exponential Backoff: - Doubles wait time after each failure - Reduces network load during congestion - Requires maximum timeout cap

Random Jitter: - Spreads retry attempts to prevent collisions - Typically 20-30% of base timeout - Essential for multi-device networks

Sequence Numbers: - Detect loss (gaps), duplicates (repeated), reordering - Handle wraparound with signed arithmetic - Enable selective acknowledgment (SACK)

Key Trade-offs:

More Retries	Fewer Retries
Higher delivery rate	Lower latency
More power consumption	Less network load
Longer worst-case delay	May fail faster

What is Next: The Connection State Lab chapter provides hands-on implementation of all reliability concepts.

Cross-Reference Links

Reliability Overview: Parent chapter with all five reliability pillars
Error Detection: CRC and checksums for corruption detection
Connection State Lab: Hands-on implementation of all concepts
Transport Optimizations: Advanced reliability tuning