744  Retry Mechanisms and Sequence Numbers

NoteLearning Objectives

By the end of this section, you will be able to:

  • Implement exponential backoff: Design congestion-aware retry algorithms with proper timeout calculations
  • Apply random jitter: Prevent collision storms by spreading retry attempts across time
  • Manage sequence numbers: Use sequence numbers to detect loss, duplication, and reordering
  • Handle sequence wraparound: Implement modular arithmetic for 16-bit sequence comparison
  • Tune retry parameters: Select appropriate timeout and retry limits for different IoT scenarios
  • Analyze retry overhead: Calculate the cost of reliability in terms of latency and power

744.1 Prerequisites

Before diving into this chapter, you should be familiar with:

ImportantWhy Retry Mechanisms Matter

Packet loss is inevitable in IoT networks. Radio interference, network congestion, sleeping receivers, and gateway reboots all cause packets to disappear. Error detection tells you when data is corrupted, but retry mechanisms recover from complete loss. Without proper retry logic, your sensor readings may never reach the cloud, or critical commands may fail to reach actuators.


744.2 The Problem with Naive Retry

Time: ~10 min | Level: Intermediate | Unit: P07.REL.U03

When a transmission fails, the obvious solution is to try again immediately. But naive retry can worsen the problem:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
sequenceDiagram
    participant D1 as Device 1
    participant D2 as Device 2
    participant D3 as Device 3
    participant GW as Gateway

    Note over D1,GW: Network congested - packets dropping

    D1->>GW: Message (lost)
    D2->>GW: Message (lost)
    D3->>GW: Message (lost)

    Note over D1,D3: All devices timeout simultaneously

    D1->>GW: Immediate retry
    D2->>GW: Immediate retry
    D3->>GW: Immediate retry

    Note over GW: COLLISION! All retries fail too

    D1->>GW: Retry again
    D2->>GW: Retry again
    D3->>GW: Retry again

    Note over D1,GW: Congestion worsens - "retry storm"

Figure 744.1: Naive retry problem: all devices retry simultaneously, causing collision storms that worsen congestion.

The solution: Exponential backoff with random jitter.


744.3 Exponential Backoff Algorithm

Exponential backoff doubles the wait time after each failure, reducing network load during congestion:

Wait Time = min(Initial_Timeout * 2^retry_count, Max_Timeout) + random_jitter

744.3.1 Backoff Timeline

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#E67E22'}}}%%
gantt
    title Exponential Backoff Timeline (Initial=500ms, Max=4000ms)
    dateFormat X
    axisFormat %s

    section Attempt 1
    Send    :a1, 0, 1
    Wait 500ms :crit, w1, 1, 6

    section Attempt 2
    Retry   :a2, 6, 7
    Wait 1000ms :crit, w2, 7, 17

    section Attempt 3
    Retry   :a3, 17, 18
    Wait 2000ms :crit, w3, 18, 38

    section Attempt 4
    Retry   :a4, 38, 39
    Wait 4000ms (max) :crit, w4, 39, 79

    section Success
    ACK Received :done, success, 79, 80

Figure 744.2: Exponential backoff timeline: each retry doubles the wait time (500ms, 1000ms, 2000ms, 4000ms) until success or maximum retries.

744.3.2 Implementation

#define INITIAL_TIMEOUT_MS  500
#define MAX_TIMEOUT_MS      8000
#define JITTER_PERCENT      25

int calculateBackoff(int retryCount) {
    // Exponential: initial * 2^retry
    int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);

    // Cap at maximum
    baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);

    // Add random jitter (0 to 25% of timeout)
    int jitter = random((baseTimeout * JITTER_PERCENT) / 100);

    return baseTimeout + jitter;
}

744.3.3 Backoff Calculation Examples

Retry Count Base Timeout With Max Cap With 25% Jitter
0 500ms 500ms 500-625ms
1 1000ms 1000ms 1000-1250ms
2 2000ms 2000ms 2000-2500ms
3 4000ms 4000ms 4000-5000ms
4 8000ms 8000ms 8000-10000ms
5 16000ms 8000ms (capped) 8000-10000ms

744.4 Why Random Jitter is Essential

Without jitter, devices that experience simultaneous failures will retry at exactly the same time, causing collision storms:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
    subgraph NoJitter["Without Jitter"]
        NJ1["Device A: retry at 1000ms"] --> COLL["COLLISION"]
        NJ2["Device B: retry at 1000ms"] --> COLL
        NJ3["Device C: retry at 1000ms"] --> COLL
    end

    subgraph WithJitter["With 25% Jitter"]
        WJ1["Device A: retry at 1050ms"] --> SPREAD["Spread across<br/>250ms window"]
        WJ2["Device B: retry at 1180ms"] --> SPREAD
        WJ3["Device C: retry at 1220ms"] --> SPREAD
    end

    COLL --> FAIL["All fail again"]
    SPREAD --> SUCCESS["Success more likely"]

    style COLL fill:#E67E22,stroke:#2C3E50,color:#fff
    style SPREAD fill:#16A085,stroke:#2C3E50,color:#fff
    style FAIL fill:#E67E22,stroke:#2C3E50,color:#fff
    style SUCCESS fill:#16A085,stroke:#2C3E50,color:#fff

Figure 744.3: Jitter comparison: without jitter all retries collide; with jitter retries are spread across a time window.

744.4.1 Jitter Implementation Patterns

// Pattern 1: Percentage jitter (0 to X%)
int jitter1 = random((timeout * 25) / 100);

// Pattern 2: Fixed window jitter
int jitter2 = random(500);  // 0-500ms random

// Pattern 3: Full decorrelation (AWS recommendation)
int jitter3 = random(min(MAX_TIMEOUT, baseTimeout * 3));

// Pattern 4: Equal jitter (half random, half fixed)
int jitter4 = (timeout / 2) + random(timeout / 2);
TipBest Practice: Jitter Parameters

For IoT applications, typical jitter parameters are:

  • Percentage jitter: 20-30% of base timeout
  • Minimum jitter: 50-100ms (to spread even short timeouts)
  • Seed randomness: Use hardware random (ADC noise, radio noise) not just millis()

744.5 Retry Parameter Selection

Different IoT scenarios require different retry parameters:

744.5.1 Parameter Guidelines

Scenario Initial Timeout Max Timeout Max Retries Jitter
Real-time control 100-200ms 1s 2-3 10%
Sensor reporting 500ms-1s 30s 3-5 25%
Firmware update 2-5s 60s 10+ 30%
Low-power sleep 5-10s 5min 3 50%

744.5.2 Timeout Calculation

The initial timeout should be based on expected round-trip time (RTT) plus margin:

// Timeout = RTT + margin for processing + margin for variability
int timeout = expectedRTT * 2 + 100;  // 2x RTT + 100ms buffer

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
flowchart LR
    subgraph Timeout["Timeout Components"]
        TX["TX Time<br/>10ms"] --> PROP["Propagation<br/>50ms"] --> PROC["Processing<br/>20ms"] --> ACK["ACK TX<br/>10ms"] --> RETURN["Return Prop<br/>50ms"]
    end

    TOTAL["Total RTT: 140ms<br/>Timeout: 280ms + margin"]

    TX --> TOTAL
    RETURN --> TOTAL

    style Timeout fill:#2C3E50,stroke:#16A085,color:#fff
    style TOTAL fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 744.4: Timeout components: transmission, propagation, processing, acknowledgment, and return propagation.

744.6 Sequence Numbers and Loss Detection

Time: ~10 min | Level: Intermediate | Unit: P07.REL.U04

Sequence numbers are unique identifiers assigned to each message. They solve three critical problems:

  1. Loss detection: Gaps in sequence indicate missing packets
  2. Duplicate detection: Same sequence number received twice indicates a successful retransmission
  3. Ordering: Receiver can reassemble out-of-order packets correctly

744.6.1 Sequence Number Scenarios

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
graph TB
    subgraph "Normal Delivery"
        N1["SEQ 1"] --> N2["SEQ 2"] --> N3["SEQ 3"]
        N1 -.->|"Receiver sees: 1,2,3"| OK1["In Order - Accept all"]
    end

    subgraph "Packet Loss"
        L1["SEQ 1"] --> L2["SEQ 2 LOST"] --> L3["SEQ 3"]
        L1 -.->|"Receiver sees: 1,_,3"| LOSS["Gap Detected!<br/>Request SEQ 2 retransmit"]
    end

    subgraph "Duplicate (ACK lost)"
        D1["SEQ 1"] --> D2["SEQ 2"] --> D3["SEQ 2 (retry)"]
        D1 -.->|"Receiver sees: 1,2,2"| DUP["Duplicate!<br/>ACK again, discard data"]
    end

    subgraph "Reordering"
        R1["SEQ 1"] --> R2["SEQ 3"] --> R3["SEQ 2"]
        R1 -.->|"Receiver sees: 1,3,2"| REORD["Buffer until in-order<br/>Deliver as 1,2,3"]
    end

    style OK1 fill:#16A085,stroke:#2C3E50,color:#fff
    style LOSS fill:#E67E22,stroke:#2C3E50,color:#fff
    style DUP fill:#E67E22,stroke:#2C3E50,color:#fff
    style REORD fill:#16A085,stroke:#2C3E50,color:#fff

Figure 744.5: Sequence number scenarios: normal ordered delivery, gap detection for loss, duplicate identification, and reordering correction.

744.6.2 Implementing Sequence Numbers

// Sender side
uint16_t nextSequenceNum = 0;

void sendMessage(const char* data) {
    Message msg;
    msg.sequenceNum = nextSequenceNum++;
    msg.payload = data;
    transmit(&msg);
}

// Receiver side
uint16_t expectedSequenceNum = 0;

void handleMessage(Message* msg) {
    if (msg->sequenceNum == expectedSequenceNum) {
        // Normal in-order delivery
        processData(msg->payload);
        expectedSequenceNum++;
        sendAck(msg->sequenceNum);

    } else if (msg->sequenceNum < expectedSequenceNum) {
        // Duplicate - ACK but don't process again
        sendAck(msg->sequenceNum);

    } else {
        // Gap detected - request missing packets
        for (uint16_t seq = expectedSequenceNum; seq < msg->sequenceNum; seq++) {
            requestRetransmit(seq);
        }
        bufferForLater(msg);
    }
}

744.7 Sequence Number Wraparound

With a 16-bit sequence number (0-65535), what happens after message 65535? The sequence โ€œwraps aroundโ€ to 0. Receivers must handle this using modular arithmetic:

744.7.1 The Wraparound Problem

Sequence progression: 65533, 65534, 65535, 0, 1, 2, ...

Naive comparison fails:

// WRONG! This breaks at wraparound
if (received_seq > expected_seq) {
    // Handle out of order
}
// When expected=65535 and received=0, this incorrectly thinks 0 < 65535

744.7.2 Correct Wraparound Handling

// Check if seq_a comes before seq_b (handling wraparound)
bool isSequenceBefore(uint16_t seq_a, uint16_t seq_b) {
    // Interpret difference as signed value
    return (int16_t)(seq_a - seq_b) < 0;
}

// Example: seq_a=65535, seq_b=0
// 65535 - 0 = 65535, interpreted as signed = -1
// -1 < 0, so 65535 "comes before" 0 (correct!)
WarningWraparound Window

This approach works correctly as long as sequence numbers donโ€™t differ by more than 32767 (half the range). For high-throughput applications, use 32-bit sequence numbers.


744.8 Selective Acknowledgment (SACK)

For high-throughput applications, acknowledging every packet is inefficient. Selective Acknowledgment allows the receiver to acknowledge multiple packets at once and specifically identify gaps:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22'}}}%%
sequenceDiagram
    participant S as Sender
    participant R as Receiver

    Note over S,R: Without SACK (Go-Back-N)

    S->>R: SEQ 1
    S->>R: SEQ 2 (lost)
    S->>R: SEQ 3
    S->>R: SEQ 4
    R-->>S: ACK 1, NACK 2
    Note over S: Must retransmit 2,3,4

    Note over S,R: With SACK

    S->>R: SEQ 1
    S->>R: SEQ 2 (lost)
    S->>R: SEQ 3
    S->>R: SEQ 4
    R-->>S: SACK: received 1,3,4; missing 2
    Note over S: Only retransmit 2

Figure 744.6: SACK comparison: without SACK the sender retransmits all packets after the gap; with SACK only the missing packet is retransmitted.

744.9 Retry Overhead Analysis

Reliability has costs. Understanding the overhead helps you choose appropriate parameters:

744.9.1 Time Overhead

Total delivery time = RTT + (retries * average_backoff)

Example (20% loss, initial=500ms, 3 retries):
- 64% success on first try: 140ms RTT
- 26% success on second try: 140ms + 500ms = 640ms
- 8% success on third try: 140ms + 500ms + 1000ms = 1640ms
- 2% success on fourth try: 140ms + 500ms + 1000ms + 2000ms = 3640ms

Weighted average: 0.64*140 + 0.26*640 + 0.08*1640 + 0.02*3640 = 460ms

744.9.2 Power Overhead

Component Current Duration Energy
TX attempt 100mA 50ms 5mJ
Wait (sleep) 10uA 500ms 0.005mJ
RX window 20mA 100ms 2mJ

With 20% loss and average 0.25 retries:

Energy per message = (1 + 0.25) * (5 + 0.005 + 2) = 8.75mJ
TipPower Optimization

To minimize retry power cost: 1. Use hardware CRC to avoid retransmits from corruption 2. Increase TX power slightly to reduce loss rate 3. Sleep during backoff rather than busy-waiting 4. Use adaptive timeout based on recent RTT measurements


744.10 Knowledge Check

Question: With initial timeout 500ms and max timeout 8000ms, what is the base timeout on the 4th retry (retry count = 3)?

  1. 1500ms
  2. 2000ms
  3. 4000ms
  4. 8000ms
Click for answer

Answer: C) 4000ms

Timeout = min(500 * 2^3, 8000) = min(500 * 8, 8000) = min(4000, 8000) = 4000ms

On the 5th retry (count=4): min(500 * 16, 8000) = min(8000, 8000) = 8000ms (hits max)

Question: What is the PRIMARY purpose of adding random jitter to retry timeouts?

  1. To make debugging easier by adding timestamps
  2. To prevent multiple devices from retrying at exactly the same time
  3. To compensate for clock drift between sender and receiver
  4. To make the protocol more secure against timing attacks
Click for answer

Answer: B) To prevent multiple devices from retrying at exactly the same time

Without jitter, devices experiencing simultaneous failures (e.g., from a network outage) would all retry at identical times, causing collision storms that worsen congestion. Jitter spreads retry attempts across a time window.

Question: A sender transmits SEQ=5, receives no ACK (lost), retransmits SEQ=5, receiver gets both copies. What should the receiver do?

  1. Process both messages normally
  2. Discard the first, process the second
  3. Process the first, discard the second, ACK both
  4. Send NACK for both
Click for answer

Answer: C) Process the first, discard the second, ACK both

The receiver should ACK both copies (in case the first ACK was lost) but only process the data once. The sequence number identifies the second copy as a duplicate of already-processed data.

Question: With 16-bit sequence numbers, what value follows 65535?

  1. 65536
  2. 0
  3. 1
  4. Error - sequence exhausted
Click for answer

Answer: B) 0

16-bit sequence numbers wrap around from 65535 to 0. This is normal operation, not an error. Receivers must use modular arithmetic (signed comparison of differences) to correctly order packets across the wraparound boundary.

744.11 Summary

This chapter covered retry mechanisms and sequence number management:

Exponential Backoff: - Doubles wait time after each failure - Reduces network load during congestion - Requires maximum timeout cap

Random Jitter: - Spreads retry attempts to prevent collisions - Typically 20-30% of base timeout - Essential for multi-device networks

Sequence Numbers: - Detect loss (gaps), duplicates (repeated), reordering - Handle wraparound with signed arithmetic - Enable selective acknowledgment (SACK)

Key Trade-offs:

More Retries Fewer Retries
Higher delivery rate Lower latency
More power consumption Less network load
Longer worst-case delay May fail faster

What is Next: The Connection State Lab chapter provides hands-on implementation of all reliability concepts.


NoteCross-Reference Links