21  Connection Reliability Lab

In 60 Seconds

This lab brings together all five reliability pillars – error detection, retry with backoff, sequence numbering, connection state management, and keep-alive monitoring – into a complete working ESP32 implementation. You will build a connection state machine with proper transitions (disconnected, connecting, connected, disconnecting), implement keep-alive heartbeats for long-lived sessions, and measure delivery rates, retry overhead, and effective throughput.

Key Concepts
  • TCP State Machine: 11 states (CLOSED, LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1/2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT); understanding state transitions is essential for debugging connection issues
  • netstat / ss: Linux commands for displaying active TCP connections and their states; ss -tn shows TCP connections with numeric addresses; useful for diagnosing connection state leaks
  • Wireshark TCP Stream: TCP stream reassembly feature displaying the full conversation between two endpoints; essential for diagnosing protocol-level issues in IoT connection reliability
  • TCP Keepalive: Three socket options: TCP_KEEPIDLE (seconds before first probe), TCP_KEEPINTVL (probe interval), TCP_KEEPCNT (number of probes before dropping); defaults vary by OS
  • SO_LINGER Socket Option: Configures TCP close behavior: linger=0 causes immediate RST on close (no TIME_WAIT); linger>0 waits specified seconds for data to flush; affects how quickly ports can be reused
  • Connection Timeout vs Read Timeout: Connection timeout: maximum time to complete TCP handshake; read timeout: maximum time to receive any data after connection; both must be configured for robust IoT clients
  • Network Namespace: Linux kernel feature isolating network stack (interfaces, routing, firewall rules) per namespace; used in labs to simulate network conditions without physical hardware
  • tc netem: Linux Traffic Control Network Emulator; adds delay, jitter, packet loss, duplication, and corruption to network interfaces for lab simulation of poor network conditions
Learning Objectives

By the end of this section, you will be able to:

  • Design connection state machines: Build robust connection management with proper state transitions
  • Implement keep-alive mechanisms: Monitor connection health for long-lived IoT sessions
  • Diagnose connection failures: Identify root causes of and recover from unexpected disconnections
  • Construct a complete reliable transport: Implement all five reliability pillars in working code
  • Evaluate reliability metrics: Assess delivery rate, retry overhead, and throughput against targets
  • Configure reliability parameters: Select and adjust settings based on observed network conditions

This lab lets you experiment with network reliability by building connections that can handle real-world problems like dropped messages, out-of-order delivery, and network delays. Think of it as stress-testing a delivery service – you will see what happens when packages get lost and how systems recover automatically.

“This lab combines everything we have learned into a working system,” said Max the Microcontroller. “Error detection, retries, sequence numbers, connection states, and keep-alive – all running on a real ESP32.”

“The connection state machine is the backbone,” explained Sammy the Sensor. “It tracks whether we are disconnected, connecting, connected, or in an error state. Each state has rules for what can happen next – you cannot send data while disconnected, and you cannot disconnect while already disconnected.”

“Keep-alive heartbeats detect silent failures,” added Lila the LED. “NAT routers time out idle connections after 30 to 120 seconds. Cell networks silently drop sessions. Without heartbeats, you think you are connected while the network path is broken – a zombie connection.”

“By the end, you will measure delivery rates, retry overhead, and effective throughput,” said Bella the Battery. “These are the metrics that tell you if your reliability implementation is actually working. Theory is great, but numbers do not lie!”

21.1 Prerequisites

Before diving into this chapter, you should be familiar with:

Why Connection State Matters

Long-lived IoT connections are fragile. NAT routers time out idle connections after 30-120 seconds. Cellular networks may silently drop sessions. Wi-Fi access points reboot for updates. Without proper connection state management, your device may believe it is connected while the network path is broken – leading to lost data and unresponsive actuators.


21.2 Connection State Machines

Time: ~10 min | Level: Intermediate | Unit: P07.REL.U05

Connection state machines track the lifecycle of a communication session. Proper state management prevents resource leaks, handles unexpected disconnections, and ensures clean shutdown.

21.2.1 Basic Connection States

State machine diagram with four states: DISCONNECTED, CONNECTING, CONNECTED, and DISCONNECTING. Arrows show transitions: DISCONNECTED to CONNECTING on connect(), CONNECTING to CONNECTED on ACK received, CONNECTING to DISCONNECTED on timeout after retries, CONNECTED to DISCONNECTING on disconnect(), DISCONNECTING to DISCONNECTED on ACK or timeout.
Figure 21.1: Connection state machine showing DISCONNECTED, CONNECTING, CONNECTED, and DISCONNECTING states with transition conditions.

21.2.2 State Transition Rules

Current State Event Next State Action
DISCONNECTED connect() CONNECTING Send CONNECT message
CONNECTING ACK received CONNECTED Start keep-alive timer
CONNECTING Timeout (after retries) DISCONNECTED Report failure
CONNECTED Data to send CONNECTED Send with reliability
CONNECTED Keep-alive timeout DISCONNECTED Clean up resources
CONNECTED disconnect() DISCONNECTING Send DISCONNECT
DISCONNECTING ACK received DISCONNECTED Clean up complete
DISCONNECTING Timeout DISCONNECTED Force clean up

21.2.3 Implementation Pattern

enum ConnectionState {
    STATE_DISCONNECTED = 0,
    STATE_CONNECTING   = 1,
    STATE_CONNECTED    = 2,
    STATE_DISCONNECTING= 3
};

ConnectionState currentState = STATE_DISCONNECTED;

void transitionState(ConnectionState newState) {
    // Log transition for debugging
    Serial.printf("[STATE] %s -> %s\n",
                  stateNames[currentState],
                  stateNames[newState]);

    // Perform exit actions for old state
    switch (currentState) {
        case STATE_CONNECTED:
            stopKeepaliveTimer();
            break;
        // ... other exit actions
    }

    // Update state
    currentState = newState;

    // Perform entry actions for new state
    switch (newState) {
        case STATE_CONNECTED:
            startKeepaliveTimer();
            resetSequenceNumbers();
            break;
        // ... other entry actions
    }
}

21.3 Keep-Alive Mechanism

Long-lived connections may go silent during idle periods. Keep-alive messages verify the connection is still healthy.

21.3.1 Keep-Alive Protocol

Sequence diagram showing keep-alive protocol between a device and server. The device sends periodic PING messages at the configured interval; the server responds with PONG. When the server fails to respond, the device increments a missed-ping counter, and after three consecutive misses declares the connection dead and transitions to DISCONNECTED state.
Figure 21.2: Keep-alive protocol: periodic pings detect silent connection failures.

21.3.2 Protocol-Specific Keep-Alive

Protocol Mechanism Typical Interval Notes
MQTT PINGREQ/PINGRESP 30-300s Configurable, broker enforces
CoAP Empty CON message 30-120s Optional, not standardized
TCP TCP keepalive 2 hours (default) OS-level, often disabled for IoT
WebSocket Ping/Pong frames 30-60s Built into protocol

NAT Timeout Considerations

NAT routers and firewalls often drop “idle” connections after 30-120 seconds. Your keep-alive interval must be shorter than the shortest NAT timeout in the path, or connections will silently break.

Recommendation: Use 25 second keep-alive intervals for cellular IoT, which provides margin even against 30-second NAT timeouts. For Wi-Fi home routers (120-300 second NAT timeouts), 60 seconds is typically sufficient.

Keep-alive interval must be shorter than NAT timeout to prevent zombie connections. The safety margin prevents timing races:

$ T_{keepalive} T_{NAT} S_{margin} $

where \(S_{margin}\) = safety factor (typically 0.5-0.8).

Worked example: Cellular network with NAT timeout \(T_{NAT}\) = 90 seconds, using \(S_{margin}\) = 0.7.

$ T_{keepalive} = 63 $

With 63-second keep-alive, the connection sends a ping every 63s. NAT entry resets at each ping, never reaching the 90s timeout. If keep-alive fails (packet loss), timeout = 90 - 63 = 27 seconds to detect dead connection before NAT drops it. Using 30-second keep-alive provides even more margin: 90 - 30 = 60 seconds detection window, but doubles control overhead (2 pings/min vs 1 ping/min).

Try It: Keep-Alive NAT Timeout Calculator

Adjust NAT timeout and safety margin to see the recommended keep-alive interval and detection window.


21.4 Reliability Lab: Build a Complete Reliable Transport System

This hands-on lab demonstrates all five reliability pillars by building a complete reliable messaging system on an ESP32. You will implement CRC error detection, sequence numbering, acknowledgments, exponential backoff retries, and connection state management - the same mechanisms used by TCP and CoAP internally.

21.4.1 What You Will Learn

By completing this lab, you will be able to:

  • Calculate and verify CRC-16 checksums: Identify bit-level corruption in transmitted data
  • Implement exponential backoff with jitter: Apply congestion-aware retry logic to real code
  • Construct sequence number management: Distinguish in-order delivery from duplicates and gaps
  • Design an ACK/NACK protocol: Build a bidirectional acknowledgment system from scratch
  • Build connection state machines: Construct session lifecycle management with proper state transitions
  • Evaluate real-time reliability statistics: Analyze throughput, loss rate, and retry overhead

21.4.2 Components Needed

Component Quantity Purpose
ESP32 DevKit 1 Microcontroller running reliability simulation
Green LED 1 Successful transmission indicator (ACK received)
Red LED 1 Error indicator (timeout, CRC failure, max retries)
Yellow LED 1 Transmission in progress indicator
Blue LED 1 Connection state indicator (connected/disconnected)
220 ohm Resistors 4 Current limiting for LEDs
Push Button 1 Trigger manual message send
10K ohm Resistor 1 Button pull-down
Breadboard 1 Circuit assembly
Jumper Wires Several Connections

21.4.3 Wokwi Simulator

Use the embedded simulator below to build and test your reliable transport system. Click “Start Simulation” to begin.

Simulator Tips
  • Click inside the simulator frame first to give it focus
  • Press the green “Play” button to start simulation
  • Open the Serial Monitor (set to 115200 baud) to see detailed output
  • You can add components by clicking the “+” button
  • Copy the code below into the editor, replacing any default code

21.4.4 Circuit Diagram

Breadboard wiring diagram for the reliability lab. An ESP32 DevKit connects to four LEDs via 220 ohm current-limiting resistors: Green LED on GPIO 4 (success), Red LED on GPIO 2 (error), Yellow LED on GPIO 5 (transmit in progress), and Blue LED on GPIO 18 (connection state). A push button is wired from 3.3V to GPIO 15 with a 10K ohm pull-down resistor to GND.
Figure 21.3: Circuit diagram showing ESP32 connections to four LEDs for status indication and a push button for manual message triggering.

21.4.5 Complete Code

Copy this code into the Wokwi editor:

/*
 * ============================================================================
 * RELIABILITY AND ERROR HANDLING LAB - ESP32
 * ============================================================================
 *
 * This comprehensive lab demonstrates the five pillars of reliable IoT
 * communication:
 *
 * 1. ERROR DETECTION: CRC-16 checksum calculation and verification
 * 2. SEQUENCE NUMBERS: Packet ordering, loss detection, duplicate handling
 * 3. ACKNOWLEDGMENTS: ACK/NACK protocol with configurable timeouts
 * 4. RETRY WITH BACKOFF: Exponential backoff with random jitter
 * 5. CONNECTION STATE: State machine for connection lifecycle
 *
 * LED Indicators:
 * - Green (GPIO 4):  Successful transmission (ACK received)
 * - Red (GPIO 2):    Error condition (timeout, CRC fail, max retries)
 * - Yellow (GPIO 5): Transmission in progress
 * - Blue (GPIO 18):  Connection state (ON = connected)
 *
 * Button (GPIO 15): Manual message send trigger
 *
 * Serial Commands:
 * - 'h' or '?': Show help
 * - 's': Send test message
 * - 'c': Toggle connection state
 * - 'v': Toggle verbose mode
 * - 'r': Reset statistics
 * - 'p': Print current statistics
 * - '1'-'9': Set packet loss percentage (1=10%, 5=50%, etc.)
 *
 * Author: IoT Education Platform
 * License: MIT
 * ============================================================================
 */

#include <Arduino.h>

// ============================================================================
// PIN DEFINITIONS
// ============================================================================
#define LED_SUCCESS     4     // Green LED - ACK received, transmission OK
#define LED_ERROR       2     // Red LED - Timeout, CRC fail, or max retries
#define LED_TRANSMIT    5     // Yellow LED - Transmission in progress
#define LED_CONNECTED   18    // Blue LED - Connection state indicator
#define BTN_SEND        15    // Push button for manual message send

// ============================================================================
// RELIABILITY CONFIGURATION
// ============================================================================
#define INITIAL_TIMEOUT_MS    500     // Initial ACK timeout (milliseconds)
#define MAX_TIMEOUT_MS        8000    // Maximum timeout after backoff
#define MAX_RETRIES           5       // Maximum retry attempts before failure
#define JITTER_PERCENT        25      // Random jitter (0-25% of timeout)
#define DEFAULT_LOSS_PERCENT  20      // Default simulated packet loss rate
#define KEEPALIVE_INTERVAL_MS 10000   // Connection keep-alive interval
#define KEEPALIVE_TIMEOUT_MS  3000    // Time to wait for keep-alive response

// ============================================================================
// CRC-16 POLYNOMIAL (CRC-CCITT)
// ============================================================================
#define CRC16_POLYNOMIAL  0x1021
#define CRC16_INITIAL     0xFFFF

// ============================================================================
// MESSAGE TYPES
// ============================================================================
enum MessageType : uint8_t {
  MSG_DATA      = 0x01,   // Data payload message
  MSG_ACK       = 0x02,   // Positive acknowledgment
  MSG_NACK      = 0x03,   // Negative acknowledgment (CRC error, etc.)
  MSG_CONNECT   = 0x10,   // Connection request
  MSG_CONNACK   = 0x11,   // Connection acknowledgment
  MSG_DISCONNECT= 0x12,   // Disconnect notification
  MSG_KEEPALIVE = 0x20,   // Keep-alive ping
  MSG_KEEPALIVE_ACK = 0x21 // Keep-alive response
};

// ============================================================================
// CONNECTION STATES
// ============================================================================
enum ConnectionState : uint8_t {
  STATE_DISCONNECTED = 0,
  STATE_CONNECTING   = 1,
  STATE_CONNECTED    = 2,
  STATE_DISCONNECTING= 3
};

const char* stateNames[] = {
  "DISCONNECTED",
  "CONNECTING",
  "CONNECTED",
  "DISCONNECTING"
};

// ============================================================================
// MESSAGE STRUCTURE (74 bytes total: 8 header + 64 payload + 2 CRC)
// ============================================================================
struct ReliableMessage {
  // Header (8 bytes)
  uint8_t   type;           // Message type (MSG_DATA, MSG_ACK, etc.)
  uint8_t   flags;          // Bit flags: 0x01=requires_ack, 0x02=is_retransmit
  uint16_t  sequenceNum;    // Sender's sequence number (0-65535)
  uint16_t  ackNum;         // Acknowledgment number (for ACK/NACK)
  uint8_t   payloadLen;     // Length of payload (0-63)
  uint8_t   reserved;       // Reserved for future use

  // Payload (64 bytes max)
  char      payload[64];    // Message data

  // Trailer (2 bytes)
  uint16_t  crc16;          // CRC-16 checksum of header + payload
};

// ============================================================================
// RELIABILITY STATISTICS
// ============================================================================
struct ReliabilityStats {
  // Transmission counts
  uint32_t messagesSent;
  uint32_t messagesDelivered;
  uint32_t messagesFailed;

  // Acknowledgment counts
  uint32_t acksReceived;
  uint32_t nacksReceived;
  uint32_t ackTimeouts;

  // Retry statistics
  uint32_t retransmissions;
  uint32_t totalRetryDelayMs;

  // Error detection
  uint32_t crcErrors;
  uint32_t duplicatesDetected;
  uint32_t outOfOrderDetected;

  // Simulated network conditions
  uint32_t simulatedDrops;
  uint32_t simulatedCorruptions;

  // Connection statistics
  uint32_t connectionAttempts;
  uint32_t connectionSuccesses;
  uint32_t keepalivesSent;
  uint32_t keepalivesTimedOut;

  // Timing
  unsigned long startTime;
  unsigned long lastMessageTime;
};

// ============================================================================
// GLOBAL STATE
// ============================================================================
ConnectionState currentState = STATE_DISCONNECTED;
uint16_t nextSequenceNum = 0;
uint16_t expectedSequenceNum = 0;
uint16_t lastAckedSequence = 0xFFFF;

ReliabilityStats stats;
int packetLossPercent = DEFAULT_LOSS_PERCENT;
int corruptionPercent = 5;  // 5% chance of bit corruption
bool verboseMode = true;
bool autoSendMode = true;

unsigned long lastKeepaliveTime = 0;
unsigned long lastButtonPress = 0;

// ============================================================================
// FUNCTION PROTOTYPES
// ============================================================================

// LED Functions
void initializePins();
void setLED(int pin, bool state);
void blinkLED(int pin, int times, int delayMs);
void updateConnectionLED();

// CRC Functions
uint16_t calculateCRC16(const uint8_t* data, size_t length);
uint16_t calculateMessageCRC(ReliableMessage* msg);
bool verifyCRC(ReliableMessage* msg);
void corruptRandomBit(ReliableMessage* msg);

// Message Functions
void createDataMessage(ReliableMessage* msg, const char* payload);
void createAckMessage(ReliableMessage* ack, uint16_t ackNum);
void createNackMessage(ReliableMessage* nack, uint16_t nackNum, const char* reason);
void createConnectMessage(ReliableMessage* msg);
void createKeepaliveMessage(ReliableMessage* msg);
void printMessage(ReliableMessage* msg, const char* direction);

// Network Simulation
bool simulatePacketLoss();
bool simulateCorruption();
int calculateBackoff(int retryCount);

// Reliability Core
bool sendWithReliability(const char* data);
bool sendMessage(ReliableMessage* msg, bool requiresAck);
void processReceivedMessage(ReliableMessage* msg);

// Connection Management
bool connect();
void disconnect();
void handleKeepalive();
void transitionState(ConnectionState newState);

// Statistics
void printStatistics();
void resetStatistics();
void printHelp();

// Command handler
void handleCommand(char cmd);

// ============================================================================
// SETUP
// ============================================================================
void setup() {
  Serial.begin(115200);
  delay(1000);

  // Initialize random seed
  randomSeed(analogRead(0) + micros());

  // Initialize GPIO
  initializePins();

  // Initialize statistics
  resetStatistics();

  // Print banner
  Serial.println();
  Serial.println("========================================================================");
  Serial.println("     RELIABILITY AND ERROR HANDLING LAB - ESP32                        ");
  Serial.println("========================================================================");
  Serial.println();
  Serial.println("This lab demonstrates the five pillars of reliable IoT communication:");
  Serial.println();
  Serial.println("  1. CRC-16 Error Detection     - Detect bit corruption in transit");
  Serial.println("  2. Sequence Numbers           - Detect loss, duplicates, reordering");
  Serial.println("  3. ACK/NACK Protocol          - Confirm or reject message delivery");
  Serial.println("  4. Exponential Backoff        - Congestion-aware retry strategy");
  Serial.println("  5. Connection State Machine   - Manage session lifecycle");
  Serial.println();
  Serial.println("------------------------------------------------------------------------");
  Serial.println("CONFIGURATION:");
  Serial.printf("  Initial Timeout:    %d ms\n", INITIAL_TIMEOUT_MS);
  Serial.printf("  Maximum Timeout:    %d ms\n", MAX_TIMEOUT_MS);
  Serial.printf("  Maximum Retries:    %d\n", MAX_RETRIES);
  Serial.printf("  Jitter:             %d%%\n", JITTER_PERCENT);
  Serial.printf("  Packet Loss Rate:   %d%%\n", packetLossPercent);
  Serial.printf("  Corruption Rate:    %d%%\n", corruptionPercent);
  Serial.printf("  Keep-Alive Interval:%d ms\n", KEEPALIVE_INTERVAL_MS);
  Serial.println("------------------------------------------------------------------------");
  Serial.println();

  // LED test sequence
  Serial.println("Testing LEDs...");
  blinkLED(LED_SUCCESS, 2, 150);
  blinkLED(LED_ERROR, 2, 150);
  blinkLED(LED_TRANSMIT, 2, 150);
  blinkLED(LED_CONNECTED, 2, 150);
  Serial.println("LED test complete.");
  Serial.println();

  printHelp();

  Serial.println();
  Serial.println("Attempting initial connection...");
  Serial.println();

  // Attempt initial connection
  if (connect()) {
    Serial.println("==> Connected successfully! Ready to send messages.");
  } else {
    Serial.println("==> Connection failed. Type 'c' to retry.");
  }

  Serial.println();
}

// ============================================================================
// MAIN LOOP
// ============================================================================
void loop() {
  // Handle serial commands
  if (Serial.available()) {
    char cmd = Serial.read();
    handleCommand(cmd);
  }

  // Handle button press
  if (digitalRead(BTN_SEND) == HIGH) {
    if (millis() - lastButtonPress > 500) {  // Debounce
      lastButtonPress = millis();
      if (currentState == STATE_CONNECTED) {
        static int buttonMsgCount = 0;
        char msgBuffer[64];
        snprintf(msgBuffer, sizeof(msgBuffer), "Button press #%d at %lu ms",
                 ++buttonMsgCount, millis());
        sendWithReliability(msgBuffer);
      } else {
        Serial.println("[WARN] Not connected. Press 'c' to connect first.");
        blinkLED(LED_ERROR, 2, 100);
      }
    }
  }

  // Handle keep-alive for connected state
  if (currentState == STATE_CONNECTED) {
    handleKeepalive();
  }

  // Auto-send test messages when connected
  static unsigned long lastAutoSend = 0;
  if (autoSendMode && currentState == STATE_CONNECTED) {
    if (millis() - lastAutoSend > 5000) {  // Every 5 seconds
      lastAutoSend = millis();

      static int autoMsgCount = 0;
      char msgBuffer[64];
      float temp = 20.0 + (random(200) / 10.0);  // Random temp 20-40C
      float humidity = 40.0 + (random(400) / 10.0);  // Random humidity 40-80%

      snprintf(msgBuffer, sizeof(msgBuffer),
               "Sensor #%d: Temp=%.1fC, Humidity=%.1f%%",
               ++autoMsgCount, temp, humidity);

      Serial.println("========================================================================");
      Serial.printf(">>> AUTO-SEND MESSAGE #%d <<<\n", autoMsgCount);
      Serial.println("========================================================================");

      bool success = sendWithReliability(msgBuffer);

      if (success) {
        Serial.println("[RESULT] Message delivered successfully!");
      } else {
        Serial.println("[RESULT] Message delivery FAILED after max retries.");
      }

      // Print stats every 5 messages
      if (autoMsgCount % 5 == 0) {
        printStatistics();
      }

      Serial.println();
    }
  }

  // Update connection LED
  updateConnectionLED();

  delay(10);  // Small delay for stability
}

// ============================================================================
// COMMAND HANDLER
// ============================================================================
void handleCommand(char cmd) {
  switch (cmd) {
    case 'h':
    case '?':
      printHelp();
      break;

    case 's':
    case 'S':
      if (currentState == STATE_CONNECTED) {
        char msg[64];
        snprintf(msg, sizeof(msg), "Manual test message at %lu ms", millis());
        sendWithReliability(msg);
      } else {
        Serial.println("[WARN] Not connected. Type 'c' to connect first.");
      }
      break;

    case 'c':
    case 'C':
      if (currentState == STATE_DISCONNECTED) {
        connect();
      } else if (currentState == STATE_CONNECTED) {
        disconnect();
      } else {
        Serial.printf("[INFO] Currently in state: %s\n", stateNames[currentState]);
      }
      break;

    case 'v':
    case 'V':
      verboseMode = !verboseMode;
      Serial.printf("[CONFIG] Verbose mode: %s\n", verboseMode ? "ON" : "OFF");
      break;

    case 'a':
    case 'A':
      autoSendMode = !autoSendMode;
      Serial.printf("[CONFIG] Auto-send mode: %s\n", autoSendMode ? "ON" : "OFF");
      break;

    case 'r':
    case 'R':
      resetStatistics();
      Serial.println("[INFO] Statistics reset.");
      break;

    case 'p':
    case 'P':
      printStatistics();
      break;

    case '0':
      packetLossPercent = 0;
      Serial.println("[CONFIG] Packet loss: 0% (perfect network)");
      break;

    case '1': case '2': case '3': case '4': case '5':
    case '6': case '7': case '8': case '9':
      packetLossPercent = (cmd - '0') * 10;
      Serial.printf("[CONFIG] Packet loss: %d%%\n", packetLossPercent);
      break;

    case '\n':
    case '\r':
      // Ignore newlines
      break;

    default:
      Serial.printf("[WARN] Unknown command: '%c'. Type 'h' for help.\n", cmd);
      break;
  }
}

// ============================================================================
// PIN INITIALIZATION
// ============================================================================
void initializePins() {
  // Configure LED pins as outputs
  pinMode(LED_SUCCESS, OUTPUT);
  pinMode(LED_ERROR, OUTPUT);
  pinMode(LED_TRANSMIT, OUTPUT);
  pinMode(LED_CONNECTED, OUTPUT);

  // Configure button as input
  pinMode(BTN_SEND, INPUT);

  // Initialize all LEDs to off
  digitalWrite(LED_SUCCESS, LOW);
  digitalWrite(LED_ERROR, LOW);
  digitalWrite(LED_TRANSMIT, LOW);
  digitalWrite(LED_CONNECTED, LOW);
}

// ============================================================================
// LED FUNCTIONS
// ============================================================================
void setLED(int pin, bool state) {
  digitalWrite(pin, state ? HIGH : LOW);
}

void blinkLED(int pin, int times, int delayMs) {
  for (int i = 0; i < times; i++) {
    digitalWrite(pin, HIGH);
    delay(delayMs);
    digitalWrite(pin, LOW);
    delay(delayMs);
  }
}

void updateConnectionLED() {
  // Blue LED indicates connection state
  static unsigned long lastBlink = 0;
  static bool blinkState = false;

  switch (currentState) {
    case STATE_DISCONNECTED:
      setLED(LED_CONNECTED, false);
      break;

    case STATE_CONNECTING:
    case STATE_DISCONNECTING:
      // Blink slowly during transitions
      if (millis() - lastBlink > 250) {
        lastBlink = millis();
        blinkState = !blinkState;
        setLED(LED_CONNECTED, blinkState);
      }
      break;

    case STATE_CONNECTED:
      setLED(LED_CONNECTED, true);
      break;
  }
}

// ============================================================================
// CRC-16 CALCULATION (CRC-CCITT)
// ============================================================================
uint16_t calculateCRC16(const uint8_t* data, size_t length) {
  uint16_t crc = CRC16_INITIAL;

  for (size_t i = 0; i < length; i++) {
    crc ^= ((uint16_t)data[i] << 8);

    for (int bit = 0; bit < 8; bit++) {
      if (crc & 0x8000) {
        crc = (crc << 1) ^ CRC16_POLYNOMIAL;
      } else {
        crc = crc << 1;
      }
    }
  }

  return crc;
}

uint16_t calculateMessageCRC(ReliableMessage* msg) {
  // CRC covers header (excluding CRC field) and payload
  size_t crcLength = 8 + msg->payloadLen;  // Header (8 bytes) + payload
  return calculateCRC16((uint8_t*)msg, crcLength);
}

bool verifyCRC(ReliableMessage* msg) {
  uint16_t calculated = calculateMessageCRC(msg);
  bool valid = (calculated == msg->crc16);

  if (!valid && verboseMode) {
    Serial.printf("  [CRC] MISMATCH! Calculated: 0x%04X, Received: 0x%04X\n",
                  calculated, msg->crc16);
  }

  return valid;
}

void corruptRandomBit(ReliableMessage* msg) {
  // Corrupt a random bit in the payload (simulating transmission error)
  if (msg->payloadLen > 0) {
    int byteIndex = random(msg->payloadLen);
    int bitIndex = random(8);
    msg->payload[byteIndex] ^= (1 << bitIndex);

    if (verboseMode) {
      Serial.printf("  [CORRUPT] Flipped bit %d in payload byte %d\n",
                    bitIndex, byteIndex);
    }
  }
}

// ============================================================================
// MESSAGE CREATION
// ============================================================================
void createDataMessage(ReliableMessage* msg, const char* payload) {
  memset(msg, 0, sizeof(ReliableMessage));

  msg->type = MSG_DATA;
  msg->flags = 0x01;  // Requires ACK
  msg->sequenceNum = nextSequenceNum;
  msg->ackNum = 0;
  msg->payloadLen = min((int)strlen(payload), 63);
  msg->reserved = 0;

  strncpy(msg->payload, payload, msg->payloadLen);
  msg->payload[msg->payloadLen] = '\0';

  msg->crc16 = calculateMessageCRC(msg);
}

void createAckMessage(ReliableMessage* ack, uint16_t ackNum) {
  memset(ack, 0, sizeof(ReliableMessage));

  ack->type = MSG_ACK;
  ack->flags = 0;
  ack->sequenceNum = 0;
  ack->ackNum = ackNum;
  ack->payloadLen = 0;
  ack->reserved = 0;

  ack->crc16 = calculateMessageCRC(ack);
}

void createNackMessage(ReliableMessage* nack, uint16_t nackNum, const char* reason) {
  memset(nack, 0, sizeof(ReliableMessage));

  nack->type = MSG_NACK;
  nack->flags = 0;
  nack->sequenceNum = 0;
  nack->ackNum = nackNum;
  nack->payloadLen = min((int)strlen(reason), 63);
  nack->reserved = 0;

  strncpy(nack->payload, reason, nack->payloadLen);

  nack->crc16 = calculateMessageCRC(nack);
}

void createConnectMessage(ReliableMessage* msg) {
  memset(msg, 0, sizeof(ReliableMessage));

  msg->type = MSG_CONNECT;
  msg->flags = 0x01;  // Requires ACK
  msg->sequenceNum = 0;
  msg->payloadLen = 0;

  msg->crc16 = calculateMessageCRC(msg);
}

void createKeepaliveMessage(ReliableMessage* msg) {
  memset(msg, 0, sizeof(ReliableMessage));

  msg->type = MSG_KEEPALIVE;
  msg->flags = 0x01;  // Requires ACK
  msg->sequenceNum = nextSequenceNum;
  msg->payloadLen = 0;

  msg->crc16 = calculateMessageCRC(msg);
}

// ============================================================================
// MESSAGE PRINTING
// ============================================================================
void printMessage(ReliableMessage* msg, const char* direction) {
  if (!verboseMode) return;

  const char* typeName;
  switch (msg->type) {
    case MSG_DATA:      typeName = "DATA"; break;
    case MSG_ACK:       typeName = "ACK"; break;
    case MSG_NACK:      typeName = "NACK"; break;
    case MSG_CONNECT:   typeName = "CONNECT"; break;
    case MSG_CONNACK:   typeName = "CONNACK"; break;
    case MSG_DISCONNECT:typeName = "DISCONNECT"; break;
    case MSG_KEEPALIVE: typeName = "KEEPALIVE"; break;
    case MSG_KEEPALIVE_ACK: typeName = "KEEPALIVE_ACK"; break;
    default:            typeName = "UNKNOWN"; break;
  }

  Serial.println();
  Serial.printf("  +-------------- %s MESSAGE --------------+\n", direction);
  Serial.printf("  | Type:         %-26s |\n", typeName);
  Serial.printf("  | Flags:        0x%02X (ACK_REQ=%d, RETX=%d)     |\n",
                msg->flags, (msg->flags & 0x01), ((msg->flags & 0x02) >> 1));
  Serial.printf("  | Sequence:     %-26d |\n", msg->sequenceNum);
  Serial.printf("  | Ack Number:   %-26d |\n", msg->ackNum);
  Serial.printf("  | Payload Len:  %-26d |\n", msg->payloadLen);

  if (msg->payloadLen > 0) {
    if (msg->payloadLen <= 30) {
      Serial.printf("  | Payload:      %-26s |\n", msg->payload);
    } else {
      char truncated[28];
      strncpy(truncated, msg->payload, 24);
      truncated[24] = '\0';
      strcat(truncated, "...");
      Serial.printf("  | Payload:      %-26s |\n", truncated);
    }
  }

  Serial.printf("  | CRC-16:       0x%04X                       |\n", msg->crc16);
  Serial.println("  +-------------------------------------------+");
}

// ============================================================================
// NETWORK SIMULATION
// ============================================================================
bool simulatePacketLoss() {
  int roll = random(100);
  bool lost = (roll < packetLossPercent);

  if (lost) {
    stats.simulatedDrops++;
    if (verboseMode) {
      Serial.printf("  [NETWORK] Packet DROPPED (roll=%d < %d%%)\n",
                    roll, packetLossPercent);
    }
  }

  return lost;
}

bool simulateCorruption() {
  int roll = random(100);
  bool corrupted = (roll < corruptionPercent);

  if (corrupted) {
    stats.simulatedCorruptions++;
    if (verboseMode) {
      Serial.printf("  [NETWORK] Packet CORRUPTED (roll=%d < %d%%)\n",
                    roll, corruptionPercent);
    }
  }

  return corrupted;
}

int calculateBackoff(int retryCount) {
  // Exponential backoff: initial * 2^retry
  int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);
  baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);

  // Add random jitter (0 to JITTER_PERCENT% of timeout)
  int jitter = random((baseTimeout * JITTER_PERCENT) / 100);

  return baseTimeout + jitter;
}

// ============================================================================
// RELIABLE SEND WITH RETRY
// ============================================================================
bool sendWithReliability(const char* data) {
  if (currentState != STATE_CONNECTED) {
    Serial.println("[ERROR] Cannot send: not connected");
    return false;
  }

  ReliableMessage msg;
  ReliableMessage response;
  int retryCount = 0;
  bool delivered = false;

  // Create the data message
  createDataMessage(&msg, data);
  printMessage(&msg, "SEND");

  while (retryCount <= MAX_RETRIES && !delivered) {
    // Calculate timeout for this attempt
    int currentTimeout = calculateBackoff(retryCount);

    // Mark as retransmission if this is a retry
    if (retryCount > 0) {
      msg.flags |= 0x02;  // Set retransmit flag
      stats.retransmissions++;
      stats.totalRetryDelayMs += currentTimeout;

      if (verboseMode) {
        Serial.println();
        Serial.printf("  [RETRY] Attempt %d/%d (timeout=%dms, backoff=2^%d)\n",
                      retryCount, MAX_RETRIES, currentTimeout, retryCount);
      }
    }

    // Turn on transmit LED
    setLED(LED_TRANSMIT, true);

    if (verboseMode) {
      Serial.printf("  [TX] Sending SEQ=%d, waiting up to %dms for ACK...\n",
                    msg.sequenceNum, currentTimeout);
    }

    stats.messagesSent++;

    // Simulate network transmission delay
    delay(50);

    // Turn off transmit LED
    setLED(LED_TRANSMIT, false);

    // ===== SIMULATE NETWORK CHANNEL =====

    // Check if packet was lost in transit
    if (simulatePacketLoss()) {
      // Packet lost - wait for timeout
      if (verboseMode) {
        Serial.printf("  [WAIT] Waiting %dms for ACK (packet was lost)...\n",
                      currentTimeout);
      }
      delay(currentTimeout);

      // Timeout occurred
      setLED(LED_ERROR, true);
      delay(150);
      setLED(LED_ERROR, false);

      stats.ackTimeouts++;
      if (verboseMode) {
        Serial.println("  [TIMEOUT] No ACK received - will retry");
      }

      retryCount++;
      continue;
    }

    // Check if packet was corrupted
    if (simulateCorruption()) {
      corruptRandomBit(&msg);
    }

    // Packet arrived at receiver - process it
    setLED(LED_TRANSMIT, true);
    delay(30);
    setLED(LED_TRANSMIT, false);

    // Receiver verifies CRC
    if (!verifyCRC(&msg)) {
      // CRC failed - receiver sends NACK
      stats.crcErrors++;
      stats.nacksReceived++;

      createNackMessage(&response, msg.sequenceNum, "CRC_ERROR");
      printMessage(&response, "RECV");

      if (verboseMode) {
        Serial.println("  [NACK] CRC error detected by receiver - will retry");
      }

      setLED(LED_ERROR, true);
      delay(150);
      setLED(LED_ERROR, false);

      // Restore original message for retry
      msg.crc16 = calculateMessageCRC(&msg);

      retryCount++;
      continue;
    }

    if (verboseMode) {
      Serial.println("  [RX] Message received and CRC verified OK");
    }

    // Check sequence number at receiver
    if (msg.sequenceNum < expectedSequenceNum &&
        !(msg.sequenceNum == 0 && expectedSequenceNum > 60000)) {
      // Duplicate detected (not wraparound)
      stats.duplicatesDetected++;
      if (verboseMode) {
        Serial.printf("  [RX] DUPLICATE (expected SEQ >= %d, got %d) - ACK anyway\n",
                      expectedSequenceNum, msg.sequenceNum);
      }
    } else if (msg.sequenceNum > expectedSequenceNum) {
      // Out of order (gap detected)
      stats.outOfOrderDetected++;
      if (verboseMode) {
        Serial.printf("  [RX] OUT OF ORDER (expected SEQ=%d, got %d)\n",
                      expectedSequenceNum, msg.sequenceNum);
      }
      expectedSequenceNum = msg.sequenceNum + 1;
    } else {
      // Normal in-order delivery
      expectedSequenceNum = msg.sequenceNum + 1;
    }

    // ===== RECEIVER SENDS ACK =====

    // Check if ACK is lost on the return path
    if (simulatePacketLoss()) {
      if (verboseMode) {
        Serial.println("  [NETWORK] ACK was LOST in transit!");
        Serial.printf("  [WAIT] Waiting %dms for ACK...\n", currentTimeout);
      }
      delay(currentTimeout);

      setLED(LED_ERROR, true);
      delay(150);
      setLED(LED_ERROR, false);

      stats.ackTimeouts++;
      retryCount++;
      continue;
    }

    // ACK received successfully
    createAckMessage(&response, msg.sequenceNum);
    printMessage(&response, "RECV");

    stats.acksReceived++;
    stats.messagesDelivered++;
    lastAckedSequence = msg.sequenceNum;

    if (verboseMode) {
      Serial.printf("  [ACK] Received ACK for SEQ=%d\n", response.ackNum);
    }

    setLED(LED_SUCCESS, true);
    delay(200);
    setLED(LED_SUCCESS, false);

    // Success - increment sequence number for next message
    nextSequenceNum++;
    delivered = true;
  }

  if (!delivered) {
    stats.messagesFailed++;
    Serial.printf("  [FAIL] Message delivery failed after %d retries\n", MAX_RETRIES);
    blinkLED(LED_ERROR, 3, 100);
  }

  stats.lastMessageTime = millis();
  return delivered;
}

// ============================================================================
// CONNECTION MANAGEMENT
// ============================================================================
void transitionState(ConnectionState newState) {
  if (verboseMode) {
    Serial.printf("[STATE] %s -> %s\n", stateNames[currentState], stateNames[newState]);
  }
  currentState = newState;
  updateConnectionLED();
}

bool connect() {
  if (currentState != STATE_DISCONNECTED) {
    Serial.printf("[WARN] Cannot connect: already in state %s\n",
                  stateNames[currentState]);
    return false;
  }

  transitionState(STATE_CONNECTING);
  stats.connectionAttempts++;

  Serial.println("[CONNECT] Initiating connection...");

  ReliableMessage connectMsg;
  createConnectMessage(&connectMsg);
  printMessage(&connectMsg, "SEND");

  // Simulate connection handshake with retries
  int retryCount = 0;
  while (retryCount < 3) {
    setLED(LED_TRANSMIT, true);
    delay(100);
    setLED(LED_TRANSMIT, false);

    // Check for simulated loss
    if (simulatePacketLoss()) {
      int timeout = calculateBackoff(retryCount);
      if (verboseMode) {
        Serial.printf("  [WAIT] Connection request lost, waiting %dms...\n", timeout);
      }
      delay(timeout);
      retryCount++;
      continue;
    }

    // Connection successful
    if (verboseMode) {
      Serial.println("  [RX] CONNACK received");
    }

    transitionState(STATE_CONNECTED);
    stats.connectionSuccesses++;
    lastKeepaliveTime = millis();
    nextSequenceNum = 0;
    expectedSequenceNum = 0;

    setLED(LED_SUCCESS, true);
    delay(300);
    setLED(LED_SUCCESS, false);

    Serial.println("[CONNECT] Connection established successfully!");
    return true;
  }

  // Connection failed
  transitionState(STATE_DISCONNECTED);
  Serial.println("[CONNECT] Connection FAILED after 3 attempts");
  blinkLED(LED_ERROR, 3, 150);
  return false;
}

void disconnect() {
  if (currentState != STATE_CONNECTED) {
    Serial.printf("[WARN] Cannot disconnect: in state %s\n",
                  stateNames[currentState]);
    return;
  }

  transitionState(STATE_DISCONNECTING);
  Serial.println("[DISCONNECT] Initiating graceful disconnect...");

  ReliableMessage disconnectMsg;
  memset(&disconnectMsg, 0, sizeof(ReliableMessage));
  disconnectMsg.type = MSG_DISCONNECT;
  disconnectMsg.crc16 = calculateMessageCRC(&disconnectMsg);

  printMessage(&disconnectMsg, "SEND");

  setLED(LED_TRANSMIT, true);
  delay(100);
  setLED(LED_TRANSMIT, false);

  // Don't wait for ACK on disconnect (best effort)
  transitionState(STATE_DISCONNECTED);
  Serial.println("[DISCONNECT] Disconnected.");
}

void handleKeepalive() {
  if (currentState != STATE_CONNECTED) return;

  if (millis() - lastKeepaliveTime >= KEEPALIVE_INTERVAL_MS) {
    stats.keepalivesSent++;

    if (verboseMode) {
      Serial.println("[KEEPALIVE] Sending keep-alive ping...");
    }

    ReliableMessage keepalive;
    createKeepaliveMessage(&keepalive);

    setLED(LED_TRANSMIT, true);
    delay(30);
    setLED(LED_TRANSMIT, false);

    // Simulate keep-alive response
    if (!simulatePacketLoss()) {
      lastKeepaliveTime = millis();
      if (verboseMode) {
        Serial.println("[KEEPALIVE] Response received - connection healthy");
      }
    } else {
      stats.keepalivesTimedOut++;
      if (verboseMode) {
        Serial.println("[KEEPALIVE] No response - connection may be degraded");
      }

      // After multiple failed keepalives, disconnect
      static int missedKeepalives = 0;
      missedKeepalives++;

      if (missedKeepalives >= 3) {
        Serial.println("[KEEPALIVE] Connection lost - 3 consecutive failures");
        transitionState(STATE_DISCONNECTED);
        missedKeepalives = 0;
        blinkLED(LED_ERROR, 3, 100);
      }
    }
  }
}

// ============================================================================
// STATISTICS
// ============================================================================
void printStatistics() {
  unsigned long elapsed = (millis() - stats.startTime) / 1000;
  if (elapsed == 0) elapsed = 1;

  float deliveryRate = stats.messagesSent > 0 ?
    (float)stats.messagesDelivered * 100.0 / stats.messagesSent : 0;

  float avgRetriesPerMsg = stats.messagesDelivered > 0 ?
    (float)stats.retransmissions / stats.messagesDelivered : 0;

  float throughput = (float)stats.messagesDelivered / elapsed;

  Serial.println();
  Serial.println("+======================================================================+");
  Serial.println("|                 RELIABILITY STATISTICS                              |");
  Serial.println("+======================================================================+");
  Serial.println("| TRANSMISSION                                                        |");
  Serial.printf("|   Messages Sent:        %-10lu                                  |\n",
                stats.messagesSent);
  Serial.printf("|   Messages Delivered:   %-10lu                                  |\n",
                stats.messagesDelivered);
  Serial.printf("|   Messages Failed:      %-10lu                                  |\n",
                stats.messagesFailed);
  Serial.printf("|   Delivery Rate:        %-10.1f%%                                 |\n",
                deliveryRate);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| ACKNOWLEDGMENTS                                                     |");
  Serial.printf("|   ACKs Received:        %-10lu                                  |\n",
                stats.acksReceived);
  Serial.printf("|   NACKs Received:       %-10lu                                  |\n",
                stats.nacksReceived);
  Serial.printf("|   ACK Timeouts:         %-10lu                                  |\n",
                stats.ackTimeouts);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| RETRY STATISTICS                                                    |");
  Serial.printf("|   Total Retransmissions:%-10lu                                  |\n",
                stats.retransmissions);
  Serial.printf("|   Avg Retries/Message:  %-10.2f                                  |\n",
                avgRetriesPerMsg);
  Serial.printf("|   Total Retry Delay:    %-10lu ms                               |\n",
                stats.totalRetryDelayMs);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| ERROR DETECTION                                                     |");
  Serial.printf("|   CRC Errors Detected:  %-10lu                                  |\n",
                stats.crcErrors);
  Serial.printf("|   Duplicates Detected:  %-10lu                                  |\n",
                stats.duplicatesDetected);
  Serial.printf("|   Out-of-Order:         %-10lu                                  |\n",
                stats.outOfOrderDetected);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| NETWORK SIMULATION                                                  |");
  Serial.printf("|   Packets Dropped:      %-10lu                                  |\n",
                stats.simulatedDrops);
  Serial.printf("|   Packets Corrupted:    %-10lu                                  |\n",
                stats.simulatedCorruptions);
  Serial.printf("|   Current Loss Rate:    %-10d%%                                  |\n",
                packetLossPercent);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| CONNECTION                                                          |");
  Serial.printf("|   Connection Attempts:  %-10lu                                  |\n",
                stats.connectionAttempts);
  Serial.printf("|   Connection Successes: %-10lu                                  |\n",
                stats.connectionSuccesses);
  Serial.printf("|   Keep-alives Sent:     %-10lu                                  |\n",
                stats.keepalivesSent);
  Serial.printf("|   Keep-alives Failed:   %-10lu                                  |\n",
                stats.keepalivesTimedOut);
  Serial.printf("|   Current State:        %-10s                                  |\n",
                stateNames[currentState]);
  Serial.println("+----------------------------------------------------------------------+");
  Serial.println("| PERFORMANCE                                                         |");
  Serial.printf("|   Elapsed Time:         %-10lu seconds                          |\n",
                elapsed);
  Serial.printf("|   Throughput:           %-10.2f messages/sec                     |\n",
                throughput);
  Serial.println("+======================================================================+");
  Serial.println();
}

void resetStatistics() {
  memset(&stats, 0, sizeof(ReliabilityStats));
  stats.startTime = millis();
  nextSequenceNum = 0;
  expectedSequenceNum = 0;
}

void printHelp() {
  Serial.println();
  Serial.println("------------------------------------------------------------------------");
  Serial.println("COMMANDS:");
  Serial.println("  h / ?   Show this help menu");
  Serial.println("  s       Send a test message manually");
  Serial.println("  c       Connect (if disconnected) or Disconnect (if connected)");
  Serial.println("  v       Toggle verbose mode (detailed logging)");
  Serial.println("  a       Toggle auto-send mode (periodic test messages)");
  Serial.println("  r       Reset statistics");
  Serial.println("  p       Print current statistics");
  Serial.println("  0-9     Set packet loss percentage (0=0%, 5=50%, 9=90%)");
  Serial.println("------------------------------------------------------------------------");
  Serial.println("LED INDICATORS:");
  Serial.println("  Green   - Successful transmission (ACK received)");
  Serial.println("  Red     - Error (timeout, CRC failure, max retries)");
  Serial.println("  Yellow  - Transmission in progress");
  Serial.println("  Blue    - Connection state (solid=connected, blink=transitioning)");
  Serial.println("------------------------------------------------------------------------");
  Serial.println("BUTTON: Press to send a message manually");
  Serial.println("------------------------------------------------------------------------");
}

21.4.6 Step-by-Step Instructions

Step 1: Set Up the Circuit

  1. Open the Wokwi simulator above (or visit wokwi.com/projects/new/esp32)
  2. Add four LEDs to the breadboard:
    • Green LED (success indicator)
    • Red LED (error indicator)
    • Yellow LED (transmission indicator)
    • Blue LED (connection state indicator)
  3. Add four 220 ohm resistors for the LEDs
  4. Add one push button and one 10K ohm pull-down resistor
  5. Wire the connections as shown in the circuit diagram:
    • GPIO 4 to Green LED anode (through 220 ohm resistor)
    • GPIO 2 to Red LED anode (through 220 ohm resistor)
    • GPIO 5 to Yellow LED anode (through 220 ohm resistor)
    • GPIO 18 to Blue LED anode (through 220 ohm resistor)
    • All LED cathodes to GND
    • 3.3V to push button, button to GPIO 15
    • GPIO 15 to GND through 10K ohm pull-down resistor

Step 2: Upload and Run the Code

  1. Copy the complete code into the Wokwi code editor
  2. Click “Start Simulation” to begin
  3. Open the Serial Monitor (set to 115200 baud)
  4. The system will automatically connect and start sending test messages

Step 3: Observe the Five Reliability Pillars

Pillar What to Observe
CRC-16 Watch for “CRC verified OK” or “CRC MISMATCH” in serial output
Sequence Numbers Each message shows SEQ=N incrementing
ACK/NACK See ACK messages received after successful delivery
Exponential Backoff On retries, timeout doubles: 500ms, 1000ms, 2000ms, 4000ms
Connection State Blue LED shows connection status; state transitions logged

Step 4: Experiment with Network Conditions

Use serial commands to change simulation parameters:

Command Effect
0 Perfect network (0% loss)
3 Moderate loss (30%)
7 High loss (70%)
v Toggle verbose mode
p Print statistics

21.4.7 Understanding the Code

1. CRC-16 Error Detection
uint16_t calculateCRC16(const uint8_t* data, size_t length) {
  uint16_t crc = CRC16_INITIAL;  // 0xFFFF
  for (size_t i = 0; i < length; i++) {
    crc ^= ((uint16_t)data[i] << 8);
    for (int bit = 0; bit < 8; bit++) {
      if (crc & 0x8000) {
        crc = (crc << 1) ^ CRC16_POLYNOMIAL;  // 0x1021
      } else {
        crc = crc << 1;
      }
    }
  }
  return crc;
}

This implements CRC-CCITT, the same algorithm used in Bluetooth, USB, and many IoT protocols. The polynomial 0x1021 provides excellent burst error detection.

2. Exponential Backoff with Jitter
int calculateBackoff(int retryCount) {
  // Exponential: initial * 2^retry
  int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);
  baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);

  // Random jitter: 0 to 25% of timeout
  int jitter = random((baseTimeout * JITTER_PERCENT) / 100);
  return baseTimeout + jitter;
}

The (1 << retryCount) efficiently calculates 2^n. Jitter prevents collision storms when multiple devices retry simultaneously.

Try It: Exponential Backoff Visualizer

Configure retry parameters to see how timeout grows with each attempt, including the random jitter range.

3. Connection State Machine
void transitionState(ConnectionState newState) {
  Serial.printf("[STATE] %s -> %s\n",
                stateNames[currentState], stateNames[newState]);
  currentState = newState;
  updateConnectionLED();
}

The state machine ensures proper lifecycle: DISCONNECTED -> CONNECTING -> CONNECTED -> DISCONNECTING -> DISCONNECTED. In this lab implementation, guard checks in connect() and disconnect() prevent illegal transitions – for example, calling connect() while already connected returns immediately with a warning.

21.4.8 Challenge Exercises

Challenge 1: Implement Selective Repeat

Modify the code to support a sliding window with selective retransmission:

#define WINDOW_SIZE 4
uint16_t windowBase = 0;
bool acked[WINDOW_SIZE] = {false};

// Only retransmit specifically NACKed packets,
// not the entire window

Goal: Increase throughput by sending multiple messages before waiting for ACKs.

Challenge 2: Add CRC-32 Option

Implement CRC-32 alongside CRC-16 and compare:

uint32_t calculateCRC32(const uint8_t* data, size_t length);

// Add a configuration option to switch between CRC-16 and CRC-32
// Measure the CPU time difference

Goal: Understand the trade-off between error detection capability and computational overhead.

Challenge 3: Implement Quality of Service Levels

Add support for three QoS levels similar to MQTT:

  • QoS 0: Fire and forget (no ACK)
  • QoS 1: At least once (ACK required)
  • QoS 2: Exactly once (two-phase handshake)
enum QoSLevel { QOS_0, QOS_1, QOS_2 };
bool sendWithQoS(const char* data, QoSLevel qos);

Goal: Understand how different reliability levels affect latency and overhead.

21.4.9 Expected Outcomes

After completing this lab, you should observe:

Metric Expected Value (20% loss) Meaning
Delivery Rate 95-100% Retries recover most losses
Avg Retries/Message 0.3-0.5 Some messages need 1-2 retries
CRC Errors 2-5% of messages Corruption detected and recovered
Throughput ~0.15 msg/sec Limited by timeouts and retries

Key Insights:

  1. Reliability has a cost: Higher packet loss means more retries, more delay, and more power consumption
  2. CRC catches corruption: Bit errors are detected before corrupted data reaches the application
  3. Backoff prevents congestion: Exponential backoff reduces network load during problems
  4. State machines prevent leaks: Proper connection management ensures clean lifecycle
Try It: Delivery Rate Estimator

Model how packet loss, corruption, and retry limits affect delivery success and overhead.


Common Mistake: Forgetting to Handle Connection State Timeouts

The Mistake: Developers implement connection state machines with STATE_CONNECTED but forget to set timeouts for detecting dead connections. The device believes it’s connected while the network path is broken.

Real-World Consequence: A smart irrigation system used MQTT over TCP with keep-alive set to 5 minutes. NAT routers on cellular networks time out idle connections after 90 seconds. Between MQTT keep-alives (at 300 seconds), the NAT mapping expired. The device remained in STATE_CONNECTED, sending data packets that were silently dropped by the NAT router. No error was detected until the next keep-alive attempt 300 seconds later—meanwhile, 4 critical irrigation commands were lost.

What Should Have Happened:

Proper Keep-Alive Configuration:

// WRONG - Keep-alive longer than NAT timeout
mqtt.setKeepAlive(300);  // 5 minutes

// RIGHT - Keep-alive shorter than minimum NAT timeout
mqtt.setKeepAlive(60);   // 1 minute (well under 90s NAT timeout)

Connection State Timeout Tracking:

enum ConnectionState {
    STATE_DISCONNECTED = 0,
    STATE_CONNECTING   = 1,
    STATE_CONNECTED    = 2,
    STATE_DISCONNECTING= 3,
    STATE_ERROR        = 4  // Add error state
};

unsigned long lastRxTime = 0;
unsigned long lastTxTime = 0;
const unsigned long RX_TIMEOUT_MS = 120000;  // 2 minutes

void checkConnectionHealth() {
    if (currentState == STATE_CONNECTED) {
        unsigned long now = millis();

        // Check if we've received anything recently
        if (now - lastRxTime > RX_TIMEOUT_MS) {
            Serial.println("[ERROR] No data received for 2 min - connection dead");
            transitionState(STATE_ERROR);
            disconnect();
            connect();  // Attempt reconnect
        }

        // Check if keep-alive is overdue
        if (now - lastTxTime > KEEPALIVE_INTERVAL_MS * 1.5) {
            Serial.println("[WARN] Keep-alive overdue - sending ping");
            sendKeepalive();
        }
    }
}

NAT-Aware Keep-Alive Design:

Network Type NAT Timeout Recommended Keep-Alive Safety Margin
Wi-Fi (home router) 120-300s 60s 2-5× safety
Cellular (LTE-M) 30-90s 25s 1.2-3× safety
Satellite 60s 30s 2× safety
Enterprise NAT 180-600s 120s 1.5-5× safety

Debugging Dead Connections:

void monitorConnectionState() {
    Serial.printf("[STATE] Current: %s | RX age: %lu ms | TX age: %lu ms\n",
                  stateNames[currentState],
                  millis() - lastRxTime,
                  millis() - lastTxTime);

    if (currentState == STATE_CONNECTED) {
        // Verify we can actually reach the peer
        if (millis() - lastRxTime > 30000) {  // 30s no RX
            Serial.println("[WARN] Zombie connection suspected - testing with ping");
            if (!sendKeepalive()) {
                Serial.println("[ERROR] Keep-alive failed - connection is dead");
                transitionState(STATE_ERROR);
            }
        }
    }
}

Lesson Learned:

  • Keep-alive must be shorter than minimum NAT timeout (typically 25-60 seconds for cellular)
  • Track last RX timestamp - if no data received for 2× keep-alive interval, assume connection is dead
  • Implement connection health monitoring - periodic checks that transition to STATE_ERROR when timeouts are detected
  • Add reconnection logic - automatic recovery from dead connection states
  • Log timeout events - record when keep-alive fails so you can tune intervals for your deployment

Rule of Thumb: Keep-alive interval should be 40-50% of the minimum NAT timeout in your network path. For cellular IoT, use 25-30 seconds. For Wi-Fi, 60 seconds is usually safe. Always measure actual NAT behavior in your production environment.

21.5 See Also

Reliability Deep Dives (Prerequisites):

Protocol Implementations (Where These Concepts Appear):

Hands-On Labs:

Testing Tools:

Related Concepts:

  • TCP congestion control uses the same exponential backoff pattern as this lab
  • Selective Repeat ARQ extends Stop-and-Wait to a sliding window (Challenge 1 above)
  • DTLS handshake retransmission uses identical reliability mechanisms

Common Pitfalls

Relying only on TCP keepalive for connection health detection can leave connections “stuck” for hours. TCP keepalive default (Linux): first probe after 7,200 s (2 hours), then 9 probes at 75 s intervals = 2 hours 11 minutes before declaring dead. IoT applications must implement application-level timeouts: if no expected data arrives within N seconds (N = 2× expected message interval), close and reopen the connection. Do not rely on TCP keepalive for connection health — it is designed for different purposes.

Each TCP connection that closes normally enters TIME_WAIT for 2×MSL (Maximum Segment Lifetime, default 60 s). A server handling 1,000 short-lived IoT connections per second accumulates 60,000 sockets in TIME_WAIT within 60 seconds, exhausting the ephemeral port range (60,000 ports). Mitigate with: SO_REUSEADDR socket option (allows binding to TIME_WAIT sockets), connection pooling (keep connections alive for multiple transactions), or SO_LINGER=0 (immediate RST on close, caution: may cause data loss).

Debugging intermittent IoT connection drops without a packet capture is speculation. Start a Wireshark capture before reproducing the issue and look for: RST packets (connection reset by remote), TCP retransmissions (packet loss), window size reaching 0 (receiver buffer full), and delta time between packets (latency spikes). A 30-second packet capture of a connection failure saves hours of log analysis.

“Connection refused” (errno ECONNREFUSED) means the target IP is reachable but no process is listening on that port — check service status. “Connection timed out” (errno ETIMEDOUT) means packets are not reaching the destination at all — check routing, firewall, and network connectivity. Treating both as “connection failure” and applying the same fix (restart service) wastes time. Always distinguish the error type from socket error codes before investigating.

21.6 What’s Next

After completing this connection reliability lab, continue with these chapters to apply and extend what you have built:

Next Chapter Topic Why It Follows
Transport Optimizations Advanced reliability tuning and IoT-specific adaptations Apply the parameters you configured in this lab to real-world scenarios
CoAP Overview CoAP Confirmable messages and retransmission See how a real IoT protocol implements the same CON/ACK pattern you built
MQTT QoS and Session MQTT QoS 0/1/2 reliability levels Compare MQTT’s broker-mediated reliability against your direct ACK design
MQTT Fundamentals MQTT protocol architecture and broker model Extend your connection state machine knowledge to a widely deployed IoT protocol
Reliability Overview Five reliability pillars framework Revisit the conceptual foundations now that you have implemented all five pillars
Error Detection CRC algorithms and checksum theory Deepen your understanding of the CRC-16 code you wrote in this lab