21 Connection Reliability Lab
Key Concepts
- TCP State Machine: 11 states (CLOSED, LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1/2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT); understanding state transitions is essential for debugging connection issues
- netstat / ss: Linux commands for displaying active TCP connections and their states; ss -tn shows TCP connections with numeric addresses; useful for diagnosing connection state leaks
- Wireshark TCP Stream: TCP stream reassembly feature displaying the full conversation between two endpoints; essential for diagnosing protocol-level issues in IoT connection reliability
- TCP Keepalive: Three socket options: TCP_KEEPIDLE (seconds before first probe), TCP_KEEPINTVL (probe interval), TCP_KEEPCNT (number of probes before dropping); defaults vary by OS
- SO_LINGER Socket Option: Configures TCP close behavior: linger=0 causes immediate RST on close (no TIME_WAIT); linger>0 waits specified seconds for data to flush; affects how quickly ports can be reused
- Connection Timeout vs Read Timeout: Connection timeout: maximum time to complete TCP handshake; read timeout: maximum time to receive any data after connection; both must be configured for robust IoT clients
- Network Namespace: Linux kernel feature isolating network stack (interfaces, routing, firewall rules) per namespace; used in labs to simulate network conditions without physical hardware
- tc netem: Linux Traffic Control Network Emulator; adds delay, jitter, packet loss, duplication, and corruption to network interfaces for lab simulation of poor network conditions
Learning Objectives
By the end of this section, you will be able to:
- Design connection state machines: Build robust connection management with proper state transitions
- Implement keep-alive mechanisms: Monitor connection health for long-lived IoT sessions
- Diagnose connection failures: Identify root causes of and recover from unexpected disconnections
- Construct a complete reliable transport: Implement all five reliability pillars in working code
- Evaluate reliability metrics: Assess delivery rate, retry overhead, and throughput against targets
- Configure reliability parameters: Select and adjust settings based on observed network conditions
For Beginners: Connection Reliability Lab
This lab lets you experiment with network reliability by building connections that can handle real-world problems like dropped messages, out-of-order delivery, and network delays. Think of it as stress-testing a delivery service – you will see what happens when packages get lost and how systems recover automatically.
Sensor Squad: Putting It All Together!
“This lab combines everything we have learned into a working system,” said Max the Microcontroller. “Error detection, retries, sequence numbers, connection states, and keep-alive – all running on a real ESP32.”
“The connection state machine is the backbone,” explained Sammy the Sensor. “It tracks whether we are disconnected, connecting, connected, or in an error state. Each state has rules for what can happen next – you cannot send data while disconnected, and you cannot disconnect while already disconnected.”
“Keep-alive heartbeats detect silent failures,” added Lila the LED. “NAT routers time out idle connections after 30 to 120 seconds. Cell networks silently drop sessions. Without heartbeats, you think you are connected while the network path is broken – a zombie connection.”
“By the end, you will measure delivery rates, retry overhead, and effective throughput,” said Bella the Battery. “These are the metrics that tell you if your reliability implementation is actually working. Theory is great, but numbers do not lie!”
21.1 Prerequisites
Before diving into this chapter, you should be familiar with:
- Error Detection: CRC and checksums for detecting corrupted data
- Retry and Sequencing: Exponential backoff and sequence number handling
- Reliability Overview: The parent chapter introducing all five reliability pillars
- C/C++ programming: The lab uses Arduino/ESP32 code
Why Connection State Matters
Long-lived IoT connections are fragile. NAT routers time out idle connections after 30-120 seconds. Cellular networks may silently drop sessions. Wi-Fi access points reboot for updates. Without proper connection state management, your device may believe it is connected while the network path is broken – leading to lost data and unresponsive actuators.
21.2 Connection State Machines
Connection state machines track the lifecycle of a communication session. Proper state management prevents resource leaks, handles unexpected disconnections, and ensures clean shutdown.
21.2.1 Basic Connection States
21.2.2 State Transition Rules
| Current State | Event | Next State | Action |
|---|---|---|---|
| DISCONNECTED | connect() | CONNECTING | Send CONNECT message |
| CONNECTING | ACK received | CONNECTED | Start keep-alive timer |
| CONNECTING | Timeout (after retries) | DISCONNECTED | Report failure |
| CONNECTED | Data to send | CONNECTED | Send with reliability |
| CONNECTED | Keep-alive timeout | DISCONNECTED | Clean up resources |
| CONNECTED | disconnect() | DISCONNECTING | Send DISCONNECT |
| DISCONNECTING | ACK received | DISCONNECTED | Clean up complete |
| DISCONNECTING | Timeout | DISCONNECTED | Force clean up |
21.2.3 Implementation Pattern
enum ConnectionState {
STATE_DISCONNECTED = 0,
STATE_CONNECTING = 1,
STATE_CONNECTED = 2,
STATE_DISCONNECTING= 3
};
ConnectionState currentState = STATE_DISCONNECTED;
void transitionState(ConnectionState newState) {
// Log transition for debugging
Serial.printf("[STATE] %s -> %s\n",
stateNames[currentState],
stateNames[newState]);
// Perform exit actions for old state
switch (currentState) {
case STATE_CONNECTED:
stopKeepaliveTimer();
break;
// ... other exit actions
}
// Update state
currentState = newState;
// Perform entry actions for new state
switch (newState) {
case STATE_CONNECTED:
startKeepaliveTimer();
resetSequenceNumbers();
break;
// ... other entry actions
}
}21.3 Keep-Alive Mechanism
Long-lived connections may go silent during idle periods. Keep-alive messages verify the connection is still healthy.
21.3.1 Keep-Alive Protocol
21.3.2 Protocol-Specific Keep-Alive
| Protocol | Mechanism | Typical Interval | Notes |
|---|---|---|---|
| MQTT | PINGREQ/PINGRESP | 30-300s | Configurable, broker enforces |
| CoAP | Empty CON message | 30-120s | Optional, not standardized |
| TCP | TCP keepalive | 2 hours (default) | OS-level, often disabled for IoT |
| WebSocket | Ping/Pong frames | 30-60s | Built into protocol |
NAT Timeout Considerations
NAT routers and firewalls often drop “idle” connections after 30-120 seconds. Your keep-alive interval must be shorter than the shortest NAT timeout in the path, or connections will silently break.
Recommendation: Use 25 second keep-alive intervals for cellular IoT, which provides margin even against 30-second NAT timeouts. For Wi-Fi home routers (120-300 second NAT timeouts), 60 seconds is typically sufficient.
Putting Numbers to It
Keep-alive interval must be shorter than NAT timeout to prevent zombie connections. The safety margin prevents timing races:
$ T_{keepalive} T_{NAT} S_{margin} $
where \(S_{margin}\) = safety factor (typically 0.5-0.8).
Worked example: Cellular network with NAT timeout \(T_{NAT}\) = 90 seconds, using \(S_{margin}\) = 0.7.
$ T_{keepalive} = 63 $
With 63-second keep-alive, the connection sends a ping every 63s. NAT entry resets at each ping, never reaching the 90s timeout. If keep-alive fails (packet loss), timeout = 90 - 63 = 27 seconds to detect dead connection before NAT drops it. Using 30-second keep-alive provides even more margin: 90 - 30 = 60 seconds detection window, but doubles control overhead (2 pings/min vs 1 ping/min).
21.4 Reliability Lab: Build a Complete Reliable Transport System
This hands-on lab demonstrates all five reliability pillars by building a complete reliable messaging system on an ESP32. You will implement CRC error detection, sequence numbering, acknowledgments, exponential backoff retries, and connection state management - the same mechanisms used by TCP and CoAP internally.
21.4.1 What You Will Learn
By completing this lab, you will be able to:
- Calculate and verify CRC-16 checksums: Identify bit-level corruption in transmitted data
- Implement exponential backoff with jitter: Apply congestion-aware retry logic to real code
- Construct sequence number management: Distinguish in-order delivery from duplicates and gaps
- Design an ACK/NACK protocol: Build a bidirectional acknowledgment system from scratch
- Build connection state machines: Construct session lifecycle management with proper state transitions
- Evaluate real-time reliability statistics: Analyze throughput, loss rate, and retry overhead
21.4.2 Components Needed
| Component | Quantity | Purpose |
|---|---|---|
| ESP32 DevKit | 1 | Microcontroller running reliability simulation |
| Green LED | 1 | Successful transmission indicator (ACK received) |
| Red LED | 1 | Error indicator (timeout, CRC failure, max retries) |
| Yellow LED | 1 | Transmission in progress indicator |
| Blue LED | 1 | Connection state indicator (connected/disconnected) |
| 220 ohm Resistors | 4 | Current limiting for LEDs |
| Push Button | 1 | Trigger manual message send |
| 10K ohm Resistor | 1 | Button pull-down |
| Breadboard | 1 | Circuit assembly |
| Jumper Wires | Several | Connections |
21.4.3 Wokwi Simulator
Use the embedded simulator below to build and test your reliable transport system. Click “Start Simulation” to begin.
Simulator Tips
- Click inside the simulator frame first to give it focus
- Press the green “Play” button to start simulation
- Open the Serial Monitor (set to 115200 baud) to see detailed output
- You can add components by clicking the “+” button
- Copy the code below into the editor, replacing any default code
21.4.4 Circuit Diagram
21.4.5 Complete Code
Copy this code into the Wokwi editor:
Full Lab Code (click to expand)
/*
* ============================================================================
* RELIABILITY AND ERROR HANDLING LAB - ESP32
* ============================================================================
*
* This comprehensive lab demonstrates the five pillars of reliable IoT
* communication:
*
* 1. ERROR DETECTION: CRC-16 checksum calculation and verification
* 2. SEQUENCE NUMBERS: Packet ordering, loss detection, duplicate handling
* 3. ACKNOWLEDGMENTS: ACK/NACK protocol with configurable timeouts
* 4. RETRY WITH BACKOFF: Exponential backoff with random jitter
* 5. CONNECTION STATE: State machine for connection lifecycle
*
* LED Indicators:
* - Green (GPIO 4): Successful transmission (ACK received)
* - Red (GPIO 2): Error condition (timeout, CRC fail, max retries)
* - Yellow (GPIO 5): Transmission in progress
* - Blue (GPIO 18): Connection state (ON = connected)
*
* Button (GPIO 15): Manual message send trigger
*
* Serial Commands:
* - 'h' or '?': Show help
* - 's': Send test message
* - 'c': Toggle connection state
* - 'v': Toggle verbose mode
* - 'r': Reset statistics
* - 'p': Print current statistics
* - '1'-'9': Set packet loss percentage (1=10%, 5=50%, etc.)
*
* Author: IoT Education Platform
* License: MIT
* ============================================================================
*/
#include <Arduino.h>
// ============================================================================
// PIN DEFINITIONS
// ============================================================================
#define LED_SUCCESS 4 // Green LED - ACK received, transmission OK
#define LED_ERROR 2 // Red LED - Timeout, CRC fail, or max retries
#define LED_TRANSMIT 5 // Yellow LED - Transmission in progress
#define LED_CONNECTED 18 // Blue LED - Connection state indicator
#define BTN_SEND 15 // Push button for manual message send
// ============================================================================
// RELIABILITY CONFIGURATION
// ============================================================================
#define INITIAL_TIMEOUT_MS 500 // Initial ACK timeout (milliseconds)
#define MAX_TIMEOUT_MS 8000 // Maximum timeout after backoff
#define MAX_RETRIES 5 // Maximum retry attempts before failure
#define JITTER_PERCENT 25 // Random jitter (0-25% of timeout)
#define DEFAULT_LOSS_PERCENT 20 // Default simulated packet loss rate
#define KEEPALIVE_INTERVAL_MS 10000 // Connection keep-alive interval
#define KEEPALIVE_TIMEOUT_MS 3000 // Time to wait for keep-alive response
// ============================================================================
// CRC-16 POLYNOMIAL (CRC-CCITT)
// ============================================================================
#define CRC16_POLYNOMIAL 0x1021
#define CRC16_INITIAL 0xFFFF
// ============================================================================
// MESSAGE TYPES
// ============================================================================
enum MessageType : uint8_t {
MSG_DATA = 0x01, // Data payload message
MSG_ACK = 0x02, // Positive acknowledgment
MSG_NACK = 0x03, // Negative acknowledgment (CRC error, etc.)
MSG_CONNECT = 0x10, // Connection request
MSG_CONNACK = 0x11, // Connection acknowledgment
MSG_DISCONNECT= 0x12, // Disconnect notification
MSG_KEEPALIVE = 0x20, // Keep-alive ping
MSG_KEEPALIVE_ACK = 0x21 // Keep-alive response
};
// ============================================================================
// CONNECTION STATES
// ============================================================================
enum ConnectionState : uint8_t {
STATE_DISCONNECTED = 0,
STATE_CONNECTING = 1,
STATE_CONNECTED = 2,
STATE_DISCONNECTING= 3
};
const char* stateNames[] = {
"DISCONNECTED",
"CONNECTING",
"CONNECTED",
"DISCONNECTING"
};
// ============================================================================
// MESSAGE STRUCTURE (74 bytes total: 8 header + 64 payload + 2 CRC)
// ============================================================================
struct ReliableMessage {
// Header (8 bytes)
uint8_t type; // Message type (MSG_DATA, MSG_ACK, etc.)
uint8_t flags; // Bit flags: 0x01=requires_ack, 0x02=is_retransmit
uint16_t sequenceNum; // Sender's sequence number (0-65535)
uint16_t ackNum; // Acknowledgment number (for ACK/NACK)
uint8_t payloadLen; // Length of payload (0-63)
uint8_t reserved; // Reserved for future use
// Payload (64 bytes max)
char payload[64]; // Message data
// Trailer (2 bytes)
uint16_t crc16; // CRC-16 checksum of header + payload
};
// ============================================================================
// RELIABILITY STATISTICS
// ============================================================================
struct ReliabilityStats {
// Transmission counts
uint32_t messagesSent;
uint32_t messagesDelivered;
uint32_t messagesFailed;
// Acknowledgment counts
uint32_t acksReceived;
uint32_t nacksReceived;
uint32_t ackTimeouts;
// Retry statistics
uint32_t retransmissions;
uint32_t totalRetryDelayMs;
// Error detection
uint32_t crcErrors;
uint32_t duplicatesDetected;
uint32_t outOfOrderDetected;
// Simulated network conditions
uint32_t simulatedDrops;
uint32_t simulatedCorruptions;
// Connection statistics
uint32_t connectionAttempts;
uint32_t connectionSuccesses;
uint32_t keepalivesSent;
uint32_t keepalivesTimedOut;
// Timing
unsigned long startTime;
unsigned long lastMessageTime;
};
// ============================================================================
// GLOBAL STATE
// ============================================================================
ConnectionState currentState = STATE_DISCONNECTED;
uint16_t nextSequenceNum = 0;
uint16_t expectedSequenceNum = 0;
uint16_t lastAckedSequence = 0xFFFF;
ReliabilityStats stats;
int packetLossPercent = DEFAULT_LOSS_PERCENT;
int corruptionPercent = 5; // 5% chance of bit corruption
bool verboseMode = true;
bool autoSendMode = true;
unsigned long lastKeepaliveTime = 0;
unsigned long lastButtonPress = 0;
// ============================================================================
// FUNCTION PROTOTYPES
// ============================================================================
// LED Functions
void initializePins();
void setLED(int pin, bool state);
void blinkLED(int pin, int times, int delayMs);
void updateConnectionLED();
// CRC Functions
uint16_t calculateCRC16(const uint8_t* data, size_t length);
uint16_t calculateMessageCRC(ReliableMessage* msg);
bool verifyCRC(ReliableMessage* msg);
void corruptRandomBit(ReliableMessage* msg);
// Message Functions
void createDataMessage(ReliableMessage* msg, const char* payload);
void createAckMessage(ReliableMessage* ack, uint16_t ackNum);
void createNackMessage(ReliableMessage* nack, uint16_t nackNum, const char* reason);
void createConnectMessage(ReliableMessage* msg);
void createKeepaliveMessage(ReliableMessage* msg);
void printMessage(ReliableMessage* msg, const char* direction);
// Network Simulation
bool simulatePacketLoss();
bool simulateCorruption();
int calculateBackoff(int retryCount);
// Reliability Core
bool sendWithReliability(const char* data);
bool sendMessage(ReliableMessage* msg, bool requiresAck);
void processReceivedMessage(ReliableMessage* msg);
// Connection Management
bool connect();
void disconnect();
void handleKeepalive();
void transitionState(ConnectionState newState);
// Statistics
void printStatistics();
void resetStatistics();
void printHelp();
// Command handler
void handleCommand(char cmd);
// ============================================================================
// SETUP
// ============================================================================
void setup() {
Serial.begin(115200);
delay(1000);
// Initialize random seed
randomSeed(analogRead(0) + micros());
// Initialize GPIO
initializePins();
// Initialize statistics
resetStatistics();
// Print banner
Serial.println();
Serial.println("========================================================================");
Serial.println(" RELIABILITY AND ERROR HANDLING LAB - ESP32 ");
Serial.println("========================================================================");
Serial.println();
Serial.println("This lab demonstrates the five pillars of reliable IoT communication:");
Serial.println();
Serial.println(" 1. CRC-16 Error Detection - Detect bit corruption in transit");
Serial.println(" 2. Sequence Numbers - Detect loss, duplicates, reordering");
Serial.println(" 3. ACK/NACK Protocol - Confirm or reject message delivery");
Serial.println(" 4. Exponential Backoff - Congestion-aware retry strategy");
Serial.println(" 5. Connection State Machine - Manage session lifecycle");
Serial.println();
Serial.println("------------------------------------------------------------------------");
Serial.println("CONFIGURATION:");
Serial.printf(" Initial Timeout: %d ms\n", INITIAL_TIMEOUT_MS);
Serial.printf(" Maximum Timeout: %d ms\n", MAX_TIMEOUT_MS);
Serial.printf(" Maximum Retries: %d\n", MAX_RETRIES);
Serial.printf(" Jitter: %d%%\n", JITTER_PERCENT);
Serial.printf(" Packet Loss Rate: %d%%\n", packetLossPercent);
Serial.printf(" Corruption Rate: %d%%\n", corruptionPercent);
Serial.printf(" Keep-Alive Interval:%d ms\n", KEEPALIVE_INTERVAL_MS);
Serial.println("------------------------------------------------------------------------");
Serial.println();
// LED test sequence
Serial.println("Testing LEDs...");
blinkLED(LED_SUCCESS, 2, 150);
blinkLED(LED_ERROR, 2, 150);
blinkLED(LED_TRANSMIT, 2, 150);
blinkLED(LED_CONNECTED, 2, 150);
Serial.println("LED test complete.");
Serial.println();
printHelp();
Serial.println();
Serial.println("Attempting initial connection...");
Serial.println();
// Attempt initial connection
if (connect()) {
Serial.println("==> Connected successfully! Ready to send messages.");
} else {
Serial.println("==> Connection failed. Type 'c' to retry.");
}
Serial.println();
}
// ============================================================================
// MAIN LOOP
// ============================================================================
void loop() {
// Handle serial commands
if (Serial.available()) {
char cmd = Serial.read();
handleCommand(cmd);
}
// Handle button press
if (digitalRead(BTN_SEND) == HIGH) {
if (millis() - lastButtonPress > 500) { // Debounce
lastButtonPress = millis();
if (currentState == STATE_CONNECTED) {
static int buttonMsgCount = 0;
char msgBuffer[64];
snprintf(msgBuffer, sizeof(msgBuffer), "Button press #%d at %lu ms",
++buttonMsgCount, millis());
sendWithReliability(msgBuffer);
} else {
Serial.println("[WARN] Not connected. Press 'c' to connect first.");
blinkLED(LED_ERROR, 2, 100);
}
}
}
// Handle keep-alive for connected state
if (currentState == STATE_CONNECTED) {
handleKeepalive();
}
// Auto-send test messages when connected
static unsigned long lastAutoSend = 0;
if (autoSendMode && currentState == STATE_CONNECTED) {
if (millis() - lastAutoSend > 5000) { // Every 5 seconds
lastAutoSend = millis();
static int autoMsgCount = 0;
char msgBuffer[64];
float temp = 20.0 + (random(200) / 10.0); // Random temp 20-40C
float humidity = 40.0 + (random(400) / 10.0); // Random humidity 40-80%
snprintf(msgBuffer, sizeof(msgBuffer),
"Sensor #%d: Temp=%.1fC, Humidity=%.1f%%",
++autoMsgCount, temp, humidity);
Serial.println("========================================================================");
Serial.printf(">>> AUTO-SEND MESSAGE #%d <<<\n", autoMsgCount);
Serial.println("========================================================================");
bool success = sendWithReliability(msgBuffer);
if (success) {
Serial.println("[RESULT] Message delivered successfully!");
} else {
Serial.println("[RESULT] Message delivery FAILED after max retries.");
}
// Print stats every 5 messages
if (autoMsgCount % 5 == 0) {
printStatistics();
}
Serial.println();
}
}
// Update connection LED
updateConnectionLED();
delay(10); // Small delay for stability
}
// ============================================================================
// COMMAND HANDLER
// ============================================================================
void handleCommand(char cmd) {
switch (cmd) {
case 'h':
case '?':
printHelp();
break;
case 's':
case 'S':
if (currentState == STATE_CONNECTED) {
char msg[64];
snprintf(msg, sizeof(msg), "Manual test message at %lu ms", millis());
sendWithReliability(msg);
} else {
Serial.println("[WARN] Not connected. Type 'c' to connect first.");
}
break;
case 'c':
case 'C':
if (currentState == STATE_DISCONNECTED) {
connect();
} else if (currentState == STATE_CONNECTED) {
disconnect();
} else {
Serial.printf("[INFO] Currently in state: %s\n", stateNames[currentState]);
}
break;
case 'v':
case 'V':
verboseMode = !verboseMode;
Serial.printf("[CONFIG] Verbose mode: %s\n", verboseMode ? "ON" : "OFF");
break;
case 'a':
case 'A':
autoSendMode = !autoSendMode;
Serial.printf("[CONFIG] Auto-send mode: %s\n", autoSendMode ? "ON" : "OFF");
break;
case 'r':
case 'R':
resetStatistics();
Serial.println("[INFO] Statistics reset.");
break;
case 'p':
case 'P':
printStatistics();
break;
case '0':
packetLossPercent = 0;
Serial.println("[CONFIG] Packet loss: 0% (perfect network)");
break;
case '1': case '2': case '3': case '4': case '5':
case '6': case '7': case '8': case '9':
packetLossPercent = (cmd - '0') * 10;
Serial.printf("[CONFIG] Packet loss: %d%%\n", packetLossPercent);
break;
case '\n':
case '\r':
// Ignore newlines
break;
default:
Serial.printf("[WARN] Unknown command: '%c'. Type 'h' for help.\n", cmd);
break;
}
}
// ============================================================================
// PIN INITIALIZATION
// ============================================================================
void initializePins() {
// Configure LED pins as outputs
pinMode(LED_SUCCESS, OUTPUT);
pinMode(LED_ERROR, OUTPUT);
pinMode(LED_TRANSMIT, OUTPUT);
pinMode(LED_CONNECTED, OUTPUT);
// Configure button as input
pinMode(BTN_SEND, INPUT);
// Initialize all LEDs to off
digitalWrite(LED_SUCCESS, LOW);
digitalWrite(LED_ERROR, LOW);
digitalWrite(LED_TRANSMIT, LOW);
digitalWrite(LED_CONNECTED, LOW);
}
// ============================================================================
// LED FUNCTIONS
// ============================================================================
void setLED(int pin, bool state) {
digitalWrite(pin, state ? HIGH : LOW);
}
void blinkLED(int pin, int times, int delayMs) {
for (int i = 0; i < times; i++) {
digitalWrite(pin, HIGH);
delay(delayMs);
digitalWrite(pin, LOW);
delay(delayMs);
}
}
void updateConnectionLED() {
// Blue LED indicates connection state
static unsigned long lastBlink = 0;
static bool blinkState = false;
switch (currentState) {
case STATE_DISCONNECTED:
setLED(LED_CONNECTED, false);
break;
case STATE_CONNECTING:
case STATE_DISCONNECTING:
// Blink slowly during transitions
if (millis() - lastBlink > 250) {
lastBlink = millis();
blinkState = !blinkState;
setLED(LED_CONNECTED, blinkState);
}
break;
case STATE_CONNECTED:
setLED(LED_CONNECTED, true);
break;
}
}
// ============================================================================
// CRC-16 CALCULATION (CRC-CCITT)
// ============================================================================
uint16_t calculateCRC16(const uint8_t* data, size_t length) {
uint16_t crc = CRC16_INITIAL;
for (size_t i = 0; i < length; i++) {
crc ^= ((uint16_t)data[i] << 8);
for (int bit = 0; bit < 8; bit++) {
if (crc & 0x8000) {
crc = (crc << 1) ^ CRC16_POLYNOMIAL;
} else {
crc = crc << 1;
}
}
}
return crc;
}
uint16_t calculateMessageCRC(ReliableMessage* msg) {
// CRC covers header (excluding CRC field) and payload
size_t crcLength = 8 + msg->payloadLen; // Header (8 bytes) + payload
return calculateCRC16((uint8_t*)msg, crcLength);
}
bool verifyCRC(ReliableMessage* msg) {
uint16_t calculated = calculateMessageCRC(msg);
bool valid = (calculated == msg->crc16);
if (!valid && verboseMode) {
Serial.printf(" [CRC] MISMATCH! Calculated: 0x%04X, Received: 0x%04X\n",
calculated, msg->crc16);
}
return valid;
}
void corruptRandomBit(ReliableMessage* msg) {
// Corrupt a random bit in the payload (simulating transmission error)
if (msg->payloadLen > 0) {
int byteIndex = random(msg->payloadLen);
int bitIndex = random(8);
msg->payload[byteIndex] ^= (1 << bitIndex);
if (verboseMode) {
Serial.printf(" [CORRUPT] Flipped bit %d in payload byte %d\n",
bitIndex, byteIndex);
}
}
}
// ============================================================================
// MESSAGE CREATION
// ============================================================================
void createDataMessage(ReliableMessage* msg, const char* payload) {
memset(msg, 0, sizeof(ReliableMessage));
msg->type = MSG_DATA;
msg->flags = 0x01; // Requires ACK
msg->sequenceNum = nextSequenceNum;
msg->ackNum = 0;
msg->payloadLen = min((int)strlen(payload), 63);
msg->reserved = 0;
strncpy(msg->payload, payload, msg->payloadLen);
msg->payload[msg->payloadLen] = '\0';
msg->crc16 = calculateMessageCRC(msg);
}
void createAckMessage(ReliableMessage* ack, uint16_t ackNum) {
memset(ack, 0, sizeof(ReliableMessage));
ack->type = MSG_ACK;
ack->flags = 0;
ack->sequenceNum = 0;
ack->ackNum = ackNum;
ack->payloadLen = 0;
ack->reserved = 0;
ack->crc16 = calculateMessageCRC(ack);
}
void createNackMessage(ReliableMessage* nack, uint16_t nackNum, const char* reason) {
memset(nack, 0, sizeof(ReliableMessage));
nack->type = MSG_NACK;
nack->flags = 0;
nack->sequenceNum = 0;
nack->ackNum = nackNum;
nack->payloadLen = min((int)strlen(reason), 63);
nack->reserved = 0;
strncpy(nack->payload, reason, nack->payloadLen);
nack->crc16 = calculateMessageCRC(nack);
}
void createConnectMessage(ReliableMessage* msg) {
memset(msg, 0, sizeof(ReliableMessage));
msg->type = MSG_CONNECT;
msg->flags = 0x01; // Requires ACK
msg->sequenceNum = 0;
msg->payloadLen = 0;
msg->crc16 = calculateMessageCRC(msg);
}
void createKeepaliveMessage(ReliableMessage* msg) {
memset(msg, 0, sizeof(ReliableMessage));
msg->type = MSG_KEEPALIVE;
msg->flags = 0x01; // Requires ACK
msg->sequenceNum = nextSequenceNum;
msg->payloadLen = 0;
msg->crc16 = calculateMessageCRC(msg);
}
// ============================================================================
// MESSAGE PRINTING
// ============================================================================
void printMessage(ReliableMessage* msg, const char* direction) {
if (!verboseMode) return;
const char* typeName;
switch (msg->type) {
case MSG_DATA: typeName = "DATA"; break;
case MSG_ACK: typeName = "ACK"; break;
case MSG_NACK: typeName = "NACK"; break;
case MSG_CONNECT: typeName = "CONNECT"; break;
case MSG_CONNACK: typeName = "CONNACK"; break;
case MSG_DISCONNECT:typeName = "DISCONNECT"; break;
case MSG_KEEPALIVE: typeName = "KEEPALIVE"; break;
case MSG_KEEPALIVE_ACK: typeName = "KEEPALIVE_ACK"; break;
default: typeName = "UNKNOWN"; break;
}
Serial.println();
Serial.printf(" +-------------- %s MESSAGE --------------+\n", direction);
Serial.printf(" | Type: %-26s |\n", typeName);
Serial.printf(" | Flags: 0x%02X (ACK_REQ=%d, RETX=%d) |\n",
msg->flags, (msg->flags & 0x01), ((msg->flags & 0x02) >> 1));
Serial.printf(" | Sequence: %-26d |\n", msg->sequenceNum);
Serial.printf(" | Ack Number: %-26d |\n", msg->ackNum);
Serial.printf(" | Payload Len: %-26d |\n", msg->payloadLen);
if (msg->payloadLen > 0) {
if (msg->payloadLen <= 30) {
Serial.printf(" | Payload: %-26s |\n", msg->payload);
} else {
char truncated[28];
strncpy(truncated, msg->payload, 24);
truncated[24] = '\0';
strcat(truncated, "...");
Serial.printf(" | Payload: %-26s |\n", truncated);
}
}
Serial.printf(" | CRC-16: 0x%04X |\n", msg->crc16);
Serial.println(" +-------------------------------------------+");
}
// ============================================================================
// NETWORK SIMULATION
// ============================================================================
bool simulatePacketLoss() {
int roll = random(100);
bool lost = (roll < packetLossPercent);
if (lost) {
stats.simulatedDrops++;
if (verboseMode) {
Serial.printf(" [NETWORK] Packet DROPPED (roll=%d < %d%%)\n",
roll, packetLossPercent);
}
}
return lost;
}
bool simulateCorruption() {
int roll = random(100);
bool corrupted = (roll < corruptionPercent);
if (corrupted) {
stats.simulatedCorruptions++;
if (verboseMode) {
Serial.printf(" [NETWORK] Packet CORRUPTED (roll=%d < %d%%)\n",
roll, corruptionPercent);
}
}
return corrupted;
}
int calculateBackoff(int retryCount) {
// Exponential backoff: initial * 2^retry
int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);
baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);
// Add random jitter (0 to JITTER_PERCENT% of timeout)
int jitter = random((baseTimeout * JITTER_PERCENT) / 100);
return baseTimeout + jitter;
}
// ============================================================================
// RELIABLE SEND WITH RETRY
// ============================================================================
bool sendWithReliability(const char* data) {
if (currentState != STATE_CONNECTED) {
Serial.println("[ERROR] Cannot send: not connected");
return false;
}
ReliableMessage msg;
ReliableMessage response;
int retryCount = 0;
bool delivered = false;
// Create the data message
createDataMessage(&msg, data);
printMessage(&msg, "SEND");
while (retryCount <= MAX_RETRIES && !delivered) {
// Calculate timeout for this attempt
int currentTimeout = calculateBackoff(retryCount);
// Mark as retransmission if this is a retry
if (retryCount > 0) {
msg.flags |= 0x02; // Set retransmit flag
stats.retransmissions++;
stats.totalRetryDelayMs += currentTimeout;
if (verboseMode) {
Serial.println();
Serial.printf(" [RETRY] Attempt %d/%d (timeout=%dms, backoff=2^%d)\n",
retryCount, MAX_RETRIES, currentTimeout, retryCount);
}
}
// Turn on transmit LED
setLED(LED_TRANSMIT, true);
if (verboseMode) {
Serial.printf(" [TX] Sending SEQ=%d, waiting up to %dms for ACK...\n",
msg.sequenceNum, currentTimeout);
}
stats.messagesSent++;
// Simulate network transmission delay
delay(50);
// Turn off transmit LED
setLED(LED_TRANSMIT, false);
// ===== SIMULATE NETWORK CHANNEL =====
// Check if packet was lost in transit
if (simulatePacketLoss()) {
// Packet lost - wait for timeout
if (verboseMode) {
Serial.printf(" [WAIT] Waiting %dms for ACK (packet was lost)...\n",
currentTimeout);
}
delay(currentTimeout);
// Timeout occurred
setLED(LED_ERROR, true);
delay(150);
setLED(LED_ERROR, false);
stats.ackTimeouts++;
if (verboseMode) {
Serial.println(" [TIMEOUT] No ACK received - will retry");
}
retryCount++;
continue;
}
// Check if packet was corrupted
if (simulateCorruption()) {
corruptRandomBit(&msg);
}
// Packet arrived at receiver - process it
setLED(LED_TRANSMIT, true);
delay(30);
setLED(LED_TRANSMIT, false);
// Receiver verifies CRC
if (!verifyCRC(&msg)) {
// CRC failed - receiver sends NACK
stats.crcErrors++;
stats.nacksReceived++;
createNackMessage(&response, msg.sequenceNum, "CRC_ERROR");
printMessage(&response, "RECV");
if (verboseMode) {
Serial.println(" [NACK] CRC error detected by receiver - will retry");
}
setLED(LED_ERROR, true);
delay(150);
setLED(LED_ERROR, false);
// Restore original message for retry
msg.crc16 = calculateMessageCRC(&msg);
retryCount++;
continue;
}
if (verboseMode) {
Serial.println(" [RX] Message received and CRC verified OK");
}
// Check sequence number at receiver
if (msg.sequenceNum < expectedSequenceNum &&
!(msg.sequenceNum == 0 && expectedSequenceNum > 60000)) {
// Duplicate detected (not wraparound)
stats.duplicatesDetected++;
if (verboseMode) {
Serial.printf(" [RX] DUPLICATE (expected SEQ >= %d, got %d) - ACK anyway\n",
expectedSequenceNum, msg.sequenceNum);
}
} else if (msg.sequenceNum > expectedSequenceNum) {
// Out of order (gap detected)
stats.outOfOrderDetected++;
if (verboseMode) {
Serial.printf(" [RX] OUT OF ORDER (expected SEQ=%d, got %d)\n",
expectedSequenceNum, msg.sequenceNum);
}
expectedSequenceNum = msg.sequenceNum + 1;
} else {
// Normal in-order delivery
expectedSequenceNum = msg.sequenceNum + 1;
}
// ===== RECEIVER SENDS ACK =====
// Check if ACK is lost on the return path
if (simulatePacketLoss()) {
if (verboseMode) {
Serial.println(" [NETWORK] ACK was LOST in transit!");
Serial.printf(" [WAIT] Waiting %dms for ACK...\n", currentTimeout);
}
delay(currentTimeout);
setLED(LED_ERROR, true);
delay(150);
setLED(LED_ERROR, false);
stats.ackTimeouts++;
retryCount++;
continue;
}
// ACK received successfully
createAckMessage(&response, msg.sequenceNum);
printMessage(&response, "RECV");
stats.acksReceived++;
stats.messagesDelivered++;
lastAckedSequence = msg.sequenceNum;
if (verboseMode) {
Serial.printf(" [ACK] Received ACK for SEQ=%d\n", response.ackNum);
}
setLED(LED_SUCCESS, true);
delay(200);
setLED(LED_SUCCESS, false);
// Success - increment sequence number for next message
nextSequenceNum++;
delivered = true;
}
if (!delivered) {
stats.messagesFailed++;
Serial.printf(" [FAIL] Message delivery failed after %d retries\n", MAX_RETRIES);
blinkLED(LED_ERROR, 3, 100);
}
stats.lastMessageTime = millis();
return delivered;
}
// ============================================================================
// CONNECTION MANAGEMENT
// ============================================================================
void transitionState(ConnectionState newState) {
if (verboseMode) {
Serial.printf("[STATE] %s -> %s\n", stateNames[currentState], stateNames[newState]);
}
currentState = newState;
updateConnectionLED();
}
bool connect() {
if (currentState != STATE_DISCONNECTED) {
Serial.printf("[WARN] Cannot connect: already in state %s\n",
stateNames[currentState]);
return false;
}
transitionState(STATE_CONNECTING);
stats.connectionAttempts++;
Serial.println("[CONNECT] Initiating connection...");
ReliableMessage connectMsg;
createConnectMessage(&connectMsg);
printMessage(&connectMsg, "SEND");
// Simulate connection handshake with retries
int retryCount = 0;
while (retryCount < 3) {
setLED(LED_TRANSMIT, true);
delay(100);
setLED(LED_TRANSMIT, false);
// Check for simulated loss
if (simulatePacketLoss()) {
int timeout = calculateBackoff(retryCount);
if (verboseMode) {
Serial.printf(" [WAIT] Connection request lost, waiting %dms...\n", timeout);
}
delay(timeout);
retryCount++;
continue;
}
// Connection successful
if (verboseMode) {
Serial.println(" [RX] CONNACK received");
}
transitionState(STATE_CONNECTED);
stats.connectionSuccesses++;
lastKeepaliveTime = millis();
nextSequenceNum = 0;
expectedSequenceNum = 0;
setLED(LED_SUCCESS, true);
delay(300);
setLED(LED_SUCCESS, false);
Serial.println("[CONNECT] Connection established successfully!");
return true;
}
// Connection failed
transitionState(STATE_DISCONNECTED);
Serial.println("[CONNECT] Connection FAILED after 3 attempts");
blinkLED(LED_ERROR, 3, 150);
return false;
}
void disconnect() {
if (currentState != STATE_CONNECTED) {
Serial.printf("[WARN] Cannot disconnect: in state %s\n",
stateNames[currentState]);
return;
}
transitionState(STATE_DISCONNECTING);
Serial.println("[DISCONNECT] Initiating graceful disconnect...");
ReliableMessage disconnectMsg;
memset(&disconnectMsg, 0, sizeof(ReliableMessage));
disconnectMsg.type = MSG_DISCONNECT;
disconnectMsg.crc16 = calculateMessageCRC(&disconnectMsg);
printMessage(&disconnectMsg, "SEND");
setLED(LED_TRANSMIT, true);
delay(100);
setLED(LED_TRANSMIT, false);
// Don't wait for ACK on disconnect (best effort)
transitionState(STATE_DISCONNECTED);
Serial.println("[DISCONNECT] Disconnected.");
}
void handleKeepalive() {
if (currentState != STATE_CONNECTED) return;
if (millis() - lastKeepaliveTime >= KEEPALIVE_INTERVAL_MS) {
stats.keepalivesSent++;
if (verboseMode) {
Serial.println("[KEEPALIVE] Sending keep-alive ping...");
}
ReliableMessage keepalive;
createKeepaliveMessage(&keepalive);
setLED(LED_TRANSMIT, true);
delay(30);
setLED(LED_TRANSMIT, false);
// Simulate keep-alive response
if (!simulatePacketLoss()) {
lastKeepaliveTime = millis();
if (verboseMode) {
Serial.println("[KEEPALIVE] Response received - connection healthy");
}
} else {
stats.keepalivesTimedOut++;
if (verboseMode) {
Serial.println("[KEEPALIVE] No response - connection may be degraded");
}
// After multiple failed keepalives, disconnect
static int missedKeepalives = 0;
missedKeepalives++;
if (missedKeepalives >= 3) {
Serial.println("[KEEPALIVE] Connection lost - 3 consecutive failures");
transitionState(STATE_DISCONNECTED);
missedKeepalives = 0;
blinkLED(LED_ERROR, 3, 100);
}
}
}
}
// ============================================================================
// STATISTICS
// ============================================================================
void printStatistics() {
unsigned long elapsed = (millis() - stats.startTime) / 1000;
if (elapsed == 0) elapsed = 1;
float deliveryRate = stats.messagesSent > 0 ?
(float)stats.messagesDelivered * 100.0 / stats.messagesSent : 0;
float avgRetriesPerMsg = stats.messagesDelivered > 0 ?
(float)stats.retransmissions / stats.messagesDelivered : 0;
float throughput = (float)stats.messagesDelivered / elapsed;
Serial.println();
Serial.println("+======================================================================+");
Serial.println("| RELIABILITY STATISTICS |");
Serial.println("+======================================================================+");
Serial.println("| TRANSMISSION |");
Serial.printf("| Messages Sent: %-10lu |\n",
stats.messagesSent);
Serial.printf("| Messages Delivered: %-10lu |\n",
stats.messagesDelivered);
Serial.printf("| Messages Failed: %-10lu |\n",
stats.messagesFailed);
Serial.printf("| Delivery Rate: %-10.1f%% |\n",
deliveryRate);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| ACKNOWLEDGMENTS |");
Serial.printf("| ACKs Received: %-10lu |\n",
stats.acksReceived);
Serial.printf("| NACKs Received: %-10lu |\n",
stats.nacksReceived);
Serial.printf("| ACK Timeouts: %-10lu |\n",
stats.ackTimeouts);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| RETRY STATISTICS |");
Serial.printf("| Total Retransmissions:%-10lu |\n",
stats.retransmissions);
Serial.printf("| Avg Retries/Message: %-10.2f |\n",
avgRetriesPerMsg);
Serial.printf("| Total Retry Delay: %-10lu ms |\n",
stats.totalRetryDelayMs);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| ERROR DETECTION |");
Serial.printf("| CRC Errors Detected: %-10lu |\n",
stats.crcErrors);
Serial.printf("| Duplicates Detected: %-10lu |\n",
stats.duplicatesDetected);
Serial.printf("| Out-of-Order: %-10lu |\n",
stats.outOfOrderDetected);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| NETWORK SIMULATION |");
Serial.printf("| Packets Dropped: %-10lu |\n",
stats.simulatedDrops);
Serial.printf("| Packets Corrupted: %-10lu |\n",
stats.simulatedCorruptions);
Serial.printf("| Current Loss Rate: %-10d%% |\n",
packetLossPercent);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| CONNECTION |");
Serial.printf("| Connection Attempts: %-10lu |\n",
stats.connectionAttempts);
Serial.printf("| Connection Successes: %-10lu |\n",
stats.connectionSuccesses);
Serial.printf("| Keep-alives Sent: %-10lu |\n",
stats.keepalivesSent);
Serial.printf("| Keep-alives Failed: %-10lu |\n",
stats.keepalivesTimedOut);
Serial.printf("| Current State: %-10s |\n",
stateNames[currentState]);
Serial.println("+----------------------------------------------------------------------+");
Serial.println("| PERFORMANCE |");
Serial.printf("| Elapsed Time: %-10lu seconds |\n",
elapsed);
Serial.printf("| Throughput: %-10.2f messages/sec |\n",
throughput);
Serial.println("+======================================================================+");
Serial.println();
}
void resetStatistics() {
memset(&stats, 0, sizeof(ReliabilityStats));
stats.startTime = millis();
nextSequenceNum = 0;
expectedSequenceNum = 0;
}
void printHelp() {
Serial.println();
Serial.println("------------------------------------------------------------------------");
Serial.println("COMMANDS:");
Serial.println(" h / ? Show this help menu");
Serial.println(" s Send a test message manually");
Serial.println(" c Connect (if disconnected) or Disconnect (if connected)");
Serial.println(" v Toggle verbose mode (detailed logging)");
Serial.println(" a Toggle auto-send mode (periodic test messages)");
Serial.println(" r Reset statistics");
Serial.println(" p Print current statistics");
Serial.println(" 0-9 Set packet loss percentage (0=0%, 5=50%, 9=90%)");
Serial.println("------------------------------------------------------------------------");
Serial.println("LED INDICATORS:");
Serial.println(" Green - Successful transmission (ACK received)");
Serial.println(" Red - Error (timeout, CRC failure, max retries)");
Serial.println(" Yellow - Transmission in progress");
Serial.println(" Blue - Connection state (solid=connected, blink=transitioning)");
Serial.println("------------------------------------------------------------------------");
Serial.println("BUTTON: Press to send a message manually");
Serial.println("------------------------------------------------------------------------");
}21.4.6 Step-by-Step Instructions
Step 1: Set Up the Circuit
- Open the Wokwi simulator above (or visit wokwi.com/projects/new/esp32)
- Add four LEDs to the breadboard:
- Green LED (success indicator)
- Red LED (error indicator)
- Yellow LED (transmission indicator)
- Blue LED (connection state indicator)
- Add four 220 ohm resistors for the LEDs
- Add one push button and one 10K ohm pull-down resistor
- Wire the connections as shown in the circuit diagram:
- GPIO 4 to Green LED anode (through 220 ohm resistor)
- GPIO 2 to Red LED anode (through 220 ohm resistor)
- GPIO 5 to Yellow LED anode (through 220 ohm resistor)
- GPIO 18 to Blue LED anode (through 220 ohm resistor)
- All LED cathodes to GND
- 3.3V to push button, button to GPIO 15
- GPIO 15 to GND through 10K ohm pull-down resistor
Step 2: Upload and Run the Code
- Copy the complete code into the Wokwi code editor
- Click “Start Simulation” to begin
- Open the Serial Monitor (set to 115200 baud)
- The system will automatically connect and start sending test messages
Step 3: Observe the Five Reliability Pillars
| Pillar | What to Observe |
|---|---|
| CRC-16 | Watch for “CRC verified OK” or “CRC MISMATCH” in serial output |
| Sequence Numbers | Each message shows SEQ=N incrementing |
| ACK/NACK | See ACK messages received after successful delivery |
| Exponential Backoff | On retries, timeout doubles: 500ms, 1000ms, 2000ms, 4000ms |
| Connection State | Blue LED shows connection status; state transitions logged |
Step 4: Experiment with Network Conditions
Use serial commands to change simulation parameters:
| Command | Effect |
|---|---|
0 |
Perfect network (0% loss) |
3 |
Moderate loss (30%) |
7 |
High loss (70%) |
v |
Toggle verbose mode |
p |
Print statistics |
21.4.7 Understanding the Code
1. CRC-16 Error Detection
uint16_t calculateCRC16(const uint8_t* data, size_t length) {
uint16_t crc = CRC16_INITIAL; // 0xFFFF
for (size_t i = 0; i < length; i++) {
crc ^= ((uint16_t)data[i] << 8);
for (int bit = 0; bit < 8; bit++) {
if (crc & 0x8000) {
crc = (crc << 1) ^ CRC16_POLYNOMIAL; // 0x1021
} else {
crc = crc << 1;
}
}
}
return crc;
}This implements CRC-CCITT, the same algorithm used in Bluetooth, USB, and many IoT protocols. The polynomial 0x1021 provides excellent burst error detection.
2. Exponential Backoff with Jitter
int calculateBackoff(int retryCount) {
// Exponential: initial * 2^retry
int baseTimeout = INITIAL_TIMEOUT_MS * (1 << retryCount);
baseTimeout = min(baseTimeout, (int)MAX_TIMEOUT_MS);
// Random jitter: 0 to 25% of timeout
int jitter = random((baseTimeout * JITTER_PERCENT) / 100);
return baseTimeout + jitter;
}The (1 << retryCount) efficiently calculates 2^n. Jitter prevents collision storms when multiple devices retry simultaneously.
3. Connection State Machine
void transitionState(ConnectionState newState) {
Serial.printf("[STATE] %s -> %s\n",
stateNames[currentState], stateNames[newState]);
currentState = newState;
updateConnectionLED();
}The state machine ensures proper lifecycle: DISCONNECTED -> CONNECTING -> CONNECTED -> DISCONNECTING -> DISCONNECTED. In this lab implementation, guard checks in connect() and disconnect() prevent illegal transitions – for example, calling connect() while already connected returns immediately with a warning.
21.4.8 Challenge Exercises
Challenge 1: Implement Selective Repeat
Modify the code to support a sliding window with selective retransmission:
#define WINDOW_SIZE 4
uint16_t windowBase = 0;
bool acked[WINDOW_SIZE] = {false};
// Only retransmit specifically NACKed packets,
// not the entire windowGoal: Increase throughput by sending multiple messages before waiting for ACKs.
Challenge 2: Add CRC-32 Option
Implement CRC-32 alongside CRC-16 and compare:
uint32_t calculateCRC32(const uint8_t* data, size_t length);
// Add a configuration option to switch between CRC-16 and CRC-32
// Measure the CPU time differenceGoal: Understand the trade-off between error detection capability and computational overhead.
Challenge 3: Implement Quality of Service Levels
Add support for three QoS levels similar to MQTT:
- QoS 0: Fire and forget (no ACK)
- QoS 1: At least once (ACK required)
- QoS 2: Exactly once (two-phase handshake)
enum QoSLevel { QOS_0, QOS_1, QOS_2 };
bool sendWithQoS(const char* data, QoSLevel qos);Goal: Understand how different reliability levels affect latency and overhead.
21.4.9 Expected Outcomes
After completing this lab, you should observe:
| Metric | Expected Value (20% loss) | Meaning |
|---|---|---|
| Delivery Rate | 95-100% | Retries recover most losses |
| Avg Retries/Message | 0.3-0.5 | Some messages need 1-2 retries |
| CRC Errors | 2-5% of messages | Corruption detected and recovered |
| Throughput | ~0.15 msg/sec | Limited by timeouts and retries |
Key Insights:
- Reliability has a cost: Higher packet loss means more retries, more delay, and more power consumption
- CRC catches corruption: Bit errors are detected before corrupted data reaches the application
- Backoff prevents congestion: Exponential backoff reduces network load during problems
- State machines prevent leaks: Proper connection management ensures clean lifecycle
Common Mistake: Forgetting to Handle Connection State Timeouts
The Mistake: Developers implement connection state machines with STATE_CONNECTED but forget to set timeouts for detecting dead connections. The device believes it’s connected while the network path is broken.
Real-World Consequence: A smart irrigation system used MQTT over TCP with keep-alive set to 5 minutes. NAT routers on cellular networks time out idle connections after 90 seconds. Between MQTT keep-alives (at 300 seconds), the NAT mapping expired. The device remained in STATE_CONNECTED, sending data packets that were silently dropped by the NAT router. No error was detected until the next keep-alive attempt 300 seconds later—meanwhile, 4 critical irrigation commands were lost.
What Should Have Happened:
Proper Keep-Alive Configuration:
// WRONG - Keep-alive longer than NAT timeout
mqtt.setKeepAlive(300); // 5 minutes
// RIGHT - Keep-alive shorter than minimum NAT timeout
mqtt.setKeepAlive(60); // 1 minute (well under 90s NAT timeout)Connection State Timeout Tracking:
enum ConnectionState {
STATE_DISCONNECTED = 0,
STATE_CONNECTING = 1,
STATE_CONNECTED = 2,
STATE_DISCONNECTING= 3,
STATE_ERROR = 4 // Add error state
};
unsigned long lastRxTime = 0;
unsigned long lastTxTime = 0;
const unsigned long RX_TIMEOUT_MS = 120000; // 2 minutes
void checkConnectionHealth() {
if (currentState == STATE_CONNECTED) {
unsigned long now = millis();
// Check if we've received anything recently
if (now - lastRxTime > RX_TIMEOUT_MS) {
Serial.println("[ERROR] No data received for 2 min - connection dead");
transitionState(STATE_ERROR);
disconnect();
connect(); // Attempt reconnect
}
// Check if keep-alive is overdue
if (now - lastTxTime > KEEPALIVE_INTERVAL_MS * 1.5) {
Serial.println("[WARN] Keep-alive overdue - sending ping");
sendKeepalive();
}
}
}NAT-Aware Keep-Alive Design:
| Network Type | NAT Timeout | Recommended Keep-Alive | Safety Margin |
|---|---|---|---|
| Wi-Fi (home router) | 120-300s | 60s | 2-5× safety |
| Cellular (LTE-M) | 30-90s | 25s | 1.2-3× safety |
| Satellite | 60s | 30s | 2× safety |
| Enterprise NAT | 180-600s | 120s | 1.5-5× safety |
Debugging Dead Connections:
void monitorConnectionState() {
Serial.printf("[STATE] Current: %s | RX age: %lu ms | TX age: %lu ms\n",
stateNames[currentState],
millis() - lastRxTime,
millis() - lastTxTime);
if (currentState == STATE_CONNECTED) {
// Verify we can actually reach the peer
if (millis() - lastRxTime > 30000) { // 30s no RX
Serial.println("[WARN] Zombie connection suspected - testing with ping");
if (!sendKeepalive()) {
Serial.println("[ERROR] Keep-alive failed - connection is dead");
transitionState(STATE_ERROR);
}
}
}
}Lesson Learned:
- Keep-alive must be shorter than minimum NAT timeout (typically 25-60 seconds for cellular)
- Track last RX timestamp - if no data received for 2× keep-alive interval, assume connection is dead
- Implement connection health monitoring - periodic checks that transition to STATE_ERROR when timeouts are detected
- Add reconnection logic - automatic recovery from dead connection states
- Log timeout events - record when keep-alive fails so you can tune intervals for your deployment
Rule of Thumb: Keep-alive interval should be 40-50% of the minimum NAT timeout in your network path. For cellular IoT, use 25-30 seconds. For Wi-Fi, 60 seconds is usually safe. Always measure actual NAT behavior in your production environment.
21.5 See Also
Reliability Deep Dives (Prerequisites):
- Reliability Overview: Five pillars framework – start here for context
- Error Detection: CRC-16 calculation and checksum comparison
- Retry and Sequencing: Exponential backoff algorithms and sequence wraparound
Protocol Implementations (Where These Concepts Appear):
- TCP Fundamentals: TCP’s LISTEN/SYN_SENT/ESTABLISHED state machine and keep-alive option
- CoAP Features: CON vs NON messages, retransmission with ACK/RST
- MQTT QoS: QoS 0/1/2 reliability levels over TCP
- Transport Optimizations: Tuning reliability parameters for IoT workloads
Hands-On Labs:
- CoAP ESP32 Lab: Apply reliability concepts to a real CoAP implementation
- MQTT Client Lab: Implement QoS levels on ESP32
Testing Tools:
- Wireshark Packet Analysis: Capture and inspect retransmissions in a live session
- tc Command (Linux): Inject packet loss and delay for controlled testing
Related Concepts:
- TCP congestion control uses the same exponential backoff pattern as this lab
- Selective Repeat ARQ extends Stop-and-Wait to a sliding window (Challenge 1 above)
- DTLS handshake retransmission uses identical reliability mechanisms
Common Pitfalls
1. Not Configuring Application-Level Timeouts
Relying only on TCP keepalive for connection health detection can leave connections “stuck” for hours. TCP keepalive default (Linux): first probe after 7,200 s (2 hours), then 9 probes at 75 s intervals = 2 hours 11 minutes before declaring dead. IoT applications must implement application-level timeouts: if no expected data arrives within N seconds (N = 2× expected message interval), close and reopen the connection. Do not rely on TCP keepalive for connection health — it is designed for different purposes.
2. Ignoring TIME_WAIT State Socket Exhaustion
Each TCP connection that closes normally enters TIME_WAIT for 2×MSL (Maximum Segment Lifetime, default 60 s). A server handling 1,000 short-lived IoT connections per second accumulates 60,000 sockets in TIME_WAIT within 60 seconds, exhausting the ephemeral port range (60,000 ports). Mitigate with: SO_REUSEADDR socket option (allows binding to TIME_WAIT sockets), connection pooling (keep connections alive for multiple transactions), or SO_LINGER=0 (immediate RST on close, caution: may cause data loss).
3. Reproducing Connection Issues Without Traffic Capture
Debugging intermittent IoT connection drops without a packet capture is speculation. Start a Wireshark capture before reproducing the issue and look for: RST packets (connection reset by remote), TCP retransmissions (packet loss), window size reaching 0 (receiver buffer full), and delta time between packets (latency spikes). A 30-second packet capture of a connection failure saves hours of log analysis.
4. Confusing TCP Connection Refused with Connection Timed Out
“Connection refused” (errno ECONNREFUSED) means the target IP is reachable but no process is listening on that port — check service status. “Connection timed out” (errno ETIMEDOUT) means packets are not reaching the destination at all — check routing, firewall, and network connectivity. Treating both as “connection failure” and applying the same fix (restart service) wastes time. Always distinguish the error type from socket error codes before investigating.
21.6 What’s Next
After completing this connection reliability lab, continue with these chapters to apply and extend what you have built:
| Next Chapter | Topic | Why It Follows |
|---|---|---|
| Transport Optimizations | Advanced reliability tuning and IoT-specific adaptations | Apply the parameters you configured in this lab to real-world scenarios |
| CoAP Overview | CoAP Confirmable messages and retransmission | See how a real IoT protocol implements the same CON/ACK pattern you built |
| MQTT QoS and Session | MQTT QoS 0/1/2 reliability levels | Compare MQTT’s broker-mediated reliability against your direct ACK design |
| MQTT Fundamentals | MQTT protocol architecture and broker model | Extend your connection state machine knowledge to a widely deployed IoT protocol |
| Reliability Overview | Five reliability pillars framework | Revisit the conceptual foundations now that you have implemented all five pillars |
| Error Detection | CRC algorithms and checksum theory | Deepen your understanding of the CRC-16 code you wrote in this lab |