16  Transport Scenarios & Pitfalls

In 60 Seconds

Using TCP for real-time sensor data over lossy Wi-Fi creates cascading failures: 10% packet loss raises average handshake time to 411 ms per reading, and retransmission storms drain batteries up to 8x faster. This chapter walks through five disaster scenarios (TCP over lossy networks, UDP firmware updates, persistent connection storms, broadcast amplification, and satellite-link overhead) plus seven implementation pitfalls, with concrete mitigation strategies for your IoT deployments.

Key Concepts
  • Smart Home Scenario: BLE → hub → MQTT over TCP → cloud; TCP persistent connection for subscribe; UDP/CoAP for individual device-to-hub telemetry on local network
  • Industrial Monitoring Scenario: Modbus RTU over serial → OPC-UA over TCP → historian; TCP for reliable command/response; no tolerance for dropped commands or data; TLS required
  • Fleet Tracking Scenario: GNSS → LTE-M → UDP/CoAP → server; UDP minimizes data plan usage; CoAP Confirmable for location updates every 60 s; TCP for OTA firmware updates
  • Agriculture IoT Scenario: NB-IoT → CoAP → cloud; 50-byte soil readings every 30 min; CoAP Non-Confirmable (occasional loss acceptable); UDP minimizes NB-IoT power and data
  • Building Automation Scenario: LoRaWAN → MQTT → BACnet/IP → BMS; MQTT QoS 1 for alarm delivery; QoS 0 for routine telemetry; TCP with TLS for secure BMS integration
  • Emergency Alert Scenario: 4G LTE → HTTPS/WebSocket → multi-region cloud; TCP + TLS mandatory; WebSocket for real-time bidirectional push; requires <100 ms server processing time
  • Protocol Selection Matrix: Data frequency (high → UDP), reliability requirement (high → TCP), network type (cellular → UDP for lower overhead), device constraints (MCU < 64 KB RAM → UDP/CoAP only)
  • Multi-Protocol Gateway Scenario: Edge gateway translating between: BLE (GATT) ↔︎ MQTT (TCP), Zigbee ↔︎ MQTT, CoAP (UDP) ↔︎ HTTP (TCP); requires protocol bridging with semantic mapping

16.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Identify protocol mismatches before deployment by analyzing five real-world failure scenarios
  • Diagnose cascading failures caused by TCP over lossy wireless or high-latency (satellite) connections
  • Select appropriate protocols by distinguishing the conditions under which each of the 7 common transport pitfalls occurs
  • Apply mitigation strategies from real-world protocol disasters to your own IoT deployments
  • Evaluate trade-offs between TCP reliability guarantees and UDP efficiency for constrained IoT devices
  • Justify protocol choices using quantitative overhead calculations and battery-life projections

This chapter walks through real IoT disasters caused by choosing the wrong transport protocol. Should a warehouse sensor use TCP or UDP? What happens when firmware is sent over UDP? What causes a “thundering herd” crash? By studying these failures, you will develop the judgment to make good protocol choices and avoid the seven most common transport pitfalls in your own projects.

“This chapter is about what goes WRONG when you pick the wrong protocol,” said Max the Microcontroller. “And trust me, the failures are spectacular. TCP over a lossy warehouse Wi-Fi? Cascading retransmission storms that drain batteries 8 times faster!”

“My favorite disaster is the satellite link scenario,” said Sammy the Sensor. “Someone uses TCP over a satellite with 600-millisecond round-trip time. The three-way handshake alone takes 1.8 seconds before sending a single byte of data! With readings every 10 seconds, you spend 36 percent of your time just on protocol overhead.”

“The UDP-without-retry disaster is equally bad,” warned Lila the LED. “A security company uses plain UDP for intruder alerts. A packet gets lost due to momentary interference – and nobody knows the alarm went off. No retry, no confirmation, no response. That is a life-safety failure.”

“Each scenario teaches a specific lesson,” said Bella the Battery. “Study these failures and you will develop an instinct for which protocol fits which situation. The best engineers learn from other people’s mistakes, not their own.”

16.2 What Would Happen If… (Scenarios)

16.2.1 Scenario 1: Using TCP for Real-Time Sensor Data Over Lossy Connection

The Setup: You deploy 500 temperature sensors across a warehouse with spotty Wi-Fi. Each sensor sends readings every 10 seconds using TCP.

What Goes Wrong:

Sequence diagram illustrating TCP retransmission delays on a lossy wireless link: a SYN packet is lost requiring multiple retries, each consuming round-trip time and exponential backoff, causing cumulative delays of hundreds of milliseconds before data transmission begins.
Figure 16.1: TCP Retransmission Delays Over Lossy Wireless Connection

The Cascade of Failures:

  1. Connection Setup Delays (over 400 ms per reading)
    • 10% packet loss = on average 1.37 handshake attempts (each taking 1.5 RTTs)
    • Reading due every 10 seconds, handshake consumes ~411 ms at 200 ms RTT
    • 4% of time spent on handshake overhead alone, multiplied across all retransmission events

For a TCP handshake with 10% packet loss, the probability of success becomes exponential. The handshake requires three packets: SYN, SYN-ACK, and ACK.

Success probability per attempt: \[P_{\text{success}} = (1 - 0.10)^3 = 0.90^3 = 0.729\]

Expected number of handshake attempts: \[E[\text{attempts}] = \frac{1}{P_{\text{success}}} = \frac{1}{0.729} = 1.37 \text{ attempts}\]

With a 200ms round-trip time (RTT) per handshake attempt: \[T_{\text{handshake}} = 1.37 \times (1.5 \times 200\text{ms}) = 411\text{ms}\]

For sensors reporting every 10 seconds, this represents \(\frac{411\text{ms}}{10,000\text{ms}} = 4.1\%\) of time spent just establishing connections. Over 2,880 daily readings, that’s 19.7 minutes of wasted radio time. At 50 mA radio TX current with a 2000 mAh battery: \(50\text{mA} \times \frac{19.7\text{ min}}{60} = 16.4\text{ mAh/day}\), reducing battery life from years to months.

  1. Retransmission Spiral
    • Each lost packet triggers exponential backoff: 200ms, 400ms, 800ms, 1600ms
    • After 3 retries (3 seconds), reading is already stale
    • Gateway receives “72 degrees F” at T=7s when current temp is now 73 degrees F
  2. Head-of-Line Blocking
    • Lost packet at T=0 blocks delivery of packets at T=10, T=20, T=30
    • TCP delivers in order, so fresh data waits behind old retransmissions
    • Dashboard shows: “72… 72… 72… [pause]… 73, 74, 75, 76” (burst)
  3. Connection State Explosion
    • 500 sensors x 9 KB TCP state (send + receive buffers + connection record) = 4.5 MB RAM on gateway
    • A 64 MB gateway handles roughly 7,000 sensors before running out of memory
    • Scaling to 10,000 sensors triggers out-of-memory crashes (see Scenario 3)
  4. Battery Catastrophe
    • TCP retransmissions keep radio on: 4s to 15s per reading
    • Battery life: 30 days to 2 days
    • IT department gets 500 “low battery” alerts simultaneously

The Fix:

Switch to UDP with CoAP:
- Send reading immediately (no handshake)
- If lost, next reading comes in 10s anyway
- No connection state on gateway
- Battery life: 2 days to 60 days
- Dashboard shows smooth updates

Lesson: TCP’s reliability mechanisms become the problem when network is lossy and data is replaceable.

Try It: TCP Over Lossy Wi-Fi Calculator

Adjust packet loss rate, round-trip time, and sensor count to see how TCP overhead cascades into connection delays and battery drain for periodic sensor readings.


16.2.2 Scenario 2: Using UDP for Firmware Updates

The Setup: Product manager says “UDP is faster, let’s use it for firmware updates to save time!”

What Goes Wrong:

Update file: 2 MB = 2,000,000 bytes / 1400 bytes per packet = 1,429 packets

With 2% packet loss:

  • Expected lost packets: 1,429 x 0.02 = ~29 packets lost
  • Lost locations: random throughout the file
Diagram showing a firmware binary split into 1,429 UDP packets being transmitted to a device. Random packets are lost (shown in red), leaving gaps in the firmware image stored in flash memory. The device attempts to boot but encounters missing data at critical addresses, causing a crash and infinite reset loop.
Figure 16.2: UDP Firmware Update With Packet Loss Causes Corrupted Flash

Device attempts to boot:

  1. Bootloader: Jumps to address 0x0300 - Finds 0x0000 (missing data) - CRASH (infinite reset loop)
  2. Device is now a BRICK
  3. Requires hardware JTAG to recover
  4. $50 RMA cost x 10,000 devices = $500,000

Even “Smart” Recovery Fails:

Attempt 1: “Let’s add checksums!” - Device checks checksum, detects corruption - Requests retransmit… using UDP - Retransmit request gets lost (2% chance) - Device stuck waiting forever

For a 2 MB firmware update at 2% packet loss with 1400-byte UDP packets:

Total packets needed: \[N = \frac{2{,}000{,}000 \text{ bytes}}{1{,}400 \text{ bytes/packet}} = 1{,}429 \text{ packets}\]

Expected lost packets: \[N_{\text{lost}} = 1{,}429 \times 0.02 = 28.6 \approx 29 \text{ packets}\]

With triple redundancy (send each packet 3 times), the probability all three copies are lost is: \[P_{\text{all lost}} = (0.02)^3 = 0.000008 = 0.0008\%\]

Expected packets still lost: \[N_{\text{still lost}} = 1{,}429 \times 0.000008 = 0.011\]

This seems good, but with binomial distribution variance, there’s a 1% chance of having 1+ packets completely lost. For 10,000 deployed devices, that’s 100 bricked units at $50 RMA cost each = $5,000 in failures. TCP’s guaranteed in-order delivery eliminates this risk entirely.

Attempt 2: “Let’s send each packet 3 times!” - 2 MB x 3 = 6 MB transmitted - Probability all 3 copies lost: 0.02 cubed = 0.000008 (expected ~0.011 packets still lost per device) - Update time: 50 seconds to 150 seconds - User thinks device is frozen, unplugs it mid-update - Brick count: +5,000 devices

The Fix:

Use TCP for firmware updates:
- Guaranteed delivery (automatic retransmission)
- In-order assembly (no packet reordering)
- Error detection (checksum + ACK)
- Connection state (resume after disconnect)

Result:
- 2 MB update succeeds 99.99% of time
- Takes 60 seconds (acceptable)
- Zero bricks
- Happy customers

Lesson: When failure is catastrophic, TCP’s overhead is insurance, not waste.

Try It: Firmware Update Bricking Risk Calculator

Adjust firmware size, packet loss rate, number of deployed devices, and redundancy factor to see how many devices could be bricked and the resulting RMA cost.


16.2.3 Scenario 3: Maintaining 10,000 Persistent TCP Connections

The Setup: Smart building with 10,000 IoT sensors (lights, HVAC, occupancy) all using MQTT over persistent TCP connections.

What Goes Wrong:

Gateway Resource Exhaustion:

Per TCP connection overhead:
- Send buffer:      4 KB
- Receive buffer:   4 KB
- Connection state: 1 KB
- Keep-alive timer: 1 timer
Total per sensor:   9 KB

10,000 sensors x 9 KB = 90 MB RAM
Gateway has 64 MB RAM - CRASH

The Cascade:

  1. Gateway boots up
    • First 7,000 sensors connect successfully (63 MB used)
    • Sensors 7,001-10,000 try to connect: “Out of memory”
    • Gateway crashes, all 7,000 connections lost
  2. All 10,000 sensors retry simultaneously
    • TCP SYN flood (10,000 SYN packets in 1 second)
    • Gateway’s SYN queue: 128 slots
    • 99% of connections rejected
    • Exponential backoff: retry after 1s, 2s, 4s, 8s…
  3. The “Thundering Herd”
    • 10,000 devices retry at T=1s: gateway crashes
    • 10,000 devices retry at T=2s: gateway crashes
    • Gateway stuck in crash loop
    • Building automation: OFFLINE
  4. Keep-Alive Battery Drain
    • TCP keep-alive: every 2 hours (configurable)
    • Battery sensor must wake up: 12 times/day just for keep-alive
    • Actual data sent: 48 times/day (every 30 min)
    • Keep-alive overhead: 20% of battery life wasted
    • Battery life: 1 year to 10 months

The Fix:

Option 1: Short-lived connections
- Connect - Send data - Disconnect
- Connection overhead: acceptable for infrequent data
- RAM usage: ~10 MB (only active connections)

Option 2: UDP with CoAP
- No persistent connections
- Sensors send data as needed
- Gateway RAM: ~5 MB (minimal state)
- Battery life: +20% (no keep-alive)

Option 3: MQTT-SN over UDP
- Same pub/sub semantics
- No TCP overhead
- Optimized for sensor networks

Result: System scales to 50,000+ sensors

Lesson: Persistent TCP connections don’t scale for massive IoT deployments. Stateless protocols win.

Try It: Gateway Connection Scaling Explorer

Adjust the number of sensors, per-connection memory overhead, and gateway RAM to see when connection state exhaustion crashes your gateway.



16.2.4 Scenario 4: UDP Broadcast Storm (Self-Inflicted DDoS)

The Setup: Smart home system uses UDP broadcast for device discovery. 100 devices on LAN.

What Goes Wrong:

Device boots - Sends broadcast: "Who's there?"
100 devices respond simultaneously
100 x 200-byte response = 20 KB burst

100 devices boot after power outage:
100 devices broadcast "Who's there?"
Each triggers 100 responses
Total: 100 x 100 x 200 bytes = 2 MB burst

Wi-Fi router (100 Mbps):
- Can handle 2 MB in 160 ms (seems fine?)
- BUT: Wi-Fi is half-duplex, shared medium
- All 100 devices transmit at once = collisions
- Collision - exponential backoff - retransmit
- Retransmits cause more collisions
- Network saturated for 30+ seconds
- User's video call: FROZEN

The Amplification Attack (Accidental):

Attacker (or buggy sensor):
while(true) {
    send_udp_broadcast("Who's there?");
}

Result:
- 1 broadcast triggers 100 responses
- 100:1 amplification factor
- Sending 1 Mbps creates 100 Mbps of response traffic
- Network: UNUSABLE
- No authentication to stop it (UDP is connectionless)

The Fix:

Option 1: Rate limiting
- Max 1 broadcast per device per 10 seconds
- Prevents accidental storms

Option 2: Unicast discovery
- Devices register with central server
- Server provides peer list
- No broadcast spam

Option 3: Multicast with TTL
- Use multicast instead of broadcast
- Set TTL=1 (don't cross routers)
- Scope limits damage

Option 4: TCP for discovery
- Slower, but orderly
- Connection limits act as natural rate limit

Lesson: UDP broadcast is powerful but dangerous. Always add rate limiting and scoping.


16.3 Common Mistakes with Transport Protocols

16.3.1 Mistake 1: “TCP is Always More Reliable”

The Myth: “TCP guarantees delivery, so it’s always the safer choice for critical IoT data.”

The Reality: TCP guarantees delivery only while the connection is alive. If the connection drops, data in flight is lost.

Real Example:

Smart meter sends billing data via TCP
- Reading at T=0: Connection active, ACK received (success)
- Reading at T=60s: Connection drops mid-transmission
- TCP retries 3 times over 9 seconds
- Connection dead (sensor moved out of range)
- Data LOST, no notification to application
- Billing system thinks: "No data = $0 usage"
- Customer: Free electricity this month!

The Fix:

Application-level reliability:
1. Use sequence numbers (not TCP's)
2. Store-and-forward (database persistence)
3. Application-level ACK (separate from TCP ACK)
4. Retry at application layer with exponential backoff

Example (MQTT QoS 2):
- TCP ACK confirms packet delivered
- MQTT PUBREC confirms application received
- Persist to disk before PUBCOMP
- Even if TCP dies, MQTT resumes on reconnect

Lesson: TCP is transport reliability, not application reliability. Always add application-level confirmation for critical data.


16.3.2 Mistake 2: Forgetting Source Port in UDP (Breaks Request/Response)

The Myth: “UDP is stateless, so I don’t need to track source ports.”

The Reality: Without proper source port handling, you can’t match responses to requests.

Real Example (CoAP):

# WRONG: Reusing same source port
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('0.0.0.0', 5683))  # Always port 5683

# Send 10 requests to 10 sensors
for sensor in sensors:
    sock.sendto(b"GET /temperature", (sensor.ip, 5683))

# Try to receive responses
for i in range(10):
    data, addr = sock.recvfrom(1024)
    # Which request does this answer?
    # NO WAY TO KNOW! All came from port 5683

What Goes Wrong:

  • 10 requests sent to different sensors
  • All responses come back to port 5683
  • Responses arrive out of order: sensor5, sensor1, sensor3…
  • Can’t match response to request
  • Data assigned to wrong sensor: “Living room is 140 degrees F” (actually server room!)

The Fix:

# CORRECT: Ephemeral source ports + token matching
for sensor in sensors:
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.bind(('0.0.0.0', 0))  # 0 = OS assigns unique port
    token = random.randint(0, 65535)
    request = {"token": token, "method": "GET", "path": "/temperature"}
    sock.sendto(json.dumps(request).encode(), (sensor.ip, 5683))

    # Store mapping
    pending[token] = {"sensor": sensor, "sock": sock, "timestamp": time.time()}

# Receive responses
while pending:
    for token, info in pending.items():
        data, addr = info["sock"].recvfrom(1024)
        response = json.loads(data)
        if response["token"] == token:
            print(f"{info['sensor'].name}: {response['data']}")
            del pending[token]

Lesson: UDP is stateless at transport, but application must maintain state. Use tokens/IDs to match requests to responses.


16.3.3 Mistake 3: TCP Nagle’s Algorithm Delays Small Packets

The Myth: “TCP is slower than UDP, but we’ll just optimize later.”

The Reality: TCP’s Nagle algorithm intentionally delays small packets to batch them. For IoT real-time commands, this adds 40-200ms latency.

What Is Nagle’s Algorithm:

Intent: Reduce small packet overhead
Mechanism: Buffer small writes until:
  1. 1 full MSS (1460 bytes) accumulated, OR
  2. ACK received for previous data, OR
  3. 200ms timeout (implementation-dependent)

Example:
- Send "ON" command (2 bytes)
- Nagle waits for more data...
- Waits for ACK from last packet...
- Waits 200ms...
- Finally sends "ON" (2 bytes in 54-byte packet)
- Light turns on 200ms late (user perceives lag)

Real Example (Smart Home):

User: "Alexa, turn on living room lights"
Alexa to Gateway: "ON" (2 bytes via TCP)
Nagle: Buffering... (200ms delay)
Gateway to Light: "ON" (finally!)

User: "Why is there a delay? My old switch was instant!"
Product review: one star "Laggy, returning it"

The Fix:

// Disable Nagle for real-time commands
int flag = 1;
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

// Now "ON" sends immediately (no buffering)
// Latency: 200ms to 20ms

Trade-offs:

  • TCP_NODELAY = ON: Low latency, more packets (slightly more overhead)
  • TCP_NODELAY = OFF: Higher latency, fewer packets (Nagle batching)

When to Disable Nagle:

  • Real-time commands (lights, locks, motors) - YES
  • Interactive applications (gaming, remote control) - YES
  • Telemetry with small frequent updates - YES
  • Bulk transfers (firmware, logs) - NO, keep Nagle on

Lesson: TCP defaults are optimized for large data transfers, not IoT real-time commands. Always set TCP_NODELAY for latency-sensitive applications.

Try It: Nagle’s Algorithm Latency Explorer

See how Nagle’s algorithm affects latency for different packet sizes and command types. Toggle TCP_NODELAY to compare immediate sending versus Nagle buffering.


16.3.4 Mistake 4: Not Setting UDP Socket Timeouts (Hangs Forever)

The Myth: “UDP is fire-and-forget, so I don’t need timeouts.”

The Reality: When you call recvfrom() expecting a response, you’ll wait forever if the packet was lost.

Real Example:

# WRONG: No timeout
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.sendto(b"GET /temperature", ("sensor1.local", 5683))
data, addr = sock.recvfrom(1024)  # HANGS FOREVER if packet lost
print(data)

What Goes Wrong:

  • Send request to sensor
  • Sensor is offline (battery dead)
  • recvfrom() waits forever
  • Application thread: FROZEN
  • Watchdog timer: Never triggers (thread still “alive”)
  • User: “App is frozen, force quit”

The Fix (Multiple Strategies):

Strategy 1: Socket timeout

sock.settimeout(2.0)  # 2 second timeout
try:
    sock.sendto(b"GET /temp", ("sensor", 5683))
    data, addr = sock.recvfrom(1024)
except socket.timeout:
    print("Sensor offline or packet lost")

Strategy 2: Non-blocking with select()

import select

sock.setblocking(False)
sock.sendto(b"GET /temp", ("sensor", 5683))

ready = select.select([sock], [], [], 2.0)  # 2s timeout
if ready[0]:
    data, addr = sock.recvfrom(1024)
else:
    print("Timeout: No response")

Strategy 3: Application-level timeout with retry

def udp_request_with_retry(sock, addr, data, retries=3, timeout=2.0):
    sock.settimeout(timeout)
    for attempt in range(retries):
        try:
            sock.sendto(data, addr)
            response, _ = sock.recvfrom(1024)
            return response
        except socket.timeout:
            print(f"Attempt {attempt+1} failed, retrying...")
    raise TimeoutError("Sensor unreachable after 3 attempts")

Lesson: UDP requires application-level timeout handling. Always set socket timeouts to prevent infinite hangs.


16.3.5 Mistake 5: Using UDP Checksum Optional (IPv4) for IoT

The Myth: “Checksums add overhead, and IoT needs to be lightweight, so skip them.”

The Reality: Wireless links have high bit error rates. Without checksums, corrupted data silently accepted.

Real Example:

Scenario: Smart thermostat over Zigbee (high noise)
Bit error rate: 10^-3 (1 error per 1000 bits)
100-byte packet = 800 bits
Probability of error: 1 - (0.999)^800 = 55%

Without checksum:
- Receive: "SET TEMP 72 degrees F"
- Actual bits (corrupted): "SET TEMP 127 degrees F"
- Thermostat sets heat to 127 degrees F
- House becomes sauna
- Elderly resident: Heat stroke

With checksum:
- Receive corrupted packet
- Checksum fails
- Packet discarded
- Thermostat: No change (safe default)
- Next packet arrives (retry): "SET TEMP 72 degrees F" (correct)

IPv6 Made Checksums Mandatory:

IPv4: UDP checksum optional (can be 0x0000)
IPv6: UDP checksum MANDATORY (must be calculated)

Reason: IPv6 removed IP-layer checksum
        Transport layer must ensure integrity

The Fix:

// Always enable UDP checksum (most platforms default ON)
// Just verify it's not disabled:

// DON'T DO THIS:
int no_check = 1;
setsockopt(sock, SOL_SOCKET, SO_NO_CHECK, &no_check, sizeof(no_check));
// ^^^ DANGEROUS: Disables checksum validation

// Correct: Let stack handle checksums (default behavior)
// No code needed - checksums automatic

Checksum Overhead:

  • Computation: ~10 microseconds on ESP32
  • Transmission: 16 bits = 2 bytes (1.6% overhead on 100-byte packet)
  • Value: Prevents catastrophic data corruption

Lesson: UDP checksums are tiny overhead with huge safety benefit. Never disable them, especially for wireless IoT.


16.3.6 Mistake 6: Not Handling ICMP “Port Unreachable” (UDP)

The Myth: “UDP has no error reporting, so no need to handle errors.”

The Reality: If remote port is closed, router sends ICMP “Port Unreachable”. Most apps ignore this.

What Happens:

Sensor to Gateway: UDP packet to port 5683
Gateway: Port 5683 not listening (app crashed)
Gateway to Sensor: ICMP "Port Unreachable"
Sensor: Ignores ICMP (not reading error queue)
Sensor: Continues sending every 10s forever
Result: Wasting battery sending to black hole

Real Example (Linux):

# WRONG: Ignores ICMP errors
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
while True:
    sock.sendto(b"DATA", ("gateway", 5683))
    # Gateway returns ICMP error, but we don't check
    time.sleep(10)

The Fix:

# CORRECT: Check for ICMP errors
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# Enable ICMP error reporting
sock.setsockopt(socket.SOL_IP, socket.IP_RECVERR, 1)

while True:
    try:
        sock.sendto(b"DATA", ("gateway", 5683))
        # Try to receive (with timeout)
        sock.settimeout(1.0)
        data, addr = sock.recvfrom(1024)
    except OSError as e:
        # Check for ICMP errors
        if e.errno == errno.ECONNREFUSED:
            print("Gateway port unreachable - retrying in 60s")
            time.sleep(60)  # Longer backoff
        elif e.errno == errno.EHOSTUNREACH:
            print("Gateway unreachable - network down")
            time.sleep(60)
    except socket.timeout:
        print("No response, but no ICMP error either")

Platform Differences:

  • Linux: IP_RECVERR provides ICMP errors
  • Windows: Errors delivered on next recvfrom()
  • Embedded (lwIP): Often no ICMP error delivery (check docs)

Lesson: UDP does have error reporting via ICMP. Handle it to detect dead peers early and save battery.


16.3.7 Mistake 7: Mixing TCP and UDP on Same Port

The Myth: “Port 5683 for both TCP and UDP saves configuration.”

The Reality: Same port number, different protocols = completely separate sockets.

Real Example (CoAP Confusion):

CoAP standard: UDP port 5683
Some implementations: Also TCP port 5683 (for WebSockets)

Developer mistake:
# Client sends UDP to port 5683
client_sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
client_sock.sendto(b"CON GET /temp", ("server", 5683))

# Server only listening on TCP port 5683
server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_sock.bind(("0.0.0.0", 5683))
server_sock.listen()

# Result: Packet goes to UDP 5683 (nothing listening)
# Client thinks server is down
# Server thinks no clients connecting
# Both ports 5683, both can coexist!

Key Insight:

OS maintains separate port tables:
- TCP ports: 0-65535
- UDP ports: 0-65535

Port 5683 TCP != Port 5683 UDP
Both can be bound simultaneously by different apps!

The Fix:

# Be explicit about protocol
# Server: Listen on BOTH protocols if needed
tcp_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
tcp_sock.bind(("0.0.0.0", 5683))
tcp_sock.listen()

udp_sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
udp_sock.bind(("0.0.0.0", 5683))

# Client: Know which protocol the service uses
# CoAP = UDP, MQTT = TCP
if protocol == "coap":
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
elif protocol == "mqtt":
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

Documentation Best Practices:

BAD:  "Service runs on port 5683"
GOOD: "CoAP service: UDP port 5683"
      "CoAP WebSocket: TCP port 5683"

Lesson: Always specify protocol when documenting ports. TCP and UDP ports are completely independent namespaces.

Concept Relationships

Depends on:

Enables:

Related concepts:

  • Head-of-line blocking in TCP causes cascading delays when packets are lost on wireless links
  • UDP broadcast storms demonstrate why multicast requires rate limiting
  • TIME_WAIT accumulation on gateways shows why connection pooling matters for scalability
See Also

Learn from failures:

External case studies:

Common Pitfalls

A smart building may have: low-power sensors (CoAP/UDP), actuators requiring reliable control (TCP/MQTT QoS 1), video cameras (RTSP/UDP or WebRTC), and access controllers (HTTPS). Using a single protocol for all devices optimizes for average cases while failing edge cases. Define a protocol matrix per device category, mapping device constraints and data characteristics to specific protocols. Accept protocol heterogeneity at the field level and normalize at the gateway.

TCP + TLS establishment overhead consumes 3–8 KB per connection. An NB-IoT device on a 1 MB/month plan that establishes a new TCP+TLS connection for each of 100 daily transmissions uses 300–800 KB just for handshakes — 30–80% of the monthly plan budget before any payload data. For cellular IoT, prefer persistent connections (MQTT with keepalive) or CoAP over UDP (no connection overhead) to minimize per-message overhead on metered connections.

A gateway translating CoAP (UDP, binary CBOR) to MQTT (TCP, UTF-8 JSON) must: receive CoAP, parse CBOR, re-encode as JSON, establish/reuse MQTT connection, publish. This translation adds 2–10 ms processing latency and requires the gateway to maintain MQTT connection pools. For high-density deployments (1,000 devices reporting every second), the gateway translation CPU and connection overhead can saturate a modest ARM Cortex-A53. Profile gateway throughput under peak load and implement protocol bridging with binary formats (CBOR → MessagePack) where possible.

Protocol selection scenarios validated only in lab conditions with perfect network (0% loss, <1 ms RTT) may fail in production. TCP scenario on LTE-M with 2% packet loss causes TCP congestion control to reduce throughput to 10–30% of nominal. UDP scenario with 5% loss causes 5% data loss per message. Test all protocol selection scenarios under: 1–5% packet loss (tc netem), 100–500 ms RTT (tc netem delay), and burst loss (tc netem loss gemodel) to validate protocol choice under real network conditions.

16.4 What’s Next

Continue your transport protocol learning:

Topic Description Relevance to This Chapter
Hands-On Transport Lab Build a reliable message transport system on ESP32 with Wokwi simulation Apply the TCP vs UDP trade-offs from Scenarios 1 and 2 in real firmware
Transport Selection and Scenarios Systematic decision frameworks for IoT protocol selection Translate the five disaster patterns into a structured selection rubric
Transport Optimizations and Implementation TCP optimization techniques, window tuning, and lightweight stacks Implements the TCP_NODELAY, window scaling, and MQTT-SN fixes from the pitfalls section
Transport Practical Applications Overhead calculations and battery-life projections for constrained devices Quantifies the costs introduced in the TCP handshake and satellite-link scenarios
Transport Plain English Guide Analogies and conceptual explanations for transport protocol behaviour Reinforces intuition behind head-of-line blocking and NAT keepalive failures
Decision Framework End-to-end protocol selection guide for production IoT deployments Combines lessons from all five scenarios into actionable decision trees