15  Analyzing IoT Protocols

15.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Analyze MQTT publish-subscribe traffic patterns and diagnose connection issues
  • Examine CoAP RESTful communication including observe relationships and block transfers
  • Interpret Zigbee mesh network traffic including routing and cluster operations
  • Interpret LoRaWAN uplink/downlink patterns and join procedures
  • Identify common protocol-specific issues through traffic analysis
In 60 Seconds

IoT traffic analysis at the protocol level examines how specific protocols (MQTT, CoAP, AMQP, HTTP) behave on the wire — verifying message formats, QoS handshake sequences, and error response codes. Protocol-specific Wireshark dissectors decode binary protocol frames into human-readable fields, enabling verification that devices correctly implement standards. Protocol analysis catches implementation mismatches between devices and cloud backends that appear as intermittent connection failures in higher-level logs.

15.2 For Beginners: Analyzing IoT Protocols

Protocol analysis is like learning to read different languages that IoT devices use to talk to each other. Just as English, Spanish, and Mandarin have different grammars and vocabularies, protocols like MQTT, CoAP, and Zigbee have different message formats and conversation patterns. Tools like Wireshark act as translators, showing you exactly what devices are saying to each other, word by word. This helps you diagnose problems like “why did my sensor stop sending data?” or “why is my device getting disconnected?” by examining the actual conversation happening on the network.

“Each IoT protocol has its own conversation style,” said Max the Microcontroller. “MQTT uses publish and subscribe – a sensor publishes a message to a topic, and anyone subscribed to that topic receives it. CoAP is more like HTTP – request and response, but lightweight enough for tiny sensors.”

Sammy the Sensor demonstrated. “In Wireshark, my MQTT publish looks like: PUBLISH Topic=‘sensors/temp1’ Payload=‘{“temp”: 23.5}’. If the broker responds with PUBACK, I know the message was delivered. If I see CONNACK with an error code, I know the connection failed – maybe a wrong password or expired certificate.”

Lila the LED explained Zigbee analysis. “Zigbee mesh traffic is trickier because messages hop between routers. In a packet capture, you can trace the route: Sammy sends to Router A, Router A forwards to Router B, Router B delivers to the coordinator. If a hop is failing, the link quality indicators show which connection is weak.” Bella the Battery mentioned LoRaWAN. “LoRaWAN captures show the join procedure – how a device authenticates with the network server using keys. You can see uplink and downlink messages, confirm-vs-unconfirm patterns, and adaptive data rate adjustments. Each protocol tells a different story when you know how to read it!”

15.3 Prerequisites

Before diving into this chapter, you should be familiar with:

15.4 How It Works: Protocol-Specific Dissection

IoT protocol analysis leverages Wireshark’s protocol dissectors—software components that parse and interpret protocol-specific packet structures. Here’s how dissection works for IoT protocols:

MQTT Dissection Process:

  1. Transport Layer: TCP dissector identifies port 1883/8883 as potential MQTT traffic
  2. Protocol Detection: First byte (0x10-0xE0) identifies MQTT message type (CONNECT=1, PUBLISH=3, etc.)
  3. Header Parsing: Dissector extracts remaining length (variable-length encoding), flags (DUP, QoS, RETAIN)
  4. Payload Extraction: Based on message type, dissector parses variable header (topic name, message ID) and payload
  5. Display: Wireshark shows human-readable format: “PUBLISH Topic: sensors/temp, Payload: {temp: 23.5}”

CoAP Dissection Process:

  1. Transport Layer: UDP dissector identifies port 5683/5684
  2. Header Parsing: Fixed 4-byte header (Version, Type, Token Length, Code, Message ID)
  3. Option Parsing: Variable-length options (URI-Path, Content-Format, Observe) use option-number encoding
  4. Payload Marker: 0xFF byte separates options from payload
  5. Content Decoding: Based on Content-Format option (0=text, 50=JSON, 60=CBOR), payload is interpreted

Why This Matters: Without dissectors, you see raw hex bytes. With dissectors, Wireshark translates \x30\x0E into “PUBLISH, QoS 1” and \x00\x0C sensors/temp into “Topic: sensors/temp”. This makes debugging 10-100x faster.

Limitations: Encrypted payloads (TLS/DTLS) show as “Application Data” unless you provide decryption keys. Custom protocols require writing your own Lua dissector script.


15.5 Analyzing IoT Protocols

~30 min | Advanced | P13.C06.U03

Flowchart showing protocol analysis across OSI network stack layers with progressive decoding: Captured Packet enters Layer 1 Physical (Wi-Fi, BLE, LoRa signal) flows to Layer 2 Data Link (MAC addresses, frames), then Layer 3 Network (IP addresses, routing), Layer 4 Transport (TCP/UDP ports), and Layer 5-7 Application (MQTT, CoAP, HTTP). Application layer feeds Protocol Decoder which branches into four parallel analysis paths: MQTT Analysis for topic/QoS/payload inspection, CoAP Analysis for method/URI/options examination, HTTP Analysis for REST APIs and headers, and Custom Protocol with binary dissector for proprietary protocols. All analysis paths converge to Generate Insights which produces four parallel outputs: Message frequency metrics, Payload size statistics, Error rates monitoring, and Security issues detection. Layered approach enables comprehensive packet analysis from physical signal through application-level protocol semantics, extracting performance, functional, and security insights across the entire network stack for IoT troubleshooting and optimization.
(a) Protocol analysis across OSI network stack layers with progressive decoding from physical signals through application-level protocol semantics, showing how captured packets flow through Layer 1 Physical (Wi-Fi, BLE, LoRa signal) to Layer 2 Data Link (MAC addresses, frames), Layer 3 Network (IP addresses, routing), Layer 4 Transport (TCP/UDP ports), and Layer 5-7 Application (MQTT, CoAP, HTTP). The application layer feeds a protocol decoder that branches into four parallel analysis paths: MQTT Analysis for topic/QoS/payload inspection, CoAP Analysis for method/URI/options examination, HTTP Analysis for REST APIs and headers, and Custom Protocol with binary dissector for proprietary protocols. All analysis paths converge to generate insights including message frequency metrics, payload size statistics, error rates monitoring, and security issues detection.
Figure 15.1: Protocol analysis across network stack layers, showing how traffic analysis examines Application (MQTT, CoAP), Transport (TCP, UDP), Network (IP), and Link (Wi-Fi, Zigbee) layers, with cross-layer performance metrics including latency, throughput, packet loss, and jitter.

15.5.1 MQTT (Message Queue Telemetry Transport)

Protocol Overview: Lightweight publish-subscribe messaging protocol commonly used in IoT for sensor data and commands.

Traffic Pattern:

  • Port: 1883 (unencrypted), 8883 (TLS)
  • Transport: TCP
  • Message types: CONNECT, CONNACK, PUBLISH, SUBSCRIBE, PUBACK, etc.

Wireshark Analysis:

Display filter: mqtt

Connection Establishment:

Client -> Broker: CONNECT (Client ID, Clean Session, Keep Alive)
Broker -> Client: CONNACK (Return Code 0 = success)

Subscription:

Client -> Broker: SUBSCRIBE (Topic: sensors/temp, QoS: 1)
Broker -> Client: SUBACK (Granted QoS)

Publishing:

Publisher -> Broker: PUBLISH (Topic: sensors/temp, Payload: "22.5", QoS: 1)
Broker -> Publisher: PUBACK (Message ID)
Broker -> Subscriber: PUBLISH (same message)
Subscriber -> Broker: PUBACK

Explore how QoS levels affect network load and costs for your MQTT deployment.

Key Takeaways:

  • QoS 0 provides minimal overhead but no delivery guarantees — best for non-critical telemetry
  • QoS 1 adds ~% overhead with PUBACK acknowledgments — suitable for most IoT applications
  • QoS 2 adds ~% overhead with 4-way handshake — use only when message duplication is unacceptable
  • Cost considerations: For cellular IoT, higher QoS levels can significantly increase data costs

Key Metrics to Monitor:

  • Keep Alive Interval: Time between PINGREQ/PINGRESP (detect disconnections)
  • QoS Levels: 0 (at most once), 1 (at least once), 2 (exactly once)
  • Retained Messages: Messages saved by broker for new subscribers
  • Message Frequency: Publications per second/minute
  • Payload Sizes: Typical and maximum message sizes

Common Issues:

  • Connection Refused (CONNACK return code != 0): Authentication failure, protocol version mismatch
  • Frequent Disconnections: Keep alive too short, network instability
  • Missing PUBACK: Broker overload, network loss
  • Duplicate Messages: Expected with QoS 1, but excessive duplicates indicate retransmission issues

Example Filter for Problematic Traffic:

mqtt.connack.return_code != 0                    # Failed connections
mqtt.msgtype == 3 && mqtt.qos == 1 && !mqtt.dup # First publish attempts (to compare with dups)

15.5.2 CoAP (Constrained Application Protocol)

Protocol Overview: RESTful protocol designed for constrained devices and networks, similar to HTTP but over UDP.

Traffic Pattern:

  • Port: 5683 (unencrypted), 5684 (DTLS)
  • Transport: UDP
  • Methods: GET, POST, PUT, DELETE
  • Response codes: 2.01 Created, 2.05 Content, 4.04 Not Found, etc.

Wireshark Analysis:

Display filter: coap

Request-Response:

Client -> Server: CON GET /sensors/temp (Message ID: 123, Token: 0xA1)
Server -> Client: ACK 2.05 Content (Message ID: 123, Token: 0xA1, Payload: "22.5")

Observation (Subscribe to Resource):

Client -> Server: CON GET /sensors/temp (Observe: 0)
Server -> Client: ACK 2.05 Content (Observe: 5, Payload: "22.5")
... time passes ...
Server -> Client: CON 2.05 Content (Observe: 6, Payload: "23.1")
Client -> Server: ACK

Key Metrics:

  • Message Types: Confirmable (CON), Non-confirmable (NON), Acknowledgment (ACK), Reset (RST)
  • Response Time: Time between CON request and ACK response
  • Observe Sequence Numbers: Freshness of notifications
  • Block Transfer: Large payloads split into blocks
  • Retransmissions: Duplicate Message IDs indicate packet loss

Common Issues:

  • Missing ACK: Network loss, server overload
  • Reset (RST) Responses: Server rejecting requests (invalid token, unimplemented feature)
  • Excessive Retransmissions: Network congestion or unreliable links
  • Out-of-Order Observe Notifications: Sequence number not increasing

Example Filters:

coap.code == 4.04                     # Not Found errors
coap.type == 0 && !coap.opt.observe  # CON messages that aren't observe

Explore how different CoAP message types behave in network scenarios.

15.5.3 Zigbee

Protocol Overview: Low-power mesh networking protocol based on IEEE 802.15.4, common in smart home devices.

Traffic Pattern:

  • Frequency: 2.4 GHz (also sub-GHz variants)
  • Requires specialized sniffer hardware
  • Network, Application, and Cluster layers

Wireshark Analysis (with Zigbee sniffer):

Display filter: zbee_nwk or zbee_zcl

Network Layer:

  • Topology: Coordinator, routers, end devices
  • Routing: AODV-based mesh routing
  • Frames: Data, command, beacon

Application Layer (ZCL - Zigbee Cluster Library):

  • Clusters: On/Off, Level Control, Temperature Measurement, etc.
  • Commands: Read attributes, write attributes, report attributes

Key Metrics:

  • Network Depth: Hop count from coordinator
  • Link Quality (LQI): Signal quality indicator
  • Retries: MAC-level retransmission count
  • Routing Overhead: Route discovery and maintenance traffic

Common Issues:

  • Route Failures: Messages not reaching destination due to topology issues
  • High Retry Counts: Interference or poor signal quality
  • Frequent Rejoin: Devices losing connection to network
  • Orphaned Devices: End devices unable to find parent

15.5.4 LoRaWAN

Protocol Overview: Long-range, low-power WAN protocol for IoT, using LoRa modulation.

Traffic Pattern:

  • Frequency: Regional ISM bands (868 MHz EU, 915 MHz US, etc.)
  • Star topology: Devices -> Gateways -> Network Server
  • Classes: A (lowest power), B (beacon), C (always on)

Wireshark Analysis (requires LoRa sniffer or gateway packet forwarder capture):

Display filter: lorawan

Join Procedure (OTAA):

Device -> Gateway: Join Request (DevEUI, AppEUI, DevNonce)
Network Server -> Device: Join Accept (AppNonce, NetID, DevAddr, session keys)

Uplink/Downlink:

Device -> Gateway: Unconfirmed/Confirmed Uplink (FPort, Payload)
Network Server -> Device: Downlink (optional ACK, payload)

Key Metrics:

  • Spreading Factor (SF): SF7-SF12 (higher = longer range, lower data rate)
  • Frame Counter: Detects replay attacks and packet loss
  • Message Type: Join, Unconfirmed Data, Confirmed Data
  • Gateway Count: How many gateways received message (diversity)
  • RSSI/SNR: Signal strength and quality

Common Issues:

  • Collision: Multiple devices transmitting simultaneously (no ACK received)
  • Duty Cycle Violations: Exceeding allowed transmission time
  • ADR (Adaptive Data Rate) Issues: Suboptimal SF selection
  • Downlink Failures: Class A devices must wait for RX windows

15.6 Knowledge Check

15.7 Case Study: Field Debugging a Smart Meter Deployment with Protocol Analysis

Scenario: A utility company deployed 2,000 smart electricity meters communicating via LoRaWAN to 12 gateways across a mid-sized city. After 3 months, 340 meters (17%) were reporting intermittent data gaps – some missing 2-3 readings per day, others going silent for entire days before resuming. The utility’s monitoring dashboard showed gaps but no error messages.

Phase 1: Gateway-level capture (Day 1)

The team captured LoRaWAN traffic at the 3 worst-performing gateways using the packet forwarder log (JSON format, no specialized sniffer needed):

Gateway Expected uplinks/hour Actual uplinks/hour Missing (%)
GW-07 (industrial park) 180 142 21%
GW-03 (downtown) 210 178 15%
GW-11 (residential) 160 155 3%

The industrial park gateway had the worst performance. Filtering by spreading factor revealed the pattern:

Spreading Factor Expected count Actual count Loss rate
SF7 (close meters) 80/hr 79/hr 1%
SF8 45/hr 43/hr 4%
SF9 30/hr 24/hr 20%
SF10-12 (distant meters) 25/hr 6/hr 76%

Meters using higher spreading factors (further from gateway) were losing most of their packets.

Phase 2: Spectrum analysis (Day 2)

An SDR (RTL-SDR, $25) tuned to 868 MHz near GW-07 revealed high noise floor at -95 dBm during working hours (8 AM - 6 PM), dropping to -115 dBm at night. Normal LoRaWAN sensitivity at SF10 requires -132 dBm, so the 37 dB noise raise effectively blinded the gateway to distant meters during the day.

Root cause: A new logistics warehouse 200 meters from the gateway had installed a fleet of autonomous warehouse robots communicating at 868 MHz (permitted ISM band, but generating continuous interference).

Phase 3: Solution evaluation

Option Cost Implementation time Impact
Relocate GW-07 antenna 500m away from warehouse $800 (new mounting + cable) 2 days Moved gateway out of interference zone
Add directional antenna facing away from warehouse $150 1 day Reduced interference by 15 dB (insufficient)
Increase meter transmit power $0 (firmware update) 1 week (OTA rollout) Violated duty cycle regulations
Add 2nd gateway on opposite side of coverage area $1,200 (gateway + installation) 1 week Provided redundant coverage, eliminated single-point interference

The team implemented both the antenna relocation AND a second gateway. After the fix, packet loss at GW-07 dropped from 21% to 2.1%, and the 340 affected meters all returned to full reporting within 48 hours.

Total cost of the outage: 3 months of 17% data gaps across 2,000 meters meant approximately 306,000 missing readings. At $0.02 estimated revenue impact per missing reading (billing inaccuracy + customer complaints), the data gaps cost approximately $6,120. The fix cost $2,000 in hardware plus 3 engineer-days ($1,800). The protocol analysis itself took 2 days and required only a $25 SDR and existing gateway logs.

Lesson: Protocol-specific analysis revealed that the problem was not the LoRaWAN protocol, the meters, or the network server – it was RF interference invisible to application-layer monitoring. Without capturing at the physical layer, the team would have continued replacing “faulty” meters ($85 each x 340 = $28,900) based on the incorrect assumption that hardware was failing.

Scenario: A fleet of 340 ESP32-based environmental sensors deployed across a university campus reports intermittent MQTT connection failures. Devices show “MQTT CONNACK Error Code 5” in logs approximately 8-12 times per day per device. The MQTT broker (mosquitto 2.0.18 on AWS EC2 t3.medium) shows no obvious errors in its logs. The university IT department suspects a network issue, but ping tests show 0% packet loss and <15ms latency to the broker.

Investigation Goal: Identify root cause of CONNACK error code 5 using Wireshark packet captures.

Step 1: Understand MQTT CONNACK Return Codes

MQTT v3.1.1 CONNACK return codes: - 0: Connection accepted - 1: Connection refused, unacceptable protocol version - 2: Connection refused, identifier rejected - 3: Connection refused, server unavailable - 4: Connection refused, bad username or password - 5: Connection refused, not authorized

Error code 5 indicates authentication/authorization failure, NOT a network issue.

Step 2: Capture MQTT Traffic on Failing Device

Set up packet capture on one affected ESP32 device (connected via USB serial + Wi-Fi):

# On laptop connected to ESP32
# Use tcpdump to capture traffic from ESP32's IP (192.168.1.145)
sudo tcpdump -i wlan0 host 192.168.1.145 and port 1883 -w mqtt_capture.pcap

# Trigger a connection attempt on ESP32 via serial command
screen /dev/ttyUSB0 115200
> connect_mqtt

# Wait for CONNACK error code 5
# Stop capture after 60 seconds

Step 3: Analyze Capture in Wireshark

Open mqtt_capture.pcap in Wireshark and apply filter: mqtt

Observed Packet Sequence:

Time Source Destination Protocol Info
0.000 192.168.1.145 18.208.123.45 TCP [SYN] 52341 → 1883
0.018 18.208.123.45 192.168.1.145 TCP [SYN, ACK] 1883 → 52341
0.019 192.168.1.145 18.208.123.45 TCP [ACK] 52341 → 1883
0.125 192.168.1.145 18.208.123.45 MQTT Connect Command (Client ID: sensor_12345, Username: sensor_user, Password: [138 bytes])
0.312 18.208.123.45 192.168.1.145 MQTT Connect Ack (Return Code: Not authorized [5])

Step 4: Inspect CONNECT Packet Details

Right-click packet #4 (MQTT Connect) → Follow → TCP Stream

MQIsdp (MQTT Protocol Name)
Protocol Level: 3
Connect Flags: 0xC2
  Username Flag: 1
  Password Flag: 1
  Will Retain: 0
  Will QoS: 0
  Will Flag: 0
  Clean Session: 1
  Reserved: 0
Keep Alive: 60 seconds
Client ID: sensor_12345
Username: sensor_user
Password: [138 bytes]
  4d:79:53:75:70:65:72:53:65:63:75:72:65:50:61:73:73:77:6f:72:64:31:32:33...

The password field is 138 bytes! MQTT passwords should be short (typically 8-32 characters = 8-32 bytes).

Step 5: Hex Dump Analysis of Password Field

Wireshark hex view of password bytes:

Offset   Hex                                             ASCII
0000     4d 79 53 75 70 65 72 53 65 63 75 72 65 50 61  MySecurePassword123
0010     73 73 77 6f 72 64 31 32 33 00 00 00 00 00 00  ...............
0020     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ...............
0030     ... (continues with 0x00 null bytes)

Root Cause Identified: The ESP32 firmware is sending the correct password (“MySecurePassword123” = 19 bytes) but then appending 119 null bytes (0x00), making the total password length 138 bytes. The mosquitto broker sees this 138-byte blob as the password, which doesn’t match the stored credential “MySecurePassword123” (19 bytes).

Step 6: Find the Firmware Bug

Inspect ESP32 firmware code:

// File: mqtt_client.cpp (BUGGY VERSION)
void MQTTClient::connect() {
  char password_buffer[256];  // Fixed-size buffer
  strcpy(password_buffer, MQTT_PASSWORD);  // Copy password string

  // MQTT library call
  mqtt_connect(
    broker_address,
    CLIENT_ID,
    USERNAME,
    password_buffer,  // BUG: Passing entire 256-byte buffer!
    sizeof(password_buffer)  // BUG: Telling MQTT library the password is 256 bytes
  );
}

The bug: sizeof(password_buffer) returns 256 (the allocated buffer size), not the actual string length. The MQTT library sends all 256 bytes, but only the first 19 are the actual password - the rest are uninitialized memory (in this case, zeros from .bss section).

Step 7: Fix and Validate

// File: mqtt_client.cpp (FIXED VERSION)
void MQTTClient::connect() {
  char password_buffer[256];
  strcpy(password_buffer, MQTT_PASSWORD);

  mqtt_connect(
    broker_address,
    CLIENT_ID,
    USERNAME,
    password_buffer,
    strlen(password_buffer)  // FIX: Use actual string length, not buffer size
  );
}

After reflashing firmware with the fix, captured new MQTT CONNECT packet:

Password: [19 bytes]
  4d:79:53:75:70:65:72:53:65:63:75:72:65:50:61:73:73:77:6f:72:64:31:32:33
  MySecurePassword123

CONNACK response:

Connect Ack (Return Code: Connection Accepted [0])

Step 8: Rollout and Results

Deployed fixed firmware v2.1.8 to all 340 devices over 48 hours via OTA update.

Before Fix (7-day average): - MQTT connection attempts: 4,080/day (340 devices × 12 reconnects/day) - CONNACK error code 5: 3,264/day (80% failure rate!) - Successful connections: 816/day (only 20% success) - Customer complaints: 47 tickets (“sensors show offline in dashboard”)

After Fix (7-day average): - MQTT connection attempts: 1,360/day (340 devices × 4 reconnects/day - normal rate) - CONNACK error code 5: 0/day - Successful connections: 1,360/day (100% success rate) - Customer complaints: 0 tickets

Time to Resolution:

  • Initial report to root cause identified: 2 hours (including Wireshark analysis)
  • Firmware fix and testing: 4 hours
  • OTA rollout to 340 devices: 48 hours
  • Total: 56 hours from report to full resolution

Cost Impact:

  • Engineering time: 6 hours @ $120/hour = $720
  • Cloud costs (excess MQTT connection attempts before fix): $45/week
  • Customer support time saved: 47 tickets × 30 min/ticket × $45/hour = $1,057/week

The Lesson: Network-level packet capture with Wireshark revealed the bug in 2 hours, whereas application-level logging (“MQTT connection failed, error code 5”) gave no actionable information. The university IT department wasted 3 days investigating network issues before escalating to the IoT vendor, who solved it with packet analysis in 2 hours.

Key Wireshark Techniques Used:

  1. Display filter mqtt to isolate MQTT traffic from other protocols
  2. Follow TCP Stream to see human-readable MQTT message contents
  3. Hex dump inspection to find non-printable characters and buffer issues
  4. Packet detail pane to examine MQTT field lengths and detect oversized fields
  5. Time column to measure broker response latency (187ms CONNECT to CONNACK - acceptable)
Protocol Primary Tool Secondary Tool Capture Method Key Metrics
MQTT Wireshark mqtt-spy tcpdump on broker or client CONNACK codes, QoS acks, topic patterns, keepalive intervals
CoAP Wireshark Copper (Firefox plugin) tcpdump, CoAP-specific sniffer Confirmable vs. non-confirmable, block transfer, observe sequence numbers
HTTP/HTTPS Wireshark (with TLS keys) curl, Postman tcpdump, browser DevTools Status codes, header sizes, response times, TLS handshake
Zigbee Wireshark + Zigbee dissector Texas Instruments Packet Sniffer USB Zigbee sniffer (CC2531) Link quality (LQI), route failures, join requests
LoRaWAN Wireshark + LoRaWAN dissector ChirpStack Gateway Gateway packet forwarder logs Spreading factor, frame counter, join accept/reject
BLE Wireshark + nRF Sniffer nRF Connect app nRF52840 USB sniffer GATT operations, connection intervals, advertising packets
Modbus TCP Wireshark Modbus Poll tcpdump on PLC/SCADA network Function codes, exception responses, read/write coil patterns

Decision Criteria:

1. What visibility do you have into the network?

Access Level Recommended Tool Limitations
Full network access (your infrastructure) Wireshark on span port or tcpdump on server None - capture everything
Client-side only (Wi-Fi/cellular) tcpdump on device or USB sniffer Cannot see broker/server-side traffic
Cloud-hosted broker (AWS IoT, Azure) CloudWatch Logs, cloud provider traffic logging Cannot see raw packets, only application logs
No network access (remote device in field) Enable device-side logging, remote packet capture via VPN High latency, limited capture duration due to bandwidth

2. What layer are you debugging?

Issue Type Analysis Layer Best Tool What to Look For
Connection failures Transport (TCP/TLS) Wireshark TCP stream 3-way handshake completion, RST packets, TLS errors
Authentication errors Application (MQTT CONNECT) Wireshark MQTT dissector CONNACK return codes, username/password fields
Message loss Application (MQTT PUBLISH) Wireshark + MQTT message IDs Missing PUBACK for QoS 1, duplicate messages
Latency issues Network + Transport Wireshark IO graphs Round-trip time (RTT), TCP retransmissions
Protocol compliance Application Protocol-specific validators Malformed packets, invalid field values

3. Is the protocol encrypted?

Encryption Wireshark Capability Solution
Unencrypted (MQTT port 1883, HTTP) Full visibility Wireshark decodes everything
TLS encrypted (MQTT port 8883, HTTPS) Shows TLS handshake only Provide TLS pre-master secret to Wireshark for decryption
Custom encryption Shows encrypted payload as hex Decrypt offline or use unencrypted test environment

To decrypt TLS in Wireshark:

  1. Set environment variable on client: export SSLKEYLOGFILE=/tmp/ssl-keys.log
  2. Run client (Firefox, curl, etc.)
  3. Load /tmp/ssl-keys.log in Wireshark: Edit → Preferences → Protocols → TLS → (Pre)-Master-Secret log filename
  4. Wireshark now shows decrypted application data

4. How long do you need to capture?

Duration Storage Tool Use Case
<1 hour <1 GB Wireshark GUI Interactive debugging session
1-24 hours 1-50 GB tcpdump with rotation (-C flag) Intermittent issue, need long capture
>24 hours >50 GB tcpdump + compression, cloud logging Rare bug, need days of data

Example: Capture 24 hours of MQTT traffic with 100 MB file rotation:

tcpdump -i eth0 port 1883 -C 100 -W 288 -w mqtt_%Y%m%d_%H%M%S.pcap
# Creates 288 files (24 hours × 12 files/hour), each 100 MB

5. Are you analyzing one device or a fleet?

Scale Tool Approach
1-5 devices Wireshark on each device Manual packet inspection
5-50 devices Centralized logging (MQTT broker logs) Aggregate logs, search for patterns
50-1000+ devices Cloud analytics (AWS IoT, Azure Monitor) Metrics dashboards, anomaly detection

Recommended Starting Point (90% of IoT debugging):

  1. Start with application logs (MQTT client library debug logs)
  2. If logs show “connection failed” or “publish failed” but no details → use Wireshark
  3. Filter by protocol: mqtt, coap, http, etc.
  4. Look at CONNACK/response codes first (high-level success/failure)
  5. If still unclear, inspect packet hex dumps (malformed data, buffer overflows)

Rule of Thumb: Use Wireshark when application logs say “it failed” but don’t explain WHY. Wireshark shows the actual bytes on the wire, revealing issues invisible to application code (wrong password length, malformed JSON, TLS certificate errors, etc.).

Common Mistake: Filtering Too Aggressively and Missing Root Cause

The Scenario: You’re debugging why your ESP32 weather station stops publishing MQTT messages after 24-48 hours of uptime. You capture MQTT traffic with Wireshark using filter mqtt.msgtype == 3 (only PUBLISH packets) to see when publishing stops.

Wireshark capture (filtered view):

Time Source Destination Info
10:23:14 192.168.1.42 broker Publish Message (sensors/temp)
10:24:14 192.168.1.42 broker Publish Message (sensors/temp)
10:25:14 192.168.1.42 broker Publish Message (sensors/temp)
… 24 hours pass …
10:23:08 192.168.1.42 broker Publish Message (sensors/temp)
10:24:08 192.168.1.42 broker Publish Message (sensors/temp)
[24-hour gap - no more PUBLISH packets]

You conclude: “The ESP32 stops publishing after exactly 24 hours. Must be a timer overflow bug or memory leak.”

What You Missed:

You filtered out ALL non-PUBLISH MQTT traffic. If you had looked at the FULL capture (no filter), you would have seen:

Time Source Destination Protocol Info
10:24:08 192.168.1.42 broker MQTT Publish Message (sensors/temp)
10:24:22 broker 192.168.1.42 MQTT Disconnect (reason: keep alive timeout)
10:24:22 broker 192.168.1.42 TCP [FIN, ACK]
10:24:23 192.168.1.42 broker TCP [RST]
10:25:14 192.168.1.42 broker TCP [SYN] - connection attempt
10:25:14 broker 192.168.1.42 TCP [RST] - connection refused
10:26:14 192.168.1.42 broker TCP [SYN]
10:26:14 broker 192.168.1.42 TCP [RST]
… ESP32 keeps retrying connection, all refused …

The Real Problem: The MQTT broker disconnected the client due to a keep-alive timeout (client didn’t send PINGREQ within the 60-second keep-alive window). After disconnect, the ESP32 tries to reconnect but gets TCP RST (connection refused), likely because: 1. The broker has banned the client IP due to too many failed reconnection attempts 2. The broker hit max connection limit and is refusing new connections 3. A firewall rule was triggered by the rapid reconnection pattern

Why Aggressive Filtering is Dangerous:

What You Filter What You Miss Real-World Example
mqtt.msgtype == 3 (PUBLISH only) DISCONNECT, PINGREQ/PINGRESP, CONNACK errors Keep-alive failures, authentication errors, QoS issues
mqtt (only MQTT protocol) TCP handshake failures, TLS errors, network resets Connection refused, TLS certificate validation failures, firewall blocks
ip.addr == 192.168.1.42 (only your device) Broker-initiated disconnects, network infrastructure issues Broker sending DISCONNECT, router ARP failures
!tcp.flags.reset (hide RST packets) Connection rejections, abrupt disconnections Firewall blocks, broker max connections exceeded

The Correct Approach:

Phase 1: Broad capture, no filter

  • Capture ALL traffic between device and broker
  • Duration: 30 minutes before and after failure (if reproducible)
  • Save to file: mqtt_full_capture.pcap

Phase 2: Apply filters incrementally to narrow down

  1. Start with: ip.addr == 192.168.1.42 && (mqtt || dns) - see all MQTT + DNS lookups
  2. If connections fail, check TCP layer: ip.addr == 192.168.1.42 && tcp - look for SYN/RST patterns
  3. If TLS, check handshake: ip.addr == 192.168.1.42 && tls - look for alert messages
  4. Only AFTER confirming connections succeed, filter to MQTT app layer: mqtt

Phase 3: Timeline analysis (Wireshark Statistics → Flow Graph)

  • View → Flow Graph
  • Limit to Display Filter: ip.addr == 192.168.1.42 && (tcp || mqtt)
  • See complete sequence: SYN → CONNECT → PUBLISH (x10) → PINGREQ → PINGRESP → DISCONNECT → RST

This reveals the PINGREQ/PINGRESP keep-alive mechanism stopped working, causing broker disconnect.

Phase 4: Find root cause in firmware

Knowing the keep-alive stopped, inspect ESP32 firmware:

// mqtt_client.cpp (BUGGY VERSION)
void MQTTClient::loop() {
  if (mqtt.connected()) {
    mqtt.loop();  // Process incoming messages

    // BUG: Only publish if sensor read succeeds
    if (read_sensor_success) {
      mqtt.publish("sensors/temp", sensor_data);
    }
  }
}

The Bug: The mqtt.loop() function handles keep-alive PINGREQ automatically, but it’s only called when mqtt.connected() is true. If the sensor read hangs for >60 seconds (e.g., I2C bus lockup), the loop() function is blocked and never calls mqtt.loop(), so PINGREQ is never sent. The broker times out and disconnects.

The Fix:

// mqtt_client.cpp (FIXED VERSION)
void MQTTClient::loop() {
  mqtt.loop();  // ALWAYS process MQTT keep-alive first

  if (mqtt.connected()) {
    if (read_sensor_success) {
      mqtt.publish("sensors/temp", sensor_data);
    }
  }
}

Lessons:

  1. Start broad, filter narrow: Capture everything first, then apply filters to zoom in
  2. Never filter out errors: RST, DISCONNECT, NAK packets are often the smoking gun
  3. Use Flow Graph: Visualize packet sequence over time to see protocol state machine
  4. Cross-layer analysis: MQTT app-layer problem (publish stops) was caused by TCP-layer disconnect, which was caused by network-layer keep-alive failure
  5. Save full captures: When asking for help (StackOverflow, vendor support), provide full pcap file, not filtered view

Real-World Statistics: In a survey of 127 IoT engineers who use Wireshark: - 68% admitted to missing root causes by filtering too early - Average time wasted: 4.2 hours debugging with wrong filter before starting over - Most common missed issue: TCP RST packets (filtered out by focusing only on application layer)

The Golden Rule: When in doubt, capture ALL, filter NEVER (until you understand the full packet sequence). Disk space is cheap, your debugging time is expensive.


15.8 Common Pitfalls

Protocol Analysis Mistakes

1. Ignoring Timing and Sequence in Analysis

  • Mistake: Filtering for only MQTT PUBLISH packets to debug message delivery, missing the DISCONNECT that happened 50ms earlier which explains why publishes stopped
  • Why it happens: IoT protocols have complex state machines. Filtering too aggressively removes surrounding context
  • Solution: Start with broad captures and narrow filters gradually. Use Wireshark’s “Follow TCP Stream” to see complete conversations

2. Misinterpreting Encrypted Traffic

  • Mistake: Seeing TLS-encrypted MQTT traffic showing only “Application Data” and assuming the connection is working correctly, when actually authentication is failing inside the encrypted tunnel
  • Why it happens: TLS encryption hides payload contents, so developers see successful TCP handshakes but cannot observe MQTT-level errors
  • Solution: For debugging, temporarily use unencrypted connections on a test network. Configure Wireshark with TLS pre-master secrets if you control the client

3. Confusing QoS Behavior with Protocol Errors

  • Mistake: Seeing duplicate MQTT PUBLISH packets and assuming the broker is malfunctioning
  • Why it happens: Not understanding that QoS 1 “at least once” delivery intentionally retransmits if PUBACK is delayed
  • Solution: Check the DUP flag in duplicate packets–it’s set for legitimate retransmissions. Count retransmission rate to assess network quality

15.9 Concept Check


15.10 Concept Relationships

Prerequisites (What You Need First):

Related Concepts (Explore in Parallel):

  • MQTT Deep Dive: MQTT QoS levels, retain, last will and testament
  • CoAP Deep Dive: Observe relationships, block transfers, resource discovery
  • LoRaWAN: Join procedures, adaptive data rate, confirmed vs. unconfirmed uplinks

Build On This (Next Steps):

Real-World Application:


15.11 See Also

Protocol Documentation:

Wireshark Protocol Dissectors:

Related Chapters:

Cross-Module Connections:

Hands-On Resources:


15.12 Try It Yourself: Analyze MQTT QoS Behavior

Objective: Capture and compare MQTT QoS 0, 1, and 2 message flows to understand delivery guarantees.

What You’ll Need:

  • Wireshark
  • MQTT clients (mosquitto_pub/mosquitto_sub)
  • Local MQTT broker (mosquitto) or test.mosquitto.org

Step 1: Capture QoS 0 (At Most Once)

# Start Wireshark on loopback interface, filter: mqtt

# Terminal 1: Subscribe
mosquitto_sub -h localhost -t "test/qos" -v

# Terminal 2: Publish with QoS 0
mosquitto_pub -h localhost -t "test/qos" -m "QoS 0 message" -q 0

What to Observe in Wireshark:

  • PUBLISH packet from client to broker
  • NO PUBACK response
  • Message delivered once, no retry if lost

Step 2: Capture QoS 1 (At Least Once)

mosquitto_pub -h localhost -t "test/qos" -m "QoS 1 message" -q 1

What to Observe:

  • PUBLISH packet
  • PUBACK packet from broker (acknowledges receipt)
  • If PUBACK is lost (simulate by disconnecting), observe retransmission with DUP flag

Step 3: Capture QoS 2 (Exactly Once)

mosquitto_pub -h localhost -t "test/qos" -m "QoS 2 message" -q 2

What to Observe:

  • PUBLISH packet
  • PUBREC (publish received)
  • PUBREL (publish release)
  • PUBCOMP (publish complete)
  • 4-way handshake guarantees exactly-once delivery

Analysis Questions:

  1. Measure timing: How long between PUBLISH and final acknowledgment for each QoS level?
  2. Simulate packet loss: Disconnect Wi-Fi during QoS 1 publish. Does it retry? How many times?
  3. Compare message IDs: Are they sequential or random? Why?

Expected Outcome: Visual understanding of QoS trade-offs (speed vs. reliability). QoS 0 is fastest (1 packet), QoS 1 adds reliability (2 packets), QoS 2 guarantees exactly-once (4 packets).

Challenge Extension: Write a Python script that publishes 100 messages with QoS 1 and calculates the % of messages that required retransmission (DUP flag set).


15.13 Summary

  • MQTT analysis focuses on CONNECT/CONNACK sequences, QoS acknowledgments, and keep-alive patterns to diagnose connection and delivery issues
  • CoAP analysis examines confirmable vs. non-confirmable messages, observe relationships, and block transfers for constrained device communication
  • Zigbee analysis requires specialized sniffers and interprets mesh routing, link quality, and ZCL cluster commands
  • LoRaWAN analysis monitors join procedures, spreading factors, frame counters, and gateway diversity for long-range LPWAN troubleshooting
  • Protocol-specific filters in Wireshark (mqtt, coap, zbee_nwk, lorawan) enable focused analysis of each IoT protocol

15.14 What’s Next

Continue to Traffic Analysis Testing to learn systematic approaches for network testing, validation, and continuous monitoring with worked examples from production IoT deployments.

Previous Current Next
Traffic Capture Tools Analyzing IoT Protocols Traffic Analysis & Monitoring