17  Traffic Analysis & Monitoring

17.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Design systematic network testing strategies for IoT deployments
  • Configure hardware-in-the-loop (HIL) network test environments
  • Implement automated anomaly detection for production monitoring
  • Perform load testing to identify performance bottlenecks
  • Apply traffic analysis to diagnose real-world IoT issues
In 60 Seconds

Network traffic testing subjects IoT communication channels to controlled stress conditions — high message rates, packet loss, network delay, and congestion — to validate system behavior under adverse conditions. Tools like iperf3, tc netem, and protocol load generators simulate real-world network impairments in lab conditions. Traffic testing reveals buffering issues, retry storm behaviors, and throughput bottlenecks before production deployment.

17.2 For Beginners: Traffic Analysis & Monitoring

Testing and validation ensure your IoT device works correctly and reliably in the real world, not just on your workbench. Think of it like test-driving a car in rain, snow, and heavy traffic before buying it. Thorough testing catches problems before your devices are deployed to thousands of locations where fixing them becomes expensive and disruptive.

“Analyzing traffic is not just for debugging,” said Max the Microcontroller. “It is also for testing! Load testing floods the network with simulated traffic to find the breaking point. How many sensors can the MQTT broker handle before it starts dropping messages? At what point does the Wi-Fi access point become overloaded?”

Sammy the Sensor described anomaly detection. “After monitoring normal traffic patterns for a while, you build a baseline. Then any deviation triggers an alert. If I usually send 10 packets per minute but suddenly start sending 1,000, something is wrong – maybe a firmware bug, or maybe a hacker has compromised me.” Lila the LED explained HIL network testing. “Hardware-in-the-Loop network tests create realistic conditions – simulated packet loss, variable latency, and bandwidth throttling. You can test how your IoT system behaves when the network degrades, without waiting for real network problems to happen.”

Bella the Battery emphasized monitoring. “In production, continuous traffic monitoring watches for security threats, performance degradation, and device malfunctions. It is like having a security guard watching the network 24/7. Anomaly detection algorithms automatically flag suspicious patterns so engineers can investigate before problems affect users.”

17.3 Prerequisites

Before diving into this chapter, you should be familiar with:

17.4 How It Works: Network Load Testing

Network load testing systematically validates IoT system behavior under increasing traffic volumes to identify capacity limits and performance degradation points:

Step 1: Baseline Measurement

  • Capture production traffic for 24-48 hours to establish normal patterns
  • Calculate key metrics: average message rate (msg/min), peak rate, P50/P95/P99 latency
  • Example: Smart meter fleet averages 1,200 msg/min with 1,800 msg/min morning peak

Step 2: Load Generation

  • Simulate N concurrent IoT devices using MQTT client libraries (paho-mqtt, mosquitto-clients)
  • Each simulated device publishes at realistic intervals (15-second sensor readings)
  • Gradually increase device count: 100 → 500 → 1,000 → 2,000 → failure point

Step 3: Performance Monitoring

  • Capture traffic with tcpdump/Wireshark during load test
  • Monitor server metrics: CPU %, memory usage, connection count, queue depth
  • Measure response times: PUBLISH → PUBACK latency at each load level

Step 4: Identify Bottleneck

  • Plot latency vs. load: performance degrades non-linearly at capacity limit
  • Example: Latency remains <100ms up to 1,500 msg/min, then spikes to 1,200ms at 2,000 msg/min
  • Analyze packet captures for retransmissions, timeouts, connection resets

Why This Works: IoT systems often have hidden capacity limits (database connection pools, network bandwidth, broker thread limits). Graduated load testing reveals the “knee” in the performance curve before production traffic hits it. Finding this limit in testing costs $0-500; discovering it in production costs $5,000-50,000 in downtime and emergency scaling.

Calculate the required test load for your IoT system to identify performance bottlenecks before production deployment.

How to Use: Adjust the sliders to match your system’s expected traffic patterns. The calculator shows the recommended test load that exceeds your peak by a safety margin, helping identify bottlenecks before they affect production users.


17.5 Testing and Validation Guide

Systematic network testing ensures IoT deployments meet performance, reliability, and security requirements.

17.5.1 Testing Pyramid for Network Validation

Network testing follows a layered approach from protocol compliance to production monitoring:

Level Scope Tools Automation Execution Time
Protocol Unit Tests Individual protocol messages Scapy, Python scripts High (90%+) Seconds
Integration Tests Multi-device communication Testbed with real devices Medium (60-80%) Minutes
System Tests End-to-end network flow Production-like testbed Medium (40-60%) Hours
Field Tests Real-world networks Pilot deployment monitoring Low (10-20%) Days-Weeks

Network Testing Priorities:

  • 60% Protocol Compliance: Validate MQTT, CoAP, Zigbee message formats and timing
  • 25% Performance Testing: Measure latency, throughput, packet loss under load
  • 10% Security Testing: Verify encryption, authentication, vulnerability scanning
  • 5% Chaos Testing: Inject failures to validate resilience and recovery

17.5.2 Hardware-in-the-Loop (HIL) Network Testing

Create controlled network conditions to validate device behavior:

Component Purpose Example Setup Cost
DUT (Device Under Test) IoT device being tested ESP32 sensor node $10-50
Network Emulator Simulate latency/loss Linux tc (traffic control) $0 (software)
MQTT Broker Message broker Mosquitto on Raspberry Pi $35-75
Packet Capture Traffic recording Wireshark on monitoring PC $0 (software)
Load Generator Stress testing Python scripts, JMeter $0 (software)

Hardware-in-the-Loop Testing ROI: Building testbed for MQTT sensor network (50-device production scale):

DIY testbed cost: \[\text{Cost}_{\text{testbed}} = \$75 \text{ (Raspberry Pi broker)} + \$200 \text{ (managed switch)} + \$150 \text{ (10 ESP32 test nodes)} = \$425\]

Setup time: \(20\,\text{hr} \times \$85/\text{hr} = \$1,700\)

Total investment: \(\$2,125\)

Bugs caught in testbed vs field:

  • Pre-testbed: 8 bugs/release reached production (\(8 \times \$180 \text{ field service} = \$1,440\) per release)
  • With testbed: 1 bug/release reached production (\(\$180\) per release)

Savings per release: \(\$1,260\)

Break-even point: \(\dfrac{\$2,125}{\$1,260} = 1.7\) releases (2 months for monthly releases)

Year 1 savings: \((12 - 2) \times \$1,260 = \$12,600\) ROI. Testbed automation reduces manual test time from 40hr → 8hr per release (\(32\,\text{hr} \times \$85 = \$2,720\) annual savings). Controlled testing environment prevents 87% of field failures.

Calculate the return on investment for building a Hardware-in-the-Loop network testbed for your IoT project.

How to Use: Adjust the sliders to match your project’s costs and release schedule. The calculator shows when your testbed investment pays for itself and the ongoing savings from preventing field failures.

Network TAP | Non-intrusive capture | Ethernet TAP device | $50-200 |

HIL Network Test Architecture:

Internet/Cloud
    ^ (Traffic Capture Point 1: WAN side)
Gateway/Router with Port Mirroring
    v (Traffic Capture Point 2: LAN side)
    |-> MQTT Broker (Raspberry Pi)
    |-> Network Emulator (Linux tc - inject delay, loss, jitter)
    --> DUT Fleet (5-10 IoT devices under test)
            v
    Monitoring PC (Wireshark, tshark, Grafana dashboards)

Example Network Emulation Script (Linux tc):

#!/bin/bash
# Simulate poor network conditions for testing

INTERFACE="eth0"

# Add 100ms latency with +/-20ms jitter
sudo tc qdisc add dev $INTERFACE root netem delay 100ms 20ms

# Add 2% packet loss
sudo tc qdisc change dev $INTERFACE root netem loss 2%

# Limit bandwidth to 1 Mbps
sudo tc qdisc add dev $INTERFACE root tbf rate 1mbit burst 32kbit latency 400ms

# Test device behavior under these conditions
echo "Network conditions applied. Run your tests now."
echo "Press Enter to restore normal network..."
read

# Remove network emulation
sudo tc qdisc del dev $INTERFACE root
echo "Network conditions restored."

17.5.3 Test Cases Checklist

Systematically validate network behavior across all IoT protocols and scenarios:

Functional Network Tests:

Performance Tests:

Reliability Tests:

Security Tests:

17.5.4 Network Test Report Template

Document network test execution with traffic captures for troubleshooting:

# Network Performance Test Report

**Test:** [Test Name - e.g., "MQTT Latency Under Load"]
**Date:** [YYYY-MM-DD]
**Tester:** [Name]
**Device:** [Model, Firmware Version, Network Interface]
**Network:** [SSID, Broker URL, Subnet]
**Result:** [PASS / FAIL / DEGRADED]

## Network Configuration
- Wi-Fi SSID: [Name, Channel, 2.4GHz/5GHz]
- MQTT Broker: [IP/Hostname, Port]
- Latency Emulation: [None / 50ms / 100ms]
- Packet Loss: [None / 1% / 5%]
- Bandwidth Limit: [None / 1 Mbps]

## Test Steps
1. [Action - e.g., "Connect 10 IoT devices to MQTT broker"]
2. [Action - e.g., "Each device publishes 1 message/second for 60 seconds"]
3. [Action - e.g., "Capture traffic with Wireshark on broker interface"]
4. [Action - e.g., "Calculate P50, P95, P99 latency from pcap timestamps"]

## Expected Result
[Description - e.g., "Latency P95 <150ms, no packet loss, all 600 messages delivered"]

## Actual Result
[Description - e.g., "Latency P50=45ms, P95=120ms, P99=230ms. 598/600 messages delivered (99.7%)"]

## Metrics
- **Latency:**
  - P50 (median): 45ms (target <100ms)
  - P95: 120ms (target <150ms)
  - P99: 230ms (target <200ms, slightly high)
- **Packet Loss:** 0.3% (target <1%)
- **Throughput:** 9.8 messages/sec (target 10/sec)
- **Retransmissions:** 12 occurrences (2% of traffic)

## Evidence
- PCAP capture: `captures/mqtt_latency_load_test.pcap`
- Wireshark I/O Graph: `graphs/mqtt_latency_timeline.png`
- tshark analysis: `analysis/mqtt_message_count.txt`
- Grafana dashboard: [Screenshot or link]

## Analysis
- P99 latency spike at T+45s correlates with 10th device connecting (connection flood)
- TCP retransmissions indicate Wi-Fi congestion (channel 6 has interference)
- 2 missing messages due to broker queue overflow during connection flood

## Recommendations
- [ ] Implement connection backoff (stagger device startup by 5-10 seconds)
- [ ] Change Wi-Fi channel to 1 or 11 (avoid overlap with neighbor networks)
- [ ] Increase broker message queue size from 100 to 500
- [ ] Add connection rate limiting on broker (max 5 connections/second)

## Follow-Up Actions
- [ ] Re-test after Wi-Fi channel change
- [ ] Validate broker configuration changes in staging
- [ ] Add P99 latency monitoring alert (threshold 200ms)

17.5.5 Automated Network Testing

For production network testing, use automated test frameworks that validate MQTT protocol compliance, QoS levels, connection handling, and error recovery. Key testing areas include:

  • Connection Testing: Validate MQTT CONNECT/CONNACK handshake timing and success rates
  • QoS Validation: Test QoS 0 (at-most-once), QoS 1 (at-least-once), and QoS 2 (exactly-once) delivery guarantees
  • Subscribe/Publish: Verify topic subscription, message routing, and payload delivery
  • Reconnection Logic: Test automatic reconnection after network failures with exponential backoff
  • Error Handling: Validate behavior under connection refused, timeout, and malformed packet scenarios

Recommended Tools:

  • Paho MQTT Testing: Python library with built-in testing utilities
  • MQTT.fx: Java-based GUI testing tool for manual protocol validation
  • HiveMQ MQTT CLI: Command-line testing and debugging tool
  • Mosquitto Test Suite: Official test utilities for MQTT broker compliance

Production Test Strategy:

# Example automated test execution
# Use established testing frameworks with CI/CD integration
python -m pytest tests/mqtt_compliance_tests.py

17.5.6 Continuous Network Monitoring Setup

Deploy ongoing traffic analysis for production networks:

Wireshark + tshark Continuous Capture:

#!/bin/bash
# Rotate packet captures every hour for continuous monitoring

INTERFACE="eth0"
CAPTURE_DIR="/var/captures"
FILTER="port 1883 or port 8883"  # MQTT only

mkdir -p $CAPTURE_DIR

# Capture with 1-hour rotation, keep last 24 files (24 hours)
sudo tshark -i $INTERFACE -f "$FILTER" \
  -b duration:3600 -b files:24 \
  -w $CAPTURE_DIR/mqtt_continuous.pcap

Automated Anomaly Detection:

#!/usr/bin/env python3
"""
Real-time network anomaly detection from live packet capture
Alerts on unusual patterns: connection floods, message rate spikes, retransmission storms
"""

import subprocess
import re
from collections import defaultdict
import time

THRESHOLD_CONNECTIONS_PER_MIN = 100
THRESHOLD_RETRANSMISSIONS_PERCENT = 5.0

def analyze_live_capture(interface="eth0", duration=60):
    """Capture and analyze traffic for specified duration"""
    print(f"Capturing traffic on {interface} for {duration}s...")

    # Run tcpdump for duration
    cmd = f"sudo timeout {duration} tcpdump -i {interface} -n port 1883 -c 1000"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

    # Parse output
    lines = result.stdout.split('\n')
    syn_count = 0
    retrans_count = 0
    total_packets = len(lines)

    for line in lines:
        if '[S]' in line and '[.]' not in line:  # SYN without ACK (new connection)
            syn_count += 1
        if 'retransmission' in line.lower():
            retrans_count += 1

    # Calculate metrics
    connections_per_min = syn_count * (60.0 / duration)
    retrans_percent = (retrans_count / total_packets * 100) if total_packets > 0 else 0

    print(f"\n--- Network Analysis Results ---")
    print(f"Total packets: {total_packets}")
    print(f"New connections: {syn_count} ({connections_per_min:.1f}/min)")
    print(f"Retransmissions: {retrans_count} ({retrans_percent:.2f}%)")

    # Anomaly detection
    alerts = []
    if connections_per_min > THRESHOLD_CONNECTIONS_PER_MIN:
        alerts.append(f"HIGH CONNECTION RATE: {connections_per_min:.1f}/min (threshold: {THRESHOLD_CONNECTIONS_PER_MIN})")

    if retrans_percent > THRESHOLD_RETRANSMISSIONS_PERCENT:
        alerts.append(f"HIGH RETRANSMISSION RATE: {retrans_percent:.2f}% (threshold: {THRESHOLD_RETRANSMISSIONS_PERCENT}%)")

    if alerts:
        print("\nANOMALIES DETECTED:")
        for alert in alerts:
            print(alert)
    else:
        print("\nNo anomalies detected")

    return alerts

if __name__ == "__main__":
    while True:
        alerts = analyze_live_capture(duration=60)
        if alerts:
            # In production: send email, webhook, Slack notification
            print("(Alert sent to monitoring system)")
        time.sleep(60)  # Analyze every minute

17.5.7 Performance Testing with Load Generation

Validate network behavior under realistic load conditions:

MQTT Load Generator (Python):

#!/usr/bin/env python3
"""
MQTT Load Generator - Simulate N concurrent IoT devices
Requires paho-mqtt 2.0+
"""

import paho.mqtt.client as mqtt
import threading
import time
import random

BROKER = "localhost"
PORT = 1883
NUM_CLIENTS = 50
PUBLISH_INTERVAL = 5  # seconds

def iot_device_simulator(client_id, topic, duration=60):
    """Simulate one IoT device publishing sensor data"""
    client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2, client_id=f"device_{client_id}")

    try:
        client.connect(BROKER, PORT)
        client.loop_start()

        end_time = time.time() + duration
        while time.time() < end_time:
            # Simulate sensor reading
            temperature = 20 + random.uniform(-5, 5)
            humidity = 60 + random.uniform(-10, 10)
            payload = f"{{\\"temp\\": {temperature:.1f}, \\"humidity\\": {humidity:.1f}}}"

            client.publish(topic, payload, qos=1)
            print(f"[Device {client_id}] Published: {payload}")

            time.sleep(PUBLISH_INTERVAL)

        client.loop_stop()
        client.disconnect()
    except Exception as e:
        print(f"[Device {client_id}] ERROR: {e}")

def main():
    print(f"Starting load test: {NUM_CLIENTS} devices, {PUBLISH_INTERVAL}s interval")

    threads = []
    for i in range(NUM_CLIENTS):
        topic = f"sensors/device_{i}/data"
        t = threading.Thread(target=iot_device_simulator, args=(i, topic, 120))
        t.start()
        threads.append(t)
        time.sleep(0.1)  # Stagger startup

    # Wait for all devices to finish
    for t in threads:
        t.join()

    print("Load test complete")

if __name__ == "__main__":
    main()

Run load test and monitor:

# Terminal 1: Start MQTT broker with verbose logging
mosquitto -v

# Terminal 2: Start Wireshark capture
wireshark -i lo -f "port 1883" &

# Terminal 3: Run load generator
python3 mqtt_load_generator.py

# Terminal 4: Monitor broker performance
watch -n 1 'mosquitto_sub -t "#" -v | wc -l'

17.6 Worked Example: Performance Testing an MQTT-Based IoT Fleet

Scenario

Your company operates 25,000 smart water meters deployed across a metropolitan area. Customers report intermittent “meter offline” alerts, but devices show connected in the backend. You suspect the MQTT broker is dropping messages under load. You need to performance test the system to identify the bottleneck.

Given:

  • Fleet size: 25,000 water meters
  • MQTT broker: Mosquitto on AWS EC2 (t3.large, 2 vCPU, 8GB RAM)
  • Message frequency: Each meter publishes every 15 minutes (1,667 messages/minute peak)
  • Message size: 256 bytes average (JSON payload with reading + metadata)
  • QoS level: 1 (at-least-once delivery)
  • Current issue: 3-5% of messages “lost” during peak hours (6-9 AM)
  • SLA requirement: 99.9% message delivery, P95 latency < 500ms

Step 1: Establish baseline metrics from production traffic

# Capture 1 hour of production MQTT traffic
$ sudo tcpdump -i eth0 port 1883 -w mqtt_baseline.pcap -G 3600 -W 1

# Analyze with tshark
$ tshark -r mqtt_baseline.pcap -Y "mqtt" -T fields \
    -e frame.time_relative -e mqtt.msgtype -e mqtt.topic \
    > mqtt_messages.csv

# Calculate message statistics
$ python3 analyze_mqtt.py mqtt_messages.csv

Production Baseline (1 hour, 7-8 AM):
=====================================
Total MQTT packets:     127,456
PUBLISH messages:       98,234 (77%)
PUBACK messages:        95,891 (97.6% of PUBLISH acknowledged)
Missing PUBACK:         2,343 (2.4% potential loss)
Average inter-arrival:  28.7ms
P50 RTT (PUBLISH->PUBACK): 45ms
P95 RTT:                  312ms
P99 RTT:                  1,247ms (!)
Max RTT:                  8,934ms (timeout)

Finding: P99 latency of 1.2 seconds exceeds SLA; 2.4% message loss during peak

Step 2: Identify correlation between load and latency

Time Window Messages/min P50 RTT P95 RTT P99 RTT Loss Rate
5:00-6:00 892 23ms 67ms 145ms 0.1%
6:00-7:00 1,234 34ms 156ms 423ms 0.8%
7:00-8:00 1,667 45ms 312ms 1,247ms 2.4%
8:00-9:00 1,589 41ms 287ms 987ms 1.9%
9:00-10:00 1,123 31ms 134ms 312ms 0.4%

Pattern: Latency degrades exponentially above 1,400 messages/minute

Step 3: Load test results analysis

Load (msg/min) Clients P50 RTT P95 RTT P99 RTT Loss % CPU % Memory
500 50 18ms 45ms 78ms 0.02% 12% 1.2GB
1,000 50 24ms 89ms 167ms 0.08% 24% 1.4GB
1,500 50 38ms 234ms 567ms 0.31% 48% 2.1GB
2,000 50 67ms 456ms 1,890ms 1.2% 72% 3.8GB
2,500 50 145ms 1,234ms 4,567ms 3.8% 89% 5.9GB
3,000 50 312ms 2,890ms timeout 8.2% 97% 7.6GB

Bottleneck identified: Broker CPU saturates above 2,000 msg/min; memory pressure starts at 1,500 msg/min

Step 4: Calculate required infrastructure for SLA compliance

Current Capacity Analysis:
==========================
Peak production load:    1,667 msg/min
SLA requirement:         99.9% delivery, P95 < 500ms

Current broker:
- Saturates at ~1,500 msg/min for SLA compliance
- Capacity margin: -10% (already over capacity!)

Options:

Option A: Vertical scaling (t3.xlarge)
- 4 vCPU, 16GB RAM
- Estimated capacity: 3,500 msg/min
- Cost: +$50/month
- Margin: +110% headroom

Option B: Horizontal scaling (2x t3.large + LB)
- 2 brokers behind HAProxy
- Estimated capacity: 2,800 msg/min
- Cost: +$80/month
- Margin: +68% headroom
- Benefit: High availability

Recommendation: Option A short-term, migrate to managed service long-term

Step 5: Verify fix

Metric Before (t3.large) After (t3.xlarge) Improvement
P50 RTT @ 1,667/min 45ms 28ms 38% faster
P95 RTT @ 1,667/min 312ms 89ms 71% faster
P99 RTT @ 1,667/min 1,247ms 178ms 86% faster
Message loss 2.4% 0.04% 98% reduction
CPU utilization 72% 34% 53% headroom

Result: Upgrading from t3.large to t3.xlarge reduced P99 latency from 1,247ms to 178ms (86% improvement) and message loss from 2.4% to 0.04%. The system now meets the 99.9% delivery SLA with 53% CPU headroom for growth.

Key Insight: Performance testing IoT systems requires graduated load testing that exceeds production peaks by 50-100%. The relationship between load and latency is often non-linear: our system performed acceptably at 1,000 msg/min but degraded exponentially above 1,500 msg/min. Always identify the “knee” in the load curve.

17.7 Worked Example: Diagnosing Intermittent Packet Loss in LoRaWAN Network

Scenario

Your agricultural IoT deployment has 340 soil moisture sensors across 12 farms connected via 8 LoRaWAN gateways. Farmers report that 15-20% of hourly readings are missing from the dashboard, but the network server logs show all uplinks as “successful.” You need to use traffic analysis to find where packets are being lost.

Given:

  • Sensors: 340 Dragino LSE01 soil sensors
  • Gateways: 8 Kerlink Wirnet stations
  • Network server: ChirpStack on-premise
  • Expected uplinks: 340 sensors x 24 hours = 8,160 per day
  • Actual dashboard readings: 6,800-7,000 per day (16-17% missing)
  • Network server logs: 8,100+ uplinks received (99%+ success)

Step 1: Map the data path and identify measurement points

Data Flow:
Sensor -> [RF] -> Gateway -> [UDP] -> Network Server -> [MQTT] ->
Application Server -> [PostgreSQL] -> Dashboard API -> Dashboard

Measurement Points:
A. Gateway packet forwarder logs (radio reception)
B. Network server uplink logs (UDP ingestion)
C. Application server MQTT subscription (decoded payloads)
D. Database insertion logs (persistence)
E. Dashboard API query results (display)

Step 2: Collect packet counts at each measurement point (24-hour sample)

Point Description Packets % of Expected
Expected 340 sensors x 24 hours 8,160 100%
A Gateway RF reception 8,247 101% (duplicates OK)
B Network server ingestion 8,134 99.7%
C Application server MQTT 8,089 99.1%
D Database insertions 6,912 84.7%
E Dashboard display 6,891 84.4%

Gap identified: 1,177 packets lost between Application Server (C) and Database (D)

Step 3: Analyze application server logs during loss events

# Filter for database-related errors
$ grep -E "(INSERT|database|timeout|connection)" app_server.log | \
    awk '{print $1, $2, $NF}' | uniq -c | sort -rn | head -20

Loss Pattern Analysis (24 hours):
===============================
147 occurrences: "connection pool exhausted, dropping message"
89 occurrences:  "database timeout after 5000ms"
23 occurrences:  "duplicate key violation sensor_id+timestamp"
12 occurrences:  "payload decode error: invalid CRC"

Total errors: 271 logged
Unaccounted: 1,177 - 271 = 906 messages (silent drops)

Finding: 906 messages silently dropped without error logging

Step 4: Root cause analysis - database connection pool exhaustion

# Connection pool configuration
POOL_SIZE = 10
MAX_OVERFLOW = 5
POOL_TIMEOUT = 5  # seconds

# Query connection stats during peak hour
Connection Analysis (8:00 AM sample):
=====================================
Active connections:     15 (max 15 = POOL_SIZE + MAX_OVERFLOW)
Waiting connections:    8 (queued behind pool)
Avg query time:         127ms
Queries/second:         89 (from 340 sensors arriving ~same minute)

Problem: 340 messages arrive within 60-second window
- 340 / 60 = 5.67 messages/second sustained
- Burst rate: 340 in first 10 seconds = 34/sec
- Pool exhausts in 10s / (127ms * 15 workers) = ~0.5 seconds
- Remaining 325 messages wait 5 seconds -> timeout -> drop

Solution: Increase pool size or batch inserts

Step 5: Implement and verify fix

Configuration Pool Size Batch Size Peak Loss Daily Total
Original 10 1 15.6% 1,177 lost
Pool +25 25 1 8.2% 623 lost
Pool +25 + Batch 25 10 1.1% 84 lost
Pool +50 + Batch 50 10 0.3% 23 lost

Final configuration: Pool size 50, batch insert every 100ms or 10 messages - Peak loss reduced from 15.6% to 0.3% - Daily delivery improved from 84.4% to 99.7%

Result: The missing packets were not lost at the LoRaWAN layer (99.7% delivery to network server) but at the application layer due to database connection pool exhaustion during message bursts.

Key Insight: When troubleshooting IoT data loss, measure at every layer boundary, not just endpoints. The farmers saw “missing data” and blamed the radio network, but 99.7% of packets reached the network server successfully. The actual bottleneck was a database connection pool 6 hops downstream.

17.8 Knowledge Check

17.9 Common Pitfalls

Testing and Monitoring Mistakes

1. Capturing at the Wrong Network Location

  • Mistake: Running Wireshark on your development laptop to debug MQTT issues between an IoT device and a cloud broker, then seeing no relevant traffic
  • Why it happens: Modern switched networks only forward packets to their destination port. Without port mirroring, you see only broadcast traffic
  • Solution: Identify the correct capture point before starting analysis. For device-to-cloud issues, capture at the gateway or enable port mirroring

2. Testing Only at Average Load

  • Mistake: Load testing at expected average load (e.g., 1,000 msg/min) and declaring the system healthy, then experiencing failures at peak load (2,000 msg/min)
  • Why it happens: IoT systems often have predictable peaks (morning usage, hourly reporting) that exceed average by 2-3x
  • Solution: Test at 150-200% of expected peak load to identify the “knee” where performance degrades non-linearly

3. Missing Silent Failures

  • Mistake: Assuming all failures are logged and only checking error logs for lost messages
  • Why it happens: Many systems silently drop messages when queues overflow, timeouts occur, or backpressure isn’t implemented
  • Solution: Measure message counts at every layer boundary (sender, broker, receiver, database) and compare totals to find discrepancies
Fleet Size Testing Approach Budget Critical Tests
<100 devices Manual testing, single test device $1-5K Connection reliability, basic load
100-1,000 Automated testing, 5-10 device farm $5-20K Protocol compliance, moderate load
1,000-10,000 HIL network testing, traffic replay $20-50K Graduated load, failover testing
>10,000 Production-scale testing, chaos engineering $50-200K+ Peak load +50%, network partition simulation

Test frequency by scale:

  • Smoke test (every commit): Connection establishment, basic publish/subscribe
  • Integration test (nightly): Full protocol compliance, 100-device load simulation
  • Load test (weekly): Peak load +50%, sustained for 4 hours
  • Chaos test (monthly): Network failures, message loss, broker crashes

Key Insight: Testing scope scales with fleet size. 100 devices tolerate manual testing; 10,000 devices require automated load testing and chaos engineering.

17.10 Concept Check


17.11 Concept Relationships

Prerequisites:

Build On This:

Real-World Application:


17.12 See Also

Testing Tools:

  • JMeter MQTT Plugin - Load testing MQTT brokers
  • Locust - Python-based load testing framework
  • K6 - Modern load testing tool with MQTT support

Related Chapters:


17.13 Try It Yourself: MQTT Broker Load Test

Objective: Perform graduated load testing on an MQTT broker to find the performance knee.

What You’ll Need:

  • Local MQTT broker (Mosquitto)
  • Python 3 with paho-mqtt library
  • System monitoring tool (htop, top, or Grafana)

Step 1: Baseline Test (50 devices)

# Save as mqtt_load_test.py
import paho.mqtt.client as mqtt
import threading
import time
import random

BROKER = "localhost"
PORT = 1883
NUM_CLIENTS = 50  # Increase this value
PUBLISH_INTERVAL = 5

def iot_device(client_id):
    client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2, client_id=f"device_{client_id}")
    client.connect(BROKER, PORT)
    client.loop_start()

    for i in range(12):  # 1 minute test (12 × 5 sec)
        payload = f'{{"temp": {20 + random.uniform(-5,5):.1f}}}'
        client.publish(f"sensors/device_{client_id}", payload, qos=1)
        time.sleep(PUBLISH_INTERVAL)

    client.loop_stop()
    client.disconnect()

threads = []
start_time = time.time()
for i in range(NUM_CLIENTS):
    t = threading.Thread(target=iot_device, args=(i,))
    t.start()
    threads.append(t)
    time.sleep(0.1)  # Stagger startup

for t in threads:
    t.join()

duration = time.time() - start_time
print(f"Load test complete: {NUM_CLIENTS} devices, {duration:.1f} seconds")

Step 2: Run Tests at Increasing Loads

# Terminal 1: Monitor broker CPU/memory
htop

# Terminal 2: Run load tests
python3 mqtt_load_test.py  # 50 devices
# Edit NUM_CLIENTS to 100, 200, 500, 1000

Step 3: Measure Latency at Each Level

  • Use Wireshark to capture MQTT traffic
  • Filter for QoS 1: mqtt.msgtype == 3 && mqtt.qos == 1
  • Measure PUBLISH → PUBACK time delta
  • Record CPU %, memory %, connection count

What to Observe:

  • At low load (50-100): Latency <50ms, CPU <20%
  • At medium load (200-500): Latency 50-150ms, CPU 30-60%
  • At high load (1000+): Latency >500ms, CPU >80%, possible connection failures

Expected Outcome: Graph showing latency vs. load reveals the performance knee (point where latency spikes exponentially).

Challenge: Add database writes to simulate real system. Observe how database connection pool exhaustion affects MQTT latency.


17.14 Summary

  • Network testing pyramid progresses from protocol unit tests (seconds) through integration and system tests (hours) to field validation (days-weeks)
  • Hardware-in-the-loop testing with Linux tc network emulation enables controlled validation under adverse conditions (latency, packet loss, bandwidth limits)
  • Continuous monitoring with tshark rotating captures and automated anomaly detection provides ongoing visibility into production traffic
  • Load testing should exceed production peaks by 50-100% to identify non-linear degradation and capacity limits
  • Multi-layer measurement at every system boundary reveals where data loss actually occurs vs. where users perceive it

17.16 What’s Next

The next section covers Software Platforms and Frameworks, which explores the integrated services and infrastructure available for building complete IoT systems. While individual devices and networks are important, platforms provide the glue that brings distributed IoT systems together.

Previous Current Next
Analyzing IoT Protocols Traffic Analysis & Monitoring CI/CD and DevOps for IoT