MQTT reliability combines three mechanisms: QoS levels (0/1/2) controlling per-message delivery guarantees, persistent sessions queuing messages for offline subscribers, and retained messages storing the last value on each topic for immediate delivery to new subscribers. The Last Will Testament (LWT) enables automatic failure notification when a client disconnects unexpectedly.
35.1 Learning Objectives
By the end of this chapter, you will be able to:
Select Appropriate QoS Levels: Choose QoS 0, 1, or 2 based on data criticality and battery constraints
Explain Delivery Guarantees: Describe the message flow and acknowledgment packets for each QoS level and justify why each step is necessary
Configure Persistent Sessions: Apply clean_session=false with fixed client IDs to enable offline message queuing for actuators and command receivers
Implement Retained Messages: Demonstrate how to store last-known values for immediate delivery to new subscribers and distinguish state topics from event topics
Design LWT Monitoring: Construct Last Will and Testament configurations for graceful failure notification and combine with retained messages for persistent device status
Calculate Energy Trade-offs: Compare battery consumption across QoS levels and assess the impact of keepalive interval on detection latency versus power draw
Key Concepts
QoS 0 (At Most Once): Fire-and-forget delivery with no acknowledgment — lowest overhead, message loss possible
QoS 1 (At Least Once): ACK-based delivery ensuring message arrives at least once — duplicate possible if PUBACK lost
For a 2,000 mAh battery, QoS 2 consumes 29% of battery capacity just for MQTT overhead versus 7% for QoS 0 — a critical factor in 5+ year sensor deployments.
35.3.3 QoS Battery Life Calculator
Estimate how QoS level selection affects battery life for your IoT device. Adjust battery capacity, message rate, and radio characteristics to compare all three QoS levels side by side.
Show code
viewof qosBatCap = Inputs.range([100,10000], {value:2000,step:100,label:"Battery capacity (mAh)"})viewof qosMsgPerMin = Inputs.range([0.1,60], {value:1,step:0.1,label:"Messages per minute"})viewof qosTxCurrent = Inputs.range([5,200], {value:20,step:1,label:"Transmit current (mA)"})viewof qosTxDuration = Inputs.range([5,200], {value:50,step:5,label:"Tx time per packet (ms)"})viewof qosSleepCurrent = Inputs.range([0.001,1.0], {value:0.01,step:0.001,label:"Sleep current (mA)"})
Client connects with client_id="sensor_123" and clean_session=false
Client subscribes to commands/sensor_123 (broker remembers subscription)
Client disconnects (sleep mode)
Server publishes command to commands/sensor_123 with QoS 1
Broker queues message for offline sensor_123
Client reconnects with same client_id, clean_session=false
Broker delivers queued message immediately
Requirements:
Fixed client_id: Must use same ID across sessions (broker links queue to ID)
QoS >= 1: Only QoS 1/2 messages queued (QoS 0 discarded if client offline)
Subscription persistence: Subscriptions survive disconnection (no need to re-SUBSCRIBE)
Trade-offs:
Broker stores queued messages (uses memory/disk - typically 100KB-10MB per client)
Potential message flood on reconnection (100s of queued messages delivered rapidly)
35.5.2 Clean Sessions (Clean Session = True)
Clean Session=1: Discards session state on disconnect - suitable for high-frequency sensors that don’t need commands (temperature logger).
Best practices:
Use persistent sessions for devices receiving commands (actuators, configuration)
Use clean sessions for pure publishers (sensors)
Set message expiry (MQTT 5.0) to prevent infinite queuing
Monitor broker memory usage
35.5.3 Persistent Session Queue Calculator
Estimate the broker memory required to queue messages for offline devices with persistent sessions. This is critical for sizing your MQTT broker when devices sleep or experience network outages.
Best practice: Use retained for “state” topics (current temp, light status), don’t use for “events” (button pressed, motion detected - these are temporal, not states).
35.7 Last Will and Testament (LWT)
Last Will and Testament (LWT) enables graceful failure notification.
Setup: During CONNECT, client specifies will message:
Normal operation: Client publishes data, sends "online" to status topic, DISCONNECT gracefully -> LWT not sent.
Unexpected disconnect: Power failure, network timeout -> broker doesn’t receive DISCONNECT -> broker publishes LWT "offline" to status topic -> subscribers notified of failure.
Use cases:
Device availability monitoring: Dashboard shows which devices offline
Automated failover: Backup sensor activates when primary goes offline
Alert generation: Email/SMS when critical device loses connection
Implementation pattern:
whileTrue: client.publish("device/status", "online", retain=True) # Heartbeat time.sleep(60)# LWT takes effect if this loop stops
LWT + retained: Combine for persistent status - LWT sets “offline” as retained message, ensuring new subscribers see current state. Clear retained on reconnection:
Client sends PINGREQ if no messages sent in keepalive period
Broker responds PINGRESP
If broker doesn’t receive ANY message (publish or ping) within 1.5x keepalive: disconnects client
Example problem: Keepalive=60s. Network drops for 2s during active publishing. Client doesn’t realize disconnection until keepalive timeout (60s) - continues buffering messages locally. When timeout occurs: 6 messages queued (10s x 6 = 60s), sudden burst on reconnection.
Optimal keepalive: Set to 2-3x message interval. Publishing every 10s -> keepalive=30s detects disconnections within 30s maximum.
Trade-offs:
Keepalive
Detection
Power
Broker Load
Short (10s)
Quick
Higher (periodic pings)
More
Long (300s)
Delayed
Lower
Less
AWS IoT recommendation: 30-1200 seconds depending on battery constraints. Mobile networks: 60-300s (cellular keep-alive).
Production: Implement exponential backoff reconnection: First reconnect immediately, then 1s, 2s, 4s, 8s, max 60s. Prevents reconnection storms when broker restarts.
35.8.1 Keepalive Interval Optimizer
Find the optimal keepalive interval for your IoT deployment. Balance disconnection detection speed against power consumption and broker load. The MQTT specification triggers disconnect after 1.5x the keepalive interval with no activity.
Full tracking, signature required. Expensive but guaranteed.
When to use each:
QoS 0: Temperature every 5 seconds - missing one is fine
QoS 1: Door sensor alerts - must be delivered, duplicates OK
QoS 2: Payment confirmations - no duplicates, no losses
The most common mistake: Using QoS 2 for everything “just to be safe” - this wastes 4x the battery and bandwidth!
35.9 Worked Example: MQTT Broker Sizing for a Smart Building
Scenario: A 20-floor commercial building deploys 2,000 IoT devices (HVAC sensors, occupancy detectors, light controllers, door locks) publishing to a central MQTT broker. Calculate the broker’s message throughput, memory requirements, and determine the correct QoS mix.
Device breakdown and QoS assignment:
Device Type Count Msg/min QoS Payload Rationale
──────────────────────────────────────────────────────────────────
Temperature sensors 800 1 QoS 0 32 B Replaceable, high frequency
Occupancy sensors 400 2 QoS 0 16 B Replaceable, high frequency
Light controllers 300 0.2 QoS 1 48 B Commands must arrive
HVAC actuators 200 0.5 QoS 1 64 B Control commands critical
Door locks 100 0.1 QoS 2 24 B Security-critical, no dupes
Fire/smoke detectors 150 0.05 QoS 1 32 B Critical alerts
Energy meters 50 1 QoS 1 128 B Billing data, must deliver
Message throughput calculation:
Inbound (publish):
Temperature: 800 x 1 = 800 msg/min
Occupancy: 400 x 2 = 800 msg/min
Light: 300 x 0.2 = 60 msg/min
HVAC: 200 x 0.5 = 100 msg/min
Locks: 100 x 0.1 = 10 msg/min
Fire: 150 x 0.05 = 7.5 msg/min
Energy: 50 x 1 = 50 msg/min
─────────────────────────────────────────
Total inbound: 1,827.5 msg/min = 30.5 msg/sec
Outbound (deliver to subscribers):
Assume average 3 subscribers per topic (dashboard, logger, automation):
Total outbound: 30.5 x 3 = 91.5 msg/sec
QoS overhead (additional protocol messages):
QoS 0: 0 extra messages (60% of traffic)
QoS 1: 1 PUBACK per msg (35% of traffic): 91.5 x 0.35 = 32 msg/sec
QoS 2: 3 extra msgs per msg (5% of traffic): 91.5 x 0.05 x 3 = 13.7 msg/sec
Total broker throughput: 91.5 + 32 + 13.7 = ~137 msg/sec
Broker memory requirements:
Per-connection state:
TCP buffer (send+receive): 8 KB
SSL/TLS context: 4 KB (if encrypted)
Session state: 2 KB
Subscription tree: 0.5 KB (avg 5 subscriptions)
─────────────────────────────────────────
Per connection: ~14.5 KB
Total connections: 2,000 devices + 50 subscribers (dashboards, loggers)
Connection memory: 2,050 x 14.5 KB = 29 MB
Message queue (persistent sessions for 500 actuators/locks):
Worst case: device offline 1 hour, 60 QoS 1 messages queued
Per device: 60 x (48 bytes + 32 bytes overhead) = 4.8 KB
Total queue: 500 x 4.8 KB = 2.4 MB
Retained messages (one per status topic):
2,000 devices x 128 bytes avg = 256 KB
Total broker RAM: 29 + 2.4 + 0.25 = ~32 MB minimum
Recommended: 128 MB (4x headroom for peak, subscriptions, routing table)
Bandwidth calculation:
Average payload: weighted average = 38 bytes
MQTT overhead per message: ~12 bytes (fixed header + topic)
TCP/IP overhead: 40 bytes
TLS overhead: 29 bytes (record header + MAC)
Total per message: 38 + 12 + 40 + 29 = 119 bytes
Inbound bandwidth: 30.5 msg/sec x 119 bytes = 3.6 KB/s
Outbound bandwidth: 137 msg/sec x 119 bytes = 16.3 KB/s
Total: ~20 KB/s = 160 kbps
A basic 1 Mbps Ethernet link handles this with 84% headroom.
Python: MQTT QoS monitoring dashboard snippet:
import paho.mqtt.client as mqttfrom collections import defaultdictimport timeclass MQTTQoSMonitor:"""Monitor QoS delivery rates for a smart building MQTT broker."""def__init__(self, broker_host, broker_port=1883):self.stats = defaultdict(lambda: {"sent": 0, "acked": 0, "lost": 0})self.client = mqtt.Client(client_id="qos-monitor", protocol=mqtt.MQTTv5)self.client.on_message =self._on_messageself.client.on_publish =self._on_publishself.client.connect(broker_host, broker_port)def _on_publish(self, client, userdata, mid):self.stats["publish"]["acked"] +=1def _on_message(self, client, userdata, msg): qos = msg.qos topic_type = msg.topic.split("/")[1] # e.g., building/temp/floor3self.stats[f"qos{qos}_{topic_type}"]["sent"] +=1def report(self):"""Print delivery statistics per QoS level."""# What to observe: QoS 0 messages may show gaps during# network congestion. QoS 1/2 should show 100% delivery.for key, val insorted(self.stats.items()): rate = val["acked"] /max(val["sent"], 1) *100print(f" {key}: {val['sent']} sent, {val['acked']} acked ({rate:.1f}%)")
Real-World Reference: Microsoft’s Azure IoT Hub handles exactly this workload pattern. Their recommended broker tier for 2,000 devices at 137 msg/sec is S1 Standard (400,000 messages/day included, $25/month). The Eclipse Mosquitto open-source broker handles 50,000 msg/sec on a single 4-core VM, making this workload trivial for self-hosted deployments.
35.9.1 MQTT Broker Throughput Calculator
Size your MQTT broker by entering your device fleet composition. This calculator models inbound publish rates, outbound fan-out to subscribers, and QoS protocol overhead to estimate total broker throughput and bandwidth.
This chapter covered MQTT’s reliability mechanisms:
QoS Levels: QoS 0 for replaceable data (1 message, may lose), QoS 1 for important events (2 messages, may duplicate), QoS 2 for critical commands (4 messages, exactly-once)
Energy Trade-offs: QoS 2 uses 4x messages of QoS 0 - significant battery impact on constrained devices
Persistent Sessions: Clean Session=0 with fixed client ID enables offline message queuing for devices receiving commands
Retained Messages: Store last-known value for immediate delivery to new subscribers - essential for device status
Last Will Testament: Automatic “offline” notification when client disconnects unexpectedly - critical for monitoring
Keepalive: Balance disconnection detection speed vs power consumption (typically 30-300 seconds)