MQTT reliability combines three mechanisms: QoS levels (0/1/2) controlling per-message delivery guarantees, persistent sessions queuing messages for offline subscribers, and retained messages storing the last value on each topic for immediate delivery to new subscribers. The Last Will Testament (LWT) enables automatic failure notification when a client disconnects unexpectedly.
35.1 Learning Objectives
By the end of this chapter, you will be able to:
Select Appropriate QoS Levels: Choose QoS 0, 1, or 2 based on data criticality and battery constraints
Explain Delivery Guarantees: Describe the message flow and acknowledgment packets for each QoS level and justify why each step is necessary
Configure Persistent Sessions: Apply clean_session=false with fixed client IDs to enable offline message queuing for actuators and command receivers
Implement Retained Messages: Demonstrate how to store last-known values for immediate delivery to new subscribers and distinguish state topics from event topics
Design LWT Monitoring: Construct Last Will and Testament configurations for graceful failure notification and combine with retained messages for persistent device status
Calculate Energy Trade-offs: Compare battery consumption across QoS levels and assess the impact of keepalive interval on detection latency versus power draw
Key Concepts
QoS 0 (At Most Once): Fire-and-forget delivery with no acknowledgment — lowest overhead, message loss possible
QoS 1 (At Least Once): ACK-based delivery ensuring message arrives at least once — duplicate possible if PUBACK lost
Understanding Quality of Service levels is critical for reliability vs performance trade-offs:
QoS Selection Guide
QoS 0: Use for high-frequency, replaceable telemetry such as temperature samples and occupancy counts.
QoS 1: Use for alerts and commands that must arrive, while accepting that a retry can create a duplicate.
QoS 2: Use only for rare, high-consequence actions where both loss and duplication are unacceptable.
Rule of thumb: Start at QoS 0, move to QoS 1 when delivery matters, and reserve QoS 2 for exceptional cases because it costs the most bandwidth and battery.
35.3.1 QoS Level Summary
QoS 0 - At most once: 1 message, fire-and-forget delivery, message loss possible. Best for temperature every 5 seconds.
QoS 1 - At least once: 2 messages, confirmed delivery, duplicate possible. Best for door alerts and motion sensors.
QoS 2 - Exactly once: 4 messages, no loss and no duplicates. Best for financial or medical commands.
35.3.2 Energy and Performance Trade-offs
Battery impact example: Sensor publishes 1 message/minute for 1 year.
QoS
Messages/Year
Energy (at 20mA x 50ms/msg)
QoS 0
525,600
146 mAh
QoS 1
1,051,200 (2x)
292 mAh
QoS 2
2,102,400 (4x)
584 mAh
Key Insight: Use QoS 0 for high-frequency replaceable data, QoS 1 selectively for critical messages only.
Putting Numbers to It
For a sensor publishing every minute for a year at 20 mA per message:
For a 2,000 mAh battery, QoS 2 consumes 29% of battery capacity just for MQTT overhead versus 7% for QoS 0 — a critical factor in 5+ year sensor deployments.
35.3.3 QoS Battery Life Calculator
Estimate how QoS level selection affects battery life for your IoT device. Adjust battery capacity, message rate, and radio characteristics to compare all three QoS levels side by side.
Show code
viewof qosBatCap = Inputs.range([100,10000], {value:2000,step:100,label:"Battery capacity (mAh)"})viewof qosMsgPerMin = Inputs.range([0.1,60], {value:1,step:0.1,label:"Messages per minute"})viewof qosTxCurrent = Inputs.range([5,200], {value:20,step:1,label:"Transmit current (mA)"})viewof qosTxDuration = Inputs.range([5,200], {value:50,step:5,label:"Tx time per packet (ms)"})viewof qosSleepCurrent = Inputs.range([0.001,1.0], {value:0.01,step:0.001,label:"Sleep current (mA)"})
Client connects with client_id="sensor_123" and clean_session=false
Client subscribes to commands/sensor_123 (broker remembers subscription)
Client disconnects (sleep mode)
Server publishes command to commands/sensor_123 with QoS 1
Broker queues message for offline sensor_123
Client reconnects with same client_id, clean_session=false
Broker delivers queued message immediately
Requirements:
Fixed client_id: Must use same ID across sessions (broker links queue to ID)
QoS >= 1: Only QoS 1/2 messages queued (QoS 0 discarded if client offline)
Subscription persistence: Subscriptions survive disconnection (no need to re-SUBSCRIBE)
Trade-offs:
Broker stores queued messages (uses memory/disk - typically 100KB-10MB per client)
Potential message flood on reconnection (100s of queued messages delivered rapidly)
35.5.2 Clean Sessions (Clean Session = True)
Clean Session=1: Discards session state on disconnect - suitable for high-frequency sensors that don’t need commands (temperature logger).
Best practices:
Use persistent sessions for devices receiving commands (actuators, configuration)
Use clean sessions for pure publishers (sensors)
Set message expiry (MQTT 5.0) to prevent infinite queuing
Monitor broker memory usage
35.5.3 Persistent Session Queue Calculator
Estimate the broker memory required to queue messages for offline devices with persistent sessions. This is critical for sizing your MQTT broker when devices sleep or experience network outages.
When subscriber connects -> immediately receives retained “24.5” (even if sensor published hours ago)
Use cases:
Device status: device/status retained “online” - dashboard always knows current state
Configuration: device/config retained settings - new dashboards get current config without querying device
Slow-changing data: Room temperature, door state - subscribers need current value immediately, not historical
Caution:
Only ONE message retained per topic (new publish replaces old)
Clearing retained message: publish an empty payload with the retain flag set.
client.publish("topic", "", retain=True)
Best practice: Use retained for “state” topics (current temp, light status), don’t use for “events” (button pressed, motion detected - these are temporal, not states).
35.7 Last Will and Testament (LWT)
Last Will and Testament (LWT) enables graceful failure notification.
Setup: During CONNECT, client specifies will message:
Normal operation: Client publishes data, sends "online" to status topic, DISCONNECT gracefully -> LWT not sent.
Unexpected disconnect: Power failure, network timeout -> broker doesn’t receive DISCONNECT -> broker publishes LWT "offline" to status topic -> subscribers notified of failure.
Use cases:
Device availability monitoring: Dashboard shows which devices offline
Automated failover: Backup sensor activates when primary goes offline
Alert generation: Email/SMS when critical device loses connection
Implementation pattern:
whileTrue: client.publish("device/status", "online", retain=True) # Heartbeat time.sleep(60)# LWT takes effect if this loop stops
LWT + retained: Combine for persistent status - LWT sets “offline” as retained message, ensuring new subscribers see current state. Clear retained on reconnection:
Client sends PINGREQ if no messages sent in keepalive period
Broker responds PINGRESP
If broker doesn’t receive ANY message (publish or ping) within 1.5x keepalive: disconnects client
Example problem: Keepalive=60s. Network drops for 2s during active publishing. Client doesn’t realize disconnection until keepalive timeout (60s) - continues buffering messages locally. When timeout occurs: 6 messages queued (10s x 6 = 60s), sudden burst on reconnection.
Optimal keepalive: Set to 2-3x message interval. Publishing every 10s -> keepalive=30s detects disconnections within 30s maximum.
Trade-offs:
Short keepalive (10s): Quick detection, higher power draw because of periodic pings, more broker load.
Long keepalive (300s): Delayed detection, lower power draw, less broker load.
AWS IoT recommendation: 30-1200 seconds depending on battery constraints. Mobile networks: 60-300s (cellular keep-alive).
Production: Implement exponential backoff reconnection: First reconnect immediately, then 1s, 2s, 4s, 8s, max 60s. Prevents reconnection storms when broker restarts.
35.8.1 Keepalive Interval Optimizer
Find the optimal keepalive interval for your IoT deployment. Balance disconnection detection speed against power consumption and broker load. The MQTT specification triggers disconnect after 1.5x the keepalive interval with no activity.
Postcard -> QoS 0: Drop it in the mailbox and hope it arrives. No tracking.
Certified Mail -> QoS 1: The carrier confirms delivery, but the letter might get delivered twice.
Registered Mail -> QoS 2: Full tracking and signature required. Expensive but guaranteed.
When to use each:
QoS 0: Temperature every 5 seconds - missing one is fine
QoS 1: Door sensor alerts - must be delivered, duplicates OK
QoS 2: Payment confirmations - no duplicates, no losses
The most common mistake: Using QoS 2 for everything “just to be safe” - this wastes 4x the battery and bandwidth!
35.9 Worked Example: MQTT Broker Sizing for a Smart Building
Scenario: A 20-floor commercial building deploys 2,000 IoT devices (HVAC sensors, occupancy detectors, light controllers, door locks) publishing to a central MQTT broker. Calculate the broker’s message throughput, memory requirements, and determine the correct QoS mix.
Device breakdown and QoS assignment:
Temperature sensors: 800 devices, 1 msg/min, QoS 0, 32 B payload. Replaceable high-frequency data.
Energy meters: 50 devices, 1 msg/min, QoS 1, 128 B payload. Billing data must be delivered.
Message throughput calculation:
Inbound publishes: Temperature 800, occupancy 800, light 60, HVAC 100, locks 10, fire 7.5, and energy 50 msg/min. Total inbound is 1,827.5 msg/min, or 30.5 msg/sec.
Outbound delivery: Assume 3 subscribers per topic for dashboards, loggers, and automation. Total outbound traffic is 30.5 x 3 = 91.5 msg/sec.
QoS overhead: QoS 0 adds no extra protocol traffic, QoS 1 adds about 32 msg/sec of PUBACK traffic, and QoS 2 adds about 13.7 msg/sec of handshake traffic.
Total broker throughput: 91.5 + 32 + 13.7 = about 137 msg/sec.
Broker memory requirements:
Per connection: TCP buffers use 8 KB, TLS context 4 KB, session state 2 KB, and the subscription tree about 0.5 KB. Each connection needs about 14.5 KB.
Connection memory: 2,000 devices plus 50 subscribers equals 2,050 connections. At 14.5 KB each, that is about 29 MB.
Queued messages: If 500 actuator or lock devices go offline for one hour and each queues 60 QoS 1 messages, the queue uses about 2.4 MB.
Retained messages: One retained status topic for 2,000 devices at 128 bytes each uses about 256 KB.
Total broker RAM: Roughly 32 MB minimum. A 128 MB broker leaves healthy headroom for peaks, subscriptions, and routing tables.
Bandwidth calculation:
Per message size: 38-byte average payload + 12 bytes MQTT overhead + 40 bytes TCP/IP + 29 bytes TLS = 119 bytes per message.
Inbound bandwidth: 30.5 msg/sec x 119 bytes = 3.6 KB/s.
Outbound bandwidth: 137 msg/sec x 119 bytes = 16.3 KB/s.
Total bandwidth: About 20 KB/s, or 160 kbps. A basic 1 Mbps Ethernet link still has about 84% headroom.
Python: MQTT QoS monitoring dashboard snippet:
import paho.mqtt.client as mqttfrom collections import defaultdictimport timeclass MQTTQoSMonitor:"""Monitor QoS delivery rates for a smart building MQTT broker."""def__init__(self, broker_host, broker_port=1883):self.stats = defaultdict(lambda: {"sent": 0, "acked": 0, "lost": 0})self.client = mqtt.Client(client_id="qos-monitor", protocol=mqtt.MQTTv5)self.client.on_message =self._on_messageself.client.on_publish =self._on_publishself.client.connect(broker_host, broker_port)def _on_publish(self, client, userdata, mid):self.stats["publish"]["acked"] +=1def _on_message(self, client, userdata, msg): qos = msg.qos topic_type = msg.topic.split("/")[1] # e.g., building/temp/floor3self.stats[f"qos{qos}_{topic_type}"]["sent"] +=1def report(self):"""Print delivery statistics per QoS level."""# What to observe: QoS 0 messages may show gaps during# network congestion. QoS 1/2 should show 100% delivery.for key, val insorted(self.stats.items()): rate = val["acked"] /max(val["sent"], 1) *100print(f" {key}: {val['sent']} sent, {val['acked']} acked ({rate:.1f}%)")
Real-World Reference: Microsoft’s Azure IoT Hub handles exactly this workload pattern. Their recommended broker tier for 2,000 devices at 137 msg/sec is S1 Standard (400,000 messages/day included, $25/month). The Eclipse Mosquitto open-source broker handles 50,000 msg/sec on a single 4-core VM, making this workload trivial for self-hosted deployments.
35.9.1 MQTT Broker Throughput Calculator
Size your MQTT broker by entering your device fleet composition. This calculator models inbound publish rates, outbound fan-out to subscribers, and QoS protocol overhead to estimate total broker throughput and bandwidth.
This chapter covered MQTT’s reliability mechanisms:
QoS Levels: QoS 0 for replaceable data (1 message, may lose), QoS 1 for important events (2 messages, may duplicate), QoS 2 for critical commands (4 messages, exactly-once)
Energy Trade-offs: QoS 2 uses 4x messages of QoS 0 - significant battery impact on constrained devices
Persistent Sessions: Clean Session=0 with fixed client ID enables offline message queuing for devices receiving commands
Retained Messages: Store last-known value for immediate delivery to new subscribers - essential for device status
Last Will Testament: Automatic “offline” notification when client disconnects unexpectedly - critical for monitoring
Keepalive: Balance disconnection detection speed vs power consumption (typically 30-300 seconds)
35.13 What’s Next
MQTT Production Deployment: Clustering, security, and broker scaling. Continue into production-grade MQTT deployments with TLS and load balancing.
MQTT QoS Levels: Deep dive into QoS 0/1/2 packet flows. Study the exact byte-level handshake and packet identifiers behind each guarantee.
MQTT QoS Worked Examples: Real-world QoS scenario walkthroughs for smart home, industrial, and healthcare IoT case studies.
MQTT Architecture Patterns: Pub/sub topology and topic design. See how QoS and retained messages fit into hierarchical topic structures.