MQTT offers three QoS levels trading speed for delivery guarantees: QoS 0 (fire-and-forget, fastest but may lose messages), QoS 1 (at-least-once with acknowledgment, may duplicate), and QoS 2 (exactly-once via a 4-step handshake, slowest but no loss or duplication). Session management complements QoS by controlling whether the broker remembers subscriptions and queues messages during disconnections – persistent sessions are essential for battery-powered devices that sleep between transmissions.
24.1 Learning Objectives
By the end of this chapter series, you will be able to:
Distinguish the three MQTT QoS levels and compare their message flow handshakes (1-packet, 2-packet, and 4-packet)
Select and justify the optimal QoS level based on reliability requirements, battery constraints, and bandwidth trade-offs
Configure clean vs persistent sessions for different device types (stateless sensors vs command receivers)
Calculate queue depth and message expiry to prevent broker memory exhaustion during device disconnections
Diagnose and prevent common QoS pitfalls including random client IDs, QoS mismatch, and reconnection storms
Evaluate QoS trade-offs by applying bandwidth and battery-life calculations to real IoT deployment scenarios
Message Ordering: MQTT guarantees ordered delivery within a client session but not across sessions or publishers
24.2 MVU — Minimum Viable Understanding
QoS 0 sends once with no confirmation (may lose). QoS 1 retries until acknowledged (may duplicate). QoS 2 uses a 4-step handshake for exactly-once delivery. Clean sessions forget state on disconnect; persistent sessions buffer messages for offline devices. Choose the lowest QoS that meets your reliability needs to conserve bandwidth and battery.
Sensor Squad: Message Delivery Guarantees
Sammy the Sensor sends temperature readings to the cloud every few seconds. “I just shout my readings into the air — if one gets lost, another is coming soon!” says Sammy. That is QoS 0 — fire and forget.
Lila the Light Sensor detects when someone enters a room. “I need to make sure the cloud got my alert, so I keep sending it until I hear back ‘Got it!’” explains Lila. That is QoS 1 — at least once, with a confirmation.
Max the Motor Controller receives commands to lock a smart door. “I only accept each command exactly once — imagine if a lock command arrived twice and unlocked the door again!” says Max. That is QoS 2 — exactly once, with a special secret handshake.
Bella the Battery Monitor adds: “And when I go to sleep to save energy, the cloud remembers I was there and saves messages for when I wake up. That is called a persistent session!”
For Beginners: Understanding QoS and Sessions
Think of MQTT QoS like different ways of sending a letter:
QoS 0 (Regular mail): You drop the letter in the mailbox and walk away. It will probably arrive, but you have no proof. Fast and cheap.
QoS 1 (Certified mail): The post office gets a signature on delivery and tells you it arrived. If the confirmation gets lost, you send the letter again — so the recipient might get two copies.
QoS 2 (Registered mail with tracking): A multi-step process ensures the letter arrives exactly once. The most reliable, but the slowest and most expensive.
Sessions are like your mailbox at the post office:
Clean session: Every time you visit, the post office has no record of you. You start fresh.
Persistent session: The post office remembers you, holds your mail while you are away, and hands it all over when you return.
In IoT, you pick the right combination of QoS level and session type based on how critical each message is and how constrained your device is in terms of battery and bandwidth.
24.3 Overview
This topic has been split into focused chapters for better learning. Each chapter covers a specific aspect of MQTT Quality of Service and session management, progressing from fundamentals to advanced real-world examples.
24.4 QoS and Session Decision Map
The following diagram shows how to decide which QoS level and session type to use based on your application requirements:
Test your understanding of MQTT QoS and session concepts before diving into the detailed chapters.
Interactive Review: Match Concepts and Sequence the Protocol
Worked Example: Queue Depth Planning for Fleet Tracking
Scenario: You’re deploying an MQTT system for 5,000 delivery trucks. Each truck publishes GPS location every 10 seconds with persistent sessions (cleanSession=false) so offline trucks receive commands when reconnecting.
Problem: Trucks drive through tunnels/remote areas with 15-30 minute connectivity gaps. How do you size broker queues to prevent message loss while avoiding memory exhaustion?
Step 1: Calculate Messages During Outage
Outage duration: 30 minutes (worst case)
Trucks affected: 5,000
Messages per truck: (30 min x 60 sec) / 10 sec = 180 messages/truck
Messages queued: 5,000 trucks x 180 messages = 900,000 messages
Memory needed: 900,000 x (120 + 40) bytes = 144 MB (payload + 40-byte broker overhead per message)
Putting Numbers to It: Broker Queue Memory Scaling
For \(N\) clients with persistent sessions, each offline for average duration \(T\) seconds, publishing at rate \(r\) messages/sec with message size \(S\) bytes:
Total queue depth: $ Q_{} = N r T $
Memory consumption: $ M_{} = Q_{} (S + O) $
Where \(O\) = broker overhead per message (~40 bytes for metadata)
from paho.mqtt.properties import Propertiesfrom paho.mqtt.packettypes import PacketTypesprops = Properties(PacketTypes.PUBLISH)props.MessageExpiryInterval =1800# 30 minutesclient.publish(f"trucks/{truck_id}/gps", payload=json.dumps({"lat": lat, "lon": lon}), qos=1, properties=props)
Pro: Stale GPS data auto-expires. Truck only gets fresh data Con: Commands sent during outage might expire before delivery
Step 5: Implement Smart Reconnection
def on_connect(client, userdata, flags, reason_code, properties):if reason_code ==0:# On reconnect, publish location immediately client.publish("trucks/123/gps", get_current_location(), qos=1)# Then subscribe to commands client.subscribe("trucks/123/commands", qos=2)else:print(f"Connection failed: {reason_code}")# Exponential backoff: 1s, 2s, 4s, 8s, 16s, max 60s backoff =min(2** userdata['retry_count'], 60) time.sleep(backoff) userdata['retry_count'] +=1
Step 6: Monitor Broker Health
# Subscribe to $SYS topics for monitoringclient.subscribe("$SYS/broker/messages/stored")client.subscribe("$SYS/broker/subscriptions/count")def on_sys_message(client, userdata, msg):if"messages/stored"in msg.topic: stored =int(msg.payload.decode())if stored >800000: # 80% of 1M capacity alert("Broker queue at 80% - investigate!")
Decision Summary:
Configuration
Value
Rationale
max_queued_messages
200
Covers 30-min outage + buffer
message_expiry_interval
1800 sec
GPS data stale after 30 min
Reconnection backoff
1-60 sec
Prevents reconnection storms
Queue monitoring
Alert at 80%
Early warning for capacity issues
Key Insight: Queue depth is a balancing act. Too small = drop messages. Too large = memory exhaustion + stale data. Calculate based on realistic outage duration, not worst-case scenarios.
24.11 Common Pitfalls
Common QoS and Session Mistakes
1. Using QoS 2 everywhere “just to be safe” QoS 2 requires a 4-packet handshake for every message. On a device sending data every second, this quadruples bandwidth usage and dramatically increases battery drain. Reserve QoS 2 for critical commands only.
2. Random client IDs with persistent sessions If your code generates a random client ID on each connection (e.g., client_ + random UUID), the broker treats each connection as a new client. Old sessions are never cleaned up, leaking broker memory. Always use deterministic, device-specific client IDs (e.g., sensor-factory3-line7-temp01).
3. QoS mismatch between publisher and subscriber If a publisher sends at QoS 2 but the subscriber connects at QoS 0, the broker downgrades delivery to the subscriber’s level. The publisher pays the QoS 2 overhead, but the subscriber still gets fire-and-forget delivery. Always verify that both ends use matching (or compatible) QoS levels.
4. Forgetting to set message expiry on persistent sessions Without a TTL (time-to-live), queued messages for offline devices persist indefinitely on the broker. A device offline for months accumulates thousands of stale messages. When it reconnects, it receives a flood of outdated data. Always configure message-expiry-interval (MQTT 5.0) or broker-side queue policies.
5. Not handling reconnection storms When hundreds of devices lose connectivity simultaneously (e.g., power outage), they all reconnect at once, overwhelming the broker with CONNECT, SUBSCRIBE, and queued message delivery. Use exponential backoff with jitter on reconnection and stagger device boot times.