24  MQTT QoS and Session

In 60 Seconds

MQTT offers three QoS levels trading speed for delivery guarantees: QoS 0 (fire-and-forget, fastest but may lose messages), QoS 1 (at-least-once with acknowledgment, may duplicate), and QoS 2 (exactly-once via a 4-step handshake, slowest but no loss or duplication). Session management complements QoS by controlling whether the broker remembers subscriptions and queues messages during disconnections – persistent sessions are essential for battery-powered devices that sleep between transmissions.

24.1 Learning Objectives

By the end of this chapter series, you will be able to:

  • Distinguish the three MQTT QoS levels and compare their message flow handshakes (1-packet, 2-packet, and 4-packet)
  • Select and justify the optimal QoS level based on reliability requirements, battery constraints, and bandwidth trade-offs
  • Configure clean vs persistent sessions for different device types (stateless sensors vs command receivers)
  • Calculate queue depth and message expiry to prevent broker memory exhaustion during device disconnections
  • Diagnose and prevent common QoS pitfalls including random client IDs, QoS mismatch, and reconnection storms
  • Evaluate QoS trade-offs by applying bandwidth and battery-life calculations to real IoT deployment scenarios
Key Concepts
  • QoS 0 (At Most Once): Fire-and-forget delivery with no acknowledgment — lowest overhead, message loss possible
  • QoS 1 (At Least Once): ACK-based delivery ensuring message arrives at least once — duplicate possible if PUBACK lost
  • QoS 2 (Exactly Once): Four-message handshake (PUBLISH/PUBREC/PUBREL/PUBCOMP) guaranteeing exactly one delivery
  • PUBACK: Broker acknowledgment for QoS 1 publish — triggers publisher to remove message from retry queue
  • QoS Downgrade: Broker silently delivers at subscriber’s QoS level if lower than publisher’s — common source of unexpected behavior
  • In-flight Messages: QoS 1/2 messages awaiting acknowledgment — window size limits concurrent unacknowledged messages
  • Message Ordering: MQTT guarantees ordered delivery within a client session but not across sessions or publishers

24.2 MVU — Minimum Viable Understanding

QoS 0 sends once with no confirmation (may lose). QoS 1 retries until acknowledged (may duplicate). QoS 2 uses a 4-step handshake for exactly-once delivery. Clean sessions forget state on disconnect; persistent sessions buffer messages for offline devices. Choose the lowest QoS that meets your reliability needs to conserve bandwidth and battery.

Sammy the Sensor sends temperature readings to the cloud every few seconds. “I just shout my readings into the air — if one gets lost, another is coming soon!” says Sammy. That is QoS 0 — fire and forget.

Lila the Light Sensor detects when someone enters a room. “I need to make sure the cloud got my alert, so I keep sending it until I hear back ‘Got it!’” explains Lila. That is QoS 1 — at least once, with a confirmation.

Max the Motor Controller receives commands to lock a smart door. “I only accept each command exactly once — imagine if a lock command arrived twice and unlocked the door again!” says Max. That is QoS 2 — exactly once, with a special secret handshake.

Bella the Battery Monitor adds: “And when I go to sleep to save energy, the cloud remembers I was there and saves messages for when I wake up. That is called a persistent session!”

Think of MQTT QoS like different ways of sending a letter:

  • QoS 0 (Regular mail): You drop the letter in the mailbox and walk away. It will probably arrive, but you have no proof. Fast and cheap.
  • QoS 1 (Certified mail): The post office gets a signature on delivery and tells you it arrived. If the confirmation gets lost, you send the letter again — so the recipient might get two copies.
  • QoS 2 (Registered mail with tracking): A multi-step process ensures the letter arrives exactly once. The most reliable, but the slowest and most expensive.

Sessions are like your mailbox at the post office:

  • Clean session: Every time you visit, the post office has no record of you. You start fresh.
  • Persistent session: The post office remembers you, holds your mail while you are away, and hands it all over when you return.

In IoT, you pick the right combination of QoS level and session type based on how critical each message is and how constrained your device is in terms of battery and bandwidth.

24.3 Overview

This topic has been split into focused chapters for better learning. Each chapter covers a specific aspect of MQTT Quality of Service and session management, progressing from fundamentals to advanced real-world examples.

24.4 QoS and Session Decision Map

The following diagram shows how to decide which QoS level and session type to use based on your application requirements:

MQTT QoS decision flowchart

24.5 Chapter Guide

24.5.1 1. MQTT QoS Fundamentals

MQTT QoS Fundamentals - Start here if you’re new to QoS

Learn the basics of MQTT Quality of Service through simple analogies and real-world examples:

  • What QoS means (shipping analogy: regular mail vs certified vs registered)
  • QoS 0 (fire and forget), QoS 1 (at least once), QoS 2 (exactly once)
  • Clean vs persistent sessions explained simply
  • Retained messages for device status
  • Quick self-check with detailed answer analysis

Best for: Beginners, those needing a refresher on QoS concepts


24.5.2 2. MQTT QoS Levels

MQTT QoS Levels - Technical deep dive

Explore the technical details of each QoS level with interactive visualizations:

  • Message flow diagrams for QoS 0, 1, and 2 handshakes
  • QoS decision framework and flowchart
  • Interactive QoS visualizer (OJS animation)
  • Battery impact calculations and trade-off analysis
  • Common misconceptions debunked
  • QoS and session combination matrix

Best for: Developers implementing MQTT, those choosing QoS levels for specific use cases


24.5.3 3. MQTT Session Management

MQTT Session Management - Security and configuration

Master session persistence and secure MQTT deployment:

  • Interactive QoS comparison lab (Wokwi simulation)
  • Security considerations: TLS, authentication, ACLs
  • Common pitfalls: random client IDs, QoS mismatch, session flooding
  • Publisher-side buffering for offline scenarios
  • Exponential backoff for reconnection storms
  • Knowledge checks for security concepts

Best for: System architects, security-conscious developers, production deployments


24.5.4 4. MQTT QoS Worked Examples

MQTT QoS Worked Examples - Real-world applications

Apply QoS and session concepts to real IoT scenarios with detailed calculations:

  • Fleet Tracking: Message categorization, QoS selection, data cost analysis
  • Smart Door Locks: Battery calculations, audit compliance, QoS 2 justification
  • Fleet Session Sizing: Broker memory, reconnection storms, queue expiry
  • Medical Telemetry: Mixed QoS strategy, regulatory compliance, battery life
  • Sleep-Wake Sensors: Persistent sessions for agriculture IoT, command delivery

Best for: IoT architects, those designing production systems, interview preparation


24.6 Quick Check: QoS Fundamentals

Before exploring the interactive calculators, verify your understanding of the core concepts covered above.

24.7 QoS Level Comparison

The following diagram summarizes the trade-offs between the three QoS levels across key dimensions:

MQTT QoS level comparison

24.8 Quick Reference

Topic Chapter Time Level
QoS basics, analogies Fundamentals 15 min Beginner
Technical handshakes QoS Levels 20 min Intermediate
Security, pitfalls Sessions 20 min Intermediate
Real-world examples Worked Examples 25 min Advanced

24.9 Interactive Calculators

Use the calculators below to explore MQTT QoS trade-offs for your own deployment scenarios.

24.9.1 QoS Bandwidth Overhead Calculator

Estimate the bandwidth cost of each QoS level for your message rate and payload size.

24.9.2 Broker Queue Depth Calculator

Calculate how many messages accumulate during device offline periods and the broker memory required.

24.9.3 QoS Battery Impact Calculator

Estimate how QoS level affects battery life for a battery-powered IoT device.

24.9.4 Message Expiry Planner

Calculate the optimal message expiry interval (TTL) to balance freshness against delivery during outages.

24.9.5 Reconnection Storm Estimator

Model the impact of simultaneous device reconnections on broker load after a network outage.

24.10 Knowledge Check

Test your understanding of MQTT QoS and session concepts before diving into the detailed chapters.

Interactive Review: Match Concepts and Sequence the Protocol

Scenario: You’re deploying an MQTT system for 5,000 delivery trucks. Each truck publishes GPS location every 10 seconds with persistent sessions (cleanSession=false) so offline trucks receive commands when reconnecting.

Problem: Trucks drive through tunnels/remote areas with 15-30 minute connectivity gaps. How do you size broker queues to prevent message loss while avoiding memory exhaustion?

Step 1: Calculate Messages During Outage

  • Outage duration: 30 minutes (worst case)
  • Trucks affected: 5,000
  • Messages per truck: (30 min x 60 sec) / 10 sec = 180 messages/truck

Step 2: Calculate Broker Memory Requirements

  • Message size: GPS JSON = ~120 bytes ({"lat":37.7749,"lon":-122.4194,"speed":65})
  • Messages queued: 5,000 trucks x 180 messages = 900,000 messages
  • Memory needed: 900,000 x (120 + 40) bytes = 144 MB (payload + 40-byte broker overhead per message)

For \(N\) clients with persistent sessions, each offline for average duration \(T\) seconds, publishing at rate \(r\) messages/sec with message size \(S\) bytes:

Total queue depth: $ Q_{} = N r T $

Memory consumption: $ M_{} = Q_{} (S + O) $

Where \(O\) = broker overhead per message (~40 bytes for metadata)

Concrete example (5,000 trucks, 30-min outage):

  • \(N = 5000\) clients
  • \(r = 0.1\) msg/sec (1 msg per 10 sec)
  • \(T = 1800\) sec (30 min)
  • \(S = 120\) bytes payload
  • \(O = 40\) bytes overhead

$ Q_{} = 5000 = 900,000 $

$ M_{} = 900,000 (120 + 40) = 144 $

Per-client queue limit calculation: $ Q_{} = r T_{} + $

For 30-min max outage: \(Q = 0.1 \times 1800 + 20 = 200\text{ messages}\)

Total broker capacity check: $ M_{} = N Q_{} (S + O) $

$ M = 5000 = 160 $

Scaling warning: Doubling fleet size OR max outage duration doubles memory needs: - 10,000 trucks: 320 MB - 60-min outage: 320 MB - Both: 640 MB

At scale, persistent sessions require careful capacity planning and message expiry policies to prevent broker memory exhaustion.

Step 3: Design Queue Limits

# mosquitto.conf
max_queued_messages 200  # Per client

# Why 200?
# - 180 messages in 30-min outage
# - +20 buffer for connection delays
# - Balance: enough for recovery, but prevents runaway queues

Step 4: Handle Queue Overflow

What if a truck is offline for 2 hours (720 messages at 1 per 10 seconds)?

Option A: Drop oldest messages (Circular buffer)

# Mosquitto default behavior
max_queued_messages 200
# Message 201 evicts message 1

Pro: Truck gets latest 200 messages (last ~33 minutes) Con: Loses 520 messages (first ~87 minutes)

Option B: Set Message Expiry (MQTT 5.0)

from paho.mqtt.properties import Properties
from paho.mqtt.packettypes import PacketTypes

props = Properties(PacketTypes.PUBLISH)
props.MessageExpiryInterval = 1800  # 30 minutes

client.publish(
    f"trucks/{truck_id}/gps",
    payload=json.dumps({"lat": lat, "lon": lon}),
    qos=1,
    properties=props
)

Pro: Stale GPS data auto-expires. Truck only gets fresh data Con: Commands sent during outage might expire before delivery

Step 5: Implement Smart Reconnection

def on_connect(client, userdata, flags, reason_code, properties):
    if reason_code == 0:
        # On reconnect, publish location immediately
        client.publish("trucks/123/gps", get_current_location(), qos=1)

        # Then subscribe to commands
        client.subscribe("trucks/123/commands", qos=2)
    else:
        print(f"Connection failed: {reason_code}")
        # Exponential backoff: 1s, 2s, 4s, 8s, 16s, max 60s
        backoff = min(2 ** userdata['retry_count'], 60)
        time.sleep(backoff)
        userdata['retry_count'] += 1

Step 6: Monitor Broker Health

# Subscribe to $SYS topics for monitoring
client.subscribe("$SYS/broker/messages/stored")
client.subscribe("$SYS/broker/subscriptions/count")

def on_sys_message(client, userdata, msg):
    if "messages/stored" in msg.topic:
        stored = int(msg.payload.decode())
        if stored > 800000:  # 80% of 1M capacity
            alert("Broker queue at 80% - investigate!")

Decision Summary:

Configuration Value Rationale
max_queued_messages 200 Covers 30-min outage + buffer
message_expiry_interval 1800 sec GPS data stale after 30 min
Reconnection backoff 1-60 sec Prevents reconnection storms
Queue monitoring Alert at 80% Early warning for capacity issues

Key Insight: Queue depth is a balancing act. Too small = drop messages. Too large = memory exhaustion + stale data. Calculate based on realistic outage duration, not worst-case scenarios.

24.11 Common Pitfalls

Common QoS and Session Mistakes

1. Using QoS 2 everywhere “just to be safe” QoS 2 requires a 4-packet handshake for every message. On a device sending data every second, this quadruples bandwidth usage and dramatically increases battery drain. Reserve QoS 2 for critical commands only.

2. Random client IDs with persistent sessions If your code generates a random client ID on each connection (e.g., client_ + random UUID), the broker treats each connection as a new client. Old sessions are never cleaned up, leaking broker memory. Always use deterministic, device-specific client IDs (e.g., sensor-factory3-line7-temp01).

3. QoS mismatch between publisher and subscriber If a publisher sends at QoS 2 but the subscriber connects at QoS 0, the broker downgrades delivery to the subscriber’s level. The publisher pays the QoS 2 overhead, but the subscriber still gets fire-and-forget delivery. Always verify that both ends use matching (or compatible) QoS levels.

4. Forgetting to set message expiry on persistent sessions Without a TTL (time-to-live), queued messages for offline devices persist indefinitely on the broker. A device offline for months accumulates thousands of stale messages. When it reconnects, it receives a flood of outdated data. Always configure message-expiry-interval (MQTT 5.0) or broker-side queue policies.

5. Not handling reconnection storms When hundreds of devices lose connectivity simultaneously (e.g., power outage), they all reconnect at once, overwhelming the broker with CONNECT, SUBSCRIBE, and queued message delivery. Use exponential backoff with jitter on reconnection and stagger device boot times.