35  MQTT QoS and Reliability

In 60 Seconds

MQTT reliability combines three mechanisms: QoS levels (0/1/2) controlling per-message delivery guarantees, persistent sessions queuing messages for offline subscribers, and retained messages storing the last value on each topic for immediate delivery to new subscribers. The Last Will Testament (LWT) enables automatic failure notification when a client disconnects unexpectedly.

35.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Select Appropriate QoS Levels: Choose QoS 0, 1, or 2 based on data criticality and battery constraints
  • Explain Delivery Guarantees: Describe the message flow and acknowledgment packets for each QoS level and justify why each step is necessary
  • Configure Persistent Sessions: Apply clean_session=false with fixed client IDs to enable offline message queuing for actuators and command receivers
  • Implement Retained Messages: Demonstrate how to store last-known values for immediate delivery to new subscribers and distinguish state topics from event topics
  • Design LWT Monitoring: Construct Last Will and Testament configurations for graceful failure notification and combine with retained messages for persistent device status
  • Calculate Energy Trade-offs: Compare battery consumption across QoS levels and assess the impact of keepalive interval on detection latency versus power draw

Key Concepts

  • QoS 0 (At Most Once): Fire-and-forget delivery with no acknowledgment — lowest overhead, message loss possible
  • QoS 1 (At Least Once): ACK-based delivery ensuring message arrives at least once — duplicate possible if PUBACK lost
  • QoS 2 (Exactly Once): Four-message handshake (PUBLISH/PUBREC/PUBREL/PUBCOMP) guaranteeing exactly one delivery
  • PUBACK: Broker acknowledgment for QoS 1 publish — triggers publisher to remove message from retry queue
  • QoS Downgrade: Broker silently delivers at subscriber’s QoS level if lower than publisher’s — common source of unexpected behavior
  • In-flight Messages: QoS 1/2 messages awaiting acknowledgment — window size limits concurrent unacknowledged messages
  • Message Ordering: MQTT guarantees ordered delivery within a client session but not across sessions or publishers

35.2 Prerequisites

Required Chapters:

Technical Background:

  • TCP reliability concepts
  • Battery power considerations for IoT devices

Estimated Time: 15 minutes

35.3 MQTT QoS Levels Comparison

Understanding Quality of Service levels is critical for reliability vs performance trade-offs:

MQTT QoS decision tree
Figure 35.1: MQTT QoS level selection decision tree by data criticality

35.3.1 QoS Level Summary

QoS Name Messages Guarantee Use Case
0 At most once 1 Fire-and-forget, may lose Temperature every 5 seconds
1 At least once 2 Confirmed, may duplicate Door alerts, motion sensors
2 Exactly once 4 No loss, no duplicates Financial, medical commands

35.3.2 Energy and Performance Trade-offs

Battery impact example: Sensor publishes 1 message/minute for 1 year.

QoS Messages/Year Energy (at 20mA x 50ms/msg)
QoS 0 525,600 146 mAh
QoS 1 1,051,200 (2x) 292 mAh
QoS 2 2,102,400 (4x) 584 mAh

Key Insight: Use QoS 0 for high-frequency replaceable data, QoS 1 selectively for critical messages only.

For a sensor publishing every minute for a year at 20 mA per message:

QoS 0 (1 message): \[ \text{Messages per year} = 525{,}600 \] \[ \text{Energy} = 525{,}600 \times 20\,\text{mA} \times 50\,\text{ms} = 146\,\text{mAh} \]

QoS 1 (2 messages: PUBLISH + PUBACK): \[ \text{Messages per year} = 1{,}051{,}200 \] \[ \text{Energy} = 1{,}051{,}200 \times 20\,\text{mA} \times 50\,\text{ms} = 292\,\text{mAh} \quad (2\times) \]

QoS 2 (4 messages: full handshake): \[ \text{Messages per year} = 2{,}102{,}400 \] \[ \text{Energy} = 2{,}102{,}400 \times 20\,\text{mA} \times 50\,\text{ms} = 584\,\text{mAh} \quad (4\times) \]

For a 2,000 mAh battery, QoS 2 consumes 29% of battery capacity just for MQTT overhead versus 7% for QoS 0 — a critical factor in 5+ year sensor deployments.

35.3.3 QoS Battery Life Calculator

Estimate how QoS level selection affects battery life for your IoT device. Adjust battery capacity, message rate, and radio characteristics to compare all three QoS levels side by side.

35.4 Common Misconception: MQTT Guarantees Message Delivery

Misconception: “MQTT Guarantees Message Delivery”

The Misconception: Many developers believe that using MQTT automatically ensures messages will reach subscribers, regardless of configuration.

The Reality: Message delivery guarantees depend on both QoS level AND session configuration:

Real-World Example - Smart Home Door Lock Failure (2023):

  • Scenario: Smart lock manufacturer used MQTT with default settings (QoS 0, clean_session=1)
  • Problem: Mobile app sent “unlock door” command while lock was temporarily offline
  • Result: 12% of unlock commands lost (measured over 50,000 operations), causing user complaints
  • Impact: Average 2.3 unlock failures per user per month, 34% increase in support tickets

Why Commands Were Lost:

  1. QoS 0: No acknowledgment or retry mechanism
  2. Clean Session=1: Broker discarded messages for offline clients
  3. No Retained Messages: Last command not stored for retrieval

Correct Configuration:

# Wrong (manufacturer's original config)
client.connect(clean_session=True)  # No persistence
client.publish("lock/unlock", qos=0)  # Fire-and-forget

# Right (fixed config)
client.connect(client_id="lock_456", clean_session=False)  # Persistent session
client.publish("lock/unlock", qos=1)  # At-least-once delivery

After Fix:

  • Command loss rate dropped from 12% to 0.03% (400x improvement)
  • User complaints decreased by 89%
  • Support ticket cost reduced by $47,000/month

35.5 Session Management

35.5.1 Persistent Sessions (Clean Session = False)

Persistent session (Clean Session=0) enables offline message queuing:

How it works:

  1. Client connects with client_id="sensor_123" and clean_session=false
  2. Client subscribes to commands/sensor_123 (broker remembers subscription)
  3. Client disconnects (sleep mode)
  4. Server publishes command to commands/sensor_123 with QoS 1
  5. Broker queues message for offline sensor_123
  6. Client reconnects with same client_id, clean_session=false
  7. Broker delivers queued message immediately

Requirements:

  • Fixed client_id: Must use same ID across sessions (broker links queue to ID)
  • QoS >= 1: Only QoS 1/2 messages queued (QoS 0 discarded if client offline)
  • Subscription persistence: Subscriptions survive disconnection (no need to re-SUBSCRIBE)

Trade-offs:

  • Broker stores queued messages (uses memory/disk - typically 100KB-10MB per client)
  • Potential message flood on reconnection (100s of queued messages delivered rapidly)

35.5.2 Clean Sessions (Clean Session = True)

Clean Session=1: Discards session state on disconnect - suitable for high-frequency sensors that don’t need commands (temperature logger).

Best practices:

  • Use persistent sessions for devices receiving commands (actuators, configuration)
  • Use clean sessions for pure publishers (sensors)
  • Set message expiry (MQTT 5.0) to prevent infinite queuing
  • Monitor broker memory usage

35.5.3 Persistent Session Queue Calculator

Estimate the broker memory required to queue messages for offline devices with persistent sessions. This is critical for sizing your MQTT broker when devices sleep or experience network outages.

35.6 Retained Messages

Retained messages provide “last known value” to new subscribers instantly.

Normal operation: Subscriber connects after sensor publishes -> must wait until next publish (30s) to receive data.

With retained message:

client.publish("home/temp", "24.5", qos=1, retain=True)
  • Broker stores message permanently
  • When subscriber connects -> immediately receives retained “24.5” (even if sensor published hours ago)

Use cases:

  1. Device status: device/status retained “online” - dashboard always knows current state
  2. Configuration: device/config retained settings - new dashboards get current config without querying device
  3. Slow-changing data: Room temperature, door state - subscribers need current value immediately, not historical

Caution:

  • Only ONE message retained per topic (new publish replaces old)
  • Clearing retained message: client.publish("topic", "", retain=True) (empty payload)

Best practice: Use retained for “state” topics (current temp, light status), don’t use for “events” (button pressed, motion detected - these are temporal, not states).

35.7 Last Will and Testament (LWT)

Last Will and Testament (LWT) enables graceful failure notification.

Setup: During CONNECT, client specifies will message:

client.will_set("devices/sensor1/status", "offline", qos=1, retain=True)

Normal operation: Client publishes data, sends "online" to status topic, DISCONNECT gracefully -> LWT not sent.

Unexpected disconnect: Power failure, network timeout -> broker doesn’t receive DISCONNECT -> broker publishes LWT "offline" to status topic -> subscribers notified of failure.

Use cases:

  1. Device availability monitoring: Dashboard shows which devices offline
  2. Automated failover: Backup sensor activates when primary goes offline
  3. Alert generation: Email/SMS when critical device loses connection

Implementation pattern:

while True:
    client.publish("device/status", "online", retain=True)  # Heartbeat
    time.sleep(60)
    # LWT takes effect if this loop stops

LWT + retained: Combine for persistent status - LWT sets “offline” as retained message, ensuring new subscribers see current state. Clear retained on reconnection:

client.publish("device/status", "online", retain=True)

MQTT 5.0 enhancement: Will Delay Interval - delay LWT publication (e.g., 30s) to avoid false alarms during brief reconnections.

Production: All production IoT devices should implement LWT for operational monitoring.

35.8 Keepalive and Connection Health

Keepalive interval determines disconnection detection latency.

How keepalive works:

  • Client sends PINGREQ if no messages sent in keepalive period
  • Broker responds PINGRESP
  • If broker doesn’t receive ANY message (publish or ping) within 1.5x keepalive: disconnects client

Example problem: Keepalive=60s. Network drops for 2s during active publishing. Client doesn’t realize disconnection until keepalive timeout (60s) - continues buffering messages locally. When timeout occurs: 6 messages queued (10s x 6 = 60s), sudden burst on reconnection.

Optimal keepalive: Set to 2-3x message interval. Publishing every 10s -> keepalive=30s detects disconnections within 30s maximum.

Trade-offs:

Keepalive Detection Power Broker Load
Short (10s) Quick Higher (periodic pings) More
Long (300s) Delayed Lower Less

AWS IoT recommendation: 30-1200 seconds depending on battery constraints. Mobile networks: 60-300s (cellular keep-alive).

Production: Implement exponential backoff reconnection: First reconnect immediately, then 1s, 2s, 4s, 8s, max 60s. Prevents reconnection storms when broker restarts.

35.8.1 Keepalive Interval Optimizer

Find the optimal keepalive interval for your IoT deployment. Balance disconnection detection speed against power consumption and broker load. The MQTT specification triggers disconnect after 1.5x the keepalive interval with no activity.

Think of MQTT QoS like mail delivery options:

Mail Type QoS Level What Happens
Postcard QoS 0 Drop in mailbox, hope it arrives. No tracking.
Certified Mail QoS 1 Carrier confirms delivery. Might accidentally deliver twice.
Registered Mail QoS 2 Full tracking, signature required. Expensive but guaranteed.

When to use each:

  • QoS 0: Temperature every 5 seconds - missing one is fine
  • QoS 1: Door sensor alerts - must be delivered, duplicates OK
  • QoS 2: Payment confirmations - no duplicates, no losses

The most common mistake: Using QoS 2 for everything “just to be safe” - this wastes 4x the battery and bandwidth!

35.9 Worked Example: MQTT Broker Sizing for a Smart Building

Scenario: A 20-floor commercial building deploys 2,000 IoT devices (HVAC sensors, occupancy detectors, light controllers, door locks) publishing to a central MQTT broker. Calculate the broker’s message throughput, memory requirements, and determine the correct QoS mix.

Device breakdown and QoS assignment:

Device Type         Count   Msg/min   QoS   Payload   Rationale
──────────────────────────────────────────────────────────────────
Temperature sensors   800      1      QoS 0   32 B    Replaceable, high frequency
Occupancy sensors     400      2      QoS 0   16 B    Replaceable, high frequency
Light controllers     300      0.2    QoS 1   48 B    Commands must arrive
HVAC actuators        200      0.5    QoS 1   64 B    Control commands critical
Door locks            100      0.1    QoS 2   24 B    Security-critical, no dupes
Fire/smoke detectors  150      0.05   QoS 1   32 B    Critical alerts
Energy meters          50      1      QoS 1  128 B    Billing data, must deliver

Message throughput calculation:

Inbound (publish):
  Temperature:  800 x 1    =   800 msg/min
  Occupancy:    400 x 2    =   800 msg/min
  Light:        300 x 0.2  =    60 msg/min
  HVAC:         200 x 0.5  =   100 msg/min
  Locks:        100 x 0.1  =    10 msg/min
  Fire:         150 x 0.05 =   7.5 msg/min
  Energy:        50 x 1    =    50 msg/min
  ─────────────────────────────────────────
  Total inbound: 1,827.5 msg/min = 30.5 msg/sec

Outbound (deliver to subscribers):
  Assume average 3 subscribers per topic (dashboard, logger, automation):
  Total outbound: 30.5 x 3 = 91.5 msg/sec

QoS overhead (additional protocol messages):
  QoS 0: 0 extra messages (60% of traffic)
  QoS 1: 1 PUBACK per msg (35% of traffic): 91.5 x 0.35 = 32 msg/sec
  QoS 2: 3 extra msgs per msg (5% of traffic): 91.5 x 0.05 x 3 = 13.7 msg/sec

Total broker throughput: 91.5 + 32 + 13.7 = ~137 msg/sec

Broker memory requirements:

Per-connection state:
  TCP buffer (send+receive): 8 KB
  SSL/TLS context:           4 KB (if encrypted)
  Session state:             2 KB
  Subscription tree:         0.5 KB (avg 5 subscriptions)
  ─────────────────────────────────────────
  Per connection: ~14.5 KB

Total connections: 2,000 devices + 50 subscribers (dashboards, loggers)
Connection memory: 2,050 x 14.5 KB = 29 MB

Message queue (persistent sessions for 500 actuators/locks):
  Worst case: device offline 1 hour, 60 QoS 1 messages queued
  Per device: 60 x (48 bytes + 32 bytes overhead) = 4.8 KB
  Total queue: 500 x 4.8 KB = 2.4 MB

Retained messages (one per status topic):
  2,000 devices x 128 bytes avg = 256 KB

Total broker RAM: 29 + 2.4 + 0.25 = ~32 MB minimum
Recommended: 128 MB (4x headroom for peak, subscriptions, routing table)

Bandwidth calculation:

Average payload: weighted average = 38 bytes
MQTT overhead per message: ~12 bytes (fixed header + topic)
TCP/IP overhead: 40 bytes
TLS overhead: 29 bytes (record header + MAC)
Total per message: 38 + 12 + 40 + 29 = 119 bytes

Inbound bandwidth: 30.5 msg/sec x 119 bytes = 3.6 KB/s
Outbound bandwidth: 137 msg/sec x 119 bytes = 16.3 KB/s
Total: ~20 KB/s = 160 kbps

A basic 1 Mbps Ethernet link handles this with 84% headroom.

Python: MQTT QoS monitoring dashboard snippet:

import paho.mqtt.client as mqtt
from collections import defaultdict
import time

class MQTTQoSMonitor:
    """Monitor QoS delivery rates for a smart building MQTT broker."""

    def __init__(self, broker_host, broker_port=1883):
        self.stats = defaultdict(lambda: {"sent": 0, "acked": 0, "lost": 0})
        self.client = mqtt.Client(client_id="qos-monitor", protocol=mqtt.MQTTv5)
        self.client.on_message = self._on_message
        self.client.on_publish = self._on_publish
        self.client.connect(broker_host, broker_port)

    def _on_publish(self, client, userdata, mid):
        self.stats["publish"]["acked"] += 1

    def _on_message(self, client, userdata, msg):
        qos = msg.qos
        topic_type = msg.topic.split("/")[1]  # e.g., building/temp/floor3
        self.stats[f"qos{qos}_{topic_type}"]["sent"] += 1

    def report(self):
        """Print delivery statistics per QoS level."""
        # What to observe: QoS 0 messages may show gaps during
        # network congestion. QoS 1/2 should show 100% delivery.
        for key, val in sorted(self.stats.items()):
            rate = val["acked"] / max(val["sent"], 1) * 100
            print(f"  {key}: {val['sent']} sent, {val['acked']} acked ({rate:.1f}%)")

Real-World Reference: Microsoft’s Azure IoT Hub handles exactly this workload pattern. Their recommended broker tier for 2,000 devices at 137 msg/sec is S1 Standard (400,000 messages/day included, $25/month). The Eclipse Mosquitto open-source broker handles 50,000 msg/sec on a single 4-core VM, making this workload trivial for self-hosted deployments.

35.9.1 MQTT Broker Throughput Calculator

Size your MQTT broker by entering your device fleet composition. This calculator models inbound publish rates, outbound fan-out to subscribers, and QoS protocol overhead to estimate total broker throughput and bandwidth.

This concept connects to:

  • MQTT Architecture: QoS operates within pub-sub pattern to control delivery
  • TCP Reliability: QoS adds application-level acknowledgments on top of TCP guarantees
  • Energy-Efficient Protocols: QoS level choice directly impacts battery consumption
  • Database Transactions: QoS 2 provides similar exactly-once guarantee as transaction isolation

Builds on:

  • Request-response acknowledgment patterns
  • State machine design for message tracking
  • Power management concepts for battery-powered devices

Enables:

  • Designing message delivery strategies that balance reliability vs efficiency
  • Implementing store-and-forward systems with persistent sessions
  • Calculating precise energy budgets for IoT deployments

35.10 See Also

MQTT Series:

Implementation:

Alternative Reliability Mechanisms:

35.11 Knowledge Check

35.12 Summary

This chapter covered MQTT’s reliability mechanisms:

  • QoS Levels: QoS 0 for replaceable data (1 message, may lose), QoS 1 for important events (2 messages, may duplicate), QoS 2 for critical commands (4 messages, exactly-once)
  • Energy Trade-offs: QoS 2 uses 4x messages of QoS 0 - significant battery impact on constrained devices
  • Persistent Sessions: Clean Session=0 with fixed client ID enables offline message queuing for devices receiving commands
  • Retained Messages: Store last-known value for immediate delivery to new subscribers - essential for device status
  • Last Will Testament: Automatic “offline” notification when client disconnects unexpectedly - critical for monitoring
  • Keepalive: Balance disconnection detection speed vs power consumption (typically 30-300 seconds)

35.13 What’s Next

Chapter Focus Why Read It
MQTT Production Deployment Clustering, security, and broker scaling Apply QoS and session knowledge to production-grade MQTT deployments with TLS and load balancing
MQTT QoS Levels Deep-dive into QoS 0/1/2 packet flows Understand the exact byte-level handshake messages and packet identifiers behind each QoS guarantee
MQTT QoS Worked Examples Real-world QoS scenario walkthroughs See QoS level selection applied to smart home, industrial, and healthcare IoT case studies
MQTT Architecture Patterns Pub/sub topology and topic design Understand how QoS and retained messages integrate with hierarchical topic structures
CoAP Fundamentals and Architecture RESTful IoT protocol over UDP Compare MQTT’s broker-based reliability with CoAP’s confirmable message (CON/NON) model
LoRaWAN Architecture LPWAN network-layer acknowledgments Contrast MQTT application-layer QoS with LoRaWAN’s confirmed uplink mechanism at the network layer