1208  MQTT QoS and Reliability

1208.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Select Appropriate QoS Levels: Choose QoS 0, 1, or 2 based on data criticality and battery constraints
  • Understand Delivery Guarantees: Explain the message flow for each QoS level including acknowledgments
  • Configure Session Management: Use persistent sessions for offline message queuing
  • Implement Retained Messages: Store last-known values for immediate delivery to new subscribers
  • Design for Last Will: Configure LWT for graceful failure notification

1208.2 Prerequisites

Required Chapters:

Technical Background:

  • TCP reliability concepts
  • Battery power considerations for IoT devices

Estimated Time: 15 minutes

1208.3 MQTT QoS Levels Comparison

Understanding Quality of Service levels is critical for reliability vs performance trade-offs:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'fontSize': '14px'}}}%%
flowchart TB
    START["Choose QoS Level"] --> Q0{Data Type?}

    Q0 -->|"Replaceable<br/>Sensor Data"| QOS0["QoS 0<br/>At Most Once"]
    Q0 -->|"Important<br/>Alerts"| QOS1["QoS 1<br/>At Least Once"]
    Q0 -->|"Critical<br/>Commands"| QOS2["QoS 2<br/>Exactly Once"]

    QOS0 --> F0["✓ Fire-and-forget<br/>✓ 1 message<br/>✓ Fastest<br/>✗ May lose data"]
    QOS1 --> F1["✓ PUBLISH + PUBACK<br/>✓ 2 messages<br/>✓ Confirmed delivery<br/>✗ May duplicate"]
    QOS2 --> F2["✓ 4-way handshake<br/>✓ 4 messages (PUBLISH/PUBREC/PUBREL/PUBCOMP)<br/>✓ No duplicates<br/>✗ Highest overhead"]

    F0 --> E0["Example: Temperature<br/>every 10 seconds"]
    F1 --> E1["Example: Door opened<br/>alarm triggered"]
    F2 --> E2["Example: Unlock door<br/>dispense medication"]

    style START fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style Q0 fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style QOS0 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style QOS1 fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
    style QOS2 fill:#2C3E50,stroke:#E67E22,stroke-width:2px,color:#fff
    style F0 fill:#d4edda,stroke:#16A085,stroke-width:1px,color:#000
    style F1 fill:#fff3cd,stroke:#E67E22,stroke-width:1px,color:#000
    style F2 fill:#e2e3e5,stroke:#2C3E50,stroke-width:1px,color:#000
    style E0 fill:#d4edda,stroke:#16A085,stroke-width:1px,color:#000
    style E1 fill:#fff3cd,stroke:#E67E22,stroke-width:1px,color:#000
    style E2 fill:#e2e3e5,stroke:#2C3E50,stroke-width:1px,color:#000

Figure 1208.1: MQTT QoS level selection decision tree by data criticality

MQTT Quality of Service (QoS) levels showing delivery guarantees: QoS 0 provides fire-and-forget transmission with no acknowledgment (fastest, 1 message) suitable for high-frequency replaceable sensor data. QoS 1 ensures at-least-once delivery with PUBLISH/PUBACK handshake (2 messages) appropriate for important alerts where duplicates are acceptable. QoS 2 guarantees exactly-once delivery with 4-way handshake (PUBLISH/PUBREC/PUBREL/PUBCOMP) critical for financial transactions and medical commands despite highest overhead.

Figure 1208.2

1208.3.1 QoS Level Summary

QoS Name Messages Guarantee Use Case
0 At most once 1 Fire-and-forget, may lose Temperature every 5 seconds
1 At least once 2 Confirmed, may duplicate Door alerts, motion sensors
2 Exactly once 4 No loss, no duplicates Financial, medical commands

1208.3.2 Energy and Performance Trade-offs

Battery impact example: Sensor publishes 1 message/minute for 1 year.

QoS Messages/Year Energy (at 20mA x 50ms/msg)
QoS 0 525,600 291 mAh
QoS 1 1,051,200 (2x) 583 mAh
QoS 2 2,102,400 (4x) 1,166 mAh

Key Insight: Use QoS 0 for high-frequency replaceable data, QoS 1 selectively for critical messages only.

1208.4 Common Misconception: MQTT Guarantees Message Delivery

WarningMisconception: “MQTT Guarantees Message Delivery”

The Misconception: Many developers believe that using MQTT automatically ensures messages will reach subscribers, regardless of configuration.

The Reality: Message delivery guarantees depend on both QoS level AND session configuration:

Real-World Example - Smart Home Door Lock Failure (2023):

  • Scenario: Smart lock manufacturer used MQTT with default settings (QoS 0, clean_session=1)
  • Problem: Mobile app sent “unlock door” command while lock was temporarily offline
  • Result: 12% of unlock commands lost (measured over 50,000 operations), causing user complaints
  • Impact: Average 2.3 unlock failures per user per month, 34% increase in support tickets

Why Commands Were Lost:

  1. QoS 0: No acknowledgment or retry mechanism
  2. Clean Session=1: Broker discarded messages for offline clients
  3. No Retained Messages: Last command not stored for retrieval

Correct Configuration:

# Wrong (manufacturer's original config)
client.connect(clean_session=True)  # No persistence
client.publish("lock/unlock", qos=0)  # Fire-and-forget

# Right (fixed config)
client.connect(client_id="lock_456", clean_session=False)  # Persistent session
client.publish("lock/unlock", qos=1)  # At-least-once delivery

After Fix:

  • Command loss rate dropped from 12% to 0.03% (400x improvement)
  • User complaints decreased by 89%
  • Support ticket cost reduced by $47,000/month

1208.5 Session Management

1208.5.1 Persistent Sessions (Clean Session = False)

Persistent session (Clean Session=0) enables offline message queuing:

How it works:

  1. Client connects with client_id="sensor_123" and clean_session=false
  2. Client subscribes to commands/sensor_123 (broker remembers subscription)
  3. Client disconnects (sleep mode)
  4. Server publishes command to commands/sensor_123 with QoS 1
  5. Broker queues message for offline sensor_123
  6. Client reconnects with same client_id, clean_session=false
  7. Broker delivers queued message immediately

Requirements:

  • Fixed client_id: Must use same ID across sessions (broker links queue to ID)
  • QoS >= 1: Only QoS 1/2 messages queued (QoS 0 discarded if client offline)
  • Subscription persistence: Subscriptions survive disconnection (no need to re-SUBSCRIBE)

Trade-offs:

  • Broker stores queued messages (uses memory/disk - typically 100KB-10MB per client)
  • Potential message flood on reconnection (100s of queued messages delivered rapidly)

1208.5.2 Clean Sessions (Clean Session = True)

Clean Session=1: Discards session state on disconnect - suitable for high-frequency sensors that don’t need commands (temperature logger).

Best practices:

  • Use persistent sessions for devices receiving commands (actuators, configuration)
  • Use clean sessions for pure publishers (sensors)
  • Set message expiry (MQTT 5.0) to prevent infinite queuing
  • Monitor broker memory usage

1208.6 Retained Messages

Retained messages provide “last known value” to new subscribers instantly.

Normal operation: Subscriber connects after sensor publishes -> must wait until next publish (30s) to receive data.

With retained message:

client.publish("home/temp", "24.5", qos=1, retain=True)
  • Broker stores message permanently
  • When subscriber connects -> immediately receives retained “24.5” (even if sensor published hours ago)

Use cases:

  1. Device status: device/status retained “online” - dashboard always knows current state
  2. Configuration: device/config retained settings - new dashboards get current config without querying device
  3. Slow-changing data: Room temperature, door state - subscribers need current value immediately, not historical

Caution:

  • Only ONE message retained per topic (new publish replaces old)
  • Clearing retained message: client.publish("topic", "", retain=True) (empty payload)

Best practice: Use retained for “state” topics (current temp, light status), don’t use for “events” (button pressed, motion detected - these are temporal, not states).

1208.7 Last Will and Testament (LWT)

Last Will and Testament (LWT) enables graceful failure notification.

Setup: During CONNECT, client specifies will message:

client.will_set("devices/sensor1/status", "offline", qos=1, retain=True)

Normal operation: Client publishes data, sends "online" to status topic, DISCONNECT gracefully -> LWT not sent.

Unexpected disconnect: Power failure, network timeout -> broker doesn’t receive DISCONNECT -> broker publishes LWT "offline" to status topic -> subscribers notified of failure.

Use cases:

  1. Device availability monitoring: Dashboard shows which devices offline
  2. Automated failover: Backup sensor activates when primary goes offline
  3. Alert generation: Email/SMS when critical device loses connection

Implementation pattern:

while True:
    client.publish("device/status", "online", retain=True)  # Heartbeat
    time.sleep(60)
    # LWT takes effect if this loop stops

LWT + retained: Combine for persistent status - LWT sets “offline” as retained message, ensuring new subscribers see current state. Clear retained on reconnection:

client.publish("device/status", "online", retain=True)

MQTT 5.0 enhancement: Will Delay Interval - delay LWT publication (e.g., 30s) to avoid false alarms during brief reconnections.

Production: All production IoT devices should implement LWT for operational monitoring.

1208.8 Keepalive and Connection Health

Keepalive interval determines disconnection detection latency.

How keepalive works:

  • Client sends PINGREQ if no messages sent in keepalive period
  • Broker responds PINGRESP
  • If broker doesn’t receive ANY message (publish or ping) within 1.5x keepalive: disconnects client

Example problem: Keepalive=60s. Network drops for 2s during active publishing. Client doesn’t realize disconnection until keepalive timeout (60s) - continues buffering messages locally. When timeout occurs: 6 messages queued (10s x 6 = 60s), sudden burst on reconnection.

Optimal keepalive: Set to 2-3x message interval. Publishing every 10s -> keepalive=30s detects disconnections within 30s maximum.

Trade-offs:

Keepalive Detection Power Broker Load
Short (10s) Quick Higher (periodic pings) More
Long (300s) Delayed Lower Less

AWS IoT recommendation: 30-1200 seconds depending on battery constraints. Mobile networks: 60-300s (cellular keep-alive).

Production: Implement exponential backoff reconnection: First reconnect immediately, then 1s, 2s, 4s, 8s, max 60s. Prevents reconnection storms when broker restarts.

Think of MQTT QoS like mail delivery options:

Mail Type QoS Level What Happens
Postcard QoS 0 Drop in mailbox, hope it arrives. No tracking.
Certified Mail QoS 1 Carrier confirms delivery. Might accidentally deliver twice.
Registered Mail QoS 2 Full tracking, signature required. Expensive but guaranteed.

When to use each:

  • QoS 0: Temperature every 5 seconds - missing one is fine
  • QoS 1: Door sensor alerts - must be delivered, duplicates OK
  • QoS 2: Payment confirmations - no duplicates, no losses

The most common mistake: Using QoS 2 for everything “just to be safe” - this wastes 4x the battery and bandwidth!

1208.9 Summary

This chapter covered MQTT’s reliability mechanisms:

  • QoS Levels: QoS 0 for replaceable data (1 message, may lose), QoS 1 for important events (2 messages, may duplicate), QoS 2 for critical commands (4 messages, exactly-once)
  • Energy Trade-offs: QoS 2 uses 4x messages of QoS 0 - significant battery impact on constrained devices
  • Persistent Sessions: Clean Session=0 with fixed client ID enables offline message queuing for devices receiving commands
  • Retained Messages: Store last-known value for immediate delivery to new subscribers - essential for device status
  • Last Will Testament: Automatic “offline” notification when client disconnects unexpectedly - critical for monitoring
  • Keepalive: Balance disconnection detection speed vs power consumption (typically 30-300 seconds)

1208.10 What’s Next

Continue exploring MQTT with these related chapters: