23  MQTT Session Management

In 60 Seconds

MQTT session management determines whether the broker remembers a client’s subscriptions and queues messages during disconnection. A persistent session (clean_session=false) preserves subscriptions and buffers QoS 1/2 messages for offline clients, while a clean session starts fresh on every connection. Misconfiguring sessions is a top production pitfall, causing either unbounded queue growth or silent message loss.

23.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Configure Session Persistence: Select and configure clean versus persistent sessions for specific device types and use cases
  • Implement Secure MQTT: Apply TLS encryption, certificate-based authentication, and topic-level ACLs for production deployments
  • Distinguish Session Behaviors: Compare clean session and persistent session behaviors and justify the choice for a given scenario
  • Design Reconnection Strategies: Construct exponential backoff with jitter to prevent thundering-herd reconnection storms
  • Diagnose Session Issues: Identify and resolve message loss, orphaned sessions, queue overflow, and QoS mismatch problems
  • Calculate Queue Memory: Assess broker memory requirements using the queue growth formula for persistent sessions at scale
  • Evaluate QoS Delivery: Analyze the effective end-to-end QoS resulting from publisher and subscriber QoS combinations
  • MQTT: Message Queuing Telemetry Transport — pub/sub protocol optimized for constrained IoT devices over unreliable networks
  • Broker: Central server routing messages from publishers to all matching subscribers by topic pattern
  • Topic: Hierarchical string (e.g., home/bedroom/temperature) used to route messages to interested subscribers
  • QoS Level: Quality of Service 0/1/2 trading delivery guarantee for message overhead
  • Retained Message: Last message on a topic stored by broker for immediate delivery to new subscribers
  • Last Will and Testament: Pre-configured message published by broker when a client disconnects ungracefully
  • Persistent Session: Broker stores subscriptions and pending messages allowing clients to resume after disconnection

23.2 For Beginners: MQTT Session Management

When a device disconnects and reconnects to an MQTT broker, what happens to the messages it missed? Session management handles this. A persistent session saves the device’s subscriptions and queues missed messages for later delivery. It is like pausing a movie and picking up right where you left off.

“I sleep for 10 minutes at a time to save energy,” said Bella the Battery. “But when I wake up, have I missed important messages?”

Max the Microcontroller explained the two options: “With Clean Session = true, the broker forgets you the moment you disconnect. When you reconnect, you start fresh – no saved subscriptions, no queued messages. Any messages sent while you slept are gone.”

“But with Clean Session = false,” continued Sammy the Sensor, “the broker remembers you! It keeps your subscriptions active and queues any QoS 1 or QoS 2 messages that arrive while you’re asleep. When you wake up and reconnect, you get all the missed messages delivered in order.”

Lila the LED added a warning: “Be careful though – if Bella sleeps for hours and thousands of messages pile up, the broker’s memory fills up. That’s why you set a session expiry interval in MQTT 5. It tells the broker: ‘Remember me for 30 minutes. After that, clean up.’ Balance memory savings with message reliability!”

For persistent sessions (clean_session=false), the broker queues QoS 1/2 messages for offline clients:

Queue memory per offline client: $ M_{} = N_{} (M_{} + M_{}) $

Where: - \(N_{\text{msgs}}\): Number of queued messages - \(M_{\text{payload}}\): Average message size (~100 bytes for typical IoT) - \(M_{\text{overhead}}\): Broker metadata per message (~40 bytes)

Concrete example (sensor offline for 1 hour, 1 msg/min): $ M_{} = 60 (100 + 40) = 8,400 $

Fleet scaling (1000 sensors offline simultaneously): $ M_{} = 1000 = 8.4 $

Worst-case scenario (sensor offline 24 hours): $ M_{} = 1440 = 201{,}600 $

For 10,000 sensors: \(10{,}000 \times 197\text{ KB} \approx 1.88\text{ GB}\)

Broker queue limit (MQTT 5.0 message expiry): Set message_expiry_interval = 3600 (1 hour) to drop messages older than 1 hour: $ M_{} = 60 = 8.4 $

Lesson: Persistent sessions scale poorly for long offline periods or high message rates. Use session expiry and message expiry to bound queue growth.

Foundations:

Deep Dives:

Hands-On:

23.3 Interactive Lab: MQTT QoS Comparison

Let’s build an experiment that demonstrates the real differences between QoS 0, 1, and 2!

Lab Setup

Hardware (Simulated):

  • ESP32 publisher (sends messages with different QoS levels)
  • Simulated unreliable network (random packet loss)
  • Message counter to track deliveries

What This Lab Does:

  1. Publishes 100 messages with each QoS level
  2. Simulates 20% network packet loss
  3. Counts actual deliveries and duplicates
  4. Measures battery impact (message transmission time)

23.3.1 QoS Comparison Simulation

Code Explanation:

#include <WiFi.h>
#include <PubSubClient.h>

const char* ssid = "Wokwi-GUEST";
const char* password = "";
const char* mqtt_server = "test.mosquitto.org";

WiFiClient espClient;
PubSubClient mqttClient(espClient);

// Statistics tracking
int qos0_sent = 0, qos0_acked = 0;
int qos1_sent = 0, qos1_acked = 0, qos1_duplicates = 0;
int qos2_sent = 0, qos2_acked = 0;

unsigned long qos0_time = 0, qos1_time = 0, qos2_time = 0;

// Simulate packet loss (20% chance)
bool simulatePacketLoss() {
  return (random(100) < 20);  // 20% packet loss
}

void callback(char* topic, byte* payload, unsigned int length) {
  // Track received messages (for subscriber)
  Serial.print("Received: ");
  Serial.println(String((char*)payload).substring(0, length));
}

void setup() {
  Serial.begin(115200);
  WiFi.begin(ssid, password);

  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }

  mqttClient.setServer(mqtt_server, 1883);
  mqttClient.setCallback(callback);

  while (!mqttClient.connected()) {
    if (mqttClient.connect("ESP32_QoS_Test")) {
      Serial.println("Connected to MQTT broker!");
    } else {
      delay(5000);
    }
  }

  Serial.println("\n=== QoS Comparison Test ===\n");
  runQoSTest();
}

void runQoSTest() {
  Serial.println("Testing QoS 0 (Fire and Forget)...");
  testQoS0();

  delay(2000);

  Serial.println("\nTesting QoS 1 (At Least Once)...");
  testQoS1();

  delay(2000);

  Serial.println("\nTesting QoS 2 (Exactly Once)...");
  testQoS2();

  delay(2000);

  printResults();
}

void testQoS0() {
  unsigned long start = millis();

  for (int i = 0; i < 100; i++) {
    char msg[50];
    snprintf(msg, sizeof(msg), "QoS0_Message_%d", i);

    if (!simulatePacketLoss()) {
      mqttClient.publish("test/qos0", msg);  // QoS 0 (default)
      qos0_sent++;
      qos0_acked++;  // Assume success (no actual confirmation)
    } else {
      qos0_sent++;
      Serial.printf("QoS0 Message %d lost (no retry)\n", i);
    }

    delay(10);
  }

  qos0_time = millis() - start;
  Serial.printf("QoS 0 complete: %d sent, ~%d delivered, %d lost\n",
                qos0_sent, qos0_acked, qos0_sent - qos0_acked);
}

void loop() {
  mqttClient.loop();
  // Test runs once in setup()
}
Try It: QoS Delivery Simulator

Adjust the network conditions and message count to see how each QoS level performs under different packet loss scenarios.

23.3.2 Lab Results Analysis

Expected Results (with 20% simulated packet loss):

QoS Messages Delivered Duplicates Time (ms) Battery Impact
0 ~80/100 (80%) 0 1,000 Baseline (100%)
1 100/100 (100%) ~4-6 1,500 ~150%
2 100/100 (100%) 0 2,500 ~250%

Key Observations:

  1. QoS 0 loses ~20% of messages (matches packet loss rate)
  2. QoS 1 delivers all messages, but creates duplicates when PUBACK is lost
  3. QoS 2 delivers all messages exactly once, no duplicates
  4. QoS 2 takes 2.5x longer than QoS 0 (4-way handshake overhead)
  5. Battery impact scales with time: QoS 2 uses 2.5x more power

23.4 Security Considerations

Basic MQTT (port 1883) sends data unencrypted. For production:

  1. Use MQTT over TLS (port 8883)
  2. Enable authentication (username/password)
  3. Implement access control (topic-level permissions)
  4. Use private broker (don’t rely on public brokers)
Unencrypted MQTT: A Critical Security Risk

Never deploy MQTT on port 1883 without TLS in production environments. Unencrypted MQTT transmits credentials and data in plain text–network sniffers can capture usernames, passwords, and all sensor readings. An attacker on your Wi-Fi network can see every temperature reading, door lock command, and camera feed. Always use MQTTS (port 8883) with TLS 1.2+ and certificate-based authentication for production deployments. Public test brokers like test.mosquitto.org are fine for learning, but never for real applications.

Public MQTT Brokers: Never for Production

Using public brokers (test.mosquitto.org, broker.hivemq.com) for real IoT deployments is dangerous:

  • Anyone worldwide can subscribe to your topics and see all data
  • No authentication means anyone can publish malicious commands
  • Zero privacy for sensor readings or control commands
  • Unreliable service with no guarantees

Deploy a private broker (Mosquitto, HiveMQ, AWS IoT Core) with TLS, authentication, and topic-level ACLs. The cost is minimal compared to the security risk.

Knowledge Check: MQTT Security Test Your Understanding

23.5 Knowledge Check

Test your understanding of these networking concepts.

23.6 Common Pitfalls

Common Pitfall: Misunderstanding MQTT QoS Levels

The mistake: MQTT QoS levels (0, 1, 2) are often misunderstood. QoS 0 offers no delivery guarantee, QoS 1 guarantees at-least-once delivery (may duplicate), and QoS 2 guarantees exactly-once delivery. Using the wrong level leads to message loss or unnecessary overhead.

Symptoms:

  • Message loss when QoS 0 used for critical data
  • Duplicate processing when QoS 2 expected but QoS 1 used
  • Battery drain from unnecessary QoS 2
  • High latency from QoS 2 handshake

Wrong approach:

# Using QoS 0 for critical alerts - messages may be lost!
client.publish("alerts/fire", "Fire detected!", qos=0)

# Using QoS 2 for frequent sensor readings - wastes bandwidth
while True:
    client.publish("sensors/temp", read_temp(), qos=2)
    time.sleep(1)

Correct approach:

# Use QoS 1 or 2 for critical messages
client.publish("alerts/fire", "Fire detected!", qos=2)

# Use QoS 0 for high-frequency, non-critical data
client.publish("sensors/temp", read_temp(), qos=0)

# Use QoS 1 for important but duplicable data
client.publish("metrics/hourly", summary, qos=1)

How to avoid:

  • Match QoS level to message criticality
  • Use QoS 0 for high-frequency telemetry
  • Use QoS 1 for commands and alerts
  • Use QoS 2 only for exactly-once requirements
  • Consider battery and bandwidth impact
Try It: QoS Level Selection Advisor

Describe your IoT use case and get a recommended QoS level. Adjust the parameters to see how different requirements change the recommendation.

Pitfall: Expecting Clean Session to Queue Messages for Publishers

The Mistake: Developers configure clean_session=false on publishing devices (sensors), expecting the broker to buffer their outbound messages when the network is down.

Why It Happens: The term “persistent session” suggests messages are persisted in both directions. Developers assume that if subscribers get queued messages, publishers should too. The MQTT spec is clear but often misread: persistent sessions only queue messages to clients, not from them.

The Fix: Implement local message buffering on the publisher side. When mqttClient.connected() returns false, store messages locally (SPIFFS, SD card, or RAM buffer) and publish them on reconnection.

// ESP32 local buffering pattern
#define BUFFER_SIZE 100
struct BufferedMessage {
  char topic[64];
  char payload[256];
  uint8_t qos;
};
BufferedMessage buffer[BUFFER_SIZE];
int bufferIndex = 0;

void publishWithBuffer(const char* topic, const char* payload, uint8_t qos) {
  if (mqttClient.connected()) {
    // Flush buffer first
    for (int i = 0; i < bufferIndex; i++) {
      mqttClient.publish(buffer[i].topic, buffer[i].payload, buffer[i].qos);
    }
    bufferIndex = 0;
    // Then publish current message
    mqttClient.publish(topic, payload, qos);
  } else if (bufferIndex < BUFFER_SIZE) {
    // Store locally when offline
    strncpy(buffer[bufferIndex].topic, topic, 63);
    strncpy(buffer[bufferIndex].payload, payload, 255);
    buffer[bufferIndex].qos = qos;
    bufferIndex++;
  }
}

MQTT 3.1.1 Spec Reference: Section 3.1.2.4 states “If CleanSession is set to 0, the Server MUST resume communications with the Client based on state from the current Session.” The “state” includes subscriptions and inflight messages to the client, not messages the client wants to send.

Pitfall: Using Random Client IDs with Persistent Sessions

The Mistake: Developers enable persistent sessions (clean_session=false) but use auto-generated or random client IDs like ESP32_ + random() or allow the library to generate one.

Why It Happens: Many MQTT libraries generate unique client IDs automatically to avoid ID collisions. Developers don’t realize this breaks persistent session restoration. Each reconnection creates a new session with different ID, so queued messages and subscriptions from the previous session are orphaned and eventually expire.

The Fix: Use a stable, unique client ID derived from hardware identifiers. For ESP32, use the MAC address or chip ID. For MQTT 5.0, you can also let the broker assign a persistent ID via Assigned Client Identifier.

// MQTT 3.1.1: Derive stable ID from hardware
char clientId[32];
uint64_t chipId = ESP.getEfuseMac();  // Unique per chip
snprintf(clientId, sizeof(clientId), "ESP32_%04X%08X",
         (uint16_t)(chipId >> 32), (uint32_t)chipId);

// Connect with persistent session
if (mqttClient.connect(clientId, user, pass,
                       willTopic, willQos, willRetain, willMessage,
                       false)) {  // clean_session = false
  // Session restored - subscriptions and queued messages available
}

// MQTT 5.0: Let broker assign ID (paho-mqtt Python)
client = mqtt.Client(mqtt.CallbackAPIVersion.VERSION2,
                     client_id="",  // Empty = broker assigns
                     protocol=mqtt.MQTTv5)
client.connect(broker, port, clean_start=False,
               properties=Properties(PacketTypes.CONNECT))
# Check assigned ID in CONNACK properties

Broker Configuration (Mosquitto): Set persistent_client_expiration to control how long orphaned sessions are kept. Default is infinite, which can consume broker memory if clients use random IDs.

# mosquitto.conf - expire orphaned sessions after 7 days
persistent_client_expiration 7d
Try It: Client ID Collision Risk Calculator

Random client IDs with persistent sessions cause orphaned sessions. Explore how the probability of collision grows with fleet size and ID length.

Pitfall: QoS Mismatch Between Publisher and Subscriber

The Mistake: Developers configure QoS 2 on the publisher side, expecting guaranteed exactly-once delivery to subscribers, but subscribers connect with QoS 0 or 1. They’re confused when messages are duplicated or lost at the subscriber despite using QoS 2 for publishing.

Why It Happens: MQTT QoS is not end-to-end; it applies separately to publisher-to-broker and broker-to-subscriber segments. The effective QoS for delivery is the minimum of the two. Publishing with QoS 2 to a subscriber with QoS 0 subscription results in QoS 0 delivery (fire-and-forget) to that subscriber.

The Fix: Match QoS levels across the entire message path. If exactly-once delivery is required, both publisher and subscriber must use QoS 2. Document QoS requirements in your API specification and validate them during system integration testing.

# Publisher: QoS 2 for critical command
client.publish("factory/line1/emergency_stop", "STOP", qos=2)

# Subscriber: MUST also use QoS 2 for exactly-once delivery
def on_connect(client, userdata, flags, reason_code, properties):
    # WRONG: QoS 0 subscription downgrades all deliveries
    # client.subscribe("factory/line1/emergency_stop", qos=0)

    # CORRECT: Match publisher QoS for end-to-end guarantee
    client.subscribe("factory/line1/emergency_stop", qos=2)

# Effective QoS = min(publisher_qos, subscriber_qos)
# QoS 2 publish + QoS 0 subscribe = QoS 0 delivery (NO guarantee!)
# QoS 2 publish + QoS 2 subscribe = QoS 2 delivery (exactly-once)

Real-World Impact: A manufacturing plant configured emergency stop commands with QoS 2 on HMI panels, but the PLC subscribers used default QoS 0 subscriptions. During a network glitch, the broker retransmitted the stop command (QoS 2 retry), but the PLCs–receiving at QoS 0–processed it as a new command, causing a production line to halt twice. Cost: 4 hours of downtime ($85,000).

Pitfall: Session Expiry Flooding on Broker Restart

The Mistake: Developers deploy hundreds of IoT devices with persistent sessions (clean_session=false) and long keep-alive intervals (300+ seconds). When the broker restarts or fails over, all devices attempt to reconnect simultaneously, overwhelming the broker with CONNECT packets and queued message delivery.

Why It Happens: Persistent sessions are designed to survive brief disconnections, but broker restarts trigger mass reconnection. With 1,000 devices each having 50 queued messages, the broker must deliver 50,000 messages within seconds while also handling 1,000 simultaneous CONNECT handshakes. Default broker configurations often can’t handle this “thundering herd” scenario.

The Fix: Implement staggered reconnection with exponential backoff and jitter. Configure broker max_queued_messages_per_client to limit queue buildup. For MQTT 5.0, use Session Expiry Interval to automatically clean up stale sessions.

import random
import time

class MQTTClientWithBackoff:
    def __init__(self, client_id):
        self.client_id = client_id
        self.base_delay = 1.0  # 1 second base
        self.max_delay = 120.0  # 2 minute cap
        self.attempt = 0

    def connect_with_backoff(self, broker, port):
        while True:
            try:
                self.client.connect(broker, port)
                self.attempt = 0  # Reset on success
                return
            except Exception as e:
                self.attempt += 1
                # Exponential backoff: 1s, 2s, 4s, 8s... capped at 120s
                delay = min(self.base_delay * (2 ** self.attempt), self.max_delay)
                # Add jitter: random 0-50% of delay to spread reconnections
                jitter = random.uniform(0, delay * 0.5)
                total_delay = delay + jitter
                print(f"Reconnect attempt {self.attempt} in {total_delay:.1f}s")
                time.sleep(total_delay)

# Broker configuration (mosquitto.conf)
# Limit queue buildup per client (Mosquitto 2.x)
# max_queued_messages 1000
# max_inflight_messages 20
#
# MQTT 5.0: Auto-expire sessions after 1 hour of inactivity
# persistent_client_expiration 1h

Sizing Guide: For N devices with Q average queued messages and M bytes per message, broker restart requires handling NxQxM bytes immediately. Example: 1,000 devices x 100 messages x 500 bytes = 50MB burst. Ensure broker memory can handle 2-3x this peak load.

Try It: Exponential Backoff with Jitter Visualizer

See how exponential backoff with jitter spreads reconnection attempts over time, preventing thundering herd problems. Compare no-backoff (instant reconnection) with backoff strategies.

When should you use clean vs persistent sessions? Use this decision table:

Use Case Session Type Rationale
Sensor publishing telemetry only Clean No need to queue messages to publisher; saves broker memory
Command receiver (actuator, device) Persistent Must receive commands issued while offline
Mobile app (online only) Clean User expects fresh data on each app launch
Fleet tracking dashboard Persistent Must receive updates even during brief disconnections
Temporary debug client Clean No need to preserve subscriptions
Critical infrastructure controller Persistent Cannot miss any commands; session restoration essential

Session memory cost example:

  • 1,000 devices with persistent sessions
  • Average 5 subscriptions per device
  • Average 10 queued messages per device
  • Broker memory: ~50 MB (5 KB per session + 10 KB per device for queued messages)

When persistent sessions create problems:

  • Random client IDs (session never restored)
  • No session expiry configured (orphaned sessions consume memory forever)
  • Thousands of devices reconnecting simultaneously after broker restart
  • Devices never cleaning up old sessions (memory leak)

Best practice: Use persistent sessions for command receivers, clean sessions for simple telemetry publishers. Always set session_expiry_interval (MQTT 5.0) or persistent_client_expiration (Mosquitto) to prevent unbounded memory growth.

Knowledge Check: Session Type Selection Test Your Understanding

23.7 Knowledge Check

Apply the decision framework above to select the correct session type.

23.8 Interactive Calculators

23.8.1 Session Queue Memory Calculator

Estimate broker memory consumed by persistent sessions for offline IoT devices. Adjust device count, offline duration, and message rate to see memory impact.

23.8.2 Reconnection Storm Simulator

Model the burst load when a broker restarts and all devices with persistent sessions reconnect simultaneously. See how exponential backoff with jitter spreads the load.

23.8.3 Effective QoS Calculator

MQTT QoS is not end-to-end. The effective delivery QoS is the minimum of the publisher and subscriber QoS levels. Explore what happens with different combinations.

23.8.4 Local Buffer Sizing Tool

When the MQTT broker does not queue messages for publishers, devices need local buffering. Calculate RAM or flash storage needed for offline message storage.

Knowledge Check: Matching and Sequencing

Test your understanding of key MQTT session management concepts before moving on.

23.9 Summary

This chapter covered MQTT session management and security:

  • Clean Sessions forget all state on disconnect, ideal for simple publishers that don’t need offline message queuing
  • Persistent Sessions maintain subscriptions and queue QoS 1/2 messages for offline clients, essential for devices receiving commands during sleep
  • Security requires TLS encryption (port 8883), authentication, topic-level ACLs, and private brokers for production deployments
  • Common Pitfalls include expecting publishers to have messages queued, using random client IDs with persistent sessions, QoS mismatches, and reconnection storms
  • Exponential Backoff with jitter prevents thundering herd problems when many devices reconnect simultaneously
  • Broker Configuration must account for queue limits, session expiry, and memory requirements for large-scale deployments

23.10 What’s Next

Chapter Focus Why Read It
MQTT QoS Worked Examples Real-world QoS and session selection Applies the clean vs persistent decision framework to fleet tracking, door locks, medical telemetry, and sleep-wake sensors
MQTT Labs and Implementation Hands-on ESP32 and Python MQTT projects Lets you implement persistent sessions, TLS, and backoff strategies from scratch in a working codebase
MQTT QoS Fundamentals QoS 0/1/2 mechanics and handshakes Reviews the underlying delivery guarantees that determine when persistent session queuing is needed
MQTT QoS Levels PUBACK, PUBREC, PUBREL, PUBCOMP packet flows Deepens understanding of the QoS 2 four-way handshake referenced in the pitfall on QoS mismatch
MQTT Comprehensive Review Advanced MQTT 5.0 patterns and production design Covers session expiry intervals, shared subscriptions, and broker clustering for large-scale deployments