77  State Machine Patterns

In 60 Seconds

Three core IoT state machine patterns: connection management (DISCONNECTED -> CONNECTING -> CONNECTED with exponential backoff up to 5 minutes), duty cycling (SLEEP -> WAKE -> SAMPLE -> TRANSMIT with 99%+ sleep ratio for multi-year battery life), and safety actuator control (SAFE -> ARMED -> ACTIVE with hardware watchdog timeouts of 100-500ms). Always design transitions as one-way safety valves – any anomaly should default to the safest state.

77.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Construct Connection Patterns: Design state machines for network connection management with exponential backoff
  • Implement Duty Cycling: Create power-efficient sampling state machines that maximise battery life
  • Architect Safety Systems: Build actuator control with multiple protection layers and hardware watchdog enforcement
  • Select Appropriate Patterns: Justify the choice of state machine pattern for different IoT scenarios

State machine patterns are reusable templates for managing device behavior. Think of a vending machine that follows a clear sequence: waiting for coins, accepting selection, dispensing product, giving change. Each step is a state, and the rules for moving between states are clear. These patterns help you design reliable IoT device behavior that handles every situation predictably.

77.2 Prerequisites

Before diving into this chapter, you should be familiar with:

77.3 Introduction

State machine patterns are reusable solutions to common IoT design challenges. Rather than designing from scratch, experienced developers apply proven patterns that handle edge cases and failure modes. This chapter presents three essential patterns for IoT applications.

Time: ~15 min | Difficulty: Intermediate | Unit: P04.FSM.U03

77.4 Pattern 1: Connection State Machine

A common pattern for IoT devices managing network connections:

Connection state machine diagram showing states for disconnected, connecting, connected, and reconnecting with exponential backoff

77.4.1 Key Features

Feature Implementation Benefit
Exponential Backoff Double wait time after each failure Prevents network congestion
Retry Limits Maximum attempts before giving up Avoids infinite loops
Fast Reconnect Shorter delays for lost connections Maintains user experience
Connection State Tracking Clear CONNECTED/DISCONNECTED states Reliable data transmission

77.4.2 Implementation Considerations

  1. Backoff Strategy: Start with 1 second, double each retry, cap at 60 seconds
  2. Retry Limits: 5 retries for initial connection, 10 for reconnection
  3. Connection Monitoring: Heartbeat or ping to detect silent failures
  4. State Persistence: Remember last-known-good credentials

77.4.3 Example Use Cases

  • MQTT client connections
  • Wi-Fi station mode
  • Cellular modem management
  • WebSocket connections
  • BLE central device pairing

77.5 Pattern 2: Sensor Sampling State Machine

Efficient power management through state-based duty cycling:

Sensor sampling state machine showing deep sleep, waking, sampling, processing, and transmitting states for duty cycling

77.5.1 Power Consumption by State

State Current Draw Duration Energy per Cycle
DEEP_SLEEP 10 uA 60 seconds 0.6 mAs
WAKING 20 mA 100 ms 2 mAs
SAMPLING 50 mA 200 ms 10 mAs
PROCESSING 30 mA 50 ms 1.5 mAs
TRANSMITTING 150 mA 500 ms 75 mAs
Total per cycle - ~61 seconds ~89 mAs

77.5.2 Implementation Considerations

  1. Wake Source: RTC timer vs. external interrupt vs. both
  2. Peripheral Initialization: Lazy vs. eager initialization
  3. Buffering Strategy: Store locally when network unavailable
  4. Sample Aggregation: Reduce transmissions by batching data

77.5.3 Example Use Cases

  • Environmental monitoring stations
  • Agricultural soil sensors
  • Asset tracking devices
  • Wildlife monitoring collars
  • Smart meter endpoints

77.6 Pattern 3: Actuator Safety State Machine

Safety-critical actuator control with multiple protection layers:

Actuator safety state machine with normal operation, limiting, emergency stop, and lockout states for safety-critical control

77.6.1 Safety Layers

Layer Trigger Response Recovery
Normal Operator command Controlled start/stop Automatic
Limiting Threshold exceeded Reduce power/speed Automatic when safe
Emergency Stop Critical fault or button Immediate halt Requires admin
Lockout Confirmed emergency Disable actuator Manual reset only

77.6.2 Implementation Considerations

  1. Fail-Safe Default: Always default to SAFE_OFF on unknown state
  2. Watchdog Timer: Force EMERGENCY_STOP if state machine hangs
  3. Dual-Channel Input: Redundant sensors for critical measurements
  4. Audit Logging: Record all state transitions with timestamps

77.6.3 Example Use Cases

  • Industrial motor control
  • HVAC damper actuators
  • Automated valve systems
  • Robotic arm controllers
  • Medical pump controllers

77.7 Choosing the Right Pattern

Diagram illustrating a multi-step decision process for adaptive frequency hopping channel management

77.7.1 Pattern Combinations

Many real-world IoT devices combine multiple patterns:

  1. Smart Thermostat: Connection + Safety patterns
  2. Remote Sensor Node: Sampling + Connection patterns
  3. Industrial Controller: All three patterns

When combining patterns, use hierarchical state machines to keep the design manageable.

State machine patterns are like recipe books for robot brains – proven recipes that engineers use again and again!

77.7.2 The Sensor Squad Adventure: Three Magic Recipes

The Sensor Squad discovered that most IoT devices need the same three “brain recipes” (patterns):

Recipe 1 – The Never-Give-Up Connector (Connection Pattern): Sammy the Sensor needed to connect to the Wi-Fi, but sometimes the signal was weak. Instead of trying again immediately (which would be like knocking on someone’s door 100 times per second – super annoying!), Sammy followed the “polite knock” recipe: - First try: Wait 1 second - Second try: Wait 2 seconds - Third try: Wait 4 seconds - Keep doubling, but never wait more than 5 minutes!

This way, Sammy keeps trying but does not overwhelm the router.

Recipe 2 – The Sleepy Sensor (Sampling Pattern): Bella the Battery had a problem. If she powered the temperature sensor ALL the time, the battery would die in 2 days. So she created a schedule: - Sleep for 60 seconds (uses barely any power – like a hibernating bear!) - Wake up for 0.1 seconds - Read the temperature for 0.2 seconds - Send the reading for 0.5 seconds - Go back to Sleep!

Result: The battery lasts 5 YEARS instead of 2 days!

Recipe 3 – The Safety Guard (Safety Pattern): Lila the LED controlled a factory robot arm. Safety was the #1 priority! The recipe had LAYERS: - Normal: Robot works fine - Warning: “Slow down, something looks off” - Emergency Stop: “FREEZE! Something is wrong!” - Lockout: “No one touches this until a human checks it”

Max the Microcontroller said, “The most important rule? When in doubt, ALWAYS go to the safest state. A robot that stops is better than a robot that breaks things!”

77.7.3 Key Words for Kids

Word What It Means
Pattern A proven recipe that solves a common problem – reuse it instead of starting from scratch
Exponential Backoff Waiting longer and longer between tries, so you do not annoy the server
Duty Cycling Sleeping most of the time and only waking up briefly to do work
Fail-Safe Always defaulting to the safest option when something goes wrong
Key Takeaway

In one sentence: Three essential state machine patterns – connection management (exponential backoff), duty cycling (sleep-wake-sample-transmit), and safety control (layered fail-safe states) – solve the vast majority of IoT device behavior challenges.

Remember this rule: Always design transitions as one-way safety valves – any anomaly should default to the safest state, and dangerous fault recovery should require deliberate human intervention.

77.8 Code Example: Connection State Machine with Exponential Backoff

This MicroPython implementation demonstrates the connection pattern from Pattern 1. The state machine manages Wi-Fi connectivity with exponential backoff, retry limits, and heartbeat monitoring:

import time, network

class ConnectionFSM:
    """States: DISCONNECTED -> CONNECTING -> CONNECTED -> RECONNECTING
    Backoff doubles per failure: 1s, 2s, 4s, 8s... up to max_backoff_s."""
    DISCONNECTED, CONNECTING, CONNECTED, RECONNECTING = (
        "DISCONNECTED", "CONNECTING", "CONNECTED", "RECONNECTING")

    def __init__(self, ssid, password, max_retries=5, max_backoff_s=60):
        self.ssid, self.password = ssid, password
        self.max_retries, self.max_backoff_s = max_retries, max_backoff_s
        self.state, self.retries, self.backoff_s = self.DISCONNECTED, 0, 1
        self.wlan = network.WLAN(network.STA_IF)
        self.last_heartbeat, self.heartbeat_interval_s = 0, 30

    def _attempt_connect(self):
        self.wlan.active(True)
        self.wlan.connect(self.ssid, self.password)
        deadline = time.time() + 10
        while time.time() < deadline:
            if self.wlan.isconnected(): return True
            time.sleep(0.5)
        return False

    def update(self):
        """Run one cycle of the state machine. Call in main loop."""
        if self.state == self.DISCONNECTED:
            self.state = self.CONNECTING
        elif self.state == self.CONNECTING:
            if self._attempt_connect():
                self.state, self.retries, self.backoff_s = self.CONNECTED, 0, 1
                self.last_heartbeat = time.time()
            else:
                self.retries += 1
                if self.retries >= self.max_retries:
                    self.state, self.retries, self.backoff_s = self.DISCONNECTED, 0, 1
                else:
                    time.sleep(self.backoff_s)
                    self.backoff_s = min(self.backoff_s * 2, self.max_backoff_s)
        elif self.state == self.CONNECTED:
            if not self.wlan.isconnected():
                self.state, self.backoff_s = self.RECONNECTING, 1
        # ... RECONNECTING mirrors CONNECTING with 2x patience
        return self.state

# Usage
fsm = ConnectionFSM("MyNetwork", "MyPassword")
while True:
    if fsm.update() == ConnectionFSM.CONNECTED:
        pass  # Safe to transmit sensor data
    time.sleep(1)

The exponential backoff (1s, 2s, 4s, 8s...) prevents network congestion when many devices reconnect simultaneously after an outage. The separate RECONNECTING state uses more patient retry limits because a previously-working connection is likely to recover.

Common Pitfalls

Firmware that implements device behavior with nested if-else checks on global variables is an implicit state machine that becomes unmaintainable. When a new state (e.g., FIRMWARE_UPDATE) must be added, every if-else branch must be audited for correctness. An explicit state machine with defined states and transitions makes adding new states a localized change with clear boundary conditions.

Key Concepts
  • Finite State Machine (FSM): A computational model representing a device’s behavior as a finite set of states, with transitions triggered by events or conditions – the standard pattern for modeling IoT device lifecycle
  • State: A distinct mode of operation for an IoT device (IDLE, SENSING, TRANSMITTING, ERROR, SLEEP) where specific behaviors and valid transitions are defined
  • Transition: A directed change from one state to another triggered by an event or condition, optionally executing actions (entry/exit/transition actions) during the state change
  • Entry/Exit Action: Code executed automatically when entering or leaving a state, used to configure hardware (enable ADC on entry), release resources (disable radio on exit), or log state changes for debugging
  • Guard Condition: A boolean expression that must be true for a transition to occur, enabling conditional state changes based on sensor values, battery levels, or network connectivity status
  • Hierarchical State Machine (HSM): An FSM extension where states can contain substates, enabling shared behavior to be defined in parent states and overriding only specific behaviors in child states – reduces duplication in complex IoT device firmware
  • Watchdog Timer: A hardware timer that resets the device if not periodically cleared by software, ensuring the state machine always reaches a safe state even when firmware enters an unexpected condition
  • State Explosion: An FSM anti-pattern where the number of states and transitions grows combinatorially as features are added, making the model incomprehensible – addressed by using hierarchical state machines or orthogonal regions

IoT devices in the field encounter unexpected conditions (sensor hangs, lost connections) that prevent the state machine from receiving the expected event to progress. Every state that waits for an external event must have a timeout transition leading to a safe state (usually ERROR or RESET). Without timeouts, devices get permanently stuck in intermediate states, appearing ‘frozen’ without restarting.

State machine transitions are invaluable for diagnosing field failures. A device that reaches ERROR state without a logged transition history forces engineers to reproduce complex sequences on the bench. Log every transition with timestamp, previous state, trigger event, and new state. This adds minimal overhead but transforms debugging from days to minutes.

77.9 Summary

State machine patterns provide battle-tested solutions for common IoT challenges:

  • Connection Pattern: Manages network reliability with backoff and retry logic
  • Sampling Pattern: Optimizes power consumption through duty cycling
  • Safety Pattern: Ensures reliable actuator control with protection layers

Applying these patterns reduces development time, improves reliability, and leverages lessons learned from thousands of production IoT deployments.

77.10 Pattern Review Questions

Why does the connection pattern use exponential backoff instead of fixed retry intervals?

  1. To reduce code complexity
  2. To prevent overwhelming the network during outages
  3. To make the code run faster
  4. Because it is required by Wi-Fi standards
Click for answer

Answer: B) To prevent overwhelming the network during outages

Exponential backoff spreads out retry attempts, preventing thousands of devices from simultaneously hammering the network when it recovers from an outage. This is especially important for large-scale IoT deployments where simultaneous reconnection attempts could cause secondary failures.

In the sensor sampling pattern, why is there a separate BUFFERING state instead of just staying in PROCESSING?

  1. To save memory
  2. To handle offline operation gracefully
  3. To make the code simpler
  4. Because sensors require buffering
Click for answer

Answer: B) To handle offline operation gracefully

The BUFFERING state allows the device to store data locally when the network is unavailable, then return to sleep to conserve power. This ensures data is not lost during connectivity outages while maintaining the power efficiency of the duty cycling pattern.

What is the purpose of the LOCKOUT state in the actuator safety pattern?

  1. To save power
  2. To prevent unauthorized access
  3. To ensure dangerous faults require deliberate human intervention to clear
  4. To simplify the state machine
Click for answer

Answer: C) To ensure dangerous faults require deliberate human intervention to clear

The LOCKOUT state prevents automatic recovery from serious faults. This ensures that a qualified person must physically or administratively reset the system, providing an opportunity to investigate and address the root cause before resuming operation.

77.11 Knowledge Check

77.12 Worked Example: Duty-Cycling State Machine Power Budget

Scenario: A soil moisture sensor deployed in a vineyard in Marlborough, New Zealand must operate for 3 years on a single 3.6V lithium thionyl chloride battery (19,000 mAh). The sensor samples soil moisture and transmits via LoRaWAN every 15 minutes.

State Machine Definition:

State Duration Current Draw Energy per Cycle
DEEP_SLEEP 899.2 sec (14 min 59.2 sec) 2.5 uA 8.1 mJ
WAKE (MCU boot, sensor power-up) 120 ms 8 mA 3.5 mJ
SAMPLE (ADC read, 3 readings averaged) 50 ms 12 mA 2.2 mJ
COMPUTE (moisture % calculation, threshold check) 30 ms 10 mA 1.1 mJ
TRANSMIT (LoRaWAN uplink, SF7, 14 dBm) 56 ms 120 mA 24.2 mJ
RX_WINDOW (receive window 1 + 2) 500 ms 12 mA 21.6 mJ
Total per cycle 900 sec 60.7 mJ

Daily Energy Budget:

  • Cycles per day: 24 hrs x 4/hr = 96 cycles
  • Daily energy: 96 x 60.7 mJ = 5,827 mJ = 1.62 mWh/day
  • Average current: 1.62 mWh / (24h x 3.6V) = 18.8 uA

Battery Life Calculation:

  • Battery capacity: 19,000 mAh x 3.6V = 68,400 mWh
  • Self-discharge: 1%/year at 20C = ~684 mWh/year (for a full battery)
  • Annual circuit consumption: 1.62 x 365 = 591.3 mWh/year
  • Year 1: 68,400 - 591.3 - 684 = 67,125 mWh remaining
  • Year 2: 67,125 - 591.3 - 671 = 65,863 mWh remaining
  • Year 3: 65,863 - 591.3 - 659 = 64,613 mWh remaining (94.5% capacity)

Projected lifetime: ~54 years (combined self-discharge and circuit drain of ~1,275 mWh/year)

Duty cycling battery life calculation uses average current over a full cycle. For a sensor with 900-second period:

\[I_{avg} = \frac{\sum (I_i \times t_i)}{T_{total}}\]

From the table: \(I_{avg} = \frac{(2.5 \mu A \times 899.2s) + (8mA \times 0.12s) + ... + (12mA \times 0.5s)}{900s}\)

Simplifying: \(I_{avg} = \frac{2247.8 \mu As + 960 \mu As + ... + 6000 \mu As}{900s} \approx 18.8 \mu A\)

Battery life (circuit only, ignoring self-discharge): \(\frac{19,000 mAh}{0.0188 mA} = 1,010,638\) hours \(\approx\) 115 years. Including 1%/year self-discharge, practical lifetime is ~54 years – still vastly exceeding the 3-year target.

The 99.9% sleep time is critical: if the sensor transmitted continuously at 120 mA, battery life would be \(\frac{19,000}{120} = 158\) hours ≈ 6.6 days. Duty cycling extends life by 1700x.

The state machine’s 99.9% time in DEEP_SLEEP means the sensor uses just 5.5% of the battery capacity over 3 years (circuit drain only). The remaining capacity is available for additional features or simply provides decades of operation well beyond the 3-year requirement.

Optimisation – Adding Alert Mode:

With the massive energy surplus, the vineyard owner adds a FROST_ALERT state:

Added State Trigger Behaviour Extra Energy
FROST_ALERT Temperature < 2C Sample every 60 sec, TX every 5 min 14.6 mWh/day

During frost season (June-August in NZ, ~90 days), frost alerts activate for ~6 hours/night:

  • Additional energy: 90 days x 6 hrs/day x (14.6/24) mWh = 328 mWh/season
  • 3-year frost alert energy: 984 mWh (1.4% of battery)
  • Total 3-year usage: 1,770 + 984 = 2,754 mWh (4% of battery)

Key Insight: The duty-cycling state machine is the single most impactful design pattern for battery-powered IoT. By spending 99.9% of time in DEEP_SLEEP (2.5 uA), the transmit state’s 120 mA draw (48,000x higher) becomes negligible in the average. This pattern transforms a 19,000 mAh battery from a 6.6-day power source (if always transmitting) into one lasting decades (over 50 years including self-discharge).

Try It: Duty Cycling Battery Life Calculator

Adjust the sensor parameters to explore how duty cycling affects battery life. See the dramatic difference between always-on and duty-cycled operation.

77.13 See Also

Related Resources

State machine fundamentals:

Power management patterns:

Safety-critical design:

Connection reliability:

Challenge: Implement Pattern 3 (Actuator Safety State Machine) for a smart home garage door opener.

Requirements:

  • States: SAFE_OFF, OPENING, OPEN, CLOSING, CLOSED, EMERGENCY_STOP, LOCKOUT
  • Sensors: Door position (encoder), obstruction detector (IR beam)
  • Safety features:
    • If obstruction detected during CLOSING, transition to EMERGENCY_STOP
    • If watchdog timeout (500ms), force EMERGENCY_STOP
    • LOCKOUT requires manual PIN reset

Starter Code (ESP32 + MicroPython):

# State definitions
SAFE_OFF = 0
OPENING = 1
OPEN = 2
CLOSING = 3
CLOSED = 4
EMERGENCY_STOP = 5
LOCKOUT = 6

class GarageDoorFSM:
    def __init__(self):
        self.state = SAFE_OFF
        self.watchdog_timer = 0
        self.obstruction_count = 0

    def update(self):
        # Exercise: Implement state machine logic
        # Hint: use if/elif for state-based behavior
        pass

    def check_watchdog(self):
        # Exercise: Force EMERGENCY_STOP if timer expires
        pass

    def handle_obstruction(self):
        # Exercise: Count obstructions, lockout after 3 in 1 minute
        pass

Tasks:

  1. Implement state transitions for button press (open/close/stop)
  2. Add obstruction handling - stop and reverse if IR beam broken during closing
  3. Implement watchdog - force EMERGENCY_STOP if update() not called for 500ms
  4. Add lockout logic - transition to LOCKOUT after 3 obstructions in 60 seconds
  5. Test edge cases:
    • Press open while OPENING (should continue, not reverse)
    • Press close while door is OPEN (should start closing)
    • Obstruction during OPENING (should EMERGENCY_STOP)
    • Multiple rapid obstructions (should LOCKOUT after 3)

What to observe:

  • Does the state machine always transition to a safe state on anomaly?
  • Can the system recover from EMERGENCY_STOP automatically?
  • Does LOCKOUT truly require manual intervention?

Expected learning:

  • Safety systems use multiple protection layers
  • Lockout prevents automatic recovery from dangerous faults
  • Watchdog timers provide hardware-level backup to software logic

Extension: Add logging to record all state transitions with timestamps for safety audit.

77.15 What’s Next

If you want to… Read this
Apply state machines within microservice architectures SOA and Microservices Fundamentals
Implement duty cycling using power state machines Duty Cycling and Topology
Model smart device behavior in digital twins Digital Twins
Apply state machines to process control and PID systems Process Control and PID
Understand IoT reference model layers where state machines live IoT Reference Models and Patterns