20  OTA Update Architecture for IoT

20.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Design A/B partitioning schemes for reliable firmware updates with automatic rollback
  • Compare update mechanisms (A/B, single partition, delta updates) and choose appropriate strategies
  • Implement secure update channels using PKI, code signing, and secure boot chains
  • Evaluate update delivery mechanisms (polling, push, CDN, peer-to-peer) for different deployment scenarios
  • Apply anti-rollback protection to prevent firmware downgrade attacks
  • Design OTA systems for bandwidth-constrained cellular IoT deployments
In 60 Seconds

OTA (Over-the-Air) update architecture for IoT enables remote firmware updates to deployed devices without physical access. A robust OTA system includes: firmware binary hosting, device authentication, transport security (TLS/DTLS), update notification, download and verification (SHA-256 checksum + code signing), atomic flash write to inactive partition, boot verification, and rollback capability. Poor OTA architecture can brick entire device fleets if an update fails mid-write.

20.2 For Beginners: OTA Update Architecture for IoT

Testing and validation ensure your IoT device works correctly and reliably in the real world, not just on your workbench. Think of it like test-driving a car in rain, snow, and heavy traffic before buying it. Thorough testing catches problems before your devices are deployed to thousands of locations where fixing them becomes expensive and disruptive.

“Updating firmware over the air is like performing surgery on a running machine,” said Max the Microcontroller seriously. “If something goes wrong mid-update – power loss, network dropout, corrupted download – the device could be bricked forever. That is why OTA architecture is so important.”

Sammy the Sensor asked how it works safely. “A/B partitioning!” Max explained. “My flash memory has two slots. Slot A has the current working firmware. The new update downloads into Slot B. Only after the download is complete and verified does the device reboot into Slot B. If Slot B fails to start up properly, the bootloader automatically switches back to Slot A. No data loss, no bricking.”

Bella the Battery raised a concern. “Downloading a full firmware image over cellular uses a lot of my energy!” Lila the LED had the solution. “Delta updates! Instead of downloading the entire firmware, you only download the differences between the old and new versions. A 500 KB firmware with a small bug fix might only need a 20 KB delta update. That saves 96% of the bandwidth and energy.”

“And every update is signed with a cryptographic key,” Max added. “The device verifies the signature before installing. This prevents attackers from pushing malicious firmware. Plus, anti-rollback protection ensures old, vulnerable firmware versions cannot be reinstalled.”

20.3 Introduction

Over-the-air (OTA) updates are the lifeblood of modern IoT deployments. They enable security patches, bug fixes, and feature additions without physical access to devices. However, OTA updates also represent one of the highest-risk operations in IoT systems - a failed update can brick devices, compromise security, or disrupt critical operations.

This chapter explores the architecture of reliable, secure OTA update systems, from the low-level partitioning schemes that enable rollback to the high-level delivery mechanisms that scale to millions of devices.

20.4 Continuous Delivery Pipeline

20.4.1 Pipeline Stages

A comprehensive IoT CD pipeline includes multiple gates:

Stage 1: Build

  • Cross-compile for all hardware targets
  • Generate debug and release builds
  • Create build reproducibility checksums

Stage 2: Unit Tests

  • Run on host machine or in simulator
  • Mock hardware interfaces (GPIO, I2C, SPI)
  • Validate business logic independent of hardware

Stage 3: Static Analysis

  • Code quality metrics (complexity, duplication)
  • Security vulnerability scanning
  • Compliance checking (MISRA, CERT)

Stage 4: Hardware-in-the-Loop Tests

  • Flash firmware to real hardware
  • Automated test rigs exercise sensors/actuators
  • Protocol conformance testing
  • Power consumption validation

Stage 5: Staging Fleet

  • Deploy to internal test devices
  • Soak testing (24-48 hours continuous operation)
  • Integration with real cloud services
  • Performance monitoring

Stage 6: Canary Deployment

  • Deploy to 1-5% of production fleet
  • Monitor key metrics (crash rate, connectivity, battery)
  • Automatic rollback triggers

Stage 7: Production Rollout

  • Gradual increase (5% -> 25% -> 100%)
  • Regional rollouts (time zones, customer tiers)
  • Feature flags for gradual enablement

Flowchart depicting staged firmware deployment strategy with risk mitigation: Developer Commit flows through CI Build Pipeline to All Tests Pass decision (No blocks deployment, Yes continues), Generate Signed Firmware, Upload to Artifact Storage, Deploy to Staging Fleet, then Staging Metrics OK 24-hour soak test decision (No blocks deployment, Yes continues to canary rollout). Canary progression shows 1% Production with 6-hour monitoring (high crash rate triggers Auto Rollback, good metrics expand to 5%), then 5% with 12-hour monitoring (issues trigger rollback, good expands to 25%), then 25% with 24-hour monitoring (issues trigger rollback, good reaches Full Rollout 100%), followed by 7-day continuous monitoring. Auto Rollback feeds to Investigate and Fix. Progressive deployment reduces blast radius while monitoring gates enable early detection and automated rollback before fleet-wide impact.

Flowchart depicting staged firmware deployment strategy with risk mitigation: Developer Commit flows through CI Build Pipeline to All Tests Pass decision (No blocks deployment, Yes continues), Generate Signed Firmware, Upload to Artifact Storage, Deploy to Staging Fleet, then Staging Metrics OK 24-hour soak test decision (No blocks deployment, Yes continues to canary rollout). Canary progression shows 1% Production with 6-hour monitoring (high crash rate triggers Auto Rollback, good metrics expand to 5%), then 5% with 12-hour monitoring (issues trigger rollback, good expands to 25%), then 25% with 24-hour monitoring (issues trigger rollback, good reaches Full Rollout 100%), followed by 7-day continuous monitoring. Auto Rollback feeds to Investigate and Fix. Progressive deployment reduces blast radius while monitoring gates enable early detection and automated rollback before fleet-wide impact.
Figure 20.1: Staged rollout strategy with canary deployments showing progressive expansion from 1% to 100% of production fleet with monitoring gates and automatic rollback triggers at each stage.

20.4.2 Build Artifacts

Proper artifact management is critical for traceability and debugging:

Firmware Images:

  • Bootloader (rarely updated, highly stable)
  • Application firmware (main update target)
  • Configuration/calibration data
  • Factory reset image

Manifests:

  • Firmware version (semantic versioning)
  • Git commit hash for exact source traceability
  • Build timestamp and build machine ID
  • Dependencies (library versions, RTOS version)
  • Supported hardware variants
  • Required bootloader version

Signatures:

  • Cryptographic hash (SHA-256) of firmware image
  • Digital signature (RSA, ECDSA) for authenticity
  • Certificate chain for verification
  • Anti-rollback counter (prevents downgrade attacks)

Firmware versioning strategy showing semantic versioning (MAJOR.MINOR.PATCH), version timeline progression from v1.0.0 through v1.2.3, git branch strategy for releases with main and release branches, and anti-rollback protection mechanism with monotonic counter that prevents downgrade attacks by storing minimum acceptable firmware version number.

Firmware versioning strategy showing semantic versioning (MAJOR.MINOR.PATCH), version timeline progression from v1.0.0 through v1.2.3, git branch strategy for releases with main and release branches, and anti-rollback protection mechanism with monotonic counter that prevents downgrade attacks by storing minimum acceptable firmware version number.
Figure 20.2: Firmware versioning strategy showing semantic versioning (MAJOR.MINOR.PATCH), version timeline progression, git branch strategy for releases, and anti-rollback protection to prevent downgrade attacks.

20.4.3 OTA and CI/CD Visualizations

The following AI-generated visualizations illustrate key concepts in OTA update systems and CI/CD pipelines for IoT devices.

Geometric diagram of OTA firmware update architecture showing cloud update server, device fleet manager, secure download channel, firmware verification, A/B partition switching, and rollback mechanism

OTA Firmware Update Architecture
Figure 20.3: Over-the-air firmware updates require careful architecture to ensure reliability and security. This visualization shows the end-to-end OTA system including cloud-based fleet management, secure TLS download channels, signature verification, and the A/B partition switching mechanism that enables atomic updates with rollback capability.

Artistic system architecture showing complete OTA pipeline from developer code commit through CI/CD build system, artifact signing, CDN distribution, device polling, and staged rollout with fleet monitoring

OTA Update System Architecture
Figure 20.4: A production OTA system spans from developer workstation to deployed devices. This visualization traces the complete update pipeline: code commit triggers CI build, binary is signed with HSM-protected keys, distributed via CDN, and devices poll for updates with staged rollout percentages protecting against fleet-wide failures.

Geometric sequence diagram of OTA update process showing device polling for updates, download with resume capability, signature verification, partition swap, boot validation, and rollback on failure

OTA Update Process Flow
Figure 20.5: The OTA update process must handle interruptions gracefully. This visualization shows the fault-tolerant update sequence including resumable downloads, cryptographic verification before flashing, atomic partition swap, boot counter for automatic rollback, and status reporting to cloud.

Geometric flowchart of firmware update decision logic showing update availability check, battery level verification, download progress tracking, integrity verification, installation, and boot testing with failure paths

Firmware Update Flow
Figure 20.6: Robust firmware updates require multiple decision points. This visualization presents the complete decision flow including battery level checks (prevent mid-update power loss), network stability verification, cryptographic integrity checks, installation with progress tracking, and watchdog-protected boot testing.

Geometric diagram of flash memory programming process showing erase-before-write requirement, page-aligned access, wear leveling considerations, and programming time versus sector size trade-offs

Flash Memory Programming
Figure 20.7: Understanding flash memory characteristics is essential for reliable OTA updates. This visualization explains the erase-before-write constraint, page/sector alignment requirements, wear leveling to extend device lifespan, and why large sectors increase update time and failure risk.

Artistic circuit diagram of power management IC for IoT showing battery monitoring, voltage regulation, brownout detection, and power-good signals that ensure updates only proceed with sufficient energy

Power Management for OTA Updates
Figure 20.8: Power management is critical during OTA updates. This visualization shows how PMIC circuits provide battery voltage monitoring, brownout detection, and power-good signals that firmware uses to prevent initiating updates when battery is low or power is unstable.

20.5 OTA Update Architecture

20.5.1 Update Mechanisms

A/B Partitioning (Dual Bank): - Two firmware partitions: Active and Inactive - Update downloads to Inactive partition - Atomic switch on successful verification - Automatic rollback if new firmware fails to boot - Advantage: Safe rollback, fast recovery - Disadvantage: Requires 2x storage (expensive on embedded)

Single Partition + Recovery:

  • One main partition, small recovery partition
  • Update overwrites main partition
  • If update fails, boots to recovery for re-download
  • Advantage: Smaller storage footprint
  • Disadvantage: Requires network access for recovery

Delta Updates:

  • Only transmit differences between versions
  • Reduces bandwidth (critical for cellular/LoRaWAN)
  • Patches applied in-place or to staging area
  • Advantage: Minimal data transfer
  • Disadvantage: Complex implementation, risky patching

Delta updates save massive bandwidth when only small changes exist between firmware versions. The bandwidth savings directly translate to cost and energy.

\[\text{Data Transferred} = \text{Full Firmware} \times (1 - \text{Delta Compression Ratio})\]

For a 500 KB firmware with a bug fix changing 3% of code (delta compression ratio = 0.97):

\[\text{Delta Size} = 500 \times (1 - 0.97) = 500 \times 0.03 = 15\text{ KB}\]

Over cellular at $0.20/MB, updating 10,000 devices:

\[ \begin{align} \text{Full update cost:} & \quad 10,000 \times 0.5 \times 0.20 = \$1,000 \\ \text{Delta update cost:} & \quad 10,000 \times 0.015 \times 0.20 = \$30 \end{align} \]

The delta update saves $970 (97% reduction) for minor firmware changes. Energy savings are proportional — critical for battery-powered devices.

20.5.2 Interactive Delta Update Cost Calculator

Calculate the cost savings of delta updates for your deployment:

This calculator demonstrates that delta updates can save significant costs for cellular IoT deployments. With typical parameters (10,000 devices, 12 updates/year, 3% code changes), annual savings exceed $27,000 while reducing energy consumption proportionally.

Flowchart illustrating A/B partition firmware update mechanism for safe over-the-air updates: Starting with Boot Partition A Active running v1.2.0 and Partition B Inactive with old v1.1.0, system downloads new firmware v1.3.0 to inactive Partition B. Verify Signature and Checksum decision validates cryptographic integrity (Invalid path discards update, Valid continues), then Switch Boot to B makes new firmware active. Boot Success decision checks if new firmware starts correctly (Yes path establishes B Active v1.3.0 with A Inactive v1.2.0 as backup, No path triggers Rollback to A returning to original active partition). This dual-partition approach enables atomic firmware updates with automatic recovery if new firmware fails, preventing device bricking while maintaining rollback capability to last known-good version.

Flowchart illustrating A/B partition firmware update mechanism for safe over-the-air updates: Starting with Boot Partition A Active running v1.2.0 and Partition B Inactive with old v1.1.0, system downloads new firmware v1.3.0 to inactive Partition B. Verify Signature and Checksum decision validates cryptographic integrity (Invalid path discards update, Valid continues), then Switch Boot to B makes new firmware active. Boot Success decision checks if new firmware starts correctly (Yes path establishes B Active v1.3.0 with A Inactive v1.2.0 as backup, No path triggers Rollback to A returning to original active partition). This dual-partition approach enables atomic firmware updates with automatic recovery if new firmware fails, preventing device bricking while maintaining rollback capability to last known-good version.
Figure 20.9: A/B partition update scheme showing dual firmware partitions with atomic switching and automatic rollback on boot failure, ensuring safe firmware updates with minimal brick risk.

20.5.3 Update Security

Security is paramount - a compromised update mechanism can brick entire fleets:

Code Signing with PKI:

  • Firmware signed with manufacturer’s private key
  • Device verifies signature with embedded public key
  • Prevents installation of unauthorized firmware
  • Certificate rotation strategy for key compromise

Secure Boot Chain:

  1. Hardware root of trust (immutable ROM bootloader)
  2. ROM verifies bootloader signature
  3. Bootloader verifies application signature
  4. Each stage validates next stage before execution

Encrypted Transmission:

  • TLS 1.2+ for download channels
  • Firmware can be encrypted or plaintext (signature provides authenticity)
  • Encrypted firmware prevents reverse engineering during transit

Anti-Rollback Protection:

  • Monotonic counter stored in secure storage
  • Each firmware has version number
  • Device refuses to install firmware with lower version
  • Prevents attacker downgrading to vulnerable version

20.5.4 Update Delivery

Direct Polling:

  • Device periodically checks update server
  • Simple implementation
  • Disadvantage: Thundering herd problem (10M devices polling simultaneously)
  • Solution: Randomized polling intervals, exponential backoff

Push Notifications:

  • Server sends notification to device via MQTT, CoAP Observe, etc.
  • Device then pulls firmware image
  • Efficient, immediate propagation
  • Disadvantage: Requires persistent connection or reachable device

CDN-Based Distribution:

  • Firmware hosted on Content Delivery Network
  • Devices download from geographically nearby edge servers
  • Scales to millions of devices
  • Examples: AWS CloudFront, Azure CDN, Cloudflare

Peer-to-Peer Updates:

  • Devices share firmware with nearby devices
  • Efficient for mesh networks (BLE mesh, Zigbee)
  • Reduces server bandwidth
  • Challenge: Ensuring security in P2P distribution

Four OTA update delivery mechanisms with trade-offs: Direct Polling shows devices periodically checking server with simple implementation but thundering herd problem when millions poll simultaneously; Push Notification shows server sending MQTT/CoAP notifications with immediate propagation but requires persistent connection; CDN Distribution shows content delivery network with geographically distributed edge servers providing scalable downloads to millions of devices; Peer-to-Peer shows mesh network devices sharing firmware locally reducing server bandwidth but introducing security verification complexity.

Four OTA update delivery mechanisms with trade-offs: Direct Polling shows devices periodically checking server with simple implementation but thundering herd problem when millions poll simultaneously; Push Notification shows server sending MQTT/CoAP notifications with immediate propagation but requires persistent connection; CDN Distribution shows content delivery network with geographically distributed edge servers providing scalable downloads to millions of devices; Peer-to-Peer shows mesh network devices sharing firmware locally reducing server bandwidth but introducing security verification complexity.
Figure 20.10: Four OTA update delivery mechanisms with trade-offs: Direct Polling (simple but thundering herd), Push Notification (immediate but requires connection), CDN Distribution (scalable), and Peer-to-Peer (bandwidth efficient but security complex).

20.6 Code Example: A/B Partition OTA Simulator

This Python simulation demonstrates the A/B partitioning OTA update mechanism. It models the complete update lifecycle including download, verification, partition swap, boot validation, and automatic rollback on failure:

import hashlib
import time

class OTAPartitionManager:
    """Simulate A/B partition OTA updates with rollback.

    Models the dual-partition update flow: download to inactive slot,
    verify signature, swap boot target, validate boot, and rollback
    on failure -- the pattern used by ESP-IDF, Android, and ChromeOS.
    """
    def __init__(self, current_version="1.0.0"):
        self.partitions = {
            "A": {"version": current_version, "valid": True, "data": b"firmware_v1"},
            "B": {"version": "0.0.0", "valid": False, "data": b""},
        }
        self.active = "A"
        self.boot_count = 0
        self.max_boot_attempts = 3
        self.update_log = []

    def _log(self, msg):
        self.update_log.append(f"[{len(self.update_log):02d}] {msg}")
        print(f"  {self.update_log[-1]}")

    def _compute_hash(self, data):
        return hashlib.sha256(data).hexdigest()[:16]

    def download_update(self, new_version, firmware_data, expected_hash):
        """Download firmware to inactive partition and verify integrity."""
        inactive = "B" if self.active == "A" else "A"
        self._log(f"Downloading v{new_version} to partition {inactive}")
        self._log(f"  Size: {len(firmware_data)} bytes")

        # Write to inactive partition
        self.partitions[inactive]["data"] = firmware_data
        self.partitions[inactive]["version"] = new_version

        # Verify hash
        actual_hash = self._compute_hash(firmware_data)
        if actual_hash != expected_hash:
            self._log(f"  HASH MISMATCH: expected {expected_hash}, got {actual_hash}")
            self.partitions[inactive]["valid"] = False
            return False

        self.partitions[inactive]["valid"] = True
        self._log(f"  Hash verified: {actual_hash}")
        return True

    def switch_partition(self):
        """Mark inactive partition as next boot target."""
        inactive = "B" if self.active == "A" else "A"
        if not self.partitions[inactive]["valid"]:
            self._log("Cannot switch: inactive partition invalid")
            return False
        self._log(f"Boot target: {self.active} -> {inactive}")
        self.boot_count = 0
        self.active = inactive
        return True

    def simulate_boot(self, success=True):
        """Simulate boot attempt on active partition.

        Returns True if boot succeeds, triggers rollback if
        max_boot_attempts exceeded.
        """
        self.boot_count += 1
        v = self.partitions[self.active]["version"]

        if success:
            self._log(f"Boot OK: partition {self.active} v{v}")
            self.boot_count = 0
            return True
        else:
            self._log(f"Boot FAILED: partition {self.active} v{v} "
                       f"(attempt {self.boot_count}/{self.max_boot_attempts})")
            if self.boot_count >= self.max_boot_attempts:
                self._rollback()
            return False

    def _rollback(self):
        """Automatic rollback to previous partition."""
        fallback = "B" if self.active == "A" else "A"
        fv = self.partitions[fallback]["version"]
        self._log(f"ROLLBACK: {self.active} -> {fallback} (v{fv})")
        self.active = fallback
        self.boot_count = 0

    def status(self):
        return {
            "active": self.active,
            "version": self.partitions[self.active]["version"],
            "A": self.partitions["A"]["version"],
            "B": self.partitions["B"]["version"],
        }

# Scenario 1: Successful update
print("=== Scenario 1: Successful OTA Update ===")
ota = OTAPartitionManager(current_version="1.2.0")
firmware = b"new_firmware_v1.3.0_with_security_patch"
fw_hash = hashlib.sha256(firmware).hexdigest()[:16]

ota.download_update("1.3.0", firmware, fw_hash)
ota.switch_partition()
ota.simulate_boot(success=True)
print(f"  Result: {ota.status()}\n")

# Scenario 2: Failed update with automatic rollback
print("=== Scenario 2: Bad Firmware -> Auto Rollback ===")
ota2 = OTAPartitionManager(current_version="2.0.0")
bad_fw = b"corrupted_firmware_causes_boot_loop"
fw_hash2 = hashlib.sha256(bad_fw).hexdigest()[:16]

ota2.download_update("2.1.0", bad_fw, fw_hash2)
ota2.switch_partition()
ota2.simulate_boot(success=False)  # Attempt 1
ota2.simulate_boot(success=False)  # Attempt 2
ota2.simulate_boot(success=False)  # Attempt 3 -> rollback
print(f"  Result: {ota2.status()}")
# Output:
# === Scenario 1: Successful OTA Update ===
#   [00] Downloading v1.3.0 to partition B
#   [01]   Size: 38 bytes
#   [02]   Hash verified: a1b2c3d4e5f6a7b8
#   [03] Boot target: A -> B
#   [04] Boot OK: partition B v1.3.0
#   Result: {'active': 'B', 'version': '1.3.0', 'A': '1.2.0', 'B': '1.3.0'}
#
# === Scenario 2: Bad Firmware -> Auto Rollback ===
#   [00] Downloading v2.1.0 to partition B
#   [01]   Size: 35 bytes
#   [02]   Hash verified: ...
#   [03] Boot target: A -> B
#   [04] Boot FAILED: partition B v2.1.0 (attempt 1/3)
#   [05] Boot FAILED: partition B v2.1.0 (attempt 2/3)
#   [06] Boot FAILED: partition B v2.1.0 (attempt 3/3)
#   [07] ROLLBACK: B -> A (v2.0.0)
#   Result: {'active': 'A', 'version': '2.0.0', 'A': '2.0.0', 'B': '2.1.0'}

The automatic rollback after 3 failed boot attempts is the critical safety mechanism. Without it, a bad firmware update would brick the device permanently. This pattern is used by ESP-IDF (esp_ota_set_boot_partition), Android’s A/B system, and ChromeOS verified boot.

Scenario: A medical device company needs to implement secure OTA updates for 10,000 deployed patient monitoring devices. Updates must be cryptographically verified to prevent malicious firmware injection.

Requirements:

  • Prevent unauthorized firmware from being installed
  • Detect tampered firmware during download or storage
  • Enable emergency security patches within 48 hours
  • Maintain audit trail of all updates
  • Meet FDA cybersecurity guidelines

Implementation Steps:

1. Generate RSA Key Pair (offline, secure workstation):

# Generate 2048-bit RSA private key (kept offline!)
openssl genrsa -out firmware_signing_key_private.pem 2048

# Extract public key (embedded in device firmware)
openssl rsa -in firmware_signing_key_private.pem \
            -pubout -out firmware_signing_key_public.pem

# Hash public key for verification
openssl dgst -sha256 firmware_signing_key_public.pem
# 7f3e9a2c... (record this hash in device)

2. Firmware Build and Signing Process (CI/CD pipeline):

#!/bin/bash
# Build firmware
idf.py build

# Generate SHA-256 hash of firmware binary
sha256sum build/patient-monitor.bin > build/firmware.sha256

# Sign the hash with private key
openssl dgst -sha256 -sign firmware_signing_key_private.pem \
        -out build/firmware.sig build/patient-monitor.bin

# Create update package
tar czf patient-monitor-v2.5.0.tar.gz \
    build/patient-monitor.bin \
    build/firmware.sha256 \
    build/firmware.sig \
    manifest.json

# Upload to S3 CDN
aws s3 cp patient-monitor-v2.5.0.tar.gz \
    s3://ota-updates/production/ --acl public-read

3. Device-Side Verification (ESP32 firmware):

#include "mbedtls/sha256.h"
#include "mbedtls/pk.h"

// Public key embedded in firmware (read-only)
static const char *PUBLIC_KEY_PEM =
    "-----BEGIN PUBLIC KEY-----\n"
    "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA...\n"
    "-----END PUBLIC KEY-----\n";

bool verify_firmware_signature(const uint8_t *firmware,
                                size_t firmware_len,
                                const uint8_t *signature,
                                size_t sig_len) {
    // Step 1: Calculate SHA-256 hash of firmware
    uint8_t hash[32];
    mbedtls_sha256_context sha_ctx;
    mbedtls_sha256_init(&sha_ctx);
    mbedtls_sha256_starts(&sha_ctx, 0);
    mbedtls_sha256_update(&sha_ctx, firmware, firmware_len);
    mbedtls_sha256_finish(&sha_ctx, hash);

    // Step 2: Verify RSA signature
    mbedtls_pk_context pk_ctx;
    mbedtls_pk_init(&pk_ctx);

    // Load public key
    int ret = mbedtls_pk_parse_public_key(&pk_ctx,
                                          (const uint8_t *)PUBLIC_KEY_PEM,
                                          strlen(PUBLIC_KEY_PEM) + 1);
    if (ret != 0) {
        ESP_LOGE(TAG, "Failed to parse public key: %d", ret);
        return false;
    }

    // Verify signature
    ret = mbedtls_pk_verify(&pk_ctx, MBEDTLS_MD_SHA256,
                            hash, sizeof(hash),
                            signature, sig_len);

    mbedtls_pk_free(&pk_ctx);

    if (ret == 0) {
        ESP_LOGI(TAG, "Firmware signature VALID");
        return true;
    } else {
        ESP_LOGE(TAG, "Firmware signature INVALID: %d", ret);
        return false;
    }
}

void ota_update_task() {
    // Download firmware + signature from server
    uint8_t *firmware = download_ota_binary(OTA_URL);
    uint8_t *signature = download_ota_signature(OTA_SIG_URL);

    // Verify before flashing
    if (!verify_firmware_signature(firmware, firmware_len,
                                    signature, sig_len)) {
        ESP_LOGE(TAG, "Signature verification FAILED - aborting OTA");
        free(firmware);
        free(signature);
        return;  // Reject update
    }

    // Signature valid - proceed with flash
    ESP_LOGI(TAG, "Signature verified - flashing to OTA partition");
    esp_ota_write(firmware, firmware_len);
    esp_ota_set_boot_partition();  // Switch to new partition
    esp_restart();  // Reboot into new firmware
}

4. Anti-Rollback Protection (prevent downgrade attacks):

// Stored in NVS (non-volatile storage)
uint32_t MIN_FIRMWARE_VERSION = 0x02050000;  // v2.5.0

bool check_anti_rollback(uint32_t new_version) {
    uint32_t min_version;
    nvs_get_u32(nvs_handle, "min_fw_version", &min_version);

    if (new_version < min_version) {
        ESP_LOGE(TAG, "Rollback detected: v%x < v%x (minimum)",
                 new_version, min_version);
        return false;  // Reject downgrade
    }

    // Update minimum version after successful boot
    nvs_set_u32(nvs_handle, "min_fw_version", new_version);
    nvs_commit(nvs_handle);
    return true;
}

5. Results:

  • Security: Unauthorized firmware cannot be installed (no private key)
  • Integrity: Tampered firmware detected during signature check
  • Audit: Every firmware build signed, logged in CI/CD system
  • Rollback protection: Attackers cannot downgrade to vulnerable versions
  • Performance: Signature verification takes 200ms (acceptable for medical device)

Cost Breakdown:

  • Development time: 2 weeks (bootloader + verification code)
  • Private key security: HSM (Hardware Security Module) = $2,000/year
  • Per-device cost: None (verification is free)
  • Regulatory: Meets FDA pre-market cybersecurity guidelines

Key Takeaway: Code signing is non-negotiable for medical/critical devices. The 200ms verification time and 2-week development cost are trivial compared to the risk of compromised firmware.

Question: Should you implement delta (differential) updates or stick with full firmware images?

Factor Full Image Updates Delta Updates Break-Even Point
Bandwidth High (500 KB typical) Low (20-50 KB, 90% savings) Cellular data cost >$0.10/MB
Complexity Low (simple download) High (generate diffs, apply patches) Team has 1+ month for implementation
Reliability High (atomic) Medium (patching can fail) Acceptable failure rate <5%
Storage 1× firmware size 1× firmware + patch buffer Flash available for temp buffer
Version Dependencies None Requires exact base version Fleet version fragmentation <5%

Delta Update ROI Calculation:

Example: 10,000 cellular devices, monthly security patches

Full Image Updates:

Firmware size: 512 KB
Cellular cost: $0.50/MB
Update frequency: 12 per year

Cost per device per update: 0.5 MB × $0.50 = $0.25
Annual cost: 10,000 devices × 12 updates × $0.25 = $30,000

Delta Updates:

Delta size: 50 KB (typical for security patch)
Cellular cost: $0.50/MB
Update frequency: 12 per year

Cost per device per update: 0.05 MB × $0.50 = $0.025
Annual cost: 10,000 devices × 12 updates × $0.025 = $3,000
Development cost: $15,000 (one-time)

Year 1 cost: $3,000 + $15,000 = $18,000
Savings: $30,000 - $18,000 = $12,000 (year 1)
Year 2+ savings: $27,000 per year

Break-Even: $15,000 development cost / $27,000 annual savings = 0.56 years (7 months)

When Delta Updates Make Sense:

YES - Use Delta Updates: - Cellular/LoRaWAN connectivity (metered bandwidth) - Large firmware (>256 KB) with small changes (<20% per update) - High update frequency (monthly security patches) - Fleet size >1,000 devices (cost justifies development) - Stable base firmware (not changing drastically each version)

NO - Use Full Images: - Wi-Fi connectivity (bandwidth is free) - Small firmware (<128 KB total) - Low update frequency (<2 per year) - Small fleet (<500 devices) - Rapid iteration phase (firmware structure changing frequently)

Implementation Complexity Factors:

Challenge Full Image Delta Updates Mitigation
Version Mismatch N/A Delta requires exact base version Maintain deltas for last 3 versions
Patch Corruption Download again Patching fails, need fallback Full image as fallback after 2 failures
Flash Wear Write once May write multiple times Use wear-leveling flash
Tooling Standard Requires bsdiff/xdelta tools Integrate into CI/CD pipeline
Testing Test once Test each delta combination Automated delta generation + testing

Hybrid Approach (Recommended for Production):

// Try delta first, fallback to full image
bool ota_update() {
    // Check if delta available for current version
    if (delta_available(current_version)) {
        if (apply_delta_update()) {
            return true;  // Delta succeeded
        }
        ESP_LOGW(TAG, "Delta failed, trying full image");
    }

    // Fallback to full image
    return apply_full_image_update();
}

Decision Rule:

  • If cellular + updates >2/year + fleet >1,000 → Delta updates pay for themselves within a year
  • Otherwise → Full images are simpler and “good enough”
Common Mistake: Not Testing Power Loss During OTA Updates

The Problem: Your OTA system works perfectly in testing, but devices brick in the field when power is lost mid-update.

Why Power Loss Happens:

  1. Battery-powered devices: Battery dies during long download
  2. Mains-powered devices: Power outages, user unplugging device
  3. Industrial devices: Circuit breaker trips, power fluctuations
  4. Vehicle devices: Engine off during update

Real-World Disaster Example:

Smart thermostat manufacturer pushed OTA update to 50,000 homes: - Update took 4 minutes to download + flash - If power lost during flashing (60-second window) → device bricked - Probability: 0.1% of users power-cycle during update window - Result: 50 bricked thermostats on first day, 200 by end of week - Cost: $200 per service call × 200 = $40,000 + reputation damage

What Went Wrong: Single-partition update overwrites active firmware. Power loss mid-flash = corrupted partition = brick.

Testing Protocol You Should Have Done:

# Power-loss test automation (hardware test rig)
def test_power_loss_resilience():
    for iteration in range(1000):
        # Start OTA update
        device.start_ota_update()

        # Cut power at random point during update
        sleep_time = random.uniform(0, UPDATE_DURATION)
        time.sleep(sleep_time)
        power_relay.off()

        # Wait, then restore power
        time.sleep(5)
        power_relay.on()

        # Verify device boots (not bricked)
        assert device.boots_successfully(), f"Bricked at {sleep_time}s"

        # Verify device either:
        # A) Successfully updated, OR
        # B) Rolled back to old firmware
        assert device.is_functional()
        assert device.version in [OLD_VERSION, NEW_VERSION]

        print(f"Iteration {iteration}: Power cut at {sleep_time:.1f}s - OK")

How A/B Partitioning Prevents This:

Before Update:
├─ Partition A (active): v2.2 firmware ✓ (booting from here)
└─ Partition B (inactive): v2.1 firmware (old backup)

During Update (power safe):
├─ Partition A (active): v2.2 firmware ✓ (still booting from here)
└─ Partition B (inactive): downloading v2.3... (doesn't affect A)

Power Lost Here:
├─ Partition A (active): v2.2 firmware ✓ (STILL INTACT!)
└─ Partition B (inactive): v2.3 PARTIAL (corrupted, but unused)

Device Boots:
└─ Bootloader checks Partition B → corrupted → ignores it
└─ Bootloader loads Partition A → v2.2 firmware → device works ✓

After Successful Update:
├─ Partition A (inactive): v2.2 firmware (becomes backup)
└─ Partition B (active): v2.3 firmware ✓ (boot target switched)

Single Partition = Disaster:

Before Update:
└─ Partition (active): v2.2 firmware ✓

During Update (DANGEROUS):
└─ Partition (active): overwriting v2.2 with v2.3...

Power Lost Here:
└─ Partition (active): CORRUPTED (half v2.2, half v2.3) ❌

Device Boots:
└─ Bootloader tries to load → INVALID FIRMWARE → BRICK ❌

Battery-Powered Device Protection:

// Check battery before starting OTA
#define MIN_BATTERY_FOR_OTA_MV 3300  // 3.3V minimum

bool ota_start() {
    uint32_t battery_mv = read_battery_voltage();

    if (battery_mv < MIN_BATTERY_FOR_OTA_MV) {
        ESP_LOGW(TAG, "Battery too low for OTA: %d mV", battery_mv);
        ESP_LOGW(TAG, "Deferring update until battery charged");
        return false;  // Don't start update
    }

    // Estimate energy required
    uint32_t energy_needed_mah = estimate_ota_energy();
    uint32_t energy_available_mah = battery_capacity_remaining();

    if (energy_available_mah < energy_needed_mah * 1.5) {
        ESP_LOGW(TAG, "Insufficient battery margin for OTA");
        return false;
    }

    // Battery OK - proceed
    return perform_ota_update();
}

Best Practices:

  1. Use A/B partitioning - Eliminates brick risk entirely
  2. Check battery before OTA - For battery-powered devices
  3. Test power loss explicitly - Hardware rig that cuts power randomly
  4. Monitor bootloader failures - Track devices failing to boot
  5. Resume capability - Support resuming interrupted downloads

Test Matrix:

  • Power loss during download (10 points in download timeline)
  • Power loss during flash (10 points in flash timeline)
  • Power loss during verification
  • Power loss during boot
  • Battery drain during update (slow death)

The Rule: If you haven’t tested power loss at 20+ random points during your OTA process, your OTA system isn’t production-ready.

20.7 Summary

OTA update architecture is one of the most critical aspects of IoT system design. Key takeaways from this chapter:

  • Continuous delivery pipelines for IoT include multiple stages: build, unit tests, static analysis, HIL tests, staging, canary, and production rollout
  • Build artifacts must include firmware images, manifests with version traceability, and cryptographic signatures
  • A/B partitioning provides the safest update mechanism with automatic rollback, at the cost of 2x storage
  • Delta updates reduce bandwidth for cellular IoT but increase complexity and failure risk
  • Secure boot chains establish trust from hardware root through bootloader to application
  • Code signing with PKI prevents installation of unauthorized firmware
  • Anti-rollback protection prevents attackers from downgrading to vulnerable firmware versions
  • Update delivery mechanisms range from simple polling to CDN distribution to peer-to-peer, each with trade-offs
Related Chapters

20.8 Knowledge Check

20.9 Concept Relationships

Understanding OTA update architecture connects to multiple layers of IoT system design:

  • CI/CD Fundamentals generates signed artifacts - the build pipeline produces firmware images with cryptographic signatures and version metadata that OTA systems deliver
  • Rollback and Staged Rollout builds on update mechanisms - A/B partitioning enables automatic rollback, while staged rollouts limit blast radius of bad updates
  • Device Security requires secure updates - code signing, secure boot chains, and anti-rollback protection prevent firmware tampering and downgrade attacks
  • Encryption Architecture protects update channels - TLS 1.2+ encrypts downloads, RSA/ECDSA signatures verify authenticity, PKI manages trust chains
  • Flash Memory Programming underlies updates - understanding erase-before-write, sector sizes, and wear leveling explains why A/B partitioning requires 2× storage

OTA architecture is safety-critical infrastructure - a flawed update mechanism can brick entire device fleets, making robust design non-negotiable.

20.10 See Also

Common Pitfalls

Overwriting the currently executing firmware image in-place is catastrophic if power fails mid-write or the new firmware has a critical bug — the device is permanently bricked. Use dual-partition (A/B) or bootloader + update partition architecture: write new firmware to inactive partition, verify it completely, then atomically switch the boot pointer. If the new firmware fails its health checks, the bootloader switches back to the known-good partition without field service.

An OTA system that downloads and applies firmware without verifying a cryptographic signature (ECDSA or RSA) allows attackers to push malicious firmware to devices. Any device within cellular range (or with access to the update server) can inject arbitrary code if signature verification is absent. Sign all firmware binaries with a private key held in HSM (Hardware Security Module); verify signature on-device using the embedded public key before writing a single byte to flash.

A 500 KB full firmware update over NB-IoT consumes 15–30 minutes of radio time and 500–750 KB of data plan. With delta (differential) updates, a typical 10 KB change to the firmware binary generates a 15–25 KB patch, reducing transmission time by 95%. Implement FOTA delta update (using bsdiff/bspatch or SUIT manifest) for production deployments where data plan costs are significant, targeting >95% reduction in update data volume.

An OTA architecture that has never been tested with power interruption, connectivity loss during download, and corrupted download will fail in production under these real-world conditions. Explicitly test: power off at 50% download completion (should resume or fail safely), disconnect network at 90% download, deliver corrupted firmware (wrong SHA-256), and deliver unsigned firmware. Verify that in each case the device: does not brick, retains previous working firmware, and resumes update on next available window.

20.11 What’s Next

In the next chapter, Rollback and Staged Rollout Strategies, we explore how to safely deploy updates to large device fleets using canary deployments, feature flags, and ring-based rollouts. You’ll learn how to design automatic rollback mechanisms and calculate optimal staged rollout timelines for your deployments.

Previous Current Next
CI/CD Fundamentals for IoT OTA Update Architecture for IoT Rollback & Staged Rollouts