1622  Hardware Optimization Strategies

1622.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Compare hardware acceleration options: Understand the spectrum from CPU to DSP to FPGA to ASIC
  • Evaluate ASIC specialization dimensions: Instruction set, functional units, memory, interconnect, and control
  • Apply heterogeneous multicore concepts: Understand ARM big.LITTLE and workload migration strategies
  • Select appropriate hardware platforms: Match hardware capabilities to application requirements
  • Analyze NRE vs production cost trade-offs: Make informed decisions about custom hardware

1622.2 Prerequisites

Before diving into this chapter, you should be familiar with:

1622.3 Hardware Optimisation

1622.3.1 Additional Processors

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph LR
    CPU["General Purpose<br/>CPU"] --> DSP["DSP<br/>Digital Signal<br/>Processor"]
    DSP --> FPGA["FPGA<br/>Field Programmable<br/>Gate Array"]
    FPGA --> ASIC["ASIC<br/>Application-Specific<br/>IC"]

    CPU -.-> C1["Flexible<br/>Medium performance<br/>Low NRE cost"]
    DSP -.-> D1["Arithmetic-heavy<br/>High performance<br/>Medium NRE cost"]
    FPGA -.-> F1["Reconfigurable<br/>Very high performance<br/>Medium-High NRE cost"]
    ASIC -.-> A1["Fixed function<br/>Maximum performance<br/>Highest NRE cost"]

    style CPU fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
    style DSP fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style FPGA fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style ASIC fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff

Figure 1622.1: Hardware Acceleration Spectrum: CPU to DSP to FPGA to ASIC

This matrix variant helps engineers decide which acceleration approach fits their project constraints based on volume, time-to-market, and performance requirements.

%% fig-cap: "Hardware Acceleration Decision Matrix"
%% fig-alt: "Decision matrix comparing four hardware acceleration options across five criteria. CPU offers lowest performance but fastest development with no NRE cost, suitable for prototyping and low volume. DSP provides 10x performance with weeks development time and low-medium NRE, good for signal processing. FPGA offers 100x performance with months development and medium NRE at 10K-100K units, ideal for iterating designs. ASIC provides 1000x performance with 1-2 years development and high NRE at 1M+ units, required for mass consumer products."

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#fff', 'fontSize': '11px'}}}%%
graph LR
    subgraph Decision["CHOOSE YOUR ACCELERATION"]
        Q1{{"Volume?"}} --> Low["< 1K units"]
        Q1 --> Med["1K-100K units"]
        Q1 --> High["> 100K units"]

        Low --> CPU["CPU/MCU<br/>1x perf<br/>Days dev<br/>$0 NRE"]
        Med --> FPGA["FPGA<br/>100x perf<br/>Months dev<br/>$$$ NRE"]
        High --> ASIC["ASIC<br/>1000x perf<br/>1-2 years<br/>$$$$ NRE"]

        DSP["DSP<br/>10x perf<br/>Weeks dev<br/>$$ NRE"]
    end

    style CPU fill:#E67E22,stroke:#2C3E50,color:#fff
    style DSP fill:#2C3E50,stroke:#16A085,color:#fff
    style FPGA fill:#16A085,stroke:#2C3E50,color:#fff
    style ASIC fill:#2C3E50,stroke:#E67E22,color:#fff
    style Low fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style Med fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style High fill:#7F8C8D,stroke:#2C3E50,color:#fff

Figure 1622.2: Decision matrix helping select the right hardware acceleration for your project based on production volume and development timeline constraints.

{fig-alt=“Hardware acceleration progression from general-purpose CPU through DSP and FPGA to ASIC showing increasing performance and NRE cost with decreasing flexibility”}

DSPs (Digital Signal Processors): Implement specialized routines for specific applications - Designed for arithmetic-heavy applications - Lots of arithmetic instructions and parallelism - Examples: Radio baseband (4G), image/audio processing, video encoding, vision

FPGAs (Field Programmable Gate Arrays): Popular accelerator in embedded systems - Ultimate re-configurability - Can be reconfigured unlimited times “in the field” - Useful for software acceleration with potential for upgrades - Invaluable during hardware prototyping

ASICs (Application-Specific Integrated Circuits): Logical next step for highly customized applications - Designed for fixed application (e.g., Bitcoin mining) - Highest performance over silicon and power consumption - Lowest overall cost (at volume) - No gate-level reconfigurability

1622.3.2 ASIC Specializations

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
mindmap
  root((ASIC<br/>Specialization))
    Instruction Set
      Minimal instructions
      Compressed encoding
      App-specific ops
      Vector operations
    Functional Units
      Multiply-accumulate
      Crypto accelerators
      Custom datapaths
    Memory System
      Bank count
      Cache config
      Access ports
      Multiple blocks
    Interconnect
      Bus width
      Network-on-chip
      Direct connections
    Control Logic
      Custom FSMs
      Simplified decode

Figure 1622.3: ASIC Specialization Dimensions: Instruction Set, Functional Units, Memory, Interconnect

{fig-alt=“Mind map showing five dimensions of ASIC specialization: instruction set, functional units, memory system, interconnect, and control logic with specific optimization techniques under each category”}

1622.3.2.1 Instruction Set Specialization

  • Implement bare minimum required instructions, omit unused ones
  • Compress instruction encodings to save space
  • Introduce application-specific instructions:
    • Multiply-accumulate operations
    • Encoding/decoding, filtering
    • Vector operations
    • String manipulation/matching
    • Pixel operations/transformations

1622.3.2.2 Memory Specialization

  • Number and size of memory banks + access ports
  • Cache configurations (separate/unified, associativity, size)
  • Multiple smaller blocks increase parallelism and reduce power
  • Application-dependent (profiling very important!)

1622.3.3 Memory and Register Visualizations

The following AI-generated diagrams illustrate memory architecture and register concepts critical for IoT firmware optimization.

Graph showing cache hit rate versus access latency with three scenarios: L1 cache hit at 1-2 cycles, L2 cache hit at 10-20 cycles, and main memory access at 100-200 cycles, demonstrating the importance of cache-friendly code for IoT performance.

Caching Performance Analysis
Figure 1622.4: Cache performance dramatically impacts IoT firmware efficiency. This visualization shows how L1 cache hits complete in 1-2 cycles while main memory access requires 100-200 cycles - a 100x penalty that accumulates rapidly in sensor processing loops.

Block diagram of microcontroller bus architecture showing high-speed AHB bus connecting CPU to memory and DMA, and lower-speed APB bus connecting to peripherals like GPIO, UART, and SPI with clock domain crossings.

Bus Architecture
Figure 1622.5: Microcontroller bus architecture affects data transfer efficiency. The AHB (Advanced High-performance Bus) provides fast CPU-memory paths, while the APB (Advanced Peripheral Bus) connects slower peripherals. Understanding bus topology helps optimize DMA configurations and peripheral access patterns.

Detailed view of microcontroller control register showing individual bit fields for peripheral configuration including enable bits, mode selection, interrupt flags, and status bits with color-coded read-write permissions.

Control Registers
Figure 1622.6: Control registers provide low-level hardware configuration. This visualization shows typical register organization with enable bits, mode selection fields, interrupt flags, and status indicators. Efficient register access requires understanding bit manipulation and atomic operations.

ARM Cortex-M core register set showing R0-R12 general purpose registers, R13 stack pointer, R14 link register, R15 program counter, and special registers PSR, PRIMASK, CONTROL with their roles in interrupt handling and exception processing.

Core Registers
Figure 1622.7: ARM Cortex-M core registers form the foundation of embedded programming. General purpose registers R0-R12 hold working data, while R13 (stack pointer), R14 (link register), and R15 (program counter) manage execution flow. Understanding these registers enables efficient assembly optimization.

ARM Cortex-M4 memory map showing address regions from 0x00000000 for code through SRAM, peripherals, external RAM, external device, and ending at system region near 0xFFFFFFFF with typical sizes and access permissions for each region.

Cortex-M Memory Map
Figure 1622.8: The Cortex-M memory map defines fixed address regions for different memory types. Code executes from the lowest addresses, SRAM provides fast read-write storage, and peripheral registers occupy the 0x40000000 region. This predictable layout enables efficient linker scripts and DMA configuration.

Detailed Cortex-M4 peripheral memory region showing how each peripheral block occupies 1KB of address space from 0x40000000 through 0x5FFFFFFF with GPIOA, GPIOB, UART1, SPI1, I2C1 mapped to specific addresses.

Cortex-M Memory Regions
Figure 1622.9: Peripheral memory mapping enables direct hardware access through memory operations. Each peripheral occupies a fixed address range with registers at predictable offsets. This structure enables bit-banding and efficient peripheral initialization.

Block diagram of DMA controller showing multiple channels, priority arbitration, source and destination address registers, transfer count, and handshaking signals connecting memory, peripherals, and CPU with data paths bypassing CPU for direct transfers.

DMA Controller Architecture
Figure 1622.10: DMA (Direct Memory Access) enables data transfer without CPU intervention. This visualization shows how DMA channels connect peripherals directly to memory, freeing the CPU for computation while sensor data streams into buffers automatically.

Table comparing computational cost in CPU cycles and energy for common IoT operations including integer add, multiply, divide, floating point operations, memory access, peripheral read, and function call with order-of-magnitude differences highlighted.

Operation Cost Examples
Figure 1622.11: Understanding operation costs guides optimization priorities. This visualization compares CPU cycles and energy for common operations, revealing that memory access and function calls often dominate over arithmetic in typical IoT firmware.

Graph showing DVFS operating points with voltage on Y-axis and frequency on X-axis, demonstrating how lower voltage enables lower frequency operation with quadratic power savings, plus example showing 50 percent frequency reduction yielding 75 percent power reduction.

Dynamic Voltage and Frequency Scaling
Figure 1622.12: DVFS enables dynamic power-performance trade-offs. This visualization shows operating points where reduced frequency allows lower voltage operation, providing quadratic power savings. IoT devices can scale performance to match workload demands.

1622.3.4 Heterogeneous Multicores

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph TB
    subgraph "ARM big.LITTLE Architecture"
        subgraph "LITTLE Cluster (Low Power)"
            A7_1["Cortex-A7<br/>Core 0"]
            A7_2["Cortex-A7<br/>Core 1"]
            A7_3["Cortex-A7<br/>Core 2"]
            A7_4["Cortex-A7<br/>Core 3"]
        end

        subgraph "big Cluster (High Performance)"
            A15_1["Cortex-A15<br/>Core 4"]
            A15_2["Cortex-A15<br/>Core 5"]
            A15_3["Cortex-A15<br/>Core 6"]
            A15_4["Cortex-A15<br/>Core 7"]
        end
    end

    SCHED["Task Scheduler"] --> A7_1
    SCHED --> A7_2
    SCHED --> A15_1
    SCHED --> A15_2

    A7_1 -.->|"Migrate workload"| A15_1
    A7_2 -.->|"Migrate workload"| A15_2

    style A7_1 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A7_2 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A7_3 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A7_4 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A15_1 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A15_2 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A15_3 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style A15_4 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style SCHED fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1622.13: ARM big.LITTLE Architecture: Low-Power and High-Performance Core Clusters

{fig-alt=“ARM big.LITTLE heterogeneous multicore architecture showing LITTLE cluster with four Cortex-A7 cores for low power tasks and big cluster with four Cortex-A15 cores for high performance with task scheduler managing workload migration between clusters”}

Heterogeneous Multicore: Processor containing multiple cores with the same architecture but different power/performance profiles

Example: ARM big.LITTLE with four Cortex-A7 and four Cortex-A15 cores

Run-state Migration Strategies:

  1. Clustered Switching: Either all fast cores OR all slow cores
  2. CPU Migration: Pairs of fast/slow cores, threads migrate between pairs
  3. Global Task Scheduling: Each core seen separately, threads scheduled on appropriate core

Geometric decision tree for IoT hardware platform selection starting from requirements like connectivity, processing power, and battery life leading to recommendations for ESP32, STM32, nRF52, or Raspberry Pi

Hardware Selection Decision Tree
Figure 1622.14: Selecting the right hardware platform impacts all aspects of IoT development. This decision tree guides platform selection based on connectivity requirements (Wi-Fi/BLE/LoRa/cellular), processing needs (8-bit vs 32-bit, ML capability), and power constraints.

1622.4 Knowledge Check

Question 1: What is the main advantage of an ASIC over an FPGA?

ASICs provide the highest performance and lowest per-unit cost at high production volumes because they are optimized for a specific application. However, they have high NRE (non-recurring engineering) costs and cannot be reconfigured after manufacturing.

Question 2: In ARM big.LITTLE, what is the purpose of having both Cortex-A7 and Cortex-A15 cores?

The big.LITTLE architecture enables energy-proportional computing by using low-power LITTLE cores (Cortex-A7) for background tasks and high-performance big cores (Cortex-A15) for burst processing. This matches the IoT workload pattern: 99% idle monitoring, 1% compute bursts.

Question 3: Big.LITTLE architecture uses clustered switching between fast and slow cores. What is the PRIMARY advantage of this approach for battery-powered IoT devices?

The benefit for IoT is energy proportional computing - use low-power slow cores for background tasks (sensor polling, idle loops), switch to fast cores only when needed (burst data processing, ML inference). This matches the IoT workload pattern: 99% low-intensity monitoring, 1% burst processing.

CautionPitfall: Forgetting to Enable Peripheral Clocks Before Register Access

The Mistake: Writing to peripheral configuration registers (GPIO, UART, SPI, I2C) before enabling the peripheral’s clock gate, causing hard faults, bus errors, or silent failures where writes appear to succeed but have no effect.

Why It Happens: On ARM Cortex-M microcontrollers (STM32, nRF52, SAM, LPC), peripherals are clock-gated by default to save power. Unlike desktop CPUs where all hardware is always accessible, embedded peripherals exist in a powered-down state until explicitly enabled via RCC (Reset and Clock Control) registers. Developers familiar with Arduino’s pinMode() don’t realize it internally calls clock enable functions.

The Fix: Always enable peripheral clocks as the first step before any register access. Check your MCU’s reference manual for the specific RCC register and bit field:

// WRONG: Accessing GPIO before clock enable (STM32F4 example)
GPIOA->MODER |= GPIO_MODER_MODE5_0;  // Hard fault or silent fail!

// CORRECT: Enable clock first, then configure
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;  // Enable GPIOA clock
__DSB();  // Data Synchronization Barrier - ensure clock is active
GPIOA->MODER |= GPIO_MODER_MODE5_0;  // Now safe to configure

// STM32 HAL equivalent:
__HAL_RCC_GPIOA_CLK_ENABLE();
// For UART1:
__HAL_RCC_USART1_CLK_ENABLE();
// For SPI1:
__HAL_RCC_SPI1_CLK_ENABLE();

Debugging Tip: If a peripheral “doesn’t work” but code compiles, check clock enable first. Use a debugger to read the peripheral registers - if they all read as 0x00000000, the clock isn’t enabled. On Cortex-M3/M4, accessing an unclocked peripheral triggers a BusFault or HardFault (BFSR register shows PRECISERR).

CautionPitfall: Incorrect NVIC Priority Configuration Breaking Nested Interrupts

The Mistake: Setting interrupt priorities without understanding that ARM Cortex-M uses inverted priority numbering (lower number = higher priority), and failing to configure priority grouping correctly, causing critical interrupts to be blocked by less important ones.

Why It Happens: Priority numbering on Cortex-M is counterintuitive: priority 0 is the HIGHEST priority, not lowest. Additionally, the NVIC supports priority grouping (preemption priority + sub-priority), but the number of implemented priority bits varies by MCU (STM32F4 uses 4 bits = 16 levels, nRF52 uses 3 bits = 8 levels, ESP32’s Xtensa uses different scheme). Developers often assume more bits or use HAL defaults without understanding implications.

The Fix: Explicitly configure priority grouping at startup and assign priorities systematically, remembering lower numbers preempt higher numbers:

// WRONG: Assuming priority 10 > priority 5 (backwards!)
NVIC_SetPriority(USART1_IRQn, 5);   // This is HIGHER priority
NVIC_SetPriority(TIM2_IRQn, 10);    // This is LOWER priority (won't preempt USART1)

// CORRECT: Design priority scheme with awareness of inversion
// Priority 0-1: Safety-critical (motor control, watchdog)
// Priority 2-3: Time-critical (encoder, high-speed ADC)
// Priority 4-7: Standard peripherals (UART, SPI)
// Priority 8-15: Low-priority (LED status, debug)

// Configure priority grouping first (4 bits preemption, 0 bits subpriority)
NVIC_SetPriorityGrouping(0);  // Or HAL_NVIC_SetPriorityGrouping(NVIC_PRIORITYGROUP_4)

// Set priorities (lower number = higher priority = CAN preempt others)
NVIC_SetPriority(EXTI0_IRQn, 2);     // High priority - emergency stop button
NVIC_SetPriority(TIM2_IRQn, 3);      // Motor control PWM - critical timing
NVIC_SetPriority(USART1_IRQn, 6);    // Serial - can be delayed
NVIC_SetPriority(I2C1_EV_IRQn, 8);   // Sensor polling - lowest priority

// Enable interrupts
NVIC_EnableIRQ(EXTI0_IRQn);
NVIC_EnableIRQ(TIM2_IRQn);

Debugging Tip: Use debugger to read NVIC->IP[] (Interrupt Priority) registers and verify values. If a high-priority interrupt isn’t preempting, check that (1) priorities are different, (2) priority grouping allows preemption, and (3) the pending interrupt’s priority number is numerically LOWER than the running ISR.

1622.6 Summary

Hardware optimization provides the foundation for high-performance IoT systems:

  1. Hardware Spectrum: CPU -> DSP -> FPGA -> ASIC with increasing performance and NRE cost
  2. DSPs: Ideal for arithmetic-heavy applications (signal processing, audio, video)
  3. FPGAs: Reconfigurable acceleration, valuable for prototyping and field updates
  4. ASICs: Maximum performance and efficiency at volume, but high NRE and inflexible
  5. ASIC Specialization: Instruction set, functional units, memory, interconnect, control logic
  6. Heterogeneous Multicores: big.LITTLE balances performance with energy efficiency
  7. Memory Architecture: Understanding caches, DMA, and peripheral access is critical

The key is matching hardware capabilities to application requirements while considering production volume and time-to-market constraints.

1622.7 What’s Next

The next chapter covers Software Optimization, which explores compiler optimization flags, code size strategies, SIMD vectorization, function inlining, and other software techniques for embedded IoT firmware.