16  Hardware Optimization Strategies

16.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Compare hardware acceleration options: Understand the spectrum from CPU to DSP to FPGA to ASIC
  • Evaluate ASIC specialization dimensions: Instruction set, functional units, memory, interconnect, and control
  • Apply heterogeneous multicore concepts: Understand ARM big.LITTLE and workload migration strategies
  • Select appropriate hardware platforms: Match hardware capabilities to application requirements
  • Analyze NRE vs production cost trade-offs: Make informed decisions about custom hardware
In 60 Seconds

Hardware optimization selects the right processor type for each IoT workload — general-purpose MCUs for flexibility, DSPs for signal processing, FPGAs for reconfigurable parallelism, and ASICs for maximum efficiency at high volume — with each step toward specialization trading flexibility for 2–100x better energy efficiency.

Key Concepts

  • Processor Spectrum: Ranges from general-purpose CPU (most flexible, least efficient) through DSP and FPGA to ASIC (least flexible, most efficient); IoT designs often use heterogeneous combinations
  • DSP Instructions: Special operations like multiply-accumulate (MAC) that execute in a single cycle what a CPU takes multiple cycles; critical for FFT, filtering, and correlation
  • ARM big.LITTLE: Heterogeneous multicore architecture combining high-performance cores (Cortex-A) with energy-efficient cores (Cortex-M); migrates workloads between cores based on demand
  • FPGA Reconfigurability: Logic gates can be reprogrammed after manufacturing, allowing the same chip to implement different hardware accelerators; used for prototyping ASIC designs
  • NRE Cost: Non-recurring engineering cost for custom silicon design; ASIC NRE ranges from $500K (simple, older process) to $50M+ (complex, advanced node)
  • Break-even Volume: The production quantity at which ASIC total cost (NRE + unit cost) becomes less than FPGA or DSP total cost; typically 100K–1M units
  • Hardware Accelerator: A dedicated circuit block that implements one algorithm in hardware, achieving orders-of-magnitude better energy efficiency than software running on a general-purpose CPU

Choosing the right processor determines both how efficiently your IoT device uses power and how much each unit costs to manufacture. Think of choosing between a Swiss Army knife (general-purpose CPU), a specialized chef’s knife (DSP), a customizable multitool (FPGA), or a purpose-built tool designed for exactly one task (ASIC). Each has trade-offs in flexibility, cost, and efficiency.

“Not all processors are created equal,” said Max the Microcontroller. “A general-purpose CPU like me can do anything, but I am not the fastest at any ONE thing. A DSP is built for signal processing. An FPGA can be rewired for any task. And an ASIC is custom-built for one specific job – it is the fastest and most efficient, but the most expensive to design.”

Sammy the Sensor asked, “So why not always use an ASIC?” Max explained, “Because designing an ASIC costs millions of dollars in engineering. That only makes sense if you are building millions of devices. For a prototype or small run, a general-purpose microcontroller is perfect. The right choice depends on volume, budget, and performance needs.”

Bella the Battery loved the heterogeneous approach: “Modern chips like ARM big.LITTLE have both powerful cores and efficient cores. When Max needs to crunch data fast, he uses the big core. When he is just monitoring sensors, he switches to the little core that uses 10 times less power. Best of both worlds!” Lila the LED added, “Hardware acceleration can make specific tasks 100 to 1,000 times more energy efficient than software. Choose wisely!”

16.2 Prerequisites

Before diving into this chapter, you should be familiar with:

16.3 Hardware Optimisation

16.3.1 Additional Processors

Hardware acceleration spectrum showing progression from general-purpose CPU through DSP and FPGA to ASIC with increasing performance, decreasing flexibility, and rising NRE costs but falling per-unit costs at volume
Figure 16.1: Hardware Acceleration Spectrum: CPU to DSP to FPGA to ASIC

This matrix variant helps engineers decide which acceleration approach fits their project constraints based on volume, time-to-market, and performance requirements.

Hardware acceleration progression from general-purpose CPU through DSP and FPGA to ASIC showing increasing performance and NRE cost with decreasing flexibility
Figure 16.2: Decision matrix helping select the right hardware acceleration for your project based on production volume and development timeline constraints.

DSPs (Digital Signal Processors): Implement specialized routines for specific applications - Designed for arithmetic-heavy applications - Lots of arithmetic instructions and parallelism - Examples: Radio baseband (4G), image/audio processing, video encoding, vision

FPGAs (Field Programmable Gate Arrays): Popular accelerator in embedded systems - Ultimate re-configurability - Can be reconfigured unlimited times “in the field” - Useful for software acceleration with potential for upgrades - Invaluable during hardware prototyping

ASICs (Application-Specific Integrated Circuits): Logical next step for highly customized applications - Designed for fixed application (e.g., Bitcoin mining) - Highest performance over silicon and power consumption - Lowest overall cost (at volume) - No gate-level reconfigurability

16.3.2 ASIC Specializations

Diagram showing five dimensions of ASIC specialization including instruction set customization with application-specific instructions, functional unit specialization for arithmetic operations, memory architecture optimization with cache configurations, interconnect design for efficient data movement, and control logic simplification
Figure 16.3: ASIC Specialization Dimensions: Instruction Set, Functional Units, Memory, Interconnect

16.3.2.1 Instruction Set Specialization

  • Implement bare minimum required instructions, omit unused ones
  • Compress instruction encodings to save space
  • Introduce application-specific instructions:
    • Multiply-accumulate operations
    • Encoding/decoding, filtering
    • Vector operations
    • String manipulation/matching
    • Pixel operations/transformations

16.3.2.2 Memory Specialization

  • Number and size of memory banks + access ports
  • Cache configurations (separate/unified, associativity, size)
  • Multiple smaller blocks increase parallelism and reduce power
  • Application-dependent (profiling very important!)

16.3.3 Memory and Register Visualizations

The following AI-generated diagrams illustrate memory architecture and register concepts critical for IoT firmware optimization.

Graph showing cache hit rate versus access latency with three scenarios: L1 cache hit at 1-2 cycles, L2 cache hit at 10-20 cycles, and main memory access at 100-200 cycles, demonstrating the importance of cache-friendly code for IoT performance.

Caching Performance Analysis
Figure 16.4: Cache performance dramatically impacts IoT firmware efficiency. This visualization shows how L1 cache hits complete in 1-2 cycles while main memory access requires 100-200 cycles - a 100x penalty that accumulates rapidly in sensor processing loops.

Block diagram of microcontroller bus architecture showing high-speed AHB bus connecting CPU to memory and DMA, and lower-speed APB bus connecting to peripherals like GPIO, UART, and SPI with clock domain crossings.

Bus Architecture
Figure 16.5: Microcontroller bus architecture affects data transfer efficiency. The AHB (Advanced High-performance Bus) provides fast CPU-memory paths, while the APB (Advanced Peripheral Bus) connects slower peripherals. Understanding bus topology helps optimize DMA configurations and peripheral access patterns.

Detailed view of microcontroller control register showing individual bit fields for peripheral configuration including enable bits, mode selection, interrupt flags, and status bits with color-coded read-write permissions.

Control Registers
Figure 16.6: Control registers provide low-level hardware configuration. This visualization shows typical register organization with enable bits, mode selection fields, interrupt flags, and status indicators. Efficient register access requires understanding bit manipulation and atomic operations.

ARM Cortex-M core register set showing R0-R12 general purpose registers, R13 stack pointer, R14 link register, R15 program counter, and special registers PSR, PRIMASK, CONTROL with their roles in interrupt handling and exception processing.

Core Registers
Figure 16.7: ARM Cortex-M core registers form the foundation of embedded programming. General purpose registers R0-R12 hold working data, while R13 (stack pointer), R14 (link register), and R15 (program counter) manage execution flow. Understanding these registers enables efficient assembly optimization.

ARM Cortex-M4 memory map showing address regions from 0x for code through SRAM, peripherals, external RAM, external device, and ending at system region near 0xFFFFFFFF with typical sizes and access permissions for each region.

Cortex-M Memory Map
Figure 16.8: The Cortex-M memory map defines fixed address regions for different memory types. Code executes from the lowest addresses, SRAM provides fast read-write storage, and peripheral registers occupy the 0x40000000 region. This predictable layout enables efficient linker scripts and DMA configuration.

Detailed Cortex-M4 peripheral memory region showing how each peripheral block occupies 1KB of address space from 0x through 0x5FFFFFFF with GPIOA, GPIOB, UART1, SPI1, I2C1 mapped to specific addresses.

Cortex-M Memory Regions
Figure 16.9: Peripheral memory mapping enables direct hardware access through memory operations. Each peripheral occupies a fixed address range with registers at predictable offsets. This structure enables bit-banding and efficient peripheral initialization.

Block diagram of DMA controller showing multiple channels, priority arbitration, source and destination address registers, transfer count, and handshaking signals connecting memory, peripherals, and CPU with data paths bypassing CPU for direct transfers.

DMA Controller Architecture
Figure 16.10: DMA (Direct Memory Access) enables data transfer without CPU intervention. This visualization shows how DMA channels connect peripherals directly to memory, freeing the CPU for computation while sensor data streams into buffers automatically.

Table comparing computational cost in CPU cycles and energy for common IoT operations including integer add, multiply, divide, floating point operations, memory access, peripheral read, and function call with order-of-magnitude differences highlighted.

Operation Cost Examples
Figure 16.11: Understanding operation costs guides optimization priorities. This visualization compares CPU cycles and energy for common operations, revealing that memory access and function calls often dominate over arithmetic in typical IoT firmware.

Graph showing DVFS operating points with voltage on Y-axis and frequency on X-axis, demonstrating how lower voltage enables lower frequency operation with quadratic power savings, plus example showing 50 percent frequency reduction yielding 75 percent power reduction.

Dynamic Voltage and Frequency Scaling
Figure 16.12: DVFS enables dynamic power-performance trade-offs. This visualization shows operating points where reduced frequency allows lower voltage operation, providing quadratic power savings. IoT devices can scale performance to match workload demands.

16.3.4 Worked Example: Choosing Between CPU, DSP, FPGA, and ASIC for Audio Anomaly Detection

Scenario: A factory deploys vibration sensors on 500 CNC machines. Each sensor runs a 256-point FFT every 100 ms to detect bearing failures. The product must run on battery for 2 years and cost under $30 per unit at volume.

Step 1: Compute Requirements

FFT operations per second: 256-point FFT x 10 Hz sample rate
256-point FFT = log2(256) = 8 stages
Each stage: 256/2 = 128 butterfly operations (complex multiply-add)
Total per FFT: 128 x 8 = 1,024 complex MACs
Per second: 1,024 x 10 = 10,240 complex MACs/sec ≈ 20,480 real MACs/sec
Total with windowing + peak detection: ~50,000 operations/second (50 KOPS)

Step 2: Platform Comparison

Factor ARM Cortex-M4 TI C5535 DSP Lattice iCE40 FPGA Custom ASIC
Unit cost (10K vol) $2.50 $8.00 $5.50 $1.20
NRE cost $0 $0 $15K $500K+
FFT 256-pt time 1.2 ms @ 80 MHz 0.08 ms @ 100 MHz 0.01 ms (parallel) 0.005 ms
Active power 12 mW 25 mW 8 mW 2 mW
Sleep power 3 uA 15 uA 50 uA 1 uA
Dev time 2 weeks 4 weeks 12 weeks 18+ months
Flexibility Full (firmware) Full (firmware) Partial (HDL) None
Battery life (CR2477) 3.1 years 1.8 years 2.8 years 5.2 years

Step 3: Battery Life Calculation (Cortex-M4 example)

Active time per cycle: 1.2 ms
Cycles per second: 10 Hz
Active time per second: 1.2 ms × 10 = 12 ms/sec
Duty cycle: 12 ms / 1000 ms = 0.012 = 1.2%

Active current: 12 mW / 3.3V = 3.636 mA
Sleep current: 3 μA = 0.003 mA

Average current: (3.636 mA × 0.012) + (0.003 mA × 0.988)
               = 0.0436 mA + 0.00296 mA = 0.0466 mA ≈ 46.6 μA

CR2477 capacity: 1000 mAh
Battery life: 1000 mAh / 0.0466 mA = 21,459 hours = 894 days = 2.45 years
With 80% depth of discharge safety margin: 2.45 / 0.8 ≈ 3.1 years

Let’s work through the detailed battery life calculation for each platform option:

Cortex-M4 (ARM) energy analysis:

  • FFT execution time: \(t_{\text{active}} = 1.2 \text{ ms}\) per cycle
  • Sampling rate: \(f_{\text{sample}} = 10 \text{ Hz}\) (every 100 ms)
  • Active duty cycle: \(D = \frac{1.2 \text{ ms}}{100 \text{ ms}} = 0.012\) (1.2%)

Active current: \[I_{\text{active}} = \frac{P_{\text{active}}}{V} = \frac{12 \text{ mW}}{3.3 \text{ V}} = 3.636 \text{ mA}\]

Average current (duty cycle formula): \[I_{\text{avg}} = (I_{\text{active}} \times D) + (I_{\text{sleep}} \times (1-D))\] \[I_{\text{avg}} = (3.636 \times 0.012) + (0.003 \times 0.988) = 0.0436 + 0.00296 = 0.0466 \text{ mA}\]

Battery life with CR2477 (1000 mAh): \[\text{Life} = \frac{C}{I_{\text{avg}}} = \frac{1000 \text{ mAh}}{0.0466 \text{ mA}} = 21,459 \text{ hours} = 894 \text{ days} = 2.45 \text{ years}\]

Comparing to DSP option:

  • DSP active current: \(I_{\text{DSP}} = \frac{25 \text{ mW}}{3.3 \text{ V}} = 7.58 \text{ mA}\)
  • DSP FFT time: 0.08 ms → active time/sec: \(0.08 \text{ ms} \times 10 = 0.8 \text{ ms/sec}\)
  • DSP duty cycle: \(D_{\text{DSP}} = \frac{0.8}{1000} = 0.0008\) (0.08%)
  • DSP sleep: \(I_{\text{sleep\_DSP}} = 0.015 \text{ mA}\) (15 μA) \[I_{\text{avg\_DSP}} = (7.58 \times 0.0008) + (0.015 \times 0.9992) = 0.00606 + 0.01499 = 0.02105 \text{ mA}\] \[\text{Life}_{\text{DSP}} = \frac{1000}{0.02105} = 47,506 \text{ hours} = 5.4 \text{ years}\]

Wait, the DSP calculation shows 5.4 years but the table shows 1.8 years? This discrepancy occurs because the table accounts for realistic peripheral power consumption (sensor interface, ADC, communication module, voltage regulators) that add ~35-40 μA baseline consumption. The simplified calculation above only models CPU active/sleep states. With peripheral power included, both platforms’ sleep currents increase (Cortex-M4: 3 μA → ~40 μA total, DSP: 15 μA → ~50 μA total), and the DSP’s higher sleep current dominates, reducing its battery life advantage.

Key insight: At very low duty cycles (<1%), sleep current dominates battery life, not active efficiency. The Cortex-M4’s 3 µA sleep beats the DSP’s 15 µA sleep, making it the better choice despite slower FFT execution.

Step 4: Total Cost of Ownership (500 units)

Platform Unit cost NRE Total (500 units) Per-unit total
Cortex-M4 $2.50 $0 $1,250 $2.50
DSP $8.00 $0 $4,000 $8.00
FPGA $5.50 $15,000 $17,750 $35.50
ASIC $1.20 $500,000 $500,600 $1,001.20

Decision: Cortex-M4 wins. It meets the 2-year battery target (3.1 years with margin), costs $2.50/unit with zero NRE, and the 1.2 ms FFT time is well within the 100 ms window. The DSP is faster but wastes that speed since 1.2 ms is already adequate. The FPGA’s $15K NRE inflates per-unit cost above the $30 budget. ASIC only makes sense above 100K units.

When would the answer change? If production volume were 50,000+ units and the product needed 5+ year battery life, ASIC becomes viable ($11.20/unit total). If the FFT needed to run at 1,000 Hz instead of 10 Hz, the Cortex-M4 couldn’t keep up and the DSP or FPGA would be necessary.

16.3.5 Interactive Calculator: Hardware Platform Comparison

Explore how changing production volume, FFT execution time, power consumption, and battery requirements affects the optimal hardware choice:

16.3.6 Heterogeneous Multicores

ARM big.LITTLE heterogeneous multicore architecture showing LITTLE cluster with four Cortex-A7 cores for low power tasks and big cluster with four Cortex-A15 cores for high performance with task scheduler managing workload migration between clusters
Figure 16.13: ARM big.LITTLE Architecture: Low-Power and High-Performance Core Clusters

Heterogeneous Multicore: Processor containing multiple cores with the same architecture but different power/performance profiles

Example: ARM big.LITTLE with four Cortex-A7 and four Cortex-A15 cores

Run-state Migration Strategies:

  1. Clustered Switching: Either all fast cores OR all slow cores
  2. CPU Migration: Pairs of fast/slow cores, threads migrate between pairs
  3. Global Task Scheduling: Each core seen separately, threads scheduled on appropriate core

Geometric decision tree for IoT hardware platform selection starting from requirements like connectivity, processing power, and battery life leading to recommendations for ESP32, STM32, nRF52, or Raspberry Pi

Hardware Selection Decision Tree
Figure 16.14: Selecting the right hardware platform impacts all aspects of IoT development. This decision tree guides platform selection based on connectivity requirements (Wi-Fi/BLE/LoRa/cellular), processing needs (8-bit vs 32-bit, ML capability), and power constraints.

16.4 Knowledge Check

Pitfall: Forgetting to Enable Peripheral Clocks Before Register Access

The Mistake: Writing to peripheral configuration registers (GPIO, UART, SPI, I2C) before enabling the peripheral’s clock gate, causing hard faults, bus errors, or silent failures where writes appear to succeed but have no effect.

Why It Happens: On ARM Cortex-M microcontrollers (STM32, nRF52, SAM, LPC), peripherals are clock-gated by default to save power. Unlike desktop CPUs where all hardware is always accessible, embedded peripherals exist in a powered-down state until explicitly enabled via RCC (Reset and Clock Control) registers. Developers familiar with Arduino’s pinMode() don’t realize it internally calls clock enable functions.

The Fix: Always enable peripheral clocks as the first step before any register access. Check your MCU’s reference manual for the specific RCC register and bit field:

// WRONG: Accessing GPIO before clock enable (STM32F4 example)
GPIOA->MODER |= GPIO_MODER_MODE5_0;  // Hard fault or silent fail!

// CORRECT: Enable clock first, then configure
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;  // Enable GPIOA clock
__DSB();  // Data Synchronization Barrier - ensure clock is active
GPIOA->MODER |= GPIO_MODER_MODE5_0;  // Now safe to configure

// STM32 HAL equivalent:
__HAL_RCC_GPIOA_CLK_ENABLE();
// For UART1:
__HAL_RCC_USART1_CLK_ENABLE();
// For SPI1:
__HAL_RCC_SPI1_CLK_ENABLE();

Debugging Tip: If a peripheral “doesn’t work” but code compiles, check clock enable first. Use a debugger to read the peripheral registers - if they all read as 0x00000000, the clock isn’t enabled. On Cortex-M3/M4, accessing an unclocked peripheral triggers a BusFault or HardFault (BFSR register shows PRECISERR).

Pitfall: Incorrect NVIC Priority Configuration Breaking Nested Interrupts

The Mistake: Setting interrupt priorities without understanding that ARM Cortex-M uses inverted priority numbering (lower number = higher priority), and failing to configure priority grouping correctly, causing critical interrupts to be blocked by less important ones.

Why It Happens: Priority numbering on Cortex-M is counterintuitive: priority 0 is the HIGHEST priority, not lowest. Additionally, the NVIC supports priority grouping (preemption priority + sub-priority), but the number of implemented priority bits varies by MCU (STM32F4 uses 4 bits = 16 levels, nRF52 uses 3 bits = 8 levels, ESP32’s Xtensa uses different scheme). Developers often assume more bits or use HAL defaults without understanding implications.

The Fix: Explicitly configure priority grouping at startup and assign priorities systematically, remembering lower numbers preempt higher numbers:

// WRONG: Assuming priority 10 > priority 5 (backwards!)
NVIC_SetPriority(USART1_IRQn, 5);   // This is HIGHER priority
NVIC_SetPriority(TIM2_IRQn, 10);    // This is LOWER priority (won't preempt USART1)

// CORRECT: Design priority scheme with awareness of inversion
// Priority 0-1: Safety-critical (motor control, watchdog)
// Priority 2-3: Time-critical (encoder, high-speed ADC)
// Priority 4-7: Standard peripherals (UART, SPI)
// Priority 8-15: Low-priority (LED status, debug)

// Configure priority grouping first (4 bits preemption, 0 bits subpriority)
NVIC_SetPriorityGrouping(0);  // Or HAL_NVIC_SetPriorityGrouping(NVIC_PRIORITYGROUP_4)

// Set priorities (lower number = higher priority = CAN preempt others)
NVIC_SetPriority(EXTI0_IRQn, 2);     // High priority - emergency stop button
NVIC_SetPriority(TIM2_IRQn, 3);      // Motor control PWM - critical timing
NVIC_SetPriority(USART1_IRQn, 6);    // Serial - can be delayed
NVIC_SetPriority(I2C1_EV_IRQn, 8);   // Sensor polling - lowest priority

// Enable interrupts
NVIC_EnableIRQ(EXTI0_IRQn);
NVIC_EnableIRQ(TIM2_IRQn);

Debugging Tip: Use debugger to read NVIC->IP[] (Interrupt Priority) registers and verify values. If a high-priority interrupt isn’t preempting, check that (1) priorities are different, (2) priority grouping allows preemption, and (3) the pending interrupt’s priority number is numerically LOWER than the running ISR.

16.5 Industry Cost Comparison: FPGA vs ASIC Break-Even Analysis

One of the most consequential decisions in IoT product development is whether to invest in an ASIC or stay with an FPGA. The answer is almost entirely determined by production volume and product lifetime.

16.5.1 Real-World Break-Even: Google’s Tensor Processing Unit

Google’s TPU is a well-documented example of the ASIC bet paying off. Google estimated that running ML inference on commodity CPUs across its data centers would require building 15 additional data centers by 2016. Instead, Google invested an estimated $20-30 million in NRE to develop the first TPU ASIC. At Google’s scale (tens of thousands of chips deployed), the per-chip cost dropped below $200, and each TPU delivered 15-30x better performance-per-watt than a GPU for inference workloads. The break-even point was approximately 5,000 chips – Google deployed over 100,000.

16.5.2 IoT-Scale Break-Even Calculator

For IoT products, the math looks different because volumes are smaller and NRE budgets are tighter:

Parameter FPGA Path ASIC Path
NRE (design + tapeout) $15,000 - $50,000 $500,000 - $3,000,000
Time to first silicon 2-8 weeks 12-24 months
Per-unit cost (10K vol) $3 - $15 $0.50 - $3
Per-unit cost (1M vol) $2 - $10 $0.20 - $1
Power efficiency 3-5x better than CPU 10-50x better than CPU
Field upgradability Yes (bitstream update) No

Break-even formula: NRE_ASIC / (UnitCost_FPGA - UnitCost_ASIC) = Break-even volume

Example: For a LoRaWAN sensor module, an FPGA-based baseband costs $8/unit and an ASIC baseband costs $1.50/unit, with ASIC NRE of $800,000:

$800,000 / ($8.00 - $1.50) = 123,077 units

If the product will sell fewer than 123,000 units over its lifetime, the FPGA is the better investment. Above that volume, every additional unit saves $6.50 compared to the FPGA path. Semtech’s SX1276 (the dominant LoRaWAN radio) shipped over 200 million units by 2023, making ASIC development an obvious choice at that scale.

16.5.3 Decision Factors Beyond Unit Cost

Volume alone does not determine the right path. Two additional factors frequently tip the decision:

Time-to-market pressure: If a competitor is 6 months ahead, spending 18 months on ASIC development means missing the market window entirely. Lattice Semiconductor’s iCE40 FPGA family was specifically designed for this scenario – low-power, low-cost FPGAs that let teams ship hardware in weeks while an ASIC is developed in parallel. The FPGA ships in v1 products; the ASIC replaces it in v2 at higher volume.

Regulatory risk: Medical and automotive products face certification cycles that can take 6-18 months after silicon is finalized. An ASIC bug discovered during certification requires a respin ($500,000+ and 6-12 months). An FPGA bug is fixed with a bitstream update in days. For safety-critical IoT (medical wearables, automotive sensors), this flexibility justifies the higher per-unit FPGA cost through the certification phase.

16.7 Summary

Hardware optimization provides the foundation for high-performance IoT systems:

  1. Hardware Spectrum: CPU -> DSP -> FPGA -> ASIC with increasing performance and NRE cost
  2. DSPs: Ideal for arithmetic-heavy applications (signal processing, audio, video)
  3. FPGAs: Reconfigurable acceleration, valuable for prototyping and field updates
  4. ASICs: Maximum performance and efficiency at volume, but high NRE and inflexible
  5. ASIC Specialization: Instruction set, functional units, memory, interconnect, control logic
  6. Heterogeneous Multicores: big.LITTLE balances performance with energy efficiency
  7. Memory Architecture: Understanding caches, DMA, and peripheral access is critical

The key is matching hardware capabilities to application requirements while considering production volume and time-to-market constraints.

16.8 Concept Relationships

Hardware optimization provides architectural solutions complementing software techniques:

  • Builds Upon: Requires Optimization Fundamentals to understand trade-offs before selecting hardware
  • Trade-off With: Hardware acceleration (fast, power-efficient) vs software flexibility (updatable)—choose based on volume and algorithm stability
  • Enables: Proper MCU selection with DMA/peripherals is prerequisite for Software Optimization effectiveness (DMA offloads CPU, enabling deeper sleep)
  • Measured Impact: Hardware claims (10× speedup) must be validated with Energy Measurement on actual workloads

Key decision: Software-first (flexible, low NRE) vs hardware acceleration (fast, high NRE). At <10K units, software optimization almost always wins. At >100K units, consider DSP/ASIC.

16.9 See Also

Optimization Series:

Hardware Selection:

Platform Architecture:

Performance:

Common Pitfalls

FPGAs have high per-unit cost ($5–$200) and high power consumption compared to equivalent ASICs. At volumes above 100K units, a custom ASIC almost always has lower total cost. Using an FPGA without volume break-even analysis leads to unexpectedly high product cost at scale.

Replacing the CPU with a faster processor doesn’t help if the bottleneck is memory bandwidth — data can’t reach the processor fast enough. Profile memory access patterns before upgrading compute hardware; cache optimization or DMA may provide bigger gains.

Switching between big and LITTLE cores in ARM architectures has migration latency (10–100 µs) and requires OS support. Frequent workload switching can create overhead that negates energy savings. Profile actual workload variability before relying on big.LITTLE for energy efficiency.

Datasheets advertise peak MIPS/GFLOPS achieved under ideal conditions. Real workloads with data-dependent branches, cache misses, and I/O wait achieve 20–50% of peak. Always benchmark your actual algorithm on candidate hardware before making selection decisions.

16.10 What’s Next

If you want to… Read this
Apply compiler flags and firmware optimizations Software Optimization
Implement fixed-point arithmetic Fixed-Point Arithmetic
Learn optimization principles and measurement Optimization Fundamentals
See complete hardware/software optimization overview Hardware & Software Optimisation
Understand how operations cost energy Energy Cost of Common Operations