1621  Optimization Fundamentals and Trade-offs

1621.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Define optimization in the context of IoT systems and understand why it matters
  • Identify the four key optimization dimensions: speed, size, power, and energy
  • Analyze trade-offs between competing optimization goals
  • Select appropriate optimization priorities based on application requirements
  • Avoid common optimization mistakes including premature optimization

1621.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Think of optimization like packing a suitcase for a long trip—you want to fit the most important items while keeping the bag light.

In IoT, your “suitcase” is limited: tiny memory, slow processors, and batteries that need to last years. Optimization is the art of making your code and hardware do more with less.

Two types of optimization:

Type What You’re Improving Example
Hardware Energy, power, heat Make sensor sleep more, transmit less
Software Speed, memory Use smaller data types, avoid loops

Why optimization matters in IoT:

Resource Desktop Computer IoT Device
RAM 16 GB 256 KB (65,000x less!)
CPU 3 GHz 80 MHz (40x slower)
Power 500 watts 0.001 watts
Battery N/A Must last 5+ years

Common optimization techniques:

  1. Loop unrolling — Instead of looping 100 times, write out 4 iterations at once (faster, but bigger code)
  2. Function inlining — Put tiny function code directly where it’s called (saves overhead)
  3. Fixed-point math — Use integers instead of decimals (much faster on small processors)
  4. Sleep modes — Put the chip to sleep between readings (saves battery)

The trade-off triangle:

         Speed
         /\
        /  \
       /    \
      /______\
  Size     Power

You can optimize for two, but often sacrifice the third!

Real-world example: - Original code: Reads sensor every 1 second, always awake -> Battery lasts 2 days - Optimized: Reads every 10 seconds, sleeps between readings -> Battery lasts 6 months!

Key insight: In IoT, “good enough” code that runs for years beats “perfect” code that drains the battery in a week.

1621.3 What is Optimisation?

Time: ~10 min | Level: Intermediate | Unit: P13.C13.U01

Optimisation: Modifying some aspect of a system to make it run more efficiently, or utilize less resources.

Code comparison showing technique for avoiding pipeline branch delays by using conditional moves instead of if-else branches
Figure 1621.1: Optimization: AvoidingBranchDelay
Assembly code comparison showing how function inlining eliminates call overhead by replacing function invocation with direct code insertion
Figure 1621.2: Optimization: FunctionInlining
  • Optimising Hardware: Making it use less energy, or dissipate less power
  • Optimising Software: Making it run faster, or use less memory

1621.3.1 Advanced Optimization Visualizations

The following AI-generated visualizations illustrate key optimization techniques and concepts for IoT system development.

Artistic visualization of function inlining showing before and after assembly code comparison where small function body replaces call instruction, eliminating push pop overhead and branch prediction penalties

Function Inlining Optimization
Figure 1621.3: Function inlining eliminates call overhead by embedding small function bodies directly at the call site. This visualization compares the assembly output before and after inlining, showing how 5-10 cycle call overhead is eliminated for frequently-invoked helper functions.

Artistic diagram of data parallelism exploitation showing sequential scalar operations transformed into parallel SIMD vector operations processing four elements simultaneously in each CPU cycle

Exploiting Parallelism
Figure 1621.4: Modern microcontrollers support SIMD (Single Instruction, Multiple Data) operations for parallel processing. This visualization shows how loop iterations can be vectorized to process 4 elements simultaneously, achieving near-linear speedup for array operations.

Artistic flowchart of low-overhead programming techniques for embedded systems including polling versus interrupts, DMA usage, memory-mapped IO, and zero-copy buffer strategies

Low-Overhead Programming
Figure 1621.5: Minimizing overhead is critical for responsive IoT firmware. This visualization presents low-overhead programming patterns including hardware timer-driven scheduling, DMA transfers, and zero-copy buffer management that reduce CPU cycles per operation.

Geometric diagram of hardware-software co-design process showing iterative refinement between firmware algorithm design and hardware capability analysis to achieve optimal system-level performance

Hardware-Software Co-Design
Figure 1621.6: Optimal IoT systems require co-design of hardware and software. This visualization shows the iterative process of matching firmware algorithms to hardware capabilities, identifying bottlenecks, and refining both sides to achieve system-level performance targets.

1621.4 Optimisation Choices

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph TD
    OPT["Optimization<br/>Trade-offs"] --> SPEED["Speed<br/>(Performance)"]
    OPT --> SIZE["Size<br/>(Memory/Code)"]
    OPT --> POWER["Power<br/>(Dissipation)"]
    OPT --> ENERGY["Energy<br/>(Battery Life)"]

    SPEED -.->|Conflicts| SIZE
    SPEED -.->|Conflicts| ENERGY
    SIZE -.->|Conflicts| SPEED
    POWER -.->|Conflicts| SPEED

    style OPT fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#fff
    style SPEED fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style SIZE fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style POWER fill:#F39C12,stroke:#2C3E50,stroke-width:2px,color:#fff
    style ENERGY fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1621.7: IoT Optimization Trade-offs: Speed, Size, Power, and Energy Constraints

This view helps select optimization priorities based on your specific IoT application:

%% fig-alt: "Optimization priority matrix by IoT application type. Industrial control prioritizes speed and reliability over size. Wearables prioritize energy and size over speed. Environmental monitoring prioritizes energy over everything else. Gateway devices prioritize speed and size. Each application type has different optimal trade-off points."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1'}}}%%
flowchart TD
    App[Application Type] --> Industrial["Industrial Control<br/>Priority: SPEED<br/>Latency < 10ms<br/>Energy: mains-powered"]
    App --> Wearable["Wearables<br/>Priority: SIZE + ENERGY<br/>Must fit tiny package<br/>Days battery life"]
    App --> EnvMon["Environmental Sensor<br/>Priority: ENERGY<br/>Years battery life<br/>Slow updates OK"]
    App --> Gateway["Edge Gateway<br/>Priority: SPEED + SIZE<br/>Process many streams<br/>Limited RAM/Flash"]

    Industrial --> Opt1["Use -O3, inline,<br/>hardware interrupts"]
    Wearable --> Opt2["Use -Os, sleep modes,<br/>minimal libraries"]
    EnvMon --> Opt3["Max duty cycle,<br/>batch transmissions"]
    Gateway --> Opt4["Streaming buffers,<br/>avoid heap alloc"]

    style App fill:#2C3E50,stroke:#16A085,color:#fff
    style Industrial fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Wearable fill:#3498DB,stroke:#2C3E50,color:#fff
    style EnvMon fill:#27AE60,stroke:#2C3E50,color:#fff
    style Gateway fill:#E67E22,stroke:#2C3E50,color:#fff

Your application type determines which optimization trade-offs make sense.

{fig-alt=“Optimization trade-off diagram showing four competing objectives: speed/performance, code/memory size, power dissipation, and energy/battery life with conflicting relationships indicated by dotted lines”}

Key Questions: - Optimise for speed? Do we need to react to events quickly? - Optimise for size? Are we memory/space constrained? - Optimise for power? Is there limited scope for power dissipation? - Optimise for energy? Do we need to conserve as much energy as possible?

Usually requires some combination (with trade-off) of all of these!

NoteCross-Hub Connections

This chapter connects hardware and software optimization to several learning resources:

Interactive Explorations: - Simulations Hub - Power Budget Calculator (line 35) helps quantify battery life improvements from optimization strategies. Compare baseline vs optimized: e.g., sensor duty cycle 1s -> 10s with sleep modes can extend 2-day battery to 6+ months - Knowledge Map - Navigate relationships between optimization techniques, compiler flags, and hardware accelerators across the IoT stack

Assessment & Practice: - Quizzes Hub - Test understanding of compiler optimization flags (-O0/-O2/-O3/-Os), fixed-point arithmetic (Qn.m format), and vectorization (SIMD) performance calculations - Knowledge Gaps Hub - Common optimization misconceptions explained: premature optimization, ignoring algorithmic complexity, and misunderstanding energy vs speed trade-offs

Video Resources: - Videos Hub - Visual explanations of ARM big.LITTLE architecture, SIMD vectorization benefits (4x throughput), and fixed-point vs floating-point performance comparisons with real hardware demonstrations

Why These Connections Matter: Hardware-software co-optimization requires quantitative analysis. The Simulations Hub provides calculators to measure actual energy/performance trade-offs (e.g., -Os saves 32 KB flash but adds 3ms latency—is it worth it for a duty-cycled sensor?). The Knowledge Gaps Hub addresses the #1 mistake: optimizing before profiling. Use the interactive tools to validate optimization decisions with data, not assumptions.

WarningCommon Misconception: “Faster Code Always Uses Less Power”

The Misconception: Many developers believe that optimizing for speed (-O3 compiler flag, aggressive loop unrolling, SIMD vectorization) automatically reduces power consumption because the processor finishes work faster and can sleep sooner. This leads to choosing speed-focused optimizations for battery-powered IoT devices.

Why It’s Wrong (With Numbers): Power consumption = Voltage^2 x Capacitance x Frequency. Faster code often requires: - Higher clock frequency (80 MHz -> 200 MHz = 2.5x power for 2x speed) - More parallel execution units (SIMD uses 4x ALUs = 3x power draw during execution) - Larger code size (-O3 generates 1.5-2x more instructions = more instruction fetches = more memory power)

Real-World Example: Consider a temperature sensor reading every 10 seconds: - Baseline (-O2): 18ms processing @ 80 MHz, sleep 9,982ms -> Average power: 5 mW active + 0.01 mW sleep = 0.019 mW average - Speed Optimized (-O3 + 200 MHz): 6ms processing @ 200 MHz (12 mW active), sleep 9,994ms -> Average power: 12 mW active + 0.01 mW sleep = 0.017 mW average - Size/Energy Optimized (-Os + sleep): 22ms processing @ 80 MHz, sleep 9,978ms -> Average power: 5 mW active + 0.01 mW sleep = 0.020 mW average

The Surprise: Speed optimization saved only 10% power (0.019 -> 0.017 mW) because sleep time dominates (99.8% of cycle time). The 12ms time savings (18ms -> 6ms) is irrelevant when sleeping 9,982ms! Meanwhile, -O3’s larger code size increased flash memory by 38% (52 KB -> 72 KB), requiring more expensive hardware.

The Correct Approach: For duty-cycled IoT devices, optimize sleep time, not active time: - Sleep deeper: Change sleep mode from 10 uA to 1 uA = 10x power savings (far exceeds any -O3 benefit) - Sleep longer: Reduce sensor sampling rate 1s -> 10s = 10x battery life (if application permits) - Optimize for size: Use -Os to minimize flash/RAM -> enables cheaper MCUs with smaller batteries

When Speed DOES Save Power: Speed optimization helps when the device cannot sleep during processing: - Real-time audio/video processing (continuous operation) - Safety-critical systems (must respond within deadline) - Compute-bound tasks preventing sleep (e.g., cryptography taking >1 second)

Key Principle from Chapter: “Optimise for energy? Do we need to conserve as much energy as possible?” For battery-powered IoT, the answer is optimize for longest sleep time, not fastest execution. The chapter shows: flash memory dominates die area (27.3 mm^2 vs 0.43 mm^2 for CPU)—choosing -Os over -O3 saves both energy AND cost by enabling smaller flash chips.

Quantitative Decision Rule: If sleep time > 100x active time, optimize for code size/sleep depth. If active time > 10% of cycle time, then consider speed optimization. For typical IoT sensors (0.1-1% duty cycle), speed optimization provides <1% battery life improvement.

WarningPremature Optimization Wastes Development Time

A classic mistake is optimizing code before measuring where bottlenecks actually are. Developers spend days hand-optimizing assembly or complex bit manipulations in functions that consume 0.1% of total execution time, while ignoring the inefficient algorithm consuming 80% of CPU. Example: A team spent 2 weeks optimizing sensor reading code (saving 5ms per hour) but never profiled their JSON parsing library that consumed 2 seconds per message. Rule: Always profile first - use actual measurements to identify hotspots. The 80-20 rule applies: 80% of execution time typically comes from 20% of code. Focus optimization efforts on measured bottlenecks, not assumptions. Start with algorithmic improvements (O(n^2) to O(n log n)), then compiler flags, then micro-optimizations only if needed.

WarningTradeoff: Active Power Gating vs Continuous Low-Power Peripherals

Option A (Active Power Gating): MOSFET switch cuts power to sensors/peripherals when idle, achieves true zero leakage (<0.1 uA), requires GPIO pin + MOSFET ($0.05-0.20), adds 1-10 ms startup delay for sensor stabilization, potential for inrush current damage if not managed

Option B (Continuous Low-Power Mode): Peripherals remain powered but in ultra-low-power standby, typical standby current 1-50 uA per peripheral, instant readiness (<1 us), no external components, stable sensor calibration maintained

Decision Factors: Choose power gating for sensors with high standby current (>100 uA) or rarely used peripherals (GPS module at 20 mA idle - gate it!), deployments where every microamp matters for 10+ year battery life. Choose continuous low-power for frequently accessed sensors (accelerometer at 6 uA standby, sampled 10x/second - gating overhead exceeds savings), temperature-sensitive sensors requiring thermal stabilization (10-second warmup wastes more energy than continuous 5 uA), or when GPIO pins are scarce. Quantified example: A humidity sensor drawing 50 uA standby, read every 60 seconds: Continuous = 50 uA x 60s = 3 mAs/cycle. Power-gated = 0 uA x 59s + 2mA x 500ms warmup = 1 mAs/cycle - 3x better. But for 1-second reads: Continuous = 0.05 mAs, Power-gated = 1 mAs - continuous wins by 20x.

WarningTradeoff: Compiler Size Optimization (-Os) vs Speed Optimization (-O3)

Option A (-Os Size Optimization): Minimizes code size (20-40% smaller than -O3), moderate compile time, fits in smaller/cheaper flash (32 KB vs 64 KB saves $0.30-1.00/unit), reduced instruction cache misses, slightly slower execution (10-30% vs -O3), better for I/O-bound code

Option B (-O3 Speed Optimization): Maximum execution speed through aggressive inlining and loop unrolling, larger code size (1.5-2x vs -Os), may exceed flash capacity on small MCUs, faster compile-intensive algorithms, risk of code size exceeding instruction cache causing performance regression

Decision Factors: Choose -Os for flash-constrained MCUs (STM32L0 with 32KB, ATtiny with 8KB), battery-powered devices where code fetches from flash cost energy (1-5 pJ/bit), networking/protocol stacks where code size dominates, and production cost optimization at scale (10K units x $0.50 flash savings = $5,000). Choose -O3 for compute-intensive DSP/ML on edge (audio processing, anomaly detection), latency-critical control loops (motor control, PID), and devices with abundant flash (ESP32 with 4MB). Quantified example: A 45 KB -O3 binary vs 28 KB -Os binary - the -Os version fits in a 32 KB STM32L011 ($0.80) instead of requiring 64 KB STM32L031 ($1.30), saving $0.50/unit at 15% performance cost. For 50,000 units, that’s $25,000 saved - worth the 3ms extra processing time per sensor read.

1621.5 Knowledge Check

Scenario: Your battery-powered IoT sensor has 128 KB flash (80% full after -O2 compile) and runs on a 16 MHz ARM Cortex-M0+ processor. Current firmware processes sensor data in 45ms, transmits results via LoRaWAN (200ms), then sleeps for 10 minutes. You’re evaluating -O3 (code increases to 110 KB, processing drops to 30ms) vs. -Os (code shrinks to 65 KB, processing increases to 55ms).

Think about: 1. Which optimization flag should you choose and why? 2. How much does the 15ms processing time difference (45ms -> 30ms with -O3) affect battery life? 3. What’s the real benefit of reclaiming 25 KB flash with -Os?

Key Insight: Choose -Os. The 15ms processing time improvement from -O3 is irrelevant for battery life—the sensor sleeps 10 minutes (600,000ms) between readings. Active time is 45ms + 200ms = 245ms per cycle. Reducing to 30ms + 200ms = 230ms saves only 15ms/600,000ms = 0.0025% energy savings—negligible. Meanwhile, -Os frees 25 KB flash (128 KB -> 103 KB = 20% reduction), enabling: (1) future firmware updates, (2) data buffering during network outages, (3) additional features without hardware redesign. For duty-cycled IoT devices, code size matters more than processing speed when active time is tiny compared to sleep time. Exception: If processing exceeded 10 minutes, preventing timely readings, then -O3’s speed would matter. The chapter’s guidance: “Optimise for energy? Do we need to conserve as much energy as possible?”—here, sleep time dominates, so optimize for flash (future flexibility).

1621.6 Summary

Optimization fundamentals establish the foundation for efficient IoT systems:

  1. Four Dimensions: Speed, size, power, and energy often conflict—you must prioritize
  2. Application-Driven: Industrial control needs speed; environmental sensors need energy efficiency
  3. Avoid Premature Optimization: Profile first, optimize the actual bottleneck
  4. Trade-off Analysis: Quantify benefits before choosing optimization strategies
  5. Sleep Dominates: For duty-cycled devices, sleep efficiency matters more than processing speed

The key is understanding your application requirements and choosing appropriate optimization strategies with full awareness of trade-offs.

1621.7 What’s Next

The next chapter covers Hardware Optimization, which explores processor selection from DSPs to ASICs, heterogeneous multicore architectures like ARM big.LITTLE, and hardware acceleration strategies for IoT applications.