1623  Software Optimization Techniques

1623.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Select appropriate compiler optimization flags: Choose between -O0, -O2, -O3, and -Os based on requirements
  • Analyze code size considerations: Understand why flash memory dominates embedded system costs
  • Apply SIMD vectorization: Exploit data-level parallelism for array processing
  • Implement function inlining strategies: Balance call overhead against code size
  • Design opportunistic sleeping patterns: Maximize battery life through intelligent power management
  • Profile before optimizing: Use measurement to identify actual bottlenecks

1623.2 Prerequisites

Before diving into this chapter, you should be familiar with:

1623.3 Software Optimisation

1623.3.1 Compiler Optimisation Choices

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph TD
    SOURCE["Source Code"] --> COMPILER{"Compiler<br/>Optimization"}

    COMPILER -->|"-O0"| O0["No Optimization<br/>Fast compile<br/>Easy debug<br/>Large/slow code"]
    COMPILER -->|"-O1/-O2"| O2["Balanced<br/>Moderate compile<br/>Good performance<br/>Reasonable size"]
    COMPILER -->|"-O3"| O3["Speed Focus<br/>Slow compile<br/>Aggressive inlining<br/>MASSIVE code size"]
    COMPILER -->|"-Os"| OS["Size Focus<br/>Moderate compile<br/>Compact code<br/>Reduced performance"]
    COMPILER -->|"-On"| ON["Custom Passes<br/>Fine control<br/>LLVM best<br/>Expert level"]

    style SOURCE fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O0 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O2 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O3 fill:#F39C12,stroke:#2C3E50,stroke-width:2px,color:#fff
    style OS fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style ON fill:#9B59B6,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1623.1: Compiler Optimization Levels: -O0 to -O3 and -Os Trade-offs

{fig-alt=“Compiler optimization level comparison showing trade-offs between -O0 (no optimization), -O1/-O2 (balanced), -O3 (speed-focused), -Os (size-focused), and -On (custom passes) in terms of compile time, code size, and performance”}

-O3 (Optimize for Speed): - May aggressively inline functions - Generates MASSIVE amounts of code - Re-orders instructions for better performance - Selects complex instruction encodings

-Os (Optimize for Size): - Penalizes inlining decisions - Generates shorter instruction encodings - Affects instruction scheduling - Fewer branches eliminated

-On (Custom Optimization): - Very specific optimization passes - Insert assembly language templates - LLVM is best for this

1623.3.2 Code Size Considerations

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph LR
    subgraph "Die Area Comparison @ 0.13um"
        M3["ARM Cortex M3<br/>0.43 mm2"]
        FLASH["128-Mbit Flash<br/>27.3 mm2"]
    end

    M3 -.->|"63x larger!"| FLASH

    subgraph "Code Size Strategies"
        DUAL["Dual Instruction Sets<br/>Thumb/Thumb-2<br/>16-bit + 32-bit"]
        CISC["CISC Encodings<br/>x86, System/360<br/>Complex instructions"]
        COMPRESS["Code Compression<br/>Decompress at runtime"]
    end

    FLASH --> DUAL
    FLASH --> CISC
    FLASH --> COMPRESS

    style M3 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style FLASH fill:#E74C3C,stroke:#2C3E50,stroke-width:3px,color:#fff
    style DUAL fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style CISC fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style COMPRESS fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1623.2: Code Size Impact: Flash Memory Dominates Die Area vs ARM Cortex M3

{fig-alt=“Code size impact diagram showing flash memory dominating die area (27.3mm2 vs 0.43mm2 for ARM Cortex M3) and three code size reduction strategies: dual instruction sets, CISC encodings, and code compression”}

Does code size matter? - 128-Mbit Flash = 27.3 mm2 @ 0.13um - ARM Cortex M3 = 0.43 mm2 @ 0.13um - Flash memory can dominate die area!

Dual Instruction Sets: Thumb/Thumb-2, ARCompact, microMIPS - 16-bit instructions have constraints (limited registers, reduced immediates)

CISC Instruction Sets: x86, System/360, PDP-11 - Complex encodings do more work - Require more complex hardware - Compiler support may be limited

1623.4 Advanced Software Optimizations

1623.4.1 Vectorization

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
sequenceDiagram
    participant Code as Source Code
    participant Scalar as Scalar Execution<br/>(1 element/instruction)
    participant SIMD as SIMD Vectorization<br/>(4 elements/instruction)

    Note over Code: for i = 0 to 999<br/>  array[i] = array[i] x 2

    rect rgb(231, 76, 60, 0.1)
        Note over Scalar: 1000 iterations
        loop 1000 times
            Scalar->>Scalar: Load 1 element
            Scalar->>Scalar: Multiply by 2
            Scalar->>Scalar: Store 1 element
        end
    end

    rect rgb(22, 160, 133, 0.1)
        Note over SIMD: 250 iterations (4x faster!)
        loop 250 times
            SIMD->>SIMD: Load 4 elements
            SIMD->>SIMD: Multiply 4x2 (parallel)
            SIMD->>SIMD: Store 4 elements
        end
    end

    Note over SIMD: Data-level parallelism<br/>Better memory bandwidth<br/>Fewer loop iterations

Figure 1623.3: SIMD Vectorization: Scalar vs Parallel Array Processing Performance

{fig-alt=“SIMD vectorization comparison showing scalar execution processing one element per iteration requiring 1000 iterations versus SIMD processing four elements per instruction requiring only 250 iterations for 4x speedup”}

Key Idea: Operate on multiple data elements with single instruction (SIMD)

Benefits: - Fewer loop iterations - Better memory bandwidth utilization - Exploits data-level parallelism

Caution: Need extra code if vector width doesn’t divide iteration count exactly!

Code transformation showing scalar loop converted to vectorized SIMD operations processing 4 elements per iteration instead of 1
Figure 1623.4: Optimization: Vectorisation
Performance graph comparing execution time of scalar versus vectorized code showing 4x speedup from SIMD vectorization on ARM NEON
Figure 1623.5: Optimization: Vectorisation2
Diagram contrasting scalar operations processing one data element at a time versus vector operations processing multiple elements in parallel
Figure 1623.6: Optimization: VectorisationScalarOperations

Artistic visualization of parallel computation strategies including pipelining sensor reads while previous data is transmitted, overlapping ADC sampling with processing, and multi-core task partitioning

Exploiting Parallel Computation
Figure 1623.7: Parallel execution maximizes hardware utilization. This visualization demonstrates how to pipeline sensor reading, processing, and transmission so they overlap in time, effectively hiding latency and increasing throughput on single-core and multi-core MCUs.

1623.4.2 Function Inlining

Advantages: - Low calling overhead - Avoids branch delay - Enables further optimizations

Limitations: - Not all functions can be inlined - Code size explosion possible - May require manual intervention (inline qualifier)

Geometric visualization comparing function calls versus inlined code. Shows the overhead of call/return instructions, stack operations, and register saving versus the code size increase from inlining. Illustrates when inlining improves performance.

Function Inlining

Function inlining eliminates call overhead but increases code size, requiring careful analysis of which functions benefit from this optimization.

1623.4.3 Opportunistic Sleeping

Strategy: Transition processor and peripherals to lowest usable power mode ASAP

  • Similar to stop/start in cars!
  • Interrupts wake the processor for processing
  • Balance energy savings vs reaction time

Artistic graph showing power consumption versus clock frequency relationship where reducing clock rate from 80MHz to 20MHz saves 75 percent power while still meeting real-time deadlines for sensor sampling

Minimize Clock Rate Strategy
Figure 1623.8: Dynamic voltage and frequency scaling (DVFS) reduces power consumption when full performance is unnecessary. This visualization shows the quadratic relationship between clock frequency and power, demonstrating how running at 25% speed can reduce power by 75% for tasks that are not time-critical.

Artistic comparison showing energy cost of transmitting raw sensor data to cloud versus performing local feature extraction and sending only compressed results, with order of magnitude energy savings illustrated

Local Computation vs Transmission - Part 1
Figure 1623.9: Local computation often saves more energy than transmission. This visualization compares the energy cost of sending raw accelerometer data (150 bytes/sample at 100Hz) versus extracting features locally and transmitting only activity classification results (1 byte/second).

Artistic decision tree for choosing between local processing and cloud offloading based on data rate, model complexity, latency requirements, and available MCU compute resources

Local Computation vs Transmission - Part 2
Figure 1623.10: Deciding between local and remote computation depends on multiple factors. This visualization presents a decision framework considering data rate, model complexity, latency requirements, and available compute resources to determine optimal processing placement.

Artistic timeline showing LTE cellular modem power states demonstrating how batching multiple sensor readings into single transmission amortizes 2-3 second connection setup overhead across many data points

LTE Data Batching
Figure 1623.11: Cellular transmission has high setup overhead (2-3 seconds to establish connection). This visualization shows how batching multiple sensor readings into a single transmission amortizes this overhead, improving energy efficiency by 10-100x for bursty IoT data.

Artistic bar chart comparing energy per transmitted byte across wireless protocols showing Wi-Fi at 1000 microjoules, BLE at 50 microjoules, LoRa at 10 microjoules, and Zigbee at 30 microjoules per byte

Networking Energy Costs
Figure 1623.12: Communication dominates IoT energy budgets. This visualization compares the energy cost per transmitted byte across common wireless protocols, helping engineers choose the most efficient protocol for their data rate and range requirements.

Artistic flowchart of optimized sensor data processing pipeline using lookup tables, fixed-point arithmetic, and branch-free conditional moves to achieve 10x speedup over naive implementation

Optimized Efficient Processing
Figure 1623.13: Algorithmic optimization can yield dramatic improvements. This visualization shows a sensor processing pipeline optimized with lookup tables for trigonometric functions, fixed-point arithmetic, and branch-free conditional moves that achieves 10x speedup.

1623.5 Knowledge Check

Question 1: Which compiler flag typically produces the largest code size?

-O3 generates MASSIVE amounts of code due to aggressive function inlining and loop unrolling. While -O0 produces unoptimized code, -O3’s optimizations for speed often expand code size significantly more. -Os specifically optimizes for smaller code.

Question 2: What does the -Os compiler flag optimize for?

-Os (optimize for size) penalizes inlining decisions, generates shorter instruction encodings, and prioritizes compact code. This is critical for flash-constrained IoT devices where memory dominates die area and cost.

Question 3: What is the main benefit of vectorization (SIMD)?

SIMD (Single Instruction Multiple Data) processes 4 or more values per instruction, providing 4x+ speedup for array operations. For 1000 values: scalar requires 1000 iterations, SIMD requires 250 iterations (1000/4) = 4x faster.

Question 4: What is a limitation of function inlining?

While function inlining eliminates call overhead and enables further optimizations, it copies the function body to every call site, potentially causing code size explosion. This can overflow flash memory and cause instruction cache misses.

Question 5: Your IoT sensor processes 10,000 samples per second. The baseline implementation takes 50,000 cycles per sample on a 100 MHz processor. After loop unrolling and SIMD optimization, it takes 12,500 cycles per sample. What is the throughput improvement?

Throughput = Clock / Cycles per sample. Baseline: 100 MHz / 50,000 cycles = 2,000 samples/s. Optimized: 100 MHz / 12,500 cycles = 8,000 samples/s. Improvement: 8,000 / 2,000 = 4x throughput increase. The optimization reduced cycles by 4x, which directly translates to 4x throughput.

Question 6: Which code snippet demonstrates better optimization for an IoT device processing array data?

  • Version B demonstrates loop unrolling. By processing 4 elements per iteration, it reduces: (1) Loop overhead by 4x (250 iterations vs 1000), (2) Branch mispredictions, (3) Enables compiler to apply SIMD vectorization automatically. Modern compilers can often parallelize the 4 independent operations using SIMD instructions.

    CautionPitfall: Using printf/Serial.print for Timing-Critical Debug

    The Mistake: Adding Serial.print() or printf() statements inside tight loops or ISRs to debug timing-sensitive code, then wondering why the bug disappears when debug output is added, or why the system becomes unreliable during debugging.

    Why It Happens: Serial output is extremely slow compared to CPU operations. A single Serial.print("x") on Arduino at 115200 baud takes approximately 87 microseconds (10 bits per character at 115200 bps). In a 1 MHz loop (1 us per iteration), adding one print statement slows execution by 87x and changes all timing relationships. This creates “Heisenbugs” - bugs that disappear when you try to observe them.

    The Fix: Use GPIO pin toggling for timing-critical debugging - set a pin HIGH at function entry, LOW at exit, and observe with an oscilloscope or logic analyzer. For capturing values without affecting timing, write to a circular buffer in RAM and dump after the critical section completes:

    // WRONG: Serial output changes timing
    void IRAM_ATTR timerISR() {
        Serial.println(micros());  // Takes 500+ us, destroys timing!
        processData();
    }
    
    // CORRECT: GPIO toggle for timing debug (oscilloscope required)
    #define DEBUG_PIN 2
    void IRAM_ATTR timerISR() {
        digitalWrite(DEBUG_PIN, HIGH);  // ~0.1 us
        processData();
        digitalWrite(DEBUG_PIN, LOW);   // ~0.1 us
    }
    
    // CORRECT: Buffer capture for value debug
    volatile uint32_t debugBuffer[64];
    volatile uint8_t debugIndex = 0;
    
    void IRAM_ATTR timerISR() {
        if (debugIndex < 64) {
            debugBuffer[debugIndex++] = ADC_VALUE;  // ~0.2 us
        }
        processData();
    }
    // Dump buffer in loop() after critical operation completes
    CautionPitfall: Optimizing Without Profiling (Premature Optimization)

    The Mistake: Spending hours hand-optimizing a function that “looks slow” (like a nested loop or floating-point math) without measuring whether it actually impacts overall system performance, while ignoring the true bottleneck.

    Why It Happens: Developers have intuitions about what should be slow based on algorithm complexity or “expensive” operations. But modern compilers optimize heavily, and I/O operations (network, storage, peripherals) often dominate execution time. A function with O(n^2) complexity processing 10 items (100 operations) may take 1 microsecond, while a single I2C sensor read takes 500 microseconds.

    The Fix: Always profile before optimizing. On ESP32, use micros() to measure elapsed time. On ARM Cortex-M, use the DWT cycle counter (enabled via CoreDebug register). Identify the actual hotspot before optimizing:

    // Step 1: Instrument code to find actual bottleneck
    void loop() {
        uint32_t t0 = micros();
        readSensors();           // Suspect this is slow
        uint32_t t1 = micros();
        processData();           // Spent 2 weeks optimizing this
        uint32_t t2 = micros();
        transmitData();          // Never measured
        uint32_t t3 = micros();
    
        Serial.printf("Read: %lu us, Process: %lu us, Transmit: %lu us\n",
                      t1-t0, t2-t1, t3-t2);
    }
    // Typical result: Read: 2000 us, Process: 50 us, Transmit: 150000 us
    // The "optimized" processData() was 0.03% of total time!
    // Real fix: Reduce transmission frequency or payload size
    
    // For ARM Cortex-M cycle-accurate profiling:
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;  // Enable trace
    DWT->CYCCNT = 0;                                   // Reset counter
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;              // Start counting
    // ... code to measure ...
    uint32_t cycles = DWT->CYCCNT;                    // Read cycles

    Rule of thumb: 80% of execution time comes from 20% of code. Find the 20% first.

    Flowchart showing how seemingly simple optimizations like loop unrolling or inlining can cause unexpected complications including increased code size, instruction cache misses, register spilling, and reduced maintainability.

    Optimization Complications
    Figure 1623.14: Optimization often introduces unexpected complications. This visualization shows how aggressive inlining can increase code size beyond instruction cache capacity, or how loop unrolling can cause register spilling. Profiling before and after optimization prevents these surprises.

    1623.7 Summary

    Software optimization techniques are essential for efficient IoT firmware:

    1. Compiler Flags: -O3 for speed (large code), -Os for size (compact), -O2 for balance
    2. Code Size Matters: Flash memory dominates die area (27.3 mm2 vs 0.43 mm2 for CPU)
    3. Dual Instruction Sets: Thumb/Thumb-2 provide 30% code size reduction
    4. Vectorization (SIMD): Process 4+ elements per instruction for 4x+ speedup
    5. Function Inlining: Reduces call overhead but increases code size
    6. Opportunistic Sleeping: Transition to low-power modes ASAP
    7. Profile First: Measure actual bottlenecks before optimizing

    The key is applying the right technique to the measured bottleneck, not optimizing based on assumptions.

    1623.8 What’s Next

    The next chapter covers Fixed-Point Arithmetic, which explores Qn.m format representation, conversion from floating-point, and efficient implementation on embedded processors without floating-point hardware.