1623 Software Optimization Techniques

1623.1 Learning Objectives

By the end of this chapter, you will be able to:

Select appropriate compiler optimization flags: Choose between -O0, -O2, -O3, and -Os based on requirements
Analyze code size considerations: Understand why flash memory dominates embedded system costs
Apply SIMD vectorization: Exploit data-level parallelism for array processing
Implement function inlining strategies: Balance call overhead against code size
Design opportunistic sleeping patterns: Maximize battery life through intelligent power management
Profile before optimizing: Use measurement to identify actual bottlenecks

1623.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Optimization Fundamentals: Understanding optimization trade-offs and priorities
Hardware Optimization: Understanding the hardware capabilities that software optimization leverages
Embedded Systems Programming: Familiarity with embedded C/C++ programming

1623.3 Software Optimisation

1623.3.1 Compiler Optimisation Choices

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph TD
    SOURCE["Source Code"] --> COMPILER{"Compiler<br/>Optimization"}

    COMPILER -->|"-O0"| O0["No Optimization<br/>Fast compile<br/>Easy debug<br/>Large/slow code"]
    COMPILER -->|"-O1/-O2"| O2["Balanced<br/>Moderate compile<br/>Good performance<br/>Reasonable size"]
    COMPILER -->|"-O3"| O3["Speed Focus<br/>Slow compile<br/>Aggressive inlining<br/>MASSIVE code size"]
    COMPILER -->|"-Os"| OS["Size Focus<br/>Moderate compile<br/>Compact code<br/>Reduced performance"]
    COMPILER -->|"-On"| ON["Custom Passes<br/>Fine control<br/>LLVM best<br/>Expert level"]

    style SOURCE fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O0 fill:#E74C3C,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O2 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style O3 fill:#F39C12,stroke:#2C3E50,stroke-width:2px,color:#fff
    style OS fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style ON fill:#9B59B6,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1623.1: Compiler Optimization Levels: -O0 to -O3 and -Os Trade-offs

{fig-alt=“Compiler optimization level comparison showing trade-offs between -O0 (no optimization), -O1/-O2 (balanced), -O3 (speed-focused), -Os (size-focused), and -On (custom passes) in terms of compile time, code size, and performance”}

-O3 (Optimize for Speed): - May aggressively inline functions - Generates MASSIVE amounts of code - Re-orders instructions for better performance - Selects complex instruction encodings

-Os (Optimize for Size): - Penalizes inlining decisions - Generates shorter instruction encodings - Affects instruction scheduling - Fewer branches eliminated

-On (Custom Optimization): - Very specific optimization passes - Insert assembly language templates - LLVM is best for this

1623.3.2 Code Size Considerations

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
graph LR
    subgraph "Die Area Comparison @ 0.13um"
        M3["ARM Cortex M3<br/>0.43 mm2"]
        FLASH["128-Mbit Flash<br/>27.3 mm2"]
    end

    M3 -.->|"63x larger!"| FLASH

    subgraph "Code Size Strategies"
        DUAL["Dual Instruction Sets<br/>Thumb/Thumb-2<br/>16-bit + 32-bit"]
        CISC["CISC Encodings<br/>x86, System/360<br/>Complex instructions"]
        COMPRESS["Code Compression<br/>Decompress at runtime"]
    end

    FLASH --> DUAL
    FLASH --> CISC
    FLASH --> COMPRESS

    style M3 fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style FLASH fill:#E74C3C,stroke:#2C3E50,stroke-width:3px,color:#fff
    style DUAL fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style CISC fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff
    style COMPRESS fill:#3498DB,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 1623.2: Code Size Impact: Flash Memory Dominates Die Area vs ARM Cortex M3

{fig-alt=“Code size impact diagram showing flash memory dominating die area (27.3mm2 vs 0.43mm2 for ARM Cortex M3) and three code size reduction strategies: dual instruction sets, CISC encodings, and code compression”}

Does code size matter? - 128-Mbit Flash = 27.3 mm2 @ 0.13um - ARM Cortex M3 = 0.43 mm2 @ 0.13um - Flash memory can dominate die area!

Dual Instruction Sets: Thumb/Thumb-2, ARCompact, microMIPS - 16-bit instructions have constraints (limited registers, reduced immediates)

CISC Instruction Sets: x86, System/360, PDP-11 - Complex encodings do more work - Require more complex hardware - Compiler support may be limited

1623.4 Advanced Software Optimizations

1623.4.1 Vectorization

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ECF0F1', 'fontFamily': 'Inter, system-ui, sans-serif'}}}%%
sequenceDiagram
    participant Code as Source Code
    participant Scalar as Scalar Execution<br/>(1 element/instruction)
    participant SIMD as SIMD Vectorization<br/>(4 elements/instruction)

    Note over Code: for i = 0 to 999<br/>  array[i] = array[i] x 2

    rect rgb(231, 76, 60, 0.1)
        Note over Scalar: 1000 iterations
        loop 1000 times
            Scalar->>Scalar: Load 1 element
            Scalar->>Scalar: Multiply by 2
            Scalar->>Scalar: Store 1 element
        end
    end

    rect rgb(22, 160, 133, 0.1)
        Note over SIMD: 250 iterations (4x faster!)
        loop 250 times
            SIMD->>SIMD: Load 4 elements
            SIMD->>SIMD: Multiply 4x2 (parallel)
            SIMD->>SIMD: Store 4 elements
        end
    end

    Note over SIMD: Data-level parallelism<br/>Better memory bandwidth<br/>Fewer loop iterations

Figure 1623.3: SIMD Vectorization: Scalar vs Parallel Array Processing Performance

{fig-alt=“SIMD vectorization comparison showing scalar execution processing one element per iteration requiring 1000 iterations versus SIMD processing four elements per instruction requiring only 250 iterations for 4x speedup”}

Key Idea: Operate on multiple data elements with single instruction (SIMD)

Benefits: - Fewer loop iterations - Better memory bandwidth utilization - Exploits data-level parallelism

Caution: Need extra code if vector width doesn’t divide iteration count exactly!

Code transformation showing scalar loop converted to vectorized SIMD operations processing 4 elements per iteration instead of 1 — Figure 1623.4: Optimization: Vectorisation

Performance graph comparing execution time of scalar versus vectorized code showing 4x speedup from SIMD vectorization on ARM NEON — Figure 1623.5: Optimization: Vectorisation2

Diagram contrasting scalar operations processing one data element at a time versus vector operations processing multiple elements in parallel — Figure 1623.6: Optimization: VectorisationScalarOperations

Artistic visualization of parallel computation strategies including pipelining sensor reads while previous data is transmitted, overlapping ADC sampling with processing, and multi-core task partitioning — Exploiting Parallel Computation

1623.4.2 Function Inlining

Advantages: - Low calling overhead - Avoids branch delay - Enables further optimizations

Limitations: - Not all functions can be inlined - Code size explosion possible - May require manual intervention (inline qualifier)

Function Inlining Trade-offs

Geometric visualization comparing function calls versus inlined code. Shows the overhead of call/return instructions, stack operations, and register saving versus the code size increase from inlining. Illustrates when inlining improves performance.

Function Inlining

Function inlining eliminates call overhead but increases code size, requiring careful analysis of which functions benefit from this optimization.

1623.4.3 Opportunistic Sleeping

Strategy: Transition processor and peripherals to lowest usable power mode ASAP

Similar to stop/start in cars!
Interrupts wake the processor for processing
Balance energy savings vs reaction time

Artistic graph showing power consumption versus clock frequency relationship where reducing clock rate from 80MHz to 20MHz saves 75 percent power while still meeting real-time deadlines for sensor sampling

Minimize Clock Rate Strategy

Figure 1623.8: Dynamic voltage and frequency scaling (DVFS) reduces power consumption when full performance is unnecessary. This visualization shows the quadratic relationship between clock frequency and power, demonstrating how running at 25% speed can reduce power by 75% for tasks that are not time-critical.

Artistic comparison showing energy cost of transmitting raw sensor data to cloud versus performing local feature extraction and sending only compressed results, with order of magnitude energy savings illustrated — Local Computation vs Transmission - Part 1

Artistic decision tree for choosing between local processing and cloud offloading based on data rate, model complexity, latency requirements, and available MCU compute resources — Local Computation vs Transmission - Part 2

Artistic timeline showing LTE cellular modem power states demonstrating how batching multiple sensor readings into single transmission amortizes 2-3 second connection setup overhead across many data points — LTE Data Batching

Artistic bar chart comparing energy per transmitted byte across wireless protocols showing Wi-Fi at 1000 microjoules, BLE at 50 microjoules, LoRa at 10 microjoules, and Zigbee at 30 microjoules per byte — Networking Energy Costs

Artistic flowchart of optimized sensor data processing pipeline using lookup tables, fixed-point arithmetic, and branch-free conditional moves to achieve 10x speedup over naive implementation — Optimized Efficient Processing

1623.5 Knowledge Check

Quiz: Software Optimization

Question 1: Which compiler flag typically produces the largest code size?

-O3 generates MASSIVE amounts of code due to aggressive function inlining and loop unrolling. While -O0 produces unoptimized code, -O3’s optimizations for speed often expand code size significantly more. -Os specifically optimizes for smaller code.

Question 2: What does the -Os compiler flag optimize for?

-Os (optimize for size) penalizes inlining decisions, generates shorter instruction encodings, and prioritizes compact code. This is critical for flash-constrained IoT devices where memory dominates die area and cost.

Question 3: What is the main benefit of vectorization (SIMD)?

SIMD (Single Instruction Multiple Data) processes 4 or more values per instruction, providing 4x+ speedup for array operations. For 1000 values: scalar requires 1000 iterations, SIMD requires 250 iterations (1000/4) = 4x faster.

Question 4: What is a limitation of function inlining?

While function inlining eliminates call overhead and enables further optimizations, it copies the function body to every call site, potentially causing code size explosion. This can overflow flash memory and cause instruction cache misses.

Question 5: Your IoT sensor processes 10,000 samples per second. The baseline implementation takes 50,000 cycles per sample on a 100 MHz processor. After loop unrolling and SIMD optimization, it takes 12,500 cycles per sample. What is the throughput improvement?

Throughput = Clock / Cycles per sample. Baseline: 100 MHz / 50,000 cycles = 2,000 samples/s. Optimized: 100 MHz / 12,500 cycles = 8,000 samples/s. Improvement: 8,000 / 2,000 = 4x throughput increase. The optimization reduced cycles by 4x, which directly translates to 4x throughput.

Question 6: Which code snippet demonstrates better optimization for an IoT device processing array data?

Version B demonstrates loop unrolling. By processing 4 elements per iteration, it reduces: (1) Loop overhead by 4x (250 iterations vs 1000), (2) Branch mispredictions, (3) Enables compiler to apply SIMD vectorization automatically. Modern compilers can often parallelize the 4 independent operations using SIMD instructions.

// WRONG: Serial output changes timing void IRAM_ATTR timerISR() { Serial.println(micros()); // Takes 500+ us, destroys timing! processData(); } // CORRECT: GPIO toggle for timing debug (oscilloscope required) #define DEBUG_PIN 2 void IRAM_ATTR timerISR() { digitalWrite(DEBUG_PIN, HIGH); // ~0.1 us processData(); digitalWrite(DEBUG_PIN, LOW); // ~0.1 us } // CORRECT: Buffer capture for value debug volatile uint32_t debugBuffer[64]; volatile uint8_t debugIndex = 0; void IRAM_ATTR timerISR() { if (debugIndex < 64) { debugBuffer[debugIndex++] = ADC_VALUE; // ~0.2 us } processData(); } // Dump buffer in loop() after critical operation completes

// Step 1: Instrument code to find actual bottleneck void loop() { uint32_t t0 = micros(); readSensors(); // Suspect this is slow uint32_t t1 = micros(); processData(); // Spent 2 weeks optimizing this uint32_t t2 = micros(); transmitData(); // Never measured uint32_t t3 = micros(); Serial.printf("Read: %lu us, Process: %lu us, Transmit: %lu us\n", t1-t0, t2-t1, t3-t2); } // Typical result: Read: 2000 us, Process: 50 us, Transmit: 150000 us // The "optimized" processData() was 0.03% of total time! // Real fix: Reduce transmission frequency or payload size // For ARM Cortex-M cycle-accurate profiling: CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; // Enable trace DWT->CYCCNT = 0; // Reset counter DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; // Start counting // ... code to measure ... uint32_t cycles = DWT->CYCCNT; // Read cycles

1623.6 Visual Reference Gallery

Vectorization Concepts

Introduction to SIMD vectorization showing how single instruction processes multiple data elements in parallel — Vectorization Overview

Practical vectorization example showing loop transformation for SIMD execution — Vectorization Details

Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping

Scalar vs Vector Operations

Scalar operation processing one element at a time showing sequential execution overhead — Scalar Operations

Vector operation processing 4 elements simultaneously showing 4x throughput improvement — Vector Operations

Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping

Exploiting Parallelism

Overview of parallel processing opportunities in IoT applications including sensor fusion and protocol handling — Parallelism Overview

Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping

Local Computation vs Cloud

Analysis of when to compute locally vs offload to cloud based on latency, bandwidth, and energy constraints — Local Computation 1

Decision framework for edge vs cloud processing showing break-even points for different workload types — Local Computation 2

Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping

1623.7 Summary

Software optimization techniques are essential for efficient IoT firmware:

Compiler Flags: -O3 for speed (large code), -Os for size (compact), -O2 for balance

Code Size Matters: Flash memory dominates die area (27.3 mm2 vs 0.43 mm2 for CPU)

Dual Instruction Sets: Thumb/Thumb-2 provide 30% code size reduction

Vectorization (SIMD): Process 4+ elements per instruction for 4x+ speedup

Function Inlining: Reduces call overhead but increases code size

Opportunistic Sleeping: Transition to low-power modes ASAP

Profile First: Measure actual bottlenecks before optimizing

The key is applying the right technique to the measured bottleneck, not optimizing based on assumptions.

1623.1 Learning Objectives

1623.2 Prerequisites

1623.3 Software Optimisation

1623.3.1 Compiler Optimisation Choices

1623.3.2 Code Size Considerations

1623.4 Advanced Software Optimizations

1623.4.1 Vectorization

1623.4.2 Function Inlining

1623.4.3 Opportunistic Sleeping

1623.5 Knowledge Check

1623.6 Visual Reference Gallery

1623.7 Summary

1623.8 What’s Next