17  Software Optimization Techniques

17.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Select appropriate compiler optimization flags: Choose between -O0, -O2, -O3, and -Os based on requirements
  • Analyze code size considerations: Understand why flash memory dominates embedded system costs
  • Apply SIMD vectorization: Exploit data-level parallelism for array processing
  • Implement function inlining strategies: Balance call overhead against code size
  • Design opportunistic sleeping patterns: Maximize battery life through intelligent power management
  • Profile before optimizing: Use measurement to identify actual bottlenecks
In 60 Seconds

Software optimization on IoT devices starts with the right compiler flag (-Os for smallest flash footprint, -O2 for speed-energy balance), then applies firmware techniques like function inlining, loop unrolling, and SIMD vectorization — always guided by profiling to ensure effort targets the actual bottleneck, not where the developer guesses.

Key Concepts

  • Compiler Optimization Flag: -O0 disables optimization (for debugging), -O1 minimal, -O2 balanced speed/size, -O3 aggressive speed (may increase size), -Os optimizes for minimal code size
  • SIMD (Single Instruction Multiple Data): Processor instructions that operate on multiple data elements simultaneously; ARM NEON processes 4 floats in one cycle vs 4 cycles with scalar code
  • Function Inlining: Replacing a function call with the function body at the call site; eliminates call overhead but increases code size; controlled with __attribute__((always_inline)) or inline keyword
  • Loop Unrolling: Replicating a loop body multiple times to reduce loop overhead and expose instruction-level parallelism; useful when loop body is simple and iteration count is known
  • Opportunistic Sleeping: Entering a sleep state between I/O operations or while waiting for peripherals, rather than busy-waiting; critical for energy-efficient embedded firmware
  • Link-Time Optimization (LTO): Compiler optimization applied across translation units at link time; can eliminate dead code and inline across file boundaries
  • Dead Code Elimination: Compiler pass that removes unreachable or unused code paths; reduces flash footprint without programmer intervention

Energy and power management determines how long your IoT device can operate between battery changes or charges. Think of packing for a camping trip with limited battery packs – every bit of power must be used wisely. Since many IoT sensors need to run for months or years unattended, power management is often the single most important engineering decision.

“The compiler is your secret weapon,” said Max the Microcontroller. “When you compile code with the -Os flag, it automatically shrinks your program to fit in less flash memory. With -O2, it makes your code run faster. The compiler does hundreds of tricks that would take humans hours to apply manually.”

Sammy the Sensor learned about SIMD: “That stands for Single Instruction, Multiple Data. Instead of processing sensor readings one at a time, SIMD lets Max process four or eight readings simultaneously with one instruction. It is like washing four dishes at once instead of one!”

“My favorite optimization is opportunistic sleeping,” said Bella the Battery. “Whenever Max finishes his work early, instead of busy-waiting for the next task, he goes to sleep immediately. Even sleeping for a few milliseconds adds up over millions of cycles. The formula is simple: sleep whenever there is nothing to do, no matter how brief.” Lila the LED cautioned, “But always profile before optimizing! Measure where time and energy actually go, then optimize the hot spots. Optimizing code that runs 0.1 percent of the time is a waste of YOUR time!”

17.2 Prerequisites

Before diving into this chapter, you should be familiar with:

17.3 Software Optimisation

17.3.1 Compiler Optimisation Choices

Flowchart showing compiler optimization flag trade-offs: -O0 produces unoptimized code with fastest compile time; -O2 provides balanced optimization; -O3 maximizes speed with aggressive inlining and instruction reordering; -Os minimizes code size with shorter encodings; -On enables custom optimization passes for specific use cases
Figure 17.1: Compiler Optimization Levels: -O0 to -O3 and -Os Trade-offs

-O3 (Optimize for Speed):

  • Aggressively inlines functions to eliminate call overhead
  • May generate significantly larger code size
  • Reorders instructions for better pipeline utilization
  • Selects complex instruction encodings for maximum performance

-Os (Optimize for Size):

  • Penalizes inlining decisions
  • Generates shorter instruction encodings
  • Affects instruction scheduling
  • Fewer branches eliminated
Interactive Calculator: SIMD Vectorization Speedup

Real-World Impact: Tile firmware optimization using -Os + SIMD + opportunistic sleeping extended battery life from 200 days to 810 days (4× improvement). The -Os compiler flag alone saved 15 KB flash memory (code size: 187 KB → 172 KB), enabling use of a smaller, less expensive microcontroller.

-On (Custom Optimization):

  • Very specific optimization passes
  • Insert assembly language templates
  • LLVM is best for this

17.3.2 Code Size Considerations

Diagram comparing die area requirements: 128-Mbit flash memory occupies 27.3 square millimeters versus ARM Cortex M3 processor core at only 0.43 square millimeters, demonstrating how flash memory dominates chip die area. Shows three code size reduction strategies: dual instruction sets like Thumb/Thumb-2 with 16-bit instructions, CISC instruction sets with complex encodings, and code compression techniques
Figure 17.2: Code Size Impact: Flash Memory Dominates Die Area vs ARM Cortex M3

Does code size matter?

  • 128-Mbit Flash = 27.3 mm2 @ 0.13um
  • ARM Cortex M3 = 0.43 mm2 @ 0.13um
  • Flash memory can dominate die area!

Dual Instruction Sets: Thumb/Thumb-2, ARCompact, microMIPS - 16-bit instructions have constraints (limited registers, reduced immediates)

CISC Instruction Sets: x86, System/360, PDP-11 - Complex encodings do more work - Require more complex hardware - Compiler support may be limited

Try It: Code Size vs Die Area Explorer

Explore how code size affects chip die area and cost. Flash memory dominates embedded system die area – see how different optimization levels change the balance.

17.4 Advanced Software Optimizations

17.4.1 Vectorization

Diagram comparing scalar versus SIMD vectorization execution: scalar loop processes one array element per instruction requiring 1000 iterations to process 1000 elements; SIMD vectorization with 4-wide vector processing handles four elements simultaneously per instruction requiring only 250 iterations, achieving 4x speedup with same total work but fewer loop iterations and reduced overhead
Figure 17.3: SIMD Vectorization: Scalar vs Parallel Array Processing Performance

Key Idea: Operate on multiple data elements with a single instruction (SIMD - Single Instruction, Multiple Data)

Benefits:

  • Reduces loop iterations by vector width factor (4×, 8×, etc.)
  • Improves memory bandwidth utilization through wider loads/stores
  • Exploits data-level parallelism in array operations
  • Lower energy per operation by amortizing instruction fetch overhead

Important Consideration: Requires tail-handling code when array size is not evenly divisible by vector width. For example, processing 1003 elements with 4-wide SIMD requires 250 vector iterations plus 3 scalar operations for remaining elements.

Code transformation example showing scalar C loop processing array elements one at a time converted to vectorized code using SIMD intrinsics that process four elements per iteration, demonstrating syntax of vector load, vector multiply, and vector store operations with reduced loop iterations
Figure 17.4: Optimization: Vectorisation
Bar chart comparing execution time between scalar processing taking 1000 cycles and vectorized SIMD processing taking 250 cycles, demonstrating 4x speedup achieved through ARM NEON vector instructions processing four elements simultaneously
Figure 17.5: Optimization: Vectorisation2
Side-by-side comparison showing scalar operations processing single data elements sequentially through processor arithmetic logic unit versus vector operations loading multiple data elements into wide SIMD registers and processing them in parallel with single instruction
Figure 17.6: Optimization: VectorisationScalarOperations

Diagram showing three parallel computation strategies for IoT systems: instruction-level parallelism with pipelined sensor read overlapped with data transmission, data-level parallelism processing multiple sensor values simultaneously with SIMD instructions, and task-level parallelism distributing separate tasks across multiple cores or hardware accelerators

Exploiting Parallel Computation
Figure 17.7: Parallel execution maximizes hardware utilization. This visualization demonstrates how to pipeline sensor reading, processing, and transmission so they overlap in time, effectively hiding latency and increasing throughput on single-core and multi-core MCUs.

17.4.2 Function Inlining

Advantages:

  • Eliminates function call overhead (no push/pop of return address)
  • Avoids branch misprediction penalties
  • Enables further optimizations like constant propagation and dead code elimination across inlined boundaries

Limitations:

  • Not all functions are suitable for inlining (recursive functions, functions with complex control flow)
  • Can cause code size explosion if large functions are inlined at many call sites
  • May overflow instruction cache, paradoxically reducing performance
  • Often requires manual hints via inline keyword or __attribute__((always_inline)) compiler directives

Diagram comparing function call overhead versus inlined code: left side shows call instruction pushing return address, jumping to function, executing function body, saving and restoring registers, then returning; right side shows inlined version where function body is copied directly at call site eliminating call/return overhead but duplicating code at each call location, increasing binary size

Function Inlining

Function inlining eliminates call overhead but increases code size, requiring careful analysis of which functions benefit from this optimization.

Try It: Function Inlining Cost-Benefit Calculator

Determine whether inlining a function saves or wastes resources. Adjust the function size, call count, and cache capacity to see when inlining helps vs hurts.

17.4.3 Opportunistic Sleeping

Strategy: Transition processor and peripherals to the lowest usable power mode as soon as possible

  • Analogous to auto start-stop systems in modern vehicles
  • Hardware interrupts (timers, GPIO, peripherals) wake the processor when events require attention
  • Critical trade-off: energy savings versus wake-up latency and responsiveness
  • Even brief sleep periods (milliseconds) accumulate significant energy savings over millions of cycles

Graph plotting power consumption versus clock frequency showing non-linear relationship: at 80 MHz baseline power is 100 milliwatts, reducing to 48 MHz drops power to 36 milliwatts, further reduction to 20 MHz brings power down to 25 milliwatts. Demonstrates how power scales super-linearly with frequency due to both dynamic switching power and voltage requirements, allowing 75 percent power savings by reducing clock rate from 80 MHz to 20 MHz while still meeting sensor sampling deadlines

Minimize Clock Rate Strategy
Figure 17.8: Dynamic voltage and frequency scaling (DVFS) reduces power consumption when full performance is unnecessary. This visualization shows the super-linear relationship between clock frequency and power (power scales approximately with \(f \times V^2\), and voltage must increase with frequency), demonstrating how reducing clock rate from 80 MHz to 20 MHz (25% speed) can reduce power by approximately 75% for tasks that are not time-critical.
Try It: DVFS Power Savings Calculator

Explore how dynamic voltage and frequency scaling saves power. Power scales as P = C * V^2 * f, and voltage must increase with frequency. See how reducing clock speed affects power, energy, and task completion time.

Energy cost comparison between two approaches for accelerometer activity tracking: left side shows transmitting raw sensor data requires 150 bytes per sample at 100 Hz sampling rate consuming 1.5 joules per hour via Wi-Fi transmission; right side shows local feature extraction processing data on MCU using 0.05 joules per hour then transmitting only activity classification result of 1 byte per second consuming 0.01 joules per hour, achieving 100x energy savings through edge processing versus cloud offloading

Local Computation vs Transmission - Part 1
Figure 17.9: Local computation often saves more energy than transmission. This visualization compares the energy cost of sending raw accelerometer data (150 bytes/sample at 100Hz) versus extracting features locally and transmitting only activity classification results (1 byte/second).

Decision flowchart for edge versus cloud processing showing four decision criteria: if data rate exceeds 100 KB per second favor local processing to avoid transmission overhead, if model complexity exceeds MCU capabilities favor cloud offloading to leverage server compute power, if latency requirement is under 100 milliseconds favor local processing for real-time response, if battery powered favor local processing to minimize radio energy consumption. Flowchart guides engineers through systematic evaluation of data volume, computational requirements, timing constraints, and power budget to select optimal processing location

Local Computation vs Transmission - Part 2
Figure 17.10: Deciding between local and remote computation depends on multiple factors. This visualization presents a decision framework considering data rate, model complexity, latency requirements, and available compute resources to determine optimal processing placement.

Timeline comparing two LTE transmission strategies: top timeline shows sending each sensor reading individually requiring modem wake-up consuming 2 joules, connection establishment consuming 3 joules, data transmission consuming 0.1 joules, then disconnect for total 5.1 joules per reading; bottom timeline shows batching 100 readings with one-time modem wake-up and connection establishment consuming 5 joules, then transmitting batched payload consuming 10 joules for total 15 joules across 100 readings or 0.15 joules per reading, achieving 34x energy savings by amortizing connection overhead across multiple data points

LTE Data Batching
Figure 17.11: Cellular transmission has high setup overhead (2-3 seconds to establish connection). This visualization shows how batching multiple sensor readings into a single transmission amortizes this overhead, improving energy efficiency by 10-100x for bursty IoT data.

Bar chart comparing energy consumption per transmitted byte across four wireless protocols: Wi-Fi consumes 1000 microjoules per byte providing high throughput at 54 Mbps with 100 meter range, BLE consumes 50 microjoules per byte at 1 Mbps throughput with 50 meter range, Zigbee consumes 30 microjoules per byte at 250 kbps with 100 meter range, LoRa consumes 10 microjoules per byte at 50 kbps with 10 kilometer range. Chart demonstrates inverse relationship between energy efficiency and data rate with long-range protocols achieving lowest energy per byte

Networking Energy Costs
Figure 17.12: Communication dominates IoT energy budgets. This visualization compares the energy cost per transmitted byte across common wireless protocols, helping engineers choose the most efficient protocol for their data rate and range requirements.
Try It: Data Batching Energy Savings

Explore how batching sensor readings into fewer transmissions saves energy. Cellular and Wi-Fi radios have high connection setup costs – batching amortizes this overhead across many readings.

Flowchart comparing naive versus optimized sensor data processing pipeline: naive version uses floating-point sin/cos function calls taking 200 cycles each, floating-point division taking 50 cycles, and conditional branches taking 10 cycles with potential misprediction penalty; optimized version replaces trigonometric functions with 256-entry lookup table requiring 4 cycles per access, uses 16-bit fixed-point integer arithmetic taking 2 cycles per operation, and employs branch-free conditional move instructions taking 1 cycle, achieving total speedup from 1000 cycles per sample to 100 cycles per sample for 10x performance improvement

Optimized Efficient Processing
Figure 17.13: Algorithmic optimization can yield dramatic improvements. This visualization shows a sensor processing pipeline optimized with lookup tables for trigonometric functions, fixed-point arithmetic, and branch-free conditional moves that achieves 10x speedup.

17.5 Knowledge Check

Pitfall: Using printf/Serial.print for Timing-Critical Debug

The Mistake: Adding Serial.print() or printf() statements inside tight loops or ISRs to debug timing-sensitive code, then wondering why the bug disappears when debug output is added, or why the system becomes unreliable during debugging.

Why It Happens: Serial output is extremely slow compared to CPU operations. A single Serial.print("x") on Arduino at 115200 baud takes approximately 87 microseconds (10 bits per character at 115200 bps). In a 1 MHz loop (1 us per iteration), adding one print statement slows execution by 87x and changes all timing relationships. This creates “Heisenbugs” - bugs that disappear when you try to observe them.

The Fix: Use GPIO pin toggling for timing-critical debugging - set a pin HIGH at function entry, LOW at exit, and observe with an oscilloscope or logic analyzer. For capturing values without affecting timing, write to a circular buffer in RAM and dump after the critical section completes:

// WRONG: Serial output changes timing
void IRAM_ATTR timerISR() {
    Serial.println(micros());  // Takes 500+ us, destroys timing!
    processData();
}

// CORRECT: GPIO toggle for timing debug (oscilloscope required)
#define DEBUG_PIN 2
void IRAM_ATTR timerISR() {
    digitalWrite(DEBUG_PIN, HIGH);  // ~0.1 us
    processData();
    digitalWrite(DEBUG_PIN, LOW);   // ~0.1 us
}

// CORRECT: Buffer capture for value debug
volatile uint32_t debugBuffer[64];
volatile uint8_t debugIndex = 0;

void IRAM_ATTR timerISR() {
    if (debugIndex < 64) {
        debugBuffer[debugIndex++] = ADC_VALUE;  // ~0.2 us
    }
    processData();
}
// Dump buffer in loop() after critical operation completes
Pitfall: Optimizing Without Profiling (Premature Optimization)

The Mistake: Spending hours hand-optimizing a function that “looks slow” (like a nested loop or floating-point math) without measuring whether it actually impacts overall system performance, while ignoring the true bottleneck.

Why It Happens: Developers have intuitions about what should be slow based on algorithm complexity or “expensive” operations. But modern compilers optimize heavily, and I/O operations (network, storage, peripherals) often dominate execution time. A function with O(n^2) complexity processing 10 items (100 operations) may take 1 microsecond, while a single I2C sensor read takes 500 microseconds.

The Fix: Always profile before optimizing. On ESP32, use micros() to measure elapsed time. On ARM Cortex-M, use the DWT cycle counter (enabled via CoreDebug register). Identify the actual hotspot before optimizing:

// Step 1: Instrument code to find actual bottleneck
void loop() {
    uint32_t t0 = micros();
    readSensors();           // Suspect this is slow
    uint32_t t1 = micros();
    processData();           // Spent 2 weeks optimizing this
    uint32_t t2 = micros();
    transmitData();          // Never measured
    uint32_t t3 = micros();

    Serial.printf("Read: %lu us, Process: %lu us, Transmit: %lu us\n",
                  t1-t0, t2-t1, t3-t2);
}
// Typical result: Read: 2000 us, Process: 50 us, Transmit: 150000 us
// The "optimized" processData() was 0.03% of total time!
// Real fix: Reduce transmission frequency or payload size

// For ARM Cortex-M cycle-accurate profiling:
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;  // Enable trace
DWT->CYCCNT = 0;                                   // Reset counter
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;              // Start counting
// ... code to measure ...
uint32_t cycles = DWT->CYCCNT;                    // Read cycles

Rule of thumb: 80% of execution time comes from 20% of code. Find the 20% first.

Flowchart illustrating unintended consequences of aggressive optimization: function inlining increases code size from 4 KB to 32 KB exceeding 16 KB instruction cache capacity causing cache thrashing and 2x slowdown instead of speedup; loop unrolling from 4 iterations to 16 iterations requires 24 registers but processor only has 16 registers causing register spilling to stack memory with 50 percent performance penalty; auto-vectorization introduces alignment requirements causing unaligned memory access faults. Demonstrates importance of measuring optimization impact rather than assuming all optimizations improve performance

Optimization Complications
Figure 17.14: Optimization often introduces unexpected complications. This visualization shows how aggressive inlining can increase code size beyond instruction cache capacity, or how loop unrolling can cause register spilling. Profiling before and after optimization prevents these surprises.

Overview of parallel processing opportunities in IoT applications including pipelined instruction execution, SIMD data-level parallelism for array processing, and multi-threaded task-level parallelism for sensor fusion and concurrent protocol handling

Parallelism Overview

Vector SIMD operation diagram showing four-wide vector register loading four array elements simultaneously, applying arithmetic operation in parallel with single instruction, and storing four results back to memory, contrasted with scalar operation requiring four separate load-compute-store sequences

Vector Operations

Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping

17.6 Worked Example: Firmware Optimization at Tile (Bluetooth Tracker)

Tile, the Bluetooth tracking device company, faced a critical battery life challenge in 2019 when developing their Tile Pro tracker. The device needed to advertise its BLE beacon continuously for a full year on a single CR2032 coin cell (220 mAh at 3V).

The Problem: Initial firmware consumed an average of 45 microamps, giving only 200 days of operation – well short of the 365-day target required for competitive parity with Apple AirTag.

Step 1: Profile the Power Budget

Component Initial Current % of Total
BLE advertising (1 Hz) 18 uA 40%
MCU idle (clock running) 15 uA 33%
Sensor polling (accelerometer) 8 uA 18%
Voltage regulator quiescent 4 uA 9%
Total average 45 uA 100%

Target: 220 mAh / (365 days x 24 hours) = 25 uA average

Step 2: Apply Software Optimizations

  1. Opportunistic sleeping – Replaced MCU idle mode (15 uA) with deep sleep between BLE events (0.5 uA). The nRF52 wakes only on RTC interrupt for advertising. Savings: 14.5 uA.

  2. Compiler flag change – Switched from -O2 to -Os for non-critical code paths, reducing flash usage from 198 KB to 156 KB. This allowed disabling one flash bank during sleep, saving 1.2 uA.

  3. Batch accelerometer reads – Instead of polling the accelerometer every 100 ms (continuous SPI bus activity), configured the accelerometer’s internal FIFO to buffer 32 samples and trigger an interrupt. The MCU sleeps during accumulation and reads the entire buffer in one SPI burst. Savings: 6 uA.

  4. Adaptive advertising interval – When accelerometer detects no motion for 30 seconds, advertising interval increases from 1 second to 4 seconds. Stationary devices (95% of the time) use 75% less advertising energy. Savings: 12 uA.

Step 3: Measure Results

Component Optimized Current Savings
BLE advertising (adaptive) 6 uA -12 uA
MCU deep sleep 0.5 uA -14.5 uA
Sensor (batched FIFO) 2 uA -6 uA
Voltage regulator 2.8 uA -1.2 uA
Flash (one bank off) 0 uA (included above)
Total average 11.3 uA -33.7 uA

Outcome: 220 mAh / 11.3 uA = 810 days (2.2 years) – exceeding the 365-day target by 2.2x. No hardware changes were needed; all improvements came from software optimization and compiler configuration. The adaptive advertising technique alone saved more power than all other optimizations combined.

Key Lesson: Profiling revealed that the “obvious” target (BLE radio) was only 40% of the problem. The MCU idle current (33%) was a larger opportunity because deep sleep reduced it by 97%, while BLE advertising could only be reduced by 67% without degrading user experience.

Interactive Calculator: Battery Life Estimation

17.7 Summary

Software optimization techniques are essential for maximizing IoT device efficiency, battery life, and cost-effectiveness:

  1. Compiler Optimization Flags: Choose -O3 for maximum speed (larger code size), -Os for minimum size (slower execution), or -O2 for balanced trade-off. Flash memory dominates embedded system cost (27.3 mm² vs 0.43 mm² for ARM Cortex-M3), making -Os critical for cost reduction.

  2. SIMD Vectorization: Process 4-16 elements simultaneously with single instructions (ARM NEON, Intel SSE/AVX) achieving 3-10× speedup for array operations with minimal code changes. Requires careful handling of non-divisible array lengths.

  3. Function Inlining: Eliminates call/return overhead and enables cross-function optimization, but risks code size explosion and instruction cache thrashing. Best applied selectively to small, frequently-called functions.

  4. Opportunistic Sleeping: Transition MCU to lowest viable power state immediately after completing work. Deep sleep modes reduce current consumption by 95-99% (e.g., 15 µA active → 0.5 µA sleep). Even millisecond sleep periods accumulate massive energy savings over device lifetime.

  5. Profile Before Optimizing: Always measure actual bottlenecks using profiling tools (micros(), DWT cycle counter, logic analyzer). The 80/20 rule applies: 80% of execution time occurs in 20% of code. Optimizing non-critical code wastes engineering effort without performance gains.

  6. Real-World Validation: Tile Bluetooth tracker case study demonstrates 4× battery life improvement (200 → 810 days) using software optimization alone: opportunistic sleeping saved 14.5 µA, adaptive advertising saved 12 µA, batched sensor reads saved 6 µA, compiler -Os flag saved 1.2 µA.

Key Principle: Apply the right optimization technique to the measured bottleneck, not to code that “looks slow.” Assumptions about performance are often wrong; profiling reveals the truth.

17.8 How It Works: Step-by-Step Breakdown

Understanding how compiler optimizations work helps you make informed choices about optimization flags and techniques.

The Compilation Pipeline:

  1. Source Code Parsing: Compiler reads C/C++ code and builds an Abstract Syntax Tree (AST)
  2. Intermediate Representation (IR): AST converted to platform-independent intermediate code
  3. Optimization Passes: Multiple transformation passes applied based on -O flag
  4. Code Generation: Optimized IR compiled to machine code (assembly)
  5. Linking: Machine code combined with libraries to create final binary

What -Os Actually Does (size optimization):

Original C code:
for (int i = 0; i < 4; i++) {
    array[i] = compute(i);
}

-O0 (no optimization):
- 4 separate compute() function calls
- 4 loop iterations with condition checks
- Result: 80 bytes of code

-Os (size optimization):
- Inlines compute() only if < 10 instructions
- Unrolls loop partially (2x2 instead of 4x1)
- Uses shorter instruction encodings
- Result: 42 bytes of code (47% smaller)

What -O3 Actually Does (speed optimization):

-O3 (speed optimization):
- Fully inlines compute() (eliminates call overhead)
- Fully unrolls loop (no loop overhead)
- Reorders instructions for pipelining
- Result: 140 bytes of code (75% larger, 3.5x faster)

SIMD Vectorization Mechanics:

When you enable vectorization, the compiler converts:

// Scalar code (processes 1 value per instruction)
for (int i = 0; i < 1000; i++) {
    output[i] = input[i] * scale;
}

Into:

// Vectorized code (processes 4 values per instruction on ARM NEON)
for (int i = 0; i < 1000; i += 4) {
    // Single SIMD instruction multiplies 4 values simultaneously
    vst1q_f32(&output[i], vmulq_f32(vld1q_f32(&input[i]), vdupq_n_f32(scale)));
}

Result: 250 iterations instead of 1000 = 4x speedup.

17.9 Concept Relationships

Software Optimization Techniques
├── Relates to: [Optimization Fundamentals](optimization-fundamentals.html) - Overall optimization strategy
├── Builds on: [Hardware Optimization](optimization-hardware.html) - Hardware capabilities that software leverages
├── Feeds into: [Fixed-Point Arithmetic](optimization-fixed-point.html) - Software technique for efficient math
├── Applied in: [Embedded Systems Programming](../prototyping/prototyping-software.html) - Practical firmware optimization
└── Verified by: [Energy-Aware Considerations](energy-aware-considerations.html) - Measuring optimization impact

Key Connections:

  1. Compiler flags (-Os, -O3) directly trade off code size vs speed, complementing hardware selection decisions
  2. SIMD vectorization exploits hardware parallelism (ARM NEON, Intel SSE), bridging software and hardware optimization
  3. Opportunistic sleeping implements power management at the firmware level
  4. Profiling before optimizing connects to testing methodologies for validation

17.10 See Also

17.10.1 Prerequisite Knowledge

17.10.3 Advanced Applications

17.10.4 Real-World Context

  • Tile Firmware Optimization Case Study - Production example of compiler flags and sleep optimization extending battery life
  • ARM NEON Programming Guide - SIMD vectorization documentation for ARM Cortex-A processors
  • GCC Optimization Options - Complete reference for compiler flags: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

17.11 Try It Yourself

17.11.1 Experiment 1: Compiler Flag Comparison

Objective: Compare code size and performance impact of different optimization flags.

Setup:

# Install platformio
pip install platformio

# Create test project
pio project init --board esp32dev

Test Code (save as src/main.cpp):

#include <Arduino.h>

volatile int result = 0;

int fibonacci(int n) {
    if (n <= 1) return n;
    return fibonacci(n - 1) + fibonacci(n - 2);
}

void setup() {
    Serial.begin(115200);
}

void loop() {
    unsigned long start = micros();
    result = fibonacci(20);
    unsigned long duration = micros() - start;

    Serial.printf("Fibonacci(20) = %d, Time: %lu us\n", result, duration);
    delay(1000);
}

Build with Different Flags:

# Modify platformio.ini for each test:
# Test 1: No optimization
build_flags = -O0

# Test 2: Size optimization
build_flags = -Os

# Test 3: Speed optimization
build_flags = -O3

What to Observe:

  1. Code Size: Check .pio/build/esp32dev/firmware.elf size with size command
  2. Execution Time: Compare Time: XXX us across three builds
  3. Trade-offs: Record size vs speed for each optimization level

Expected Results:

  • -O0: Largest code (~50 KB), slowest execution (~8000 us)
  • -Os: Smallest code (~35 KB), medium execution (~3000 us)
  • -O3: Medium code (~45 KB), fastest execution (~1500 us)

17.11.2 Experiment 2: Profiling Before Optimizing

Objective: Identify actual bottlenecks using profiling instead of guessing.

Test Code:

void processData() {
    // Suspect this is slow
    float sum = 0;
    for (int i = 0; i < 10000; i++) {
        sum += sin(i * 0.01);
    }
}

void transmitData() {
    // Never measured
    WiFi.begin("SSID", "password");
    while (WiFi.status() != WL_CONNECTED) {
        delay(100);
    }
}

void loop() {
    uint32_t t0 = micros();
    processData();
    uint32_t t1 = micros();
    transmitData();
    uint32_t t2 = micros();

    Serial.printf("Process: %lu us, Transmit: %lu us\n", t1-t0, t2-t1);
    delay(10000);
}

What to Observe:

  1. Which function actually dominates execution time?
  2. Does your intuition match the profiling results?
  3. Would optimizing processData() actually matter if it’s only 1% of total time?

Key Lesson: The 80/20 rule applies to optimization – 80% of time is spent in 20% of code. Find the 20% first.

Common Pitfalls

Loop unrolling works well for short loops with fixed iteration counts. For loops with hundreds of iterations or variable length, aggressive unrolling bloats code size and may cause instruction cache misses that negate the speed gain. Profile actual performance before and after.

-O3 enables loop unrolling and function cloning that can triple code size. On devices with 64–256 KB flash, this frequently causes a linker failure (“section .text' will not fit in regionFLASH’”). Use -Os (optimize for size) on flash-constrained targets.

Polling a flag in a tight loop (while (!flag);) keeps the CPU active and consuming full power. Using interrupts and sleeping between events reduces energy consumption by the duty cycle factor, which is often 100–10,000×.

Inlining a 200-instruction function that’s called in 10 places increases code size by ~2,000 instructions. If the function doesn’t fit in instruction cache, the cache miss penalty exceeds the call overhead savings. Reserve inlining for small, performance-critical functions (< 10 instructions).

17.12 What’s Next

If you want to… Read this
Learn fixed-point arithmetic for MCUs without FPU Fixed-Point Arithmetic
Understand hardware acceleration options Hardware Optimization
Learn optimization principles and profiling Optimization Fundamentals
Apply to energy-aware system design Energy-Aware Considerations
See practical optimization case studies Case Studies and Best Practices