Apply SIMD vectorization: Exploit data-level parallelism for array processing
Implement function inlining strategies: Balance call overhead against code size
Design opportunistic sleeping patterns: Maximize battery life through intelligent power management
Profile before optimizing: Use measurement to identify actual bottlenecks
In 60 Seconds
Software optimization on IoT devices starts with the right compiler flag (-Os for smallest flash footprint, -O2 for speed-energy balance), then applies firmware techniques like function inlining, loop unrolling, and SIMD vectorization — always guided by profiling to ensure effort targets the actual bottleneck, not where the developer guesses.
SIMD (Single Instruction Multiple Data): Processor instructions that operate on multiple data elements simultaneously; ARM NEON processes 4 floats in one cycle vs 4 cycles with scalar code
Function Inlining: Replacing a function call with the function body at the call site; eliminates call overhead but increases code size; controlled with __attribute__((always_inline)) or inline keyword
Loop Unrolling: Replicating a loop body multiple times to reduce loop overhead and expose instruction-level parallelism; useful when loop body is simple and iteration count is known
Opportunistic Sleeping: Entering a sleep state between I/O operations or while waiting for peripherals, rather than busy-waiting; critical for energy-efficient embedded firmware
Link-Time Optimization (LTO): Compiler optimization applied across translation units at link time; can eliminate dead code and inline across file boundaries
Dead Code Elimination: Compiler pass that removes unreachable or unused code paths; reduces flash footprint without programmer intervention
For Beginners: Software Optimization Techniques
Energy and power management determines how long your IoT device can operate between battery changes or charges. Think of packing for a camping trip with limited battery packs – every bit of power must be used wisely. Since many IoT sensors need to run for months or years unattended, power management is often the single most important engineering decision.
Sensor Squad: Code That Saves Energy!
“The compiler is your secret weapon,” said Max the Microcontroller. “When you compile code with the -Os flag, it automatically shrinks your program to fit in less flash memory. With -O2, it makes your code run faster. The compiler does hundreds of tricks that would take humans hours to apply manually.”
Sammy the Sensor learned about SIMD: “That stands for Single Instruction, Multiple Data. Instead of processing sensor readings one at a time, SIMD lets Max process four or eight readings simultaneously with one instruction. It is like washing four dishes at once instead of one!”
“My favorite optimization is opportunistic sleeping,” said Bella the Battery. “Whenever Max finishes his work early, instead of busy-waiting for the next task, he goes to sleep immediately. Even sleeping for a few milliseconds adds up over millions of cycles. The formula is simple: sleep whenever there is nothing to do, no matter how brief.” Lila the LED cautioned, “But always profile before optimizing! Measure where time and energy actually go, then optimize the hot spots. Optimizing code that runs 0.1 percent of the time is a waste of YOUR time!”
17.2 Prerequisites
Before diving into this chapter, you should be familiar with:
Real-World Impact: Tile firmware optimization using -Os + SIMD + opportunistic sleeping extended battery life from 200 days to 810 days (4× improvement). The -Os compiler flag alone saved 15 KB flash memory (code size: 187 KB → 172 KB), enabling use of a smaller, less expensive microcontroller.
-On (Custom Optimization):
Very specific optimization passes
Insert assembly language templates
LLVM is best for this
17.3.2 Code Size Considerations
Figure 17.2: Code Size Impact: Flash Memory Dominates Die Area vs ARM Cortex M3
CISC Instruction Sets: x86, System/360, PDP-11 - Complex encodings do more work - Require more complex hardware - Compiler support may be limited
Try It: Code Size vs Die Area Explorer
Explore how code size affects chip die area and cost. Flash memory dominates embedded system die area – see how different optimization levels change the balance.
Figure 17.3: SIMD Vectorization: Scalar vs Parallel Array Processing Performance
Key Idea: Operate on multiple data elements with a single instruction (SIMD - Single Instruction, Multiple Data)
Benefits:
Reduces loop iterations by vector width factor (4×, 8×, etc.)
Improves memory bandwidth utilization through wider loads/stores
Exploits data-level parallelism in array operations
Lower energy per operation by amortizing instruction fetch overhead
Important Consideration: Requires tail-handling code when array size is not evenly divisible by vector width. For example, processing 1003 elements with 4-wide SIMD requires 250 vector iterations plus 3 scalar operations for remaining elements.
Figure 17.7: Parallel execution maximizes hardware utilization. This visualization demonstrates how to pipeline sensor reading, processing, and transmission so they overlap in time, effectively hiding latency and increasing throughput on single-core and multi-core MCUs.
17.4.2 Function Inlining
Advantages:
Eliminates function call overhead (no push/pop of return address)
Avoids branch misprediction penalties
Enables further optimizations like constant propagation and dead code elimination across inlined boundaries
Limitations:
Not all functions are suitable for inlining (recursive functions, functions with complex control flow)
Can cause code size explosion if large functions are inlined at many call sites
May overflow instruction cache, paradoxically reducing performance
Often requires manual hints via inline keyword or __attribute__((always_inline)) compiler directives
Function Inlining Trade-offs
Function Inlining
Function inlining eliminates call overhead but increases code size, requiring careful analysis of which functions benefit from this optimization.
Try It: Function Inlining Cost-Benefit Calculator
Determine whether inlining a function saves or wastes resources. Adjust the function size, call count, and cache capacity to see when inlining helps vs hurts.
html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid ${inlineCalc.recColor}; margin-top: 0.5rem;"><table style="width: 100%; border-collapse: collapse;"><tr style="border-bottom: 2px solid #2C3E50;"> <th style="padding: 8px; text-align: left;">Metric</th> <th style="padding: 8px; text-align: right;">Without Inlining</th> <th style="padding: 8px; text-align: right;">With Inlining</th></tr><tr> <td style="padding: 8px;">Total code size</td> <td style="padding: 8px; text-align: right;">${inlineCalc.withoutInline} bytes</td> <td style="padding: 8px; text-align: right;">${inlineCalc.withInline} bytes</td></tr><tr> <td style="padding: 8px;">Size difference</td> <td style="padding: 8px; text-align: right;" colspan="2">${inlineCalc.sizeDelta>0?"+":""}${inlineCalc.sizeDelta} bytes (${((inlineCalc.withInline/ inlineCalc.withoutInline-1) *100).toFixed(0)}%)</td></tr><tr> <td style="padding: 8px;">Call overhead eliminated</td> <td style="padding: 8px; text-align: right;" colspan="2">${inlineCalc.overheadSaved} bytes saved per execution path</td></tr><tr> <td style="padding: 8px;">Fits in I-cache?</td> <td style="padding: 8px; text-align: right;" colspan="2">${inlineCalc.exceedsCache?"NO -- inlined code ("+ inlineCalc.withInline+" B) exceeds cache ("+ inlineCalc.cacheBytes+" B)":"YES -- inlined code ("+ inlineCalc.withInline+" B) fits in cache ("+ inlineCalc.cacheBytes+" B)"}</td></tr></table><p style="margin-top: 0.75rem; padding: 0.5rem; border-radius: 4px; background: ${inlineCalc.recColor}22; border: 1px solid ${inlineCalc.recColor};"><strong style="color: ${inlineCalc.recColor};">Recommendation: ${inlineCalc.recommendation}</strong> --${inlineCalc.recommendation==="INLINE"?"Small function with few call sites. Inlining eliminates call overhead with minimal code growth.": inlineCalc.recommendation==="CAUTION"?"Significant code size increase. Verify with profiling that call overhead is actually a bottleneck.":"Inlined code exceeds instruction cache capacity. This will cause cache thrashing and likely DECREASE performance."}</p></div>`
17.4.3 Opportunistic Sleeping
Strategy: Transition processor and peripherals to the lowest usable power mode as soon as possible
Analogous to auto start-stop systems in modern vehicles
Hardware interrupts (timers, GPIO, peripherals) wake the processor when events require attention
Critical trade-off: energy savings versus wake-up latency and responsiveness
Even brief sleep periods (milliseconds) accumulate significant energy savings over millions of cycles
Minimize Clock Rate Strategy
Figure 17.8: Dynamic voltage and frequency scaling (DVFS) reduces power consumption when full performance is unnecessary. This visualization shows the super-linear relationship between clock frequency and power (power scales approximately with \(f \times V^2\), and voltage must increase with frequency), demonstrating how reducing clock rate from 80 MHz to 20 MHz (25% speed) can reduce power by approximately 75% for tasks that are not time-critical.
Try It: DVFS Power Savings Calculator
Explore how dynamic voltage and frequency scaling saves power. Power scales as P = C * V^2 * f, and voltage must increase with frequency. See how reducing clock speed affects power, energy, and task completion time.
html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #9B59B6; margin-top: 0.5rem;"><div style="display: flex; gap: 1rem; flex-wrap: wrap;"> <div style="flex: 1; min-width: 240px;"> <h4 style="margin: 0 0 0.5rem; color: #2C3E50;">Instantaneous Power</h4> <svg width="260" height="120" viewBox="0 0 260 120"> <rect x="10" y="${100-80}" width="80" height="80" rx="4" fill="#E74C3C" opacity="0.8"/> <text x="50" y="55" text-anchor="middle" fill="white" font-family="Arial" font-size="11" font-weight="bold">${baseFreqMHz} MHz</text> <text x="50" y="70" text-anchor="middle" fill="white" font-family="Arial" font-size="10">${dvfsCalc.vBase.toFixed(2)}V</text> <rect x="130" y="${100-80* dvfsCalc.powerRatio}" width="80" height="${80* dvfsCalc.powerRatio}" rx="4" fill="#16A085" opacity="0.8"/> <text x="170" y="${Math.max(25,100-80* dvfsCalc.powerRatio+18)}" text-anchor="middle" fill="white" font-family="Arial" font-size="11" font-weight="bold">${reducedFreqMHz} MHz</text> <text x="170" y="${Math.max(40,100-80* dvfsCalc.powerRatio+33)}" text-anchor="middle" fill="white" font-family="Arial" font-size="10">${dvfsCalc.vReduced.toFixed(2)}V</text> <line x1="5" y1="100" x2="215" y2="100" stroke="#7F8C8D" stroke-width="1"/> <text x="130" y="115" text-anchor="middle" fill="#7F8C8D" font-family="Arial" font-size="10">Power reduction: ${((1- dvfsCalc.powerRatio) *100).toFixed(1)}%</text> </svg> </div> <div style="flex: 1; min-width: 240px;"> <h4 style="margin: 0 0 0.5rem; color: #2C3E50;">Per-Task Analysis</h4> <p style="margin: 0.2rem 0; font-size: 0.9rem;"><strong>Task time at ${baseFreqMHz} MHz:</strong> ${dvfsCalc.tBase.toFixed(1)} µs</p> <p style="margin: 0.2rem 0; font-size: 0.9rem;"><strong>Task time at ${reducedFreqMHz} MHz:</strong> ${dvfsCalc.tReduced.toFixed(1)} µs (${(dvfsCalc.tReduced/ dvfsCalc.tBase).toFixed(1)}x slower)</p> <p style="margin: 0.2rem 0; font-size: 0.9rem;"><strong>Energy per task:</strong> ${dvfsCalc.energyRatio<1? ((1- dvfsCalc.energyRatio) *100).toFixed(1) +"% less": ((dvfsCalc.energyRatio-1) *100).toFixed(1) +"% more"} at reduced frequency</p> <hr style="margin: 0.5rem 0; border-color: #ddd;"/> <p style="margin: 0.2rem 0; font-size: 0.9rem;"><strong>With ${dutyCyclePct}% duty cycle + sleep:</strong></p> <p style="margin: 0.2rem 0; font-size: 0.9rem; color: #16A085; font-weight: bold;">Average power saving: ${dvfsCalc.avgSaving.toFixed(1)}%</p> </div></div><p style="margin-top: 0.5rem; font-size: 0.85rem; color: #555;"><strong>Key insight:</strong> Power scales as V² x f. Since voltage must increase with frequency, power reduction is <em>super-linear</em> -- halving frequency reduces power by more than half. Combined with opportunistic sleeping during idle time, DVFS dramatically extends battery life.</p></div>`
Local Computation vs Transmission - Part 1
Figure 17.9: Local computation often saves more energy than transmission. This visualization compares the energy cost of sending raw accelerometer data (150 bytes/sample at 100Hz) versus extracting features locally and transmitting only activity classification results (1 byte/second).
Local Computation vs Transmission - Part 2
Figure 17.10: Deciding between local and remote computation depends on multiple factors. This visualization presents a decision framework considering data rate, model complexity, latency requirements, and available compute resources to determine optimal processing placement.
LTE Data Batching
Figure 17.11: Cellular transmission has high setup overhead (2-3 seconds to establish connection). This visualization shows how batching multiple sensor readings into a single transmission amortizes this overhead, improving energy efficiency by 10-100x for bursty IoT data.
Networking Energy Costs
Figure 17.12: Communication dominates IoT energy budgets. This visualization compares the energy cost per transmitted byte across common wireless protocols, helping engineers choose the most efficient protocol for their data rate and range requirements.
Try It: Data Batching Energy Savings
Explore how batching sensor readings into fewer transmissions saves energy. Cellular and Wi-Fi radios have high connection setup costs – batching amortizes this overhead across many readings.
Figure 17.13: Algorithmic optimization can yield dramatic improvements. This visualization shows a sensor processing pipeline optimized with lookup tables for trigonometric functions, fixed-point arithmetic, and branch-free conditional moves that achieves 10x speedup.
17.5 Knowledge Check
Quiz: Software Optimization
Pitfall: Using printf/Serial.print for Timing-Critical Debug
The Mistake: Adding Serial.print() or printf() statements inside tight loops or ISRs to debug timing-sensitive code, then wondering why the bug disappears when debug output is added, or why the system becomes unreliable during debugging.
Why It Happens: Serial output is extremely slow compared to CPU operations. A single Serial.print("x") on Arduino at 115200 baud takes approximately 87 microseconds (10 bits per character at 115200 bps). In a 1 MHz loop (1 us per iteration), adding one print statement slows execution by 87x and changes all timing relationships. This creates “Heisenbugs” - bugs that disappear when you try to observe them.
The Fix: Use GPIO pin toggling for timing-critical debugging - set a pin HIGH at function entry, LOW at exit, and observe with an oscilloscope or logic analyzer. For capturing values without affecting timing, write to a circular buffer in RAM and dump after the critical section completes:
// WRONG: Serial output changes timingvoid IRAM_ATTR timerISR(){ Serial.println(micros());// Takes 500+ us, destroys timing! processData();}// CORRECT: GPIO toggle for timing debug (oscilloscope required)#define DEBUG_PIN 2void IRAM_ATTR timerISR(){ digitalWrite(DEBUG_PIN, HIGH);// ~0.1 us processData(); digitalWrite(DEBUG_PIN, LOW);// ~0.1 us}// CORRECT: Buffer capture for value debugvolatileuint32_t debugBuffer[64];volatileuint8_t debugIndex =0;void IRAM_ATTR timerISR(){if(debugIndex <64){ debugBuffer[debugIndex++]= ADC_VALUE;// ~0.2 us} processData();}// Dump buffer in loop() after critical operation completes
Pitfall: Optimizing Without Profiling (Premature Optimization)
The Mistake: Spending hours hand-optimizing a function that “looks slow” (like a nested loop or floating-point math) without measuring whether it actually impacts overall system performance, while ignoring the true bottleneck.
Why It Happens: Developers have intuitions about what should be slow based on algorithm complexity or “expensive” operations. But modern compilers optimize heavily, and I/O operations (network, storage, peripherals) often dominate execution time. A function with O(n^2) complexity processing 10 items (100 operations) may take 1 microsecond, while a single I2C sensor read takes 500 microseconds.
The Fix: Always profile before optimizing. On ESP32, use micros() to measure elapsed time. On ARM Cortex-M, use the DWT cycle counter (enabled via CoreDebug register). Identify the actual hotspot before optimizing:
// Step 1: Instrument code to find actual bottleneckvoid loop(){uint32_t t0 = micros(); readSensors();// Suspect this is slowuint32_t t1 = micros(); processData();// Spent 2 weeks optimizing thisuint32_t t2 = micros(); transmitData();// Never measureduint32_t t3 = micros(); Serial.printf("Read: %lu us, Process: %lu us, Transmit: %lu us\n", t1-t0, t2-t1, t3-t2);}// Typical result: Read: 2000 us, Process: 50 us, Transmit: 150000 us// The "optimized" processData() was 0.03% of total time!// Real fix: Reduce transmission frequency or payload size// For ARM Cortex-M cycle-accurate profiling:CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;// Enable traceDWT->CYCCNT =0;// Reset counterDWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;// Start counting// ... code to measure ...uint32_t cycles = DWT->CYCCNT;// Read cycles
Rule of thumb: 80% of execution time comes from 20% of code. Find the 20% first.
Optimization Complications
Figure 17.14: Optimization often introduces unexpected complications. This visualization shows how aggressive inlining can increase code size beyond instruction cache capacity, or how loop unrolling can cause register spilling. Profiling before and after optimization prevents these surprises.
Additional Visual Resources
Parallelism Overview
Vector Operations
Source: CP IoT System Design Guide, Chapter 8 - Design and Prototyping
17.6 Worked Example: Firmware Optimization at Tile (Bluetooth Tracker)
Tile, the Bluetooth tracking device company, faced a critical battery life challenge in 2019 when developing their Tile Pro tracker. The device needed to advertise its BLE beacon continuously for a full year on a single CR2032 coin cell (220 mAh at 3V).
The Problem: Initial firmware consumed an average of 45 microamps, giving only 200 days of operation – well short of the 365-day target required for competitive parity with Apple AirTag.
Step 1: Profile the Power Budget
Component
Initial Current
% of Total
BLE advertising (1 Hz)
18 uA
40%
MCU idle (clock running)
15 uA
33%
Sensor polling (accelerometer)
8 uA
18%
Voltage regulator quiescent
4 uA
9%
Total average
45 uA
100%
Target: 220 mAh / (365 days x 24 hours) = 25 uA average
Step 2: Apply Software Optimizations
Opportunistic sleeping – Replaced MCU idle mode (15 uA) with deep sleep between BLE events (0.5 uA). The nRF52 wakes only on RTC interrupt for advertising. Savings: 14.5 uA.
Compiler flag change – Switched from -O2 to -Os for non-critical code paths, reducing flash usage from 198 KB to 156 KB. This allowed disabling one flash bank during sleep, saving 1.2 uA.
Batch accelerometer reads – Instead of polling the accelerometer every 100 ms (continuous SPI bus activity), configured the accelerometer’s internal FIFO to buffer 32 samples and trigger an interrupt. The MCU sleeps during accumulation and reads the entire buffer in one SPI burst. Savings: 6 uA.
Adaptive advertising interval – When accelerometer detects no motion for 30 seconds, advertising interval increases from 1 second to 4 seconds. Stationary devices (95% of the time) use 75% less advertising energy. Savings: 12 uA.
Step 3: Measure Results
Component
Optimized Current
Savings
BLE advertising (adaptive)
6 uA
-12 uA
MCU deep sleep
0.5 uA
-14.5 uA
Sensor (batched FIFO)
2 uA
-6 uA
Voltage regulator
2.8 uA
-1.2 uA
Flash (one bank off)
0 uA
(included above)
Total average
11.3 uA
-33.7 uA
Outcome: 220 mAh / 11.3 uA = 810 days (2.2 years) – exceeding the 365-day target by 2.2x. No hardware changes were needed; all improvements came from software optimization and compiler configuration. The adaptive advertising technique alone saved more power than all other optimizations combined.
Key Lesson: Profiling revealed that the “obvious” target (BLE radio) was only 40% of the problem. The MCU idle current (33%) was a larger opportunity because deep sleep reduced it by 97%, while BLE advertising could only be reduced by 67% without degrading user experience.
battery_life = {const hours = (battery_capacity *1000) / avg_current;const days = hours /24;const years = days /365;return {hours, days, years};}
Show code
html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #E67E22; margin-top: 0.5rem;"><p><strong>Battery Life Estimate:</strong></p><ul style="margin: 0.5rem 0;"> <li><strong>${battery_life.days.toFixed(0)} days</strong> (${battery_life.years.toFixed(2)} years)</li> <li>${battery_life.hours.toFixed(0)} hours total</li></ul><p style="margin-top: 0.5rem; font-size: 0.9rem; color: #555;">Formula: Battery Life (hours) = Capacity (mAh) × 1000 / Current (µA)</p><p style="margin-top: 0.5rem; font-size: 0.9rem; color: #555;"><strong>Optimization Impact:</strong> Reducing current from 45 µA (initial) to 11.3 µA (optimized) extends ${((220*1000/11.3)/(220*1000/45)).toFixed(1)}× battery life using the same CR2032 cell.</p></div>`
Matching Quiz: Match Compiler Flags to Effects
Ordering Quiz: Order Software Optimization from Highest to Lowest Impact
Label the Diagram
💻 Code Challenge
Order the Steps
Match the Concepts
17.7 Summary
Software optimization techniques are essential for maximizing IoT device efficiency, battery life, and cost-effectiveness:
Compiler Optimization Flags: Choose -O3 for maximum speed (larger code size), -Os for minimum size (slower execution), or -O2 for balanced trade-off. Flash memory dominates embedded system cost (27.3 mm² vs 0.43 mm² for ARM Cortex-M3), making -Os critical for cost reduction.
SIMD Vectorization: Process 4-16 elements simultaneously with single instructions (ARM NEON, Intel SSE/AVX) achieving 3-10× speedup for array operations with minimal code changes. Requires careful handling of non-divisible array lengths.
Function Inlining: Eliminates call/return overhead and enables cross-function optimization, but risks code size explosion and instruction cache thrashing. Best applied selectively to small, frequently-called functions.
Opportunistic Sleeping: Transition MCU to lowest viable power state immediately after completing work. Deep sleep modes reduce current consumption by 95-99% (e.g., 15 µA active → 0.5 µA sleep). Even millisecond sleep periods accumulate massive energy savings over device lifetime.
Profile Before Optimizing: Always measure actual bottlenecks using profiling tools (micros(), DWT cycle counter, logic analyzer). The 80/20 rule applies: 80% of execution time occurs in 20% of code. Optimizing non-critical code wastes engineering effort without performance gains.
Real-World Validation: Tile Bluetooth tracker case study demonstrates 4× battery life improvement (200 → 810 days) using software optimization alone: opportunistic sleeping saved 14.5 µA, adaptive advertising saved 12 µA, batched sensor reads saved 6 µA, compiler -Os flag saved 1.2 µA.
Key Principle: Apply the right optimization technique to the measured bottleneck, not to code that “looks slow.” Assumptions about performance are often wrong; profiling reveals the truth.
17.8 How It Works: Step-by-Step Breakdown
Understanding how compiler optimizations work helps you make informed choices about optimization flags and techniques.
The Compilation Pipeline:
Source Code Parsing: Compiler reads C/C++ code and builds an Abstract Syntax Tree (AST)
Intermediate Representation (IR): AST converted to platform-independent intermediate code
Optimization Passes: Multiple transformation passes applied based on -O flag
Code Generation: Optimized IR compiled to machine code (assembly)
Linking: Machine code combined with libraries to create final binary
What -Os Actually Does (size optimization):
Original C code:
for (int i = 0; i < 4; i++) {
array[i] = compute(i);
}
-O0 (no optimization):
- 4 separate compute() function calls
- 4 loop iterations with condition checks
- Result: 80 bytes of code
-Os (size optimization):
- Inlines compute() only if < 10 instructions
- Unrolls loop partially (2x2 instead of 4x1)
- Uses shorter instruction encodings
- Result: 42 bytes of code (47% smaller)
When you enable vectorization, the compiler converts:
// Scalar code (processes 1 value per instruction)for(int i =0; i <1000; i++){ output[i]= input[i]* scale;}
Into:
// Vectorized code (processes 4 values per instruction on ARM NEON)for(int i =0; i <1000; i +=4){// Single SIMD instruction multiplies 4 values simultaneously vst1q_f32(&output[i], vmulq_f32(vld1q_f32(&input[i]), vdupq_n_f32(scale)));}
Result: 250 iterations instead of 1000 = 4x speedup.
17.9 Concept Relationships
Software Optimization Techniques
├── Relates to: [Optimization Fundamentals](optimization-fundamentals.html) - Overall optimization strategy
├── Builds on: [Hardware Optimization](optimization-hardware.html) - Hardware capabilities that software leverages
├── Feeds into: [Fixed-Point Arithmetic](optimization-fixed-point.html) - Software technique for efficient math
├── Applied in: [Embedded Systems Programming](../prototyping/prototyping-software.html) - Practical firmware optimization
└── Verified by: [Energy-Aware Considerations](energy-aware-considerations.html) - Measuring optimization impact
Key Connections:
Compiler flags (-Os, -O3) directly trade off code size vs speed, complementing hardware selection decisions
# Modify platformio.ini for each test:# Test 1: No optimizationbuild_flags = -O0# Test 2: Size optimizationbuild_flags = -Os# Test 3: Speed optimizationbuild_flags = -O3
What to Observe:
Code Size: Check .pio/build/esp32dev/firmware.elf size with size command
Execution Time: Compare Time: XXX us across three builds
Trade-offs: Record size vs speed for each optimization level
Expected Results:
-O0: Largest code (~50 KB), slowest execution (~8000 us)
-Os: Smallest code (~35 KB), medium execution (~3000 us)
-O3: Medium code (~45 KB), fastest execution (~1500 us)
17.11.2 Experiment 2: Profiling Before Optimizing
Objective: Identify actual bottlenecks using profiling instead of guessing.
Test Code:
void processData(){// Suspect this is slowfloat sum =0;for(int i =0; i <10000; i++){ sum += sin(i *0.01);}}void transmitData(){// Never measured WiFi.begin("SSID","password");while(WiFi.status()!= WL_CONNECTED){ delay(100);}}void loop(){uint32_t t0 = micros(); processData();uint32_t t1 = micros(); transmitData();uint32_t t2 = micros(); Serial.printf("Process: %lu us, Transmit: %lu us\n", t1-t0, t2-t1); delay(10000);}
What to Observe:
Which function actually dominates execution time?
Does your intuition match the profiling results?
Would optimizing processData() actually matter if it’s only 1% of total time?
Key Lesson: The 80/20 rule applies to optimization – 80% of time is spent in 20% of code. Find the 20% first.
Common Pitfalls
1. Applying Loop Unrolling to Long or Variable-Length Loops
Loop unrolling works well for short loops with fixed iteration counts. For loops with hundreds of iterations or variable length, aggressive unrolling bloats code size and may cause instruction cache misses that negate the speed gain. Profile actual performance before and after.
2. Using -O3 on Memory-Constrained MCUs
-O3 enables loop unrolling and function cloning that can triple code size. On devices with 64–256 KB flash, this frequently causes a linker failure (“section .text' will not fit in regionFLASH’”). Use -Os (optimize for size) on flash-constrained targets.
3. Busy-Waiting Instead of Interrupt-Driven I/O
Polling a flag in a tight loop (while (!flag);) keeps the CPU active and consuming full power. Using interrupts and sleeping between events reduces energy consumption by the duty cycle factor, which is often 100–10,000×.
4. Inlining Large Functions Called Frequently
Inlining a 200-instruction function that’s called in 10 places increases code size by ~2,000 instructions. If the function doesn’t fit in instruction cache, the cache miss penalty exceeds the call overhead savings. Reserve inlining for small, performance-critical functions (< 10 instructions).