14 Software Optimization Techniques

Compiler Choices, Data Layout, Sleep-Aware Firmware, and Validation

energy-power

optimization

software

14.1 Start With Firmware That Wakes Too Often

Software wastes energy when it polls instead of sleeps, copies data it could stream, misses batching windows, or keeps clocks running after the useful work is done. The bug is often a time shape, not a line count.

The practical route is to make firmware event driven, keep data movement small, let peripherals work while the core sleeps, and prove the improvement in the trace.

In 60 Seconds

Software optimization is the measured process of changing firmware so it does less work, moves less data, sleeps sooner, or fits smaller memory. Compiler flags help, but the strongest gains usually come from fixing the measured bottleneck, reducing wakeups, and validating the whole duty cycle.

14.2 Software Optimization Techniques

Software optimization is not a list of tricks to apply everywhere. It is a sequence of evidence-led decisions:

define the target
profile the workload
choose the measured bottleneck
apply the smallest useful change
verify correctness and whole-device energy
repeat only if the target is not met

This chapter focuses on firmware changes that commonly matter in IoT: compiler settings, data layout, memory movement, event-driven sleep, and code-level changes such as inlining, lookup tables, loop structure, and fixed-point handoff.

14.3 Learning Objectives

By the end of this chapter, you will be able to:

Choose compiler and linker settings from evidence instead of habit.
Identify when code size, memory traffic, active time, or wakeups dominate firmware cost.
Apply data-layout and buffering changes that reduce copies and memory stalls.
Replace busy waiting with event-driven sleep where hardware support allows it.
Decide when code-level changes such as inlining, lookup tables, or loop restructuring are worth the risk.
Build a validation record for firmware optimization.

Minimum Viable Understanding

Profile before changing code.
Build size, active time, wakeups, and current trace all matter.
Faster code is not automatically lower energy if it increases memory, wakeups, or retries.
Code-level optimization must preserve correctness, timing, and debuggability.
A firmware optimization is complete only after before/after evidence proves whole-system improvement.

14.4 Firmware Optimization Loop

Use the same loop for compiler changes, data-layout changes, and hand-written code changes.

Figure 14.1: Firmware optimization loop showing profile, choose bottleneck, change one thing, test correctness, measure whole cycle, and record result.

Step

Question

Evidence

Decision

Profile

Where does firmware spend time, memory, or wakeups?

Timing trace, current trace, build map, memory report, event log, and retry log.

Pick the measured bottleneck.

Change

What is the smallest scoped fix?

Candidate ledger with expected benefit and regression risk.

Change one thing first.

Verify

Did behavior still match the requirement?

Unit tests, reference vectors, boundary cases, and integration run.

Keep only if correctness holds.

Remeasure

Did the product improve?

Same workload, same power state assumptions, and full-cycle evidence.

Record, revise, or roll back.

14.5 Software Optimization Levers

Firmware optimization levers affect different parts of the system. Choose the lever that matches the measured problem.

No-panel software optimization levers map: compiler settings, data layout, memory movement, event-driven sleep, and code-level changes. — Figure 14.2: Software optimization levers map showing compiler settings, data layout, memory movement, event-driven sleep, and code-level changes.

14.5.1 Compiler And Linker

Use build settings to remove unused code, balance speed and size, and enable whole-program cleanup when the toolchain supports it.

14.5.2 Data Layout

Use compact types, stable alignment, and cache-friendly structures when memory movement is the bottleneck.

14.5.3 Event-Driven Sleep

Use interrupts, timers, and peripheral completion events so firmware sleeps while hardware waits.

14.5.4 Code-Level Changes

Use inlining, lookup tables, loop restructuring, and fixed-point handoff only for measured hot paths.

14.6 Compiler And Build Choices

Compiler flags are a starting point, not an optimization strategy by themselves.

Choice

Useful when

Risk

Validation gate

Debug build

You need readable stepping, assertions, and source-level diagnosis.

Slow and large; not representative of production timing.

Use only for diagnosis, not energy claims.

Balanced build

The product needs stable speed without aggressive code growth.

May leave size or speed improvements unused.

Compare against production workload and build map.

Size-focused build

Flash, update size, or instruction fetch is the limit.

Can increase active time for compute-heavy paths.

Measure latency and full-cycle current.

Speed-focused build

A measured hot path blocks a deadline or keeps the device awake.

Can increase code size, memory pressure, and flash fetch activity.

Check build size, cache behavior, and duty-cycle energy.

14.7 Data, Memory, And Copies

Data movement can dominate firmware cost. A clean algorithm can still waste energy if it copies buffers, parses verbose payloads, or wakes the CPU for every small transfer.

Reduce copies Use clear buffer ownership so data can move from peripheral to processing to packet builder without repeated duplication.

Choose data width deliberately Smaller fields reduce memory and radio cost, but only if the precision and range still meet the requirement.

Avoid hot-path formatting Human-readable formatting is useful for logs, but expensive inside frequent active windows.

Batch when latency allows Batching can reduce wakeups and protocol overhead, but it must not hide important events or overload buffers.

14.8 Event-Driven Sleep

Busy waiting is one of the easiest ways to waste energy. If firmware waits for a sensor, timer, radio, storage device, or peripheral transfer, the CPU should usually sleep until the event occurs.

No-panel sleep-aware firmware timeline showing busy wait replaced by event setup, sleep, interrupt wake, short processing, and return to sleep. — Figure 14.3: Sleep-aware firmware timeline showing busy wait replaced by event setup, sleep, interrupt wake, short processing, and return to sleep.

Pattern

What changes

Risk

Validation gate

Polling to interrupt

CPU sleeps until the peripheral or timer signals completion.

Race conditions, missed events, and interrupt storms.

Stress event timing and error paths.

Short task then sleep

Firmware does the minimum active work and returns to sleep quickly.

State may be lost if sleep entry is too aggressive.

Run wake/resume tests and first-sample checks.

Deferred logging

Verbose logging is moved outside the critical active window.

Less immediate diagnosis during failures.

Check fault logs and field-debug workflow.

Batch transmission

Several events share one radio or storage session.

More buffering and delayed reporting.

Replay latency, retry, and buffer-full cases.

14.9 Code-Level Changes

Code-level changes are useful when a measured hot path is truly limiting the product. They are risky when applied broadly.

14.9.1 Inline Small Hot Functions

Inlining can remove call overhead and enable constant propagation. Avoid inlining large functions or cold paths.

14.9.2 Restructure Loops

Loop restructuring can reduce branches and expose repeated operations. Keep boundary handling clear and tested.

14.9.3 Use Lookup Tables Carefully

Lookup tables can replace expensive calculations, but they use memory and must match accuracy requirements.

14.9.4 Move To Fixed-Point When Validated

Integer math can reduce cost on some targets, but only after range, resolution, overflow, and reference-vector checks.

14.10 Worked Review: Firmware Hot Path

Suppose a device wakes every reporting interval, reads sensors, filters samples, builds a payload, and transmits the result. A profile shows three issues:

Finding

Candidate change

Trade-off

Review result

CPU waits for transfer

Use peripheral completion interrupts and sleep while waiting.

Requires careful state transitions and missed-event tests.

High-priority because it creates sleep time.

Payload formatting is frequent

Use compact fields in the hot path and defer readable logs.

Requires versioning and support-tool updates.

Useful if packet size or active formatting time is measured.

Filter loop is hot

Review data layout, fixed-point suitability, and loop structure.

Can affect accuracy and code clarity.

Apply only with reference vectors and before/after timing.

Review Conclusion

Start with event-driven sleep because the profile shows wait time. Then reduce hot-path formatting if packet size or active time remains high. Optimize the filter loop only after reference tests prove the numerical behavior is safe.

14.11 Review Checklist

Before accepting a software optimization:

The target requirement is measurable.
The baseline uses the same workload as the after measurement.
The change targets a measured bottleneck.
Build size, RAM use, active time, wakeups, and current trace are checked.
Boundary cases and reference vectors still pass.
Logging and diagnosability remain adequate.
The rollback condition is written down.
The optimization record explains why the change was kept.

14.12 Check Your Understanding

14.13 Check Your Understanding: Waiting Work

14.14 Check Your Understanding: Build Setting Trade-Off

14.15 Match Software Item To Purpose

14.16 Order The Software Optimization Process

14.17 Label The Software Optimization Record

14.18 Polling Keeps The Core Awake; Events Let It Sleep

The single largest software lever on battery life is how the firmware waits. A polling design loops and repeatedly checks whether something has happened, which keeps the processor active - or at best idling in a shallow state - the entire time. An event-driven design puts the processor into deep sleep and lets a hardware interrupt wake it only when something actually happens. Same job, wildly different energy, because one structure holds the core awake and the other lets it sleep.

The gap is not a few percent. A processor active at several milliamps versus asleep at ten microamps differ by a factor of hundreds. So a chapter's worth of clever micro-optimizations inside a polling loop cannot come close to the saving from simply restructuring the firmware to sleep and wake on events.

Put numbers on that structure. If a node polls at 8 mA for one hour, it spends 8 mAh just waiting. If the same node sleeps at 10 uA for that hour and wakes for ten 5 ms handlers at 8 mA, the sleep floor is 0.010 mAh and the handlers add only about 0.00011 mAh. The event-driven hour is roughly 0.0101 mAh, so the polling loop is about 790 times higher before counting any sensor or radio current.

An interrupt saves CPU state, runs a short ISR, and restores state so the main program resumes exactly where it paused, keeping the core asleep until an event fires.

Intuition only: if the firmware is ever spinning in a loop waiting for something, the battery is paying active current to do nothing. Let a peripheral watch for the event and wake the core with an interrupt instead.

Two Ways To Wait

Polling

The core loops and checks a condition, staying active. It burns current continuously whether or not anything happens.

Event-driven

The core sleeps until a hardware interrupt wakes it, so it draws sleep current almost all the time.

Wake source

A timer, GPIO, comparator, or sensor interrupt does the watching so the processor does not have to.

Sleep, not delay

A busy-wait delay looks like waiting but keeps the core active. Only a real sleep primitive saves energy.

Overview Knowledge Check

14.19 The Same Job At 8 mA Or 10 uA

Compare the two structures by average current for the same task: detect sparse events and report them. Polling averages near the active current; event-driven averages near the sleep current plus a tiny contribution from brief wakes.

Worked Example: A Sparse-Event Sensor Node

Events occur about 10 times per hour, and each needs 5 ms of processing at 8 mA. The core can sleep at 10 uA.

Polling: the loop keeps the core active at about 8 mA continuously. On a 2000 mAh pack that is 2000 / 8 = 250 h, about 10 days.
Event-driven: wake energy is 10 x 5 ms x 8 mA = 0.4 mA-s per hour, which averages 0.11 uA, on top of the 10 uA sleep, for about 10.1 uA total. That is a factor of roughly 800 below polling.
Life: at 10.1 uA the pack would last far beyond a decade on paper, so the practical limit becomes battery self-discharge rather than the load - the opposite regime from the 10-day polling design.

The daily budget makes the same point in battery units. Polling at 8 mA consumes 8 x 24 = 192 mAh/day. Event-driven waiting at roughly 10.1 uA consumes 0.0101 mA x 24 = 0.242 mAh/day before sensor and radio work. A 2000 mAh cell would calculate to more than 8000 days at that software floor, so real products become limited by self-discharge, leakage, temperature, and the useful work outside the core.

The processing per event is identical in both; the only change is that the event-driven core sleeps between events instead of spinning. That structural choice moved the average current by nearly three orders of magnitude.

Structure Energy Ledger

Structure

Between Events

Average Current

Life (2000 mAh)

Polling

Core active, looping

about 8 mA

about 10 days

Event-driven

Core in deep sleep

about 10.1 uA

Years (self-discharge limited)

Ratio

Awake vs asleep

about 800x

Same functionality

Practitioner Knowledge Check

Phoebe’s Field Notes: Why “More Than 8000 Days” Is A Ceiling, Not A Forecast

Phoebe the physics guide

Phoebe’s Why

This chapter’s own arithmetic is correct as a charge budget: mAh divided by mA gives hours. But mAh is charge, not energy – multiplying by the cell’s terminal voltage is what turns it into the joules the firmware actually spends – and the “more than 8000 days” figure silently assumes the whole nameplate charge is still there and deliverable on demand at the moment it’s needed. Two physical effects erode that assumption long before 8000 days pass. Self-discharge quietly drains the cell even while the event-driven firmware sleeps perfectly, which is exactly why the chapter’s own text says the real limit “becomes battery self-discharge rather than the load.” And internal resistance sags the terminal voltage under any current pulse, including the brief 8 mA event-handling bursts this chapter’s own ledger counts as free.

The Derivation

Charge to energy, at nominal terminal voltage \(V\):

\[E_{Wh} = Q_{Ah}\times V\]

Self-discharge over time \(t\) at fractional rate \(r\) per period, compounding:

\[Q_{remaining}(t) = Q_0\,(1-r)^{t}\]

Voltage sag under a current pulse \(I\), from internal resistance \(R_{int}\):

\[V_{term} = V_{oc} - I\,R_{int}\]

Worked Numbers: This Chapter’s Own 2000 mAh Pack

Nameplate energy at a catalog-typical Li-ion 3.7 V: \(E = 2.000\text{ Ah}\times3.7\text{ V}=7.40\) Wh – the actual budget behind the “2000 mAh” the chapter’s ledger divides directly by milliamps.
Self-discharge alone, at a catalog-typical Li-ion pouch-cell rate of \(\approx2\%\)/month: after 60 months (5 years) only \((1-0.02)^{60}=29.8\%\) of the charge remains; after 120 months (10 years), \(8.85\%\) remains. The chapter’s own “more than 8000 days” (\(\approx22\) years) figure is a pure current-draw ceiling that self-discharge alone would undercut within the first decade, exactly matching the chapter’s own conclusion that self-discharge becomes the real limit, now with the number attached.
Sag during the event-driven design’s own 8 mA, 5 ms handler burst: at a fresh cell, catalog-typical \(R_{int}\approx0.15\,\Omega\), sag \(=8\text{ mA}\times0.15\,\Omega=18.0\) mV – negligible. After years of the same self-discharge/aging that shrinks the charge budget, catalog-typical aged \(R_{int}\approx1.0\,\Omega\), sag \(=8\text{ mA}\times1.0\,\Omega=120\) mV – still small next to a typical 3.0 V brownout floor, but no longer free, and it grows with every added milliamp of sensor or radio current the chapter’s own text flags as “before counting any sensor or radio current.”
The practical reading: the polling-versus-event-driven comparison above is real and the 800x current ratio is correct, but “years” and “8000 days” describe a charge ceiling under ideal delivery, not a field lifetime – the actual field number needs the self-discharge derate and a peak-current sag check at the pack’s end-of-life resistance, not just Ah divided by mA.

14.20 delay() Does Not Sleep, And You Must Offload The Watching

Two traps turn an intended low-power design back into a polling one. The first is the busy-wait delay. A call like delay(10) in common firmware frameworks spins the processor at full active current for the interval; it produces the right timing but zero energy saving. A duty cycle built from such delays still pays active current the whole time. The fix is to replace the delay with a real sleep primitive - a wait-for-interrupt, light sleep, or deep sleep - armed with a timer wake, so the core is genuinely off during the interval.

The arithmetic is unforgiving. A loop that waits one second with delay(1000) at 8 mA burns 8 mA-s during that second even though it did no useful work. Across 3600 one-second waits, that is 8 mAh per hour of waiting current. A timer-backed sleep at 10 uA over the same hour is about 0.010 mAh before wake work, so the firmware primitive - delay versus sleep - decides almost the whole current budget.

The second trap is watching in software. If the "event" is detected by the CPU reading a sensor or ADC in a loop, the core cannot sleep, and you have an event-driven shape with polling energy. The watching must be offloaded to a peripheral that runs while the core sleeps: a GPIO interrupt for a digital line, an analog comparator or the sensor's own threshold interrupt for a level crossing, or a low-power timer for periodic wakes. Only then can the processor stay asleep until the peripheral raises an interrupt. One more caveat closes the loop: if events become very frequent, the wake and processing overhead can keep the core effectively awake anyway, and batching several events per wake restores the benefit. Event-driven design wins when the watching is offloaded and the events are sparse relative to the processing.

Making Sleep Real

Replace busy delays

Swap spinning delays for a wait-for-interrupt or sleep with a timer wake, or the interval saves nothing.

Offload the watch

Let a GPIO, comparator, or sensor interrupt detect the event so the core can stay asleep.

Avoid ADC polling

Polling an ADC for a threshold keeps the core active; use a hardware comparator or threshold interrupt instead.

Batch frequent events

If events arrive fast, group several per wake so the core is not effectively always awake.

Under-the-Hood Knowledge Check

14.21 Summary

Software optimization should make the product requirement more true:

Measure first.
Choose the firmware bottleneck that matters.
Use compiler settings as one lever, not the whole strategy.
Reduce data movement and hot-path formatting.
Sleep while hardware waits.
Apply code-level changes only to measured hot paths.
Verify correctness, build size, memory, timing, and full-cycle energy.

Common Pitfalls

Optimizing Debug Builds

Debug builds are for diagnosis. Use production-like settings before making timing or energy claims.

Busy Waiting On Hardware

Polling can keep the CPU active for no useful work. Prefer event-driven sleep when the hardware and error paths support it.

Growing Code Past The Update Budget

Inlining and speed-focused builds can make firmware too large for flash, update slots, or rollback images.

Changing Math Without Reference Evidence

Lookup tables, fixed-point conversion, and loop restructuring need reference vectors and boundary tests before release.

14.22 What’s Next

14.22.1 Use Fixed-Point Carefully

Fixed-Point Arithmetic explains Q-format choices, saturation, and validation for integer math.

14.22.2 Compare Hardware Paths

Hardware Optimization Strategies covers peripherals, data paths, accelerators, and platform selection.

14.22.3 Review The Fundamentals

Optimization Fundamentals explains bottlenecks, trade-off ledgers, and optimization records.

14.22.4 Measure The Result

Energy Measurement and Profiling shows how to prove that firmware changes improved the full duty cycle.

14.23 Key Takeaway

Software saves energy when it reduces unnecessary work and lets hardware sleep. Prefer efficient algorithms, bounded loops, batching, interrupt-driven design, and clear wake/sleep ownership.