11 Compute Offloading and Placement

Energy Ledgers, Network Gates, Local Accelerators, and Evidence-Based Placement

energy-power

context

offloading

11.1 Start With a Choice: Compute Here or Transmit

A device can process data locally, send features to an edge node, or push raw samples to the cloud. None of those choices is automatically cheaper; the winner depends on radio energy, data size, latency, privacy, and accelerator fit.

The start-simple calculation is the break-even point: how many bytes can you afford to move before local compute becomes the lower-energy path?

In 60 Seconds

Code offloading is an energy-placement decision: compare local compute energy with upload, wait, download, and radio-state overhead before sending work to an edge or cloud system. Heterogeneous computing adds another option: a local DSP, GPU, or NPU may be cheaper than both CPU execution and network offload. The right answer changes with data size, network state, latency, privacy, model fit, and measured board power.

Phoebe’s Field Notes: From a Millijoule Ledger to Days Between Charges

Phoebe the physics guide

Phoebe’s Why

The millijoule ledger in this chapter already is the physics – power held constant over a duration is energy, full stop. What the ledger cannot tell you by itself is how many of those decisions the battery can actually pay for. That needs the cell’s own physics: its printed milliamp-hours only becomes an energy budget once multiplied by terminal voltage, and that budget is never fully spendable. Some of it quietly leaks away as self-discharge before the device ever asks for it, and a derating margin has to stay in reserve so the node shuts down cleanly rather than browning out mid-transfer. None of that changes which placement wins per event; it only sets how many such events happen before the next charge.

The Derivation

Energy from a near-constant power draw held for a duration – the same relationship behind every mJ figure in this chapter’s ledgers:

\[E = P \times t\]

A cell’s charge rating becomes an energy budget only once voltage multiplies in:

\[E_{cell}\,\mathrm{(Wh)} = C\,\mathrm{(Ah)} \times V\]

Usable energy after self-discharge over time \(t\) and a derating margin \(\delta\):

\[E_{usable} = E_{cell}\,(1-k)^{t}\,(1-\delta)\]

Days of service from a fixed daily energy draw \(E_{day}\):

\[\mathrm{days} = \frac{E_{usable}}{E_{day}}\]

Worked Numbers: The Wearable’s Battery

Recomputing this chapter’s own worked example: the wearable’s local path is \(E_{local} = 180 \times 5.0 = 900\) mJ, and its Wi-Fi remote path totals \(240 + 9 + 12 = 261\) mJ – both match the chapter exactly.

The chapter does not fix a cell or a repeat rate, so take standard values: a 150 mAh, 3.7 V wearable LiPo, and one motion-window analysis per minute (1,440 events/day).

\(E_{cell} = 150 \times 3.7 = 555\) mWh \(= 555 \times 3.6 = 1{,}998{,}000\) mJ \(\approx 2.00\times10^{6}\) mJ
After a standard 2%/month self-discharge and a 15% derating margin over one month between charges: \(E_{usable} = 1{,}998{,}000 \times 0.98 \times 0.85 = 1{,}664{,}334\) mJ \(\approx 1.66\times10^{6}\) mJ
Always-local daily draw: \(900 \times 1440 = 1{,}296{,}000\) mJ/day. Always-Wi-Fi daily draw: \(261 \times 1440 = 375{,}840\) mJ/day
Days between charges: local \(= 1{,}664{,}334/1{,}296{,}000 = 1.28\) days; Wi-Fi offload \(= 1{,}664{,}334/375{,}840 = 4.43\) days – a 3.45x longer interval, matching the \(900/261 = 3.45\) per-event ratio exactly, as it must

11.2 Compute Offloading and Placement

Offloading is not a synonym for “use the cloud.” In an IoT design review, it is a placement decision for one workload at one moment. The candidate locations are usually local MCU or application processor, local accelerator, nearby gateway or edge node, and remote cloud service.

The decision should be made with an energy ledger and service constraints. If the ledger uses guessed radio costs, ignores retry and tail states, or assumes cloud processing is always free to the device, the resulting policy can shorten battery life instead of extending it.

11.3 Learning Objectives

By the end of this chapter, you will be able to:

Build a local-versus-remote energy ledger for an IoT workload.
Explain why upload size, result size, radio state, retries, and latency affect offloading decisions.
Apply MAUI-style runtime profiling without treating cloud execution as automatically cheaper.
Decide when edge offload, cloud offload, local acceleration, partial offload, or deferral is the best placement.
Match CPU, DSP, GPU, and NPU execution to workload characteristics.
Record the evidence needed to promote an offloading policy into deployment.

Minimum Viable Understanding

Local energy is power x time for the measured local execution path.
Remote energy is upload energy, receive energy, wait energy, state-transition energy, and retry energy.
Small data and heavy compute can favor offload; large data and light compute usually favor local processing.
Cellular offload often has extra state and retry costs that Wi-Fi examples hide.
A local accelerator can beat offload when it avoids radio transfer and fits the workload.
Privacy, deadline, model size, and availability can override pure energy savings.

11.4 The Offloading Ledger

An offloading decision starts by comparing two ledgers at the same workload boundary.

MAUI Code Offloading Decision: Local vs. Remote: Mobile Device, @Remoteable annotated method, E_local vs., E_remote?, E_local < E_remote, Execute, Locally, E_local > E_remote, Offload to — Figure 11.1: MAUI Code Offloading Decision: Local vs. Remote

Use the following model as a review scaffold:

E_local  = P_compute_local * T_compute_local

E_remote = E_upload
         + E_download
         + E_wait
         + E_radio_transition
         + E_retry
         + E_local_preprocess

Offload only when the remote path saves enough energy while still meeting the service requirement:

choose_remote when:
  E_remote < E_local
  latency_remote <= deadline
  privacy_policy_allows_transfer
  remote_service_available

The ledger should use measured values from the device, firmware, network, and deployment environment. Vendor datasheet currents are useful for early sizing, but promotion decisions need current traces or power-monitor logs.

11.5 Worked Example: Feature Extraction

Scenario: A wearable collects a 150 KB motion window and runs a feature-extraction model. Local CPU execution takes 5 seconds at 180 mW.

Local CPU:
  E_local = 180 mW * 5.0 s = 900 mJ

Wi-Fi offload path: upload the 150 KB window, receive a small result, and wait for remote processing.

Upload:   400 mW * 0.60 s = 240 mJ
Download: 180 mW * 0.05 s =   9 mJ
Wait:      30 mW * 0.40 s =  12 mJ
Total Wi-Fi remote path     = 261 mJ

For this workload, Wi-Fi offload is attractive because the compute is heavy and the transfer is moderate.

Cellular offload path: the same transfer uses a higher-power radio and includes state-transition overhead.

Radio ramp:                    250 mJ
Upload:  1000 mW * 1.20 s =  1200 mJ
Download: 400 mW * 0.08 s =    32 mJ
Wait:     100 mW * 0.40 s =    40 mJ
Tail/connected state:          500 mJ
Total cellular remote path  = 2022 mJ

For the same workload, cellular offload is worse than local CPU execution. The policy should process locally, use a local accelerator, defer until Wi-Fi is available, or transmit a smaller feature vector instead of raw data.

11.6 Placement Gates

A robust offloading policy applies gates before making a final placement decision.

Gate

Question

Likely action

Model fit

Can the model and working set fit on the device or nearby gateway?

If not, cloud or edge offload may be required regardless of energy.

Data movement

Is the upload small compared with the computation saved?

If not, preprocess locally or stay local.

Radio state

Is the radio already connected, strong, and low-retry?

If not, include ramp, tail, and retry costs before offloading.

Deadline

Can the remote path meet the timing requirement with network variance?

If not, local or edge execution is safer.

Policy

Can the data legally and ethically leave the device or site?

If not, local processing, anonymization, or on-prem edge processing is required.

Use the placement calculator after applying the gates: change the workload, timing, privacy, and link assumptions, then record which execution site changes and why.

11.7 Runtime Decision Frameworks

MAUI-style systems profile methods, estimate energy, and decide at runtime whether a component should run locally or remotely. The important idea is the discipline, not one fixed formula:

Split the application into offloadable components.
Measure or estimate local execution time and energy for each component.
Measure or estimate network upload, download, wait, and radio-state costs.
Reject candidates that violate latency, privacy, availability, or dependency constraints.
Choose the placement with the lowest measured total energy.
Re-profile when firmware, model size, input size, radio conditions, or battery policy changes.

11.7.1 Full local

Use when data is small, computation is light, latency is strict, privacy is sensitive, or the radio is expensive.

11.7.2 Edge offload

Use when a gateway can run the workload with lower latency and lower transfer cost than the cloud.

11.7.3 Cloud offload

Use when computation or model size is too large for the device and the network path is cheap enough.

11.7.4 Partial offload

Use local filtering, compression, or feature extraction to reduce upload size before remote processing.

11.7.5 Deferred offload

Queue non-urgent work until Wi-Fi, charging, or better signal quality is available.

11.7.6 Local accelerator

Use a DSP, GPU, NPU, or dedicated peripheral when it completes the task with lower energy than CPU or radio transfer.

11.8 Heterogeneous Local Computing

Heterogeneous computing changes the offloading tradeoff because the local option is no longer only “general CPU.” Many IoT and mobile-class devices include lower-power accelerators for specific work.

A heterogeneous mobile system-on-chip dispatches incoming workloads through a context-aware scheduler to four engines: a general-purpose CPU at about 1 watt per core, an always-on DSP at about 50 milliwatts for signal and sensor work, a GPU at 2 to 4 watts for parallel compute, and an NPU at about 0.5 watts for machine-learning inference. Relative to the CPU baseline the GPU is 5 to 20 times more efficient, the DSP 10 to 50 times, and the NPU 50 to 100 times for ML workloads. — Figure 11.2: Heterogeneous SoC scheduling: a context-aware scheduler dispatches each workload to the most efficient engine, CPU for control flow, DSP for always-on sensing, GPU for parallel work, and NPU for ML inference.

Processor

Best fit

Review risk

CPU

Control flow, small tasks, irregular logic, and firmware orchestration.

May be too slow or too energy-intensive for repeated signal or ML workloads.

DSP

Audio, sensor filtering, FFT, and always-on signal processing.

Tooling and fixed-point constraints can make integration harder.

GPU

Large parallel image, matrix, and batch workloads.

Setup power can dominate small tasks; speedup is not the same as energy savings.

NPU

Supported neural-network inference with quantized or compiled models.

Unsupported layers, memory pressure, or model conversion can force fallback.

Edge gateway

Shared local compute near sensors, often with better power and network budgets.

Gateway availability and contention must be included in the service design.

11.9 Breakeven Thinking

The breakeven point is the local compute duration where remote placement becomes cheaper:

T_breakeven = E_remote / P_local_compute

If a remote path costs 300 mJ and local compute power is 150 mW, remote placement starts to save energy only when local compute would take more than 2 seconds.

T_breakeven = 300 mJ / 150 mW = 2.0 s

Breakeven is not stable. It moves when:

Input size changes.
Signal strength changes.
Retransmissions increase.
The radio is already awake for another transfer.
A local accelerator becomes available.
The model grows or shrinks.
The service deadline tightens.
Privacy rules require local handling.

11.10 Measurement Record

An offloading review should leave a compact evidence record:

Workload boundary: input size, output size, model version, and deadline.
Local path: processor used, execution time, power trace, setup overhead, and thermal throttling notes.
Remote path: upload bytes, download bytes, radio state, retries, signal quality, wait time, and failure handling.
Placement decision: local CPU, local accelerator, edge, cloud, partial, or deferred.
Policy gate: privacy, availability, safety, data residency, or model-fit constraint that affected the choice.
Recheck trigger: new firmware, new model, changed radio profile, changed battery target, or changed deployment site.

11.11 Common Pitfalls

11.11.1 Treating cloud compute as free

Cloud processing can be fast, but the device still pays radio energy, wait energy, retry energy, and failure-recovery cost.

11.11.2 Ignoring result size

Some workloads return a tiny class label; others return images, maps, or model outputs. Download energy belongs in the ledger.

11.11.3 Forgetting radio state

A transfer that rides on an already-awake Wi-Fi connection differs from waking a cellular modem and holding it in a connected state.

11.11.4 Measuring speed only

A GPU or NPU can be faster but not always lower energy. Measure power x time plus setup overhead.

11.11.5 Offloading raw data too early

Local feature extraction or compression may reduce the upload enough to change the decision.

11.11.6 No fallback path

If the edge or cloud is unavailable, the device needs a degraded local behavior, queued job, or safe fail state.

11.13 Check Your Understanding

11.14 Knowledge Check: Offloading Ledger

11.15 Knowledge Check: Heterogeneous Execution

11.16 Matching Quiz: Placement Options

11.17 Ordering Quiz: Offloading Review

11.18 Label the Diagram: Offloading Decision Ledger

11.19 What’s Next

11.19.1 Practice Placement

Energy Optimization Worksheets and Assessment

Practice energy-placement and optimization decisions with structured review exercises.

11.19.2 Connect Implementation

Hardware and Software Optimisation

Connect offloading decisions to firmware, processor, and board-level design choices.

11.19.3 Compare Low Power

Energy-Aware Low Power Strategies

Compare compute placement with sleep, wake, and low-power operating modes.

11.19.4 Adapt Policy

Context-Aware Energy Management

Use battery, workload, network, and context signals to choose adaptive policies.

11.20 Compute Or Communicate

Offloading moves a computation off the device to an edge server or the cloud. It never removes energy cost; it trades one kind for another. Instead of paying the processor to run the work, the node pays the radio to ship the input out and pull the answer back. Whether that trade wins depends on how much computation you avoid versus how much data you must move.

The rule of thumb is simple: offload heavy computation on small data, and keep light computation on large data local. A one-second inference on a two-kilobyte feature vector is a good offload. A one-line threshold check on a hundred-kilobyte image is a terrible one, because sending the image costs far more than the check ever would.

Worked example: a vibration node can either classify a ten-second window locally or send extracted features to a gateway. If local inference draws 120 mW for 2.5 s, the local path costs 300 mJ. Sending a 2 KB feature vector over a short-range link at 80 mJ, then receiving a 2 mJ result, leaves plenty of margin for offload. Sending the raw 120 KB waveform at 0.04 mJ per byte would cost 4,800 mJ before any wait or retry energy, so the same algorithm should stay local unless the node first compresses or summarizes the data.

Intuition only: compare the energy to run the work locally against the energy to move its input and result over the radio. The radio is often the most expensive thing on the board, so moving a lot of data is rarely free.

Offload only when it saves energy: compare local compute energy against transmit-plus-receive-plus-tail energy. Wi-Fi often makes offloading worthwhile, while the LTE radio tail usually makes it worse unless the computation is heavy.

The Two Paths

Local path

Processor active current times run time. Scales with how hard the computation is.

Offload path

Radio energy to send the input plus receive the result. Scales with how much data you move.

Break-even

The data size at which the two paths cost the same. Below it, offload; above it, stay local.

Hidden radio cost

Connection setup and the high-power state the radio holds after the last byte, both easy to forget.

Overview Knowledge Check

11.21 Find The Break-Even Data Size

Give the radio an energy-per-byte figure from its active current and effective throughput, then compare against the local compute charge. Offload wins when E_local > e_byte x D + E_result + E_setup, where D is the payload size and e_byte is radio energy per byte.

Worked Example: Offload An Inference Over BLE

The local inference holds the MCU active at 15 mA for 2 s, so E_local = 30 mA-s. The BLE radio draws 6 mA while active at an effective 50 kbps, giving e_byte = 6 mA x (8 bit / 50000 bit/s) = 0.00096 mA-s per byte. Ignoring setup for the moment, the break-even payload is D* = 30 / 0.00096 = 31250 bytes, about 30 KB.

Small input (2 KB feature vector): offload costs 2048 x 0.00096 = 2.0 mA-s. Offloading beats the 30 mA-s local run by about 15x.
Large input (100 KB raw capture): offload costs 102400 x 0.00096 = 98 mA-s. Now local computing at 30 mA-s wins by more than 3x.
Break-even: at roughly 30 KB the two paths cost the same; the decision flips around that size.

The same inference is a good offload for a small feature vector and a bad offload for a raw capture. Placement is a property of the data-to-compute ratio, not of the algorithm alone.

Now add setup overhead. If the link costs 8 mA-s to wake, negotiate, and settle before payload transfer, the break-even becomes D* = (30 - 8) / 0.00096 = 22917 bytes, about 22 KB. The 2 KB feature vector still wins, but the 30 KB case no longer breaks even: it costs about 8 + 30720 x 0.00096 = 37.5 mA-s. Record that fixed overhead separately so a later firmware change or already-awake radio state can be reviewed without rebuilding the whole ledger.

Placement Ledger

Case

Local Cost

Offload Cost

Decision

2 KB input

30 mA-s inference

2.0 mA-s radio

Offload (about 15x cheaper)

30 KB input

30 mA-s inference

about 30 mA-s radio

Break-even; decide on latency or privacy

100 KB input

30 mA-s inference

98 mA-s radio

Stay local (about 3x cheaper)

Practitioner Knowledge Check

11.22 The Radio Does Not Stop When The Packet Does

A naive offload model counts only the active transmit time. Real radios add two costs that can flip a break-even. First, connection setup: bringing up a BLE connection or attaching a cellular modem burns energy before any payload moves, which punishes small transfers most. Second, the tail: after the last byte, many radios hold a high-power state for a while before releasing. A cellular modem in particular can stay connected for several seconds after transmission, drawing tens of milliamps, before it drops to a low-power idle or power-save mode. That tail is pure overhead a payload-only calculation misses.

The energy-per-bit also varies enormously by technology, so the break-even data size is radio-specific. A short-range link like BLE moves bits cheaply. A long-range low-rate link like LoRa has such low throughput that a modest payload occupies the radio for a long airtime, making its energy per bit high. A cellular link adds the setup and tail costs above. The same 30 KB break-even computed for BLE could be a few kilobytes on a slow or tail-heavy radio.

Worked example: suppose a small cellular offload transmits for 0.4 s at 120 mA, so a payload-only model records only 48 mA-s. The modem also spends 25 mA-s attaching and then holds a 40 mA connected tail for 8 s, adding 320 mA-s. The real radio charge is 25 + 48 + 320 = 393 mA-s. If the local compute path is 90 mA-s, the payload-only model says offload wins, while the measured device says local compute is more than four times cheaper. That is why tail timers and network state must be captured with the same power trace as payload transfer.

Costs A Payload-Only Model Misses

Setup energy

Connection or attach handshakes cost energy before the payload. Small offloads may be dominated by setup.

Tail state

A radio held in a high-power connected state after the last byte adds seconds of overhead current.

Wait current

If the link must stay up while the server computes, the wait is not free even though the node sends nothing.

Per-radio energy per bit

BLE, LoRa, and cellular differ by orders of magnitude, so the break-even size must be recomputed per radio.

Under-the-Hood Knowledge Check

11.23 Summary

This chapter evaluates when IoT devices should compute locally, offload to edge or cloud, or use heterogeneous hardware. It weighs energy, latency, data-transfer cost, privacy, reliability, and fallback behavior.

11.24 Key Takeaway

Offloading saves energy only when communication, waiting, retry, privacy, and failure costs are lower than local execution. Compare total task energy and service risk before moving work away from the device.

11.1 Start With a Choice: Compute Here or Transmit

Phoebe’s Why

The Derivation

Worked Numbers: The Wearable’s Battery

11.2 Compute Offloading and Placement

11.3 Learning Objectives

11.4 The Offloading Ledger

11.5 Worked Example: Feature Extraction

11.6 Placement Gates

11.7 Runtime Decision Frameworks

11.7.1 Full local

11.7.2 Edge offload

11.7.3 Cloud offload

11.7.4 Partial offload

11.7.5 Deferred offload

11.7.6 Local accelerator

11.8 Heterogeneous Local Computing

11.9 Breakeven Thinking

11.10 Measurement Record

11.11 Common Pitfalls

11.11.1 Treating cloud compute as free

11.11.2 Ignoring result size

11.11.3 Forgetting radio state

11.11.4 Measuring speed only

11.11.5 Offloading raw data too early

11.11.6 No fallback path

11.12 Related Chapters

11.12.1 Baseline Duty Cycle

11.12.2 Review ACE

11.12.3 Review Edge/Fog

11.12.4 Measure Energy

11.13 Check Your Understanding

11.14 Knowledge Check: Offloading Ledger

11.15 Knowledge Check: Heterogeneous Execution

11.16 Matching Quiz: Placement Options

11.17 Ordering Quiz: Offloading Review

11.18 Label the Diagram: Offloading Decision Ledger

11.19 What’s Next

11.19.1 Practice Placement

11.19.2 Connect Implementation

11.19.3 Compare Low Power

11.19.4 Adapt Policy

11.20 Compute Or Communicate

The Two Paths

Local path

Offload path

Break-even

Hidden radio cost

Overview Knowledge Check

11.21 Find The Break-Even Data Size

Worked Example: Offload An Inference Over BLE

Placement Ledger

Practitioner Knowledge Check

11.22 The Radio Does Not Stop When The Packet Does

Costs A Payload-Only Model Misses

Setup energy

Tail state

Wait current

Per-radio energy per bit

Under-the-Hood Knowledge Check

11.23 Summary

11.24 Key Takeaway