18 Fixed-Point Arithmetic
18.1 Learning Objectives
By the end of this chapter, you will be able to:
- Understand Qn.m format: Define and work with fixed-point number representations
- Convert floating-point to fixed-point: Transform algorithms from float to efficient integer operations
- Select appropriate n and m values: Choose bit allocation based on range and precision requirements
- Implement fixed-point operations: Perform multiplication, division, and other arithmetic correctly
- Evaluate trade-offs: Compare precision loss against performance and power savings
For Beginners: What is Fixed-Point Arithmetic?
Fixed-point arithmetic is a clever way to do decimal math using only whole numbers (integers). Instead of storing 25.3 directly, we multiply it by a fixed number (like 256) and store 6477 as an integer. When we need the decimal back, we divide by 256. The trick is that all the math happens using fast integer operations, which saves lots of battery power on small devices. Think of it like measuring distances in millimeters instead of meters—you use whole numbers but still capture the precision you need.
Sensor Squad: Math Without Decimals!
“Most small microcontrollers do not have hardware for decimal point math,” said Max the Microcontroller. “Floating-point operations like 3.14 times 2.7 take many clock cycles in software, which wastes time and energy. Fixed-point arithmetic is a clever trick to do decimal math using only integers!”
Sammy the Sensor asked, “How do you represent 25.3 degrees without a decimal point?” Max explained, “You multiply by a power of two! In Q8.8 format, you store 25.3 as 25.3 times 256 equals 6,477. To the processor, it is just the integer 6477. When you need the real value, divide by 256. All the math happens with fast integer operations!”
Bella the Battery appreciated the energy savings: “Fixed-point math can be 10 to 100 times faster than floating-point on a small microcontroller without an FPU. Faster math means less time awake, which means I last longer! The trade-off is that you lose a tiny bit of precision, but for most sensor readings, that is perfectly acceptable.” Lila the LED added, “Choose your Q format carefully – more fractional bits give more precision, but fewer integer bits mean a smaller range. It is all about balance!”
18.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Software Optimization: Understanding compiler optimizations and code efficiency
- Optimization Fundamentals: Understanding why optimization matters for IoT
- Binary number representation: Familiarity with how integers are stored in binary
18.3 Fixed-Point Arithmetic
18.3.1 Why Fixed-Point?
Most signal processing and machine learning algorithms are initially developed in high-level environments like MATLAB or Python using floating-point arithmetic. This approach simplifies development and testing, but presents challenges when deploying to resource-constrained IoT devices:
- Hardware cost: Floating-point processors or FPU hardware are expensive and power-hungry
- Common alternative: Fixed-point processors dominate the embedded systems market
- Deployment workflow: After design and test in floating-point, convert to fixed-point representation
- Target platforms: Deploy onto fixed-point processor or ASIC for production
18.3.2 Qn.m Format
Qn.m Format: Fixed positional number system - n bits to left of decimal (including sign bit) - m bits to right of decimal point - Total bits: n + m
Let’s examine a concrete example to understand how the Qn.m format works in practice.
Example - Q5.4 (9 bits total): - 5 bits for integer (sign + 4 bits) - 4 bits for fraction - Range: -16.0 to +15.9375 - Resolution: 1/16 = 0.0625
This format can represent temperatures from -16°C to +15.9375°C with a resolution of 0.0625°C—suitable for many indoor temperature sensing applications.
18.3.3 Conversion to Qn.m
Converting from floating-point to fixed-point requires careful analysis of your data:
- Define total number of bits (e.g., 9 bits)
- Fix location of decimal point based on value range
- Determine n and m based on required range and precision
Range Determination:
- Run simulations for all input sets
- Observe ranges of values for all variables
- Note minimum + maximum value each variable sees
- Determine Qn.m format to cover range
18.3.3.1 Interactive Qn.m Calculator
18.3.4 Fixed-Point Operations
Addition and Subtraction: Straightforward when formats match
// Q15 + Q15 = Q15 (same format)
int16_t a = 16384; // 0.5 in Q15
int16_t b = 8192; // 0.25 in Q15
int16_t c = a + b; // 24576 = 0.75 in Q15Multiplication: Requires renormalization
// Q15 x Q15 = Q30, must shift back to Q15
int16_t a = 16384; // 0.5 in Q15
int16_t b = 16384; // 0.5 in Q15
int32_t temp = (int32_t)a * b; // 268435456 in Q30
int16_t result = temp >> 15; // 8192 = 0.25 in Q1518.3.4.1 Interactive Fixed-Point Multiplication
Putting Numbers to It
Why does Q15 multiplication need a right shift by 15? Because multiplying two Q15 numbers produces Q30:
Q15 format: 1 sign bit + 15 fractional bits, where 1.0 = \(2^{15}\) = 32,768
Multiplying 0.5 × 0.5 should give 0.25: - \(a = 0.5 \times 32768 = 16384\) (Q15) - \(b = 0.5 \times 32768 = 16384\) (Q15) - \(a \times b = 16384 \times 16384 = 268,435,456\) (Q30, because \(2^{15} \times 2^{15} = 2^{30}\))
To convert Q30 back to Q15, divide by \(2^{15}\):
\[\text{result} = \frac{268,435,456}{32768} = 8192\]
Verify: \(\frac{8192}{32768} = 0.25\) ✓
On MCUs without hardware dividers, >> 15 (right shift) is a single-cycle operation vs 12-40 cycles for software division—critical for real-time DSP running at 100+ kHz sampling rates.
Division: Often avoided or replaced with multiplication by reciprocal
// Divide by 10 using multiply by 0.1 (avoid slow division)
// 0.1 in Q15 = 3277
int16_t x = 32767; // ~1.0 in Q15
int32_t temp = (int32_t)x * 3277; // Multiply by 0.1
int16_t result = temp >> 15; // ~3277 = ~0.118.4 Knowledge Check
Understanding Check: Fixed-Point vs Floating-Point for Edge ML
Scenario: Your edge AI camera runs object detection (MobileNet) on images. Baseline uses float32 (32-bit floating-point), consuming 150mW during inference and requiring 8 MB RAM (model + activations). Latency: 2 seconds/frame on ARM Cortex-M7 @ 200 MHz. You’re evaluating int8 quantization (8-bit fixed-point), which the datasheet claims provides 4x speedup and 4x memory reduction with <2% accuracy loss.
Think about:
- What are the new latency, RAM, and power metrics with int8 quantization?
- Does 4x speedup enable real-time processing (30 fps = 33ms/frame)?
- What additional considerations affect the decision beyond performance metrics?
Key Insight: Int8 quantization benefits: Latency: 2 sec / 4 = 500ms/frame (still far from 33ms real-time needed for 30 fps, but 4x better). RAM: 8 MB / 4 = 2 MB (fits in devices with 4 MB SRAM vs requiring 16 MB). Power: 150mW / 4 = 37.5mW (longer battery life or enables smaller battery). Verdict: Use int8 quantization. While 500ms still isn’t real-time, it’s “good enough” for many applications (e.g., doorbell detection every 0.5 sec is fine). The 2 MB RAM savings enables deployment on cheaper hardware ($5 MCU with 4 MB RAM vs $15 MCU with 16 MB RAM). Trade-off: <2% accuracy loss is acceptable for most object detection tasks. Non-performance consideration: Some MCUs have hardware int8 accelerators (e.g., ARM Cortex-M55 with Helium vector extension) providing 10-20x additional speedup, potentially reaching real-time. The chapter’s lesson: “Floating point processors/hardware are expensive! Fixed point processors common in embedded systems” - for edge AI, int8 fixed-point is standard.
18.5 Worked Example: Converting a Temperature Filter to Fixed-Point
A common IoT task is applying an exponential moving average (EMA) filter to smooth noisy temperature readings. Here is the conversion from floating-point to fixed-point, step by step.
Floating-Point Version (runs on ESP32 with FPU):
alpha = 0.1
filtered = alpha * new_reading + (1 - alpha) * filtered
For a reading of 25.7C with previous filtered value of 25.3C: filtered = 0.1 * 25.7 + 0.9 * 25.3 = 2.57 + 22.77 = 25.34
Step 1: Determine Value Ranges
| Variable | Min Value | Max Value | Required Precision |
|---|---|---|---|
| temperature | -40.0C | +85.0C | 0.1C |
| alpha | 0.0 | 1.0 | 0.01 |
| filtered | -40.0C | +85.0C | 0.1C |
Step 2: Select Q Format
Temperature range is -40 to +85, needing 8 integer bits (range -128 to +127). For 0.1C precision, we need at least 4 fractional bits (resolution 0.0625C). Choose Q8.8 (16-bit total): range -128 to +127.996, resolution 0.00390625C.
Alpha is 0 to 1, needing only 1 integer bit. Choose Q1.15 for maximum precision.
Step 3: Convert Constants
alpha_q15 = round(0.1 * 32768) = 3277
one_minus_alpha_q15 = 32768 - 3277 = 29491
temperature 25.7C in Q8.8 = round(25.7 * 256) = 6579
filtered 25.3C in Q8.8 = round(25.3 * 256) = 6477
Step 4: Fixed-Point Calculation
// All operations are integer-only -- no FPU needed
int16_t temp_q88 = 6579; // 25.7C
int16_t filt_q88 = 6477; // 25.3C
int16_t alpha_q15 = 3277; // 0.1
// term1 = alpha * new_reading (Q15 * Q8.8 = Q23.8, shift right 15 to get Q8.8)
int32_t term1 = (int32_t)alpha_q15 * temp_q88; // = 21,555,483
int16_t term1_q88 = term1 >> 15; // = 658 (2.57C in Q8.8)
// term2 = (1-alpha) * filtered
int32_t term2 = (int32_t)(32768 - alpha_q15) * filt_q88; // = 191,057,607
int16_t term2_q88 = term2 >> 15; // = 5831 (22.777C in Q8.8)
// result
int16_t result_q88 = term1_q88 + term2_q88; // = 6489
// Convert back: 6489 / 256 = 25.347CResult: 25.347°C vs floating-point 25.34°C – negligible error (0.007°C), but runs 10-50x faster on MCUs without FPU hardware.
18.6 When to Use Fixed-Point: Decision Framework
Not every IoT application benefits from fixed-point conversion. Use this decision tree:
| Question | If Yes | If No |
|---|---|---|
| Does your MCU have a hardware FPU? (ESP32, STM32F4+) | Float is fine for most tasks | Fixed-point gives 10-50x speedup |
| Is the computation in a tight loop? (DSP filter, ML inference) | Fixed-point saves significant energy | Float overhead is negligible |
| Do you need >6 decimal digits of precision? | Stay with float32 or use Q1.31 | Q8.8 or Q16.16 is sufficient |
| Are you deploying ML models? (object detection, anomaly detection) | int8 quantization is industry standard | Float is acceptable for development |
| Battery life target >2 years on coin cell? | Every instruction counts – use fixed-point | Float is acceptable |
Real-World Example – Google’s TensorFlow Lite for Microcontrollers: Google’s TFLite Micro framework uses int8 quantization as the default for edge ML inference. Their person detection model on ARM Cortex-M4 (no FPU) benchmarks show: float32 inference takes 22 seconds per frame, while int8 takes 580ms – a 38x speedup. Memory drops from 300 KB to 75 KB, enabling deployment on a $2 MCU with 256 KB SRAM instead of a $15 MCU with 1 MB SRAM. The accuracy loss is under 1% for most classification tasks.
Real-World Example – Hearing Aids: Modern hearing aids like the Oticon More use 16-bit fixed-point DSP running at 10 MHz to process audio in real-time. The entire signal processing chain – noise reduction, feedback cancellation, directional beamforming – runs on a zinc-air battery (1.4V, 310 mAh) for 60+ hours. Floating-point DSP would require 5-10x more power, reducing battery life to under 12 hours – unacceptable for a device worn 16 hours per day.
18.7 Common Fixed-Point Formats
| Format | Total Bits | Integer | Fractional | Range | Resolution | Use Case |
|---|---|---|---|---|---|---|
| Q1.7 | 8 | 1 | 7 | -1 to +0.992 | 0.0078 | Audio samples |
| Q1.15 (Q15) | 16 | 1 | 15 | -1 to +0.999 | 0.000031 | DSP, audio |
| Q8.8 | 16 | 8 | 8 | -128 to +127.996 | 0.0039 | Sensor data |
| Q16.16 | 32 | 16 | 16 | -32768 to +32767.999 | 0.000015 | GPS coordinates |
| Q1.31 (Q31) | 32 | 1 | 31 | -1 to +0.9999999995 | 4.7e-10 | High-precision DSP |
18.8 Implementation Tips
Overflow Handling:
// Saturating addition (prevents wraparound)
int16_t saturate_add(int16_t a, int16_t b) {
int32_t sum = (int32_t)a + b;
if (sum > 32767) return 32767;
if (sum < -32768) return -32768;
return (int16_t)sum;
}Lookup Tables: For transcendental functions (sin, cos, log, exp)
// Q15 sine lookup table (256 entries for 0 to 2*pi)
const int16_t sin_table[256] = { 0, 804, 1608, 2410, ... };
int16_t fast_sin_q15(uint8_t angle) {
return sin_table[angle];
}Scaling for Mixed Formats:
// Convert Q8.8 to Q1.15 (different scaling)
// Q8.8: 1.0 = 256, Q1.15: 1.0 = 32768
// Scale factor: 32768/256 = 128 = shift left by 7
int16_t q88_to_q15(int16_t q88_value) {
// First clamp to valid Q15 range (-1 to +1)
if (q88_value > 255) q88_value = 255; // >1.0 in Q8.8
if (q88_value < -256) q88_value = -256; // <-1.0 in Q8.8
return q88_value << 7;
}18.9 Key Concepts Summary
Optimization Layers:
- Algorithmic: Algorithm selection and design
- Software: Code implementation, compiler flags
- Microarchitectural: CPU execution patterns
- Hardware: Component selection, specialization
- System: Integration of all layers
Profiling and Measurement:
- Performance counters: Cycles, cache misses, branch prediction
- Memory analysis: Bandwidth, latency, alignment
- Power profiling: Per-core, per-component consumption
- Bottleneck identification: Critical path analysis
- Statistical validation: Representative workloads
Fixed-Point Arithmetic:
- Lower area/power than floating-point
- Precision trade-off management
- Integer operations: Fast and efficient
- Common in DSP, vision, ML inference
18.10 Summary
Fixed-point arithmetic enables efficient computation on resource-constrained IoT devices:
- Qn.m Format: n integer bits + m fractional bits provide predictable range and precision
- Conversion Process: Profile floating-point algorithm, determine value ranges, select format
- Operations: Addition is straightforward; multiplication requires renormalization (right shift)
- Trade-offs: Precision loss vs. 4x+ performance gain and significant power savings
- ML Quantization: int8 quantization is standard for edge AI, providing 4x memory and speed benefits
- Implementation: Use saturating arithmetic, lookup tables, and careful overflow handling
The key insight: floating-point hardware is expensive and power-hungry. For IoT devices, fixed-point arithmetic offers a compelling trade-off of slight precision loss for dramatic efficiency gains.
Related Chapters & Resources
Design Deep Dives:
- Energy Considerations - Power optimization
- Hardware Prototyping - Hardware design
- Reading Spec Sheets - Component selection
Architecture:
- Edge Compute - Edge optimization
- WSN Overview - Sensor network design
Sensing:
- Sensor Circuits - Circuit optimization
Interactive Tools:
- Simulations Hub - Power calculators
Learning Hubs:
- Quiz Navigator - Design quizzes
18.11 Concept Relationships
Fixed-point arithmetic bridges software efficiency and hardware constraints:
- Prerequisite: Requires Optimization Fundamentals to understand performance-precision trade-offs
- Hardware-Driven: Most critical for MCUs without FPU (ARM Cortex-M0/M0+/M3) where floating-point is emulated—10-100× slower than fixed-point
- Enables: Edge ML (TensorFlow Lite Micro int8 quantization) requires fixed-point for real-time inference
- Measured Benefit: 4× speedup claim must be validated with Energy Measurement on target hardware—varies by algorithm
When to use: (1) MCU without FPU, (2) DSP-heavy workload (filters, FFT, audio), (3) ML inference at edge, (4) Battery-powered with tight energy budget. If MCU has FPU (Cortex-M4F/M7/ESP32), float is acceptable unless extreme optimization needed.
18.12 See Also
Optimization Context:
- Optimization Fundamentals - Trade-offs framework
- Software Optimization - Compiler techniques
- Hardware Optimization - FPU vs integer-only MCUs
Implementation Details:
- ARM CMSIS-DSP Library - Q15/Q31 functions
- TensorFlow Lite Micro Quantization - int8 ML models
Energy Impact:
- Operation Costs - Float vs int energy comparison
- Low-Power Strategies - Faster math → shorter wake time
Applications:
- Audio Processing - Q15 format standard
- Edge ML - int8 quantization
- Sensor Fusion - Kalman filters in fixed-point
Common Pitfalls
1. Integer Overflow in Fixed-Point Multiplication
When multiplying two Qn.m numbers, the result has 2n integer bits and 2m fractional bits. If you store the full result in the original word size without shifting and truncating, integer overflow silently corrupts the result. Always shift right by m bits after multiplication and check for saturation.
2. Choosing m Too Large, Losing Integer Range
Allocating too many fractional bits leaves too few integer bits, causing overflow for values larger than 2^n - 1. For example, Q2.14 in a 16-bit word can only represent values from -2 to +1.9999. Profile the actual value range first, then allocate bits accordingly.
3. Mixing Q Formats Without Converting
Adding Q4.12 and Q8.8 numbers directly produces an incorrect result because their binary points are misaligned. Always convert operands to the same Qn.m format before arithmetic, or explicitly track and align binary point positions in composite expressions.
4. Applying Fixed-Point Where Hardware FPU Exists
Modern ARM Cortex-M4 and M7 cores include a hardware FPU that executes floating-point in 1 cycle — the same as integer operations. Converting to fixed-point on these platforms adds code complexity with no energy benefit. Reserve fixed-point for Cortex-M0/M3 cores without hardware FPU.
18.13 What’s Next
| If you want to… | Read this |
|---|---|
| Learn to read device datasheets | Reading a Spec Sheet |
| Apply compiler-level software optimizations | Software Optimization |
| Understand hardware acceleration options | Hardware Optimization |
| Start from optimization fundamentals | Optimization Fundamentals |
| See energy costs of different operations | Energy Cost of Common Operations |