18  Fixed-Point Arithmetic

18.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand Qn.m format: Define and work with fixed-point number representations
  • Convert floating-point to fixed-point: Transform algorithms from float to efficient integer operations
  • Select appropriate n and m values: Choose bit allocation based on range and precision requirements
  • Implement fixed-point operations: Perform multiplication, division, and other arithmetic correctly
  • Evaluate trade-offs: Compare precision loss against performance and power savings
In 60 Seconds

Fixed-point arithmetic replaces floating-point operations with scaled integer math using the Qn.m format (n integer bits, m fractional bits), achieving 3–10× energy savings on MCUs without an FPU by turning slow software float emulation into fast hardware integer operations.

Fixed-point arithmetic is a clever way to do decimal math using only whole numbers (integers). Instead of storing 25.3 directly, we multiply it by a fixed number (like 256) and store 6477 as an integer. When we need the decimal back, we divide by 256. The trick is that all the math happens using fast integer operations, which saves lots of battery power on small devices. Think of it like measuring distances in millimeters instead of meters—you use whole numbers but still capture the precision you need.

“Most small microcontrollers do not have hardware for decimal point math,” said Max the Microcontroller. “Floating-point operations like 3.14 times 2.7 take many clock cycles in software, which wastes time and energy. Fixed-point arithmetic is a clever trick to do decimal math using only integers!”

Sammy the Sensor asked, “How do you represent 25.3 degrees without a decimal point?” Max explained, “You multiply by a power of two! In Q8.8 format, you store 25.3 as 25.3 times 256 equals 6,477. To the processor, it is just the integer 6477. When you need the real value, divide by 256. All the math happens with fast integer operations!”

Bella the Battery appreciated the energy savings: “Fixed-point math can be 10 to 100 times faster than floating-point on a small microcontroller without an FPU. Faster math means less time awake, which means I last longer! The trade-off is that you lose a tiny bit of precision, but for most sensor readings, that is perfectly acceptable.” Lila the LED added, “Choose your Q format carefully – more fractional bits give more precision, but fewer integer bits mean a smaller range. It is all about balance!”

18.2 Prerequisites

Before diving into this chapter, you should be familiar with:

  • Software Optimization: Understanding compiler optimizations and code efficiency
  • Optimization Fundamentals: Understanding why optimization matters for IoT
  • Binary number representation: Familiarity with how integers are stored in binary

18.3 Fixed-Point Arithmetic

18.3.1 Why Fixed-Point?

Most signal processing and machine learning algorithms are initially developed in high-level environments like MATLAB or Python using floating-point arithmetic. This approach simplifies development and testing, but presents challenges when deploying to resource-constrained IoT devices:

  • Hardware cost: Floating-point processors or FPU hardware are expensive and power-hungry
  • Common alternative: Fixed-point processors dominate the embedded systems market
  • Deployment workflow: After design and test in floating-point, convert to fixed-point representation
  • Target platforms: Deploy onto fixed-point processor or ASIC for production
Workflow diagram showing conversion process from floating-point algorithm development in MATLAB or Python to fixed-point implementation on embedded processor or ASIC, with steps including algorithm design, format selection based on range and precision requirements, and hardware deployment
Figure 18.1: Optimization: ConversiontoQnm

18.3.2 Qn.m Format

Qn.m Format: Fixed positional number system - n bits to left of decimal (including sign bit) - m bits to right of decimal point - Total bits: n + m

Let’s examine a concrete example to understand how the Qn.m format works in practice.

Example - Q5.4 (9 bits total): - 5 bits for integer (sign + 4 bits) - 4 bits for fraction - Range: -16.0 to +15.9375 - Resolution: 1/16 = 0.0625

This format can represent temperatures from -16°C to +15.9375°C with a resolution of 0.0625°C—suitable for many indoor temperature sensing applications.

Qn.m fixed-point format structure diagram showing bit allocation with n bits for integer part including sign bit on left side of decimal point and m bits for fractional part on right side of decimal point
Figure 18.2: Optimization: QnmFormat
Comparison table of common fixed-point formats including Q7.8 with 16 bits total for sensor data, Q15 with 1 sign and 15 fractional bits for DSP audio, and Q31 with 32 bits for high-precision applications, showing range and resolution trade-offs
Figure 18.3: Optimization: QnmFormat2
Q15 format worked example demonstrating binary representation where value 0.5 is stored as integer 16384, calculated as 0.5 multiplied by 2 to the power 15, with 1 sign bit and 15 fractional bits providing range from negative 1.0 to positive 0.99997
Figure 18.4: Optimization: ExampleQ

18.3.3 Conversion to Qn.m

Converting from floating-point to fixed-point requires careful analysis of your data:

  1. Define total number of bits (e.g., 9 bits)
  2. Fix location of decimal point based on value range
  3. Determine n and m based on required range and precision

Range Determination:

  • Run simulations for all input sets
  • Observe ranges of values for all variables
  • Note minimum + maximum value each variable sees
  • Determine Qn.m format to cover range

18.3.3.1 Interactive Qn.m Calculator

18.3.4 Fixed-Point Operations

Addition and Subtraction: Straightforward when formats match

// Q15 + Q15 = Q15 (same format)
int16_t a = 16384;  // 0.5 in Q15
int16_t b = 8192;   // 0.25 in Q15
int16_t c = a + b;  // 24576 = 0.75 in Q15

Multiplication: Requires renormalization

// Q15 x Q15 = Q30, must shift back to Q15
int16_t a = 16384;  // 0.5 in Q15
int16_t b = 16384;  // 0.5 in Q15
int32_t temp = (int32_t)a * b;  // 268435456 in Q30
int16_t result = temp >> 15;    // 8192 = 0.25 in Q15

18.3.4.1 Interactive Fixed-Point Multiplication

Why does Q15 multiplication need a right shift by 15? Because multiplying two Q15 numbers produces Q30:

Q15 format: 1 sign bit + 15 fractional bits, where 1.0 = \(2^{15}\) = 32,768

Multiplying 0.5 × 0.5 should give 0.25: - \(a = 0.5 \times 32768 = 16384\) (Q15) - \(b = 0.5 \times 32768 = 16384\) (Q15) - \(a \times b = 16384 \times 16384 = 268,435,456\) (Q30, because \(2^{15} \times 2^{15} = 2^{30}\))

To convert Q30 back to Q15, divide by \(2^{15}\):

\[\text{result} = \frac{268,435,456}{32768} = 8192\]

Verify: \(\frac{8192}{32768} = 0.25\)

On MCUs without hardware dividers, >> 15 (right shift) is a single-cycle operation vs 12-40 cycles for software division—critical for real-time DSP running at 100+ kHz sampling rates.

Division: Often avoided or replaced with multiplication by reciprocal

// Divide by 10 using multiply by 0.1 (avoid slow division)
// 0.1 in Q15 = 3277
int16_t x = 32767;  // ~1.0 in Q15
int32_t temp = (int32_t)x * 3277;  // Multiply by 0.1
int16_t result = temp >> 15;        // ~3277 = ~0.1
Try It: Fixed-Point Division by Reciprocal

Explore how dividing by a number is replaced with multiplying by its reciprocal in fixed-point. Choose a divisor and see how the reciprocal is encoded in Q15 format.

18.4 Knowledge Check

Scenario: Your edge AI camera runs object detection (MobileNet) on images. Baseline uses float32 (32-bit floating-point), consuming 150mW during inference and requiring 8 MB RAM (model + activations). Latency: 2 seconds/frame on ARM Cortex-M7 @ 200 MHz. You’re evaluating int8 quantization (8-bit fixed-point), which the datasheet claims provides 4x speedup and 4x memory reduction with <2% accuracy loss.

Think about:

  1. What are the new latency, RAM, and power metrics with int8 quantization?
  2. Does 4x speedup enable real-time processing (30 fps = 33ms/frame)?
  3. What additional considerations affect the decision beyond performance metrics?

Key Insight: Int8 quantization benefits: Latency: 2 sec / 4 = 500ms/frame (still far from 33ms real-time needed for 30 fps, but 4x better). RAM: 8 MB / 4 = 2 MB (fits in devices with 4 MB SRAM vs requiring 16 MB). Power: 150mW / 4 = 37.5mW (longer battery life or enables smaller battery). Verdict: Use int8 quantization. While 500ms still isn’t real-time, it’s “good enough” for many applications (e.g., doorbell detection every 0.5 sec is fine). The 2 MB RAM savings enables deployment on cheaper hardware ($5 MCU with 4 MB RAM vs $15 MCU with 16 MB RAM). Trade-off: <2% accuracy loss is acceptable for most object detection tasks. Non-performance consideration: Some MCUs have hardware int8 accelerators (e.g., ARM Cortex-M55 with Helium vector extension) providing 10-20x additional speedup, potentially reaching real-time. The chapter’s lesson: “Floating point processors/hardware are expensive! Fixed point processors common in embedded systems” - for edge AI, int8 fixed-point is standard.

18.5 Worked Example: Converting a Temperature Filter to Fixed-Point

A common IoT task is applying an exponential moving average (EMA) filter to smooth noisy temperature readings. Here is the conversion from floating-point to fixed-point, step by step.

Floating-Point Version (runs on ESP32 with FPU):

alpha = 0.1
filtered = alpha * new_reading + (1 - alpha) * filtered

For a reading of 25.7C with previous filtered value of 25.3C: filtered = 0.1 * 25.7 + 0.9 * 25.3 = 2.57 + 22.77 = 25.34

Step 1: Determine Value Ranges

Variable Min Value Max Value Required Precision
temperature -40.0C +85.0C 0.1C
alpha 0.0 1.0 0.01
filtered -40.0C +85.0C 0.1C

Step 2: Select Q Format

Temperature range is -40 to +85, needing 8 integer bits (range -128 to +127). For 0.1C precision, we need at least 4 fractional bits (resolution 0.0625C). Choose Q8.8 (16-bit total): range -128 to +127.996, resolution 0.00390625C.

Alpha is 0 to 1, needing only 1 integer bit. Choose Q1.15 for maximum precision.

Step 3: Convert Constants

alpha_q15 = round(0.1 * 32768) = 3277
one_minus_alpha_q15 = 32768 - 3277 = 29491
temperature 25.7C in Q8.8 = round(25.7 * 256) = 6579
filtered 25.3C in Q8.8 = round(25.3 * 256) = 6477

Step 4: Fixed-Point Calculation

// All operations are integer-only -- no FPU needed
int16_t temp_q88 = 6579;       // 25.7C
int16_t filt_q88 = 6477;       // 25.3C
int16_t alpha_q15 = 3277;      // 0.1

// term1 = alpha * new_reading (Q15 * Q8.8 = Q23.8, shift right 15 to get Q8.8)
int32_t term1 = (int32_t)alpha_q15 * temp_q88;  // = 21,555,483
int16_t term1_q88 = term1 >> 15;                 // = 658 (2.57C in Q8.8)

// term2 = (1-alpha) * filtered
int32_t term2 = (int32_t)(32768 - alpha_q15) * filt_q88;  // = 191,057,607
int16_t term2_q88 = term2 >> 15;                            // = 5831 (22.777C in Q8.8)

// result
int16_t result_q88 = term1_q88 + term2_q88;  // = 6489
// Convert back: 6489 / 256 = 25.347C

Result: 25.347°C vs floating-point 25.34°C – negligible error (0.007°C), but runs 10-50x faster on MCUs without FPU hardware.

Try It: EMA Temperature Filter in Fixed-Point

Experiment with the exponential moving average filter parameters. Adjust the smoothing factor (alpha), current reading, and previous filtered value to see how fixed-point Q8.8 compares with floating-point.

18.6 When to Use Fixed-Point: Decision Framework

Not every IoT application benefits from fixed-point conversion. Use this decision tree:

Question If Yes If No
Does your MCU have a hardware FPU? (ESP32, STM32F4+) Float is fine for most tasks Fixed-point gives 10-50x speedup
Is the computation in a tight loop? (DSP filter, ML inference) Fixed-point saves significant energy Float overhead is negligible
Do you need >6 decimal digits of precision? Stay with float32 or use Q1.31 Q8.8 or Q16.16 is sufficient
Are you deploying ML models? (object detection, anomaly detection) int8 quantization is industry standard Float is acceptable for development
Battery life target >2 years on coin cell? Every instruction counts – use fixed-point Float is acceptable

Real-World Example – Google’s TensorFlow Lite for Microcontrollers: Google’s TFLite Micro framework uses int8 quantization as the default for edge ML inference. Their person detection model on ARM Cortex-M4 (no FPU) benchmarks show: float32 inference takes 22 seconds per frame, while int8 takes 580ms – a 38x speedup. Memory drops from 300 KB to 75 KB, enabling deployment on a $2 MCU with 256 KB SRAM instead of a $15 MCU with 1 MB SRAM. The accuracy loss is under 1% for most classification tasks.

Real-World Example – Hearing Aids: Modern hearing aids like the Oticon More use 16-bit fixed-point DSP running at 10 MHz to process audio in real-time. The entire signal processing chain – noise reduction, feedback cancellation, directional beamforming – runs on a zinc-air battery (1.4V, 310 mAh) for 60+ hours. Floating-point DSP would require 5-10x more power, reducing battery life to under 12 hours – unacceptable for a device worn 16 hours per day.

18.7 Common Fixed-Point Formats

Format Total Bits Integer Fractional Range Resolution Use Case
Q1.7 8 1 7 -1 to +0.992 0.0078 Audio samples
Q1.15 (Q15) 16 1 15 -1 to +0.999 0.000031 DSP, audio
Q8.8 16 8 8 -128 to +127.996 0.0039 Sensor data
Q16.16 32 16 16 -32768 to +32767.999 0.000015 GPS coordinates
Q1.31 (Q31) 32 1 31 -1 to +0.9999999995 4.7e-10 High-precision DSP
Try It: Compare Q Format Conversions

Enter a floating-point value and see how it is represented in each standard Q format. Observe how different formats trade off range for precision.

18.8 Implementation Tips

Overflow Handling:

// Saturating addition (prevents wraparound)
int16_t saturate_add(int16_t a, int16_t b) {
    int32_t sum = (int32_t)a + b;
    if (sum > 32767) return 32767;
    if (sum < -32768) return -32768;
    return (int16_t)sum;
}

Lookup Tables: For transcendental functions (sin, cos, log, exp)

// Q15 sine lookup table (256 entries for 0 to 2*pi)
const int16_t sin_table[256] = { 0, 804, 1608, 2410, ... };

int16_t fast_sin_q15(uint8_t angle) {
    return sin_table[angle];
}

Scaling for Mixed Formats:

// Convert Q8.8 to Q1.15 (different scaling)
// Q8.8: 1.0 = 256, Q1.15: 1.0 = 32768
// Scale factor: 32768/256 = 128 = shift left by 7
int16_t q88_to_q15(int16_t q88_value) {
    // First clamp to valid Q15 range (-1 to +1)
    if (q88_value > 255) q88_value = 255;   // >1.0 in Q8.8
    if (q88_value < -256) q88_value = -256; // <-1.0 in Q8.8
    return q88_value << 7;
}
Try It: Overflow and Saturation Explorer

See what happens when fixed-point addition overflows, and how saturating arithmetic prevents dangerous wraparound errors. Enter two Q8.8 values that may exceed the format range.

18.9 Key Concepts Summary

Optimization Layers:

  • Algorithmic: Algorithm selection and design
  • Software: Code implementation, compiler flags
  • Microarchitectural: CPU execution patterns
  • Hardware: Component selection, specialization
  • System: Integration of all layers

Profiling and Measurement:

  • Performance counters: Cycles, cache misses, branch prediction
  • Memory analysis: Bandwidth, latency, alignment
  • Power profiling: Per-core, per-component consumption
  • Bottleneck identification: Critical path analysis
  • Statistical validation: Representative workloads

Fixed-Point Arithmetic:

  • Lower area/power than floating-point
  • Precision trade-off management
  • Integer operations: Fast and efficient
  • Common in DSP, vision, ML inference

18.10 Summary

Fixed-point arithmetic enables efficient computation on resource-constrained IoT devices:

  1. Qn.m Format: n integer bits + m fractional bits provide predictable range and precision
  2. Conversion Process: Profile floating-point algorithm, determine value ranges, select format
  3. Operations: Addition is straightforward; multiplication requires renormalization (right shift)
  4. Trade-offs: Precision loss vs. 4x+ performance gain and significant power savings
  5. ML Quantization: int8 quantization is standard for edge AI, providing 4x memory and speed benefits
  6. Implementation: Use saturating arithmetic, lookup tables, and careful overflow handling

The key insight: floating-point hardware is expensive and power-hungry. For IoT devices, fixed-point arithmetic offers a compelling trade-off of slight precision loss for dramatic efficiency gains.

Design Deep Dives:

Architecture:

Sensing:

Interactive Tools:

Learning Hubs:

18.11 Concept Relationships

Fixed-point arithmetic bridges software efficiency and hardware constraints:

  • Prerequisite: Requires Optimization Fundamentals to understand performance-precision trade-offs
  • Hardware-Driven: Most critical for MCUs without FPU (ARM Cortex-M0/M0+/M3) where floating-point is emulated—10-100× slower than fixed-point
  • Enables: Edge ML (TensorFlow Lite Micro int8 quantization) requires fixed-point for real-time inference
  • Measured Benefit: 4× speedup claim must be validated with Energy Measurement on target hardware—varies by algorithm

When to use: (1) MCU without FPU, (2) DSP-heavy workload (filters, FFT, audio), (3) ML inference at edge, (4) Battery-powered with tight energy budget. If MCU has FPU (Cortex-M4F/M7/ESP32), float is acceptable unless extreme optimization needed.

18.12 See Also

Optimization Context:

Implementation Details:

Energy Impact:

Applications:

Common Pitfalls

When multiplying two Qn.m numbers, the result has 2n integer bits and 2m fractional bits. If you store the full result in the original word size without shifting and truncating, integer overflow silently corrupts the result. Always shift right by m bits after multiplication and check for saturation.

Allocating too many fractional bits leaves too few integer bits, causing overflow for values larger than 2^n - 1. For example, Q2.14 in a 16-bit word can only represent values from -2 to +1.9999. Profile the actual value range first, then allocate bits accordingly.

Adding Q4.12 and Q8.8 numbers directly produces an incorrect result because their binary points are misaligned. Always convert operands to the same Qn.m format before arithmetic, or explicitly track and align binary point positions in composite expressions.

Modern ARM Cortex-M4 and M7 cores include a hardware FPU that executes floating-point in 1 cycle — the same as integer operations. Converting to fixed-point on these platforms adds code complexity with no energy benefit. Reserve fixed-point for Cortex-M0/M3 cores without hardware FPU.

18.13 What’s Next

If you want to… Read this
Learn to read device datasheets Reading a Spec Sheet
Apply compiler-level software optimizations Software Optimization
Understand hardware acceleration options Hardware Optimization
Start from optimization fundamentals Optimization Fundamentals
See energy costs of different operations Energy Cost of Common Operations