68 Voice and Audio Compression for IoT

68.1 Learning Objectives

By the end of this chapter, you will be able to:

Understand Toll Quality: Explain the 64 kbps baseline for voice transmission
Apply Companding: Use μ-law and A-law compression for better audio SNR
Understand LPC: Explain the source-filter model and how it achieves 8-27× compression
Select Audio Codecs: Choose appropriate compression for IoT bandwidth constraints
Calculate Compression Ratios: Determine bit rate requirements for voice IoT applications

Related Chapters

Fundamentals: - Signal Processing Overview - Sampling and Nyquist theorem - Aliasing and ADC Resolution - Previous chapter - Sensor Dynamics - Next in series - Data Representation - Binary encoding

Networking: - Long-Range Protocols - LoRaWAN bandwidth constraints - Short-Range Protocols - BLE audio streaming

Practical: - Signal Processing Labs - Hands-on experiments

68.2 Prerequisites

Signal Processing Overview: Sampling rate and ADC concepts
Aliasing and ADC: Understanding quantization
Basic logarithms: Understanding log scales for companding formulas

68.3 Voice and Audio Compression for IoT

⏱️ ~14 min | ⭐⭐⭐ Advanced | 📋 P02.C05.U06

The Problem: Voice-enabled IoT devices—smart speakers, intercoms, wearable communicators, and industrial voice systems—need to transmit audio over constrained networks. Uncompressed audio is prohibitively expensive for IoT bandwidth and storage budgets.

The Scale of the Challenge: - Telephone-quality audio: 8-bit samples at 8 kHz = 64 kbps (kilobits per second) - Typical IoT uplink: 10-50 kbps (LoRaWAN, NB-IoT, BLE) - Gap: Raw audio exceeds available bandwidth by 2-6×!

For Beginners: Why Audio Compression Matters for IoT

The Problem in Plain Terms: Imagine you have a walkie-talkie that can only send 50 small packages per second, but your voice generates 64 packages per second. You can’t keep up! You need to either speak slower (not practical) or pack more information into fewer packages (compression).

Real-World Analogy: Think of a newspaper headline vs. the full article. The headline captures the essential meaning in far fewer words. Audio compression does something similar—it captures the essential sounds while discarding information your ear won’t miss.

Why This Matters for IoT: - Smart speakers: Need to stream voice commands to the cloud for processing - Baby monitors: Continuous audio over Wi-Fi without clogging the network - Industrial intercoms: Clear communication in noisy factories over low-bandwidth radio - Wearable translators: Real-time voice translation requires fast, compressed transmission

Key Insight: Good compression reduces 64 kbps to 8 kbps or less—an 8× reduction—while keeping speech intelligible!

68.3.1 Toll Quality: The Baseline for Voice

Toll quality refers to telephone-standard voice quality established by telecom standards:

Parameter	Value	Calculation
Sampling rate	8,000 Hz	Voice frequencies 300-3400 Hz (Nyquist: >6800 Hz)
Bit depth	8 bits	256 quantization levels
Bit rate	64 kbps	8 bits × 8000 samples/sec
Bandwidth	3.1 kHz	300 Hz to 3400 Hz (telephone band)

Why 8 kHz?: Human speech contains most intelligible content between 300-3400 Hz. Sampling at 8 kHz (twice 4 kHz) captures this range with margin.

Why 8 bits?: Early telephone systems used 8-bit quantization as a balance between quality and transmission cost. This became the G.711 PCM standard still used today.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart LR
    subgraph input["Voice Input"]
        A["Microphone<br/>Analog Signal<br/>20 Hz - 20 kHz"]
    end

    subgraph preprocessing["Preprocessing"]
        B["Band-Pass Filter<br/>300 Hz - 3400 Hz"]
        C["Anti-Aliasing<br/>Low-Pass 4 kHz"]
    end

    subgraph digitization["Digitization"]
        D["Sample at 8 kHz"]
        E["Quantize to<br/>8 bits (256 levels)"]
    end

    subgraph compression["Compression"]
        F{"Compression<br/>Method?"}
        G["Companding<br/>(μ-Law/A-Law)<br/>64 → 64 kbps"]
        H["LPC Vocoder<br/>64 → 2-8 kbps"]
        I["CELP/ACELP<br/>64 → 8-16 kbps"]
    end

    subgraph output["Compressed Output"]
        J["Transmit over<br/>IoT Network"]
    end

    A --> B --> C --> D --> E --> F
    F -->|"Simple"| G
    F -->|"High Compression"| H
    F -->|"Balanced"| I
    G --> J
    H --> J
    I --> J

    style input fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
    style preprocessing fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style digitization fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style compression fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style output fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
    style A fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
    style B fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style C fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style D fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style E fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style F fill:#ECF0F1,stroke:#7F8C8D,stroke-width:2px,color:#000
    style G fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
    style H fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style I fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style J fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000

Figure 68.1: Voice Signal Processing Pipeline from Microphone to IoT Transmission

Voice Compression Pipeline: Audio flows from microphone through band-pass filtering (telephone band 300-3400 Hz), anti-aliasing, 8 kHz sampling, 8-bit quantization, and then through one of three compression methods: companding (simple, preserves 64 kbps), LPC vocoder (high compression to 2-8 kbps), or CELP (balanced quality at 8-16 kbps). {fig-alt=“Flowchart showing complete voice compression pipeline for IoT. Input stage shows microphone capturing 20Hz-20kHz analog signal. Preprocessing applies band-pass filter (300-3400 Hz) and anti-aliasing low-pass at 4 kHz. Digitization samples at 8 kHz and quantizes to 8 bits. Compression decision point offers three paths: companding (mu-law/A-law, 64 kbps), LPC vocoder (2-8 kbps high compression), or CELP/ACELP (8-16 kbps balanced). All paths lead to compressed output for IoT network transmission.”}

68.3.2 Companding: Compression + Expansion

Companding (compression + expansion) is a technique that improves perceived audio quality without changing the bit rate. It exploits a fundamental property of human hearing:

The Key Insight: Human ears perceive loudness logarithmically, not linearly. A sound 10× louder doesn’t seem 10× louder—it seems about twice as loud. This is the Weber-Fechner law.

How Companding Works:

Compression (transmitter): Apply logarithmic encoding to give more precision to quiet sounds
Expansion (receiver): Apply inverse (exponential) decoding to restore original dynamics

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart TB
    subgraph problem["The Problem with Linear Quantization"]
        P1["Quiet sounds: Few bits<br/>→ High quantization noise"]
        P2["Loud sounds: Many bits<br/>→ Wasteful precision"]
        P3["SNR varies with<br/>signal amplitude"]
    end

    subgraph solution["Companding Solution"]
        direction LR
        S1["Compress<br/>(Logarithmic)<br/>Before ADC"]
        S2["Uniform<br/>Quantization<br/>(8 bits)"]
        S3["Expand<br/>(Exponential)<br/>After DAC"]
    end

    subgraph result["Result"]
        R1["Quiet sounds: More bits<br/>→ Lower noise floor"]
        R2["Loud sounds: Fewer bits<br/>→ Still adequate"]
        R3["SNR constant across<br/>all amplitudes!"]
    end

    problem --> solution --> result

    subgraph standards["Regional Standards"]
        MU["μ-Law (mu-law)<br/>• North America & Japan<br/>• μ = 255<br/>• Slightly better for low signals"]
        AL["A-Law<br/>• Europe & rest of world<br/>• A = 87.6<br/>• Slightly better dynamic range"]
    end

    solution --> standards

    style problem fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
    style solution fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
    style result fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
    style standards fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style P1 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
    style P2 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
    style P3 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
    style S1 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
    style S2 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
    style S3 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
    style R1 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style R2 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style R3 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
    style MU fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
    style AL fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000

Figure 68.2: Companding Principle: Logarithmic Compression for Constant SNR Audio

Companding Principle: Linear quantization wastes bits on loud sounds while under-representing quiet sounds. Companding applies logarithmic compression before digitization and exponential expansion after, achieving constant signal-to-noise ratio (SNR) across all amplitudes. Two standards exist: μ-law (North America/Japan, μ=255) and A-law (Europe/international, A=87.6). {fig-alt=“Three-stage diagram explaining companding. Problem section (orange): linear quantization gives quiet sounds few bits (high noise), loud sounds many bits (wasteful), and SNR varies with amplitude. Solution section (teal): compress with logarithmic function before ADC, uniform 8-bit quantization, expand with exponential function after DAC. Result section (navy): quiet sounds get more bits (lower noise), loud sounds get fewer but adequate bits, SNR is constant across all amplitudes. Standards section shows μ-law (North America/Japan, μ=255) and A-law (Europe, A=87.6).”}

The μ-Law Formula (North America, Japan):

\[F(x) = \text{sgn}(x) \cdot \frac{\ln(1 + \mu |x|)}{\ln(1 + \mu)}\]

Where: - \(x\) = normalized input (-1 to +1) - \(\mu\) = compression parameter (typically 255) - \(\text{sgn}(x)\) = sign of x (+1 or -1)

The A-Law Formula (Europe, international):

\[F(x) = \text{sgn}(x) \cdot \begin{cases} \frac{A|x|}{1 + \ln(A)} & |x| < \frac{1}{A} \\ \frac{1 + \ln(A|x|)}{1 + \ln(A)} & |x| \geq \frac{1}{A} \end{cases}\]

Where A = 87.6 (standard value)

Comparison: μ-Law vs A-Law:

Aspect	μ-Law	A-Law
Region	North America, Japan	Europe, rest of world
Parameter	μ = 255	A = 87.6
Low-level signals	Slightly better SNR	Slightly worse
Dynamic range	Slightly less	Slightly more (13 dB vs 12 dB for μ-law)
Standard	ITU-T G.711	ITU-T G.711

Important: Companding doesn’t change the bit rate (still 64 kbps), but it dramatically improves perceived quality by matching quantization precision to human hearing sensitivity.

68.3.3 Linear Predictive Coding (LPC): 8× Compression

The Breakthrough: Instead of transmitting the audio waveform directly, transmit parameters that describe how to synthesize it. This is the key to achieving 8× or greater compression.

The Source-Filter Model of Speech:

LPC is based on a revolutionary model of human speech developed by Gunnar Fant (KTH, Sweden) and Bishnu Atal (AT&T Bell Labs):

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart LR
    subgraph source["Excitation Source"]
        E1["Voiced Sounds<br/>(vowels, m, n, l)<br/>Periodic impulse train<br/>from vocal cords"]
        E2["Unvoiced Sounds<br/>(s, f, sh, t)<br/>Random noise<br/>from turbulent airflow"]
    end

    subgraph filter["Vocal Tract Filter"]
        F1["Lips, tongue, jaw<br/>shape the sound"]
        F2["Modeled as<br/>All-Pole Filter<br/>10-16 coefficients"]
    end

    subgraph output["Speech Output"]
        O1["Intelligible<br/>Speech Signal"]
    end

    E1 --> F1
    E2 --> F1
    F1 --> F2 --> O1

    subgraph transmission["What LPC Transmits"]
        T1["• Voiced/Unvoiced flag (1 bit)"]
        T2["• Pitch period (6-7 bits)"]
        T3["• Filter coefficients (10-16 × 4-6 bits)"]
        T4["• Gain (5-6 bits)"]
        T5["Total: ~50-80 bits per 20ms frame"]
    end

    output --> transmission

    style source fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
    style filter fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
    style output fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
    style transmission fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style E1 fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
    style E2 fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
    style F1 fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
    style F2 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
    style O1 fill:#ECF0F1,stroke:#2C3E50,stroke-width:2px,color:#000
    style T1 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
    style T2 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
    style T3 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
    style T4 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
    style T5 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000

Figure 68.3: LPC Source-Filter Speech Model with Excitation and Vocal Tract Filter

Source-Filter Model: Human speech is modeled as an excitation source (vocal cords for voiced sounds, turbulent noise for unvoiced) filtered by the vocal tract (lips, tongue, jaw). LPC transmits only the source type, pitch, filter coefficients, and gain—about 50-80 bits per 20ms frame instead of 1280 bits for raw PCM. {fig-alt=“Diagram of source-filter speech model used in LPC. Source section (orange) shows two excitation types: voiced sounds (vowels, m, n, l) as periodic impulse train from vocal cords, and unvoiced sounds (s, f, sh, t) as random noise from turbulent airflow. Filter section (teal) shows vocal tract shaping by lips/tongue/jaw, modeled as all-pole filter with 10-16 coefficients. Output (navy) is intelligible speech. Transmission section (gray) lists what LPC sends: voiced/unvoiced flag (1 bit), pitch period (6-7 bits), filter coefficients (10-16 × 4-6 bits), gain (5-6 bits), totaling 50-80 bits per 20ms frame.”}

The LPC Compression Calculation:

Worked Example: LPC Compression Ratio

Raw PCM (Toll Quality): - Sampling rate: 8,000 Hz - Bits per sample: 8 bits - Bit rate: 8 × 8,000 = 64,000 bps = 64 kbps

LPC-10 (US Government Standard): - Frame size: 20 ms (160 samples) - Bits per frame: - Voiced/unvoiced: 1 bit - Pitch: 6 bits - Gain: 5 bits - 10 reflection coefficients: ~36 bits - Total: ~48 bits per frame - Frames per second: 50 (1000 ms / 20 ms) - Bit rate: 48 × 50 = 2,400 bps = 2.4 kbps

Compression Ratio: 64 kbps / 2.4 kbps = 26.7:1

That’s 0.5-1.5 bits per sample instead of 8 bits per sample!

Trade-off: LPC-10 sounds “robotic” because it perfectly models the vocal tract but loses the fine nuances that make voices sound natural. It’s intelligible but not pleasant for long conversations.

Modern LPC Variants for IoT:

Codec	Bit Rate	Quality	Use Case
LPC-10	2.4 kbps	Robotic, intelligible	Military secure voice, extreme low bandwidth
CELP	4.8-16 kbps	Good	Early digital cellular
AMR-NB	4.75-12.2 kbps	Good to excellent	GSM/3G voice calls
Opus	6-510 kbps	Excellent	VoIP, smart speakers, gaming
Codec2	0.7-3.2 kbps	Fair	Amateur radio, IoT ultra-low bandwidth

68.3.4 Practical IoT Voice Applications

Application 1: Smart Doorbell - Constraint: Wi-Fi connected, but video already consumes bandwidth - Solution: Use Opus at 16 kbps for voice (vs 64 kbps PCM) - Savings: 75% bandwidth reduction, more headroom for video

Application 2: LoRaWAN Voice Alert System - Constraint: 250 bps effective data rate - Solution: Pre-recorded vocabulary + Codec2 at 700 bps - Approach: Don’t stream—buffer 2-3 second messages, transmit over 20-30 seconds

Application 3: Wearable Translator - Constraint: BLE bandwidth ~1 Mbps, but must be real-time - Solution: AMR-WB at 12.65 kbps for high quality - Result: Multiple simultaneous voice streams possible

Quiz: Voice Compression Design

Scenario: You’re designing a voice-enabled industrial IoT intercom system for a noisy factory floor. Requirements: - Communication range: 500 meters (using LoRa radio, 10 kbps effective) - Voice quality: Must be intelligible over background noise - Latency: <500 ms end-to-end acceptable - Battery: Solar-powered, always listening

Questions:

Can you transmit raw 64 kbps PCM audio? Why or why not?
Which compression approach would you choose: companding, LPC-10, or CELP?
What bit rate is achievable, and what’s the compression ratio?

Answers:

Q1: Can you transmit 64 kbps PCM? - No. Your radio supports 10 kbps, but 64 kbps audio requires 6.4× more bandwidth than available. - Even with protocol overhead removed, you need >6× compression minimum.

Q2: Which compression approach?

Approach	Bit Rate	Pros	Cons
Companding	64 kbps	Best quality	Too high—still 6.4× over budget
LPC-10	2.4 kbps	Fits easily (24% of channel)	Robotic, hard to understand in noise
CELP/AMR	4.75-8 kbps	Good quality, fits channel	47-80% of channel used
Codec2	1.2-3.2 kbps	Ultra-low, open source	Lower quality, but intelligible

Best choice: CELP/AMR at ~6 kbps (or Codec2 at 3.2 kbps for safety margin)

Reasoning: - 6 kbps uses 60% of 10 kbps channel, leaving room for headers and retransmits - CELP quality is acceptable even in noisy environments - Latency budget allows for ~100-200 ms encoding buffer

Q3: Compression ratio: - Raw: 64 kbps - Compressed (CELP @ 6 kbps): 6 kbps - Ratio: 64/6 = 10.7:1 compression

With Codec2 @ 3.2 kbps: 20:1 compression

Key Insight: Voice compression makes the difference between “impossible” (64 kbps over 10 kbps link) and “comfortable” (6 kbps with 40% margin).

68.3.5 Summary: Voice Compression for IoT

Technique	Compression	Bit Rate	Quality	IoT Use Case
Raw PCM	1:1	64 kbps	Perfect	Local processing only
Companding (μ/A-law)	~1:1*	64 kbps	Excellent	High-bandwidth Wi-Fi devices
LPC-10	27:1	2.4 kbps	Robotic	Emergency/military systems
CELP/AMR	4-8:1	8-16 kbps	Good	Smart speakers, intercoms
Opus	Variable	6-64 kbps	Excellent	VoIP, wearables
Codec2	20-90:1	0.7-3.2 kbps	Fair	LoRa, ultra-low bandwidth

*Companding improves quality but doesn’t reduce bit rate

Key Takeaways:

Toll quality baseline: 64 kbps (8 bits × 8 kHz)
Companding: Logarithmic encoding improves SNR for quiet sounds (μ-law vs A-law)
Source-filter model: Speech = excitation (voiced/unvoiced) + vocal tract filter
LPC magic: Transmit filter parameters instead of waveform → 8-27× compression
IoT applications: Match codec to available bandwidth (Codec2 for LoRa, Opus for Wi-Fi)

68.4 What’s Next

Sensor Dynamics - Understanding sensor temporal response
Signal Processing Labs - Hands-on experiments
Networking Protocols - Understanding bandwidth constraints