8 Audio Feature Extraction for IoT

analytics-ml

modeling

audio

8.1 Start With the Story

Picture an IoT team using the ideas in Audio Feature Extraction for IoT during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

8.2 Short-Time Audio Features

IoT audio analytics usually starts with a microphone stream, but the model should not be described as if it consumes an unbounded waveform. The stream is divided into short overlapping frames, each frame is transformed into compact features, and a classifier or detector reads a sequence of feature records. The same pattern supports keyword spotting, machine sound monitoring, acoustic occupancy cues, and simple alarm detection.

The feature choice depends on the acoustic task. MFCCs are common for speech-like sounds because they summarize the spectral envelope on a perceptual frequency scale. Log-mel spectrograms are common for small neural networks that learn time-frequency patterns. RMS energy and zero-crossing rate are simpler features that can support activity gates, voice activity checks, or low-cost alarms. None of these features proves an event by itself; labels, background noise, microphone placement, and deployment validation define the claim.

A keyword recognizer adds a training contract to that feature record. Keep the train, validation, and test split; the background-noise set; augmentation rules such as time shifting and background-volume mixing; the batch shape; the loss and optimizer choice; and the early-stopping rule with the model evidence. For a small CNN, the review should be able to trace audio input representation, convolution and pooling, fully connected layers, and inference labels such as target word, silence, and unknown. Multi-task audio models need the same discipline: if one input is used to infer speaker, stress, gender, or environment, each task needs its own label source, metric, and failure boundary instead of borrowing confidence from the shared hidden layers.

If you only need the intuition, this layer is enough: audio feature extraction turns raw samples into a bounded, reviewable feature record. Always keep the frame size, hop size, feature family, label source, acoustic environment, privacy boundary, and retest trigger with the model result.

Worked example: a doorbell classifier using 16 kHz audio with 25 ms frames and a 10 ms hop creates about 98 feature frames per second. If the product later switches to a sealed microphone housing, the feature record should say that the acoustic boundary changed, even if the model file stayed the same.

For a reviewable implementation, separate the signal decision from the model decision. The signal decision says what reaches the feature extractor: sample rate, channels, gain, filtering, clipping policy, frame length, hop length, and any resampling. The model decision says how those records are interpreted: feature family, normalization, label window, confidence threshold, debounce rule, and retest trigger. Keeping those two records separate prevents a later model update from hiding a microphone or firmware change that actually changed the evidence.

Sound-based sensing is broader than asking a smart speaker to transcribe speech. A tabletop or phone-based interface may classify a blow, scratch, tap, or surface-borne vibration as the interaction itself. The raw event is still an acoustic or structure-borne signal; the reviewable evidence is the time-frequency trace, amplitude envelope, gesture segment, placement, and validation set that make the input distinguishable.

That boundary is what keeps examples such as BLUI, Scratch Input, and Surface Link honest. BLUI treats a localized blow as an input gesture. Scratch Input separates line, circle, triangle, square, and multi-part marks by their amplitude profiles. Surface Link compares spectrogram trends from phones resting on the same surface. The claim is not that a microphone heard something; it is that this acoustic path, feature extractor, and placement can distinguish the intended gesture or pairing under tested conditions.

The same pattern appears in activity-recognition prototypes that treat everyday sound and body vibration as sensor evidence. Ubicoustics uses a single trained model to classify acoustic events from spectrogram structure without asking every room or object to be retrained in place. BodyBeat places a microphone near the neck and compares short-time spectra for silence, speech, drinking, coughing, eating, and breathing, so the review question becomes whether the placement and labels cover the activity claim. Ear-worn systems add proximity and gyroscope traces to acoustic or vibration features, separating face touches, drinking, eating, and talking by the joint pattern rather than by one sensor alone.

Some systems deliberately move beyond the ordinary audible band. ViBand-style work uses high-rate wearable vibration or bio-acoustic sensing to capture body-coupled activity cues that a normal low-rate motion pipeline would miss. SoundWave-style interaction sends an ultrasonic tone around 18-22 kHz and listens for the reflected signal; a hand moving near the speaker and microphone changes the received spectrum. For IoT design, the lesson is not to name the demo, but to keep the emitted tone, receiver path, sampling rate, placement, reflected-wave geometry, and privacy boundary with the feature record.

Frames

Short windows such as 20 to 40 ms make sound locally stable enough for spectral analysis. Hop size controls how often new feature records are produced.

Spectral Features

STFT, mel filterbanks, MFCCs, and log-mel spectrograms summarize frequency content rather than storing every waveform sample.

Context

Room acoustics, microphone housing, distance, gain control, and background sounds change the evidence boundary for every audio model.

Privacy

Audio can reveal speech and behavior. Keep raw clips only when the application, consent, retention rule, and review process justify them.

MFCC feature extraction diagram showing audio moving through framing and windowing, FFT spectrum calculation, mel filtering, and compact MFCC feature output. — MFCC feature extraction pipeline from audio input to compact feature output

Overview Knowledge Check

Phoebe’s Field Notes: Why 16 kHz Audio Cannot Hear An 18-22 kHz Tone Honestly

Phoebe the physics guide

Phoebe’s Why

Sampling a waveform is multiplying it by a train of clock pulses, and that multiplication copies the signal’s spectrum at every multiple of the sample rate. If the original signal has content above half the sample rate, those copies overlap the original band and fold back into it – a fast tone masquerades as a slower one, indistinguishable from a real low-frequency event. This chapter’s own 16 kHz doorbell pipeline can only represent frequencies up to 8 kHz honestly. The chapter also mentions SoundWave-style sensing, which deliberately emits an 18-22 kHz tone – a frequency the 16 kHz doorbell pipeline cannot sample without violating that same Nyquist limit, which is exactly why gesture-sensing systems like it need a materially higher sample rate than a speech keyword pipeline, not just a bigger buffer.

The Derivation

Nyquist criterion – a sample rate \(f_s\) represents a signal without ambiguity only if:

\[f_s \geq 2f_{max}\]

Content above \(f_s/2\) folds to an alias frequency:

\[f_{alias} = \left| f - f_s \cdot \mathrm{round}\!\left(\frac{f}{f_s}\right) \right|\]

Independently, an \(N\)-bit PCM sample has a quantization step \(q = V_{ref}/2^N\); treating the rounding error as uniform gives a noise floor and dynamic-range ceiling of

\[\mathrm{SNR}_{dB} = 6.02\,N + 1.76\]

Worked Numbers: This Chapter’s Own 16 kHz Pipeline

Nyquist limit at the chapter’s own 16,000 samples/second: \(f_s/2 = 8{,}000\) Hz – nothing above 8 kHz survives sampling without folding.
Frame check (reproducing the chapter’s own claim): frame samples \(= 16{,}000 \times 0.025 = 400\); hop samples \(=16{,}000\times0.010=160\); frames per second \(=\lfloor(16{,}000-400)/160\rfloor+1=98\) – matches the “about 98 feature frames per second” stated above exactly.
SoundWave-style 18-22 kHz tone through this pipeline’s 16 kHz sampling: 18 kHz aliases to \(|18{,}000-16{,}000\cdot\mathrm{round}(18000/16000)|=2{,}000\) Hz; 20 kHz aliases to \(4{,}000\) Hz; 22 kHz aliases to \(6{,}000\) Hz – an ultrasonic gesture tone would reappear disguised as an ordinary 2-6 kHz audible sound, not disappear.
Bit-depth floor: the chapter’s own 16-bit PCM gives \(\mathrm{SNR}=6.02(16)+1.76=98.1\) dB – comfortably above what the frame/hop choice needs, so bit depth is not this pipeline’s limiting factor; sample rate is.
Consequence for the evidence record: “sample rate” is not an interchangeable setting between a keyword detector and an ultrasonic interaction feature – the same 16 kHz choice that is generous for speech (Nyquist floor 8 kHz well above the ~300-3400 Hz speech band) is not just insufficient but actively misleading for an 18-22 kHz design, because the aliased tone still produces a plausible-looking in-band feature.

8.3 MFCC Feature Record

A practical edge audio pipeline starts with sampling and framing, then computes features before inference. For a speech-like keyword or sound cue, a defensible first pass is mono audio, a fixed sample rate, a frame length near the speech analysis range, a short hop, pre-emphasis or high-pass filtering when justified, a window function, FFT, mel filterbank energies, log compression, and a DCT or retained log-mel representation. The classifier should receive a fixed number of recent frames, not an open-ended stream.

Worked example: 16 kHz mono audio for a small keyword detector
sample rate: 16,000 samples/second
sample width: 16-bit PCM = 2 bytes/sample
frame length: 25 ms
frame samples: 16,000 x 0.025 = 400 samples
hop length: 10 ms
hop samples: 16,000 x 0.010 = 160 samples
raw audio rate: 16,000 x 2 = 32,000 bytes/second

one-second feature window:
frames in first second: floor((16,000 - 400) / 160) + 1 = 98 frames
MFCC coefficients per frame: 13
float feature record: 98 x 13 x 4 = 5,096 bytes
int8 feature record: 98 x 13 = 1,274 bytes

review note:
The arithmetic is only valid for this sample rate, frame size, hop size,
coefficient count, numeric type, and one-second feature window.

Design Item

Practitioner Choice

Review Evidence

Sampling

Choose sample rate and bit depth for the sound band that matters, then keep gain and resampling rules stable.

Record sample rate, bit depth, channel count, microphone placement, gain control, clipping checks, and any resampling.

Features

Use MFCCs for compact speech-like features, log-mel spectrograms for learned time-frequency patterns, and simple energy features for gates.

Record frame length, hop length, window function, number of mel filters, coefficient count, normalization, and numeric type.

Labels

Define event boundaries before training. Include background, unknown, and transition examples rather than forcing every frame into a target class.

Keep label source, annotator instructions, ambiguous examples, and class balance by environment.

Inference

Run the smallest model that meets the application boundary, then apply confidence, debounce, and cooldown rules.

Report confusion by class, false accept and false reject behavior where relevant, latency, memory, compute, and power context.

Practitioner Knowledge Check

8.4 Audio Evidence Under Change

Audio feature pipelines are sensitive to changes that may look minor in a diagram. A new microphone housing can attenuate high frequencies. Automatic gain control can alter energy features. A factory floor can add tonal noise that was absent in the lab. A model trained on clean speech can treat ventilation, echoes, or nearby talkers as target-like evidence. The deployment should expose these boundaries rather than presenting a single average metric as universal.

The brittle point is that audio features are not raw facts; they are measurements made under assumptions. A 25 ms frame improves short-time stationarity, but it smears events shorter than the frame. A mel scale emphasizes perceptual bands, but a machine-fault detector may care about narrow tones that speech-oriented settings compress away. Mean-variance normalization can stabilize one microphone placement while masking a gain or enclosure change. Treat each transform as a review choice with a reason, not as an automatic recipe.

The speech-coding history behind those choices matters. Harvey Fletcher's equal-loudness and critical-band work showed that human hearing is not a flat frequency meter: audibility depends on frequency, sound-pressure level, and nearby masking bands. Telephone systems therefore narrowed ordinary voice service to the range needed for intelligible speech, while later codecs and MFCC-style features use perceptual scales, filterbanks, and compression to preserve decision-relevant evidence instead of every waveform detail. For IoT audio, that is useful only when the product claim is speech-like or human-auditory; a machine-fault detector may need narrow tones that a speech-oriented representation would hide.

A perceptual audio coder makes that boundary concrete. It sends audio through a coding filterbank, uses a perceptual model to estimate what a nearby louder masker will hide, allocates quantization and rate control so the added error stays under the masking threshold, then packs the coded bitstream. Linear predictive coding, associated with Bishnu Atal's speech work, takes a different speech-specific route: it models the vocal tract as a filter excited by voiced pulse trains or unvoiced noise. These are useful precedents for compact edge features, but they also warn reviewers not to let a speech or human-hearing model discard evidence needed for a non-speech IoT event.

STFT Boundary

The FFT sees only the current frame. Window choice, frame length, and overlap trade time resolution against frequency resolution.

Mel and MFCC Boundary

Mel filters and cepstral coefficients compress spectral shape. This can help small models, but it can also discard details needed for some non-speech events.

Quantization Boundary

Feature tensors and models may be quantized for edge inference. Validate after conversion because thresholds and minority classes can shift.

Retention Boundary

Raw audio, derived features, and event labels carry different privacy risks. Retention policy should match the smallest evidence needed for review.

Boundary

Failure Mode

Retest Trigger

Acoustic scene

Background machinery, wind, reverberation, distance, or competing speakers changes the feature distribution.

New room, enclosure, mount, operating mode, microphone supplier, or background sound profile.

Signal chain

Gain control, filtering, clipping, resampling, or firmware changes alter the frame before feature extraction.

New firmware, audio driver, microphone gain, sample rate, codec, filter, or clipping rate.

Label boundary

Event start and end times are ambiguous, so frame labels are shifted or overconfident.

New annotator instructions, new event classes, more unknown/background examples, or different labeling granularity.

Model boundary

Compression, thresholds, smoothing, or cooldown logic changes false accepts, false rejects, and latency.

New model family, quantization mode, confidence threshold, debounce rule, or target device.

A practical retest plan therefore has two loops. The fast loop runs on-device sanity checks: clipping rate, background energy, missing frames, confidence distribution, and latency. The slower loop samples hard cases for human review without storing broad raw-audio history by default. For privacy-sensitive products, store derived features or short consented clips only when they are needed to explain a model decision. For safety-sensitive products, keep enough evidence to reproduce false accepts and false rejects when the acoustic scene changes.

Under-the-Hood Knowledge Check

8.5 Summary

Audio feature extraction makes IoT audio reviewable by converting a continuous waveform into short-time feature records. The durable pattern is to preserve the sampling, frame, hop, feature, label, privacy, and deployment evidence with the classifier result. MFCCs, log-mel spectrograms, RMS energy, and zero-crossing rate are useful only when their assumptions match the target sound and acoustic environment.

Key Takeaway

An audio model claim is not just a model name or an average metric. It is a feature pipeline plus a documented acoustic boundary: sample rate, frames, hops, labels, environment, device signal chain, privacy rule, and retest trigger.