27 Edge Processing for Big Data

analytics-ml

big

data

edge

27.1 Start With the Story

Picture an IoT team using the ideas in Edge Processing for Big Data during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

27.2 Edge Chooses What Leaves Site

Edge processing is not just “doing analytics near the device.” It is a data-reduction and decision contract. The edge decides which raw samples stay local, which features or aggregates move upstream, which anomaly evidence needs immediate cloud attention, and which raw window must be retained locally for replay. This matters when raw sensor, camera, audio, or vibration streams are too large, too private, too expensive, or too latency-sensitive to ship continuously.

An edge design should name the local computation, the forwarded evidence, the retained raw window, and the cloud handoff. A gateway may publish MQTT summaries, run a TensorFlow Lite Micro or ONNX Runtime model, compute FFT or RMS vibration features, use delta or Gorilla-style time-series compression, buffer raw windows in a local ring store, and send anomalies to Kafka or a cloud IoT service. The cloud still matters, but it receives decisions and evidence rather than every raw byte.

The key question is not “edge or cloud?” It is “which bytes must cross the network for this decision, and which bytes are only needed locally for validation, replay, or short-term debugging?”

Edge processing should be designed as an evidence-reduction pipeline: filter raw streams, aggregate the decision features, compress the result, retain enough local context, and forward only the payload needed for cloud action and review.

Filter

Drop redundant samples, unchanged readings, low-value frames, or data outside the decision window.

Aggregate

Send min, max, count, mean, RMS, percentiles, or window features instead of every sample.

Infer

Run local classifiers, anomaly detectors, or rule engines when latency or privacy requires local action.

Buffer

Keep a bounded raw ring buffer so anomalies can include pre-event and post-event evidence.

Forward

Publish summaries, features, alerts, and selected raw evidence to cloud systems with provenance.

Pressure

Edge Action

Forwarded Evidence

Retained Locally

Bandwidth

Filter unchanged readings, aggregate windows, compress time series, and forward anomalies.

Aggregate value, count, window bounds, quality flags, and anomaly id.

Raw samples for recent windows and rejected-event reasons.

Latency

Run local rules or TinyML inference before cloud round-trip.

Decision, confidence, model version, feature vector, and timestamp.

Input window, model log, and fallback state.

Privacy

Keep raw images, audio, or personal data on site when summaries are enough.

Counts, embeddings, labels, redacted records, or policy-approved extracts.

Encrypted short-term raw evidence with access controls.

Resilience

Continue local control and buffering during cloud or network outages.

Backfilled summaries, alert sequence, and outage markers after reconnect.

Offline queue, last-known cloud acknowledgement, and local decision trace.

Overview Knowledge Check

27.3 Calculate Data Reduction First

Edge processing should be justified with a workload calculation. Start with raw sample rate, bytes per sample, number of devices, and the network budget. Then estimate the forwarded summary, feature, alert, and raw-evidence volumes. If the edge result is still too large, change the window length, feature set, anomaly policy, compression, or local retention before scaling the cloud path.

Different data types need different reductions. Temperature sensors may use change-threshold filtering and hourly aggregates. Vibration sensors often compute RMS, peak, crest factor, spectral bands, or anomaly scores at the gateway. Cameras may run local object detection on an NVIDIA Jetson or similar accelerator and forward counts, tracks, labels, and event clips rather than continuous video.

Worked example: vibration features instead of raw samples
sensor count: 1,000 vibration sensors
sample rate: 10,000 samples/s per sensor
sample size: 2 bytes

raw data rate:
1,000 * 10,000 * 2 = 20,000,000 bytes/s = 20 MB/s

raw daily volume:
20 MB/s * 86,400 s/day = 1,728,000 MB/day = 1.728 TB/day

edge feature stream:
per sensor per second: 20 features * 4 bytes = 80 bytes/s
fleet feature rate: 1,000 * 80 = 80,000 bytes/s = 80 KB/s
daily feature volume: 80 KB/s * 86,400 = 6,912,000 KB/day = 6.912 GB/day

reduction from features alone:
1,728 GB/day / 6.912 GB/day = 250x reduction

design implication:
Forwarding RMS, peak, spectral-band energy, and anomaly score can preserve the
maintenance signal while avoiding continuous raw upload. Keep short raw windows
locally around high-risk anomalies so engineers can verify the feature evidence.

Do not treat edge reduction as lossy by default or safe by default. It is safe only when the forwarded features are enough for the decision and the retained raw window is enough for audit, debugging, and model improvement.

Workload

Local Processing

Cloud Payload

Check Before Shipping

Temperature telemetry

Threshold filtering, hourly aggregates, calibration check, and missing-sensor detection.

Window aggregates, quality flags, and exceptions.

Can a fast change be missed between aggregates?

Vibration monitoring

FFT bands, RMS, crest factor, envelope features, and anomaly score.

Feature vector, model version, alert id, and selected raw window pointer.

Do features preserve the fault signatures engineers need?

Camera analytics

Object detection, tracking, redaction, clip extraction, and local retention.

Counts, classes, tracks, thumbnails, or event clips allowed by policy.

Are privacy, false positives, and evidence retention handled?

Mobile gateway

Compression, batching, offline queue, local rules, and reconnect backfill.

Summaries, alerts, sequence numbers, and outage markers.

Can cloud consumers distinguish real silence from offline buffering?

Practitioner Knowledge Check

27.4 Versioned Evidence for Edge

Edge processing changes the evidence that reaches the cloud, so correctness depends on versioning and synchronization. A feature value is meaningful only with its window bounds, units, sensor calibration, firmware version, model version, and quality state. A local anomaly score is meaningful only with the threshold, model, input window, and fallback behavior that produced it.

Edge systems also have time and delivery risks. A gateway may batch data during an outage, then reconnect and backfill older summaries. Cloud pipelines must separate event time from arrival time so late backfill does not look like current behavior. If the edge performs local control, the cloud needs sequence numbers, acknowledgements, and conflict rules so replayed or delayed messages do not trigger incorrect actions.

Worked example: local latency versus cloud round trip
local feature extraction: 12 ms
local model inference: 8 ms
local control action: 5 ms

edge decision latency:
12 + 8 + 5 = 25 ms

cloud path:
uplink queue and network: 80 ms
cloud ingestion and stream processing: 45 ms
decision service: 20 ms
downlink and gateway apply: 90 ms

cloud decision latency:
80 + 45 + 20 + 90 = 235 ms

design implication:
If a machine-protection action must occur inside 50 ms, the cloud path is too
slow for the control decision. The edge can act locally, then send the decision,
features, model version, and raw-window pointer upstream for audit and learning.

Model Version

Every edge score or class label should include model id, threshold, feature definition, and input window.

Time Semantics

Publish event time, arrival time, batch time, and outage markers so cloud windows remain correct.

Replay Window

Store short raw windows around anomalies for engineer review, model retraining, and incident evidence.

Fallback

Define what the gateway does when the model, network, clock, or cloud acknowledgement is unavailable.

Risk

Failure Mode

Control

Evidence to Forward

Model drift

Edge scores become less reliable after equipment, environment, or sensor changes.

Monitor score distribution, retrain with reviewed raw windows, and version model deployments.

Model id, feature vector summary, confidence, reviewed label, and drift metric.

Clock skew

Backfilled data lands in the wrong stream window or dashboard period.

Synchronize clocks, publish event time and arrival time, and mark outage backfill.

Event timestamp, gateway timestamp, arrival timestamp, and batch id.

Privacy leak

Raw images, audio, or personal identifiers are forwarded when summaries would suffice.

Redact locally, enforce allowlists, and audit payload classes before upload.

Policy id, redaction state, payload class, and reviewer approval for exceptions.

Unsafe replay

Delayed commands or duplicate alerts trigger actions twice.

Use sequence numbers, idempotent commands, expiry times, and acknowledgement state.

Command id, sequence, expiry, local state, and acknowledgement result.

Under the Hood Knowledge Check

27.5 Summary

Edge processing makes IoT big-data systems practical by deciding what must cross the network. Local filtering, aggregation, compression, feature extraction, TinyML inference, short raw-window retention, and outage buffering reduce bandwidth while preserving evidence. A good edge-to-cloud design forwards summaries, alerts, features, model versions, timestamps, quality flags, and selected raw evidence, not an unbounded raw stream.

Key Takeaway

Use edge processing when raw upload is too large, too slow, too private, too expensive, or too fragile. The edge result must still be auditable: include event time, model or rule version, feature definitions, confidence, quality state, and a bounded raw window for replay when the decision matters.