25 Big Data Pipelines

analytics-ml

big

data

pipelines

25.1 Start With the Story

Picture an IoT team using the ideas in Big Data Pipelines during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

25.2 Pipelines Turn Events to Evidence

An IoT big-data pipeline is the path from a physical reading to a trusted decision record. It should not be described only as “ETL” or “streaming.” A useful pipeline states what each stage accepts, rejects, transforms, enriches, stores, and proves. Raw sensor evidence may flow through MQTT or a gateway, into Kafka or a managed stream, through validation and enrichment, then into raw, curated, aggregate, and serving zones.

The pipeline contract is different from the technology list. It defines ownership of timestamps, schemas, quality flags, retries, late data, units, enrichment joins, retention, and lineage. A temperature alert, maintenance dashboard, and model-training table may use the same raw event, but each needs different latency, completeness, and audit evidence.

Every derived result should be traceable back to its source events, transformation version, quality checks, and timing assumptions. Without that lineage, a dashboard number is hard to debug and a model feature is hard to trust.

A trustworthy pipeline keeps event identity, event-time evidence, state cues, output reasons, and review records linked across every stage, so alerts, dashboards, and model features can be debugged and replayed.

Ingest

Receive device events, authenticate source identity, attach arrival metadata, and reject malformed payloads.

Validate

Check schema, units, timestamp range, duplicates, missing fields, and device health context.

Enrich

Add site, asset, calibration, firmware, or weather context with versioned reference data.

Transform

Window, aggregate, filter, normalize, and feature-engineer using event-time rules.

Serve

Publish alerts, dashboards, model features, reports, and replayable tables with lineage.

Stage

Input Contract

Output Contract

Evidence to Keep

Raw ingress

Device id, event timestamp, schema version, payload, and authentication result.

Accepted raw event or rejected message with reason.

Arrival time, broker offset, validation result, and raw payload pointer.

Quality gate

Raw event plus device metadata and allowed units or ranges.

Clean event, quarantined event, or dead-letter record.

Rule version, failed field, device health state, and sample counts.

Stream transform

Clean events, event-time timestamp, watermark, and enrichment tables.

Window result, alert, feature, or aggregate with freshness label.

Watermark setting, checkpoint id, window bounds, and late-event policy.

Analytical table

Curated events and transformation version.

Partitioned Parquet, Delta Lake, Iceberg, warehouse, or time-series table.

Table version, partition key, compaction record, and lineage to raw data.

Overview Knowledge Check

25.3 Budget Latency and Completeness

Stream processing is useful only when its latency target matches the decision. A machine-protection alert may need a few seconds. A parking availability dashboard may need tens of seconds. A monthly compliance report may tolerate hours if it is more complete. The pipeline should state the latency budget from device sampling through gateway buffering, broker lag, stream processing, serving write, and user-visible refresh.

Completeness has a cost. Watermarks let Spark Structured Streaming, Flink, and similar systems decide how long to wait for late events before closing a window. A short watermark produces faster outputs but drops more late readings. A long watermark captures more delayed data but makes outputs arrive later. The right setting depends on the physical decision, not on a default value.

Worked example: latency budget for a cold-chain alert
target user-visible alert latency: 30 seconds

budget by stage:
sensor sampling interval: 5 seconds
gateway batching: 4 seconds
network and broker delay: 3 seconds
stream window and watermark wait: 12 seconds
alert rule evaluation: 2 seconds
serving write and notification delivery: 4 seconds

total:
5 + 4 + 3 + 12 + 2 + 4 = 30 seconds

design implication:
If the watermark is raised from 12 seconds to 60 seconds, the same alert path
can no longer meet a 30-second promise. The team must either accept slower
alerts, use a separate fast-but-provisional alert path, or improve device and
network delay so the watermark does not carry the whole uncertainty.

Lambda-style pipelines use separate speed and batch views when fast provisional answers and slower corrected answers are both required. Kappa-style pipelines keep one replayable stream path and recompute by replaying retained events. The choice is operational: two code paths with eventual correction, or one stream path with stronger replay discipline.

Choice

Use When

Evidence Needed

Risk

Tumbling window

Fixed, non-overlapping periods such as hourly energy totals.

Window start/end, event-time field, and late-event policy.

Window length can exceed dashboard freshness requirements.

Sliding window

Moving averages or repeated trend checks need smoother updates.

Window length, slide interval, and duplicate-counting expectations.

More overlapping windows increase compute and state.

Session window

Events form bursts separated by inactivity, such as worker interaction sessions.

Gap duration, user or asset key, and close condition.

Incorrect gap settings split or merge sessions in misleading ways.

Watermark

Events can arrive late but outputs still need bounded latency.

Measured delay distribution, allowed lateness, and correction behavior.

Too short drops evidence; too long delays the decision.

Practitioner Knowledge Check

25.4 Pipeline Failure Paths

A production pipeline is not complete until the failed-event path is designed. Some events are malformed, duplicated, late beyond the watermark, missing calibration context, or produced by devices with stale firmware. Dropping them silently makes metrics look clean while hiding operational problems. A dead-letter queue or quarantine table keeps failed records with reasons so the team can fix devices, schemas, or transformations.

Replay also needs a contract. When a schema bug or enrichment error is fixed, the team should know which raw partitions, broker offsets, table versions, and transformation code can rebuild the affected outputs. Idempotent writes prevent replay from double-counting. Lineage links each corrected aggregate or model feature back to raw events, quality rules, enrichment versions, and processing time.

Worked example: validation failures and dead-letter rate
ingest rate: 50,000 events/minute
schema validation failure rate: 0.4 percent
range validation failure rate: 0.2 percent
duplicate retry rate: 0.1 percent

failed or quarantined events per minute:
50,000 * (0.004 + 0.002 + 0.001) = 350 events/minute

failed or quarantined events per hour:
350 * 60 = 21,000 events/hour

if 80 percent of schema failures come from one firmware version:
schema failures/minute = 50,000 * 0.004 = 200
affected firmware contribution = 200 * 0.80 = 160 events/minute

design implication:
The dead-letter path is not a trash bin. It identifies fixable sources of bad
data and protects downstream aggregates from silently mixing incompatible
payloads, units, duplicates, or stale device behavior.

Dead Letter

Stores rejected events with reason, schema version, raw payload pointer, and owner for investigation.

Replay

Rebuilds outputs from broker offsets, raw lake partitions, table versions, and deterministic transforms.

Idempotency

Uses event ids, stable keys, upserts, or transactional table writes so retries do not double-count.

Lineage

Links every aggregate, alert, feature, or report back to source events and rule versions.

Failure

Detection

Control

Evidence

Malformed payload

Schema registry or validation rule rejects the event.

Quarantine with field-level reason and firmware context.

Raw payload pointer, schema id, rejected field, and producer version.

Duplicate event

Event id, sequence number, or deterministic key repeats.

Deduplicate before aggregation and use idempotent sink writes.

Duplicate key, first-seen offset, retry count, and sink action.

Late beyond watermark

Event time is older than the closed window threshold.

Emit correction record, quarantine, or route to batch reconciliation.

Event time, arrival time, watermark, window id, and correction policy.

Bad enrichment

Reference table version or join key is wrong or missing.

Version enrichment data and replay affected partitions after correction.

Reference version, join key, affected output partitions, and replay command.

Under the Hood Knowledge Check

25.5 Summary

IoT big-data pipelines turn physical events into trusted decision evidence. A strong pipeline defines stage contracts for ingest, validation, enrichment, transformation, storage, serving, failure handling, and replay. It budgets latency across every stage, chooses windows and watermarks from the decision requirement, and keeps dead-letter, lineage, checkpoint, and idempotency evidence so outputs can be corrected rather than merely displayed.

Key Takeaway

Design the pipeline around proof: what event was accepted, what was rejected, what rule transformed it, what window closed it, what late-data policy applied, and how the result can be replayed. Without those records, fast streaming results are difficult to trust.