26 Big Data Operations

analytics-ml

big

data

operations

26.1 Start With the Story

Picture an IoT team using the ideas in Big Data Operations during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

26.2 Operations Protect Data Contracts

Operating an IoT big-data system means keeping the evidence path reliable after real devices, networks, schemas, workloads, and costs change. The operational contract should define service objectives for freshness, completeness, replay, retention, and query performance. It should also name the metrics that prove the objectives are being met: broker lag, event-time delay, processing rate, validation failure rate, duplicate rate, watermark drops, storage growth, query latency, and recovery time.

A healthy system is not one with no alerts. It is one where alerts point to a known runbook and where the team can explain which decisions are affected. A Kafka consumer lag alert, a Flink checkpoint failure, a Spark job slowdown, a schema registry rejection, or a rising dead-letter rate should lead to evidence: affected topics, partitions, offsets, devices, schemas, tables, and downstream dashboards.

The operations question is not “is the cluster up?” It is “can the current pipeline still produce timely, complete, replayable, and cost-bounded decisions from the device evidence?”

Operations should monitor the whole evidence path, not only server uptime: device health, ingestion rejects, queue lag, processor errors, storage growth, cost trends, and runbooks all show whether downstream decisions remain trustworthy.

Freshness

How quickly an event becomes usable in alerts, dashboards, features, and reports.

Completeness

How many expected devices, events, and windows arrive and pass validation.

Replay

How far back the system can rebuild outputs after a bug, schema change, or outage.

Cost

How storage, compute, stream retention, and query scans grow as data volume grows.

Governance

Who can change schemas, retention, access rules, quality gates, and serving definitions.

Objective

Metric

Evidence

Common Tool

Fresh alerts

Sensor-to-alert latency, event-time delay, broker lag, and notification delay.

Topic offset, event timestamp, arrival timestamp, processing timestamp, and alert id.

Kafka metrics, Flink/Spark metrics, Prometheus, Grafana, and tracing.

Complete windows

Expected-vs-seen device count, late-event count, watermark drops, and missing windows.

Device heartbeat, watermark setting, window id, and late-event policy.

Stream processor metrics, data-quality tables, and device registry checks.

Replayable outputs

Retained raw partitions, broker retention, table versions, and checkpoint restore success.

Raw object path, topic offset range, Delta Lake or Iceberg version, and job version.

Object storage, table catalog, schema registry, Airflow, and runbook logs.

Cost-bounded analytics

Daily storage growth, file count, query scan bytes, compaction age, and unused tables.

Partition layout, retention policy, query plan, lifecycle rule, and owner.

Cloud cost reports, warehouse query history, Spark UI, and table maintenance jobs.

Overview Knowledge Check

26.3 Runbooks for Lag and Storage

Operational alerts should lead to an action path. If Kafka lag grows, the team checks whether producers increased, a consumer slowed, one partition became hot, a sink is throttling, or a schema error is forcing retries. If Spark or Flink jobs slow down, the team checks state size, checkpoint duration, garbage collection, input skew, and sink latency. If query latency grows, the team checks partition pruning, small files, table compaction, indexing, and retention.

Runbooks should include numbers. A lag alert is more useful when it says how fast the backlog is growing and how long until the recovery window is exceeded. A storage alert is more useful when it says which table, partition, or raw zone is growing faster than the lifecycle rule. Each action should preserve evidence before restarting jobs or deleting data.

Worked example: consumer lag incident
current Kafka lag: 420,000 events
incoming producer rate: 18,000 events/s
consumer processing rate: 14,500 events/s
critical lag threshold: 2,000,000 events

lag growth rate:
18,000 - 14,500 = 3,500 events/s

remaining lag budget:
2,000,000 - 420,000 = 1,580,000 events

time until critical threshold:
1,580,000 / 3,500 = 451.4 seconds
451.4 seconds = about 7.5 minutes

runbook implication:
This is not a "watch it tomorrow" alert. The team has minutes to identify
whether the consumer is CPU-bound, sink-bound, partition-skewed, retrying bad
records, or under-provisioned, then scale or shed noncritical work before the
recovery window is at risk.

Do not restart a stream job blindly. First capture current offsets, lag by partition, recent deployment id, checkpoint health, sink errors, and schema-validation failures. Otherwise the restart can erase the evidence needed to explain the incident.

Symptom

First Evidence

Likely Controls

Do Not

Consumer lag rising

Lag by topic and partition, producer rate, consumer rate, sink error count.

Scale consumers, fix hot partition, slow producers, repair sink, or route noncritical streams.

Restart without preserving offset and error evidence.

Checkpoint failures

Checkpoint duration, state size, storage errors, job version, and backpressure metrics.

Reduce state, fix storage access, restore from last valid checkpoint, or split job state.

Delete checkpoint state unless a replay plan is ready.

Query slowdown

Scan bytes, partition pruning, small-file count, table statistics, and compaction age.

Compact files, adjust partitions, refresh stats, create aggregates, or tighten retention.

Buy more compute before checking table layout.

Schema rejections

Rejected field, schema id, producer firmware, sample payload, and deployment time.

Rollback producer, register compatible schema, quarantine events, and replay after fix.

Force-accept incompatible payloads into curated tables.

Practitioner Knowledge Check

26.4 Reliability Needs Retesting

Long-running IoT data systems fail slowly when retention and lifecycle rules are missing. Raw data grows, small files multiply, compaction falls behind, dashboards scan too much history, and backup or replay windows become unclear. Operations should separate raw replay retention, curated analytical retention, serving retention, and aggregate retention. Each tier needs an owner, a cost limit, and a retest after policy changes.

Reliability also depends on routine retests. A restore drill proves that checkpoints can restart. A replay drill proves that raw events and transformation versions can rebuild a table. A schema-compatibility test proves that new firmware does not break downstream jobs. A data-quality retest proves that validation rules still catch impossible values without rejecting good data.

Worked example: lifecycle and storage pressure
raw daily volume: 69.12 GB/day
current raw retention: 90 days
proposed raw retention: 30 days
hourly aggregate volume: 28.8 MB/day
aggregate retention: 365 days

current raw footprint:
69.12 * 90 = 6,220.8 GB = 6.22 TB

proposed raw footprint:
69.12 * 30 = 2,073.6 GB = 2.07 TB

raw storage reduction:
6.22 TB - 2.07 TB = 4.15 TB

one year of hourly aggregates:
28.8 MB/day * 365 = 10,512 MB = 10.5 GB

operational implication:
Keeping raw data forever is rarely the right control. Keep raw long enough for
replay, audit, and debugging, then keep compact aggregates or curated tables
for longer-term trends. Retest replay before shortening raw retention.

Raw Retention

Bounded period for replay, audit, device debugging, and reprocessing after bad code or schema changes.

Curated Tables

Cleaned and versioned records for analytics, with quality flags and partition strategy.

Serving Data

Fast dashboard, alert, or API data with explicit freshness, retention, and rebuild rules.

Retest Drill

Scheduled proof that restore, replay, schema evolution, and quality gates still work.

Control

What It Prevents

Retest Evidence

Failure Signal

Retention policy

Unbounded raw growth, slow scans, unclear replay windows, and accidental deletion.

Lifecycle rule, table owner, replay drill, and exception list.

Storage grows faster than forecast or old raw data is needed but missing.

Compaction

Small-file overload in object storage or lakehouse tables.

File-size distribution, compaction job logs, table statistics, and query plan.

Queries scan many tiny files and latency rises with file count.

Schema gate

Firmware changes silently changing units, fields, or enum meanings.

Compatibility test, rejected sample payloads, and schema registry version.

Spike in rejected events or wrong values after firmware deployment.

Replay drill

Discovering too late that raw data, code, schema, or checkpoint state cannot rebuild outputs.

Replay command, input offset range, table version, output diff, and owner approval.

A corrected job cannot reproduce a known aggregate from retained evidence.

Under the Hood Knowledge Check

26.5 Summary

Operating IoT big-data systems means protecting the data contract after production conditions change. Teams need metrics and runbooks for freshness, completeness, replay, cost, and governance. Kafka lag, stream checkpoints, schema registry failures, watermark drops, storage growth, compaction age, and query scan bytes are not isolated technical details; they show whether downstream alerts, dashboards, model features, and reports remain trustworthy.

Key Takeaway

Operations should prove that IoT data remains timely, complete, replayable, and cost-bounded. Alert on the evidence path, not just server uptime: lag by partition, validation failures, late events, checkpoint health, retention windows, table layout, query cost, and replay drills.