24 Big Data Technologies

analytics-ml

big

data

technologies

24.1 Start With the Story

Picture an IoT team using the ideas in Big Data Technologies during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

24.2 Technologies Have Different Jobs

An IoT big-data stack is a chain of responsibilities, not a list of fashionable product names. Device protocols move readings from constrained devices to gateways or cloud ingress. A broker buffers events and lets several consumers read the same stream. Stream processors turn live events into windows, alerts, and enriched records. Storage systems keep raw, curated, and serving-ready data at different costs and query speeds. Governance tools make schemas, access, lineage, and replay visible.

The right technology depends on the job. MQTT is useful at the device edge because it is lightweight and supports publish/subscribe messaging. Apache Kafka, Redpanda, or Pulsar fit durable event streams with multiple downstream consumers. Spark fits batch analytics and micro-batch streaming, while Apache Flink fits low-latency stateful streaming and event-time windows. Parquet in object storage works well for large analytical scans, while InfluxDB, TimescaleDB, Cassandra, or ClickHouse may fit serving paths with specific write and query patterns.

Legacy Hadoop terminology is still useful when reviewing older IoT data-lake designs. HDFS separates storage metadata from storage blocks: the NameNode tracks the file-system namespace and block placement, while DataNodes store and serve the blocks. Hadoop MapReduce v1 separated coordination from worker execution in a similar way: the JobTracker accepted jobs, split them into map and reduce tasks, assigned work, and tracked progress, while TaskTrackers ran those tasks on worker nodes near the data. Modern stacks often replace those exact roles with YARN, Spark, Flink, or managed services, but the review question is the same: which component owns scheduling, which component owns data placement, and how does the system recover when a worker fails?

A practical design names one primary job per tool. If one database is expected to buffer device bursts, run stream windows, serve dashboards, store cold history, and govern schemas, the architecture is hiding several different requirements in one box.

Technology fit starts with the chapter’s workload numbers, then proves the selected broker, processor, store, serving path, and governance controls can handle ingest rate, fan-out, replay, recovery, and audit evidence.

Ingress

MQTT broker, HTTP endpoint, gateway agent, or cloud IoT hub that authenticates devices and normalizes payloads.

Event Log

Kafka, Redpanda, Pulsar, or managed stream service that buffers events and supports replay.

Processing

Flink, Spark Structured Streaming, Beam, or cloud stream jobs that compute windows, joins, alerts, and features.

Storage

Object storage, Parquet, Delta Lake, Iceberg, time-series databases, warehouses, and serving stores.

Governance

Schema registry, catalog, access policy, data-quality checks, and lineage for replayable decisions.

Need

Common Fit

Why It Fits

Watch Out

Device publish/subscribe

MQTT broker or cloud IoT hub.

Works with constrained devices, topics, retained messages, and gateway fan-in.

It is not a long-term analytical store by itself.

Durable event replay

Kafka, Redpanda, Pulsar, or managed streams.

Partitions provide parallel read/write paths and let consumers replay from offsets.

Poor partition keys create hot partitions or break required ordering.

Low-latency stateful windows

Apache Flink or a managed stream processor.

Event-time windows, watermarks, and stateful operators are first-class concepts.

Exactly-once outcomes still require compatible checkpoints and idempotent sinks.

Historical analytics

Spark, SQL engine, Parquet, Delta Lake, Iceberg, and object storage.

Columnar files and table formats support large scans, compaction, and batch jobs.

Small-file sprawl can make cheap object storage expensive to query.

Operational serving

InfluxDB, TimescaleDB, Cassandra, ClickHouse, or warehouse table.

Choose by write rate, query shape, retention policy, and dashboard latency.

No serving store removes the need for quality flags and schema control.

Hadoop is the classic open-source example of this split-role design. It is not one database; it is a distributed data platform whose parts divide storage, processing, resource management, and shared utilities. Hadoop Common supplies shared libraries used by the other services. Hadoop Distributed File System (HDFS) stores large files as replicated blocks across worker machines. MapReduce runs batch jobs over those blocks, and YARN allocates CPU, memory, and scheduling capacity to applications running on the cluster.

In HDFS, the NameNode keeps filesystem metadata: file names, directories, block locations, permissions, and the map from each logical file to its physical blocks. DataNodes store the actual block replicas and periodically report health and block inventory back to the NameNode. That separation is the main review idea for IoT learners: metadata tells the cluster where evidence lives, while replicated data blocks make large historical telemetry, logs, images, or model-training files durable enough for batch analysis.

Overview Knowledge Check

24.3 Size Brokers and Stores Together

Technology choice starts with a rate and retention model. Broker partitions must absorb producer bursts and feed consumers fast enough. Stream processors must keep state for the windows and joins they compute. Storage must handle raw replay, curated analytics, and serving queries without forcing one format to satisfy every access pattern. The same 20,000 events per second can be easy or hard depending on payload size, partition key, late data, fan-out, and retention.

Use a small number of explicit paths. A live path can read Kafka partitions into Flink for event-time alerts and write idempotent results to a serving store. A batch path can compact raw Parquet into Delta Lake or Iceberg tables for Spark and SQL analytics. A dashboard path can read hourly aggregates from TimescaleDB, InfluxDB, ClickHouse, or a warehouse. Each path should have its own freshness, replay, and cost target.

Worked example: broker partition sizing
ingest rate: 24,000 events/s
average event size: 500 bytes
target partition capacity for this workload: 3,000 events/s

minimum partitions by event rate:
24,000 / 3,000 = 8 partitions

raw ingest throughput:
24,000 events/s * 500 bytes = 12,000,000 bytes/s
12,000,000 bytes/s = 12 MB/s using decimal units

consumer fan-out:
alert processor reads the stream once = 12 MB/s
lake writer reads the stream once = 12 MB/s
dashboard aggregate job reads the stream once = 12 MB/s
total consumer read pressure = 36 MB/s

design implication:
Start above the minimum partition count if growth, reprocessing, or uneven
device keys are expected. Then verify partition lag and hot keys under load
instead of assuming the average rate is evenly distributed.

Partition count is not just a capacity number. It affects ordering, rebalance time, file sizes, stream-task parallelism, and operational noise. Changing it later can be possible, but it is not free.

Decision

Ask

Good Fit

Risk

Broker key

Which ordering must be preserved?

Device id when per-device order matters; site plus bucket when site volume is high.

One busy key can overload a single partition.

Stream engine

Do windows need low latency, event time, state, and late-data handling?

Flink for stateful low-latency streams; Spark Structured Streaming for micro-batch plus lakehouse workflows.

Choosing processing time when the physical event time matters.

Lake format

Will jobs update, compact, time travel, or evolve schemas?

Delta Lake or Iceberg tables over Parquet for governed analytical tables.

Raw folders without catalog metadata become difficult to trust and query.

Serving store

What query pattern must be fast?

Time-series DB for device timelines, ClickHouse for analytical dashboards, Cassandra for high-write key-value access.

One store optimized for writes may be poor for ad hoc analytics.

Practitioner Knowledge Check

24.4 Tools Need Operational Contracts

The technology stack is only correct if it can be operated during failures, upgrades, replay, and schema changes. Kafka retention must be long enough for consumers to recover. Stream checkpoints must be stored durably and tested during redeployments. Lakehouse compaction must keep file sizes queryable without deleting evidence needed for replay. Serving stores need retention and downsampling policies. Schema registry and catalog rules need compatibility checks before firmware changes reach production.

Delivery guarantees also depend on the whole path. Kafka can store events durably, Flink can checkpoint state, and a sink can support idempotent writes, but the final outcome is only replay-safe when the event id, checkpoint, sink write, and downstream table semantics agree. A system that says “exactly once” in one layer can still double-count if the serving store uses blind inserts or if enrichment creates nondeterministic keys.

Worked example: recovery window
broker retention: 72 hours
consumer outage: 9 hours
normal consumer catch-up rate: 30,000 events/s
ingest rate during recovery: 24,000 events/s

net catch-up rate:
30,000 - 24,000 = 6,000 events/s

events accumulated during outage:
24,000 events/s * 9 hours * 3,600 s/hour = 777,600,000 events

catch-up time:
777,600,000 / 6,000 events/s = 129,600 s = 36 hours

total time from outage start to full recovery:
9 hours outage + 36 hours catch-up = 45 hours

design implication:
72-hour broker retention is enough for this case, but only if the consumer
really can sustain 30,000 events/s while new data continues to arrive. A slower
catch-up rate or a second outage could exceed retention and force data loss or
manual backfill from raw object storage.

Retention

The broker, lake, and serving stores need explicit retention windows tied to replay, audit, and cost.

Checkpoint

Stateful processors need durable checkpoints and restore drills before upgrades or failures.

Schema

Avro, Protobuf, or JSON Schema with compatibility rules keeps firmware changes reviewable.

Idempotent Sink

Replay-safe writes use stable event ids, deterministic keys, upserts, or transactional table semantics.

Contract

What It Defines

Evidence to Keep

Common Failure

Replay

How far back the event log or raw lake can reconstruct outputs.

Retention settings, offsets, raw object paths, table versions, and replay commands.

Broker retention expires before a broken consumer catches up.

Schema evolution

Which field changes are compatible with existing producers and consumers.

Schema versions, compatibility mode, rejected payload samples, and firmware release links.

A unit or field rename silently changes downstream calculations.

State recovery

How stream state is restored after deployment or failure.

Checkpoint location, restore test logs, state size, and watermark behavior.

A redeploy starts from the latest offset but loses the state needed for windows or joins.

Cost control

How raw, curated, aggregate, and serving data move to cheaper tiers.

Compaction reports, retention policies, query costs, and aggregate definitions.

Small files, duplicated streams, and unbounded raw retention make the stack costly to run.

Under the Hood Knowledge Check

24.5 Summary

IoT big-data technologies should be chosen by role and operating evidence. MQTT or cloud IoT hubs handle device ingress; Kafka, Redpanda, Pulsar, or managed streams provide durable replay; Flink and Spark Structured Streaming process live events; Parquet, Delta Lake, Iceberg, object storage, time-series databases, Cassandra, ClickHouse, and warehouses serve different storage and query needs. The architecture is strong only when retention, partitioning, schema evolution, checkpoints, and idempotent sinks are sized and tested together.

Key Takeaway

Pick technologies by the job they must prove: ingest, buffer, process, store, serve, govern, or recover. Then check the numbers: partitions, bytes per second, consumer fan-out, state size, retention, catch-up time, and replay behavior. Tool names without those contracts are not an architecture.