39 Cloud Data: Architecture Gallery

In 60 Seconds

This visual gallery provides reference diagrams for cloud data architecture patterns including data lakes, data warehouses, stream processing pipelines, sensor fusion algorithms (Kalman and particle filters), and machine learning model deployment – the essential building blocks for IoT cloud solutions.

39.1 Learning Objectives

By the end of this chapter, you will be able to:

Distinguish Architecture Patterns: Compare data lake, data warehouse, and lakehouse patterns from visual diagrams
Trace Data Flow: Map data from IoT devices through processing pipelines to analytics endpoints
Apply Sensor Fusion Concepts: Explain Kalman filters, particle filters, and probabilistic approaches for multi-sensor integration
Design ML Pipelines: Plan predictive model deployment pipelines for IoT applications

Key Concepts

Reference architecture: A standardised, reusable architectural blueprint for a class of systems, providing a proven starting point that avoids reinventing common design decisions.
IoT data tier: A logical layer in a cloud architecture responsible for ingesting, storing, and serving sensor data, typically composed of message brokers, time-series databases, and object storage.
Data lakehouse: A modern architecture combining the low-cost storage of a data lake with the structured query performance of a data warehouse, suitable for mixed IoT analytics workloads.
Event-driven architecture: A design pattern where components communicate by publishing and subscribing to events (sensor readings, alarms) rather than through direct API calls, enabling loose coupling and scalability.
Zonal redundancy: Deploying cloud infrastructure across multiple availability zones within a region to survive the failure of any single physical facility without service interruption.
SLA (Service Level Agreement): A contractual commitment specifying uptime, latency, and data retention guarantees for a cloud IoT platform — a key factor when selecting managed services.

For Beginners: Cloud Data Architecture

Cloud data architecture is the blueprint for organizing IoT data in cloud services. Think of designing a warehouse layout – you need receiving docks for incoming data, organized storage areas, processing stations, and shipping areas where insights are delivered. A good architecture keeps data flowing efficiently from sensors to dashboards.

39.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Cloud Data: IoT Reference Model Levels 5-7: Understanding the upper levels of the IoT reference model
Cloud Data: Platforms and Services: Cloud service models and platform options
Cloud Data: Quality and Security: Data cleaning and security considerations

39.3 Cloud Data Architecture Visuals

These overview diagrams are text-dense reference assets. Use the gallery cards below to choose the architecture you want, then open the full SVG when you need to inspect labels, annotations, and legends closely.

7-Level IoT Reference Architecture

Maps the physical device, connectivity, edge, storage, abstraction, application, and collaboration layers, including how IT and OT interact across the stack.

Open full SVG diagram

Device Registry Service

Shows how cloud platforms manage device identity, authentication, metadata, groups, and lifecycle state for large IoT fleets.

Open full SVG diagram

Cloud Deployment Strategy

Compares shared public cloud, dedicated public cloud, partner-hosted private cloud, and self-hosted private cloud deployment choices.

Open full SVG diagram

IoT Three-Tier Architecture

Traces data movement from devices through the fog layer into cloud analytics, applications, and business reporting.

Open full SVG diagram

Cloud Rule Engine

Explains event-driven routing from telemetry, lifecycle, and state-change inputs into storage, databases, functions, notifications, and streams.

Open full SVG diagram

Device Shadow Sync

Illustrates how desired and reported state stay synchronized between a physical device, its cloud shadow, and the application interface.

Open full SVG diagram

39.3.1 Storage and Processing Patterns

Batch Processing

Artistic visualization of batch processing architecture for IoT data showing scheduled jobs collecting accumulated sensor data, MapReduce processing stages, and output to data warehouse for historical analytics.

Batch Processing

Batch processing handles large volumes of historical IoT data through scheduled jobs, enabling complex analytics that don’t require real-time results.

Data Fusion Architecture

Modern diagram of multi-sensor data fusion architecture showing sensor layer, preprocessing, feature extraction, fusion algorithms, and decision output with feedback loops.

Data Fusion Architecture

Data fusion combines information from multiple IoT sensors to produce more accurate and reliable insights than any single sensor could provide.

Data Lake Pipeline

Geometric visualization of IoT data lake pipeline showing raw data ingestion, schema-on-read processing, data catalog, and multiple consumption patterns for analytics and ML.

Data Lake Pipeline

Data lakes store raw IoT data in native format, enabling schema-on-read flexibility for diverse analytics and machine learning workloads.

Data Lake

Artistic visualization of IoT data lake showing massive storage reservoir receiving streams from diverse IoT sources with zones for raw, curated, and consumption-ready data.

Data Lake

Data lakes provide cost-effective storage for massive IoT data volumes, supporting both structured and unstructured data types.

Data Lakehouse

Artistic diagram of data lakehouse architecture combining data lake storage economics with data warehouse query performance through delta format and optimization layers.

Data Lakehouse

Data lakehouses combine the flexibility of data lakes with the query performance of data warehouses, ideal for IoT analytics at scale.

Data Warehouse

Geometric visualization of data warehouse architecture showing ETL pipelines, dimensional modeling with fact and dimension tables, and OLAP cubes for IoT business intelligence.

Data Warehouse

Data warehouses provide structured, optimized storage for IoT analytics with pre-defined schemas enabling fast business intelligence queries.

Data Mesh

Geometric diagram of data mesh architecture showing domain-oriented data products, self-serve data platform, and federated governance for decentralized IoT data management.

Data Mesh

Data mesh applies domain-driven design to IoT data, with decentralized ownership and federated governance enabling scalable data management.

Data Replication

Geometric visualization of data replication strategies showing synchronous vs asynchronous replication, consistency models, and geographic distribution for IoT data availability.

Data Replication

Data replication ensures IoT data availability and durability through redundant copies across storage systems and geographic regions.

Data Lineage

Geometric diagram of data lineage tracking showing source sensors, transformation steps, derived datasets, and consumption points with metadata at each stage.

Data Lineage

Data lineage tracks IoT data from sensor origin through all transformations to final consumption, enabling debugging and compliance.

Data Pipeline Orchestration

Artistic visualization of data pipeline orchestration showing DAG-based workflow scheduling, dependency management, monitoring, and error handling for IoT data flows.

Data Pipeline Orchestration

Pipeline orchestration manages complex IoT data workflows with scheduling, dependency tracking, monitoring, and automated recovery.

Stream Processor

Artistic diagram of stream processing engine showing continuous ingestion, windowing operations, stateful computations, and real-time output for IoT event processing.

Stream Processor

Stream processors enable real-time IoT analytics through continuous processing of sensor data as it arrives.

Stream Processing Flow

Geometric visualization of stream processing flow showing event sources, Kafka topics, Flink/Spark processing, and sink outputs with exactly-once semantics.

Stream Processing Flow

End-to-end stream processing pipelines provide reliable real-time IoT analytics with guaranteed delivery semantics.

Time Series

Artistic visualization of time series data patterns showing trend, seasonality, noise components, and common IoT patterns like sensor drift and periodic readings.

Time Series

Time series data is the fundamental data type in IoT, requiring specialized storage and analytics techniques.

Time Synchronization

Geometric diagram of time synchronization for distributed IoT systems showing NTP, PTP, and GPS time sources with clock drift compensation and precision requirements.

Time Synchronization

Accurate time synchronization is essential for correlating IoT events across distributed sensor networks.

Converged Data Networks

Artistic visualization of converged data networks carrying IoT sensor streams alongside traditional enterprise traffic over unified infrastructure.

Converged Data Networks

Converged networks efficiently carry IoT data streams alongside other enterprise traffic through unified infrastructure.

Edge-Cloud Placement

Modern diagram showing decision factors for edge vs cloud placement of IoT processing including latency requirements, bandwidth costs, privacy needs, and computational complexity.

Edge Cloud Placement

Optimal placement of IoT processing between edge and cloud depends on latency, bandwidth, privacy, and computational requirements.

39.3.2 Machine Learning and Prediction

Predictive Model

Geometric diagram of IoT predictive model pipeline showing feature engineering from sensor data, model training, validation, deployment, and continuous learning loop.

Predictive Model

Predictive models transform IoT sensor patterns into actionable forecasts for maintenance, demand, and anomaly detection.

Model Registry

Geometric visualization of ML model registry for IoT showing model versioning, metadata tracking, deployment status, and lineage for managing production models.

Model Registry

Model registries manage the lifecycle of IoT machine learning models from training through deployment and monitoring.

39.3.3 Sensor Fusion and Filtering

Kalman Filter

Artistic visualization of Kalman filter operation showing prediction step, measurement update, Kalman gain calculation, and state estimation for sensor fusion.

Kalman Filter

The Kalman filter provides optimal sensor fusion for noisy IoT measurements with known dynamics and noise characteristics.

Kalman Filter Limitations

Artistic diagram showing Kalman filter limitations including linearity assumption, Gaussian noise requirement, and initialization sensitivity with alternative approaches.

Kalman Filter Limitations

Understanding Kalman filter limitations guides selection of appropriate filtering techniques for different IoT scenarios.

Three Equations

Artistic visualization of core Kalman filter equations showing state prediction, covariance prediction, and measurement update with matrix operations.

Three Equations

The three core Kalman filter equations enable recursive optimal estimation from noisy IoT sensor measurements.

Particle Filter Location

Artistic visualization of particle filter localization showing particle cloud, motion prediction, measurement weighting, and resampling for non-Gaussian positioning.

Particle Filters for Location

Particle filters enable robust localization for IoT devices when sensor noise is non-Gaussian or system dynamics are nonlinear.

Particle Filter Correction

Artistic diagram of particle filter correction step showing likelihood weighting of particles based on sensor measurements and posterior distribution estimation.

Particle Filters Correct

Particle filter correction updates position estimates by weighting particles according to sensor measurement likelihood.

Particle Filter Correction 2

Artistic visualization of advanced particle filter correction with multiple sensor modalities and adaptive particle count for varying uncertainty.

Particle Filters Correct 2

Advanced particle filtering combines multiple IoT sensor modalities for robust state estimation in challenging environments.

Particle Filter Resampling

Artistic diagram of particle filter resampling showing importance sampling, systematic resampling, and particle degeneracy prevention techniques.

Particle Filters Resample

Resampling maintains particle diversity in IoT tracking applications, preventing degeneracy as estimates converge.

Particle Filter Resampling 2

Artistic visualization of adaptive resampling strategies for particle filters showing effective sample size monitoring and dynamic particle allocation.

Particle Filters Resample 2

Adaptive resampling optimizes computational efficiency while maintaining tracking accuracy in resource-constrained IoT devices.

Particle Filter Resampling 3

Artistic diagram comparing different resampling algorithms for particle filters including multinomial, stratified, and residual approaches.

Particle Filters Resample 3

Different resampling algorithms offer trade-offs between variance reduction and computational complexity for IoT applications.

Filters and Smoothers

Artistic visualization comparing real-time filtering vs offline smoothing for IoT sensor data showing forward-backward passes and optimal estimation.

Filters and Smoothers

Smoothers provide superior estimates when post-processing IoT data offline, while filters work in real-time with causal data.

Probabilistic Approach

Artistic visualization of probabilistic sensor fusion showing uncertainty representation, belief propagation, and confidence estimation for IoT.

Probabilistic Approach

Probabilistic approaches maintain uncertainty estimates alongside point values, enabling informed decision-making in IoT systems.

Recursive Bayesian Filters

Artistic diagram of recursive Bayesian filtering showing prior prediction, measurement likelihood, posterior update, and the continuous estimation cycle.

Recursive Bayesian Filters

Recursive Bayesian filters provide a principled framework for sequential sensor fusion in IoT applications.

39.3.4 Inertial Navigation and Motion

Inertial Navigation 1

Artistic visualization of inertial navigation system showing accelerometer and gyroscope integration, position drift, and the need for external corrections.

Inertial Navigation

Inertial navigation enables position tracking without external references but accumulates drift requiring periodic corrections.

Inertial Navigation 2

Artistic diagram of dead reckoning showing velocity integration, heading estimation, and cumulative position error growth over time.

Inertial Navigation 2

Dead reckoning tracks IoT device position through integration but requires fusion with absolute position sources.

Inertial Navigation 3

Artistic visualization of IMU-based navigation showing 6-DOF motion tracking, gravity removal, and coordinate frame transformations.

Inertial Navigation 3

Six-degree-of-freedom IMU tracking enables comprehensive motion analysis for IoT wearables and robotics applications.

Inertial Navigation 4

Artistic diagram of sensor fusion for inertial navigation combining GPS, magnetometer, and barometer with IMU for drift-free positioning.

Inertial Navigation 4

Multi-sensor fusion corrects inertial navigation drift using GPS, magnetometer, and barometer references.

Pitch and Roll Angles

Artistic visualization of pitch and roll angle estimation from accelerometer gravity vector with gimbal lock considerations and complementary filtering.

Pitch and Roll Angles

Pitch and roll estimation from accelerometers enables orientation tracking for IoT devices in many applications.

Gyroscope Drift

Artistic visualization of gyroscope drift showing integrated angle diverging from true value over time due to bias and noise accumulation.

Gyroscope Drift

Gyroscope integration drift is a fundamental challenge in IoT motion tracking, requiring complementary filtering or sensor fusion.

Accelerometer Noise

Artistic visualization of noisy accelerometer-based angle estimation showing vibration sensitivity and high-frequency noise compared to stable gyroscope integration.

Accelerometer Noise

Accelerometer angle estimates are noisy but don’t drift, complementing gyroscope measurements in sensor fusion.

Trapezoidal Integration

Artistic visualization of trapezoidal integration for discrete sensor data showing area approximation, error bounds, and comparison with other integration methods.

Trapezoidal Integration

Trapezoidal integration provides accurate numerical integration of IoT sensor data for position and velocity estimation.

39.3.5 Data Quality and Calibration

Calibrated Data

Artistic visualization of sensor calibration process showing raw readings, offset correction, scale factors, cross-axis alignment, and calibrated output.

Calibrated Data

Sensor calibration corrects systematic errors in IoT measurements, improving accuracy across operating conditions.

Noisy Measurements

Artistic visualization of sensor noise sources including thermal noise, quantization error, environmental interference, and signal conditioning effects.

Measurements Are Noisy

Understanding noise sources in IoT sensors guides selection of appropriate filtering and fusion techniques.

Symmetry Problem

Artistic visualization of symmetry ambiguity in sensor-based orientation estimation showing multiple valid interpretations of the same measurements.

Symmetry Problem

Sensor symmetry creates ambiguity in orientation estimation that requires additional measurements or constraints to resolve.

39.3.6 State Estimation and Tracking

State Vector

Artistic visualization of state vector representation for IoT tracking showing position, velocity, orientation, and sensor bias components.

State Vector

State vectors capture all relevant information about IoT device state for estimation and prediction algorithms.

Simple Tracking Example

Artistic visualization of simple IoT tracking example showing sensor measurements, state estimation, and predicted trajectory with uncertainty bounds.

Simple Tracking Example

Simple tracking demonstrates fundamental concepts of state estimation from noisy IoT sensor measurements.

Simple Tracking Example 2

Artistic diagram of tracking with measurement gaps showing prediction during missing data and update when measurements resume.

Simple Tracking Example 2

Handling measurement gaps is essential for robust IoT tracking when sensors intermittently lose signal.

Complex Tracking Example

Artistic visualization of multi-target tracking scenario with data association, track initialization, and track management for multiple IoT devices.

Complex Example

Complex tracking scenarios require sophisticated data association and track management for multiple IoT targets.

Works Well Scenario

Artistic visualization of favorable tracking conditions with good sensor coverage, predictable motion, and clean measurements.

Works Well

Understanding favorable conditions helps design IoT deployments that maximize tracking performance.

39.3.7 Cloud and Infrastructure

Service Models

Artistic visualization of cloud service models showing IaaS, PaaS, and SaaS layers with responsibility boundaries between provider and customer.

Service Models

Cloud service models define responsibility boundaries for IoT deployments, from infrastructure to complete applications.

SKA Data Challenge

Artistic visualization of Square Kilometre Array data challenge showing massive sensor array, exabyte-scale data generation, and distributed processing architecture.

SKA

The SKA represents extreme IoT data challenges, generating exabytes of sensor data requiring innovative processing solutions.

Cloud Usage Patterns

Cloud-based IoT usage involves coordinated patterns of device registration, data ingestion, rules-based processing, analytics, and application integration (see the Cloud Deployment Strategy diagram above for the end-to-end workflow).

Networking

Artistic visualization of IoT cloud networking showing VPC configuration, security groups, load balancing, and hybrid connectivity options.

Networking

Cloud networking for IoT requires careful security configuration while enabling reliable device connectivity.

Sensing Resource Intensive

Artistic visualization of resource-intensive IoT sensing showing high-frequency sampling, large data volumes, and processing requirements for complex sensors.

Sensing Resource Intensive

Resource-intensive sensing applications require careful architecture balancing edge processing with cloud capabilities.

39.3.8 Data Processing Details

Information Flow Types

Artistic visualization of IoT information flow patterns including request-response, publish-subscribe, streaming, and batch with appropriate use cases.

Information Flow Types

Different information flow patterns suit different IoT application requirements for latency, throughput, and reliability.

Data Generation Table

Artistic visualization of IoT data generation rates by device type showing sensors, cameras, industrial equipment, and connected vehicles with bytes per second.

Data Generation Table

Understanding data generation rates by IoT device type enables appropriate infrastructure sizing and cost estimation.

Data Generation Table 2

Projected IoT data growth drives architecture decisions for scalable data management infrastructure.

Implementation

Artistic visualization of IoT data pipeline implementation showing component selection, integration patterns, and deployment considerations.

Implementation

Practical IoT data pipeline implementation requires careful technology selection and integration planning.

The Nitty Gritty

Artistic visualization of low-level IoT data handling showing byte ordering, data alignment, serialization formats, and protocol efficiency.

The Nitty Gritty

Understanding low-level data handling details is essential for efficient IoT system implementation.

Key Takeaway

Cloud data architectures combine multiple patterns to serve different IoT needs: data lakes for flexible raw storage, data warehouses for fast structured queries, stream processors for real-time analytics, and sensor fusion algorithms like Kalman and particle filters for extracting accurate state estimates from noisy measurements. Selecting the right combination of patterns depends on your latency, accuracy, and cost requirements.

Putting Numbers to It

Kalman Filter State Estimation: A GPS tracker reports position with $\pm 10\text{ m}$ error ($\sigma_{\text{GPS}} = 10$). We predict movement based on last velocity: predicted position has $\pm 5\text{ m}$ uncertainty ($\sigma_{\text{predict}} = 5$).

Kalman Gain determines optimal weight between prediction and measurement: \[K = \frac{\sigma_{\text{predict}}^2}{\sigma_{\text{predict}}^2 + \sigma_{\text{GPS}}^2} = \frac{25}{25 + 100} = \frac{25}{125} = 0.2\]

Predicted position: $(100, 200)$ meters. GPS measurement: $(110, 205)$ meters. Kalman filter fuses both:

\[\text{Best estimate} = \text{prediction} + K \times (\text{measurement} - \text{prediction})\]

\[x = 100 + 0.2 \times (110 - 100) = 100 + 2 = 102 \text{ m}\] \[y = 200 + 0.2 \times (205 - 200) = 200 + 1 = 201 \text{ m}\]

Final position: $(102, 201)$ with reduced uncertainty $\sigma_{\text{fused}} = \sqrt{(1-K) \times \sigma_{\text{predict}}^2} = \sqrt{0.8 \times 25} = 4.47\text{ m}$. Fusion reduces error from 10m (GPS alone) to 4.47m—55% improvement.

39.3.9 Interactive: Kalman Filter Gain Explorer

Experiment with different sensor uncertainties to see how the Kalman gain changes. A higher gain means the filter trusts the measurement more; a lower gain means it trusts the prediction more.

Show code

kalmanGain_gallery = (sigmaPredict_gallery ** 2) / (sigmaPredict_gallery ** 2 + sigmaMeas_gallery ** 2)
sigmaFused_gallery = Math.sqrt((1 - kalmanGain_gallery) * sigmaPredict_gallery ** 2)
improvementPct_gallery = ((sigmaMeas_gallery - sigmaFused_gallery) / sigmaMeas_gallery * 100)

Show code

html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #3498DB; margin-top: 0.5rem;">
<p><strong>Kalman Gain (K):</strong> ${kalmanGain_gallery.toFixed(3)}</p>
<p><strong>Interpretation:</strong> Filter gives <strong>${(kalmanGain_gallery * 100).toFixed(1)}%</strong> weight to measurement, <strong>${((1 - kalmanGain_gallery) * 100).toFixed(1)}%</strong> weight to prediction</p>
<p><strong>Fused uncertainty:</strong> ${sigmaFused_gallery.toFixed(2)} m (down from ${sigmaMeas_gallery.toFixed(1)} m measurement alone)</p>
<p><strong>Accuracy improvement:</strong> ${improvementPct_gallery.toFixed(1)}% better than measurement alone</p>
<p style="color: var(--bs-body-color); font-size: 0.85em; margin-top: 0.5rem;"><em>Try setting prediction uncertainty much higher than measurement -- the filter will trust the sensor more. Set it lower, and it trusts the prediction model.</em></p>
</div>`

Worked Example: Designing a Data Lake vs Data Warehouse for Smart City IoT

A smart city project collects data from 100,000 sensors (traffic cameras, air quality, parking, weather) generating 500 GB/day. The city needs both real-time dashboards (traffic flow) and historical analytics (urban planning). Should they use a data lake, data warehouse, or both?

Requirement Analysis:

Use Case	Data Type	Query Pattern	Latency	Users
Traffic dashboards	Live camera object counts	Pre-aggregated metrics, time-series	<5 seconds	50 traffic operators
Air quality alerts	Sensor readings (PM2.5, NO2)	Threshold checks, spatial queries	<1 minute	200 citizens via app
Urban planning	5 years historical data (500 GB/day x 365 x 5 = 912 TB)	Complex JOINs, correlations across datasets	Minutes to hours	20 city planners
ML model training	Raw sensor data + weather + events	Full dataset scans, feature engineering	Hours	5 data scientists

Architecture Decision: Data Lakehouse (Hybrid Approach)

Layer 1 - Data Lake (Bronze):

Store all raw sensor data in AWS S3 (or equivalent)
Format: Parquet files partitioned by sensor_type/date
Purpose: Long-term retention, ML training, exploratory analysis
Cost: 912 TB x $0.023/GB/month = $20,976/month

Layer 2 - Data Warehouse (Silver):

AWS Redshift cluster for structured, aggregated data
ETL pipeline: Hourly jobs aggregate raw data to 1-minute summaries
Reduces data from 500 GB/day raw to 5 GB/day aggregates (100x compression)
Purpose: Fast BI queries, operational dashboards
Cost: dc2.large 4-node cluster = $4,380/month

Layer 3 - Real-Time Stream (Gold):

Apache Kafka + Flink for live sensor ingestion and aggregation
Materialize traffic counts, air quality averages to Redis for <1s dashboard queries
Purpose: Real-time monitoring and alerts
Cost: 3 Kafka brokers + 2 Flink workers = $1,800/month

Total Architecture Cost: $27,156/month

Query Performance Comparison:

Query Type	Data Lake Only	Data Warehouse Only	Lakehouse (Hybrid)
“Show traffic counts for last hour”	30 seconds (scan 20 GB Parquet)	2 seconds (indexed aggregates)	0.5 seconds (Redis cache)
“Find correlation between air quality and traffic over 3 years”	10 minutes (full scan)	Not possible (raw data deleted)	12 minutes (Spark on data lake)
“Train ML model on 5 years of sensor data”	2 hours (Spark on S3)	Not possible (aggregates lose detail)	2 hours (Spark on S3)
“Alert if PM2.5 >100 in any district”	45 seconds (too slow)	5 seconds (query warehouse)	<1 second (stream processing)

Key Insight: Data lakehouse combines benefits of both architectures: - Data lake for raw storage (cheapest: $0.023/GB/month) and ML training - Data warehouse for fast BI queries on aggregates (100x smaller, 10x faster) - Stream layer for real-time monitoring (<1s latency)

Alternative Approaches (Not Chosen):

Data Warehouse Only: Cannot support ML training (aggregates lose raw detail), expensive for 912 TB storage ($125K/month in Redshift vs $21K in S3)
Data Lake Only: Query latency too high for operational dashboards (30s vs <1s with warehouse caching), poor support for concurrent BI users
Separate Data Lake + Warehouse with ETL: Duplicate storage costs (raw + aggregates), ETL complexity, data staleness issues

Decision Framework: Selecting Cloud Data Architecture Pattern

Pattern	Best For	Cost ($/GB/month)	Query Speed	Schema Flexibility	When NOT to Use
Data Lake	Raw data archive, ML training, exploratory analytics	$0.02-0.05	Slow (minutes)	High (schema-on-read)	Real-time dashboards, high-concurrency BI
Data Warehouse	Business intelligence, structured reports, OLAP	$0.25-1.50	Fast (seconds)	Low (schema-on-write)	Unstructured data, rapid schema changes, petabyte scale
Data Lakehouse	Combined analytics + ML, cost-effective at scale	$0.03-0.10	Medium (10s-min)	Medium	Simple use cases (pick lake OR warehouse)
Stream Processor	Real-time analytics, live dashboards, event-driven	$0.10-0.30	Very fast (<1s)	High	Historical queries, batch analytics
Time-Series DB	Sensor data, metrics, monitoring	$0.30-1.00	Fast (sub-second)	Low (fixed schema)	Non-time-series data, ad-hoc queries

Decision Tree:

Do you need real-time results (<5 seconds)?
- Yes → Stream Processor (Kafka + Flink) or Time-Series DB (InfluxDB)
- No → Continue
Is your data mostly time-series sensor readings?
- Yes → Time-Series DB (InfluxDB, TimescaleDB, Amazon Timestream)
- No → Continue
Do you need to run ML models on raw data?
- Yes + Need fast BI → Data Lakehouse (Databricks, Snowflake)
- Yes + No BI → Data Lake (S3 + Athena/Spark)
- No → Continue
Is your schema stable and queries well-defined?
- Yes + <10 TB → Data Warehouse (Redshift, BigQuery)
- Yes + >10 TB → Data Lakehouse (more cost-effective)
- No → Data Lake (schema-on-read flexibility)
What’s your query concurrency?
- 50 concurrent users → Data Warehouse (optimized for OLAP)
- <10 users → Data Lake (batch processing)
- Mixed → Data Lakehouse

Common Mistake: Using Data Warehouse for IoT Raw Sensor Storage

The Error: An industrial IoT company stores raw vibration sensor data (10 kHz sampling) from 500 machines in AWS Redshift data warehouse, paying $45,000/month for storage alone.

The Math:

Sensors: 500 machines x 10,000 samples/sec x 4 bytes = 20 MB/s raw data
Daily: 20 MB/s x 86,400 s = 1.73 TB/day
90-day retention: 1.73 TB x 90 = 155 TB
Redshift cost: 155 TB x $0.29/GB/month = $44,950/month storage

Why This Is Wrong: Data warehouses are optimized for: - Structured, aggregated data (summaries, not raw samples) - Complex SQL queries with JOINs across dimensions - Low-latency BI dashboards (<5 seconds)

IoT raw sensor data is: - High-volume, low-value until aggregated - Rarely queried directly (only during debugging) - Better suited for batch processing (Spark jobs)

Correct Approach - Tiered Storage:

Tier 1 - S3 Data Lake (Raw):

Store 155 TB raw data in S3 Standard-IA
Cost: 155 TB x $0.0125/GB/month = $1,938/month (23x cheaper)
Query via Athena or Spark when needed (rare)

Tier 2 - Redshift (Aggregates):

ETL pipeline: Compress 1.73 TB/day raw to 1.7 GB/day aggregates (1000x reduction)
- Extract FFT features (20 frequency bins per machine per second)
- Store: machine_id, timestamp, freq_bin_1..20, anomaly_score
90-day aggregates: 1.7 GB x 90 = 153 GB
Redshift cost: 153 GB x $0.29/GB/month = $44/month

Tier 3 - Redis (Real-Time Cache):

Cache last 1 hour of aggregates for live dashboards
Cost: ElastiCache r5.large = $150/month

Total New Cost: $1,938 + $44 + $150 = $2,132/month (vs $45,000)

Savings: $42,868/month = $514,416/year

Query Performance:

Live dashboard (last hour): 0.5s (Redis cache) - faster than before
Historical analysis (last 90 days): 3s (Redshift aggregates) - same as before
Raw data investigation (rare): 5 minutes (Athena scan of S3) - acceptable for debugging

Key Lesson: Data warehouses charge premium prices for premium performance on structured data. IoT raw sensor streams are unstructured, high-volume, and rarely queried. Store raw data in cheap object storage (S3), aggregate to warehouse only what you query frequently. This pattern saves 95%+ on storage costs while maintaining (or improving) query performance for actual use cases.

39.3.10 Interactive: IoT Storage Cost Estimator

Estimate the monthly cost difference between storing IoT sensor data in a data warehouse versus a tiered architecture (data lake + warehouse for aggregates).

Show code

rawDataRate_gallery = numDevices_gallery * samplesPerSec_gallery * bytesPerSample_gallery / 1e6
dailyTB_gallery = rawDataRate_gallery * 86400 / 1e6
totalTB_gallery = dailyTB_gallery * retentionDays_gallery
totalGB_gallery = totalTB_gallery * 1000
warehouseCost_gallery = totalGB_gallery * 0.29
lakeCost_gallery = totalGB_gallery * 0.0125
aggGB_gallery = totalGB_gallery / aggregationRatio_gallery
aggWarehouseCost_gallery = aggGB_gallery * 0.29
tieredTotal_gallery = lakeCost_gallery + aggWarehouseCost_gallery + 150
savings_gallery = warehouseCost_gallery - tieredTotal_gallery
savingsPct_gallery = warehouseCost_gallery > 0 ? (savings_gallery / warehouseCost_gallery * 100) : 0

Show code

html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #E67E22; margin-top: 0.5rem;">
<p><strong>Raw data rate:</strong> ${rawDataRate_gallery.toFixed(1)} MB/s | <strong>Daily volume:</strong> ${dailyTB_gallery.toFixed(2)} TB/day</p>
<p><strong>Total stored (${retentionDays_gallery} days):</strong> ${totalTB_gallery.toFixed(1)} TB</p>
<hr style="border-color: var(--bs-body-color); opacity: 0.2;">
<p><strong>Warehouse-only cost:</strong> $${warehouseCost_gallery.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")}/month (all raw data at $0.29/GB)</p>
<p><strong>Tiered cost:</strong> $${tieredTotal_gallery.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")}/month (data lake $${lakeCost_gallery.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")} + warehouse aggregates $${aggWarehouseCost_gallery.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")} + cache $150)</p>
<p style="color: #16A085; font-weight: bold;">Monthly savings: $${savings_gallery.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")} (${savingsPct_gallery.toFixed(1)}%)</p>
</div>`

For Kids: Meet the Sensor Squad!

Imagine a giant picture book that shows how data travels from tiny sensors all the way to big computers in the sky!

39.3.11 The Sensor Squad Adventure: The Architecture Picture Book

One rainy afternoon, Sammy the Sensor found a huge picture book in the library called “The Amazing Journey of Data.” She called her friends over to look.

The first page showed a Data Lake – a giant pool where ALL kinds of data splash in together. “It’s like a swimming pool where numbers, pictures, and words all swim around!” said Sammy. “You can dive in and find whatever you need.”

The next page showed a Data Warehouse – neat shelves with perfectly organized boxes. “This is more like a tidy cupboard,” explained Max the Microcontroller. “Everything is sorted and labeled so you can grab what you need super fast.”

Lila the LED pointed to a page showing Stream Processing – data flowing like a river with little workers checking each drop as it passes. “These workers read each piece of data as it flows by, looking for anything unusual. If they spot something hot or cold, they shout an alert!”

Bella the Battery’s favorite page showed the Kalman Filter – a clever helper that combines guesses with measurements. “Imagine you’re trying to guess where a ball is going. The Kalman Filter takes your best guess AND what your eyes see, then combines them for a SUPER accurate answer!”

“This picture book shows all the different ways we can organize and use data,” said Max. “Architects use these patterns to build amazing IoT systems!”

39.3.12 Key Words for Kids

Word	What It Means
Data Lake	A big storage pool where all kinds of raw data are kept together
Data Warehouse	An organized storage where data is neatly sorted for fast searching
Architecture	The plan for how all the parts of a system fit together
Kalman Filter	A smart math trick that combines guesses and measurements for better accuracy

39.4 Quiz: Cloud Data Architecture

Which IoT Reference Model level handles data reconciliation and normalization?
- 1. Level 3: Edge/Fog Computing
- 1. Level 4: Data Accumulation
- 1. Level 5: Data Abstraction
- 1. Level 6: Application
What does IaaS stand for in cloud computing?
- 1. Internet as a Service
- 1. Infrastructure as a Service
- 1. Integration as a Service
- 1. Information as a Service
Which is NOT one of the top cloud security threats identified by Cloud Security Alliance?
- 1. Data breaches
- 1. Weak identity management
- 1. High bandwidth costs
- 1. Insecure APIs
In data cleaning, what should happen to a temperature reading of 150 degrees C when the maximum allowed is 60 degrees C?
- 1. Delete the entire record
- 1. Mark as suspicious and clamp to maximum (60 degrees C)
- 1. Accept as valid
- 1. Ignore and continue
What is data provenance?
- 1. Where data is physically stored
- 1. How much data is generated
- 1. Recording sources and transformations of data
- 1. The speed of data transmission
Which cloud service model provides complete applications to users?
- 1. IaaS
- 1. PaaS
- 1. SaaS
- 1. FaaS
In a 3-year TCO comparison, if cloud costs $100k and on-premises costs $180k, what is the savings percentage with cloud?
- 1. 25%
- 1. 44%
- 1. 56%
- 1. 80%
Which of the 4 Vs benefits most directly from cloud parallel processing?
- 1. Volume
- 1. Velocity
- 1. Variety
- 1. Veracity
What is the main advantage of tracking data freshness?
- 1. Reduce storage costs
- 1. Determine reliability for time-sensitive decisions
- 1. Improve network speed
- 1. Simplify data formats
In on-premises deployment, which is typically the largest ongoing annual cost?
- 1. Power
- 1. Hardware maintenance
- 1. IT staffing
- 1. Network bandwidth

Answers: 1-C, 2-B, 3-C, 4-B, 5-C, 6-C, 7-B, 8-B, 9-B, 10-C

Interactive Quiz: Match Concepts

Interactive Quiz: Sequence the Steps

Common Pitfalls

1. Selecting an architecture gallery pattern without validating against your scale

A reference architecture designed for 10 million devices may be wildly over-engineered for a 500-device pilot. Always validate the selected pattern against your actual throughput, latency, and cost requirements before committing.

2. Copying a reference architecture without understanding its trade-offs

Every architecture gallery entry makes implicit assumptions (24/7 connectivity, uniform message rates, centralised identity). Understand which assumptions your deployment violates before adapting the pattern.

3. Ignoring the operational complexity of multi-service architectures

A gallery architecture with 12 cloud services looks elegant on a slide but requires expertise in each service for operations. Start with the simplest architecture that meets requirements and add services only when needed.

4. Not accounting for egress costs in multi-region architectures

Moving IoT data between cloud regions or from cloud to on-premises for hybrid architectures incurs significant data transfer costs. Model these costs explicitly in architecture selection.

Label the Diagram

Code Challenge

39.5 Summary

This gallery provides visual references for cloud data architecture patterns essential for IoT deployments. Data lakes and warehouses serve different analytical needs, with lakehouses combining benefits of both. Stream processing enables real-time analytics while batch processing handles complex historical analysis. Sensor fusion techniques including Kalman and particle filters extract accurate state estimates from noisy measurements. Machine learning pipelines transform IoT data into predictive insights for maintenance, demand forecasting, and anomaly detection.

Concept Relationships

Cloud Data Architecture Gallery serves as:

Visual Reference Library: Diagrams illustrate abstract architectural patterns discussed across data chapters
Architecture Pattern Catalog: Each pattern (data lake, warehouse, lakehouse, stream processor, sensor fusion) solves specific IoT challenges
Decision Support: Comparing visual architectures helps teams align on designs during planning

Pattern Relationships:

Data Lake (flexible, schema-on-read)
    ↓
Data Lakehouse (lake storage + warehouse performance)
    ↓
Data Warehouse (rigid, schema-on-write, fast queries)

Stream Processing → Real-time path
Batch Processing → Historical path
Lambda Architecture → Both combined

Integration Points:

Cloud platforms (Cloud Data Platforms) implement these patterns as managed services
Sensor fusion algorithms enable multi-sensor IoT applications requiring probabilistic state estimation
ML pipelines transform patterns from this gallery into production inference systems

Key Insight: Visual architecture patterns provide shared language for cross-functional teams. “Let’s use a lakehouse pattern” immediately communicates hundreds of implementation decisions without lengthy explanations.

39.6 What’s Next

If you want to…	Read this
Understand the IoT cloud reference model	Cloud Data IoT Reference Model
Explore specific cloud platforms and services	Cloud Data Platforms and Services
Learn about data quality and security in cloud architectures	Cloud Data Quality and Security
Study data processing foundations	Data in the Cloud
Return to the module overview	Big Data Overview

Related Chapters & Resources

Cloud & Edge Architecture:

Edge Fog Computing - Edge vs cloud trade-offs
Cloud Computing - Cloud infrastructure fundamentals

Data Management:

Data Storage and Databases - Storage options
Big Data Overview - Big data concepts
Modeling and Inferencing - ML model development

Learning Hubs:

Quiz Navigator - Test your cloud knowledge