39  Cloud Data: Architecture Gallery

In 60 Seconds

This visual gallery provides reference diagrams for cloud data architecture patterns including data lakes, data warehouses, stream processing pipelines, sensor fusion algorithms (Kalman and particle filters), and machine learning model deployment – the essential building blocks for IoT cloud solutions.

Key Concepts

  • Reference architecture: A standardised, reusable architectural blueprint for a class of systems, providing a proven starting point that avoids reinventing common design decisions.
  • IoT data tier: A logical layer in a cloud architecture responsible for ingesting, storing, and serving sensor data, typically composed of message brokers, time-series databases, and object storage.
  • Data lakehouse: A modern architecture combining the low-cost storage of a data lake with the structured query performance of a data warehouse, suitable for mixed IoT analytics workloads.
  • Event-driven architecture: A design pattern where components communicate by publishing and subscribing to events (sensor readings, alarms) rather than through direct API calls, enabling loose coupling and scalability.
  • Zonal redundancy: Deploying cloud infrastructure across multiple availability zones within a region to survive the failure of any single physical facility without service interruption.
  • SLA (Service Level Agreement): A contractual commitment specifying uptime, latency, and data retention guarantees for a cloud IoT platform — a key factor when selecting managed services.

Cloud data architecture is the blueprint for organizing IoT data in cloud services. Think of designing a warehouse layout – you need receiving docks for incoming data, organized storage areas, processing stations, and shipping areas where insights are delivered. A good architecture keeps data flowing efficiently from sensors to dashboards.

39.3 Cloud Data Architecture Visuals

Artistic visualization of cloud data management showing data lakes and data warehouses receiving IoT streams, with ETL pipelines transforming raw data into structured analytics-ready datasets, and multiple query engines accessing the organized data
Figure 39.1: Cloud data management for IoT encompasses both data lakes for raw storage and data warehouses for structured analytics. ETL pipelines continuously transform incoming sensor streams into analytics-ready datasets. This dual architecture supports both exploratory analysis on raw data and efficient querying of pre-aggregated metrics.
Artistic visualization of IoT device registry showing a hierarchical database of registered devices with their identities, security certificates, configuration states, and connection status, enabling fleet-wide device management and monitoring
Figure 39.2: Device registry services form the foundation of cloud IoT platforms, maintaining authoritative records of device identities, security credentials, and configuration state. This centralized registry enables fleet management, security policy enforcement, and device lifecycle tracking from provisioning through decommissioning.
End-to-end IoT cloud workflow showing devices transmitting sensor data through gateways to cloud ingestion services, passing through data processing and storage layers, feeding analytics and machine learning models, and presenting results through user-facing applications
Figure 39.3: Using cloud for IoT data management
Comprehensive IoT cloud architecture showing device layer with sensors and gateways, connectivity layer with MQTT and HTTP protocols, ingestion layer with message queuing, processing layer with stream and batch engines, storage layer with data lakes and warehouses, and application layer with dashboards and APIs
Figure 39.4: Complete IoT cloud architecture from devices to applications
IoT rule engine architecture showing incoming event stream processed by rule matching engine with condition evaluation, action triggering for alerts/notifications/device commands, and rule management interface for defining complex event patterns
Figure 39.5: IoT rule engine for event-driven automation
Device shadow synchronization diagram showing physical device state, cloud shadow representation, desired vs reported state comparison, and bidirectional synchronization protocol handling offline device updates and command queuing
Figure 39.6: Device shadow pattern for reliable cloud-device synchronization

39.3.1 Storage and Processing Patterns

Artistic visualization of batch processing architecture for IoT data showing scheduled jobs collecting accumulated sensor data, MapReduce processing stages, and output to data warehouse for historical analytics.

Batch Processing

Batch processing handles large volumes of historical IoT data through scheduled jobs, enabling complex analytics that don’t require real-time results.

Modern diagram of multi-sensor data fusion architecture showing sensor layer, preprocessing, feature extraction, fusion algorithms, and decision output with feedback loops.

Data Fusion Architecture

Data fusion combines information from multiple IoT sensors to produce more accurate and reliable insights than any single sensor could provide.

Geometric visualization of IoT data lake pipeline showing raw data ingestion, schema-on-read processing, data catalog, and multiple consumption patterns for analytics and ML.

Data Lake Pipeline

Data lakes store raw IoT data in native format, enabling schema-on-read flexibility for diverse analytics and machine learning workloads.

Artistic visualization of IoT data lake showing massive storage reservoir receiving streams from diverse IoT sources with zones for raw, curated, and consumption-ready data.

Data Lake

Data lakes provide cost-effective storage for massive IoT data volumes, supporting both structured and unstructured data types.

Artistic diagram of data lakehouse architecture combining data lake storage economics with data warehouse query performance through delta format and optimization layers.

Data Lakehouse

Data lakehouses combine the flexibility of data lakes with the query performance of data warehouses, ideal for IoT analytics at scale.

Geometric visualization of data warehouse architecture showing ETL pipelines, dimensional modeling with fact and dimension tables, and OLAP cubes for IoT business intelligence.

Data Warehouse

Data warehouses provide structured, optimized storage for IoT analytics with pre-defined schemas enabling fast business intelligence queries.

Geometric diagram of data mesh architecture showing domain-oriented data products, self-serve data platform, and federated governance for decentralized IoT data management.

Data Mesh

Data mesh applies domain-driven design to IoT data, with decentralized ownership and federated governance enabling scalable data management.

Geometric visualization of data replication strategies showing synchronous vs asynchronous replication, consistency models, and geographic distribution for IoT data availability.

Data Replication

Data replication ensures IoT data availability and durability through redundant copies across storage systems and geographic regions.

Geometric diagram of data lineage tracking showing source sensors, transformation steps, derived datasets, and consumption points with metadata at each stage.

Data Lineage

Data lineage tracks IoT data from sensor origin through all transformations to final consumption, enabling debugging and compliance.

Artistic visualization of data pipeline orchestration showing DAG-based workflow scheduling, dependency management, monitoring, and error handling for IoT data flows.

Data Pipeline Orchestration

Pipeline orchestration manages complex IoT data workflows with scheduling, dependency tracking, monitoring, and automated recovery.

Artistic diagram of stream processing engine showing continuous ingestion, windowing operations, stateful computations, and real-time output for IoT event processing.

Stream Processor

Stream processors enable real-time IoT analytics through continuous processing of sensor data as it arrives.

Geometric visualization of stream processing flow showing event sources, Kafka topics, Flink/Spark processing, and sink outputs with exactly-once semantics.

Stream Processing Flow

End-to-end stream processing pipelines provide reliable real-time IoT analytics with guaranteed delivery semantics.

Artistic visualization of time series data patterns showing trend, seasonality, noise components, and common IoT patterns like sensor drift and periodic readings.

Time Series

Time series data is the fundamental data type in IoT, requiring specialized storage and analytics techniques.

Geometric diagram of time synchronization for distributed IoT systems showing NTP, PTP, and GPS time sources with clock drift compensation and precision requirements.

Time Synchronization

Accurate time synchronization is essential for correlating IoT events across distributed sensor networks.

Artistic visualization of converged data networks carrying IoT sensor streams alongside traditional enterprise traffic over unified infrastructure.

Converged Data Networks

Converged networks efficiently carry IoT data streams alongside other enterprise traffic through unified infrastructure.

Modern diagram showing decision factors for edge vs cloud placement of IoT processing including latency requirements, bandwidth costs, privacy needs, and computational complexity.

Edge Cloud Placement

Optimal placement of IoT processing between edge and cloud depends on latency, bandwidth, privacy, and computational requirements.

39.3.2 Machine Learning and Prediction

Geometric diagram of IoT predictive model pipeline showing feature engineering from sensor data, model training, validation, deployment, and continuous learning loop.

Predictive Model

Predictive models transform IoT sensor patterns into actionable forecasts for maintenance, demand, and anomaly detection.

Geometric visualization of ML model registry for IoT showing model versioning, metadata tracking, deployment status, and lineage for managing production models.

Model Registry

Model registries manage the lifecycle of IoT machine learning models from training through deployment and monitoring.

39.3.3 Sensor Fusion and Filtering

Artistic visualization of Kalman filter operation showing prediction step, measurement update, Kalman gain calculation, and state estimation for sensor fusion.

Kalman Filter

The Kalman filter provides optimal sensor fusion for noisy IoT measurements with known dynamics and noise characteristics.

Artistic diagram showing Kalman filter limitations including linearity assumption, Gaussian noise requirement, and initialization sensitivity with alternative approaches.

Kalman Filter Limitations

Understanding Kalman filter limitations guides selection of appropriate filtering techniques for different IoT scenarios.

Artistic visualization of core Kalman filter equations showing state prediction, covariance prediction, and measurement update with matrix operations.

Three Equations

The three core Kalman filter equations enable recursive optimal estimation from noisy IoT sensor measurements.

Artistic visualization of particle filter localization showing particle cloud, motion prediction, measurement weighting, and resampling for non-Gaussian positioning.

Particle Filters for Location

Particle filters enable robust localization for IoT devices when sensor noise is non-Gaussian or system dynamics are nonlinear.

Artistic diagram of particle filter correction step showing likelihood weighting of particles based on sensor measurements and posterior distribution estimation.

Particle Filters Correct

Particle filter correction updates position estimates by weighting particles according to sensor measurement likelihood.

Artistic visualization of advanced particle filter correction with multiple sensor modalities and adaptive particle count for varying uncertainty.

Particle Filters Correct 2

Advanced particle filtering combines multiple IoT sensor modalities for robust state estimation in challenging environments.

Artistic diagram of particle filter resampling showing importance sampling, systematic resampling, and particle degeneracy prevention techniques.

Particle Filters Resample

Resampling maintains particle diversity in IoT tracking applications, preventing degeneracy as estimates converge.

Artistic visualization of adaptive resampling strategies for particle filters showing effective sample size monitoring and dynamic particle allocation.

Particle Filters Resample 2

Adaptive resampling optimizes computational efficiency while maintaining tracking accuracy in resource-constrained IoT devices.

Artistic diagram comparing different resampling algorithms for particle filters including multinomial, stratified, and residual approaches.

Particle Filters Resample 3

Different resampling algorithms offer trade-offs between variance reduction and computational complexity for IoT applications.

Artistic visualization comparing real-time filtering vs offline smoothing for IoT sensor data showing forward-backward passes and optimal estimation.

Filters and Smoothers

Smoothers provide superior estimates when post-processing IoT data offline, while filters work in real-time with causal data.

Artistic visualization of probabilistic sensor fusion showing uncertainty representation, belief propagation, and confidence estimation for IoT.

Probabilistic Approach

Probabilistic approaches maintain uncertainty estimates alongside point values, enabling informed decision-making in IoT systems.

Artistic diagram of recursive Bayesian filtering showing prior prediction, measurement likelihood, posterior update, and the continuous estimation cycle.

Recursive Bayesian Filters

Recursive Bayesian filters provide a principled framework for sequential sensor fusion in IoT applications.

39.3.4 Inertial Navigation and Motion

Artistic visualization of inertial navigation system showing accelerometer and gyroscope integration, position drift, and the need for external corrections.

Inertial Navigation

Inertial navigation enables position tracking without external references but accumulates drift requiring periodic corrections.

Artistic diagram of dead reckoning showing velocity integration, heading estimation, and cumulative position error growth over time.

Inertial Navigation 2

Dead reckoning tracks IoT device position through integration but requires fusion with absolute position sources.

Artistic visualization of IMU-based navigation showing 6-DOF motion tracking, gravity removal, and coordinate frame transformations.

Inertial Navigation 3

Six-degree-of-freedom IMU tracking enables comprehensive motion analysis for IoT wearables and robotics applications.

Artistic diagram of sensor fusion for inertial navigation combining GPS, magnetometer, and barometer with IMU for drift-free positioning.

Inertial Navigation 4

Multi-sensor fusion corrects inertial navigation drift using GPS, magnetometer, and barometer references.

Artistic visualization of pitch and roll angle estimation from accelerometer gravity vector with gimbal lock considerations and complementary filtering.

Pitch and Roll Angles

Pitch and roll estimation from accelerometers enables orientation tracking for IoT devices in many applications.

Artistic visualization of gyroscope drift showing integrated angle diverging from true value over time due to bias and noise accumulation.

Gyroscope Drift

Gyroscope integration drift is a fundamental challenge in IoT motion tracking, requiring complementary filtering or sensor fusion.

Artistic visualization of noisy accelerometer-based angle estimation showing vibration sensitivity and high-frequency noise compared to stable gyroscope integration.

Accelerometer Noise

Accelerometer angle estimates are noisy but don’t drift, complementing gyroscope measurements in sensor fusion.

Artistic visualization of trapezoidal integration for discrete sensor data showing area approximation, error bounds, and comparison with other integration methods.

Trapezoidal Integration

Trapezoidal integration provides accurate numerical integration of IoT sensor data for position and velocity estimation.

39.3.5 Data Quality and Calibration

Artistic visualization of sensor calibration process showing raw readings, offset correction, scale factors, cross-axis alignment, and calibrated output.

Calibrated Data

Sensor calibration corrects systematic errors in IoT measurements, improving accuracy across operating conditions.

Artistic visualization of sensor noise sources including thermal noise, quantization error, environmental interference, and signal conditioning effects.

Measurements Are Noisy

Understanding noise sources in IoT sensors guides selection of appropriate filtering and fusion techniques.

Artistic visualization of symmetry ambiguity in sensor-based orientation estimation showing multiple valid interpretations of the same measurements.

Symmetry Problem

Sensor symmetry creates ambiguity in orientation estimation that requires additional measurements or constraints to resolve.

39.3.6 State Estimation and Tracking

Artistic visualization of state vector representation for IoT tracking showing position, velocity, orientation, and sensor bias components.

State Vector

State vectors capture all relevant information about IoT device state for estimation and prediction algorithms.

Artistic visualization of simple IoT tracking example showing sensor measurements, state estimation, and predicted trajectory with uncertainty bounds.

Simple Tracking Example

Simple tracking demonstrates fundamental concepts of state estimation from noisy IoT sensor measurements.

Artistic diagram of tracking with measurement gaps showing prediction during missing data and update when measurements resume.

Simple Tracking Example 2

Handling measurement gaps is essential for robust IoT tracking when sensors intermittently lose signal.

Artistic visualization of multi-target tracking scenario with data association, track initialization, and track management for multiple IoT devices.

Complex Example

Complex tracking scenarios require sophisticated data association and track management for multiple IoT targets.

Artistic visualization of favorable tracking conditions with good sensor coverage, predictable motion, and clean measurements.

Works Well

Understanding favorable conditions helps design IoT deployments that maximize tracking performance.

39.3.7 Cloud and Infrastructure

Artistic visualization of cloud service models showing IaaS, PaaS, and SaaS layers with responsibility boundaries between provider and customer.

Service Models

Cloud service models define responsibility boundaries for IoT deployments, from infrastructure to complete applications.

Artistic visualization of Square Kilometre Array data challenge showing massive sensor array, exabyte-scale data generation, and distributed processing architecture.

SKA

The SKA represents extreme IoT data challenges, generating exabytes of sensor data requiring innovative processing solutions.

Cloud-based IoT usage involves coordinated patterns of device registration, data ingestion, rules-based processing, analytics, and application integration (see Figure 39.3 above for the end-to-end workflow).

Artistic visualization of IoT cloud networking showing VPC configuration, security groups, load balancing, and hybrid connectivity options.

Networking

Cloud networking for IoT requires careful security configuration while enabling reliable device connectivity.

Artistic visualization of resource-intensive IoT sensing showing high-frequency sampling, large data volumes, and processing requirements for complex sensors.

Sensing Resource Intensive

Resource-intensive sensing applications require careful architecture balancing edge processing with cloud capabilities.

39.3.8 Data Processing Details

Artistic visualization of IoT information flow patterns including request-response, publish-subscribe, streaming, and batch with appropriate use cases.

Information Flow Types

Different information flow patterns suit different IoT application requirements for latency, throughput, and reliability.

Artistic visualization of IoT data generation rates by device type showing sensors, cameras, industrial equipment, and connected vehicles with bytes per second.

Data Generation Table

Understanding data generation rates by IoT device type enables appropriate infrastructure sizing and cost estimation.

Artistic diagram of IoT data growth projections showing exponential increase in connected devices and data volume over time.

Data Generation Table 2

Projected IoT data growth drives architecture decisions for scalable data management infrastructure.

Artistic visualization of IoT data pipeline implementation showing component selection, integration patterns, and deployment considerations.

Implementation

Practical IoT data pipeline implementation requires careful technology selection and integration planning.

Artistic visualization of low-level IoT data handling showing byte ordering, data alignment, serialization formats, and protocol efficiency.

The Nitty Gritty

Understanding low-level data handling details is essential for efficient IoT system implementation.


Key Takeaway

Cloud data architectures combine multiple patterns to serve different IoT needs: data lakes for flexible raw storage, data warehouses for fast structured queries, stream processors for real-time analytics, and sensor fusion algorithms like Kalman and particle filters for extracting accurate state estimates from noisy measurements. Selecting the right combination of patterns depends on your latency, accuracy, and cost requirements.

Kalman Filter State Estimation: A GPS tracker reports position with \(\pm 10\text{ m}\) error (\(\sigma_{\text{GPS}} = 10\)). We predict movement based on last velocity: predicted position has \(\pm 5\text{ m}\) uncertainty (\(\sigma_{\text{predict}} = 5\)).

Kalman Gain determines optimal weight between prediction and measurement: \[K = \frac{\sigma_{\text{predict}}^2}{\sigma_{\text{predict}}^2 + \sigma_{\text{GPS}}^2} = \frac{25}{25 + 100} = \frac{25}{125} = 0.2\]

Predicted position: \((100, 200)\) meters. GPS measurement: \((110, 205)\) meters. Kalman filter fuses both:

\[\text{Best estimate} = \text{prediction} + K \times (\text{measurement} - \text{prediction})\]

\[x = 100 + 0.2 \times (110 - 100) = 100 + 2 = 102 \text{ m}\] \[y = 200 + 0.2 \times (205 - 200) = 200 + 1 = 201 \text{ m}\]

Final position: \((102, 201)\) with reduced uncertainty \(\sigma_{\text{fused}} = \sqrt{(1-K) \times \sigma_{\text{predict}}^2} = \sqrt{0.8 \times 25} = 4.47\text{ m}\). Fusion reduces error from 10m (GPS alone) to 4.47m—55% improvement.

39.3.9 Interactive: Kalman Filter Gain Explorer

Experiment with different sensor uncertainties to see how the Kalman gain changes. A higher gain means the filter trusts the measurement more; a lower gain means it trusts the prediction more.

A smart city project collects data from 100,000 sensors (traffic cameras, air quality, parking, weather) generating 500 GB/day. The city needs both real-time dashboards (traffic flow) and historical analytics (urban planning). Should they use a data lake, data warehouse, or both?

Requirement Analysis:

Use Case Data Type Query Pattern Latency Users
Traffic dashboards Live camera object counts Pre-aggregated metrics, time-series <5 seconds 50 traffic operators
Air quality alerts Sensor readings (PM2.5, NO2) Threshold checks, spatial queries <1 minute 200 citizens via app
Urban planning 5 years historical data (500 GB/day x 365 x 5 = 912 TB) Complex JOINs, correlations across datasets Minutes to hours 20 city planners
ML model training Raw sensor data + weather + events Full dataset scans, feature engineering Hours 5 data scientists

Architecture Decision: Data Lakehouse (Hybrid Approach)

Layer 1 - Data Lake (Bronze):

  • Store all raw sensor data in AWS S3 (or equivalent)
  • Format: Parquet files partitioned by sensor_type/date
  • Purpose: Long-term retention, ML training, exploratory analysis
  • Cost: 912 TB x $0.023/GB/month = $20,976/month

Layer 2 - Data Warehouse (Silver):

  • AWS Redshift cluster for structured, aggregated data
  • ETL pipeline: Hourly jobs aggregate raw data to 1-minute summaries
  • Reduces data from 500 GB/day raw to 5 GB/day aggregates (100x compression)
  • Purpose: Fast BI queries, operational dashboards
  • Cost: dc2.large 4-node cluster = $4,380/month

Layer 3 - Real-Time Stream (Gold):

  • Apache Kafka + Flink for live sensor ingestion and aggregation
  • Materialize traffic counts, air quality averages to Redis for <1s dashboard queries
  • Purpose: Real-time monitoring and alerts
  • Cost: 3 Kafka brokers + 2 Flink workers = $1,800/month

Total Architecture Cost: $27,156/month

Query Performance Comparison:

Query Type Data Lake Only Data Warehouse Only Lakehouse (Hybrid)
“Show traffic counts for last hour” 30 seconds (scan 20 GB Parquet) 2 seconds (indexed aggregates) 0.5 seconds (Redis cache)
“Find correlation between air quality and traffic over 3 years” 10 minutes (full scan) Not possible (raw data deleted) 12 minutes (Spark on data lake)
“Train ML model on 5 years of sensor data” 2 hours (Spark on S3) Not possible (aggregates lose detail) 2 hours (Spark on S3)
“Alert if PM2.5 >100 in any district” 45 seconds (too slow) 5 seconds (query warehouse) <1 second (stream processing)

Key Insight: Data lakehouse combines benefits of both architectures: - Data lake for raw storage (cheapest: $0.023/GB/month) and ML training - Data warehouse for fast BI queries on aggregates (100x smaller, 10x faster) - Stream layer for real-time monitoring (<1s latency)

Alternative Approaches (Not Chosen):

  1. Data Warehouse Only: Cannot support ML training (aggregates lose raw detail), expensive for 912 TB storage ($125K/month in Redshift vs $21K in S3)

  2. Data Lake Only: Query latency too high for operational dashboards (30s vs <1s with warehouse caching), poor support for concurrent BI users

  3. Separate Data Lake + Warehouse with ETL: Duplicate storage costs (raw + aggregates), ETL complexity, data staleness issues

Pattern Best For Cost ($/GB/month) Query Speed Schema Flexibility When NOT to Use
Data Lake Raw data archive, ML training, exploratory analytics $0.02-0.05 Slow (minutes) High (schema-on-read) Real-time dashboards, high-concurrency BI
Data Warehouse Business intelligence, structured reports, OLAP $0.25-1.50 Fast (seconds) Low (schema-on-write) Unstructured data, rapid schema changes, petabyte scale
Data Lakehouse Combined analytics + ML, cost-effective at scale $0.03-0.10 Medium (10s-min) Medium Simple use cases (pick lake OR warehouse)
Stream Processor Real-time analytics, live dashboards, event-driven $0.10-0.30 Very fast (<1s) High Historical queries, batch analytics
Time-Series DB Sensor data, metrics, monitoring $0.30-1.00 Fast (sub-second) Low (fixed schema) Non-time-series data, ad-hoc queries

Decision Tree:

  1. Do you need real-time results (<5 seconds)?
    • Yes → Stream Processor (Kafka + Flink) or Time-Series DB (InfluxDB)
    • No → Continue
  2. Is your data mostly time-series sensor readings?
    • Yes → Time-Series DB (InfluxDB, TimescaleDB, Amazon Timestream)
    • No → Continue
  3. Do you need to run ML models on raw data?
    • Yes + Need fast BI → Data Lakehouse (Databricks, Snowflake)
    • Yes + No BI → Data Lake (S3 + Athena/Spark)
    • No → Continue
  4. Is your schema stable and queries well-defined?
    • Yes + <10 TB → Data Warehouse (Redshift, BigQuery)
    • Yes + >10 TB → Data Lakehouse (more cost-effective)
    • No → Data Lake (schema-on-read flexibility)
  5. What’s your query concurrency?
    • 50 concurrent users → Data Warehouse (optimized for OLAP)

    • <10 users → Data Lake (batch processing)
    • Mixed → Data Lakehouse
Common Mistake: Using Data Warehouse for IoT Raw Sensor Storage

The Error: An industrial IoT company stores raw vibration sensor data (10 kHz sampling) from 500 machines in AWS Redshift data warehouse, paying $45,000/month for storage alone.

The Math:

  • Sensors: 500 machines x 10,000 samples/sec x 4 bytes = 20 MB/s raw data
  • Daily: 20 MB/s x 86,400 s = 1.73 TB/day
  • 90-day retention: 1.73 TB x 90 = 155 TB
  • Redshift cost: 155 TB x $0.29/GB/month = $44,950/month storage

Why This Is Wrong: Data warehouses are optimized for: - Structured, aggregated data (summaries, not raw samples) - Complex SQL queries with JOINs across dimensions - Low-latency BI dashboards (<5 seconds)

IoT raw sensor data is: - High-volume, low-value until aggregated - Rarely queried directly (only during debugging) - Better suited for batch processing (Spark jobs)

Correct Approach - Tiered Storage:

Tier 1 - S3 Data Lake (Raw):

  • Store 155 TB raw data in S3 Standard-IA
  • Cost: 155 TB x $0.0125/GB/month = $1,938/month (23x cheaper)
  • Query via Athena or Spark when needed (rare)

Tier 2 - Redshift (Aggregates):

  • ETL pipeline: Compress 1.73 TB/day raw to 1.7 GB/day aggregates (1000x reduction)
    • Extract FFT features (20 frequency bins per machine per second)
    • Store: machine_id, timestamp, freq_bin_1..20, anomaly_score
  • 90-day aggregates: 1.7 GB x 90 = 153 GB
  • Redshift cost: 153 GB x $0.29/GB/month = $44/month

Tier 3 - Redis (Real-Time Cache):

  • Cache last 1 hour of aggregates for live dashboards
  • Cost: ElastiCache r5.large = $150/month

Total New Cost: $1,938 + $44 + $150 = $2,132/month (vs $45,000)

Savings: $42,868/month = $514,416/year

Query Performance:

  • Live dashboard (last hour): 0.5s (Redis cache) - faster than before
  • Historical analysis (last 90 days): 3s (Redshift aggregates) - same as before
  • Raw data investigation (rare): 5 minutes (Athena scan of S3) - acceptable for debugging

Key Lesson: Data warehouses charge premium prices for premium performance on structured data. IoT raw sensor streams are unstructured, high-volume, and rarely queried. Store raw data in cheap object storage (S3), aggregate to warehouse only what you query frequently. This pattern saves 95%+ on storage costs while maintaining (or improving) query performance for actual use cases.

39.3.10 Interactive: IoT Storage Cost Estimator

Estimate the monthly cost difference between storing IoT sensor data in a data warehouse versus a tiered architecture (data lake + warehouse for aggregates).

Imagine a giant picture book that shows how data travels from tiny sensors all the way to big computers in the sky!

39.3.11 The Sensor Squad Adventure: The Architecture Picture Book

One rainy afternoon, Sammy the Sensor found a huge picture book in the library called “The Amazing Journey of Data.” She called her friends over to look.

The first page showed a Data Lake – a giant pool where ALL kinds of data splash in together. “It’s like a swimming pool where numbers, pictures, and words all swim around!” said Sammy. “You can dive in and find whatever you need.”

The next page showed a Data Warehouse – neat shelves with perfectly organized boxes. “This is more like a tidy cupboard,” explained Max the Microcontroller. “Everything is sorted and labeled so you can grab what you need super fast.”

Lila the LED pointed to a page showing Stream Processing – data flowing like a river with little workers checking each drop as it passes. “These workers read each piece of data as it flows by, looking for anything unusual. If they spot something hot or cold, they shout an alert!”

Bella the Battery’s favorite page showed the Kalman Filter – a clever helper that combines guesses with measurements. “Imagine you’re trying to guess where a ball is going. The Kalman Filter takes your best guess AND what your eyes see, then combines them for a SUPER accurate answer!”

“This picture book shows all the different ways we can organize and use data,” said Max. “Architects use these patterns to build amazing IoT systems!”

39.3.12 Key Words for Kids

Word What It Means
Data Lake A big storage pool where all kinds of raw data are kept together
Data Warehouse An organized storage where data is neatly sorted for fast searching
Architecture The plan for how all the parts of a system fit together
Kalman Filter A smart math trick that combines guesses and measurements for better accuracy

39.4 Quiz: Cloud Data Architecture

  1. Which IoT Reference Model level handles data reconciliation and normalization?
      1. Level 3: Edge/Fog Computing
      1. Level 4: Data Accumulation
      1. Level 5: Data Abstraction
      1. Level 6: Application
  2. What does IaaS stand for in cloud computing?
      1. Internet as a Service
      1. Infrastructure as a Service
      1. Integration as a Service
      1. Information as a Service
  3. Which is NOT one of the top cloud security threats identified by Cloud Security Alliance?
      1. Data breaches
      1. Weak identity management
      1. High bandwidth costs
      1. Insecure APIs
  4. In data cleaning, what should happen to a temperature reading of 150 degrees C when the maximum allowed is 60 degrees C?
      1. Delete the entire record
      1. Mark as suspicious and clamp to maximum (60 degrees C)
      1. Accept as valid
      1. Ignore and continue
  5. What is data provenance?
      1. Where data is physically stored
      1. How much data is generated
      1. Recording sources and transformations of data
      1. The speed of data transmission
  6. Which cloud service model provides complete applications to users?
      1. IaaS
      1. PaaS
      1. SaaS
      1. FaaS
  7. In a 3-year TCO comparison, if cloud costs $100k and on-premises costs $180k, what is the savings percentage with cloud?
      1. 25%
      1. 44%
      1. 56%
      1. 80%
  8. Which of the 4 Vs benefits most directly from cloud parallel processing?
      1. Volume
      1. Velocity
      1. Variety
      1. Veracity
  9. What is the main advantage of tracking data freshness?
      1. Reduce storage costs
      1. Determine reliability for time-sensitive decisions
      1. Improve network speed
      1. Simplify data formats
  10. In on-premises deployment, which is typically the largest ongoing annual cost?
      1. Power
      1. Hardware maintenance
      1. IT staffing
      1. Network bandwidth

Answers: 1-C, 2-B, 3-C, 4-B, 5-C, 6-C, 7-B, 8-B, 9-B, 10-C

Common Pitfalls

A reference architecture designed for 10 million devices may be wildly over-engineered for a 500-device pilot. Always validate the selected pattern against your actual throughput, latency, and cost requirements before committing.

Every architecture gallery entry makes implicit assumptions (24/7 connectivity, uniform message rates, centralised identity). Understand which assumptions your deployment violates before adapting the pattern.

A gallery architecture with 12 cloud services looks elegant on a slide but requires expertise in each service for operations. Start with the simplest architecture that meets requirements and add services only when needed.

Moving IoT data between cloud regions or from cloud to on-premises for hybrid architectures incurs significant data transfer costs. Model these costs explicitly in architecture selection.