1258  Edge Processing for Big Data

Learning Objectives

After completing this chapter, you will be able to:

  • Apply the 90/10 rule for IoT data reduction
  • Calculate bandwidth and cost savings from edge processing
  • Design edge-to-cloud data pipelines
  • Determine when to process at edge versus cloud

1258.1 Edge Computing: Making Big Data Manageable

Here’s a secret that changes everything: 90% of IoT data doesn’t need to leave the device.

1258.1.1 The Traffic Camera Example

A single traffic camera generates 30 frames per second, 24/7:

Without Edge Processing: - Raw data: 30 fps x 2 MB/frame x 86,400 seconds/day = 5.2 TB per day per camera - A city with 10,000 cameras would generate 52 PB per day (52,000,000 GB!) - Network bandwidth required: 52 PB / 86,400 sec = 6 Gbps continuous (impossible for most cities) - Storage cost: 52 PB x $0.023/GB/month (AWS S3) = $1,196,000/month ($14M/year!)

With Edge Processing:

Instead of sending all video to the cloud:

  1. Camera processes video locally (edge computing using Nvidia Jetson or similar)
  2. Counts vehicles, detects accidents, identifies license plates
  3. Sends only summary: “Camera 4521: 847 vehicles, 2 accidents, peak at 8:15 AM”
  4. Daily data sent: ~1 KB instead of 5.2 TB (5 billion x reduction!)

Cost Comparison:

Metric No Edge Processing With Edge Processing Reduction
Data volume/day 52 PB 10 GB 5,200,000x
Network bandwidth 6 Gbps 1 Mbps 6,000x
Storage cost/year $14,000,000 $280 50,000x
Edge hardware cost $0 $200/camera x 10K = $2M One-time cost

ROI: Edge hardware pays for itself in 2 months of storage savings alone!

1258.1.2 The 90/10 Rule for IoT Data

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Edge["Edge Layer - Local Processing"]
        E1[Raw Sensor Data<br/>100% = 5.2 TB/day] --> E2[Edge Processor<br/>Filter, Aggregate, Analyze]
        E2 --> E3[90% Discarded<br/>4.7 TB redundant/noise]
        E2 --> E4[9% Aggregated<br/>470 GB summaries]
        E2 --> E5[1% Critical<br/>52 GB anomalies/alerts]
    end

    subgraph Cloud["Cloud Layer - Deep Analysis"]
        E4 --> C1[Batch Analytics<br/>Daily/Weekly Trends]
        E5 --> C2[Real-Time Alerts<br/>Immediate Action]
        C1 & C2 --> C3[Long-Term Storage<br/>522 GB/day total]
    end

    E3 -.->|Never transmitted| D[Deleted at Edge]

    style E1 fill:#2C3E50,stroke:#16A085,color:#fff
    style E2 fill:#16A085,stroke:#2C3E50,color:#fff
    style E3 fill:#E74C3C,stroke:#2C3E50,color:#fff
    style E4 fill:#E67E22,stroke:#2C3E50,color:#fff
    style E5 fill:#27AE60,stroke:#2C3E50,color:#fff
    style C1 fill:#2C3E50,stroke:#16A085,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style D fill:#E74C3C,stroke:#2C3E50,color:#fff,stroke-dasharray: 5 5

Figure 1258.1: Edge Computing Reduces 5.2TB Daily Data to Manageable 522GB

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
sequenceDiagram
    participant S as IoT Sensor
    participant E as Edge Processor
    participant C as Cloud
    participant D as Dashboard

    Note over S,D: Data Volume Transformation Timeline

    S->>E: Raw: 100 readings/sec (400 KB/sec)
    Note right of S: 5.2 TB/day raw data

    E->>E: Filter redundant (temp unchanged)
    E->>E: Aggregate to 1-min averages
    E->>E: Detect anomalies locally

    alt Normal Reading (99% of time)
        E->>E: Discard - no change
        Note right of E: 90% never leaves edge
    else Aggregated Summary (every minute)
        E->>C: Summary: avg=72F, min=71F, max=73F
        Note right of E: 9% as summaries
    else Anomaly Detected
        E->>C: ALERT: Temperature spike 95F!
        C->>D: Immediate notification
        Note right of E: 1% as real-time alerts
    end

    Note over C: Cloud receives 522 GB/day<br/>(100x reduction from 5.2 TB)

Figure 1258.2: Timeline showing how edge processing progressively reduces data volume in real-time

Edge Computing Data Reduction Pipeline: Edge processors filter 90% of redundant data locally, aggregate 9% into summaries, and send only 1% of critical anomalies to cloud - transforming an impossible 5.2TB/day into manageable 522GB/day.

The 90/10 Rule Breakdown:

Data Type Processed At Example Sent to Cloud?
Raw readings (90%) Edge only Temperature: 72.1 F, 72.1 F, 72.2 F, 72.1 F… No (redundant)
Aggregated summaries (9%) Edge to Cloud Average temp: 72 F, min: 68 F, max: 76 F Yes (batch)
Anomalies/alerts (1%) Edge to Cloud Temperature spike: 95 F! Yes (real-time)

This transforms an impossible big data problem into a manageable analytics problem.

1258.1.3 Progressive Data Reduction Example

Smart Building HVAC System (1,000 temperature sensors):

Level 1 - Sensors: 1,000 sensors x 1 reading/sec x 4 bytes = 4 KB/sec raw data

Level 2 - Edge Gateway (per floor, 10 floors):
  - Aggregate 100 sensors to 10 room averages
  - 10 floors x 10 rooms x 1 reading/sec x 4 bytes = 400 bytes/sec
  - Reduction: 4 KB to 400 bytes (10x smaller)

Level 3 - Building Controller:
  - Aggregate 100 rooms to 10 zone averages
  - 10 zones x 1 reading/sec x 4 bytes = 40 bytes/sec
  - Reduction: 400 bytes to 40 bytes (10x smaller)

Level 4 - Cloud:
  - Receive zone averages + anomaly alerts only
  - 40 bytes/sec normal + 1 alert/minute average
  - Total cloud data: ~3.5 GB/year (vs 126 GB/year without edge)

Final reduction: 100x less data sent to cloud, while preserving all important information!

1258.1.4 Real-World Example: Smart City Traffic

The Scenario: 10,000 traffic cameras generating data

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Traditional["Traditional Processing"]
        T1[10,000 Cameras] --> T2[Single Server]
        T2 --> T3[Process 1 Image<br/>Sequential]
        T3 --> T4[Hours to Complete]
    end

    subgraph BigData["Big Data Processing"]
        B1[10,000 Cameras] --> B2[Distributed Cluster<br/>1000 Nodes]
        B2 --> B3[Process 1,000 Images<br/>Parallel]
        B3 --> B4[Seconds to Complete]
    end

    style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T4 fill:#E74C3C,stroke:#2C3E50,color:#fff
    style B1 fill:#2C3E50,stroke:#16A085,color:#fff
    style B4 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1258.3: Sequential versus Parallel Distributed Image Processing

Traditional vs Big Data Processing: Sequential processing on single server takes hours for 10,000 images; parallel distributed processing completes in seconds by dividing work across 1,000 nodes.

Approach Method Time
Traditional Process 1 image at a time Hours
Big Data Process 1,000 images simultaneously Seconds

1258.2 Understanding Check: When to Process at Edge vs Cloud

Scenario: An industrial IoT deployment has 1,000 vibration sensors on factory equipment, each sampling at 10,000 Hz (10,000 readings per second). Each reading is 4 bytes. Engineers need to detect bearing failures in real-time (<100ms) while also storing long-term data for predictive maintenance models.

Think about: Why would you process at edge instead of sending all data to the cloud?

Key Insights:

  1. Bandwidth Problem:

    • Raw data: 1,000 sensors x 10,000 readings/sec x 4 bytes = 40 MB/second = 320 Mbps
    • Most factory internet: 10-100 Mbps upload
    • Real number: Edge processing reduces bandwidth by 99% (40 MB/s to 400 KB/s)

    Edge processing approach:

    Raw: 10,000 readings/sec -> Edge FFT analysis -> 1 summary/sec
    40 MB/sec -> Process locally -> 400 bytes/sec to cloud
    Bandwidth: 320 Mbps -> 3.2 kbps (99.99% reduction!)
  2. Latency Problem:

    • Round-trip to cloud: 50-200ms (too slow for real-time failure detection)
    • Edge processing: <10ms (fast enough to trigger emergency shutdown)
    • Real number: Edge enables <100ms response time vs 200ms+ cloud processing
  3. Cost Problem:

    • Cloud storage: 40 MB/sec x 86,400 sec/day = 3.46 TB/day = 1.26 PB/year
    • AWS S3 Standard: 1,260 TB x $23/TB/month = $29,000/month = $348,000/year
    • Edge processing + summary storage: 400 KB/sec x 86,400 = 34.56 GB/day = $300/year
    • Real number: Edge processing saves $347,700/year (99.9% cost reduction)

Decision Rule:

Process at EDGE when:
- High sample rate (>100 Hz) generates massive data
- Real-time response required (<100ms)
- Limited bandwidth to cloud
- Data can be summarized without loss (FFT, aggregates)

Send to CLOUD when:
- Complex ML models need GPU clusters
- Historical analysis across all devices
- Data must be preserved in raw form
- Cost of edge processing > cloud storage

Edge-Cloud Hybrid Architecture:

Edge Device:
- Raw sampling: 10,000 Hz vibration data
- FFT analysis: Extract frequency spectrum
- Anomaly detection: Compare to baseline
- Alert: Send if threshold exceeded (immediate)
- Summary: Send statistics every second (normal operation)

Cloud:
- Store: Summaries from all 1,000 sensors
- Train: ML models on historical patterns
- Predict: Equipment failures weeks in advance
- Deploy: Updated models back to edge devices

1258.3 Edge vs Fog vs Cloud Decision Framework

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TD
    Start[IoT Data Decision] --> Q1{Data Rate?}

    Q1 -->|> 1 MB/sec<br/>per device| Edge[Process at Edge]
    Q1 -->|< 1 MB/sec| Q2{Latency Requirement?}

    Q2 -->|< 100ms<br/>Real-time critical| Edge
    Q2 -->|100ms - 1s| Fog[Process at Fog]
    Q2 -->|> 1 second| Q3{Bandwidth Available?}

    Q3 -->|< 10 Mbps<br/>Limited| Fog
    Q3 -->|> 100 Mbps<br/>Good connectivity| Cloud[Process in Cloud]

    Edge --> EdgeDetails[Edge Processing<br/>FFT, filtering<br/>Anomaly detection<br/>Send summaries only<br/>Cost: $5-50/device]
    Fog --> FogDetails[Fog Processing<br/>Aggregate 10-100 devices<br/>Local analytics<br/>Reduce bandwidth 90%<br/>Cost: $200-2000/gateway]
    Cloud --> CloudDetails[Cloud Processing<br/>ML training<br/>Historical analytics<br/>Unlimited compute<br/>Cost: $0.02/compute hour]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Q1 fill:#16A085,stroke:#2C3E50,color:#fff
    style Q2 fill:#16A085,stroke:#2C3E50,color:#fff
    style Q3 fill:#16A085,stroke:#2C3E50,color:#fff
    style Edge fill:#E67E22,stroke:#2C3E50,color:#fff
    style Fog fill:#E67E22,stroke:#2C3E50,color:#fff
    style Cloud fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 1258.4: Edge versus Fog versus Cloud Processing Decision Tree

1258.3.1 Specific Numbers for Common IoT Scenarios

Scenario Devices Data Rate Recommended Stack Monthly Cost Key Metric
Smart Home 10-50 1 KB/min per device Cloud only (AWS IoT) $5-20 Simplicity
Building Automation 100-1,000 100 bytes/min Fog + Cloud (InfluxDB) $50-200 10x storage compression
Smart City 10,000-100,000 1 KB/min Edge + Cloud (Kafka + S3) $2,000-10,000 99% bandwidth reduction
Industrial Monitoring 1,000-10,000 10 KB/sec (high-freq) Edge + Time-Series DB $500-5,000 <100ms latency
Connected Vehicles 1M+ vehicles 1 MB/hour per vehicle Batch processing (Hadoop) $20,000+ Petabyte-scale analytics

1258.3.2 Edge Processing Decision: Specific Bandwidth Numbers

Use Edge Processing When:

High-Frequency Sensors (saves 99%+ bandwidth):

Example: Vibration sensor at 10,000 Hz
Raw: 10,000 samples/sec x 4 bytes = 40 KB/sec = 320 kbps
Edge FFT: 1 summary/sec x 400 bytes = 400 bytes/sec = 3.2 kbps
Reduction: 99% bandwidth saved

Video Processing (saves 99.9%+ bandwidth):

Example: Security camera at 1080p 30fps
Raw: 30 frames/sec x 2 MB/frame = 60 MB/sec = 480 Mbps
Edge object detection: 1 event/min x 1 KB = 16 bytes/sec = 128 bps
Reduction: 99.9999% bandwidth saved

Limited Connectivity (cellular/satellite):

Cellular data: $10/GB typical
Raw streaming: 40 MB/sec x 86,400 sec/day = 3.46 TB/day = $34,600/day
Edge summarization: 400 KB/day = $0.004/day
Savings: $34,600/day to $0.004/day (99.9999% cost reduction)

Skip Edge Processing When:

Low-Rate Sensors (edge overhead not worth it):

Example: Temperature sensor at 1 reading/min
Data: 1 sample/min x 50 bytes = 50 bytes/min = 72 KB/day
Cost to cloud: $0.00072/day (negligible)
Edge device cost: $20 one-time + $2/year power
Not worth edge processing for such low data rates

1258.4 Cloud-Only vs Edge Processing Economics

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph CloudOnly["Cloud-Only (Expensive)"]
        C1[Camera:<br/>5 MB Image] --> C2[Upload to Cloud<br/>432 GB/day]
        C2 --> C3[Cloud Processing<br/>500ms latency]
        C3 --> C4[Cost: $1,166/month<br/>High latency<br/>High bandwidth]
    end

    subgraph EdgeProcessing["Edge Processing (Efficient)"]
        E1[Camera:<br/>5 MB Image] --> E2[Local AI Model<br/>100ms]
        E2 -->|Person?| E3{Detection}
        E3 -->|Yes| E4[Send Alert<br/>100 bytes]
        E3 -->|No| E5[Discard Image]
        E4 & E5 --> E6[Cost: $19.50/month<br/>100ms latency<br/>7.2 GB/day<br/>98% savings]
    end

    style C4 fill:#E74C3C,stroke:#2C3E50,color:#fff
    style E6 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1258.5: Cloud Only versus Edge Processing Cost and Latency Comparison

Cloud vs Edge Processing Economics: Sending 5 MB raw images to cloud costs $1,166/month with 500ms latency. Edge processing with local AI model sends only 100-byte alerts, reducing costs to $19.50/month (98% savings) and latency to 100ms (5x faster).

The Cost (Cloud-Only): - Bandwidth: 5 MB/second x 86,400 seconds/day = 432 GB/day - Cloud ingress (AWS): 432 GB/day x $0.09/GB = $38.88/day = $1,166/month - Processing latency: Upload time + cloud processing = 500ms+ (too slow for real-time)

The Fix: Edge processing - process at the device

Approach Data Sent to Cloud Bandwidth/Day Monthly Cost Latency Savings
Cloud-only 5 MB raw images 432 GB $1,166 500ms Baseline
Edge processing 100-byte alerts (only when person detected) 7.2 GB $19.50 100ms 98% cost + 5x faster

The Rule: > “Process data as close to the source as possible. Only send insights to cloud, not raw data.”

1258.5 Common Pitfall: Processing All Data in the Cloud

CautionPitfall: Processing All Data in the Cloud (Ignoring Edge)

The Mistake: Send all raw sensor data to cloud without any edge processing

What Goes Wrong: - Bandwidth: 5 MB/second x 86,400 seconds/day = 432 GB/day - Cloud ingress (AWS): 432 GB/day x $0.09/GB = $38.88/day = $1,166/month - Processing latency: Upload time + cloud processing = 500ms+ (too slow for real-time)

The Rule: > “Process data as close to the source as possible. Only send insights to cloud, not raw data.”

1258.6 Summary

  • The 90/10 rule states that 90% of IoT data can be filtered or aggregated at the edge, sending only 10% (summaries and alerts) to the cloud.
  • Edge processing reduces costs by 98-99% by eliminating bandwidth charges, storage costs, and cloud compute expenses for redundant data.
  • Latency improves from 500ms to <100ms when processing happens locally at the edge instead of round-tripping to cloud.
  • Decision framework: Use edge for high-frequency sensors (>100 Hz), real-time requirements (<100ms), or limited bandwidth. Use cloud for ML training and historical analytics.

1258.7 What’s Next

Now that you understand edge processing strategies, continue to: