%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
subgraph Edge["Edge Layer - Local Processing"]
E1[Raw Sensor Data<br/>100% = 5.2 TB/day] --> E2[Edge Processor<br/>Filter, Aggregate, Analyze]
E2 --> E3[90% Discarded<br/>4.7 TB redundant/noise]
E2 --> E4[9% Aggregated<br/>470 GB summaries]
E2 --> E5[1% Critical<br/>52 GB anomalies/alerts]
end
subgraph Cloud["Cloud Layer - Deep Analysis"]
E4 --> C1[Batch Analytics<br/>Daily/Weekly Trends]
E5 --> C2[Real-Time Alerts<br/>Immediate Action]
C1 & C2 --> C3[Long-Term Storage<br/>522 GB/day total]
end
E3 -.->|Never transmitted| D[Deleted at Edge]
style E1 fill:#2C3E50,stroke:#16A085,color:#fff
style E2 fill:#16A085,stroke:#2C3E50,color:#fff
style E3 fill:#E74C3C,stroke:#2C3E50,color:#fff
style E4 fill:#E67E22,stroke:#2C3E50,color:#fff
style E5 fill:#27AE60,stroke:#2C3E50,color:#fff
style C1 fill:#2C3E50,stroke:#16A085,color:#fff
style C2 fill:#16A085,stroke:#2C3E50,color:#fff
style C3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style D fill:#E74C3C,stroke:#2C3E50,color:#fff,stroke-dasharray: 5 5
1258 Edge Processing for Big Data
Learning Objectives
After completing this chapter, you will be able to:
- Apply the 90/10 rule for IoT data reduction
- Calculate bandwidth and cost savings from edge processing
- Design edge-to-cloud data pipelines
- Determine when to process at edge versus cloud
1258.1 Edge Computing: Making Big Data Manageable
Here’s a secret that changes everything: 90% of IoT data doesn’t need to leave the device.
1258.1.1 The Traffic Camera Example
A single traffic camera generates 30 frames per second, 24/7:
Without Edge Processing: - Raw data: 30 fps x 2 MB/frame x 86,400 seconds/day = 5.2 TB per day per camera - A city with 10,000 cameras would generate 52 PB per day (52,000,000 GB!) - Network bandwidth required: 52 PB / 86,400 sec = 6 Gbps continuous (impossible for most cities) - Storage cost: 52 PB x $0.023/GB/month (AWS S3) = $1,196,000/month ($14M/year!)
With Edge Processing:
Instead of sending all video to the cloud:
- Camera processes video locally (edge computing using Nvidia Jetson or similar)
- Counts vehicles, detects accidents, identifies license plates
- Sends only summary: “Camera 4521: 847 vehicles, 2 accidents, peak at 8:15 AM”
- Daily data sent: ~1 KB instead of 5.2 TB (5 billion x reduction!)
Cost Comparison:
| Metric | No Edge Processing | With Edge Processing | Reduction |
|---|---|---|---|
| Data volume/day | 52 PB | 10 GB | 5,200,000x |
| Network bandwidth | 6 Gbps | 1 Mbps | 6,000x |
| Storage cost/year | $14,000,000 | $280 | 50,000x |
| Edge hardware cost | $0 | $200/camera x 10K = $2M | One-time cost |
ROI: Edge hardware pays for itself in 2 months of storage savings alone!
1258.1.2 The 90/10 Rule for IoT Data
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
sequenceDiagram
participant S as IoT Sensor
participant E as Edge Processor
participant C as Cloud
participant D as Dashboard
Note over S,D: Data Volume Transformation Timeline
S->>E: Raw: 100 readings/sec (400 KB/sec)
Note right of S: 5.2 TB/day raw data
E->>E: Filter redundant (temp unchanged)
E->>E: Aggregate to 1-min averages
E->>E: Detect anomalies locally
alt Normal Reading (99% of time)
E->>E: Discard - no change
Note right of E: 90% never leaves edge
else Aggregated Summary (every minute)
E->>C: Summary: avg=72F, min=71F, max=73F
Note right of E: 9% as summaries
else Anomaly Detected
E->>C: ALERT: Temperature spike 95F!
C->>D: Immediate notification
Note right of E: 1% as real-time alerts
end
Note over C: Cloud receives 522 GB/day<br/>(100x reduction from 5.2 TB)
Edge Computing Data Reduction Pipeline: Edge processors filter 90% of redundant data locally, aggregate 9% into summaries, and send only 1% of critical anomalies to cloud - transforming an impossible 5.2TB/day into manageable 522GB/day.
The 90/10 Rule Breakdown:
| Data Type | Processed At | Example | Sent to Cloud? |
|---|---|---|---|
| Raw readings (90%) | Edge only | Temperature: 72.1 F, 72.1 F, 72.2 F, 72.1 F… | No (redundant) |
| Aggregated summaries (9%) | Edge to Cloud | Average temp: 72 F, min: 68 F, max: 76 F | Yes (batch) |
| Anomalies/alerts (1%) | Edge to Cloud | Temperature spike: 95 F! | Yes (real-time) |
This transforms an impossible big data problem into a manageable analytics problem.
1258.1.3 Progressive Data Reduction Example
Smart Building HVAC System (1,000 temperature sensors):
Level 1 - Sensors: 1,000 sensors x 1 reading/sec x 4 bytes = 4 KB/sec raw data
Level 2 - Edge Gateway (per floor, 10 floors):
- Aggregate 100 sensors to 10 room averages
- 10 floors x 10 rooms x 1 reading/sec x 4 bytes = 400 bytes/sec
- Reduction: 4 KB to 400 bytes (10x smaller)
Level 3 - Building Controller:
- Aggregate 100 rooms to 10 zone averages
- 10 zones x 1 reading/sec x 4 bytes = 40 bytes/sec
- Reduction: 400 bytes to 40 bytes (10x smaller)
Level 4 - Cloud:
- Receive zone averages + anomaly alerts only
- 40 bytes/sec normal + 1 alert/minute average
- Total cloud data: ~3.5 GB/year (vs 126 GB/year without edge)
Final reduction: 100x less data sent to cloud, while preserving all important information!
1258.1.4 Real-World Example: Smart City Traffic
The Scenario: 10,000 traffic cameras generating data
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
subgraph Traditional["Traditional Processing"]
T1[10,000 Cameras] --> T2[Single Server]
T2 --> T3[Process 1 Image<br/>Sequential]
T3 --> T4[Hours to Complete]
end
subgraph BigData["Big Data Processing"]
B1[10,000 Cameras] --> B2[Distributed Cluster<br/>1000 Nodes]
B2 --> B3[Process 1,000 Images<br/>Parallel]
B3 --> B4[Seconds to Complete]
end
style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
style T4 fill:#E74C3C,stroke:#2C3E50,color:#fff
style B1 fill:#2C3E50,stroke:#16A085,color:#fff
style B4 fill:#27AE60,stroke:#2C3E50,color:#fff
Traditional vs Big Data Processing: Sequential processing on single server takes hours for 10,000 images; parallel distributed processing completes in seconds by dividing work across 1,000 nodes.
| Approach | Method | Time |
|---|---|---|
| Traditional | Process 1 image at a time | Hours |
| Big Data | Process 1,000 images simultaneously | Seconds |
1258.2 Understanding Check: When to Process at Edge vs Cloud
Scenario: An industrial IoT deployment has 1,000 vibration sensors on factory equipment, each sampling at 10,000 Hz (10,000 readings per second). Each reading is 4 bytes. Engineers need to detect bearing failures in real-time (<100ms) while also storing long-term data for predictive maintenance models.
Think about: Why would you process at edge instead of sending all data to the cloud?
Key Insights:
Bandwidth Problem:
- Raw data: 1,000 sensors x 10,000 readings/sec x 4 bytes = 40 MB/second = 320 Mbps
- Most factory internet: 10-100 Mbps upload
- Real number: Edge processing reduces bandwidth by 99% (40 MB/s to 400 KB/s)
Edge processing approach:
Raw: 10,000 readings/sec -> Edge FFT analysis -> 1 summary/sec 40 MB/sec -> Process locally -> 400 bytes/sec to cloud Bandwidth: 320 Mbps -> 3.2 kbps (99.99% reduction!)Latency Problem:
- Round-trip to cloud: 50-200ms (too slow for real-time failure detection)
- Edge processing: <10ms (fast enough to trigger emergency shutdown)
- Real number: Edge enables <100ms response time vs 200ms+ cloud processing
Cost Problem:
- Cloud storage: 40 MB/sec x 86,400 sec/day = 3.46 TB/day = 1.26 PB/year
- AWS S3 Standard: 1,260 TB x $23/TB/month = $29,000/month = $348,000/year
- Edge processing + summary storage: 400 KB/sec x 86,400 = 34.56 GB/day = $300/year
- Real number: Edge processing saves $347,700/year (99.9% cost reduction)
Decision Rule:
Process at EDGE when:
- High sample rate (>100 Hz) generates massive data
- Real-time response required (<100ms)
- Limited bandwidth to cloud
- Data can be summarized without loss (FFT, aggregates)
Send to CLOUD when:
- Complex ML models need GPU clusters
- Historical analysis across all devices
- Data must be preserved in raw form
- Cost of edge processing > cloud storage
Edge-Cloud Hybrid Architecture:
Edge Device:
- Raw sampling: 10,000 Hz vibration data
- FFT analysis: Extract frequency spectrum
- Anomaly detection: Compare to baseline
- Alert: Send if threshold exceeded (immediate)
- Summary: Send statistics every second (normal operation)
Cloud:
- Store: Summaries from all 1,000 sensors
- Train: ML models on historical patterns
- Predict: Equipment failures weeks in advance
- Deploy: Updated models back to edge devices
1258.3 Edge vs Fog vs Cloud Decision Framework
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
graph TD
Start[IoT Data Decision] --> Q1{Data Rate?}
Q1 -->|> 1 MB/sec<br/>per device| Edge[Process at Edge]
Q1 -->|< 1 MB/sec| Q2{Latency Requirement?}
Q2 -->|< 100ms<br/>Real-time critical| Edge
Q2 -->|100ms - 1s| Fog[Process at Fog]
Q2 -->|> 1 second| Q3{Bandwidth Available?}
Q3 -->|< 10 Mbps<br/>Limited| Fog
Q3 -->|> 100 Mbps<br/>Good connectivity| Cloud[Process in Cloud]
Edge --> EdgeDetails[Edge Processing<br/>FFT, filtering<br/>Anomaly detection<br/>Send summaries only<br/>Cost: $5-50/device]
Fog --> FogDetails[Fog Processing<br/>Aggregate 10-100 devices<br/>Local analytics<br/>Reduce bandwidth 90%<br/>Cost: $200-2000/gateway]
Cloud --> CloudDetails[Cloud Processing<br/>ML training<br/>Historical analytics<br/>Unlimited compute<br/>Cost: $0.02/compute hour]
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style Q1 fill:#16A085,stroke:#2C3E50,color:#fff
style Q2 fill:#16A085,stroke:#2C3E50,color:#fff
style Q3 fill:#16A085,stroke:#2C3E50,color:#fff
style Edge fill:#E67E22,stroke:#2C3E50,color:#fff
style Fog fill:#E67E22,stroke:#2C3E50,color:#fff
style Cloud fill:#E67E22,stroke:#2C3E50,color:#fff
1258.3.1 Specific Numbers for Common IoT Scenarios
| Scenario | Devices | Data Rate | Recommended Stack | Monthly Cost | Key Metric |
|---|---|---|---|---|---|
| Smart Home | 10-50 | 1 KB/min per device | Cloud only (AWS IoT) | $5-20 | Simplicity |
| Building Automation | 100-1,000 | 100 bytes/min | Fog + Cloud (InfluxDB) | $50-200 | 10x storage compression |
| Smart City | 10,000-100,000 | 1 KB/min | Edge + Cloud (Kafka + S3) | $2,000-10,000 | 99% bandwidth reduction |
| Industrial Monitoring | 1,000-10,000 | 10 KB/sec (high-freq) | Edge + Time-Series DB | $500-5,000 | <100ms latency |
| Connected Vehicles | 1M+ vehicles | 1 MB/hour per vehicle | Batch processing (Hadoop) | $20,000+ | Petabyte-scale analytics |
1258.3.2 Edge Processing Decision: Specific Bandwidth Numbers
Use Edge Processing When:
High-Frequency Sensors (saves 99%+ bandwidth):
Example: Vibration sensor at 10,000 Hz
Raw: 10,000 samples/sec x 4 bytes = 40 KB/sec = 320 kbps
Edge FFT: 1 summary/sec x 400 bytes = 400 bytes/sec = 3.2 kbps
Reduction: 99% bandwidth saved
Video Processing (saves 99.9%+ bandwidth):
Example: Security camera at 1080p 30fps
Raw: 30 frames/sec x 2 MB/frame = 60 MB/sec = 480 Mbps
Edge object detection: 1 event/min x 1 KB = 16 bytes/sec = 128 bps
Reduction: 99.9999% bandwidth saved
Limited Connectivity (cellular/satellite):
Cellular data: $10/GB typical
Raw streaming: 40 MB/sec x 86,400 sec/day = 3.46 TB/day = $34,600/day
Edge summarization: 400 KB/day = $0.004/day
Savings: $34,600/day to $0.004/day (99.9999% cost reduction)
Skip Edge Processing When:
Low-Rate Sensors (edge overhead not worth it):
Example: Temperature sensor at 1 reading/min
Data: 1 sample/min x 50 bytes = 50 bytes/min = 72 KB/day
Cost to cloud: $0.00072/day (negligible)
Edge device cost: $20 one-time + $2/year power
Not worth edge processing for such low data rates
1258.4 Cloud-Only vs Edge Processing Economics
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
subgraph CloudOnly["Cloud-Only (Expensive)"]
C1[Camera:<br/>5 MB Image] --> C2[Upload to Cloud<br/>432 GB/day]
C2 --> C3[Cloud Processing<br/>500ms latency]
C3 --> C4[Cost: $1,166/month<br/>High latency<br/>High bandwidth]
end
subgraph EdgeProcessing["Edge Processing (Efficient)"]
E1[Camera:<br/>5 MB Image] --> E2[Local AI Model<br/>100ms]
E2 -->|Person?| E3{Detection}
E3 -->|Yes| E4[Send Alert<br/>100 bytes]
E3 -->|No| E5[Discard Image]
E4 & E5 --> E6[Cost: $19.50/month<br/>100ms latency<br/>7.2 GB/day<br/>98% savings]
end
style C4 fill:#E74C3C,stroke:#2C3E50,color:#fff
style E6 fill:#27AE60,stroke:#2C3E50,color:#fff
Cloud vs Edge Processing Economics: Sending 5 MB raw images to cloud costs $1,166/month with 500ms latency. Edge processing with local AI model sends only 100-byte alerts, reducing costs to $19.50/month (98% savings) and latency to 100ms (5x faster).
The Cost (Cloud-Only): - Bandwidth: 5 MB/second x 86,400 seconds/day = 432 GB/day - Cloud ingress (AWS): 432 GB/day x $0.09/GB = $38.88/day = $1,166/month - Processing latency: Upload time + cloud processing = 500ms+ (too slow for real-time)
The Fix: Edge processing - process at the device
| Approach | Data Sent to Cloud | Bandwidth/Day | Monthly Cost | Latency | Savings |
|---|---|---|---|---|---|
| Cloud-only | 5 MB raw images | 432 GB | $1,166 | 500ms | Baseline |
| Edge processing | 100-byte alerts (only when person detected) | 7.2 GB | $19.50 | 100ms | 98% cost + 5x faster |
The Rule: > “Process data as close to the source as possible. Only send insights to cloud, not raw data.”
1258.5 Common Pitfall: Processing All Data in the Cloud
The Mistake: Send all raw sensor data to cloud without any edge processing
What Goes Wrong: - Bandwidth: 5 MB/second x 86,400 seconds/day = 432 GB/day - Cloud ingress (AWS): 432 GB/day x $0.09/GB = $38.88/day = $1,166/month - Processing latency: Upload time + cloud processing = 500ms+ (too slow for real-time)
The Rule: > “Process data as close to the source as possible. Only send insights to cloud, not raw data.”
1258.6 Summary
- The 90/10 rule states that 90% of IoT data can be filtered or aggregated at the edge, sending only 10% (summaries and alerts) to the cloud.
- Edge processing reduces costs by 98-99% by eliminating bandwidth charges, storage costs, and cloud compute expenses for redundant data.
- Latency improves from 500ms to <100ms when processing happens locally at the edge instead of round-tripping to cloud.
- Decision framework: Use edge for high-frequency sensors (>100 Hz), real-time requirements (<100ms), or limited bandwidth. Use cloud for ML training and historical analytics.
1258.7 What’s Next
Now that you understand edge processing strategies, continue to:
- Big Data Technologies - Learn specific technologies like Hadoop, Spark, and Kafka
- Big Data Pipelines - Design stream and batch processing architectures
- Big Data Overview - Return to the chapter index