Scenario: Geotab, a fleet management company, processes telematics data from 50,000 commercial vehicles (delivery trucks, buses, service vans) across the UK. Each vehicle has an OBD-II dongle reporting GPS, speed, fuel level, engine RPM, and diagnostic trouble codes. The pipeline must detect speeding violations in real-time (for regulatory compliance) and compute daily fuel efficiency reports.
Given:
- 50,000 vehicles, each reporting every 2 seconds while engine is running
- Average 10 hours/day engine-on per vehicle = 18,000 readings/vehicle/day
- Peak hours (7-9am, 4-6pm): 80% of fleet active = 40,000 vehicles x 0.5 msg/sec = 20,000 messages/second
- Message size: 180 bytes (JSON: vehicle ID, timestamp, lat/lon, speed km/h, fuel %, RPM, DTC codes)
- Peak data rate: 20,000 x 180 bytes = 3.6 MB/second = 28.8 Mbps
- Speeding alert latency: <5 seconds from violation to fleet manager notification
- Daily report: Fuel efficiency (km/L) per vehicle, idle time, distance traveled
Step 1 – Design the ingestion layer:
| Protocol bridge |
MQTT broker (EMQX) |
50,000 persistent connections, TLS |
Vehicles use MQTT over cellular |
| Message queue |
Apache Kafka (3 brokers) |
12 partitions, keyed by vehicle_id |
Ordered per-vehicle processing, 7-day retention |
| Schema registry |
Confluent Schema Registry |
Avro schema with backward compatibility |
Validate incoming messages, handle schema evolution |
Kafka partitioning strategy: partition = hash(vehicle_id) % 12. This ensures all messages from one vehicle go to the same partition, maintaining temporal ordering per vehicle (critical for speed calculations).
Step 2 – Design the processing layer:
Two Flink jobs running in parallel:
Job 1: Real-Time Speeding Detection (latency-critical)
- Input: Raw Kafka topic
- Processing: Compare speed to road speed limit (enriched from HERE Maps API, cached locally)
- Window: Event-time tumbling window of 5 seconds per vehicle
- Rule: If average speed in window exceeds limit by >10%, emit alert
- Output: PagerDuty webhook for fleet managers, speeding_alerts Kafka topic
- Latency budget: 2s Kafka + 1s Flink + 1s PagerDuty = 4 seconds end-to-end
Job 2: Daily Fuel Efficiency Aggregation (accuracy-critical)
- Input: Same raw Kafka topic (consumer group isolation)
- Processing: Per-vehicle daily aggregation
- Windows: Event-time tumbling window of 24 hours, with watermark tolerance of 30 minutes (late cellular reports)
- Computations per vehicle per day:
- Total distance: sum of haversine distances between consecutive GPS points
- Total fuel consumed: delta between first and last fuel level readings, adjusted for refueling events
- Idle time: count of readings where RPM > 600 and speed = 0
- Fuel efficiency: distance / fuel consumed (km/L)
- Output: PostgreSQL daily_reports table, S3 Parquet for data lake
Step 3 – Calculate infrastructure costs:
| EMQX broker (3 nodes) |
c5.2xlarge x 3 |
GBP 780 |
| Kafka cluster (3 brokers) |
r5.xlarge x 3, 2TB EBS each |
GBP 1,140 |
| Flink cluster (Job 1) |
c5.xlarge x 4 (parallelism=16) |
GBP 520 |
| Flink cluster (Job 2) |
r5.large x 2 (stateful, checkpointing) |
GBP 260 |
| PostgreSQL RDS |
db.r5.large, 500GB |
GBP 340 |
| S3 data lake |
~4.9 TB/month (180B x 18K x 50K x 30) |
GBP 113 |
| Total monthly |
|
GBP 3,153 |
Per-vehicle cost: GBP 3,153 / 50,000 = GBP 0.063/vehicle/month (6.3 pence per vehicle per month for the entire streaming infrastructure).
Result: The pipeline processes 20,000 messages/second at peak with 4-second end-to-end latency for speeding alerts. Daily fuel reports are available by 01:00 the following day (30-minute watermark ensures late cellular data is captured). The 30-minute late data tolerance catches 99.7% of delayed messages. Infrastructure cost of GBP 0.063/vehicle/month is negligible compared to the GBP 45/vehicle/month subscription fee.
Key Insight: Separating the speeding detection (latency-optimized, 5-second windows, small state) from fuel reporting (accuracy-optimized, 24-hour windows, 30-minute watermark) into two independent Flink jobs is essential. A single job trying to serve both requirements would force compromises in either latency (waiting for late data) or accuracy (missing late reports). Independent jobs can be scaled, monitored, and upgraded independently.