10  Data Format Selection Guide

In 60 Seconds

Use the format decision tree to choose between JSON, CBOR, Protobuf, and custom binary. If bandwidth is not constrained (Wi-Fi/LTE), use JSON. If constrained (LoRa/Sigfox), evaluate volume and byte limits to select CBOR, Protobuf, or custom binary. Always consider total cost of ownership – not just payload size.

10.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply the format decision tree: Systematically evaluate constraints to select the right format for a given IoT scenario
  • Calculate total cost of ownership: Quantify bandwidth, battery, development, and maintenance costs for competing format choices
  • Classify IoT applications by format suitability: Map real-world use cases to JSON, CBOR, Protobuf, or custom binary based on network and power constraints
  • Design format migration strategies: Architect a phased transition from JSON to binary formats while maintaining backward compatibility
  • Justify format decisions to stakeholders: Present trade-off analyses with concrete data on cost, battery life, and schema evolution
  • JSON: JavaScript Object Notation — human-readable, universally supported, 2-5× size overhead versus binary formats
  • CBOR: Binary self-describing format — best choice when schema distribution is impractical and size matters
  • Protocol Buffers: Schema-enforced binary format — optimal for high-throughput pipelines where schema management is acceptable
  • Schema Registry: Central repository storing format versions — prevents incompatibility between producers and consumers
  • Compression: gzip/zstd applied over any format further reduces size 60-80% — beneficial for text-heavy payloads
  • Self-Description: Format carrying its own type information (JSON, CBOR) versus requiring external schema (protobuf, Avro)
  • Parsing Overhead: CPU cost to deserialize: JSON (slow, text parsing) < CBOR (fast, binary) < Protobuf (fastest, compiled)

10.2 For Beginners: Data Format Selection

Choosing a data format is like deciding how to pack a suitcase. JSON is like packing in clear labeled bags – easy to see what is inside, but takes more space. Binary formats like CBOR are like vacuum-sealing everything – much smaller, but harder to inspect. For IoT devices with limited battery and slow connections, picking the right format can mean the difference between a sensor lasting one year or five years.

This is part of a series on IoT Data Formats:

  1. IoT Data Formats Overview - Introduction and text formats
  2. Binary Data Formats - CBOR, Protobuf, custom binary
  3. Data Format Selection (this chapter) - Decision guides and real-world examples
  4. Data Formats Practice - Scenarios, quizzes, worked examples

Related Decision Tools:

10.3 Prerequisites

Before starting this chapter, you should be familiar with:


10.4 Format Selection Decision Tree

Flowchart decision tree starting with navy Start box asking Choose Data Format. First decision (gray): Bandwidth Constrained? (LoRa, Sigfox, NB-IoT). No path (Wi-Fi, Ethernet, LTE) leads to orange JSON box (Simple, Universal, Best for Wi-Fi/LTE). Yes path (LoRa, Sigfox) leads to second decision: Need Human Readability? (Debugging, Dev). Yes path leads to orange JSON or CBOR with tools box. No path leads to third decision: High Volume? (>1000 devices, >1 msg/min). No path (Small deployment) leads to teal CBOR box (Balance efficiency and flexibility). Yes path (>1000 devices) leads to fourth decision: Every Byte Critical? (e.g., 12 byte limit). No path leads to teal Protobuf or CBOR box (Typed, efficient pipelines). Yes path (Sigfox, ultra-LP) leads to navy Custom Binary box (Ultimate efficiency). Color coding: Navy for start/custom binary, Gray for decision points, Orange for JSON options, Teal for CBOR/Protobuf options.

Flowchart decision tree starting with navy Start box asking Choose Data Format. First decision (gray): Bandwidth Constrained? (LoRa, Sigfox, NB-IoT). No path (Wi-Fi, Ethernet, LTE) leads to orange JSON box (Simple, Universal, Best for Wi-Fi/LTE). Yes path (LoRa, Sigfox) leads to second decision: Need Human Readability? (Debugging, Dev). Yes path leads to orange JSON or CBOR with tools box. No path leads to third decision: High Volume? (>1000 devices, >1 msg/min). No path (Small deployment) leads to teal CBOR box (Balance efficiency and flexibility). Yes path (>1000 devices) leads to fourth decision: Every Byte Critical? (e.g., 12 byte limit). No path leads to teal Protobuf or CBOR box (Typed, efficient pipelines). Yes path (Sigfox, ultra-LP) leads to navy Custom Binary box (Ultimate efficiency). Color coding: Navy for start/custom binary, Gray for decision points, Orange for JSON options, Teal for CBOR/Protobuf options.
Figure 10.1: Format Selection Decision Tree: Interactive flowchart for choosing the right IoT data format based on network constraints and application requirements. The tree starts by evaluating bandwidth constraints (Wi-Fi/LTE vs LoRa/Sigfox). High-bandwidth networks can use JSON for simplicity. Constrained networks require further evaluation: human readability needs (debugging vs production), message volume (small deployment vs high-scale fleet), and byte-level criticality (standard protocols vs ultra-constrained like Sigfox’s 12-byte limit). Each path leads to the optimal format choice balancing efficiency, maintainability, and ecosystem support.

10.4.1 Step-by-Step Decision Process

Question 1: Is bandwidth severely constrained?

  • No (Wi-Fi, Ethernet, LTE): Use JSON for simplicity
  • Yes (LoRa, Sigfox, NB-IoT): Continue to Q2

Question 2: Do you need human readability for debugging?

  • Yes: Use JSON or CBOR with tooling
  • No: Continue to Q3

Question 3: Do you have many devices with high message volume?

  • Yes (>1000 devices, >1 msg/min): Use Protobuf or CBOR
  • No (small deployment): Use CBOR for balance

Question 4: Is every byte critical? (Sigfox: 12 bytes max)

  • Yes: Custom binary format
  • No: Protobuf or CBOR

Place the format selection decision steps in the correct order:

10.4.2 Interactive Format Cost Calculator

Try this calculator to compare data format costs for your own IoT deployment:


10.5 Real-World Application Map

Grid diagram showing 12 real-world IoT applications organized into 4 format categories. JSON (orange boxes): Smart Home with Wi-Fi thermostat 1 msg/min, Building HVAC with Ethernet sensors and debug-friendly, Mobile App with REST APIs and rapid prototyping. CBOR (teal boxes): Agriculture with LoRaWAN soil sensors 6 msgs/day, Fleet GPS with NB-IoT tracking 100 vehicles, Smart Grid with meters moderate volume schema flexibility. Protobuf (teal boxes): Industrial IoT with factory floor 10,000 sensors, Cloud Pipeline with gRPC microservices typed contracts, Analytics with high-volume streams ML pipelines. Custom Binary (navy boxes): Sigfox with 12-byte limit parking sensors, Wearables with BLE battery life fitness trackers, Satellite with extreme cost per byte remote monitoring. Each application shows icon, use case name, network type, and key constraint.

Real-World Application Map - Instead of abstract decision criteria, this diagram shows concrete IoT applications grouped by their ideal data format. Smart home devices with Wi-Fi use JSON for easy debugging. Agricultural sensors on LoRaWAN benefit from CBOR’s balance. Industrial IoT at scale demands Protobuf’s efficiency. Only ultra-constrained scenarios like Sigfox (12-byte limit) or satellite links justify custom binary. Students can identify their project type and immediately see the recommended format.
Figure 10.2

10.5.1 Application-Format Reference Table

Application Protocol Format Rationale
Smart home thermostat Wi-Fi + MQTT JSON Bandwidth plentiful, debugging important
Agricultural soil sensor LoRaWAN CBOR Bandwidth limited, need flexibility
City parking sensor Sigfox Custom binary 12-byte limit, fixed message structure
Industrial gateway Ethernet + gRPC Protobuf High volume, strong typing needed
Wearable fitness tracker BLE Custom binary Power-sensitive, fixed data structure

Match each IoT application to its recommended data format:


10.6 Learning Scenario: Soil Moisture Network

Your Challenge: You’re designing a precision agriculture system for a large vineyard. You need to monitor soil moisture to optimize irrigation and prevent crop damage.

System Requirements:

  • 100 battery-powered sensor nodes placed throughout vineyard
  • Cellular backhaul (NB-IoT): $0.01/KB data cost, 250 KB/month included per SIM
  • Readings every 15 minutes (96 readings/day)
  • Battery life target: 5 years on 2x AA batteries
  • Sensors measure: Soil moisture (0-100%), temperature (-20 to 60C), battery voltage (2.0-3.6V), GPS location (once/day)

Your Mission: Choose the optimal data format and calculate the real costs.


10.6.1 Step 1: Calculate Message Volume

Think: How many messages per year per node?

Click to reveal calculation
  • Readings per day: 96 (every 15 minutes)
  • Days per year: 365
  • Total messages/year/node: 96 x 365 = 35,040 messages
  • Fleet total: 35,040 x 100 nodes = 3,504,000 messages/year

10.6.2 Step 2: Compare Format Options

You’re considering three formats. Let’s analyze each:

10.6.2.1 Option A: JSON (Readable)

{"id":"V001","moist":45.2,"temp":18.5,"batt":3.1,"lat":38.5,"lng":-122.4}

  • Size: 78 bytes per message
  • Pros: Easy debugging, universal tools, cloud-friendly
  • Cons: Largest size, more battery drain for transmission

10.6.2.2 Option B: CBOR (Balanced)

Binary encoding with same structure: - Size: 42 bytes per message (46% smaller than JSON) - Pros: 50% bandwidth savings, standard format, easier debugging than custom binary - Cons: Less ecosystem than JSON, requires CBOR library

10.6.2.3 Option C: Custom Binary (Optimized)

[2 bytes id] [1 byte moist x 2] [1 byte temp+20] [1 byte batt x 100] [4 bytes lat] [4 bytes lng]
Total: 13 bytes
  • Size: 13 bytes per message (83% smaller than JSON)
  • Pros: Smallest size, lowest power consumption
  • Cons: No tooling, rigid schema, difficult debugging

10.6.3 Step 3: Calculate Annual Data Usage

Calculate: Data usage per node per year for each format.

Click to reveal calculations

Per node per year:

  • JSON: 35,040 messages x 78 bytes = 2.73 MB/year
  • CBOR: 35,040 messages x 42 bytes = 1.47 MB/year
  • Custom: 35,040 messages x 13 bytes = 0.46 MB/year

Fleet total (100 nodes):

  • JSON: 2.73 MB x 100 = 273 MB/year
  • CBOR: 1.47 MB x 100 = 147 MB/year
  • Custom: 0.46 MB x 100 = 46 MB/year

10.6.4 Step 4: Cost Analysis

Data plan: $0.01/KB, 250 KB/month included per SIM

Calculate: Will you exceed the included data allowance? What are the overage costs?

Click to reveal cost analysis

Included data per node per year: 250 KB/month x 12 months = 3 MB/year

Overage analysis:

Format Usage/year Included Overage Cost/node/year Fleet cost/year
JSON 2.73 MB 3 MB 0 MB $0 $0
CBOR 1.47 MB 3 MB 0 MB $0 $0
Custom 0.46 MB 3 MB 0 MB $0 $0

Surprising result: All formats fit within the included 3 MB/year allowance! No overage costs.

BUT WAIT - what about peak months? (summer = more frequent irrigation adjustments)

Summer scenario (June-August): Increase to every 5 minutes = 288 msgs/day x 90 days = 25,920 messages

Summer data usage (3 months):

  • JSON: 25,920 x 78 bytes = 2.02 MB (81% of annual allowance in 3 months!)
  • CBOR: 25,920 x 42 bytes = 1.09 MB (44% of allowance)
  • Custom: 25,920 x 13 bytes = 0.34 MB (14% of allowance)

Annual with summer spike:

  • JSON: (9 months @ 15min) + (3 months @ 5min) = 4.04 MB/year - Exceeds 3 MB! - $10.40/node overage - $1,040/year fleet
  • CBOR: 2.19 MB/year - Under limit
  • Custom: 0.69 MB/year - Under limit

10.6.5 Step 5: Battery Life Impact

Radio power consumption (NB-IoT): - Transmit power: 23 dBm (200 mW) - Transmission time: ~50 ms/byte (including protocol overhead) - Energy per byte: 200 mW x 50 ms = 10 mJ/byte

Calculate: How much battery energy does each format consume?

Click to reveal battery analysis

Energy per message (transmission only): - JSON: 78 bytes x 10 mJ/byte = 780 mJ - CBOR: 42 bytes x 10 mJ/byte = 420 mJ - Custom: 13 bytes x 10 mJ/byte = 130 mJ

Annual energy (35,040 messages/year):

  • JSON: 35,040 x 780 mJ = 27.3 kJ/year = 7.6 Wh/year
  • CBOR: 35,040 x 420 mJ = 14.7 kJ/year = 4.1 Wh/year
  • Custom: 35,040 x 130 mJ = 4.6 kJ/year = 1.3 Wh/year

Battery capacity: 2x AA batteries = 2 x 2500 mAh x 3V = 15 Wh total

Transmission as % of battery (assuming other circuitry uses 50% of battery): - JSON: 7.6 Wh / 7.5 Wh available = 101% of available budget - 4.9 year battery life - CBOR: 4.1 Wh / 7.5 Wh = 55% of budget - 5+ year target achieved - Custom: 1.3 Wh / 7.5 Wh = 17% of budget - 5+ year target easily met

Verdict: JSON fails the 5-year battery life requirement!


10.6.6 Step 6: Final Recommendation

Compare all factors:

Factor JSON CBOR Custom Binary
Summer data cost $1,040/year overage $0 $0
Battery life 4.9 years (misses target) 5+ years 5+ years
Debugging ease Excellent (text editor) Requires CBOR tools Custom parser needed
Schema evolution Easy (add fields) Moderate (CBOR flexible) Rigid (breaking changes)
Development time 1 day (JSON libs everywhere) 2-3 days (CBOR setup) 1-2 weeks (custom parser)
Maintenance burden Low (standard format) Medium (CBOR docs) High (DIY everything)
Total 5-year TCO $5,200 (overage costs) $0 (no overage) $0 (no overage)

10.6.7 Your Recommendation: CBOR (Option B)

Rationale:

  1. Meets battery life target (5+ years) with 46% size reduction vs JSON
  2. Zero overage costs even with summer spike (1.09 MB < 3 MB limit)
  3. Flexible schema - Can add new sensor types without breaking existing nodes
  4. Standard format - CBOR libraries exist for embedded C, Python, cloud processing
  5. Reasonable debugging - Tools like cbor2diag convert binary to human-readable
  6. Moderate setup - 2-3 days to integrate CBOR library, but well-documented

Why not custom binary?

  • Custom saves only 0.63 MB/year (1.47 - 0.46 MB) per node vs CBOR
  • Zero cost benefit (both are under data cap)
  • Battery savings: 2.8 Wh/year = 6 extra months of battery life
  • Trade-off: 6 months extra battery vs 2 weeks development + ongoing maintenance burden
  • Verdict: Not worth it unless battery life is absolutely critical

Why not JSON?

  • Fails 5-year battery life target (4.9 years)
  • $1,040/year overage costs during summer spike
  • Total 5-year cost: $5,200 vs $0 for CBOR

10.6.8 Real-World Lesson

Key Insight: Don’t just optimize for bytes - optimize for total cost of ownership (TCO): - Data costs - Battery replacement costs (labor + materials) - Development time costs - Maintenance burden costs

In this case, CBOR provides 80% of the efficiency of custom binary with 20% of the engineering effort. The sweet spot!

Bonus: With CBOR, you can easily add new fields (soil pH, nutrient levels) next season without touching deployed hardware - just update the cloud parser. With custom binary, that’s a breaking change requiring firmware updates or protocol versioning.


10.7 Fleet Tracking Quiz

Scenario: You’re building a fleet tracking system for 500 delivery trucks. Each truck sends GPS updates:

  • Data: Latitude (float), Longitude (float), Speed (km/h, 0-120), Heading (degrees, 0-359), Timestamp (Unix epoch)
  • Frequency: Every 30 seconds while moving (8 hours/day average)
  • Network: Cellular (NB-IoT)
  • Data plan: $5/month per truck for 50 MB

You’re considering three formats:

Option A: JSON

{"lat":37.7749,"lng":-122.4194,"spd":45,"hdg":180,"ts":1702834567}

Size: 68 bytes

Option B: CBOR Binary encoding of same structure Size: 35 bytes

Option C: Custom Binary

[4 bytes lat] [4 bytes lng] [1 byte spd] [2 bytes hdg] [4 bytes ts]

Size: 15 bytes

Think about:

  1. How many messages per truck per month? (30s intervals, 8 hours/day, 22 workdays)
  2. What’s the monthly data usage for 500 trucks with each format?
  3. Will you exceed the 50 MB/month data plan with any format?
  4. What’s the total annual cost difference between formats?

Key Insights:

Messages per truck per month:

  • Updates: Every 30s for 8 hours/day
  • Per day: (8 hours x 3600s) / 30s = 960 messages/day
  • Per month: 960 x 22 workdays = 21,120 messages/month

Monthly data usage per truck:

  • JSON: 21,120 x 68 bytes = 1.44 MB/month
  • CBOR: 21,120 x 35 bytes = 0.74 MB/month
  • Custom: 21,120 x 15 bytes = 0.32 MB/month

Fleet total (500 trucks):

  • JSON: 1.44 x 500 = 720 MB/month
  • CBOR: 0.74 x 500 = 370 MB/month
  • Custom: 0.32 x 500 = 160 MB/month

Data plan analysis (50 MB/month per truck):

  • JSON: 1.44 MB < 50 MB - Under limit (3% utilization)
  • CBOR: 0.74 MB < 50 MB - Under limit (1.5% utilization)
  • Custom: 0.32 MB < 50 MB - All formats work!

Cost analysis: Since all formats fit within the $5/month plan, costs are identical: $5 x 500 = $2,500/month

BUT - what if you want to upgrade update frequency to every 10 seconds?

10-second updates (3x more frequent):

  • JSON: 4.32 MB/month - Still under 50 MB
  • CBOR: 2.22 MB/month - Still under 50 MB
  • Custom: 0.96 MB/month - Still under 50 MB

Best choice: CBOR (Option B)

  • 52% smaller than JSON (bandwidth efficient)
  • Self-describing format (schema flexibility)
  • Standard libraries available
  • Easy debugging with CBOR tools
  • Fits well within data plan even with future growth

Why not custom binary?

  • Saves only 0.42 MB/month per truck (1% of data plan)
  • Loses flexibility for schema changes
  • Harder to debug in production
  • Not worth the maintenance burden for minimal savings

Real-world lesson: Choose formats based on flexibility and maintainability, not just raw size. CBOR provides 80% of custom binary’s efficiency with 20% of the complexity. In this case, all formats fit comfortably within the data budget, so optimize for developer productivity, not bytes.

For a cellular-connected fleet tracker, format overhead directly impacts your bill:

\[ \text{Annual Cost} = N_{\text{devices}} \times \frac{M_{\text{msgs/day}} \times 365 \times S_{\text{bytes}}}{10^6} \times C_{\text{\$/MB}} \]

With 500 trucks, 960 messages/day, and $0.10/MB: - JSON (68 bytes): (500 × 960 × 365 × 68 / 10⁶) MB × $0.10/MB = 11,890 MB × $0.10 = $1,189/year - CBOR (35 bytes): (500 × 960 × 365 × 35 / 10⁶) MB × $0.10/MB = 6,132 MB × $0.10 = $613/year - Custom (15 bytes): (500 × 960 × 365 × 15 / 10⁶) MB × $0.10/MB = 2,628 MB × $0.10 = $263/year

The $350/year CBOR→Custom savings seems significant, but amortize the $15K engineering cost: breakeven takes 43 years. For most deployments, CBOR’s sweet spot (half the bandwidth of JSON, standard tooling) wins.


Use this decision framework to choose the right data format based on your specific constraints:

10.7.1 Step 1: Identify Your Primary Constraint

Constraint Go to Step
Bandwidth-limited (LoRa, Sigfox, NB-IoT) Step 2
Battery-critical (10+ year lifespan) Step 2
Wi-Fi/Ethernet (bandwidth plentiful) Use JSON (stop here)
Rapid prototyping (speed over efficiency) Use JSON (stop here)

10.7.2 Step 2: Calculate Your Byte Budget

Total available bytes per message = MTU - Protocol Headers

Network MTU Headers Available for Payload
LoRaWAN SF12 51 13 38 bytes
Sigfox 12 0 12 bytes
NB-IoT 1500 48 (IPv6 + UDP) 1452 bytes
BLE (default ATT MTU) 23 3 20 bytes

Example: LoRaWAN → 38 bytes available for your sensor data

10.7.3 Step 3: Evaluate Format Options Against Your Payload

For your specific sensor readings, estimate size with each format:

Example: Temperature (16-bit), Humidity (8-bit), Battery (8-bit), Timestamp (32-bit) = 8 bytes raw

Format Size Fits in 38 bytes? Efficiency
JSON ~45 bytes NO - exceeds budget N/A
CBOR ~18 bytes Yes 44% (8/18)
Protobuf ~12 bytes Yes 67% (8/12)
Custom binary 8 bytes Yes 100% (8/8)

10.7.4 Step 4: Apply Decision Rules

Scenario Recommended Format Why
Payload fits in JSON within budget JSON Simplest, most debuggable
Payload doesn’t fit, but deployment is small (<100 devices) CBOR Balance of efficiency and flexibility
Large deployment (1000+ devices), stable schema Protobuf Best tooling, type safety, ecosystem
Ultra-constrained (Sigfox, every byte critical) Custom binary Maximum efficiency
Schema changes frequently CBOR or JSON Schema-less flexibility
Strong typing required (multi-team) Protobuf Compile-time validation

10.7.5 Step 5: Validate Your Choice

Checklist before committing:

Example Decision:

“We’re deploying 500 agricultural sensors on LoRaWAN SF12. Our payload is 10 bytes raw. JSON doesn’t fit (45 bytes). We’re a 2-person team without Protobuf experience. Schema may evolve (adding soil pH sensor next season). Decision: CBOR – fits in 18 bytes, we can debug with cbor2diag, schema-less flexibility, and our Python backend has native CBOR support.”


Question 1: A smart building deploys 2,000 temperature sensors over WiFi (100 Mbps available bandwidth). Each sensor reports every 60 seconds. The development team is small (3 engineers) and needs to iterate quickly. Which format should they choose?

  1. Custom binary – maximum efficiency for 2,000 devices
  2. JSON – bandwidth is plentiful, development speed matters more
  3. Protocol Buffers – 2,000 devices justifies the schema investment
  4. CBOR – best compromise between size and flexibility
Show Answer

b) JSON – bandwidth is plentiful, development speed matters more

Reasoning:

  • WiFi bandwidth: 100 Mbps = 12.5 MB/sec. Even if all 2,000 sensors send simultaneously, that’s ~150 KB (assuming 75-byte JSON payloads). This is 0.001% of available bandwidth – bandwidth is NOT the constraint.
  • Development speed: Small team (3 engineers) needs fast iteration. JSON requires no schema definition, no code generation, no special tooling. Changes to data structure take minutes, not hours.
  • Debugging: JSON is human-readable in logs, browser dev tools, and command-line debugging.
  • Scale: 2,000 devices is significant but not massive. The efficiency gains from binary formats don’t justify the development overhead when bandwidth is unconstrained.

When would the answer change?

  • If connectivity were LoRaWAN/NB-IoT → CBOR
  • If scale were 100,000+ devices → Protocol Buffers
  • If payloads were >1KB → Consider compression or binary formats

Question 2: Your deployed sensors currently use JSON over LoRaWAN (48-byte payload, barely fits in 51-byte limit). You need to add two new sensor readings (4 bytes each). What’s the best migration path?

  1. Compress the JSON with gzip before transmission
  2. Migrate to CBOR (self-describing, gradual rollout possible)
  3. Switch to Protocol Buffers (most efficient)
  4. Remove less important existing fields to make room
Show Answer

b) Migrate to CBOR (self-describing, gradual rollout possible)

Reasoning:

  • Current state: JSON = 48 bytes, barely fits in 51-byte LoRaWAN limit
  • Adding 2 fields: JSON would be ~62 bytes (exceeds limit by 21%)
  • CBOR encoding: Same data = ~25 bytes (fits comfortably with room to grow)
  • Migration safety: CBOR is self-describing, so cloud can accept both JSON and CBOR during gradual firmware rollout
  • Library footprint: CBOR libraries are small (~10-20 KB), fit in typical MCU flash

Why not the alternatives?

  • gzip compression (a): Adds CPU overhead on battery-powered MCU, decompression complexity in cloud, and compressed JSON might still exceed 51 bytes for this payload.
  • Protocol Buffers (c): Requires schema definition, code generation, larger library footprint. The schema-based approach is harder to evolve gracefully during a live migration.
  • Remove fields (d): Defeats the purpose (you need MORE data, not less).
Key lesson: CBOR is the ideal migration path from JSON when you need size reduction but want to maintain schema flexibility and enable gradual deployment.

10.8 Try It Yourself: Format Selection Decision

Scenario: You’re designing an industrial vibration monitoring system with these requirements:

Given Data:

  • 50 sensors per factory floor
  • 3 factories (150 sensors total)
  • Each sensor: 6-axis accelerometer (3 axes × 2 readings each)
  • Sampling rate: 1 kHz (1,000 samples/second)
  • Data resolution: 16-bit integers per axis reading (±8g range, 0.001g precision)
  • Connectivity: Wired Ethernet (1 Gbps available) from sensor to edge gateway
  • Edge-to-cloud: 4G LTE cellular (10 Mbps uplink)
  • Development timeline: 6 months to production
  • Expected deployment lifespan: 15 years
  • Team size: 8 engineers (4 embedded, 2 backend, 2 data science)

Your Task (Step-by-Step):

Part A: Sensor-to-Edge Format Selection

  1. Calculate the raw data rate per sensor:
    • Samples/sec: _________
    • Axes: _________
    • Bytes per sample (2 bytes × 6 axes): _________
    • Data rate per sensor: _________ bytes/sec = _________ KB/sec
  2. Calculate total sensor-to-edge bandwidth (150 sensors):
    • Total data rate: _________ KB/sec = _________ Mbps
  3. Compare this to available Ethernet bandwidth (1 Gbps = 125 MB/sec). Is bandwidth constrained?
    • Yes / No
  4. Given the latency requirement (<10ms for anomaly detection), which format should you choose for sensor-to-edge?
    • JSON (self-describing, easy to debug)
    • CBOR (30-60% smaller than JSON)
    • Protocol Buffers (schema-based, efficient)
    • FlatBuffers (zero-copy parsing, ultra-low latency)
    • Custom binary (maximum efficiency)
    Your choice: _________ because _________

Part B: Edge-to-Cloud Format Selection

  1. The edge gateway processes the raw 1 kHz data and sends to cloud:
    • Normal operation: 1 summary message per sensor per minute (mean, std dev, max)
    • Anomaly detected: Full 1-second waveform (1,000 samples × 6 axes)
    Calculate normal mode cloud bandwidth:
    • Messages/sec: 150 sensors / 60 seconds = _________ msg/sec
    • Payload per message (estimate): JSON ~80 bytes, Protobuf ~40 bytes, Custom binary ~20 bytes
    • Data rate (JSON): _________ bytes/sec = _________ Kbps
    • Data rate (Protobuf): _________ bytes/sec = _________ Kbps
  2. Calculate anomaly mode cloud bandwidth (assume 5% of sensors trigger anomaly per minute):
    • Anomaly sensors: 150 × 5% = _________ sensors/minute
    • Samples per anomaly: 1,000 samples × 6 axes × 2 bytes = _________ bytes
    • Additional bandwidth: _________ bytes/min = _________ Kbps average
  3. Given the 10 Mbps LTE uplink, is bandwidth constrained in normal or anomaly modes?
    • Normal mode: Constrained / Not Constrained
    • Anomaly mode: Constrained / Not Constrained
  4. Given the 15-year deployment lifespan, which edge-to-cloud format should you choose?
    • JSON (flexible, easy evolution)
    • CBOR (compact, self-describing)
    • Protocol Buffers (schema evolution built-in, typed)
    • Custom binary (most compact)
    Your choice: _________ because _________

What to Observe:

  • Does the format choice differ between sensor-to-edge and edge-to-cloud? Why?
  • How does the 15-year lifespan constraint affect your decision?
  • What happens if 20% of sensors trigger anomalies simultaneously (DDoS scenario)?

Part A: Sensor-to-Edge

  1. Raw data rate per sensor:

    • Samples/sec: 1,000
    • Axes: 6
    • Bytes per sample: 12 bytes (6 axes × 2 bytes)
    • Data rate: 12,000 bytes/sec = 12 KB/sec
  2. Total sensor-to-edge bandwidth:

    • 12 KB/sec × 150 sensors = 1,800 KB/sec = 1.8 MB/sec = 14.4 Mbps
  3. Is bandwidth constrained?

    • No. 14.4 Mbps is only 1.4% of 1 Gbps Ethernet capacity.
  4. Recommended format: FlatBuffers or Custom Binary

    Reasoning:

    • Bandwidth is NOT constrained (only 1.4% of capacity)
    • Latency IS constrained (<10ms for anomaly detection)
    • At 1 kHz sampling, parsing overhead matters more than transmission size
    • FlatBuffers enables zero-copy access (edge gateway can read accelerometer values directly from buffer without deserializing)
    • Custom binary (just 12 raw bytes per sample) is even simpler but lacks schema evolution

    Best choice: FlatBuffers – provides zero-copy performance for real-time processing while maintaining schema evolution capability for the 15-year lifespan.

Part B: Edge-to-Cloud

  1. Normal mode cloud bandwidth:

    • Messages/sec: 150 / 60 = 2.5 msg/sec
    • JSON: 2.5 × 80 = 200 bytes/sec = 1.6 Kbps
    • Protobuf: 2.5 × 40 = 100 bytes/sec = 0.8 Kbps
  2. Anomaly mode bandwidth:

    • Anomaly sensors: 150 × 5% = 7.5 sensors/min
    • Samples per anomaly: 12,000 bytes
    • Additional: 7.5 × 12,000 = 90,000 bytes/min = 1,500 bytes/sec = 12 Kbps
  3. Is bandwidth constrained?

    • Normal mode: NOT constrained (1.6 Kbps vs 10 Mbps = 0.016%)
    • Anomaly mode: NOT constrained (12 Kbps + 1.6 Kbps = 13.6 Kbps vs 10 Mbps = 0.14%)
  4. Recommended format: Protocol Buffers

    Reasoning:

    • Bandwidth is NOT constrained (even in anomaly mode, only 0.14% of LTE capacity)
    • 15-year lifespan makes schema evolution CRITICAL
    • Protocol Buffers’ field numbering system enables:
      • Adding new sensor types without breaking old edge gateways
      • Evolving data structure as vibration analysis algorithms improve
      • Strong typing prevents data corruption at scale (150 devices × 15 years = high exposure to data quality issues)
    • 8-engineer team can afford the schema definition overhead
    • Language-agnostic (C++ on edge, Python in cloud)

    Alternative: CBOR would work if team wants schema-less flexibility, but Protobuf’s explicit schema provides better long-term maintainability.

Key Insights:

  1. Different formats for different paths: FlatBuffers (sensor→edge) vs Protobuf (edge→cloud) optimizes for different constraints
  2. Bandwidth was not the bottleneck: Both paths had >99% capacity remaining. Latency and schema evolution mattered more.
  3. DDoS scenario (20% anomalies): 150 × 20% × 12 KB = 360 KB burst = 2.88 Mbps peak (still only 29% of LTE capacity – safe!)
  4. 15-year lifespan drove Protobuf choice: Schema evolution is inevitable over that timeline

Common Pitfalls

Sending {‘temperature’: ‘23.5’} as a string instead of {‘temperature’: 23.5} as a number forces consumers to parse strings, breaks numeric queries, and increases message size. Validate that all numeric fields are encoded as JSON numbers — this is the most common IoT data format error.

Adding a new field to a JSON/CBOR message immediately after a firmware update means old consumers (not yet updated) receive unexpected fields. Use additive-only schema changes (never remove or rename fields), version your schemas, and handle unknown fields gracefully in all consumers.

IEEE 754 float64 uses 8 bytes per value — a 10-field sensor reading becomes 80 bytes just for numbers. Integer-scaled fixed point (temperature × 100 as int16) reduces the same reading to 2 bytes per value with identical precision for typical IoT ranges.

10.9 Summary

Key Decision Factors:

  1. Bandwidth constraints - The primary driver (Wi-Fi vs LoRaWAN)
  2. Scale - Format efficiency matters more at >1000 devices
  3. Battery life - Payload size directly affects transmission energy
  4. Development time - Binary formats require more setup
  5. Schema evolution - How often will your data structure change?
  6. Total cost of ownership - Not just data costs, but development and maintenance

Quick Decision Guide:

Your Situation Recommended Format
Prototyping anything JSON
Wi-Fi/Ethernet, any scale JSON
LoRaWAN, NB-IoT, moderate scale CBOR
1000+ devices, stable schema Protobuf
Sigfox, extreme battery constraints Custom binary

Sammy the Sensor is at the post office, trying to choose how to send a letter.

“I have four choices,” Sammy explains to the Squad:

Lila the LED reads the options:

  1. JSON = Writing a full letter – “Dear Cloud, my temperature is 23.5 degrees Celsius. Yours truly, Sammy.” Clear, but uses lots of paper!
  2. CBOR = A postcard – Same information, but smaller. Like writing in neat tiny print.
  3. Protobuf = A coded telegram – “TEMP=235 STOP” – short and structured, but you need the codebook.
  4. Custom binary = A number on a stamp – Just “235” – the tiniest possible, but only works if everyone knows the secret.

Max the Microcontroller pulls out a chart. “Here’s my decision trick:”

  • “Got a BIG mailbox? (Wi-Fi)” – Write a full letter (JSON)!
  • “Tiny mailbox? (LoRaWAN)” – Use a postcard (CBOR) or code (binary)!
  • “SUPER tiny mailbox? (Sigfox – 12 bytes only!)” – Stamp code only (custom binary)!

Bella the Battery adds: “And remember – every extra word I have to carry costs me energy. Shorter messages = longer battery life!”

The Squad’s Golden Rule: Match your message size to your mailbox size!


10.10 Knowledge Check

10.11 What’s Next

Now that you can apply the format decision tree and calculate total cost of ownership, explore these related topics:

Topic Chapter What You Will Do
Practice Scenarios Data Formats Practice Work through detailed format selection scenarios, quizzes, and worked examples
Protocol Selection Protocol Selector Wizard Combine protocol and format selection using an interactive decision tool
MQTT Payloads MQTT Fundamentals Implement JSON and CBOR payloads in real MQTT publish/subscribe workflows
CoAP with CBOR CoAP Protocol Examine how CoAP pairs with CBOR for constrained RESTful communication
LoRaWAN Constraints LoRaWAN Overview Evaluate payload limits and spreading factor trade-offs that drive format choices

Continue to Data Formats Practice –>