10  Data Format Selection Guide

In 60 Seconds

Use the format decision tree to choose between JSON, CBOR, Protobuf, and custom binary. If bandwidth is not constrained (Wi-Fi/LTE), use JSON. If constrained (LoRa/Sigfox), evaluate volume and byte limits to select CBOR, Protobuf, or custom binary. Always consider total cost of ownership – not just payload size.

10.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply the format decision tree: Systematically evaluate constraints to select the right format for a given IoT scenario
  • Calculate total cost of ownership: Quantify bandwidth, battery, development, and maintenance costs for competing format choices
  • Classify IoT applications by format suitability: Map real-world use cases to JSON, CBOR, Protobuf, or custom binary based on network and power constraints
  • Design format migration strategies: Architect a phased transition from JSON to binary formats while maintaining backward compatibility
  • Justify format decisions to stakeholders: Present trade-off analyses with concrete data on cost, battery life, and schema evolution
  • JSON: JavaScript Object Notation — human-readable, universally supported, 2-5× size overhead versus binary formats
  • CBOR: Binary self-describing format — best choice when schema distribution is impractical and size matters
  • Protocol Buffers: Schema-enforced binary format — optimal for high-throughput pipelines where schema management is acceptable
  • Schema Registry: Central repository storing format versions — prevents incompatibility between producers and consumers
  • Compression: gzip/zstd applied over any format further reduces size 60-80% — beneficial for text-heavy payloads
  • Self-Description: Format carrying its own type information (JSON, CBOR) versus requiring external schema (protobuf, Avro)
  • Parsing Overhead: CPU cost to deserialize: JSON (slow, text parsing) < CBOR (fast, binary) < Protobuf (fastest, compiled)

10.2 For Beginners: Data Format Selection

Choosing a data format is like deciding how to pack a suitcase. JSON is like packing in clear labeled bags – easy to see what is inside, but takes more space. Binary formats like CBOR are like vacuum-sealing everything – much smaller, but harder to inspect. For IoT devices with limited battery and slow connections, picking the right format can mean the difference between a sensor lasting one year or five years.

This is part of a series on IoT Data Formats:

  1. IoT Data Formats Overview - Introduction and text formats
  2. Binary Data Formats - CBOR, Protobuf, custom binary
  3. Data Format Selection (this chapter) - Decision guides and real-world examples
  4. Data Formats Practice - Scenarios, quizzes, worked examples

Related Decision Tools:

10.3 Prerequisites

Before starting this chapter, you should be familiar with:


10.4 Format Selection Decision Tree

Flowchart decision tree starting with navy Start box asking Choose Data Format. First decision (gray): Bandwidth Constrained? (LoRa, Sigfox, NB-IoT). No path (Wi-Fi, Ethernet, LTE) leads to orange JSON box (Simple, Universal, Best for Wi-Fi/LTE). Yes path (LoRa, Sigfox) leads to second decision: Need Human Readability? (Debugging, Dev). Yes path leads to orange JSON or CBOR with tools box. No path leads to third decision: High Volume? (>1000 devices, >1 msg/min). No path (Small deployment) leads to teal CBOR box (Balance efficiency and flexibility). Yes path (>1000 devices) leads to fourth decision: Every Byte Critical? (e.g., 12 byte limit). No path leads to teal Protobuf or CBOR box (Typed, efficient pipelines). Yes path (Sigfox, ultra-LP) leads to navy Custom Binary box (Ultimate efficiency). Color coding: Navy for start/custom binary, Gray for decision points, Orange for JSON options, Teal for CBOR/Protobuf options.

Flowchart decision tree starting with navy Start box asking Choose Data Format. First decision (gray): Bandwidth Constrained? (LoRa, Sigfox, NB-IoT). No path (Wi-Fi, Ethernet, LTE) leads to orange JSON box (Simple, Universal, Best for Wi-Fi/LTE). Yes path (LoRa, Sigfox) leads to second decision: Need Human Readability? (Debugging, Dev). Yes path leads to orange JSON or CBOR with tools box. No path leads to third decision: High Volume? (>1000 devices, >1 msg/min). No path (Small deployment) leads to teal CBOR box (Balance efficiency and flexibility). Yes path (>1000 devices) leads to fourth decision: Every Byte Critical? (e.g., 12 byte limit). No path leads to teal Protobuf or CBOR box (Typed, efficient pipelines). Yes path (Sigfox, ultra-LP) leads to navy Custom Binary box (Ultimate efficiency). Color coding: Navy for start/custom binary, Gray for decision points, Orange for JSON options, Teal for CBOR/Protobuf options.

Format Selection Decision Tree: Interactive flowchart for choosing the right IoT data format based on network constraints and application requirements. The tree starts by evaluating bandwidth constraints (Wi-Fi/LTE vs LoRa/Sigfox). High-bandwidth networks can use JSON for simplicity. Constrained networks require further evaluation: human readability needs (debugging vs production), message volume (small deployment vs high-scale fleet), and byte-level criticality (standard protocols vs ultra-constrained like Sigfox’s 12-byte limit). Each path leads to the optimal format choice balancing efficiency, maintainability, and ecosystem support.

Mobile Decision Tree Summary

  • Start: Identify whether your network is bandwidth constrained.
  • If bandwidth is not constrained: Choose JSON for simplicity, debugging, and broad tool support.
  • If bandwidth is constrained: Ask whether you still need human-readable payloads.
  • If readability still matters: Choose JSON or CBOR with tooling support.
  • If readability does not matter: Ask whether you have many devices or high message volume.
  • If volume is high: Choose Protobuf or CBOR for efficient typed payloads.
  • If volume is moderate: Choose CBOR for a balance between size and flexibility.
  • If every byte is critical: Choose Custom binary only when you control both ends and need the smallest possible payload.

10.4.1 Step-by-Step Decision Process

Question 1: Is bandwidth severely constrained?

  • No (Wi-Fi, Ethernet, LTE): Use JSON for simplicity
  • Yes (LoRa, Sigfox, NB-IoT): Continue to Q2

Question 2: Do you need human readability for debugging?

  • Yes: Use JSON or CBOR with tooling
  • No: Continue to Q3

Question 3: Do you have many devices with high message volume?

  • Yes (>1000 devices, >1 msg/min): Use Protobuf or CBOR
  • No (small deployment): Use CBOR for balance

Question 4: Is every byte critical? (Sigfox: 12 bytes max)

  • Yes: Custom binary format
  • No: Protobuf or CBOR

Place the format selection decision steps in the correct order:

10.4.2 Interactive Format Cost Calculator

Try this calculator to compare data format costs for your own IoT deployment:


10.5 Real-World Application Map

Real-World Application Map

Pick the project profile first, then choose the format family that matches its link budget and integration style.

JSON

  • Smart home thermostat: Wi-Fi; readable payloads for setup, dashboards, and support.
  • Building HVAC controller: Ethernet; human-readable diagnostics matter more than byte size.
  • Mobile app prototype: REST API; rapid iteration with minimal tooling overhead.

CBOR

  • Agricultural soil sensor: LoRaWAN; balances payload size with flexible field updates.
  • Fleet GPS tracker: NB-IoT; keeps messages compact while preserving readable schema structure.
  • Smart grid meter: constrained backhaul; efficient payloads without hard-coded binary parsing.

Protobuf

  • Industrial IoT gateway: factory floor; high volume and strict contracts justify schemas.
  • Cloud telemetry pipeline: gRPC services; typed payloads integrate cleanly with backend tooling.
  • Analytics stream: large-scale ingestion; efficient serialization helps keep compute costs down.

Custom Binary

  • Sigfox parking sensor: 12-byte payload cap; every byte must be planned.
  • Wearable fitness tracker: BLE power budget; fixed compact frames extend battery life.
  • Satellite remote monitor: high cost per transmitted byte; smallest practical payload wins.

10.5.1 Application-Format Reference Table

  • Smart home thermostat: Protocol Wi-Fi + MQTT; format JSON; rationale: bandwidth is plentiful and debugging matters.
  • Agricultural soil sensor: Protocol LoRaWAN; format CBOR; rationale: bandwidth is limited but flexibility is still needed.
  • City parking sensor: Protocol Sigfox; format Custom binary; rationale: the 12-byte limit demands a fixed compact structure.
  • Industrial gateway: Protocol Ethernet + gRPC; format Protobuf; rationale: high volume and strong typing justify schema overhead.
  • Wearable fitness tracker: Protocol BLE; format Custom binary; rationale: power sensitivity and fixed data structure favor maximum efficiency.

Match each IoT application to its recommended data format:


10.6 Learning Scenario: Soil Moisture Network

Your Challenge: You’re designing a precision agriculture system for a large vineyard. You need to monitor soil moisture to optimize irrigation and prevent crop damage.

System Requirements:

  • 100 battery-powered sensor nodes placed throughout vineyard
  • Cellular backhaul (NB-IoT): $0.01/KB data cost, 250 KB/month included per SIM
  • Readings every 15 minutes (96 readings/day)
  • Battery life target: 5 years on 2x AA batteries
  • Sensors measure: Soil moisture (0-100%), temperature (-20 to 60C), battery voltage (2.0-3.6V), GPS location (once/day)

Your Mission: Choose the optimal data format and calculate the real costs.


10.6.1 Step 1: Calculate Message Volume

Think: How many messages per year per node?

Click to reveal calculation
  • Readings per day: 96 (every 15 minutes)
  • Days per year: 365
  • Total messages/year/node: 96 x 365 = 35,040 messages
  • Fleet total: 35,040 x 100 nodes = 3,504,000 messages/year

10.6.2 Step 2: Compare Format Options

You’re considering three formats. Let’s analyze each:

10.6.2.1 Option A: JSON (Readable)

{
  "id": "V001",
  "moist": 45.2,
  "temp": 18.5,
  "batt": 3.1,
  "lat": 38.5,
  "lng": -122.4
}

  • Size: 78 bytes per message
  • Pros: Easy debugging, universal tools, cloud-friendly
  • Cons: Largest size, more battery drain for transmission

10.6.2.2 Option B: CBOR (Balanced)

Binary encoding with same structure: - Size: 42 bytes per message (46% smaller than JSON) - Pros: 50% bandwidth savings, standard format, easier debugging than custom binary - Cons: Less ecosystem than JSON, requires CBOR library

10.6.2.3 Option C: Custom Binary (Optimized)

[id: 2 bytes]
[moisture x2: 1 byte]
[temperature + 20: 1 byte]
[battery x100: 1 byte]
[latitude: 4 bytes]
[longitude: 4 bytes]
Total: 13 bytes
  • Size: 13 bytes per message (83% smaller than JSON)
  • Pros: Smallest size, lowest power consumption
  • Cons: No tooling, rigid schema, difficult debugging

10.6.3 Step 3: Calculate Annual Data Usage

Calculate: Data usage per node per year for each format.

Click to reveal calculations

Per node per year:

  • JSON: 35,040 messages x 78 bytes = 2.73 MB/year
  • CBOR: 35,040 messages x 42 bytes = 1.47 MB/year
  • Custom: 35,040 messages x 13 bytes = 0.46 MB/year

Fleet total (100 nodes):

  • JSON: 2.73 MB x 100 = 273 MB/year
  • CBOR: 1.47 MB x 100 = 147 MB/year
  • Custom: 0.46 MB x 100 = 46 MB/year

10.6.4 Step 4: Cost Analysis

Data plan: $0.01/KB, 250 KB/month included per SIM

Calculate: Will you exceed the included data allowance? What are the overage costs?

Click to reveal cost analysis

Included data per node per year: 250 KB/month x 12 months = 3 MB/year

Overage analysis:

  • JSON: usage/year 2.73 MB; included 3 MB; overage 0 MB; cost/node/year $0; fleet cost/year $0.
  • CBOR: usage/year 1.47 MB; included 3 MB; overage 0 MB; cost/node/year $0; fleet cost/year $0.
  • Custom: usage/year 0.46 MB; included 3 MB; overage 0 MB; cost/node/year $0; fleet cost/year $0.

Surprising result: All formats fit within the included 3 MB/year allowance! No overage costs.

BUT WAIT - what about peak months? (summer = more frequent irrigation adjustments)

Summer scenario (June-August): Increase to every 5 minutes = 288 msgs/day x 90 days = 25,920 messages

Summer data usage (3 months):

  • JSON: 25,920 x 78 bytes = 2.02 MB (81% of annual allowance in 3 months!)
  • CBOR: 25,920 x 42 bytes = 1.09 MB (44% of allowance)
  • Custom: 25,920 x 13 bytes = 0.34 MB (14% of allowance)

Annual with summer spike:

  • JSON: (9 months @ 15min) + (3 months @ 5min) = 4.04 MB/year - Exceeds 3 MB! - $10.40/node overage - $1,040/year fleet
  • CBOR: 2.19 MB/year - Under limit
  • Custom: 0.69 MB/year - Under limit

10.6.5 Step 5: Battery Life Impact

Radio power consumption (NB-IoT): - Transmit power: 23 dBm (200 mW) - Transmission time: ~50 ms/byte (including protocol overhead) - Energy per byte: 200 mW x 50 ms = 10 mJ/byte

Calculate: How much battery energy does each format consume?

Click to reveal battery analysis

Energy per message (transmission only): - JSON: 78 bytes x 10 mJ/byte = 780 mJ - CBOR: 42 bytes x 10 mJ/byte = 420 mJ - Custom: 13 bytes x 10 mJ/byte = 130 mJ

Annual energy (35,040 messages/year):

  • JSON: 35,040 x 780 mJ = 27.3 kJ/year = 7.6 Wh/year
  • CBOR: 35,040 x 420 mJ = 14.7 kJ/year = 4.1 Wh/year
  • Custom: 35,040 x 130 mJ = 4.6 kJ/year = 1.3 Wh/year

Battery capacity: 2x AA batteries = 2 x 2500 mAh x 3V = 15 Wh total

Transmission as % of battery (assuming other circuitry uses 50% of battery): - JSON: 7.6 Wh / 7.5 Wh available = 101% of available budget - 4.9 year battery life - CBOR: 4.1 Wh / 7.5 Wh = 55% of budget - 5+ year target achieved - Custom: 1.3 Wh / 7.5 Wh = 17% of budget - 5+ year target easily met

Verdict: JSON fails the 5-year battery life requirement!


10.6.6 Step 6: Final Recommendation

Compare all factors:

  • Summer data cost: JSON $1,040/year overage; CBOR $0; Custom Binary $0.
  • Battery life: JSON 4.9 years (misses target); CBOR 5+ years; Custom Binary 5+ years.
  • Debugging ease: JSON Excellent (text editor); CBOR Requires CBOR tools; Custom Binary Custom parser needed.
  • Schema evolution: JSON Easy (add fields); CBOR Moderate (CBOR flexible); Custom Binary Rigid (breaking changes).
  • Development time: JSON 1 day (JSON libs everywhere); CBOR 2-3 days (CBOR setup); Custom Binary 1-2 weeks (custom parser).
  • Maintenance burden: JSON Low (standard format); CBOR Medium (CBOR docs); Custom Binary High (DIY everything).
  • Total 5-year TCO: JSON $5,200 (overage costs); CBOR $0 (no overage); Custom Binary $0 (no overage).

10.6.7 Your Recommendation: CBOR (Option B)

Rationale:

  1. Meets battery life target (5+ years) with 46% size reduction vs JSON
  2. Zero overage costs even with summer spike (1.09 MB < 3 MB limit)
  3. Flexible schema - Can add new sensor types without breaking existing nodes
  4. Standard format - CBOR libraries exist for embedded C, Python, cloud processing
  5. Reasonable debugging - Tools like cbor2diag convert binary to human-readable
  6. Moderate setup - 2-3 days to integrate CBOR library, but well-documented

Why not custom binary?

  • Custom saves only 0.63 MB/year (1.47 - 0.46 MB) per node vs CBOR
  • Zero cost benefit (both are under data cap)
  • Battery savings: 2.8 Wh/year = 6 extra months of battery life
  • Trade-off: 6 months extra battery vs 2 weeks development + ongoing maintenance burden
  • Verdict: Not worth it unless battery life is absolutely critical

Why not JSON?

  • Fails 5-year battery life target (4.9 years)
  • $1,040/year overage costs during summer spike
  • Total 5-year cost: $5,200 vs $0 for CBOR

10.6.8 Real-World Lesson

Key Insight: Don’t just optimize for bytes - optimize for total cost of ownership (TCO): - Data costs - Battery replacement costs (labor + materials) - Development time costs - Maintenance burden costs

In this case, CBOR provides 80% of the efficiency of custom binary with 20% of the engineering effort. The sweet spot!

Bonus: With CBOR, you can easily add new fields (soil pH, nutrient levels) next season without touching deployed hardware - just update the cloud parser. With custom binary, that’s a breaking change requiring firmware updates or protocol versioning.


10.7 Fleet Tracking Quiz

Scenario: You’re building a fleet tracking system for 500 delivery trucks. Each truck sends GPS updates:

  • Data: Latitude (float), Longitude (float), Speed (km/h, 0-120), Heading (degrees, 0-359), Timestamp (Unix epoch)
  • Frequency: Every 30 seconds while moving (8 hours/day average)
  • Network: Cellular (NB-IoT)
  • Data plan: $5/month per truck for 50 MB

You’re considering three formats:

Option A: JSON

{"lat":37.7749,"lng":-122.4194,"spd":45,"hdg":180,"ts":1702834567}

Size: 68 bytes

Option B: CBOR Binary encoding of same structure Size: 35 bytes

Option C: Custom Binary

[4 bytes lat] [4 bytes lng] [1 byte spd] [2 bytes hdg] [4 bytes ts]

Size: 15 bytes

Think about:

  1. How many messages per truck per month? (30s intervals, 8 hours/day, 22 workdays)
  2. What’s the monthly data usage for 500 trucks with each format?
  3. Will you exceed the 50 MB/month data plan with any format?
  4. What’s the total annual cost difference between formats?

Key Insights:

Messages per truck per month:

  • Updates: Every 30s for 8 hours/day
  • Per day: (8 hours x 3600s) / 30s = 960 messages/day
  • Per month: 960 x 22 workdays = 21,120 messages/month

Monthly data usage per truck:

  • JSON: 21,120 x 68 bytes = 1.44 MB/month
  • CBOR: 21,120 x 35 bytes = 0.74 MB/month
  • Custom: 21,120 x 15 bytes = 0.32 MB/month

Fleet total (500 trucks):

  • JSON: 1.44 x 500 = 720 MB/month
  • CBOR: 0.74 x 500 = 370 MB/month
  • Custom: 0.32 x 500 = 160 MB/month

Data plan analysis (50 MB/month per truck):

  • JSON: 1.44 MB < 50 MB - Under limit (3% utilization)
  • CBOR: 0.74 MB < 50 MB - Under limit (1.5% utilization)
  • Custom: 0.32 MB < 50 MB - All formats work!

Cost analysis: Since all formats fit within the $5/month plan, costs are identical: $5 x 500 = $2,500/month

BUT - what if you want to upgrade update frequency to every 10 seconds?

10-second updates (3x more frequent):

  • JSON: 4.32 MB/month - Still under 50 MB
  • CBOR: 2.22 MB/month - Still under 50 MB
  • Custom: 0.96 MB/month - Still under 50 MB

Best choice: CBOR (Option B)

  • 52% smaller than JSON (bandwidth efficient)
  • Self-describing format (schema flexibility)
  • Standard libraries available
  • Easy debugging with CBOR tools
  • Fits well within data plan even with future growth

Why not custom binary?

  • Saves only 0.42 MB/month per truck (1% of data plan)
  • Loses flexibility for schema changes
  • Harder to debug in production
  • Not worth the maintenance burden for minimal savings

Real-world lesson: Choose formats based on flexibility and maintainability, not just raw size. CBOR provides 80% of custom binary’s efficiency with 20% of the complexity. In this case, all formats fit comfortably within the data budget, so optimize for developer productivity, not bytes.

For a cellular-connected fleet tracker, format overhead directly impacts your bill:

\[ \text{Annual Cost} = N_{\text{devices}} \times \frac{M_{\text{msgs/day}} \times 365 \times S_{\text{bytes}}}{10^6} \times C_{\text{\$/MB}} \]

Mobile formula
Annual cost = devices × messages/day × 365
× bytes/message × cost/MB
÷ 1,000,000

With 500 trucks, 960 messages/day, and $0.10/MB: - JSON (68 bytes): (500 × 960 × 365 × 68 / 10⁶) MB × $0.10/MB = 11,890 MB × $0.10 = $1,189/year - CBOR (35 bytes): (500 × 960 × 365 × 35 / 10⁶) MB × $0.10/MB = 6,132 MB × $0.10 = $613/year - Custom (15 bytes): (500 × 960 × 365 × 15 / 10⁶) MB × $0.10/MB = 2,628 MB × $0.10 = $263/year

The $350/year CBOR→Custom savings seems significant, but amortize the $15K engineering cost: breakeven takes 43 years. For most deployments, CBOR’s sweet spot (half the bandwidth of JSON, standard tooling) wins.


Use this decision framework to choose the right data format based on your specific constraints:

10.7.1 Step 1: Identify Your Primary Constraint

  • Bandwidth-limited (LoRa, Sigfox, NB-IoT): go to Step 2.
  • Battery-critical (10+ year lifespan): go to Step 2.
  • Wi-Fi/Ethernet (bandwidth plentiful): use JSON and stop here.
  • Rapid prototyping (speed over efficiency): use JSON and stop here.

10.7.2 Step 2: Calculate Your Byte Budget

Total available bytes per message = MTU - Protocol Headers

  • LoRaWAN SF12: MTU 51; headers 13; available payload 38 bytes.
  • Sigfox: MTU 12; headers 0; available payload 12 bytes.
  • NB-IoT: MTU 1500; headers 48 (IPv6 + UDP); available payload 1452 bytes.
  • BLE (default ATT MTU): MTU 23; headers 3; available payload 20 bytes.

Example: LoRaWAN → 38 bytes available for your sensor data

10.7.3 Step 3: Evaluate Format Options Against Your Payload

For your specific sensor readings, estimate size with each format:

Example: Temperature (16-bit), Humidity (8-bit), Battery (8-bit), Timestamp (32-bit) = 8 bytes raw

  • JSON: size ~45 bytes; fits in 38 bytes? No; efficiency N/A.
  • CBOR: size ~18 bytes; fits in 38 bytes? Yes; efficiency 44% (8/18).
  • Protobuf: size ~12 bytes; fits in 38 bytes? Yes; efficiency 67% (8/12).
  • Custom binary: size 8 bytes; fits in 38 bytes? Yes; efficiency 100% (8/8).

10.7.4 Step 4: Apply Decision Rules

  • Payload fits in JSON within budget: recommended format JSON because it is the simplest and most debuggable option.
  • Payload doesn’t fit, but deployment is small (<100 devices): recommended format CBOR because it balances efficiency and flexibility.
  • Large deployment (1000+ devices), stable schema: recommended format Protobuf because it provides tooling, type safety, and ecosystem support.
  • Ultra-constrained (Sigfox, every byte critical): recommended format Custom binary because it maximizes efficiency.
  • Schema changes frequently: recommended format CBOR or JSON because schema-less flexibility matters most.
  • Strong typing required (multi-team): recommended format Protobuf because compile-time validation reduces integration errors.

10.7.5 Step 5: Validate Your Choice

Checklist before committing:

Example Decision:

“We’re deploying 500 agricultural sensors on LoRaWAN SF12. Our payload is 10 bytes raw. JSON doesn’t fit (45 bytes). We’re a 2-person team without Protobuf experience. Schema may evolve (adding soil pH sensor next season). Decision: CBOR – fits in 18 bytes, we can debug with cbor2diag, schema-less flexibility, and our Python backend has native CBOR support.”


Question 1: A smart building deploys 2,000 temperature sensors over WiFi (100 Mbps available bandwidth). Each sensor reports every 60 seconds. The development team is small (3 engineers) and needs to iterate quickly. Which format should they choose?

  1. Custom binary – maximum efficiency for 2,000 devices
  2. JSON – bandwidth is plentiful, development speed matters more
  3. Protocol Buffers – 2,000 devices justifies the schema investment
  4. CBOR – best compromise between size and flexibility
Show Answer

b) JSON – bandwidth is plentiful, development speed matters more

Reasoning:

  • WiFi bandwidth: 100 Mbps = 12.5 MB/sec. Even if all 2,000 sensors send simultaneously, that’s ~150 KB (assuming 75-byte JSON payloads). This is 0.001% of available bandwidth – bandwidth is NOT the constraint.
  • Development speed: Small team (3 engineers) needs fast iteration. JSON requires no schema definition, no code generation, no special tooling. Changes to data structure take minutes, not hours.
  • Debugging: JSON is human-readable in logs, browser dev tools, and command-line debugging.
  • Scale: 2,000 devices is significant but not massive. The efficiency gains from binary formats don’t justify the development overhead when bandwidth is unconstrained.

When would the answer change?

  • If connectivity were LoRaWAN/NB-IoT → CBOR
  • If scale were 100,000+ devices → Protocol Buffers
  • If payloads were >1KB → Consider compression or binary formats

Question 2: Your deployed sensors currently use JSON over LoRaWAN (48-byte payload, barely fits in 51-byte limit). You need to add two new sensor readings (4 bytes each). What’s the best migration path?

  1. Compress the JSON with gzip before transmission
  2. Migrate to CBOR (self-describing, gradual rollout possible)
  3. Switch to Protocol Buffers (most efficient)
  4. Remove less important existing fields to make room
Show Answer

b) Migrate to CBOR (self-describing, gradual rollout possible)

Reasoning:

  • Current state: JSON = 48 bytes, barely fits in 51-byte LoRaWAN limit
  • Adding 2 fields: JSON would be ~62 bytes (exceeds limit by 21%)
  • CBOR encoding: Same data = ~25 bytes (fits comfortably with room to grow)
  • Migration safety: CBOR is self-describing, so cloud can accept both JSON and CBOR during gradual firmware rollout
  • Library footprint: CBOR libraries are small (~10-20 KB), fit in typical MCU flash

Why not the alternatives?

  • gzip compression (a): Adds CPU overhead on battery-powered MCU, decompression complexity in cloud, and compressed JSON might still exceed 51 bytes for this payload.
  • Protocol Buffers (c): Requires schema definition, code generation, larger library footprint. The schema-based approach is harder to evolve gracefully during a live migration.
  • Remove fields (d): Defeats the purpose (you need MORE data, not less).
Key lesson: CBOR is the ideal migration path from JSON when you need size reduction but want to maintain schema flexibility and enable gradual deployment.

10.8 Try It Yourself: Format Selection Decision

Scenario: You’re designing an industrial vibration monitoring system with these requirements:

Given Data:

  • 50 sensors per factory floor
  • 3 factories (150 sensors total)
  • Each sensor: 6-axis accelerometer (3 axes × 2 readings each)
  • Sampling rate: 1 kHz (1,000 samples/second)
  • Data resolution: 16-bit integers per axis reading (±8g range, 0.001g precision)
  • Connectivity: Wired Ethernet (1 Gbps available) from sensor to edge gateway
  • Edge-to-cloud: 4G LTE cellular (10 Mbps uplink)
  • Development timeline: 6 months to production
  • Expected deployment lifespan: 15 years
  • Team size: 8 engineers (4 embedded, 2 backend, 2 data science)

Your Task (Step-by-Step):

Part A: Sensor-to-Edge Format Selection

  1. Calculate the raw data rate per sensor:
    • Samples/sec: _________
    • Axes: _________
    • Bytes per sample (2 bytes × 6 axes): _________
    • Data rate per sensor: _________ bytes/sec = _________ KB/sec
  2. Calculate total sensor-to-edge bandwidth (150 sensors):
    • Total data rate: _________ KB/sec = _________ Mbps
  3. Compare this to available Ethernet bandwidth (1 Gbps = 125 MB/sec). Is bandwidth constrained?
    • Yes / No
  4. Given the latency requirement (<10ms for anomaly detection), which format should you choose for sensor-to-edge?
    • JSON (self-describing, easy to debug)
    • CBOR (30-60% smaller than JSON)
    • Protocol Buffers (schema-based, efficient)
    • FlatBuffers (zero-copy parsing, ultra-low latency)
    • Custom binary (maximum efficiency)
    Your choice: _________ because _________

Part B: Edge-to-Cloud Format Selection

  1. The edge gateway processes the raw 1 kHz data and sends to cloud:
    • Normal operation: 1 summary message per sensor per minute (mean, std dev, max)
    • Anomaly detected: Full 1-second waveform (1,000 samples × 6 axes)
    Calculate normal mode cloud bandwidth:
    • Messages/sec: 150 sensors / 60 seconds = _________ msg/sec
    • Payload per message (estimate): JSON ~80 bytes, Protobuf ~40 bytes, Custom binary ~20 bytes
    • Data rate (JSON): _________ bytes/sec = _________ Kbps
    • Data rate (Protobuf): _________ bytes/sec = _________ Kbps
  2. Calculate anomaly mode cloud bandwidth (assume 5% of sensors trigger anomaly per minute):
    • Anomaly sensors: 150 × 5% = _________ sensors/minute
    • Samples per anomaly: 1,000 samples × 6 axes × 2 bytes = _________ bytes
    • Additional bandwidth: _________ bytes/min = _________ Kbps average
  3. Given the 10 Mbps LTE uplink, is bandwidth constrained in normal or anomaly modes?
    • Normal mode: Constrained / Not Constrained
    • Anomaly mode: Constrained / Not Constrained
  4. Given the 15-year deployment lifespan, which edge-to-cloud format should you choose?
    • JSON (flexible, easy evolution)
    • CBOR (compact, self-describing)
    • Protocol Buffers (schema evolution built-in, typed)
    • Custom binary (most compact)
    Your choice: _________ because _________

What to Observe:

  • Does the format choice differ between sensor-to-edge and edge-to-cloud? Why?
  • How does the 15-year lifespan constraint affect your decision?
  • What happens if 20% of sensors trigger anomalies simultaneously (DDoS scenario)?

Part A: Sensor-to-Edge

  1. Raw data rate per sensor:

    • Samples/sec: 1,000
    • Axes: 6
    • Bytes per sample: 12 bytes (6 axes × 2 bytes)
    • Data rate: 12,000 bytes/sec = 12 KB/sec
  2. Total sensor-to-edge bandwidth:

    • 12 KB/sec × 150 sensors = 1,800 KB/sec = 1.8 MB/sec = 14.4 Mbps
  3. Is bandwidth constrained?

    • No. 14.4 Mbps is only 1.4% of 1 Gbps Ethernet capacity.
  4. Recommended format: FlatBuffers or Custom Binary

    Reasoning:

    • Bandwidth is NOT constrained (only 1.4% of capacity)
    • Latency IS constrained (<10ms for anomaly detection)
    • At 1 kHz sampling, parsing overhead matters more than transmission size
    • FlatBuffers enables zero-copy access (edge gateway can read accelerometer values directly from buffer without deserializing)
    • Custom binary (just 12 raw bytes per sample) is even simpler but lacks schema evolution

    Best choice: FlatBuffers – provides zero-copy performance for real-time processing while maintaining schema evolution capability for the 15-year lifespan.

Part B: Edge-to-Cloud

  1. Normal mode cloud bandwidth:

    • Messages/sec: 150 / 60 = 2.5 msg/sec
    • JSON: 2.5 × 80 = 200 bytes/sec = 1.6 Kbps
    • Protobuf: 2.5 × 40 = 100 bytes/sec = 0.8 Kbps
  2. Anomaly mode bandwidth:

    • Anomaly sensors: 150 × 5% = 7.5 sensors/min
    • Samples per anomaly: 12,000 bytes
    • Additional: 7.5 × 12,000 = 90,000 bytes/min = 1,500 bytes/sec = 12 Kbps
  3. Is bandwidth constrained?

    • Normal mode: NOT constrained (1.6 Kbps vs 10 Mbps = 0.016%)
    • Anomaly mode: NOT constrained (12 Kbps + 1.6 Kbps = 13.6 Kbps vs 10 Mbps = 0.14%)
  4. Recommended format: Protocol Buffers

    Reasoning:

    • Bandwidth is NOT constrained (even in anomaly mode, only 0.14% of LTE capacity)
    • 15-year lifespan makes schema evolution CRITICAL
    • Protocol Buffers’ field numbering system enables:
      • Adding new sensor types without breaking old edge gateways
      • Evolving data structure as vibration analysis algorithms improve
      • Strong typing prevents data corruption at scale (150 devices × 15 years = high exposure to data quality issues)
    • 8-engineer team can afford the schema definition overhead
    • Language-agnostic (C++ on edge, Python in cloud)

    Alternative: CBOR would work if team wants schema-less flexibility, but Protobuf’s explicit schema provides better long-term maintainability.

Key Insights:

  1. Different formats for different paths: FlatBuffers (sensor→edge) vs Protobuf (edge→cloud) optimizes for different constraints
  2. Bandwidth was not the bottleneck: Both paths had >99% capacity remaining. Latency and schema evolution mattered more.
  3. DDoS scenario (20% anomalies): 150 × 20% × 12 KB = 360 KB burst = 2.88 Mbps peak (still only 29% of LTE capacity – safe!)
  4. 15-year lifespan drove Protobuf choice: Schema evolution is inevitable over that timeline

Common Pitfalls

Sending {‘temperature’: ‘23.5’} as a string instead of {‘temperature’: 23.5} as a number forces consumers to parse strings, breaks numeric queries, and increases message size. Validate that all numeric fields are encoded as JSON numbers — this is the most common IoT data format error.

Adding a new field to a JSON/CBOR message immediately after a firmware update means old consumers (not yet updated) receive unexpected fields. Use additive-only schema changes (never remove or rename fields), version your schemas, and handle unknown fields gracefully in all consumers.

IEEE 754 float64 uses 8 bytes per value — a 10-field sensor reading becomes 80 bytes just for numbers. Integer-scaled fixed point (temperature × 100 as int16) reduces the same reading to 2 bytes per value with identical precision for typical IoT ranges.

10.9 Summary

Key Decision Factors:

  1. Bandwidth constraints - The primary driver (Wi-Fi vs LoRaWAN)
  2. Scale - Format efficiency matters more at >1000 devices
  3. Battery life - Payload size directly affects transmission energy
  4. Development time - Binary formats require more setup
  5. Schema evolution - How often will your data structure change?
  6. Total cost of ownership - Not just data costs, but development and maintenance

Quick Decision Guide:

  • Prototyping anything: recommended format JSON.
  • Wi-Fi/Ethernet, any scale: recommended format JSON.
  • LoRaWAN, NB-IoT, moderate scale: recommended format CBOR.
  • 1000+ devices, stable schema: recommended format Protobuf.
  • Sigfox, extreme battery constraints: recommended format Custom binary.

Sammy the Sensor is at the post office, trying to choose how to send a letter.

“I have four choices,” Sammy explains to the Squad:

Lila the LED reads the options:

  1. JSON = Writing a full letter – “Dear Cloud, my temperature is 23.5 degrees Celsius. Yours truly, Sammy.” Clear, but uses lots of paper!
  2. CBOR = A postcard – Same information, but smaller. Like writing in neat tiny print.
  3. Protobuf = A coded telegram – “TEMP=235 STOP” – short and structured, but you need the codebook.
  4. Custom binary = A number on a stamp – Just “235” – the tiniest possible, but only works if everyone knows the secret.

Max the Microcontroller pulls out a chart. “Here’s my decision trick:”

  • “Got a BIG mailbox? (Wi-Fi)” – Write a full letter (JSON)!
  • “Tiny mailbox? (LoRaWAN)” – Use a postcard (CBOR) or code (binary)!
  • “SUPER tiny mailbox? (Sigfox – 12 bytes only!)” – Stamp code only (custom binary)!

Bella the Battery adds: “And remember – every extra word I have to carry costs me energy. Shorter messages = longer battery life!”

The Squad’s Golden Rule: Match your message size to your mailbox size!


10.10 Knowledge Check

10.11 What’s Next

Now that you can apply the format decision tree and calculate total cost of ownership, explore these related topics:

  • Practice Scenarios: Chapter Data Formats Practice; work through detailed format selection scenarios, quizzes, and worked examples.
  • Protocol Selection: Chapter Protocol Selector Wizard; combine protocol and format selection using an interactive decision tool.
  • MQTT Payloads: Chapter MQTT Fundamentals; implement JSON and CBOR payloads in real MQTT publish/subscribe workflows.
  • CoAP with CBOR: Chapter CoAP Protocol; examine how CoAP pairs with CBOR for constrained RESTful communication.
  • LoRaWAN Constraints: Chapter LoRaWAN Overview; evaluate payload limits and spreading factor trade-offs that drive format choices.

Continue to Data Formats Practice –>