43  IoT Data Formats Overview

43.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand why data formats matter: Explain how format choice impacts bandwidth, battery, and development
  • Compare human-readable formats: Evaluate JSON and XML for IoT applications
  • Recognize format trade-offs: Identify when text formats are appropriate vs. binary formats
  • Calculate payload overhead: Measure the cost of format metadata in typical sensor messages

This is part of a series on IoT Data Formats:

  1. IoT Data Formats Overview (this chapter) - Introduction and text formats
  2. Binary Data Formats - CBOR, Protobuf, custom binary
  3. Data Format Selection - Decision guides and real-world examples
  4. Data Formats Practice - Scenarios, quizzes, worked examples

Fundamentals: - Data Representation - Binary and hexadecimal encoding - Packet Structure and Framing - How data is wrapped - Sensor to Network Pipeline - End-to-end data flow

Networking: - MQTT - Messaging protocol with JSON/binary - CoAP - Constrained protocol with CBOR

43.2 Prerequisites

Before starting this chapter, you should be familiar with:

43.3 For Kids: How Do Devices Talk to Each Other?

Imagine you’re sending a letter to a friend in another country…

43.3.1 The Language Problem

If you wrote a letter in English and sent it to someone who only speaks Japanese, they wouldn’t understand it! The same thing happens with computers and IoT devices.

When a sensor wants to tell a computer “It’s 23 degrees outside,” it needs to say it in a way the computer understands. That’s what data formats are - they’re the languages that devices use to talk!

43.3.2 Different Ways to Say the Same Thing

Let’s say Temperature Terry wants to tell a computer it’s warm outside:

Language How Terry Says It Good For
English (JSON) “The temperature is 23 degrees” People who need to read it
Shorthand (CBOR) “T:23” When you want to save space
Number Code (Binary) “10111” When computers talk fast

43.3.3 A Story: The Three Messengers

Once upon a time, three messengers needed to deliver the same message: “It’s sunny and 25 degrees.”

Messenger 1 (JSON) wrote a beautiful letter: “Dear Computer, Today the weather is sunny. The temperature is exactly 25 degrees Celsius. Have a nice day!” It was easy to read but took a long time to write and a lot of paper!

Messenger 2 (CBOR) wrote a quick note: “Sun. 25.” It was shorter and faster, but harder to read unless you knew the code!

Messenger 3 (Binary) sent dots and dashes like Morse code: “.- .–. .-..” The fastest of all, but only machines could understand it!

43.3.4 Which Language Is Best?

It depends on what you need!

If You Need… Use This Why
People to read it JSON It’s like regular writing
To save battery CBOR or Binary Smaller messages use less energy
Super fast Binary Computers love numbers

43.3.5 Real Life Example: Your Fitness Tracker

When your fitness tracker counts your steps and sends them to your phone:

  1. The tracker measures: “I counted 5,000 steps today!”
  2. It picks a language: Usually a small format like CBOR (to save battery)
  3. It sends the message: Through Bluetooth to your phone
  4. Your phone translates: Turns it into words and pictures you can see!

43.3.6 Key Words for Kids

Word What It Means
Data Information, like numbers and words
Format The way information is organized
JSON A popular way to write data that people can read
Binary Data written in 0s and 1s (computer language)
Message Data being sent from one place to another

43.3.7 Try This at Home!

Play “Data Format” with a friend: 1. Think of a simple message like “I have 3 apples” 2. Long way (JSON): “I am holding three apples in my basket” 3. Short way (CBOR): “3 apples” 4. Code way: Hold up 3 fingers (no words at all!)

All three say the same thing, but in different ways!

43.4 For Beginners: Why Data Formats Matter

The Problem: How do you send sensor data so another device can understand it?

Sensor reads: Temperature = 23.5°C, Humidity = 65%

How to send it?
Option 1: "23.5,65"          <- Which is which? What units?
Option 2: "temp=23.5;hum=65" <- Better, but custom format
Option 3: {"temp":23.5,"humidity":65}  <- JSON (standard!)

Analogy: Languages for Data

Data formats are like languages—both sides must speak the same one:

Human Languages Data Formats
English JSON
Chinese XML
Morse Code Binary/CBOR

Trade-offs:

Format Human-Readable Size Parse Speed Ecosystem
JSON Excellent Large Medium Universal
XML Good Huge Slow Legacy systems
CBOR Binary Compact Fast Growing
MessagePack Binary Compact Fast Moderate
Protobuf Binary Very compact Very fast Strong (gRPC)
Custom Binary None Smallest Fastest DIY only
NoteKey Takeaway

In one sentence: Choose your data format based on bandwidth constraints and debugging needs - JSON for prototyping and high-bandwidth networks, CBOR for most IoT deployments, and custom binary only when every byte matters.

Remember this rule: CBOR gives you 80% of custom binary’s efficiency with 20% of the engineering effort - it’s the sweet spot for most constrained IoT applications.

TipMVU: IoT Data Format Selection

Core Concept: IoT data formats exist on a spectrum from human-readable (JSON at 100 bytes) to machine-optimized (binary at 10 bytes), and the IETF standardized CBOR as the recommended middle ground for constrained IoT networks.

Why It Matters: Data format directly impacts three critical costs: bandwidth (cellular IoT charges per KB), battery life (larger payloads = longer radio-on time), and development time (binary formats require custom parsers). A smart thermostat sending JSON over Wi-Fi costs nothing extra, but the same thermostat on NB-IoT cellular could cost $50/year more in data fees versus CBOR.

Key Takeaway: Follow the industry standard “50-byte rule”: if your typical payload exceeds 50 bytes, switch from JSON to CBOR. If it exceeds 100 bytes on LPWAN, consider Protobuf or custom binary. For payloads under 20 bytes (most sensors), format overhead matters more than the data itself - a 10-byte reading in JSON becomes 50+ bytes, while CBOR keeps it under 20.


43.5 The Format Spectrum: Human-Readable to Binary

Horizontal flowchart showing evolution from human-readable to binary data formats. Left group (Human-Readable Formats) contains JSON (orange box: Size Large, Speed Medium, Ecosystem Universal, Flexibility Excellent) and XML (gray box: Size Huge, Speed Slow, Ecosystem Legacy, Flexibility Good). Right group (Binary Formats) contains CBOR (teal box: Size Compact, Speed Fast, Ecosystem Growing, Flexibility Good), Protobuf (teal box: Size Very Compact, Speed Very Fast, Ecosystem Strong, Flexibility Schema-based), and Custom Binary (gray box: Size Smallest, Speed Fastest, Ecosystem DIY, Flexibility Rigid). Arrows flow left to right showing progression: JSON to CBOR (Trade size for simplicity), CBOR to Protobuf (Add schema for efficiency), Protobuf to Custom (Remove all overhead). Each transition represents increased efficiency at cost of complexity.

Horizontal flowchart showing evolution from human-readable to binary data formats. Left group (Human-Readable Formats) contains JSON (orange box: Size Large, Speed Medium, Ecosystem Universal, Flexibility Excellent) and XML (gray box: Size Huge, Speed Slow, Ecosystem Legacy, Flexibility Good). Right group (Binary Formats) contains CBOR (teal box: Size Compact, Speed Fast, Ecosystem Growing, Flexibility Good), Protobuf (teal box: Size Very Compact, Speed Very Fast, Ecosystem Strong, Flexibility Schema-based), and Custom Binary (gray box: Size Smallest, Speed Fastest, Ecosystem DIY, Flexibility Rigid). Arrows flow left to right showing progression: JSON to CBOR (Trade size for simplicity), CBOR to Protobuf (Add schema for efficiency), Protobuf to Custom (Remove all overhead). Each transition represents increased efficiency at cost of complexity.
Figure 43.1: Format Trade-off Comparison: Visual comparison of IoT data formats showing the progression from human-readable (JSON/XML) to ultra-compact binary formats. JSON offers universal ecosystem and simplicity but at the cost of size. CBOR balances compactness with flexibility. Protobuf adds schema enforcement for maximum efficiency. Custom binary provides ultimate size optimization but requires complete DIY implementation. The arrows show the trade-offs made when moving from one format to another.

Alternative View:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'clusterBkg': '#ECF0F1', 'edgeLabelBackground':'#ffffff'}}}%%
flowchart TD
    START["Choose Your Data Format"]

    Q1{"What is your<br/>PRIMARY constraint?"}

    BW["Bandwidth<br/>(LoRa, Sigfox, NB-IoT)"]
    DEV["Development Speed<br/>(Prototype, MVP)"]
    SCALE["Scale<br/>(10,000+ devices)"]
    DEBUG["Debuggability<br/>(Production monitoring)"]

    JSON_REC["<b>JSON</b><br/>━━━━━━━━<br/>• 95 bytes typical<br/>• Human-readable<br/>• Universal tools<br/>• Fastest development"]

    CBOR_REC["<b>CBOR</b><br/>━━━━━━━━<br/>• 50 bytes typical<br/>• 47% smaller than JSON<br/>• Self-describing<br/>• Good balance"]

    PROTO_REC["<b>Protobuf</b><br/>━━━━━━━━<br/>• 22 bytes typical<br/>• 77% smaller than JSON<br/>• Schema-enforced<br/>• Strong typing"]

    CUSTOM_REC["<b>Custom Binary</b><br/>━━━━━━━━<br/>• 16 bytes typical<br/>• 83% smaller than JSON<br/>• Zero overhead<br/>• Maximum efficiency"]

    START --> Q1
    Q1 -->|"Need to ship fast"| DEV
    Q1 -->|"Every byte counts"| BW
    Q1 -->|"Massive deployment"| SCALE
    Q1 -->|"Easy troubleshooting"| DEBUG

    DEV --> JSON_REC
    DEBUG --> JSON_REC
    BW --> CBOR_REC
    SCALE --> PROTO_REC

    BW -->|"Extreme limits<br/>(12-byte max)"| CUSTOM_REC

    style START fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
    style Q1 fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
    style BW fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#000
    style DEV fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#000
    style SCALE fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#000
    style DEBUG fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#000
    style JSON_REC fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style CBOR_REC fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style PROTO_REC fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style CUSTOM_REC fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff

Figure 43.2: Alternative view: Constraint-First Decision Tree - Rather than comparing format features, this decision tree asks “What is your primary constraint?” and guides you directly to the best format. Development speed and debuggability both lead to JSON. Bandwidth constraints suggest CBOR as the balanced choice. Large-scale deployments benefit from Protobuf’s schema enforcement and efficiency. Only extreme byte limits (like Sigfox’s 12-byte maximum) justify custom binary formats. This approach helps students make practical decisions based on real project needs rather than theoretical comparisons. {fig-alt=“Decision tree flowchart starting with navy box asking Choose Your Data Format. First decision (gray): What is your PRIMARY constraint? Four orange constraint boxes branch out: Bandwidth (LoRa, Sigfox, NB-IoT), Development Speed (Prototype, MVP), Scale (10,000+ devices), and Debuggability (Production monitoring). Development Speed and Debuggability both connect to teal JSON recommendation box showing 95 bytes typical, human-readable, universal tools, fastest development. Bandwidth connects to teal CBOR recommendation showing 50 bytes typical, 47% smaller than JSON, self-describing, good balance. Scale connects to teal Protobuf recommendation showing 22 bytes typical, 77% smaller than JSON, schema-enforced, strong typing. Bandwidth also has secondary path for extreme limits (12-byte max) leading to navy Custom Binary recommendation showing 16 bytes typical, 83% smaller than JSON, zero overhead, maximum efficiency. Color coding indicates decision flow from constraint identification to format recommendation.”}

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%
timeline
    title IoT Project Format Evolution Journey
    section Prototype Phase
        10 Devices : JSON
                   : Easy debugging
                   : Rapid development
                   : 95 bytes per message
    section Pilot Deployment
        100 Devices : JSON or CBOR?
                    : First bandwidth concerns
                    : Cloud costs appearing
                    : Consider optimization
    section Production Scale
        1,000 Devices : CBOR
                      : 50 bytes per message
                      : 47% bandwidth savings
                      : Acceptable complexity
    section High-Scale Fleet
        10,000+ Devices : Protobuf
                        : 22 bytes per message
                        : Schema enforcement
                        : 77% savings critical
    section Ultra-Constrained
        LPWAN/Battery : Custom Binary
                      : 16 bytes per message
                      : Maximum efficiency
                      : Only when necessary

Figure 43.3: Alternative view: Project Lifecycle Journey - This timeline shows how data format choices typically evolve as an IoT project scales. During prototyping, JSON’s readability accelerates development. As the fleet grows to hundreds of devices, bandwidth costs prompt evaluation of CBOR. At thousands of devices, Protobuf’s efficiency becomes essential. Only ultra-constrained scenarios (LPWAN, extreme battery life) justify custom binary’s maintenance burden. This helps students understand that format choice is not static - it evolves with project needs. {fig-alt=“Timeline diagram showing IoT project evolution through 5 phases. Prototype Phase (10 devices): JSON chosen for easy debugging, rapid development, 95 bytes per message. Pilot Deployment (100 devices): Decision point between JSON and CBOR as first bandwidth concerns and cloud costs appear. Production Scale (1,000 devices): CBOR adopted with 50 bytes per message, 47% bandwidth savings, acceptable complexity. High-Scale Fleet (10,000+ devices): Protobuf with 22 bytes per message, schema enforcement, 77% savings critical. Ultra-Constrained (LPWAN/Battery): Custom Binary with 16 bytes per message, maximum efficiency, only when necessary. Timeline flows left to right showing increasing device count and decreasing message size.”}

43.6 Real-World Example: Temperature Reading Comparison

NoteReal-World Example: Same Sensor Reading in 4 Formats

Let’s compare how a simple temperature sensor reading is encoded in different formats. This is the actual data sent over the network:

Sensor data: Temperature = 23.5°C, Device ID = “sensor-001”, Timestamp = 1702732800

43.6.1 Format 1: JSON (Human-Readable)

{"deviceId":"sensor-001","temp":23.5,"unit":"C","ts":1702732800}
  • Size: 62 bytes
  • Hex dump: 7B 22 64 65 76 69 63 65 49 64 22 3A 22 73 65 6E 73 6F 72 2D 30 30 31 22 2C 22 74 65 6D 70 22 3A 32 33 2E 35 2C 22 75 6E 69 74 22 3A 22 43 22 2C 22 74 73 22 3A 31 37 30 32 37 33 32 38 30 30 7D
  • Overhead: Field names (“deviceId”, “temp”, “unit”, “ts”) + JSON syntax ({, }, :, “,”) = ~35 bytes
  • Readable: Yes, you can read it in a text editor

43.6.2 Format 2: CBOR (Binary JSON)

A4                          # Map with 4 pairs
68 646576696365496420       # "deviceId" (8 chars)
6A 73656E736F722D303031     # "sensor-001" (10 chars)
64 74656D70                 # "temp" (4 chars)
F9 4BBB                     # 23.5 as float16
64 756E6974                 # "unit" (4 chars)
61 43                       # "C" (1 char)
62 7473                     # "ts" (2 chars)
1A 6577F080                 # 1702732800 as uint32
  • Size: 40 bytes (35% smaller than JSON)
  • Overhead: Still includes field names, but uses efficient binary encoding
  • Readable: No, requires CBOR parser

43.6.3 Format 3: Protocol Buffers (Schema-Based)

Schema file (sent once, not with each message):

message SensorReading {
  string deviceId = 1;
  float temp = 2;
  string unit = 3;
  uint64 ts = 4;
}

Binary message:

0A 0A 73656E736F722D303031  # Field 1: "sensor-001"
15 0000BC41                  # Field 2: 23.5 (float32)
1A 01 43                     # Field 3: "C"
20 80F07765                  # Field 4: 1702732800
  • Size: 23 bytes (63% smaller than JSON)
  • Overhead: Field numbers (1, 2, 3, 4) instead of names
  • Readable: No, requires schema + protoc

43.6.4 Format 4: Custom Binary (DIY)

Byte layout:
[0-9]:   deviceId "sensor-001" (10 bytes, ASCII)
[10-11]: temp = 235 (uint16, value x 10 = 23.5)
[12]:    unit = 0 (enum: 0=C, 1=F)
[13-16]: timestamp (uint32, seconds since epoch)

Hex: 73656E736F722D30303100EB000601F07765
  • Size: 17 bytes (73% smaller than JSON)
  • Overhead: Zero! Every byte is data
  • Readable: No, requires custom parser

43.6.5 Size Comparison Summary

Format Bytes Reduction Time to 1GB Cost @ $0.01/KB
JSON 62 0% (baseline) 16.1M msgs $620/GB
CBOR 40 35% 25.0M msgs $400/GB
Protobuf 23 63% 43.5M msgs $230/GB
Custom 17 73% 58.8M msgs $170/GB

Real-world impact: For 100 sensors sending data every 60 seconds over cellular at $0.01/KB: - JSON: $89/month data cost - CBOR: $58/month (35% savings = $31/month) - Protobuf: $33/month (63% savings = $56/month) - Custom: $24/month (73% savings = $65/month)

Key insight: The savings multiply with scale. For a 10,000-sensor deployment, choosing Protobuf over JSON saves $6,720/year in data costs alone!


43.7 JSON - The Universal Choice

JavaScript Object Notation is the most popular IoT data format.

Example:

{
  "deviceId": "sensor-001",
  "temp": 23.5,
  "humidity": 65,
  "timestamp": 1702834567
}

Size: ~95 bytes

Pros:

  • Human-readable, easy to debug
  • Universal support (every language, platform, tool)
  • Self-describing (field names included)
  • Easy schema evolution

Cons:

  • Large overhead (field names, quotes, braces)
  • Inefficient for bandwidth-constrained networks
  • Parsing requires more CPU/memory than binary

Best for: Wi-Fi, Ethernet, cellular IoT where bandwidth isn’t critical


WarningCommon Misconception Alert: “JSON is Too Heavy for IoT”

Myth: “JSON is too large and slow for IoT systems - you should always use binary formats.”

Reality: It depends on your constraints!

43.7.1 When JSON is Perfect for IoT:

  • Wi-Fi/Ethernet/LTE networks: Bandwidth is plentiful (megabits/sec), JSON’s 60-byte overhead is negligible
  • Development/debugging: JSON is human-readable, reducing debugging time by hours
  • Small deployments: For <100 devices, the total bandwidth difference is often <10GB/year
  • Rapid prototyping: JSON libraries exist in every language, accelerating development
  • Cloud integration: Most cloud IoT platforms (AWS IoT, Azure IoT) default to JSON

43.7.2 When to Consider Binary Formats:

  • LPWAN networks (LoRaWAN, Sigfox, NB-IoT): Bandwidth measured in bytes/sec, not megabits/sec
  • High message volume: >1000 devices sending >1 msg/min = TB/year scale
  • Data cost constraints: Cellular data at $0.01/KB x 1 million messages = $620 (JSON) vs $170 (custom binary)
  • Power-critical devices: Transmitting 60 bytes vs 17 bytes = 3.5x more radio energy

43.7.3 Real-World Data Point:

Smart thermostat (Wi-Fi, 1 message/5 minutes): - JSON: 60 bytes x 12 msgs/hour x 24 hours x 365 days = 6.3 MB/year - Custom binary: 17 bytes x same = 1.8 MB/year - Savings: 4.5 MB/year = $0.045/year per device at $0.01/KB

Verdict: For a 1000-home deployment, JSON costs $45/year more than custom binary. Is that worth the engineering complexity of maintaining a custom format? Usually no!

Key Lesson: Don’t optimize prematurely. Start with JSON, measure your actual bandwidth usage, then optimize if needed. Many production IoT systems run JSON happily for years before hitting bandwidth limits.

When binary formats actually matter: 1. Agricultural soil sensor (LoRaWAN): 12 readings/day x 60 bytes JSON = 720 bytes/day - Exceeds LoRaWAN daily limit! Must use CBOR or custom binary. 2. City parking sensor (Sigfox): 12-byte message limit - Must use custom binary, no choice. 3. Fitness tracker (BLE): 1 reading/sec x 60 bytes x 3600 sec/hour = 216 KB/hour - Drains battery! Must use efficient binary format.

Bottom line: Use JSON by default. Switch to binary formats when you have actual evidence (measurements, not assumptions) that bandwidth or power consumption is a problem.


43.8 Summary

Key Points:

  • Data formats are the “languages” devices use to communicate
  • JSON is human-readable but verbose (~95 bytes for typical sensor data)
  • Binary formats (CBOR, Protobuf) reduce size by 35-77%
  • Format choice impacts bandwidth costs, battery life, and development time
  • Start with JSON for prototyping, optimize later if needed

Format Overview Table:

Format Size Readability Best For
JSON Large (baseline) Excellent Wi-Fi, prototyping, debugging
XML Very large Good Legacy systems
CBOR 35% smaller Binary LoRaWAN, NB-IoT, CoAP
Protobuf 63% smaller Binary High-volume, gRPC
Custom 73% smaller Binary Sigfox, ultra-low-power

43.9 What’s Next

Now that you understand why data formats matter and how JSON compares to binary alternatives:

Continue to Binary Data Formats →