42 Binary Data Formats for IoT

42.1 Learning Objectives

By the end of this chapter, you will be able to:

Implement CBOR encoding: Encode and decode sensor data using CBOR’s binary format
Design Protocol Buffer schemas: Create .proto files for efficient typed data serialization
Build custom binary formats: Design byte-level encodings for ultra-constrained devices
Choose between binary formats: Select CBOR, Protobuf, or custom binary based on requirements
Handle schema evolution: Plan for future changes without breaking compatibility

Related Chapters

This is part of a series on IoT Data Formats:

IoT Data Formats Overview - Introduction and text formats
Binary Data Formats (this chapter) - CBOR, Protobuf, custom binary
Data Format Selection - Decision guides and real-world examples
Data Formats Practice - Scenarios, quizzes, worked examples

Technical Deep Dives: - Data Representation - Binary and hexadecimal encoding - Packet Structure and Framing - Protocol headers and framing - CoAP - CBOR’s primary protocol

42.2 Prerequisites

Before starting this chapter, you should be familiar with:

IoT Data Formats Overview: Understanding why data formats matter
Data Representation: Binary encoding and byte operations

42.3 CBOR - Compact Binary Object Representation

CBOR is “binary JSON” - same data model, much smaller size.

Same data in CBOR:

A4                      # Map with 4 pairs
  68 646576696365496420 # "deviceId" (8-byte string)
  6A 73656E736F722D303031 # "sensor-001" (10-byte string)
  64 74656D70          # "temp"
  F9 4BBB              # 23.5 (float16)
  68 68756D6964697479  # "humidity"
  18 41                # 65 (uint8)
  69 74696D657374616D70 # "timestamp"
  1A 657F8A57          # 1702834567 (uint32)

Size: ~50 bytes (47% smaller than JSON!)

Pros:

Much smaller than JSON (30-60% reduction)
Faster parsing than JSON
Same data model as JSON (easy migration)
IETF standard (RFC 8949)

Cons:

Not human-readable (need hex viewer + parser)
Smaller ecosystem than JSON
Still includes field names (overhead)

Best for: CoAP, MQTT over LoRaWAN, NB-IoT

Minimum Viable Understanding: Serialization Fundamentals

Core Concept: Serialization converts in-memory data structures (objects, arrays, numbers) into a sequence of bytes that can be transmitted over a network or stored on disk, and deserialization reverses this process - the choice of serialization format determines message size, parsing speed, and cross-platform compatibility.

Why It Matters: Serialization is the bridge between your code and the network. A temperature reading stored as a 32-bit float (4 bytes) in memory becomes anywhere from 2 bytes (custom binary) to 50+ bytes (JSON with metadata) when serialized. This overhead multiplies across every message, every device, every day. For a 10,000-device fleet sending hourly updates, the difference between JSON and CBOR serialization can mean 500 GB/year of saved bandwidth and proportional reductions in cellular data costs and battery consumption.

Key Takeaway: Match serialization format to your encoding/decoding location. If both sender and receiver are microcontrollers (embedded-to-embedded), use compact binary formats like CBOR or custom binary. If data flows to cloud services (embedded-to-cloud), CBOR or Protobuf balance efficiency with ecosystem support. If humans need to debug or inspect data (any-to-dashboard), keep JSON for at least the final hop. The encoding cost is paid once per message; choose based on who needs to read it.

Deep Dive: CBOR Type System and Advanced Features

CBOR’s power comes from its rich type system that goes beyond JSON’s limited types.

Major Type Encoding (first 3 bits of initial byte):

Type	Range	Description	Example
0	0x00-0x1F	Unsigned integer	`0x17` = 23
1	0x20-0x3F	Negative integer	`0x37` = -24
2	0x40-0x5F	Byte string	`0x44` + 4 bytes
3	0x60-0x7F	Text string	`0x64` + “temp”
4	0x80-0x9F	Array	`0x82` = 2-item array
5	0xA0-0xBF	Map	`0xA4` = 4-pair map
6	0xC0-0xDF	Tagged value	`0xC1` = epoch time
7	0xE0-0xFF	Special/float	`0xF9` = float16

Compact integer encoding:

0-23: Single byte (0x00 to 0x17)
24-255: Two bytes (0x18 + value)
256-65535: Three bytes (0x19 + 2-byte value)
Larger: 0x1A (4 bytes) or 0x1B (8 bytes)

IoT-specific tags (RFC 8949):

Tag	Meaning	Use Case
0	Date/time string	ISO 8601 timestamps
1	Epoch timestamp	Unix time (compact)
2	Positive bignum	Large sensor IDs
32	URI	Resource identifiers
55799	Self-describe CBOR	Magic number for detection

Float precision selection:

0xF9 + 2 bytes: float16 (3-4 significant digits)
0xFA + 4 bytes: float32 (7 significant digits)
0xFB + 8 bytes: float64 (15 significant digits)

IoT optimization tip: Use float16 for sensor readings (temp, humidity) where 3 digits of precision is sufficient. Saves 2-6 bytes per value!

Debugging CBOR: Use cbor2diag tool to convert binary to diagnostic notation:

echo "A2 64 74656D70 F9 4BC0 68 68756D6964697479 18 41" | xxd -r -p | cbor2diag
# Output: {"temp": 23.5, "humidity": 65}

Three-panel flowchart showing JSON to CBOR encoding transformation. Top panel (orange): JSON Input showing 95-byte structure with deviceId sensor-001, temp 23.5, humidity 65, timestamp 1702834567. Middle panel (gray/navy): CBOR Encoding Process flows from Parse JSON to Map header A4 (4 pairs), then branches to four parallel field encodings: Field 1 deviceId string to value string (10 bytes), Field 2 temp string to value float16 F9 4BBB (3 bytes), Field 3 humidity string to value uint8 18 41 (2 bytes), Field 4 timestamp string to value uint32 1A 657F8A57 (5 bytes). Bottom panel (teal): CBOR Output showing 50-byte binary representation A4 68...6449 6A...303031 64...6D70 F9 4BBB 68...697479 18 41 69...616D70 1A 657F8A57. Side annotation (teal box): 47% Size Reduction, 95 bytes to 50 bytes, 45 bytes saved. Demonstrates bandwidth savings from compact binary encoding. — Three-panel flowchart showing JSON to CBOR encoding transformation. Top panel (orange): JSON Input showing 95-byte structure with deviceId sensor-001, temp 23.5, humidity 65, timestamp 1702834567. Middle panel (gray/navy): CBOR Encoding Process flows from Parse JSON to Map header A4 (4 pairs), then branches to four parallel field encodings: Field 1 deviceId string to value string (10 bytes), Field 2 temp string to value float16 F9 4BBB (3 bytes), Field 3 humidity string to value uint8 18 41 (2 bytes), Field 4 timestamp string to value uint32 1A 657F8A57 (5 bytes). Bottom panel (teal): CBOR Output showing 50-byte binary representation A4 68…6449 6A…303031 64…6D70 F9 4BBB 68…697479 18 41 69…616D70 1A 657F8A57. Side annotation (teal box): 47% Size Reduction, 95 bytes to 50 bytes, 45 bytes saved. Demonstrates bandwidth savings from compact binary encoding.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'clusterBkg': '#ECF0F1'}}}%%
graph LR
    subgraph "Analogy: Sending a Package"
        LETTER["<b>Letter (JSON)</b><br/>━━━━━━━━━━<br/>Dear recipient,<br/>I am sending you<br/>the temperature<br/>which is 23.5...<br/>━━━━━━━━━━<br/>Full sentences<br/>Large envelope"]

        TELEGRAM["<b>Telegram (CBOR)</b><br/>━━━━━━━<br/>TEMP 23.5 STOP<br/>HUM 65 STOP<br/>━━━━━━━<br/>Abbreviated<br/>Medium size"]

        BARCODE["<b>Barcode (Protobuf)</b><br/>━━━━<br/>barcode pattern<br/>(encoded data)<br/>━━━━<br/>Schema lookup<br/>Small label"]
    end

    subgraph "What Each Preserves"
        PRESERVE["<b>All contain same info:</b><br/>• Device ID<br/>• Temperature<br/>• Humidity<br/>• Timestamp<br/>━━━━━━━━━━<br/>Different packaging<br/>Same content!"]
    end

    LETTER -->|"Remove<br/>verbosity"| TELEGRAM
    TELEGRAM -->|"Use codes<br/>not words"| BARCODE

    LETTER -.-> PRESERVE
    TELEGRAM -.-> PRESERVE
    BARCODE -.-> PRESERVE

    style LETTER fill:#E67E22,stroke:#2C3E50,stroke-width:2px,color:#000
    style TELEGRAM fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
    style BARCODE fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
    style PRESERVE fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000

Figure 42.2: Alternative view: Postal Analogy - Data formats are like different ways to mail information. JSON is like a formal letter with full sentences - readable but verbose. CBOR is like a telegram - abbreviated but still using words. Protobuf is like a barcode - meaningless without a scanner (schema) but ultra-compact. All three deliver the same information; they just package it differently. This analogy helps beginners understand that compression does not mean data loss - it means smarter encoding. {fig-alt=“Analogy diagram comparing data formats to postal communication methods. Left section shows three packages: Letter (JSON) as orange box with full sentences like Dear recipient I am sending temperature 23.5, described as full sentences in large envelope. Telegram (CBOR) as teal box with abbreviated TEMP 23.5 STOP HUM 65 STOP in medium size. Barcode (Protobuf) as navy box with encoded barcode pattern, requires schema lookup, small label. Arrows show progression: Letter to Telegram (Remove verbosity), Telegram to Barcode (Use codes not words). Right section shows gray box labeled What Each Preserves listing Device ID, Temperature, Humidity, Timestamp with note Different packaging Same content. Dotted lines connect all three formats to the preserved content, emphasizing that all contain identical information despite different sizes.”}

42.4 Protocol Buffers (Protobuf)

Google’s binary format with schema definition.

Schema file (.proto):

message SensorReading {
  string deviceId = 1;
  float temp = 2;
  uint32 humidity = 3;
  uint64 timestamp = 4;
}

Size: ~22 bytes (77% smaller than JSON!)

Pros:

Extremely compact (no field names sent)
Very fast parsing (code generation)
Strong typing and schema evolution
Good tooling (protoc compiler)

Cons:

Requires schema file on both ends
Not self-describing (can’t parse without schema)
More complex setup

Best for: High-volume data pipelines, gRPC APIs, edge-to-cloud

Deep Dive: Protobuf Binary Encoding Details

Understanding how Protocol Buffers encodes data helps you estimate payload sizes and debug wire format issues.

Binary encoding breakdown:

0A 0A 73656E736F722D303031  # Field 1 (deviceId): "sensor-001"
15 0000BC41                  # Field 2 (temp): 23.5
18 41                        # Field 3 (humidity): 65
20 57 8A7F65                 # Field 4 (timestamp): 1702834567

Encoding rules:

Wire Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages
5	32-bit	fixed32, sfixed32, float

Field tag format: (field_number << 3) | wire_type

Field 1 (string): 0A = (1 << 3) | 2 = 0x0A
Field 2 (float): 15 = (2 << 3) | 5 = 0x15
Field 3 (uint32): 18 = (3 << 3) | 0 = 0x18
Field 4 (uint64): 20 = (4 << 3) | 0 = 0x20

Varint encoding (for integers):

Uses 7 bits per byte, MSB indicates continuation
Small values are compact (1-2 bytes)
Large values expand (up to 10 bytes for uint64)

Schema evolution rules:

New fields: Add with new field numbers (old clients ignore)
Removed fields: Mark as reserved (never reuse numbers)
Type changes: Only compatible types (int32 <-> int64)

42.5 Custom Binary Formats

For ultimate efficiency, define your own binary format.

Example: Same data in 16 bytes

Byte layout:
[0-9]:   deviceId "sensor-001" (10 bytes, no null terminator)
[10]:    temp = 235 (uint8, value x 10)
[11]:    humidity = 65 (uint8)
[12-15]: timestamp (uint32, seconds since epoch)

Total: 16 bytes

Pros:

Smallest possible size
Fastest parsing (no overhead)
Complete control

Cons:

No tooling, DIY everything
No schema evolution (breaking changes)
Not self-describing
Maintenance burden

Best for: Extremely constrained devices (Sigfox, ultra-low-power)

Deep Dive: Custom Binary Format Design Patterns

When standard formats are too large, custom binary encoding becomes necessary. Here are proven patterns for designing efficient custom formats.

Pattern 1: Fixed-Point Encoding

Instead of floating-point (4 bytes), use scaled integers:

Value Range	Encoding	Bytes	Example
Temp: -40.0 to 85.0C	int8 + 40	1	23.5C -> 64
Humidity: 0-100%	uint8	1	65% -> 65
Voltage: 0-5.0V	uint8 x 50	1	3.3V -> 165
GPS lat/lng	int32 x 10^6	4	37.7749 -> 37774900

Pattern 2: Bit Packing

Combine multiple small values into single bytes:

// Pack 3 values into 1 byte:
// - Direction (4 bits: 0-15 -> N, NE, E, SE, S, SW, W, NW, ...)
// - Quality (2 bits: 0-3 -> Poor, Fair, Good, Excellent)
// - Alert (2 bits: 0-3 -> None, Low, Medium, High)
uint8_t packed = (direction << 4) | (quality << 2) | alert;

Pattern 3: Delta Encoding

For sequential readings, send differences instead of absolute values:

First message:  [timestamp][temp][humidity][...]  = 16 bytes
Delta messages: [delta_t (1 byte)][delta_temp (1 byte)][...]  = 4 bytes

Savings: 75% for time-series data!

Pattern 4: Enum Compression

Replace strings with numeric codes:

String	Code	Bytes Saved
“temperature”	0x01	10 bytes
“humidity”	0x02	7 bytes
“sensor-001”	0x0001	8 bytes

Versioning strategy (critical for future changes):

Byte 0: Version/Type field
  [0xV0-0xVF]: Version 0-15
  [0x01]: Sensor reading v1
  [0x81]: Sensor reading v1 with extended fields

Bytes 1-N: Payload (format depends on version)

Common pitfalls to avoid:

Endianness: Always document byte order (prefer little-endian for ARM)
Alignment: Ensure 2-byte values start at even offsets
Overflow: Validate input ranges before encoding
Magic numbers: Include a sync byte (0xAA, 0x55) for framing

42.6 Size Comparison Visualization

Horizontal comparison diagram showing three groups. Left group (gray): Sensor Reading description box (4 fields: string, float, int, timestamp). Center group (Format Comparison): Four boxes showing size progression. JSON (orange, 95 bytes, ten dashes, 100% baseline), CBOR (teal, 50 bytes, five dashes, 53% of JSON, 47% reduction), Protobuf (teal, 22 bytes, two dashes, 23% of JSON, 77% reduction), Custom Binary (navy, 16 bytes, one dash, 17% of JSON, 83% reduction). Right group (Key Trade-offs): Four boxes describing trade-offs. JSON (Universal ecosystem, Human-readable, Easy debugging), CBOR (50% size savings, Same data model, Good balance), Protobuf (77% savings, Schema required, Strong typing), Custom (83% savings, No tooling, Maintenance burden). Dotted lines connect each format to its trade-off description. Visual bar graph shows diminishing physical size from left to right, representing efficiency gains versus complexity costs.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'clusterBkg': '#ECF0F1'}}}%%
graph TB
    subgraph "Where Do the Bytes Go?"
        direction TB
        subgraph JSON["JSON: 95 bytes total"]
            J_SYNTAX["Syntax overhead<br/>braces, quotes, commas<br/>~15 bytes"]
            J_NAMES["Field names<br/>deviceId, temp, humidity, ts<br/>~35 bytes"]
            J_VALUES["Actual data values<br/>sensor-001, 23.5, 65, 1702834567<br/>~45 bytes"]
        end

        subgraph CBOR["CBOR: 50 bytes total"]
            C_HEADER["Type markers<br/>A4, 68, F9, 18, 1A<br/>~5 bytes"]
            C_NAMES["Field names (binary)<br/>Same names, compact encoding<br/>~25 bytes"]
            C_VALUES["Compact values<br/>float16, uint8, uint32<br/>~20 bytes"]
        end

        subgraph PROTO["Protobuf: 22 bytes total"]
            P_TAGS["Field tags<br/>1, 2, 3, 4 (not names)<br/>~4 bytes"]
            P_VALUES["Typed values<br/>Efficient encoding<br/>~18 bytes"]
        end

        subgraph CUSTOM["Custom: 16 bytes total"]
            X_VALUES["Pure data only<br/>No overhead at all<br/>16 bytes"]
        end
    end

    J_SYNTAX --> J_NAMES --> J_VALUES
    C_HEADER --> C_NAMES --> C_VALUES
    P_TAGS --> P_VALUES
    X_VALUES

    style J_SYNTAX fill:#7F8C8D,stroke:#2C3E50,stroke-width:1px,color:#fff
    style J_NAMES fill:#E67E22,stroke:#2C3E50,stroke-width:1px,color:#000
    style J_VALUES fill:#16A085,stroke:#2C3E50,stroke-width:1px,color:#fff
    style C_HEADER fill:#7F8C8D,stroke:#2C3E50,stroke-width:1px,color:#fff
    style C_NAMES fill:#E67E22,stroke:#2C3E50,stroke-width:1px,color:#000
    style C_VALUES fill:#16A085,stroke:#2C3E50,stroke-width:1px,color:#fff
    style P_TAGS fill:#7F8C8D,stroke:#2C3E50,stroke-width:1px,color:#fff
    style P_VALUES fill:#16A085,stroke:#2C3E50,stroke-width:1px,color:#fff
    style X_VALUES fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff

Figure 42.4: Alternative view: Byte Breakdown Stack - This diagram answers “where do the bytes actually go?” JSON spends ~50 bytes on overhead (syntax + field names) and only ~45 bytes on actual data. CBOR eliminates syntax overhead and compresses values, but keeps field names. Protobuf replaces field names with numeric tags. Custom binary is pure data with zero overhead. This layered view helps students understand that format efficiency comes from removing different types of overhead, not just “compression.” {fig-alt=“Stacked layer diagram showing byte allocation within 4 data formats. JSON (95 bytes): Three stacked layers - Syntax overhead (gray, braces quotes commas, ~15 bytes), Field names (orange, deviceId temp humidity ts, ~35 bytes), Actual data values (teal, sensor-001 23.5 65 1702834567, ~45 bytes). CBOR (50 bytes): Three layers - Type markers (gray, A4 68 F9 18 1A, ~5 bytes), Field names binary (orange, same names compact encoding, ~25 bytes), Compact values (teal, float16 uint8 uint32, ~20 bytes). Protobuf (22 bytes): Two layers - Field tags (gray, 1 2 3 4 not names, ~4 bytes), Typed values (teal, efficient encoding, ~18 bytes). Custom (16 bytes): Single layer - Pure data only (teal with thick border, no overhead at all, 16 bytes). Vertical stacks show overhead reduction: JSON has most overhead (gray+orange layers), custom has none. Actual data (teal) is similar size across all formats; the difference is overhead elimination.”}

42.7 Performance Benchmarks

Deep Dive: Performance Benchmarks and Library Selection

Beyond payload size, parsing speed and memory usage vary significantly across formats and libraries.

Parsing Performance (ESP32, 240 MHz, typical IoT payload):

Format	Library	Parse Time	Memory	Notes
JSON	ArduinoJson	1.2 ms	512 B	Dynamic allocation
JSON	cJSON	0.8 ms	384 B	Simpler, lighter
CBOR	tinycbor	0.3 ms	128 B	Streaming parser
CBOR	cn-cbor	0.4 ms	256 B	Tree-based
Protobuf	nanopb	0.1 ms	64 B	Static allocation
Custom	(manual)	0.05 ms	0 B	No parsing overhead

Memory allocation strategies:

Static allocation (Protobuf/nanopb): Pre-allocate based on schema
- Pros: Predictable, no fragmentation
- Cons: Wastes memory for variable-length fields
Dynamic allocation (JSON/ArduinoJson): Allocate on parse
- Pros: Flexible for variable payloads
- Cons: Heap fragmentation risk, slower
Streaming (CBOR/tinycbor): Process as bytes arrive
- Pros: Minimal memory, real-time processing
- Cons: No random access to fields

Library recommendations by platform:

Platform	JSON	CBOR	Protobuf
ESP32/ESP8266	ArduinoJson	tinycbor	nanopb
STM32	cJSON	cn-cbor	nanopb
Raspberry Pi	nlohmann/json	libcbor	protobuf-c
Python (cloud)	json (stdlib)	cbor2	protobuf
Node.js	JSON.parse	cbor	protobufjs

Energy impact (NB-IoT transmission at 23 dBm):

Format	Bytes	TX Time	Energy	Battery Impact
JSON	95	47 ms	9.4 mJ	Baseline
CBOR	50	25 ms	5.0 mJ	47% savings
Protobuf	22	11 ms	2.2 mJ	77% savings
Custom	16	8 ms	1.6 mJ	83% savings

Real-world impact: For a 5-year battery target with 96 messages/day, format choice can mean the difference between 4-year and 6-year battery life.

42.8 Format Trade-off Summary

Tradeoff: JSON vs Binary Formats (CBOR/Protobuf)

Decision context: When choosing between human-readable JSON and compact binary formats for IoT data serialization

Factor	JSON	CBOR	Protobuf
Battery impact	High (large payloads)	Medium (47% smaller)	Low (77% smaller)
Bandwidth	High (~95 bytes typical)	Medium (~50 bytes)	Low (~22 bytes)
Latency	Higher (parsing overhead)	Medium	Low (fast decode)
Readability	Excellent (text editor)	Requires tools	Requires schema + tools
Flexibility	Excellent (schemaless)	Good (self-describing)	Moderate (schema required)
Schema evolution	Easy (add fields anytime)	Good	Good (with planning)
Tooling	Universal	Growing	Strong (protoc, gRPC)
Development speed	Fastest	Moderate	Slower (schema first)

Choose JSON when:

Bandwidth is not constrained (Wi-Fi, Ethernet, LTE)
Development speed and debugging are priorities
Schema may change frequently during prototyping
Integrating with web services and REST APIs
Small deployments (<100 devices) where data costs are negligible

Choose CBOR when:

Bandwidth is limited but you need JSON-like flexibility (LoRaWAN, NB-IoT)
Migrating from JSON with minimal code changes
Self-describing format needed (no schema coordination required)
CoAP protocol usage (CBOR is the standard payload format)
Balance between efficiency and maintainability

Choose Protobuf when:

High-volume deployments (>1000 devices) where bandwidth savings matter
Strong typing and schema enforcement are requirements
Building gRPC-based microservices architecture
Long-term API contracts with multiple teams
Maximum efficiency needed with acceptable schema management overhead

Default recommendation: Start with JSON for prototyping, migrate to CBOR when bandwidth becomes a concern, use Protobuf for high-scale production systems with stable schemas

42.9 Summary

Key Points:

CBOR: Binary JSON with 47% size reduction, same data model, IETF standard
Protobuf: Schema-based with 77% reduction, strong typing, requires setup
Custom Binary: Maximum efficiency (83%) but high maintenance burden
Match serialization format to your sender/receiver types
Consider parsing speed and memory, not just payload size

Quick Reference:

Format	Size	Complexity	Schema	Best For
CBOR	50 bytes	Low	None	LoRaWAN, NB-IoT, CoAP
Protobuf	22 bytes	Medium	Required	gRPC, high-volume
Custom	16 bytes	High	DIY	Sigfox, ultra-low-power

42.10 What’s Next

Now that you understand binary data formats and their trade-offs:

Next: Data Format Selection - Decision guides with real-world examples to help you choose
Practice: Data Formats Practice - Work through scenarios and quizzes
Apply: CoAP Fundamentals - See CBOR in action with CoAP

Continue to Data Format Selection →