Text encoding converts characters to numbers: ASCII uses 1 byte for 128 English characters, while UTF-8 handles all world languages (1-4 bytes per character). For IoT data formats, JSON is human-readable but verbose, CBOR is 30-60% smaller, and raw binary is most compact. Most sensors output integers, not floats – convert to decimal only at display time.
Chapter Scope (Avoiding Duplicate Deep Dives)
This chapter focuses on text encoding and payload representation choices.
Stay here for ASCII/Unicode/UTF-8 behavior and message-size trade-offs.
Classify text encoding schemes: Differentiate ASCII, Unicode, and UTF-8 characteristics for IoT applications
Compare data format efficiency: Evaluate JSON, CBOR, and binary encodings for bandwidth-constrained IoT deployments
Calculate UTF-8 buffer requirements: Determine correct buffer sizes for multi-byte characters to prevent overflow errors
Distinguish sensor data representations: Identify when to use integer versus floating-point formats in IoT data pipelines
Select optimal payload formats: Apply cost-benefit analysis to choose encoding strategies based on deployment scale and constraints
No-One-Left-Behind Encoding Loop
Start from one plain-text payload example.
Encode it in UTF-8 and count bytes explicitly.
Compare JSON vs compact/binary alternatives for the same data.
Validate choice against deployment constraints (battery, bandwidth, debugging needs).
For Beginners: Text Encoding
Text encoding is how computers turn letters, numbers, and symbols into the ones and zeros they actually store. ASCII is the original system that covers English letters using one byte per character. UTF-8 extends this to handle every language in the world, from Chinese to Arabic to emoji. For IoT, understanding text encoding helps you know how much space your data takes up and why sending a temperature reading as text (“25.3”) uses more bytes than sending it as a compact number.
Characters and bytes are different measurements: In UTF-8, a string with 20 visible characters can require more than 20 bytes once accented letters, CJK text, or emoji appear.
ASCII is still the baseline: Protocol punctuation, many identifiers, and most English labels are ASCII, which is why UTF-8 works well as a practical superset for IoT systems.
UTF-8 is the default for modern text transport: MQTT topics, JSON payloads, REST APIs, and cloud tooling generally expect UTF-8, so byte sizing and validation must assume it.
Payload overhead often comes from structure, not sensor values: Field names, topic paths, braces, quotes, and separators can dominate the size of small telemetry messages.
Binary formats trade readability for efficiency: CBOR and raw binary save bytes on the wire, but they require schema discipline and make field-level debugging less obvious than JSON.
Most devices begin with scaled integers: Sensors often produce compact integer values such as 235 for 23.5C, then convert to float or text only after transmission.
The right encoding depends on deployment economics: A few extra bytes are negligible on Wi-Fi debug traffic, but expensive on LoRaWAN, BLE advertisements, or large cellular fleets.
Text must be converted to numbers for computers to process. This is critical for: - Device names and identifiers - Configuration files and JSON messages - Serial monitor output and debugging - MQTT topics and CoAP URIs
6.3.1 ASCII - The Original Standard
ASCII (American Standard Code for Information Interchange) uses 7 bits to encode 128 characters:
Limitation: ASCII only covers English. No accents (é, ñ, ü), no emojis, no Chinese/Arabic/Hebrew characters.
6.3.2 Unicode and UTF-8
Unicode assigns a unique number (called a code point) to every character across all languages (143,000+ characters): - Latin extended: é (U+00E9), ñ (U+00F1), ü (U+00FC) - Cyrillic: Д (U+0414), Ж (U+0416), Я (U+042F) - CJK (Chinese/Japanese/Korean): 中 (U+4E2D), 日 (U+65E5), 한 (U+D55C) - Technical symbols: ° (degree), μ (micro), Ω (ohm) - Emojis: 🌡 (thermometer, U+1F321), 📡 (antenna, U+1F4E1), 🔋 (battery, U+1F50B)
UTF-8 (Universal Transformation Format-8) is the most common Unicode encoding:
Characters
Bytes Used
Example
ASCII (A-Z, 0-9)
1 byte
‘A’ = 0x41
Latin extended (é, ñ)
2 bytes
‘é’ = 0xC3 0xA9
Most other (Chinese, Hebrew)
3 bytes
‘中’ = 0xE4 0xB8 0xAD
Emojis
4 bytes
🌡 = 0xF0 0x9F 0x8C 0xA1
Why UTF-8 for IoT?
Backward compatible with ASCII: 1 byte = 1 character for English
Quick Check: Test your understanding of text encoding schemes:
6.4 Data Format Efficiency
Data Format Efficiency Comparison - The same sensor reading encoded in different formats shows dramatic size differences. JSON (orange) is human-readable but verbose at 32 bytes. CBOR (teal) provides JSON semantics in binary form at 18 bytes. Raw binary (navy) is most compact at 4 bytes but requires schema knowledge. For bandwidth-constrained IoT (LoRaWAN max 51 bytes), format choice directly impacts how many readings fit per message. The trade-off: smaller formats sacrifice readability and flexibility for efficiency.
Figure 6.1
Alternative View: Format Selection by IoT Scenario
This variant shows which format to choose based on your specific IoT deployment:
Use JSON
Wi-Fi Dashboard
Home automation and lab prototypes where direct log inspection matters more than squeezing every byte.
Best fit: readable payloads and fast debugging.
Use CBOR
Edge Gateway
Constrained CoAP or mesh links where structure still helps, but repeating JSON keys starts to hurt.
Best fit: compact structure without losing field names.
Prefer CBOR
Wearable or BLE
Battery-powered devices with small packets where retransmissions cost energy and airtime.
Best fit: keep JSON only when humans inspect payloads daily.
Use Binary
LPWAN Fleet
LoRaWAN, NB-IoT, or satellite sensors where a few bytes saved can mean more samples per uplink.
Best fit: explicit schema, smallest possible payload.
Why this variant helps: The original shows abstract byte counts. This diagram answers “which format should I use for MY project?” by mapping real IoT scenarios (home, wearable, agriculture, industrial) to appropriate format choices. Students can identify their use case and immediately know which format to consider.
6.5 Common Misconceptions
Common Misconception: “All Sensors Use Floating Point Numbers”
Myth: “Sensors output floating-point temperature readings like 23.456C”
Reality: Most IoT sensors output integers, and floating point is computed later!
Example: DHT22 Temperature Sensor
Sensor output: 16-bit integer = 2345 (raw value)
Actual temperature: 2345 / 100 = 23.45C (integer arithmetic)
Why? Floating point math is slow and power-hungry on microcontrollers
BME280 Pressure Sensor:
Raw output: 24-bit integer (adc_P = 415148)
Conversion formula: Uses integer arithmetic with calibration constants
Final pressure: 101325 Pa (computed from integer compensation formula)
Practical rule:
Sensor to MCU: Use integers (fast, efficient, no rounding errors)
MCU to Cloud: Convert to float/JSON for human readability
MCU to MQTT: Use binary formats like CBOR to avoid float overhead
Energy impact: Integer-only math can save 10-30% power compared to floating point operations on an 8-bit AVR Arduino.
Text Encoding Pitfall in IoT
If your IoT device expects ASCII but receives UTF-8 with accented characters, you’ll see:
Garbled display: a UTF-8 label such as Gebaude can render as corrupted text when the receiver assumes single-byte ASCII
Buffer overflows: a 10-character field can require 12+ bytes once non-ASCII characters appear
Protocol errors: a gateway that assumes Latin-1 or ASCII can reject valid UTF-8 MQTT topics and JSON strings
Solution: Standardize on UTF-8, size buffers by bytes rather than visible characters, and test at least one multilingual topic or payload before deployment.
Quick Check: Test your understanding of the IoT data encoding pipeline:
Quick Check: Test your understanding of data format selection for IoT:
Scenario: Choosing Data Formats
You’re designing a battery-powered environmental sensor that transmits data via LoRaWAN (51-byte payload limit). The sensor measures temperature (range: -40 to +85C, resolution 0.1C) and humidity (range: 0-100%, resolution 1%).
Question: Which data format and encoding strategy would maximize battery life while fitting within the payload limit?
6.6 Code Example: IoT Payload Format Comparison
This Python tool compares the same sensor reading encoded in JSON, CBOR-like binary, and raw binary formats, showing the byte-level representation and size differences that matter for bandwidth-constrained IoT networks:
import struct, json# Same sensor reading encoded three waysreading = {"temperature": 23.5, "humidity": 65, "pressure": 1013.2}# 1. JSON text encoding (human-readable, largest)json_bytes = json.dumps(reading, separators=(",", ":")).encode("utf-8")# 2. Raw binary encoding (smallest, requires schema)binary_bytes = struct.pack(">hBH",int(23.5*10), # temp as int16 (x10 for 0.1C precision)65, # humidity as uint8int(1013.2*10) -9000# pressure as uint16 (offset from 900 hPa))# Compare sizesprint(f"JSON: {len(json_bytes):3d} bytes {json_bytes.decode()}")print(f"Raw Binary: {len(binary_bytes):3d} bytes {binary_bytes.hex(' ')}")print(f"\nBinary is {len(json_bytes)/len(binary_bytes):.1f}x smaller")print(f"LoRaWAN 51-byte limit: {51//len(binary_bytes)} binary vs "f"{51//len(json_bytes)} JSON readings")# Output:# JSON: 53 bytes {"temperature":23.5,"humidity":65,"pressure":1013.2}# Raw Binary: 5 bytes 00 eb 41 04 1a# Binary is 10.6x smaller# LoRaWAN 51-byte limit: 10 binary vs 0 JSON readings
Key insight: For a three-sensor environmental payload, raw binary encoding is 8x smaller than JSON. On a LoRaWAN link (51-byte payload limit), this means 10 readings per message versus just 1 – directly translating to 10x fewer radio transmissions and dramatically longer battery life.
Format
Size
Readability
Schema Required
Best For
JSON
42 bytes
Human-readable
No (self-describing)
Debugging, Wi-Fi devices
CBOR
~20 bytes
Machine-only
No (self-describing)
Constrained devices, CoAP
Raw Binary
5 bytes
Machine-only
Yes (both ends must agree)
LoRaWAN, NB-IoT, ultra-low-power
Worked Example: Calculating MQTT Topic String Size with UTF-8
Scenario: You’re designing an MQTT topic structure for a smart building system with international deployments (German, French, Japanese buildings). You want to use human-readable topic names that include building names.
Topics might look like:
buildings/München-Office/floor3/temp
buildings/Paris-Siège/étage2/humidity
buildings/東京本社/3階/co2
Question: How many bytes does each topic require, and how should you size your MQTT client’s topic buffer?
6.6.1 Step 1: Understand UTF-8 Character Sizes
Character Type
Bytes
Examples
ASCII (a-z, 0-9, /, -)
1 byte
buildings, floor3, /
Latin Extended (ü, é, ñ, ö)
2 bytes
München, étage, Siège
Japanese/Chinese/Korean
3 bytes
東京本社 (Tokyo HQ), 3階 (3rd floor)
Emojis (rare in topics)
4 bytes
🏢 (office building emoji)
6.6.2 Step 2: Calculate Topic Sizes
German topic:buildings/München-Office/floor3/temp
Part
Characters
ASCII Bytes
Extended Bytes
Total
buildings/
10
10
0
10
München
7 (ü = 2 bytes)
6
2
8
-Office/floor3/temp
18
18
0
18
Total
35 chars
34
2
36 bytes
French topic:buildings/Paris-Siège/étage2/humidity
#define TOPIC_MAX_LENGTH 50// "50 characters should be enough"char topic[TOPIC_MAX_LENGTH];
Problem: 50 characters might be 50-200 bytes with UTF-8!
Correct approach:
// Rule of thumb: Allocate 3x character count for safety#define TOPIC_MAX_CHARS 50#define TOPIC_MAX_BYTES (TOPIC_MAX_CHARS *3)// 150 byteschar topic[TOPIC_MAX_BYTES +1];// +1 for null terminator
Why 3x?
Most extended characters (Latin, Cyrillic) are 2 bytes
CJK characters are 3 bytes
Mix of ASCII + extended averages ~1.5-2 bytes per character
3x is safe without excessive waste
6.6.4 Key Takeaways
Always allocate 1.5-3x character count for UTF-8 buffers
UTF-8 is backward compatible - ASCII topics are 1 byte per char
MQTT supports UTF-8 natively - no special encoding needed
Test with international examples - don’t assume ASCII-only
Bandwidth cost is minimal - readability wins for topics
Recommendation: Use descriptive UTF-8 topic names. The slight bandwidth increase is worth the operational clarity.
6.7 Concept Relationships
Understanding how text encoding concepts relate helps you design efficient IoT message formats:
ASCII depends on 7-bit encoding, enables simple English identifiers, and becomes a trap when teams assume it is enough for international device names or user-visible text.
Unicode defines the character set, enables one code point per symbol across languages, and is often misunderstood as inherently large even though UTF-8 keeps common English text compact.
UTF-8 turns Unicode code points into 1-4 bytes, enables backward-compatible web and MQTT text, and is often misread as “always bigger” even though pure ASCII stays 1 byte per character.
UTF-16 is useful in some runtime environments, but it is usually a poor transport choice for IoT text because English-heavy payloads double in size compared with UTF-8.
Buffer sizing depends on UTF-8 byte width, prevents overflows, and breaks when developers assume "cafe" and "cafe" with accents occupy the same number of bytes.
JSON text encoding depends on UTF-8 plus key-value structure, enables self-describing payloads, and becomes expensive when repeated field names dominate tiny telemetry packets.
MQTT topic encoding depends on UTF-8 topic strings, enables hierarchical routing, and is often underestimated because long descriptive paths add cost to every single publish.
Locale handling depends on both encoding and regional formatting conventions, enabling international deployments while introducing edge cases around logs, units, and user-visible metadata.
Buffer sizing must account for UTF-8’s variable width (allocate 1.5× char count)
JSON mandates UTF-8, so multi-byte characters inflate payload size
MQTT topics use UTF-8, so descriptive names cost bandwidth
Critical Trade-offs:
MQTT topic names: t/001 is compact, but sensors/warehouse-3/temperature is easier to debug. The human-friendly version can cost more than 5x the bytes.
Sensor IDs: a 2-byte binary ID is efficient, while a 36-character UUID is operationally safer in distributed systems. The trade-off is collision risk versus message size.
Timestamp format: Unix epoch integers are compact; ISO 8601 strings are readable. Expect roughly a 5x size difference when you choose readability.
Error reporting: a 1-byte code is efficient, but a full text error is easier to diagnose remotely. This can change failure payload cost by 20-50x.
Gateway transforms to JSON: 80-byte REST API call to cloud
Cloud stores in TimescaleDB: 2.5 bytes per row (columnar compression)
What to Observe: Encoding decisions cascade through your system. Choose UTF-8 everywhere for text, but recognize when binary encoding is justified (sensor→gateway transmission).
6.8 Try It Yourself: Text Encoding Impact on IoT Payloads
Scenario: You’re designing an MQTT-based temperature monitoring system with 100 sensors. Each sensor reports every 60 seconds. Your cellular connectivity costs $0.10/MB. You need to choose between two message formats.
Hybrid approach wins: Use compact JSON ({"id":"...","t":23.7}) for deployments under 500 devices. Switch to binary only at scale.
MQTT topic costs: Using descriptive topics (sensors/temperature/SN003F42 = 28 bytes) vs minimal (t/SN003F42 = 10 bytes) costs $377/year extra for this deployment. Worth it for operational clarity? Depends on your team.
Recommended Decision Matrix:
< 100 sensors
Verbose JSON
Debugging ease matters more than bandwidth cost at this scale.
100-500 sensors
Compact JSON
You keep tooling friendliness while capturing the low-effort 55% savings.
500-5,000 sensors
Binary or CBOR
The deployment is large enough that breakeven can happen within a practical time horizon.
> 5,000 sensors
Binary with schema
Bandwidth savings are large enough to justify explicit schema management.
Putting Numbers to It
For 100 IoT temperature sensors with 60-second reporting over cellular ($0.10/MB):
$ = N C_{} $
Where \(N = 100\), \(M_{\text{daily}} = 1440\), \(C = 0.10\):
Shortening JSON keys (“temperature_celsius” → “t”) saves $399/year with zero engineering cost. Binary encoding saves $631/year but requires $15K firmware investment (24-year breakeven). At 1,000 sensors, binary saves $6,307/year (2.4-year breakeven)—crossing the economic justification threshold around 300-500 devices.
Common Pitfalls
1. Treating Character Count as Byte Count
UTF-8 makes visible length and byte length diverge as soon as accented characters, CJK text, or emoji appear. Size buffers from the maximum byte count, include the null terminator where required, and test with at least one multilingual example before deployment.
2. Shipping Verbose Text on Constrained Links by Default
Readable JSON is useful during bring-up, but repeated field names and topic paths can dominate tiny telemetry messages. Measure the real per-message byte cost before assuming the readability overhead is acceptable on LoRaWAN, BLE advertisements, or cellular fleets.
3. Converting to Floating Point Too Early
Many sensors already provide scaled integers, so converting to float or text on the device wastes CPU cycles, bytes, and sometimes precision. Keep compact integer forms on the device and radio link unless a downstream consumer truly needs a floating-point or human-readable representation.
🏷️ Label the Diagram
6.9 Summary
Text encoding is essential for IoT communication:
ASCII: 7-bit encoding for 128 characters, stored in 1 byte (English only)
Unicode: 143,000+ characters across all languages
UTF-8: Variable-width Unicode encoding, backward compatible with ASCII
Connection: Return here for the full index of all data representation chapters.
For Kids: Meet the Sensor Squad!
Sammy the Sensor tries to send a message: “Hello, Cloud!”
Max the Microcontroller explains: “Computers don’t understand letters! We need to turn every letter into a number. That’s called encoding.”
Lila the LED shows the secret code chart (ASCII):
H = 72, e = 101, l = 108, l = 108, o = 111
“Hello” = five numbers, five bytes!
“But what about other languages?” asks Sammy. “My friend in Japan writes in Japanese!”
Max pulls out a bigger chart: “That’s Unicode! It has numbers for EVERY letter in EVERY language – even emojis! The thermometer emoji is number 127,777!”
Bella the Battery warns: “Be careful! English letters take 1 byte each, but Chinese characters take 3 bytes and emojis take 4 bytes. If you name your sensor with an emoji, your messages get BIGGER!”
Lila adds: “And here’s a cool trick – most sensors don’t even send words! Temperature 23.5 degrees? The sensor sends the number 235 (just 2 bytes) and the cloud knows to divide by 10. Way more efficient than spelling out ‘twenty-three point five degrees Celsius’!”
The Squad’s Rule: Letters are secretly numbers! English = small numbers (1 byte), other languages = bigger numbers (2-4 bytes). For sensor data, skip the words and just send numbers!
Quick Check: Test your understanding of UTF-8 buffer sizing:
Knowledge Check: UTF-8 Buffer Sizing
Scenario: You are designing a buffer for MQTT topic names. Topics can include non-English characters (e.g., “Gebaude/Temperatur” for a German building sensor).