12  Hash Functions and Data Integrity

12.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Explain Hash Function Properties: Describe collision resistance, preimage resistance, and the avalanche effect
  • Apply SHA-256 for Integrity: Verify firmware authenticity, validate data, and store passwords securely using cryptographic hashes
  • Implement HMAC Authentication: Combine hash functions with secret keys for message authentication
  • Distinguish Deprecated Algorithms: Identify why MD5 and SHA-1 are insecure and select appropriate replacements
In 60 Seconds

Cryptographic hash functions produce a fixed-size, irreversible fingerprint of any data, enabling integrity verification and password storage for IoT devices without exposing the original values.

Hashing and data integrity checks ensure that IoT data has not been tampered with during transmission. Think of a wax seal on an envelope – if the seal is broken, you know someone opened it. Hash functions create a unique digital fingerprint of your data, so any change, no matter how small, is immediately detectable.

“Every piece of data has a unique fingerprint!” Sammy the Sensor said, holding up a data packet. “When I feed this temperature reading into the SHA-256 hash function, out comes a 64-character hex string. Change even ONE tiny bit of the reading, and the fingerprint looks COMPLETELY different!”

Max the Microcontroller demonstrated. “Watch this: ‘Hello’ hashes to 185f8db… But ‘hello’ with a lowercase h hashes to 2cf24db… Totally different! This is called the avalanche effect – small changes cause massive differences in the output. That is what makes hash functions so useful for detecting tampering.”

“Hash functions are one-way,” Lila the LED explained. “You can turn data INTO a hash, but you cannot turn a hash BACK into data. It is like putting a document through a paper shredder – you can verify the shreds match the original, but you cannot reconstruct the document from the shreds. That is why hashes are safe for storing passwords – even if someone steals the hash, they cannot reverse it to find the password.”

“HMAC adds a secret key to the hash,” Bella the Battery noted. “Without a key, anyone can compute the hash. But with HMAC, only someone who knows the secret key can create or verify the hash. It is like a wax seal that needs a specific signet ring – without the ring, you cannot create a valid seal. And never use MD5 or SHA-1 – they have been broken! Always use SHA-256 or better.”

In Plain English

A hash function creates a fixed-size “fingerprint” of any data. Like a fingerprint, even tiny changes produce completely different outputs. Hash functions are one-way: you can create a hash from data, but you cannot reverse it to get the original data back.

Why it matters for IoT: When your smart thermostat downloads a firmware update, it computes the SHA-256 hash and compares it to the manufacturer’s published hash. If they match, the firmware is authentic and unmodified. If even one bit was changed (by an attacker or network error), the hash will be completely different.

12.2 How Hash Functions Work

A cryptographic hash function takes input of any size and produces a fixed-size output called a digest or hash.

Block diagram showing a hash function accepting variable-length inputs such as a short message, a file, and a large dataset, each producing the same fixed-size 256-bit output digest
Figure 12.1: Hash functions produce fixed-size outputs regardless of input size

12.2.1 Critical Properties

12.2.1.1 1. Collision Resistance

It must be computationally infeasible to find two different inputs that produce the same hash.

Why it matters: If attackers could create a malicious firmware file with the same hash as legitimate firmware, they could bypass integrity checks.

12.2.1.2 2. Preimage Resistance (One-Way)

Given a hash output, it must be computationally infeasible to find any input that produces that hash.

Why it matters: Password hashes stored in databases cannot be reversed to reveal the original passwords.

12.2.1.3 3. Avalanche Effect

Changing even one bit of input produces a completely different hash output (approximately 50% of bits change).

Input:  "Sensor: Temp=25.0C"
Hash:   a3f2c9e847b1...

Input:  "Sensor: Temp=25.1C"  (one character changed)
Hash:   7d4e2f1a93c8...      (completely different!)

12.3 SHA-256: The IoT Standard

SHA-256 (Secure Hash Algorithm 256-bit) is the recommended hash function for IoT:

Property Value
Output Size 256 bits (32 bytes, 64 hex characters)
Block Size 512 bits
Security Level 128-bit collision resistance
Speed ~200-500 MB/s on modern processors
Status Current standard, no known attacks

12.3.1 IoT Use Cases

1. Firmware Integrity Verification

Manufacturer publishes:
  firmware_v2.1.bin
  SHA-256: 7d4e2f1a93c8b5d6e7f8a9b0c1d2e3f4...

Device downloads firmware, computes:
  SHA-256(downloaded_firmware)

If hashes match: INSTALL
If hashes differ: REJECT (corrupted or tampered)

2. Password Storage

User password:  "MySecretP@ss123"
Stored in DB:   SHA-256(password + salt)
                = "a3f2c9e847b1d5f6..."

Never store plaintext passwords!
(See Password Hashing section below for best practices)

3. Data Integrity in Transit

Sensor sends:
  {data: "temp=25.0C", hash: SHA-256(data)}

Gateway receives and verifies:
  computed_hash = SHA-256(received_data)
  if computed_hash == received_hash: VALID

4. Blockchain and Merkle Trees

IoT audit logs use hash chains where each entry includes the hash of the previous entry, making tampering detectable.

12.4 HMAC: Hash-Based Message Authentication

HMAC combines a hash function with a secret key to provide both integrity AND authentication.

Diagram showing the HMAC construction with a secret key combined with an inner and outer padding, feeding into two rounds of SHA-256 hashing to produce an authenticated message tag
Figure 12.2: HMAC provides authentication and integrity verification

12.4.1 HMAC vs Plain Hash

Feature SHA-256(data) HMAC-SHA256(key, data)
Integrity Yes Yes
Authentication No Yes
Key Required No Yes
Prevents Forgery No Yes

Always use HMAC when you need to verify that a message came from an authorized sender AND was not modified. A plain hash only proves the data was not altered; HMAC also proves who created it.

12.5 Deprecated Algorithms

12.6 Never Use MD5 or SHA-1

These algorithms have known vulnerabilities and must not be used for security:

Algorithm Output Size Status Vulnerability
MD5 128 bits Broken Collision attacks in seconds on a laptop
SHA-1 160 bits Deprecated Practical collision demonstrated by Google in 2017 (SHAttered)

Real-world impact: Attackers have created fake SSL certificates using MD5 collisions, enabling man-in-the-middle attacks.

12.6.1 Migration Path

Old Algorithm Replacement Notes
MD5 SHA-256 Direct replacement
SHA-1 SHA-256 or SHA-3 All new systems
SHA-256 SHA-3 (optional) Different internal design for defense-in-depth

12.6.2 Hash Function Comparison

Hash Function Output Size Security Level Best Use Case
SHA-256 256 bits 128-bit collision resistance Default choice for IoT
SHA-3 256/512 bits 128/256-bit collision resistance Future-proofing, defense-in-depth
BLAKE2 256/512 bits 128/256-bit collision resistance High-performance hashing
SHA-512 512 bits 256-bit collision resistance When 256-bit output is insufficient

12.7 Birthday Attack: Why 128-bit Collision Resistance Matters

The birthday paradox means finding hash collisions is much easier than finding preimages. For an n-bit hash function:

  • Preimage resistance: ~2^n operations (finding input for a given hash)
  • Collision resistance: ~2^(n/2) operations (finding ANY two inputs with the same hash)

For SHA-256 (256-bit output), collision resistance is 2^128 – still astronomically large, but half the preimage resistance of 2^256.

Why this matters for IoT: If an attacker could find a collision for your firmware hash, they could create malicious firmware that passes integrity checks. MD5 collisions can be found in seconds on a laptop. SHA-256 collisions require approximately 2^128 operations – more energy than the sun produces in its lifetime.

12.8 Worked Example: Rainbow Table Attack Cost Analysis

Scenario: A smart home hub stores 50,000 user passwords as unsalted SHA-256 hashes. An attacker obtains the database backup. Calculate the attack cost and compare defenses.

Step 1: Estimate brute-force speed

Modern GPUs can compute SHA-256 hashes at extraordinary speeds:

Hardware SHA-256 Hashes/second Cost
RTX 4090 (1 GPU) ~22 billion/sec ~$1,600
8x RTX 4090 rig ~176 billion/sec ~$15,000
Cloud (100 GPUs) ~2.2 trillion/sec ~$50/hour
ASIC (Bitcoin miner) 100+ trillion/sec ~$3,000

Step 2: Calculate crack times for common password patterns

import math

hash_rate = 22e9  # RTX 4090: 22 billion SHA-256/sec

password_spaces = {
    "6-digit PIN": 10**6,
    "8-char lowercase": 26**8,
    "8-char mixed case": 52**8,
    "8-char alphanumeric": 62**8,
    "8-char all printable": 95**8,
    "12-char alphanumeric": 62**12,
}

for name, space in password_spaces.items():
    seconds = space / hash_rate
    if seconds < 1:
        print(f"{name}: {space:,.0f} combinations -> {seconds*1000:.3f} ms")
    elif seconds < 3600:
        print(f"{name}: {space:,.0f} combinations -> {seconds:.1f} seconds")
    elif seconds < 86400 * 365:
        print(f"{name}: {space:.2e} combinations -> {seconds/3600:.1f} hours")
    else:
        print(f"{name}: {space:.2e} combinations -> {seconds/(86400*365):.1f} years")

Results (single RTX 4090):

Password Type Keyspace Time to Exhaust Cost
6-digit PIN 1 million 0.045 ms Free
8-char lowercase 209 billion 9.5 seconds Free
8-char mixed case 53 trillion 40 minutes Free
8-char alphanumeric 218 trillion 2.8 hours $0.14 electricity
8-char all printable 6.6 quadrillion 3.5 days $4.20 electricity
12-char alphanumeric 3.2 x 10^21 ~4,650 years Impractical

Step 3: Rainbow table pre-computation

For the 50,000 unsalted hashes, an attacker uses a pre-computed rainbow table:

Rainbow Table Passwords Covered Size Lookup Time
8-char alphanumeric All 218 trillion ~2 TB <1 second per hash
RockYou + common variants 14 million + mutations 50 GB Instant

Cost to crack 50,000 unsalted hashes: Under $100 using cloud GPUs and publicly available rainbow tables. Time: minutes to hours.

Step 4: Defense comparison

Defense Crack Time (8-char password) Cost to Attack
Unsalted SHA-256 2.8 hours $0.14
Salted SHA-256 2.8 hours per password (x50,000) $7,000
bcrypt (cost=12) ~72 years per password $millions
Argon2id (64 MB, 3 iterations) 500+ years per password Impractical

Key insight: Salting defeats rainbow tables (each password must be cracked individually), but SHA-256 is still too fast. Password hashing functions like bcrypt and Argon2 add deliberate slowness that makes GPU attacks impractical.

Real-World Breach: LinkedIn (2012)

What happened: 6.5 million LinkedIn passwords leaked, stored as unsalted SHA-1 hashes. Within hours, security researchers cracked 60% of passwords using rainbow tables and dictionary attacks.

Why unsalted hashes failed: With no salt, identical passwords (“password123”) produced identical hashes. Attackers pre-computed hashes for the top 10 million common passwords and matched them instantly against the entire database.

The fix: LinkedIn migrated to bcrypt with per-user salts. The 2016 follow-up breach revealed the original leak was actually 117 million accounts, but the bcrypt-protected passwords from the updated system remained uncracked.

IoT relevance: Many IoT platforms store device credentials the same way. A smart home cloud service storing 100,000 device passwords as unsalted hashes is one database leak away from total fleet compromise.

12.9 Password Hashing: Why Plain SHA-256 Is Not Enough

For storing user or device passwords, use password hashing functions (not plain SHA-256):

Function Status Use Case
Argon2id Recommended Modern systems, adjustable memory-hardness
bcrypt Good Legacy systems, widely supported
scrypt Good Memory-hard, prevents GPU attacks
PBKDF2 Acceptable FIPS compliance required

Why not plain SHA-256? Plain hashes are too fast – attackers can try billions of passwords per second. Password hashing functions are intentionally slow (100ms+) to make brute-force attacks impractical.

import hashlib
from hashlib import pbkdf2_hmac
import os

# WRONG: Too fast to brute-force
password_hash = hashlib.sha256(password.encode()).hexdigest()

# RIGHT: Intentionally slow with unique salt
salt = os.urandom(16)
password_hash = pbkdf2_hmac('sha256', password.encode(), salt, 100_000)

12.10 Implementation Example

12.10.1 Python: HMAC-SHA256 for Sensor Data

import hmac
import hashlib

# Shared secret key (pre-provisioned on sensor and gateway)
secret_key = b"sensor_shared_secret_key"

# Sensor reading
sensor_data = b"temp=25.5,humidity=60%,time=1642345678"

# Compute HMAC
mac_tag = hmac.new(secret_key, sensor_data, hashlib.sha256).hexdigest()

# Sensor sends: {data: sensor_data, mac: mac_tag}

# Gateway verifies:
received_data = sensor_data  # from network
received_mac = mac_tag       # from network

computed_mac = hmac.new(secret_key, received_data, hashlib.sha256).hexdigest()

if hmac.compare_digest(computed_mac, received_mac):
    print("VALID: Data is authentic and unmodified")
else:
    print("INVALID: Data tampered or wrong key")

12.10.2 Key Points

  1. Use hmac.compare_digest() – Prevents timing attacks
  2. Use unique keys per device – Limits blast radius if compromised
  3. Rotate keys periodically – Limits exposure from key leakage
  4. Include timestamp – Prevents replay attacks

Objective: See how even a tiny change in input produces a completely different hash output.

import hashlib

# Hash a simple IoT sensor message
message1 = "temperature:22.5"
message2 = "temperature:22.6"  # Just 0.1 degree difference

hash1 = hashlib.sha256(message1.encode()).hexdigest()
hash2 = hashlib.sha256(message2.encode()).hexdigest()

print(f"Message 1: {message1}")
print(f"Hash 1:    {hash1}")
print(f"\nMessage 2: {message2}")
print(f"Hash 2:    {hash2}")

# Count how many characters differ
differences = sum(1 for a, b in zip(hash1, hash2) if a != b)
print(f"\nCharacters that differ: {differences} out of {len(hash1)}")
print(f"Percentage changed: {100 * differences / len(hash1):.1f}%")

What to Observe:

  1. The two hashes look completely unrelated despite inputs differing by one digit
  2. Roughly 50% of hex characters change (the avalanche effect)
  3. Both hashes are exactly 64 hex characters (256 bits) regardless of input length

Objective: Build a simple sensor-to-gateway authentication system using HMAC-SHA256.

import hmac
import hashlib
import time
import os

# Simulate shared secret (pre-provisioned on sensor and gateway)
shared_secret = os.urandom(32)

def sensor_send(reading, key):
    """Sensor creates authenticated message"""
    timestamp = str(int(time.time()))
    payload = f"{reading}|{timestamp}"
    mac = hmac.new(key, payload.encode(), hashlib.sha256).hexdigest()
    return {"payload": payload, "mac": mac}

def gateway_verify(message, key):
    """Gateway verifies message authenticity"""
    computed_mac = hmac.new(key, message["payload"].encode(), hashlib.sha256).hexdigest()
    return hmac.compare_digest(computed_mac, message["mac"])

# Normal operation: sensor sends authenticated reading
msg = sensor_send("temp=25.5,humidity=60", shared_secret)
print(f"Sensor sends: {msg['payload']}")
print(f"MAC: {msg['mac'][:32]}...")
print(f"Gateway verifies: {gateway_verify(msg, shared_secret)}")

# Attack simulation: attacker modifies the reading
tampered = {"payload": "temp=99.9,humidity=60|" + msg["payload"].split("|")[1],
            "mac": msg["mac"]}
print(f"\nAttacker sends: {tampered['payload']}")
print(f"Gateway verifies: {gateway_verify(tampered, shared_secret)}")

# Attack simulation: attacker tries with wrong key
wrong_key = os.urandom(32)
fake_msg = sensor_send("temp=99.9,humidity=60", wrong_key)
print(f"\nWrong key msg: {fake_msg['payload']}")
print(f"Gateway verifies: {gateway_verify(fake_msg, shared_secret)}")

What to Observe:

  1. Legitimate messages verify successfully
  2. Tampered payloads fail verification even with the correct MAC format
  3. Messages signed with a different key are rejected
  4. hmac.compare_digest() prevents timing side-channel attacks

Objective: Compute SHA-256 hashes on an ESP32 using the built-in mbedTLS library.

#include <mbedtls/sha256.h>
#include <Arduino.h>

void printHash(unsigned char* hash) {
  for (int i = 0; i < 32; i++) {
    Serial.printf("%02x", hash[i]);
  }
  Serial.println();
}

void setup() {
  Serial.begin(115200);
  delay(1000);

  Serial.println("=== ESP32 SHA-256 Demo ===\n");

  // Hash a sensor reading
  const char* msg1 = "temperature:22.5";
  const char* msg2 = "temperature:22.6";
  unsigned char hash1[32], hash2[32];

  mbedtls_sha256((const unsigned char*)msg1, strlen(msg1), hash1, 0);
  mbedtls_sha256((const unsigned char*)msg2, strlen(msg2), hash2, 0);

  Serial.print("Message: "); Serial.println(msg1);
  Serial.print("SHA-256: "); printHash(hash1);

  Serial.print("\nMessage: "); Serial.println(msg2);
  Serial.print("SHA-256: "); printHash(hash2);

  // Count differing bytes
  int diff = 0;
  for (int i = 0; i < 32; i++) {
    if (hash1[i] != hash2[i]) diff++;
  }
  Serial.printf("\nBytes that differ: %d/32 (%.0f%%)\n", diff, 100.0 * diff / 32);

  // Timing test
  unsigned long start = micros();
  for (int i = 0; i < 1000; i++) {
    mbedtls_sha256((const unsigned char*)msg1, strlen(msg1), hash1, 0);
  }
  unsigned long elapsed = micros() - start;
  Serial.printf("\n1000 hashes in %lu us (%.1f us each)\n", elapsed, elapsed / 1000.0);
}

void loop() {}

What to Observe: The ESP32 computes SHA-256 in microseconds thanks to hardware acceleration. The hash output is always 32 bytes regardless of input size.

12.11 Knowledge Check

A smart home device uses SHA-256 to verify firmware updates before installation. Which property of hash functions makes this security measure effective?

Options:

    1. Hash functions can be reversed to recover the original firmware if needed
    1. Hash functions are extremely slow, preventing attackers from generating fake firmware quickly
    1. Hash functions produce a unique fixed-size fingerprint that changes completely if even one bit of the input changes
    1. Hash functions encrypt the firmware so attackers cannot read it

Correct: C

Cryptographic hash functions produce a unique, fixed-size output (256 bits for SHA-256) that serves as a “fingerprint” of the input data. The avalanche effect means changing even a single bit produces a completely different hash. When the manufacturer publishes the official hash, the device computes its own hash of downloaded firmware – if they match, the firmware is authentic.

Hash functions are NOT encryption (they do not hide data), are NOT reversible (one-way only), and are actually very fast (not slow).

Your IoT gateway receives sensor readings from remote devices. You want to verify that readings come from authorized sensors AND have not been tampered with. Which approach should you use?

Options:

    1. SHA-256 hash of the sensor data
    1. HMAC-SHA256 with a shared secret key
    1. RSA encryption of the sensor data
    1. Base64 encoding of the sensor data

Correct: B

HMAC-SHA256 combines a hash with a secret key to provide both integrity (data was not modified) AND authentication (sender knows the secret). Plain SHA-256 only provides integrity – anyone can compute a hash, so it does not prove who sent the data. RSA is for encryption/signatures (overkill here). Base64 is just encoding, not security.

Concept Relationships
Concept Depends On Enables Common Mistake
Hash Functions One-way mathematical functions Data integrity verification “Hashes provide confidentiality” – NO, only integrity
Collision Resistance Large output space (256+ bits) Secure firmware verification Using MD5/SHA-1 (both broken)
Preimage Resistance Computational hardness Password storage Storing plaintext passwords
Avalanche Effect Good hash design Tamper detection Thinking minor changes will not affect hash
HMAC Hash function + secret key Message authentication Using plain hash for authentication
Password Hashing Intentional slowness (KDF) Brute-force resistance Using fast hashes like SHA-256
Birthday Attack Probability theory Determines required hash size Ignoring collision probability

Key Distinction: Hashing is not Encryption. Hashes are one-way (cannot decrypt), encryption is two-way. Use hashes for integrity/authentication, encryption for confidentiality.

The probability of finding at least one hash collision among \(n\) hashed items with \(H\)-bit output follows the birthday paradox.

\[P(\text{collision}) \approx 1 - e^{-\frac{n^2}{2 \times 2^H}}\]

Working through an example:

Given: Firmware verification system using SHA-256 (256-bit output), hashing 1 billion firmware images over 10 years.

Step 1: Calculate hash space size \[\text{Hash space} = 2^{256} = 1.16 \times 10^{77}\]

Step 2: Number of firmware images \[n = 10^9 = 1,000,000,000\]

Step 3: Collision probability \[P(\text{collision}) = 1 - e^{-\frac{(10^9)^2}{2 \times 2^{256}}}\]

\[= 1 - e^{-\frac{10^{18}}{2.32 \times 10^{77}}}\]

\[\approx 1 - e^{-4.3 \times 10^{-60}} \approx 4.3 \times 10^{-60}\]

Step 4: Items needed for 50% collision probability \[n_{50\%} = \sqrt{2 \times 2^{256} \times \ln(2)} \approx 4.0 \times 10^{38}\]

This is approximately \(2^{128}\), which is why SHA-256 has 128-bit collision resistance.

Result: With 1 billion firmware images, SHA-256 collision probability is \(4.3 \times 10^{-60}\) (effectively zero). You need approximately \(2^{128}\) hashes for a 50% collision chance.

In practice: SHA-256 provides 128-bit collision resistance (vs 256-bit preimage resistance). MD5 (128-bit output) has only \(2^{64}\) collision resistance – practical attacks exist. Always use SHA-256 minimum for security-critical hashing.

12.12 See Also

Related Cryptography:

Practical Applications:

Advanced Topics:

Key Concepts

  • Hash Function: A deterministic function mapping arbitrary-length input to a fixed-length output (digest); the same input always produces the same output.
  • SHA-256: A 256-bit cryptographic hash function from the SHA-2 family; widely used for integrity verification and digital signatures.
  • BLAKE2: A fast cryptographic hash function optimized for embedded systems; often faster than MD5 with stronger security guarantees.
  • Collision Resistance: A property ensuring it is computationally infeasible to find two different inputs that produce the same hash output.
  • Pre-image Resistance: A property ensuring it is computationally infeasible to reverse a hash — given a hash output, you cannot recover the original input.
  • HMAC: Hash-based Message Authentication Code — a construction combining a hash function with a secret key to authenticate message origin and integrity.
  • Salt: A random value added to a password before hashing, ensuring that identical passwords produce different hashes and preventing rainbow table attacks.

Place the steps of hash-based firmware verification in the correct order:

Common Pitfalls

MD5 and SHA-1 have known collision vulnerabilities — attackers can craft different inputs producing the same hash. Use SHA-256 or BLAKE2 for any security-sensitive application.

Standard string comparison (==) short-circuits on the first differing byte, leaking timing information that attackers use to forge valid hashes. Always use constant-time comparison functions like hmac.compare_digest().

Unsalted password hashes can be attacked with precomputed rainbow tables. Add a cryptographically random salt unique to each user/device before hashing, and use a password-specific hash function (bcrypt, PBKDF2, Argon2).

A plain hash without a secret key can be recomputed by an attacker after modifying the message. Use HMAC (which incorporates a secret key) whenever message authentication is required.

12.13 Summary

  • Hash functions create fixed-size fingerprints of any data
  • SHA-256 is the recommended hash for IoT – 256-bit output, no known attacks
  • Collision resistance prevents attackers from creating fake data with matching hashes
  • Avalanche effect means tiny changes produce completely different hashes
  • HMAC combines hashing with a secret key for authentication + integrity
  • Never use MD5 or SHA-1 – both are cryptographically broken
  • Password hashing requires slow functions (Argon2id, bcrypt) not plain SHA-256

12.14 What’s Next

If you want to… Read this
Understand how HMAC uses hash functions for authentication Symmetric Encryption & HMAC
Learn asymmetric algorithms that use hashing internally Asymmetric Encryption
See how hashing fits into the encryption architecture Encryption Architecture & Levels
Explore cryptographic security properties formally Encryption Security Properties

Continue to TLS/DTLS Transport Security to learn how transport layer security protocols protect IoT communications by combining encryption, hashing, and authentication into a complete security solution.