Differential privacy provides mathematical guarantees that individual records cannot be distinguished in aggregated datasets.
\[\Pr[M(D) \in S] \leq e^\epsilon \times \Pr[M(D') \in S]\]
where \(M\) is the mechanism (query function), \(D\) and \(D'\) are datasets differing by one record, \(S\) is any possible output, and \(\epsilon\) is the privacy budget (smaller = stronger privacy).
Working through an example: Given: Smart building with 1,000 occupancy sensors. Release daily occupancy statistics while protecting individual privacy using differential privacy with \(\epsilon = 1.0\) (moderate privacy).
Step 1: True occupancy count (sensitive data) - True count at 2 PM: 847 people in building - Query: “How many people in building at 2 PM?”
Step 2: Add Laplace noise for differential privacy - Sensitivity: \(\Delta f = 1\) (one person entering/leaving changes count by 1) - Noise scale: \(b = \Delta f / \epsilon = 1 / 1.0 = 1.0\) - Laplace distribution: \(Lap(0, b)\) with probability density \(f(x) = \frac{1}{2b}e^{-|x|/b}\)
Step 3: Sample noise from Laplace distribution - Generate random noise: \(noise \sim Lap(0, 1.0)\) - Example sample: \(noise = +2.3\) (95% CI: noise between -3 and +3) - Noisy count: \(847 + 2.3 = 849.3 \approx 849\) people
Step 4: Calculate privacy guarantee - With \(\epsilon = 1.0\): For any two neighboring datasets (differing by 1 person) - Privacy bound: \(\Pr[Output = 849 | 847 \text{ present}] \leq e^1 \times \Pr[Output = 849 | 846 \text{ present}]\) - \(e^1 \approx 2.72\) → outputs are within 2.72× probability ratio - Interpretation: An observer cannot determine with high confidence whether any specific individual was present
Step 5: Fleet-level aggregation (composition theorem) - If we release 100 daily counts, privacy budget compounds: \(\epsilon_{total} = 100 \times 1.0 = 100\) - Solution: Allocate fixed total budget: \(\epsilon_{daily} = 1.0 / 100 = 0.01\) per release - Higher noise per query, but bounded total privacy loss
Pseudonymization Comparison (Fleet Tracking Example): Given: 5,000 delivery vehicles. Compare pseudonymization vs. differential privacy.
Pseudonymization approach:
- Replace driver IDs with HMAC hashes:
drv_a3f5d8e2
- Reversible with salt access (0% privacy guarantee if salt compromised)
- Re-identification risk: Medium-High (timing patterns still linkable)
Differential privacy approach:
- Add noise to route statistics: “Driver completed 23 ± 2 deliveries”
- Irreversible (mathematical guarantee independent of computational power)
- Re-identification risk: Low (individual routes obscured by noise)
- Trade-off: 8.7% relative error (acceptable for fleet optimization)
Result: Differential privacy with \(\epsilon=1.0\) adds Laplace noise with scale 1.0, resulting in ±3 error (95% CI) on occupancy counts. For 847 true occupancy, reported value is 849 (0.2% error). This provides provable privacy protection while maintaining high utility for building management.
In practice: Pseudonymization (replacing IDs with hashes) provides weak privacy—easily reversed if the mapping is leaked. Differential privacy provides provable protection: even with unlimited computational power and auxiliary data, individual presence cannot be reliably inferred. For IoT aggregate analytics (energy consumption, traffic patterns, occupancy), \(\epsilon \in [0.1, 1.0]\) balances privacy with utility. The composition theorem is critical: releasing 365 daily statistics compounds to \(\epsilon_{annual} = 365\), requiring careful privacy budget allocation across time.