Re-identification probability quantifies the likelihood that “anonymized” records can be linked back to real individuals.
Uniqueness Metric: \[P(\text{unique}) = \frac{\text{Records with unique quasi-identifier combination}}{\text{Total records}}\]
De-identification Theorem (Sweeney, 2000): In the US, 87% of the population can be uniquely identified by the combination of: - 5-digit ZIP code - Birth date (month, day, year) - Gender
Working through an example:
Given: “Anonymized” smart home dataset with 10,000 users
Quasi-identifiers included:
- 3-digit ZIP prefix (941**)
- Age bracket (35-39)
- Household size (2 people)
- Average nightly kWh usage
Step 1: Estimate population per equivalence class
US Census data for ZIP 941**: - Population: ~340,000 - Age 35-39: 6.8% -> 23,120 people - 2-person households: 34% -> 7,861 households
Step 2: Calculate uniqueness from energy signature
Nighttime energy patterns (11 PM - 6 AM) create unique fingerprints: - Device-specific power draw curves (EV charging, medical equipment, etc.) - Study (Enev et al., 2011): 90% household identification from 1-week smart meter data
Step 3: Combine quasi-identifiers
\[P(\text{unique | ZIP, Age, HH-size, Energy}) \approx 0.9\]
Even though each individual quasi-identifier has low uniqueness: - 3-digit ZIP: \(\frac{1}{7,861} = 0.013\%\) unique - Combined with energy signature: \(90\%\) unique
Step 4: Re-identification attack simulation
Cross-reference with public voter registration (has exact address): - 7,861 households in equivalence class - Match energy pattern to known resident at address - Success rate: \(\frac{9,000}{10,000} = 90\%\)
Result: “Anonymized” smart home data has 90% re-identification risk due to unique energy consumption fingerprints, making GDPR compliance impossible without differential privacy or extreme aggregation (k>=5,000).
In practice: IoT devices generate behavioral fingerprints (usage patterns, temporal signatures, device combinations) that are as unique as DNA. Traditional anonymization (removing names/IDs) fails because behavior itself is identifying. This is why GDPR considers pseudonymized data still “personal data” – the risk of re-identification remains high.