Scenario: A city transit authority releases an “anonymized” dataset of 50,000 bus pass users over 12 months to urban planners. Each record contains: anonymous ID, timestamp (second precision), bus stop ID, and route number. Personal names and card numbers are removed. Calculate the re-identification risk and recommend privacy-preserving alternatives.
Dataset Characteristics:
| Anonymous ID |
8-digit hash |
A3F72B91 |
| Timestamp |
Second |
2025-03-15 08:17:42 |
| Bus stop |
Stop ID (GPS-mapped) |
Stop #2847 (37.7749, -122.4194) |
| Route |
Route number |
Route 38 |
Average records per user: 480 trips over 12 months (2x daily commuter).
Step 1: Estimate Uniqueness from Home/Work Inference
Most commuters have consistent patterns. Extract likely home and work locations:
Home inference:
Most frequent first-morning stop (6:00-9:00 AM weekdays)
Example: User A3F72B91 boards at Stop #2847 at 8:15 AM
on 87% of weekdays
Work inference:
Most frequent last-AM stop (arrival by 9:30 AM)
Example: User A3F72B91 exits at Stop #1423 at 8:52 AM
on 84% of weekdays
Home stop #2847 = census block 060750123001 (population: 847)
Work stop #1423 = census block 060750456002 (population: 2,100
daytime workers)
Step 2: Calculate Anonymity Set Size
People living in home census block: 847
People working in work census block: 2,100
Working-age adults in home block: ~520 (61% of population)
Of those, commuting to work block: ~520 x (2,100 / 450,000
city workers) = ~2.4
Anonymity set = ~2 people share this exact home-work pattern
With just home + work stops, the anonymity set drops to approximately 2 people. Adding commute time (8:15 AM departure) further narrows identification.
Step 3: Apply the 4-Point Attack
Research shows 4 spatiotemporal points uniquely identify 95% of individuals. This dataset provides 480 points per user.
Point 1: Home stop, weekday 8:15 AM (eliminates 99.8% of
city population)
Point 2: Work stop, weekday 8:52 AM (narrows to ~2 candidates)
Point 3: Saturday 2:30 PM, Stop #5891 (grocery store area)
(1 candidate remaining)
Point 4: Confirmation -- any additional trip matches the
identified individual's known patterns
Re-identification confidence: >99% for regular commuters
Step 4: Cross-Reference with Public Data
An attacker combines the anonymized transit data with public records:
| Voter registration |
Name, home address |
Free |
| Property records |
Home address, owner name |
Free |
| LinkedIn |
Employer, work address |
Free |
| Social media check-ins |
Specific location visits |
Free |
Matching home census block from transit data to voter registration yields the individual’s name. Total attack cost: $0 and approximately 30 minutes of analysis.
Step 5: Quantify Privacy Impact
For the 50,000-user dataset:
Regular commuters (2+ trips/week): ~35,000 (70%)
Re-identifiable from home/work: 95% = 33,250 users
Irregular users (<2 trips/week): ~15,000 (30%)
Re-identifiable from 4+ points: 60% = 9,000 users
Total re-identifiable: 42,250 out of 50,000 = 84.5%
Step 6: Recommend Privacy-Preserving Alternatives
| Current release |
None (84.5% re-identifiable) |
Full |
Already done |
| Coarsen timestamps to 1-hour bins |
Low (72% still re-identifiable) |
High |
Easy |
| Aggregate to route-level daily counts |
High (not individual-level) |
Medium |
Easy |
| Differential privacy (epsilon=1.0) with route-level noise |
High (formally bounded) |
Medium-High |
Moderate |
| Synthetic data generation |
Very high (no real trajectories) |
Medium |
Complex |
Recommended solution: Release route-level hourly aggregate counts (passengers per route per hour) instead of individual trip records. Urban planners can still analyze demand patterns, peak hours, and route utilization without exposing individual mobility patterns. For analyses requiring origin-destination matrices, apply differential privacy with epsilon=0.5 and aggregate to zone level (10+ census blocks per zone).
Key lesson: Removing names and card numbers is not anonymization – it is pseudonymization. Location data is inherently self-identifying because human mobility patterns are nearly unique. The only effective privacy strategy is to prevent release of individual-level location traces entirely, using aggregation or synthetic data instead.