1277  Sharding Strategies

Learning Objectives

After completing this chapter, you will be able to:

  • Choose between time-based, device-based, and hybrid sharding strategies
  • Implement distributed hypertables in TimescaleDB
  • Design Cassandra partition keys for IoT workloads
  • Handle hot spots and shard rebalancing
  • Make informed decisions about sharding vs. replication

1277.1 Introduction

Sharding (horizontal partitioning) distributes data across multiple database nodes to handle scale beyond a single server’s capacity. For IoT time-series data, choosing the right sharding key and strategy is critical for both write performance and query efficiency.

1277.2 The Sharding Key Decision

The sharding key determines how data is distributed across nodes. For IoT, three primary strategies exist:

1277.2.1 1. Time-Based Sharding (Range Partitioning)

Shard 1: January 2024 data
Shard 2: February 2024 data
Shard 3: March 2024 data
...

Pros: Simple retention (drop entire shard), time-range queries hit single shard

Cons: Hot shard problem - all current writes go to the newest shard, creating write bottleneck

When to use: Read-heavy workloads with time-bounded queries; archival systems

1277.2.2 2. Device-Based Sharding (Hash Partitioning)

Shard 1: device_ids hashing to 0-999
Shard 2: device_ids hashing to 1000-1999
Shard 3: device_ids hashing to 2000-2999
...

Pros: Even write distribution, device-specific queries hit single shard

Cons: Cross-device aggregations require scatter-gather across all shards; retention requires per-shard cleanup

When to use: Write-heavy workloads; queries primarily filter by device

1277.2.3 3. Hybrid Sharding (Two-Level)

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#7F8C8D', 'secondaryColor': '#16A085', 'tertiaryColor': '#E67E22', 'fontSize': '14px'}}}%%
flowchart TB
    subgraph L1["Level 1: Hash by device_id"]
        direction LR
        HASH["Hash Function"]
    end

    subgraph S1["Shard 1<br/>devices 0-999"]
        direction TB
        S1J["Jan 2024 chunk"]
        S1F["Feb 2024 chunk"]
        S1M["Mar 2024 chunk"]
    end

    subgraph S2["Shard 2<br/>devices 1000-1999"]
        direction TB
        S2J["Jan 2024 chunk"]
        S2F["Feb 2024 chunk"]
        S2M["Mar 2024 chunk"]
    end

    subgraph S3["Shard N<br/>devices N-range"]
        direction TB
        S3J["Jan 2024 chunk"]
        S3F["Feb 2024 chunk"]
        S3M["Mar 2024 chunk"]
    end

    HASH --> S1
    HASH --> S2
    HASH --> S3

    style L1 fill:#2C3E50,stroke:#16A085,color:#fff
    style S1 fill:#16A085,stroke:#2C3E50,color:#fff
    style S2 fill:#16A085,stroke:#2C3E50,color:#fff
    style S3 fill:#16A085,stroke:#2C3E50,color:#fff
    style HASH fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 1277.1: Hybrid Sharding: Hash by device_id for write distribution, then partition by time within each shard for efficient retention.

Pros: Balanced writes (hash) + efficient retention (time chunks) + device queries stay local

Cons: Implementation complexity; cross-device queries still scatter

When to use: Large-scale IoT with mixed query patterns (the common case)

1277.3 Implementation: TimescaleDB Distributed Hypertables

TimescaleDB 2.0+ supports distributed hypertables that implement hybrid sharding automatically:

-- Create distributed hypertable across 3 data nodes
CREATE TABLE sensor_data (
    time        TIMESTAMPTZ NOT NULL,
    device_id   VARCHAR(50) NOT NULL,
    value       DOUBLE PRECISION,
    quality     SMALLINT
);

SELECT create_distributed_hypertable(
    'sensor_data',
    'time',                          -- Time dimension (chunking)
    'device_id',                     -- Space dimension (sharding)
    number_partitions => 16,         -- Hash partitions per node
    chunk_time_interval => INTERVAL '1 day'
);

-- Query behavior:
-- Device-specific: routes to single node
SELECT * FROM sensor_data
WHERE device_id = 'sensor-42' AND time > NOW() - INTERVAL '1 hour';

-- Time-range aggregate: parallel across all nodes, merge results
SELECT date_trunc('hour', time), avg(value)
FROM sensor_data
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY 1;

1277.4 Implementation: Cassandra Partition Design

Apache Cassandra uses partition keys for distribution and clustering columns for ordering within partitions:

-- Optimal IoT schema: partition by device + time bucket
CREATE TABLE sensor_readings (
    device_id    text,
    time_bucket  timestamp,      -- Truncated to hour/day
    reading_time timestamp,
    value        double,
    PRIMARY KEY ((device_id, time_bucket), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

-- Why time_bucket in partition key?
-- Without: device_id alone creates unbounded partition growth
-- With: bounded partitions (~100MB each), predictable performance

-- Query patterns enabled:
-- 1. Single device, recent data (hits 1 partition):
SELECT * FROM sensor_readings
WHERE device_id = 'sensor-42'
AND time_bucket = '2024-01-15 00:00:00';

-- 2. Single device, date range (hits multiple partitions):
SELECT * FROM sensor_readings
WHERE device_id = 'sensor-42'
AND time_bucket IN ('2024-01-14 00:00:00', '2024-01-15 00:00:00');

1277.5 Shard Rebalancing and Hot Spots

The Hot Device Problem: Some IoT devices generate 100x more data than others (high-frequency industrial sensors vs. daily weather stations). Hash-based sharding can still create hot spots.

Solution: Compound Sharding Key

# Add randomization to spread hot devices
def generate_shard_key(device_id: str, num_buckets: int = 4) -> str:
    """
    Spread a single hot device across multiple shards.
    device-001 becomes: device-001:0, device-001:1, device-001:2, device-001:3
    """
    bucket = random.randint(0, num_buckets - 1)
    return f"{device_id}:{bucket}"

# Queries must now scatter across buckets
# SELECT * FROM sensors WHERE device_id IN ('device-001:0', 'device-001:1', ...)

Rebalancing Strategies:

Strategy Downtime Data Movement Complexity
Add shard, rehash all Hours 100% Low
Consistent hashing Minutes ~1/N Medium
Virtual shards Seconds Configurable High

1277.6 Performance Characteristics by Strategy

Metric Time-Based Device-Based Hybrid
Write throughput Limited by hot shard Linear with nodes Linear with nodes
Device query latency Scatter-gather Single node Single node
Time-range query Single node Scatter-gather Scatter-gather
Retention cleanup Drop shard (fast) Per-shard scan (slow) Drop chunks (fast)
Implementation Simple Simple Complex

1277.7 Edge Cases and Gotchas

1. Shard Key Cardinality: Too few unique device_ids (< 100) leads to uneven distribution. Solution: compound key with sub-partitions.

2. Query Routing Overhead: Every cross-shard query adds coordinator overhead. Design queries to minimize scatter when possible.

3. Transaction Boundaries: Distributed transactions across shards are expensive or unsupported. Design schemas so single-device operations stay within one shard.

4. Schema Changes: Adding columns to distributed tables requires coordination. Plan schema evolution carefully.

5. Backup Complexity: Point-in-time recovery across shards requires consistent snapshots. Use database-native distributed backup tools.

1277.8 Decision Framework

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#7F8C8D', 'secondaryColor': '#16A085', 'tertiaryColor': '#E67E22', 'fontSize': '14px'}}}%%
flowchart TD
    START["Start Here:<br/>What is your PRIMARY query pattern?"]

    Q1["80%+ queries filter<br/>by device_id first"]
    Q2["80%+ queries filter<br/>by time range first"]
    Q3["Mixed patterns<br/>(common in IoT)"]

    S1["Device-based sharding<br/>(hash by device_id)"]
    S2["Time-based sharding<br/>(range by timestamp)"]
    S3["Hybrid sharding<br/>(hash by device, partition by time)"]

    R1["Add time-based partitioning<br/>within shards for retention"]
    R2["Accept write hotspot OR<br/>use write buffering"]
    R3["Use TimescaleDB distributed hypertables<br/>or Cassandra compound partition keys"]

    START --> Q1
    START --> Q2
    START --> Q3

    Q1 --> S1
    Q2 --> S2
    Q3 --> S3

    S1 --> R1
    S2 --> R2
    S3 --> R3

    style START fill:#2C3E50,stroke:#16A085,color:#fff
    style Q1 fill:#16A085,stroke:#2C3E50,color:#fff
    style Q2 fill:#16A085,stroke:#2C3E50,color:#fff
    style Q3 fill:#16A085,stroke:#2C3E50,color:#fff
    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style R1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style R2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style R3 fill:#7F8C8D,stroke:#2C3E50,color:#fff

Figure 1277.2: Sharding Strategy Decision: Choose based on primary query pattern, then add secondary partitioning for retention.

Rule of Thumb: For most IoT systems with >100K devices and >10K writes/sec, hybrid sharding with device-based distribution and time-based chunking provides the best balance of write performance, query flexibility, and operational simplicity.

1277.9 Knowledge Check

Scenario: Your IoT platform has 1 million devices generating 10 billion records/month. Query patterns: 80% filter by device_id, 20% aggregate across all devices by time range. Single PostgreSQL server maxes out at 5,000 writes/sec.

Think about: 1. How should you shard the data–by device_id, by timestamp, or hybrid? 2. What happens to cross-device aggregate queries with device_id sharding? 3. How many shards do you need for 10 billion records/month at 100 bytes/record?

Key Insight: Hybrid sharding: Shard by device_id for write distribution (1M devices across 10 shards = 100K devices/shard), then partition each shard by time for efficient retention. Trade-off: Device-specific queries hit 1 shard (fast), but cross-device aggregates must query all 10 shards and merge results (slower but acceptable for 20% of queries). Storage: 10B records x 100 bytes = 1 TB/month. With 3-month retention, you need 3 TB distributed across 10 shards = 300 GB/shard.

Question: What is the key difference between horizontal partitioning (sharding) and vertical partitioning?

Horizontal partitioning (sharding) divides rows across servers–e.g., users 1-1M on server1, 1M-2M on server2. Scales throughput and storage by distributing data.

Vertical partitioning splits columns–e.g., user profile (name, email) on server1, user activity (clicks, views) on server2. Separates hot/cold data.

IoT example: Shard sensor readings by device_id range (horizontal) to distribute 10TB across 10 servers (1TB each). Vertically partition metadata (device name, location) from metrics (temperature, pressure) since metadata is rarely updated.

1277.10 Summary

  • Time-based sharding is simple but creates hot spots for current writes
  • Device-based sharding distributes writes evenly but complicates time-range aggregations
  • Hybrid sharding (device hash + time partition) provides the best balance for most IoT workloads
  • TimescaleDB distributed hypertables and Cassandra compound partition keys implement hybrid sharding natively
  • Hot device problem requires compound sharding keys to spread high-frequency devices across shards
  • Rebalancing strategies trade off downtime, data movement, and implementation complexity

1277.11 What’s Next

Continue with Worked Examples to see complete database architecture designs for fleet management and smart city data lakes.

Related Chapters: - CAP Theorem - Distributed systems trade-offs - Time-Series Databases - TimescaleDB optimization - Database Selection Framework - Choosing the right database