Option A (Hot Standby - Active Secondary Pipeline): - Failover time: 1-5 seconds (standby already running, just promote to primary) - RPO: 0-100ms (standby processes same stream in parallel, nearly identical state) - Infrastructure cost: 2x compute resources (both pipelines running continuously) - Operational complexity: High - must manage dual pipelines, output deduplication - State consistency: Requires careful coordination to prevent duplicate outputs - Network cost: 2x ingress if both consume from same Kafka cluster - Availability: 99.99%+ achievable (sub-second failover eliminates most user-visible outages) - Use cases: Trading systems, real-time bidding, autonomous vehicle coordination
Option B (Cold Recovery - Restart from Checkpoint): - Failover time: 30 seconds - 5 minutes (restore checkpoint, replay from Kafka, warm up state) - RPO: Checkpoint interval (10 seconds - 15 minutes of replay needed) - Infrastructure cost: 1x compute resources (standby capacity only during recovery) - Operational complexity: Lower - single pipeline, standard Flink/Spark recovery - State consistency: Guaranteed by checkpoint/replay mechanism - Network cost: 1x ingress during normal operation - Availability: 99.9% typical (minutes of downtime during failures) - Use cases: Analytics dashboards, alerting systems, feature engineering pipelines
Decision Factors: - Choose Hot Standby when: Revenue loss exceeds $1000/minute of downtime, safety-critical systems where 30-second outage is unacceptable, contractual SLA requires 99.99%+ availability, state is too large for fast checkpoint restoration (>1 TB) - Choose Cold Recovery when: Cost optimization is priority (50% infrastructure savings), business tolerates 1-5 minute recovery time, checkpoint-based recovery meets SLA, team lacks expertise for dual-pipeline coordination - Warm standby hybrid: Maintain standby cluster in suspended state (containers allocated but not processing), reducing failover to 10-30 seconds at 1.3x cost instead of 2x - good middle ground for 99.95% availability requirements