Long-duration testing catches bugs that only appear after extended operation.
1571.5.1 Why Soak Testing Matters
These bugs escape short-term testing:
Bug Type
Time to Manifest
Example
Memory leaks
24-168 hours
10 bytes/hour = crash after 1 week
Battery drain
72+ hours
Sleep mode bug draining 10mA
Flash wear
1-6 months
Writing to same sector 1000x/day
Network handle exhaustion
48+ hours
Socket not closed properly
RTC drift
1-4 weeks
Clock skewing 1 second/day
1571.5.2 Soak Test Protocol
Soak Test Procedure - Smart Home Sensor
Duration: 168 hours (7 days) continuous
Environment:
- Thermal cycling: 15°C to 35°C, 12-hour cycle
- Normal Wi-Fi operation (production router)
- Simulated sensor inputs (HIL or real environment)
Monitoring (logged every minute):
- Free heap memory
- Stack high water mark
- Wi-Fi reconnection events
- MQTT message count (sent vs acknowledged)
- Current consumption
- CPU temperature
- RTC accuracy (vs NTP reference)
Pass Criteria:
- Zero crashes/reboots (except scheduled)
- Memory usage stable (±5% over duration)
- All messages delivered (acknowledge rate >99.9%)
- Current within spec (sleep <20uA, active <200mA)
- RTC drift <10 seconds over 7 days
- No logged errors beyond acceptable rate
Automatic Abort Conditions:
- Device unresponsive >5 minutes
- Memory <10KB free
- Temperature >85°C
- >100 Wi-Fi reconnects/hour
1571.5.3 Analyzing Soak Test Results
# Soak test analysis scriptimport pandas as pdimport matplotlib.pyplot as plt# Load telemetry datadf = pd.read_csv("soak_test_telemetry.csv")# Check for memory leaksmemory_trend = df.groupby(df['timestamp'].dt.hour)['free_heap'].mean()leak_rate = (memory_trend.iloc[0] - memory_trend.iloc[-1]) /len(df)if leak_rate >10: # More than 10 bytes/hourprint(f"WARNING: Memory leak detected! Rate: {leak_rate:.1f} bytes/hour")# Check for connectivity issuesreconnect_count = df['wifi_reconnect_count'].iloc[-1]if reconnect_count >10:print(f"WARNING: Excessive Wi-Fi reconnects: {reconnect_count}")# Check for message delivery issuesack_rate = df['messages_acked'].sum() / df['messages_sent'].sum()if ack_rate <0.999:print(f"WARNING: Message delivery rate below target: {ack_rate:.3%}")# Visualize memory over timeplt.figure(figsize=(12, 4))plt.plot(df['timestamp'], df['free_heap'])plt.xlabel('Time')plt.ylabel('Free Heap (bytes)')plt.title('Memory Usage Over 7-Day Soak Test')plt.savefig('soak_memory_trend.png')
1571.6 Field Failure Analysis
When field failures occur, systematic root cause analysis is critical.
1571.6.1 Failure Investigation Workflow
Field Failure Investigation Process
1. GATHER DATA
- Device telemetry logs (last 72 hours)
- User-reported symptoms
- Environmental conditions (location, weather)
- Device firmware version
- Network configuration
2. REPRODUCE
- Attempt reproduction in lab with same:
- Firmware version
- Network configuration
- Simulated environmental conditions
- If can't reproduce, need more field data
3. ISOLATE
- Binary search through firmware versions
- Component swap testing
- Protocol analyzer captures
- Memory dumps (if device accessible)
4. ROOT CAUSE
- Identify specific failure mechanism
- Determine why testing didn't catch it
- Document conditions required to trigger
5. FIX & VALIDATE
- Implement fix
- Add test case that would have caught bug
- Validate fix doesn't introduce regressions
- Plan field deployment (OTA or recall)
1571.6.2 Real-World Failure Example
NoteWorked Example: Debugging a Field Failure Using Systematic Root Cause Analysis
Scenario: Your team shipped 5,000 smart irrigation controllers 3 months ago. Customer support is receiving 50+ tickets per week reporting “device offline” errors. Devices work for 1-4 weeks, then permanently disconnect from Wi-Fi. RMA returns show no obvious hardware defect.
Given: - Product: ESP32-based irrigation controller with Wi-Fi - Failure rate: ~8% of deployed devices (400+ affected) - Symptom: Device disconnects from Wi-Fi, never reconnects (requires power cycle) - Field data: Devices are deployed outdoors in weatherproof enclosures - Initial hypothesis: “Wi-Fi router compatibility issue” (support team theory)
Investigation Steps:
Gather failure data systematically:
Collect device logs from 50 affected units via cloud telemetry
Pattern Analysis (50 failed devices):
Average time to failure: 18 days (range: 7-42 days)
Last reported temperature: 47C average (!)
Geographic distribution: 80% in Southwest US (Arizona, Nevada, Texas)
Wi-Fi router brands: 15 different brands (not correlated)
Initial finding: Geographic correlation + high temperature suggests thermal issue, not Wi-Fi compatibility
Finding: NVS corruption occurs during thermal cycling
Identify root cause:
Review ESP32 errata: Known issue with flash writes during brownout
Measure power supply during thermal cycling:
VCC nominal: 3.3V
VCC minimum: 2.9V (brownout threshold: 2.8V)
Brownout events: 3-5 per thermal cycle
Root cause: Power supply marginally handles thermal expansion + nearby EMI. Brownout during NVS write corrupts flash.
Implement and verify fix:
Software fix: Add brownout detection before NVS writes, store Wi-Fi credentials with redundancy
Hardware fix for new production: Add 100uF bulk capacitor near ESP32
Validate fix: 10 units with firmware v1.2.4, 21 days thermal cycling: 0 failures
Field OTA update: 95% recovery rate for affected devices
Key Insight: “Wi-Fi compatibility” is the most common misdiagnosis for IoT connectivity failures. The actual root causes are usually: (1) Power supply issues (brownout, noise), (2) Thermal effects (component drift, flash corruption), (3) Memory leaks (heap exhaustion over time).
1571.7 Production Readiness Criteria
1571.7.1 Go/No-Go Decision Framework
Before mass production, verify all criteria are met:
Category
Metric
Target
Measurement
Reliability
Field failure rate
<1% in 90 days
Beta fleet tracking
Quality
Manufacturing yield
>98%
Production line stats
User Experience
Setup success rate
>95%
Beta onboarding tracking
Support
Support ticket rate
<5% of users
Support system data
Satisfaction
NPS score
>40
Beta user survey
Scale
Server load
<50% capacity
Load testing
Compliance
Certifications
All passed
Cert reports
1571.7.2 Pre-Production Checklist
Production Readiness Checklist
Engineering Sign-Off:
[ ] All unit tests passing (100%)
[ ] All integration tests passing (100%)
[ ] 168-hour soak test completed with zero failures
[ ] Environmental testing passed (temp, humidity, EMC)
[ ] Security penetration test completed, no critical findings
[ ] OTA update system validated (rollback tested)
Manufacturing Sign-Off:
[ ] Production test station qualified
[ ] Manufacturing yield >98% over 100 units
[ ] Rework rate <2%
[ ] Component supply chain secured for 12 months
[ ] Factory calibration process validated
Regulatory Sign-Off:
[ ] FCC certification complete
[ ] CE certification complete
[ ] Safety certification complete (if required)
[ ] Labeling approved
Field Validation Sign-Off:
[ ] Beta program completed with 200+ devices
[ ] Field failure rate <1%
[ ] No systematic issues identified
[ ] Support documentation complete
[ ] Escalation process defined
Business Sign-Off:
[ ] Unit cost within target
[ ] Warranty terms defined
[ ] Support staffing plan in place
[ ] Inventory plan for first 6 months
1571.8 Knowledge Check
Show code
InlineKnowledgeCheck({questionId:"kc-testing-field-1",question:"Your smart thermostat passed all lab tests including 1000-hour environmental testing. Beta deployment (500 units, 8 weeks) shows 99.2% uptime - exceeding your 99% target. You approve production. Three months after launch (50,000 units shipped), field failure rate climbs to 3% with devices reporting 'sensor error'. Investigation reveals humidity-induced corrosion on the temperature sensor. What testing gap allowed this?",options: ["Environmental testing was insufficient - need longer duration at 85C/85% humidity","Beta duration was too short - 8 weeks couldn't reveal 3-month failure mode","Beta geography lacked humid climates - 8 weeks in Arizona/Minnesota won't catch Gulf Coast issues","All of the above - multiple testing gaps combined to miss the corrosion issue" ],correctAnswer:3,feedback: ["Partially correct. 1000-hour 85/85 test should have caught accelerated corrosion - unless the test was done on unrepresentative samples (engineering prototypes vs production units).","Partially correct. 8-week beta wouldn't catch failures that take 3 months to manifest. Minimum 12-week beta recommended for products targeting 10-year life.","Partially correct. If beta units were all in dry climates, real-world humidity exposure was never validated despite lab humidity testing.","Correct! Field failures from multiple testing gaps are common. The corrosion issue could have been caught by: (1) Production-representative samples in 85/85 test, (2) Longer beta duration, (3) Beta participants in humid climates (Florida, Gulf Coast, Hawaii). Always analyze field failures to identify which testing layer failed and why." ],hint:"Consider what was different between lab testing, beta deployment, and production field conditions."})
1571.9 Summary
Field testing validates real-world operation:
Beta Programs: Deploy to diverse users, geographies, and environments