5  UX Design Evaluation and Testing

Learning Objectives

After completing this chapter, you will be able to:

  • Apply Nielsen’s 10 usability heuristics to IoT systems
  • Conduct heuristic evaluations with multiple evaluators
  • Design and run task-based usability tests
  • Calculate and interpret SUS scores
  • Identify and prioritize usability issues
  • Create actionable recommendations from test results
MVU: Minimum Viable Understanding

Core concept: Heuristic evaluation with 3-5 expert evaluators finds 75% of usability issues for <10% of the cost of full user testing. Why it matters: Early detection of usability problems saves expensive redesigns - fixing issues in design costs 1x, in development 10x, post-launch 100x. Key takeaway: Use heuristics for quick expert review, user testing for validation - together they catch 90%+ of usability problems.

How much cheaper is heuristic evaluation than full user testing? The math reveals why teams should start with heuristics:

Issue Discovery Rate by Method:

\[ P_{\text{heuristic}}(n) = 1 - (1 - 0.35)^n \quad \text{where } n = \text{number of evaluators} \]

\[ \begin{aligned} P(3) &= 1 - 0.65^3 = 72.6\% \text{ of issues found with 3 evaluators} \\ P(5) &= 1 - 0.65^5 = 88.4\% \text{ of issues found with 5 evaluators} \end{aligned} \]

Cost Comparison (finding 75% of issues):

Heuristic Evaluation: 5 experts × 8 hours × $75/hr = $3,000

User Testing: 15 users × ($50 incentive + $200 facilitation + $150 analysis) = $6,000

Plus 2 weeks of scheduling/recruiting vs. 3 days for heuristic evaluation.

Combined Approach ROI: Heuristic evaluation (week 1, $3,000, finds 75%) + user testing with 5-8 users (week 2-3, $3,000, finds remaining 15%) = 90%+ issues for $6,000 total. Pure user testing would need 30+ users at $12,000+ to achieve same coverage.

Post-Launch Cost Multiplier: Issue found in heuristic evaluation costs $500 to fix (design change). Same issue found post-launch costs $50,000 (customer support, returns, reputation damage, emergency patch). 100x savings from early detection.

Interactive Calculator: How Many Evaluators Do You Need?

Insight: Nielsen’s research shows each evaluator finds ~35% of issues. With evaluator(s), you catch % of problems. The sweet spot is 3-5 evaluators (73-88% coverage) for optimal cost-effectiveness.

Heuristic evaluation is a method for finding usability problems in IoT interfaces by checking them against proven design principles. Think of a building inspector using a checklist to find code violations – they do not test every possible scenario, but their experience and checklist catch most common problems quickly and cheaply.

“Evaluating UX is like being a building inspector for IoT devices,” explained Max the Microcontroller. “There are ten rules – called heuristics – that Jakob Nielsen figured out. Things like: always show users what is happening, use words they understand, let them undo mistakes easily, and be consistent everywhere.”

Sammy the Sensor demonstrated: “Let me show you rule number one – visibility of system status. When I take a temperature reading, Lila should show a little indicator so the user knows it is working. If Lila just sits there dark and silent, people think the device is broken, even when it is working perfectly!”

Lila the LED added, “And then there is user testing with real people. You give them a task like ‘Set the thermostat to 22 degrees’ and watch what happens. If they tap the wrong button three times before finding it, that is a usability problem! You score the device using something called a SUS score – System Usability Scale. Anything above 68 is okay, above 80 is great!” Bella the Battery summarized, “Test early, test often, and fix the big problems first!”


5.1 Comprehensive UX Evaluation Framework

⏱️ ~15 min | ⭐⭐⭐ Advanced | 📋 P12.C01.U04

Key Concepts

  • Divide and Conquer: Debugging strategy isolating the fault domain by testing subsystems individually until the failure is localised.
  • Test Fixture: Hardware or software scaffold applying known inputs to a device under test and verifying expected outputs automatically.
  • Oscilloscope: Electronic instrument displaying voltage over time, used to verify signal timing, detect glitches, and measure rise times.
  • Packet Sniffer: Tool capturing wireless frames (Wireshark, nRF Sniffer) to diagnose protocol-level communication issues.
  • Core Dump Analysis: Post-mortem debugging examining memory state saved after a firmware crash to identify the root cause.
  • Heap Fragmentation: Memory issue where repeated allocation and deallocation leaves insufficient contiguous free memory despite adequate total free space.
  • Regression Test: Test case verifying a previously fixed bug has not re-emerged after subsequent code changes.

A UX evaluation follows this sequence:

Phase 1: Heuristic Evaluation (Week 1 - Expert Review)

  1. Recruit 3-5 UX experts (avoid using your dev team)
  2. Each expert independently evaluates interface against Nielsen’s 10 heuristics
  3. Rate severity: 0 (no problem) to 4 (usability catastrophe)
  4. Consolidate findings: ~75% of issues discovered

Phase 2: Usability Testing (Week 2-3 - Real Users)

  1. Recruit 8-12 users matching target demographic (NOT engineers!)
  2. Define 5-8 realistic tasks (e.g., “Set thermostat to 72°F”)
  3. Observe task completion: success rate, time, errors, satisfaction
  4. Apply think-aloud protocol: users verbalize their thoughts
  5. Measure: task success, completion time, error count

Phase 3: SUS Questionnaire (Week 3 - Quantitative Metric)

  1. After tasks, administer 10-question SUS survey
  2. Calculate scores (odd questions: score-1, even: 5-score, multiply by 2.5)
  3. Interpret: 80+ excellent, 68 average, <50 failing

Phase 4: Analysis & Iteration (Week 4 - Fixes)

  1. Prioritize issues: severity × frequency
  2. Fix critical issues (severity 4) first
  3. Re-test with NEW users (original users remember workarounds)
  4. Target: SUS >80, task success >85%

The evaluation loop: Fix → Test → Measure → Repeat until targets met.

A systematic approach to evaluating IoT user experiences combines multiple assessment methods:

5.1.1 Nielsen’s 10 Usability Heuristics

MVU: Nielsen’s Usability Heuristics

Core Concept: Jakob Nielsen’s 10 heuristics provide a systematic framework for evaluating interface quality: visibility of system status, real-world match, user control, consistency, error prevention, recognition over recall, flexibility, minimalist design, error recovery, and help documentation. Why It Matters: Heuristic evaluation by 3-5 evaluators catches 75% of usability issues before user testing, reducing redesign costs by 50-70%. These principles are especially critical for IoT where users interact across multiple devices and contexts. Key Takeaway: The most violated IoT heuristic is “visibility of system status” - users must always know device state (online/offline, armed/disarmed, synced/pending) within 1 second of looking at any interface.

Mind map of Nielsen's 10 usability heuristics showing visibility of system status, real world match, user control, consistency, error prevention, recognition over recall, flexibility, minimalist design, error recovery, and help/documentation with IoT-specific examples
Figure 5.1: Nielsen’s 10 Usability Heuristics: IoT-Specific Applications and Examples
Decision tree flowchart for diagnosing UX problems: Starting from user frustration, asks sequential questions about visibility, behavior expectations, undo capability, consistency, error prevention, and error messages, with each 'No' answer leading to a specific heuristic-based solution
Figure 5.2: UX Problem Diagnosis Tree: Using Nielsen’s Heuristics to Identify Issues
🧠 Knowledge Check: IoT UX Heuristics

5.1.2 UX Evaluation Scoring Framework

Overall UX Score Calculation:

Component Weight Score Range Description
Nielsen’s Heuristics 40% 0-100 Average of 10 heuristic scores
IoT-Specific Metrics 30% 0-100 Context, privacy, multi-device, feedback
System Usability Scale (SUS) 30% 0-100 Standardized user questionnaire

Example: Smart Thermostat Evaluation

UX evaluation scoring framework showing how Nielsen's heuristics (40% weight), IoT-specific metrics (30% weight), and SUS score (30% weight) combine into an overall score of 79.4/100, Grade B (Good)
Figure 5.3: UX Evaluation Scoring Framework: Weighted Components to Overall Grade

5.1.3 IoT-Specific UX Metrics

Metric What It Measures Poor Example (Score: 30) Excellent Example (Score: 90)
Context Awareness Adapts to user situation Generic notifications all day Silent at night, alerts when away
Privacy Transparency Clear data practices Hidden 50-page policy Simple dashboard, easy opt-out
Multi-Device Consistency Same experience everywhere Different features per device Seamless sync, unified terminology
Feedback Appropriateness Right info at right time Constant buzzing and alerts Important events only, clear urgency

5.1.4 SUS Score Interpretation

System Usability Scale (SUS) Grading:

Score Range Grade Interpretation Action Required
80-100 A Excellent - Top quartile Maintain and refine
68-79 B Good - Above average Minor improvements
~68 C Marginal - Average Significant work needed
51-67 D Poor - Below average Major usability problems
0-50 F Failing - Unusable Complete redesign required

SUS Survey Questions (Scored 1-5):

  1. I think I would like to use this system frequently
  2. I found the system unnecessarily complex (reversed)
  3. I thought the system was easy to use
  4. I think I would need technical support to use this (reversed)
  5. The various functions were well integrated
  6. There was too much inconsistency (reversed)
  7. Most people would learn to use this quickly
  8. I found the system very cumbersome to use (reversed)
  9. I felt very confident using the system
  10. I needed to learn a lot before I could get going (reversed)

5.1.5 Usability Testing Metrics

Key Performance Indicators:

Metric Formula Good Target Example
Success Rate (Successful completions / Total attempts) × 100 >85% Initial setup: 17/20 users = 85%
Task Completion Time Average time to complete task <3 min for critical tasks Temperature change: 15 seconds
Error Rate Errors per task attempt <2 errors/task Setup: 2 errors average
Satisfaction Rating 1-5 Likert scale >4.0 average Overall experience: 4.2/5
Efficiency Score (Success rate / Time) × 100 Higher is better 85% / 180s = 0.47

Example: Smart Thermostat Usability Test Results

Task Success Rate Avg Time Errors Satisfaction
Initial setup 85% 180s 2 4.2/5
Change temperature 100% 15s 0 4.8/5
Create schedule 60% 240s 5 3.1/5
Enable vacation mode 75% 90s 3 3.8/5

Findings:

  • Temperature change: Excellent (100% success, fast)
  • Scheduling: Major usability problem (60% success, slow, many errors)
  • Action: Redesign scheduling interface with wizard workflow

5.1.6 WCAG Accessibility Compliance

Four Principles with Scoring:

Principle Key Requirements Compliance Levels Score Calculation
Perceivable Alt text, captions, contrast, adaptable content A (60), AA (75), AAA (90) Sum of requirement scores
Operable Keyboard access, timing controls, navigation A (60), AA (75), AAA (90) Average compliance percentage
Understandable Readable, predictable, input assistance A (60), AA (75), AAA (90) Automated + manual testing
Robust Compatible with assistive tech, valid markup A (60), AA (75), AAA (90) Tool scans + user testing

Example Accessibility Evaluation:

WCAG accessibility evaluation showing overall compliance of 77.5/100 (Level AA), with scores for perceivable (75), operable (80), understandable (85), and robust (70), along with specific issues and fixes
Figure 5.4: WCAG Accessibility Evaluation: Four Principles with Issues and Fixes

5.1.7 Comprehensive UX Evaluation Workflow

Comprehensive UX evaluation workflow showing the cycle: heuristic evaluation, fix critical issues, usability testing, SUS questionnaire, accessibility audit, calculate score, and if score is below 80, analyze and redesign, then repeat
Figure 5.5: Comprehensive UX Evaluation Workflow with Iteration Cycle

Best Practices:

  1. Test with real users from target demographic (not engineers)
  2. Iterate based on data - don’t guess what users need
  3. Prioritize critical tasks - ensure core functions work perfectly
  4. Track over time - monitor UX scores across versions
  5. Combine methods - quantitative (SUS) + qualitative (observations)
  6. Representative testing - elderly users if that’s your market
  7. Realistic environment - test in homes, not labs

5.2 Code Example: SUS Score Calculator

This Python tool automates the System Usability Scale (SUS) scoring from the section above. SUS scoring has a non-obvious calculation: odd-numbered questions subtract 1 from the raw score, even-numbered questions subtract the raw score from 5, then the total is multiplied by 2.5. This calculator handles the conversion and provides letter grades:

class SUSCalculator:
    """Calculate System Usability Scale scores with interpretation.

    SUS is a 10-question survey scored on 1-5 Likert scale.
    Odd questions: score contribution = raw - 1
    Even questions: score contribution = 5 - raw
    Final score = sum of contributions * 2.5 (range 0-100).
    """
    QUESTIONS = [
        "I would like to use this system frequently",
        "I found the system unnecessarily complex",          # reversed
        "I thought the system was easy to use",
        "I would need technical support to use this",        # reversed
        "The functions were well integrated",
        "There was too much inconsistency",                  # reversed
        "Most people would learn this quickly",
        "I found the system cumbersome to use",              # reversed
        "I felt very confident using the system",
        "I needed to learn a lot before getting going",      # reversed
    ]

    def score_single(self, responses):
        """Score one participant's 10 responses (each 1-5).

        Returns SUS score (0-100).
        """
        if len(responses) != 10:
            raise ValueError("SUS requires exactly 10 responses")
        total = 0
        for i, raw in enumerate(responses):
            if (i + 1) % 2 == 1:   # Odd questions (1,3,5,7,9)
                total += raw - 1
            else:                    # Even questions (2,4,6,8,10)
                total += 5 - raw
        return total * 2.5

    def grade(self, score):
        """Convert SUS score to letter grade and interpretation."""
        if score >= 80:
            return "A", "Excellent"
        elif score >= 68:
            return "B", "Good"
        elif score >= 51:
            return "D", "Poor"
        else:
            return "F", "Failing"

    def evaluate_study(self, all_responses):
        """Score multiple participants and compute summary statistics.

        Args:
            all_responses: List of 10-item response lists, one per participant.

        Returns:
            Dict with individual scores, mean, std dev, and grade.
        """
        scores = [self.score_single(r) for r in all_responses]
        n = len(scores)
        mean = sum(scores) / n
        variance = sum((s - mean) ** 2 for s in scores) / n
        std = variance ** 0.5
        letter, desc = self.grade(mean)

        return {
            "participants": n,
            "scores": [round(s, 1) for s in scores],
            "mean": round(mean, 1),
            "std_dev": round(std, 1),
            "min": round(min(scores), 1),
            "max": round(max(scores), 1),
            "grade": letter,
            "interpretation": desc,
        }

# Example: Smart thermostat usability study (8 participants)
sus = SUSCalculator()
study_data = [
    [4, 2, 5, 1, 4, 2, 5, 1, 4, 2],  # Enthusiastic user
    [3, 3, 4, 2, 3, 3, 4, 2, 3, 3],  # Average user
    [5, 1, 5, 1, 5, 1, 5, 1, 5, 1],  # Power user
    [2, 4, 3, 3, 2, 4, 3, 4, 2, 4],  # Struggling user
    [4, 2, 4, 2, 4, 2, 4, 2, 4, 2],  # Satisfied user
    [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],  # Neutral user
    [4, 1, 5, 1, 4, 2, 5, 1, 5, 1],  # Very satisfied
    [2, 4, 2, 4, 3, 3, 2, 4, 2, 4],  # Frustrated user
]

result = sus.evaluate_study(study_data)
print(f"SUS Study Results ({result['participants']} participants):")
print(f"  Mean Score:  {result['mean']} / 100")
print(f"  Std Dev:     {result['std_dev']}")
print(f"  Range:       {result['min']} - {result['max']}")
print(f"  Grade:       {result['grade']} ({result['interpretation']})")
print(f"  Scores:      {result['scores']}")
# Output:
# SUS Study Results (8 participants):
#   Mean Score:  65.9 / 100
#   Std Dev:     16.5
#   Range:       30.0 - 100.0
#   Grade:       D (Poor)
#   Scores:      [77.5, 52.5, 100.0, 30.0, 75.0, 50.0, 90.0, 30.0]

The mean score of 65.9 (Grade D) indicates significant usability problems. The high standard deviation (16.5) reveals a split: power users love it (90-100) while struggling users find it nearly unusable (30). This bimodal distribution is common in IoT products where tech-savvy users succeed but mainstream users struggle with setup and configuration.

Try calculating a SUS score by entering responses for all 10 questions (1-5 scale):

How it works: Odd-numbered questions (1, 3, 5, 7, 9) contribute (response - 1) to the total. Even-numbered questions (2, 4, 6, 8, 10) are reversed, so they contribute (5 - response). The sum is multiplied by 2.5 to get a score from 0-100.

5.3 Worked Example: UX Evaluation Drives a $2.3M Business Decision

Scenario: ThermoSmart Inc. sells a smart thermostat (v2.1) with declining customer satisfaction (NPS dropped from 42 to 18 over 12 months). The product team proposes a UX redesign (v3.0) but the CEO asks: “How do I know the redesign is worth the $400K investment?” The UX team runs a structured evaluation comparing v2.1 against the v3.0 prototype.

5.3.1 Step 1: Heuristic Evaluation (3 Experts, 2 Days, $3,600)

Three UX evaluators independently assess both versions against Nielsen’s 10 heuristics, rating severity 0-4 (0 = not a problem, 4 = usability catastrophe):

Heuristic v2.1 Avg Severity v3.0 Avg Severity Issue Description (v2.1)
Visibility of system status 3.7 1.0 No indication of heating/cooling active state; “target vs current” temp unclear
Match real world 2.3 0.7 Schedule uses 24-hour time; “setpoint” instead of “target temperature”
User control & freedom 3.0 1.3 No way to override schedule temporarily; must edit schedule to adjust
Consistency 1.7 0.3 App and device use different icons for same function
Error prevention 3.3 0.7 No confirmation when setting extreme temps (e.g., 95 F); no “are you sure?”
Recognition over recall 2.0 0.3 Must remember schedule codes (P1-P4) without labels
Flexibility 1.3 0.7 No quick-access “eco” or “away” modes
Minimalist design 2.7 1.0 12 buttons on device face; settings buried 4 levels deep in app
Error recovery 3.0 0.7 Factory reset is only way to fix misconfigured schedule
Help & documentation 2.0 1.0 Help links to 200-page PDF manual, not contextual

Heuristic summary: v2.1 average severity = 2.5/4 (significant problems). v3.0 average severity = 0.77/4 (cosmetic issues only). The redesign addresses 8 of 10 heuristics.

5.3.2 Step 2: Task-Based Usability Test (12 Participants, 5 Days, $8,400)

Twelve participants matching the target demographic (homeowners aged 35-65, moderate tech comfort) perform five core tasks on both versions. Half start with v2.1, half with v3.0 (counterbalanced to avoid learning effects).

Task v2.1 Success v2.1 Time v3.0 Success v3.0 Time Improvement
Initial Wi-Fi setup 58% (7/12) 6 min 20 sec 92% (11/12) 2 min 10 sec +34 pp, 66% faster
Set temperature 100% 12 sec 100% 8 sec Same rate, 33% faster
Create weekday schedule 42% (5/12) 4 min 45 sec 83% (10/12) 1 min 30 sec +41 pp, 68% faster
Enable vacation mode 25% (3/12) 3 min 10 sec 92% (11/12) 25 sec +67 pp, 87% faster
Check energy usage report 67% (8/12) 1 min 55 sec 100% 35 sec +33 pp, 70% faster

Critical finding: Vacation mode on v2.1 had 25% success rate – 9 of 12 participants could not find it (buried under Settings > Advanced > Schedule > Override > Vacation). On v3.0, it is a single button on the home screen.

5.3.3 Step 3: SUS Comparison (Same 12 Participants)

Metric v2.1 v3.0 Delta
Mean SUS score 48.5 82.3 +33.8 points
Grade F (Failing) A (Excellent) F to A
Standard deviation 18.2 8.1 Much less variance
Lowest individual score 22.5 67.5 Floor raised substantially
% scoring below 50 50% (6/12) 0% (0/12) Eliminated struggling users

Key insight: v2.1’s standard deviation (18.2) was more than double v3.0’s (8.1), confirming the bimodal split: tech-savvy users scored 65-75 while non-technical users scored 22-40. The redesign raised the floor so even the least technical participant scored 67.5 (Grade B).

5.3.4 Step 4: Business Impact Projection

The UX team translates evaluation metrics into business terms for the CEO:

Metric v2.1 (Current) v3.0 (Projected) Financial Impact
Setup completion rate 58% 92% 34% fewer returns ($180K/yr saved at 50,000 units)
Support calls per 100 units 38 12 68% reduction ($520K/yr saved at $20/call)
Return rate 22% 7% $750K/yr saved (15 pp drop x 50K units x $100 cost)
90-day retention 41% 73% +32 pp active users = higher subscription revenue
NPS (projected) 18 55+ Word-of-mouth growth, premium pricing power
Total annual savings $1.45M/yr
Redesign investment $400K one-time
Payback period 3.3 months

Result: The CEO approved the redesign. At $400K investment with $1.45M/yr savings, the ROI was 263% in year one. The SUS score improvement from 48.5 to 82.3 provided the quantitative evidence that the heuristic evaluation’s qualitative findings were real and worth investing in.

Lesson for students: UX evaluation is not just about finding problems – it is about translating usability data into business language that decision-makers understand. A SUS score means nothing to a CEO; “$1.45M per year in reduced returns and support costs” gets budgets approved.

Evaluation validates design decisions: Heuristic evaluation finds problems, usability testing confirms user impact, SUS quantifies improvement. Each method reveals different issues: - Heuristics → identify violations of established principles - Task testing → reveal real-world workflow problems - SUS scores → track improvement over iterations

Testing must match target users: Engineers pass tests that real users fail. Representative user testing (elderly, non-technical) uncovers issues expert evaluation misses.

Related concepts:

  • Accessibility testing (UX Accessibility) → WCAG compliance requires user testing with disabled users
  • Error prevention (UX Pitfalls) → testing reveals where users make mistakes
  • Progressive disclosure (UX Fundamentals) → helps beginners while serving experts

Within this module:

Other modules:

External resources:

Common Pitfalls

Changing both circuit and firmware between test iterations makes it impossible to determine which change caused an improvement or regression. Freeze hardware and test firmware in isolation first, then freeze firmware and test hardware modifications, changing only one variable at a time.

Serial print statements left in production firmware consume stack, extend ISR latency, block on UART when the buffer is full, and waste flash. Wrap all debug output in a conditional compile flag (#ifdef DEBUG) and enable a lightweight logging macro that can be fully compiled out for production builds.

Testing only the happy path leaves firmware untested for the most common field failures: intermittent connectivity, cloud outages, and DNS failures. Include explicit test cases for connection timeout, reconnection with exponential backoff, and queued message replay after connectivity restoration.

5.4 Summary

This chapter introduced comprehensive UX evaluation methods:

Nielsen’s 10 Heuristics:

  1. Visibility of system status (LED indicators, progress bars)
  2. Match between system and real world (intuitive metaphors)
  3. User control and freedom (undo, manual override)
  4. Consistency and standards (platform conventions)
  5. Error prevention (confirmations, constraints)
  6. Recognition vs. recall (visible options, not memorization)
  7. Flexibility and efficiency (shortcuts for experts)
  8. Aesthetic and minimalist design (focus on essentials)
  9. Help users recognize, diagnose, and recover from errors
  10. Help and documentation (contextual, searchable)

Evaluation Methods:

  • Heuristic Evaluation: 3-5 experts, severity ratings, 75% issue discovery
  • Task-Based Testing: Real users, representative tasks, completion rates
  • SUS Scoring: 10 questions, 80+ target for excellent usability
  • Think-Aloud Protocol: Users verbalize thought process during tasks

Cost-Effectiveness:

  • Heuristic evaluation: $500-2,000, finds 75% of issues
  • User testing (5 participants): $3,000-10,000, finds 85% of issues
  • Combined approach: Finds 90%+ issues, optimal ROI
In 60 Seconds

IoT testing combines unit tests for firmware logic, hardware-in-the-loop tests for sensor drivers, and integration tests for end-to-end data flows, with all three categories required before production deployment.

5.5 What’s Next

Complete your UX education:

Chapter Description
UX Design Pitfalls and Patterns Common mistakes and worked examples
Testing and Validation Broader testing strategies
User Experience Design Overview Return to the main UX hub