32 Validation and Outlier Detection

analytics-ml

data

quality

validation

32.1 Start With the Story

Picture an IoT team using the ideas in Validation and Outlier Detection during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

32.2 Validation Stops Bad Data

IoT systems produce data continuously, but continuous data is not automatically usable data. A payload may be well-formed JSON while still carrying a humidity value of 108 percent, a pressure reading from the future, a frozen value repeated for six hours, or a temperature that jumps 20 C in one second. Validation is the entry gate that decides whether each reading is complete, plausible, timely, and safe enough to use.

Validation is different from later cleaning or anomaly analysis. Validation applies known rules from schemas, physics, device configuration, deployment context, and timing. Cleaning repairs or imputes selected values after validation. Anomaly detection searches for unusual but possible behaviour in data that has already passed the basic trust checks.

A good validation layer does not merely reject data. It records why a reading passed, failed, or was quarantined so that data quality becomes observable instead of invisible.

Schema

Required fields exist, data types match, schema version is known, and payload size is within expected limits.

Physics

Values obey sensor and environment limits such as humidity 0-100 percent or indoor temperature within an operational band.

Time

Event time is monotonic enough, clock skew is bounded, arrival delay is acceptable, and reporting cadence is plausible.

Context

Related fields agree: pump flow matches pump state, dew point does not exceed temperature, and peer sensors are not wildly inconsistent.

Rule

Question

Example Failure

Usual Action

Schema

Can the system parse and interpret the payload contract?

Missing device_id or temperature sent as text when the schema requires a number.

Reject to a dead-letter queue with producer id and schema id.

Range

Is the value physically or operationally possible?

Soil moisture reported as 980 percent.

Reject or quarantine; keep the original value and rule id.

Rate

Could the value have changed this fast?

An office temperature changes from 22 C to 45 C in two seconds.

Keep the previous valid state, flag the new reading, and inspect the device.

Plausibility

Do related measurements agree with each other?

Temperature 28 C, humidity 85 percent, and dew point 18 C are inconsistent.

Flag for review; compare sensors and calibration records before suppressing.

Overview Knowledge Check

32.3 Rules Plus Outlier Bounds

Validation should run in layers. Hard rules reject impossible data: malformed payloads, values beyond sensor limits, invalid timestamps, and impossible rates of change. Soft rules flag suspicious but possible data: a value outside the usual operating envelope, a peer-sensor disagreement, or a statistical outlier. Treating every warning as a hard reject loses useful evidence; treating every warning as valid poisons downstream analytics.

Use deployment-specific thresholds. A greenhouse, a data center, and an outdoor weather station should not share the same temperature band. The rule file should name the sensor type, placement, units, expected cadence, physical range, operational range, maximum rate, peer checks, and the action taken on each failure.

Worked example: validating an office HVAC sensor
incoming samples, one second apart:
22.0 C, 22.1 C, 22.2 C, 45.0 C, 22.3 C

rules:
physical range: -10 C to 60 C
office operating range: 18 C to 26 C
maximum rate: 0.5 C per second
peer sensor tolerance: 3 C

sample at 45.0 C:
schema: pass
physical range: pass
operating range: fail
rate of change from 22.2 C:
  abs(45.0 - 22.2) / 1 second = 22.8 C/s
  22.8 C/s is greater than 0.5 C/s
peer check:
  nearby zone sensor reads 22.4 C
  difference = 22.6 C, greater than 3 C

decision:
quarantine the 45.0 C sample
keep the previous valid value for control logic
store the original value, canonical unit, failed rule ids, and peer comparison
alert only if the failure repeats or affects multiple sensors

Hard validation protects control decisions. Soft validation protects analysis by adding quality flags without pretending every suspicious value is certainly wrong.

Z-score

Compares distance from the mean in standard deviations. Useful for near-normal signals, but sensitive to previous outliers.

IQR

Uses Q1, Q3, and the interquartile range. Better for skewed or bounded data such as battery voltage and occupancy counts.

MAD

Uses median absolute deviation. Robust when the baseline should not be pulled around by rare extreme values.

Peer Check

Compares related sensors or derived physics, such as dew point versus temperature and humidity.

Worked example: IQR outlier bound
window values:
20, 21, 21, 22, 22, 23, 23, 24, 25, 85

quartiles:
Q1 = 21
Q3 = 24
IQR = Q3 - Q1 = 3

fences:
lower = Q1 - 1.5 * IQR = 21 - 4.5 = 16.5
upper = Q3 + 1.5 * IQR = 24 + 4.5 = 28.5

decision:
85 is outside [16.5, 28.5], so flag it as an outlier.
Do not replace it silently. Store the flag, the rule version,
and the handling decision used by each downstream job.

32.4 Compare Sensor Agreement

When a deployment compares places, people, or measurement methods, validation needs more than a pass/fail rule. A mobile air-quality study might group 0.38-0.42 um particle counts by library, park walk, traffic street, car, and indoor context. The box plot is not decoration: the median, IQR, whiskers, and outliers show whether traffic exposure, enclosure placement, or sampling route changes the distribution enough to affect the claim.

Confidence intervals belong in the same evidence record. A sample mean by itself says what happened in the sampled window; a confidence interval says how wide the plausible population effect remains. If a 95 percent interval for a treatment, calibration change, or firmware filter crosses the no-change value, the honest result is “not enough evidence of a reliable effect,” not a forced improvement label.

Distribution

Compare medians, IQR, tails, and outliers before reducing a route, room, or cohort to one average.

Confidence

Report the estimate and interval together, especially when the interval is wide or crosses the no-change value.

Equality

A regression line can show correlation while the points sit away from the line of equality. That is bias, not agreement.

Agreement

Use ICC for continuous ratings and kappa for ordered categories when independent observers or devices must agree.

Correlation is not enough when the question is whether two methods are interchangeable. Two blood-pressure cuffs, gait scorers, or particulate counters can move together and still disagree by a consistent offset. A line-of-equality plot shows whether method A and method B produce the same value, while ICC or kappa summarizes agreement after chance agreement and category structure are considered. For ordered clinical or maintenance categories, a low kappa means the reviewers may be consistent with themselves but not with each other.

A Bland-Altman record makes method comparison explicit. Plot the mean of the two measurements on the x-axis and their difference on the y-axis, then record the mean bias, standard deviation, and whether the difference grows with the measurement level. If a scale change leaves correlation high but produces large differences, downstream logic should not treat the methods as drop-in replacements.

Do the extra arithmetic before accepting a method comparison. A regression line between method A and method B can look strong even when the differences are too wide for release. The Bland-Altman check should name the mean difference d, the standard deviation of the differences s, and the limits of agreement d +/- 1.96s after checking that the differences are roughly normal. If d = -27.2 and s = 34.8, the practical agreement band is about -95.4 to 41.1: method A may read nearly 95 units below method B or 41 units above it. That is a release decision, not just a plot annotation.

Normality and significance also need careful wording. A histogram with a normal curve or a Shapiro-Wilk, D’Agostino-Pearson, or Kolmogorov-Smirnov test can support the limits-of-agreement assumption, but it does not prove the devices are interchangeable. A p-value should be read against a stated null hypothesis such as “no difference between methods.” Smaller values mean the observed difference would be surprising under that null; they do not by themselves measure engineering importance, safety margin, or field fitness.

Pearson’s product-moment coefficient belongs in that same caution box. It measures linear correlation between two numeric variables: positive values mean the variables tend to rise together, negative values mean one tends to fall as the other rises, and values near zero mean there is little linear dependence. The population form is rho_XY = E[(X - mu_X)(Y - mu_Y)] / (sigma_X sigma_Y), so the statistic depends on means, standard deviations, and the expected cross-product of deviations. It is useful for summarising association, but it is not a release argument unless the unit error, bias, and decision threshold also pass.

ANOVA is the group-comparison partner to that idea. It tests a null hypothesis that two or more groups come from distributions with the same mean, then reports an F-statistic that compares variation between group means with variation within the samples. A p-value can be derived from that F-statistic, but the field decision still needs effect size, group definition, sampling conditions, and engineering consequence. For IoT experiments, ANOVA can help compare firmware filters, sensor placements, calibration recipes, or operating modes, but it cannot by itself say which option is safe to ship.

For binary outcomes, keep odds and risk separate. Odds divide events by non-events; risk divides events by all people or devices at risk. An odds ratio compares exposed odds with control odds, while a risk ratio compares exposed risk with control risk. In both cases, a ratio of 1 means no difference, values above 1 indicate a higher event rate in the exposed group, and values below 1 indicate a lower event rate. Report the 95 percent confidence interval with the ratio; if the interval crosses 1, do not call the result statistically significant.

Minimal method-agreement record
comparison_id: "pm-counter-field-check-2026-07"
method_a: "reference optical particle counter"
method_b: "low-cost mobile node"
unit: "particles in 0.38-0.42 um bin"
groups: ["library", "park walk", "traffic street", "car", "indoors"]
distribution_summary: "median, IQR, whiskers, outlier count by group"
confidence_interval: "estimate plus 95 percent interval for each claim"
agreement_plot: "method A vs method B with line of equality"
bland_altman: "mean(A,B), A-B, percent difference, mean bias, SD, normality check, limits of agreement"
binary_effects: "odds ratio or risk ratio, denominator, 95 percent CI, null value = 1"
decision: "use for exposure trends; do not use as a reference-grade substitute without bias correction"

Worked example: ratio interpretation
cases: 40 events, 60 non-events
controls: 20 events, 80 non-events

odds in cases = 40 / 60 = 0.66
odds in controls = 20 / 80 = 0.25
odds ratio = 0.66 / 0.25 = 2.64
95 percent CI = 1.41 to 5.02

decision:
the interval does not include 1, so the exposed group has a
statistically significant higher odds of the event in this sample.
Do not translate that into risk, causation, or deployment policy
without checking study design, confounders, and the IoT decision cost.

Practitioner Knowledge Check

32.5 Policy for Failed Data

The important engineering decision is not only how to detect invalid data, but what to do after detection. A physically impossible value can be rejected for control logic, but the rejected record is still valuable diagnostic evidence. A suspicious but possible value may need a quality flag, a quarantine workflow, or a reviewer decision. A missing reading may be imputable for a trend chart but unacceptable for billing, safety, or compliance.

Make those decisions explicit in data contracts and job configurations. Each validation result should include the source id, event time, ingestion time, schema version, canonical unit, rule version, pass/fail state, handling action, and reason. Downstream jobs should declare whether they use only valid records, valid plus flagged records, or a curated training subset.

Failure

Good Policy

Bad Policy

Why It Matters

Impossible value

Reject for automation, write to quarantine, and preserve the original payload.

Clamp to a convenient number without keeping the original.

Clamping hides sensor faults and makes root-cause analysis impossible.

Suspicious outlier

Flag and route according to use case: display, review, training exclusion, or manual override.

Delete all outliers as if they are always sensor errors.

Rare real events are often the events analysts and operators care about most.

Stale or missing data

Mark freshness, use a fallback state, and report completeness by device and window.

Fill gaps silently with the last value forever.

Silent fills can make dead devices look healthy and create false confidence.

Rule drift

Version validation rules and rerun or compare results when thresholds change.

Overwrite thresholds with no version history.

Without rule versions, old decisions cannot be explained or reproduced.

Minimal validation record for audit
{
  "sensor_id": "hvac-17",
  "event_time": "2026-07-03T07:00:00Z",
  "ingested_at": "2026-07-03T07:00:02Z",
  "schema_version": "hvac-temp-v4",
  "raw_value": 45.0,
  "canonical_unit": "C",
  "validation_rule_version": "office-temp-rules-2026-07",
  "quality_state": "quarantined",
  "failed_rules": ["operating_range", "rate_of_change", "peer_check"],
  "handling": "exclude_from_control; retain_for_review",
  "lineage_ref": "raw/topic/hvac-17/offset/884201"
}

This record lets a dashboard warn users, a model-training job exclude the
sample, and a maintenance team investigate the sensor without losing evidence.

The safest default is not “drop bad data”; it is “make quality state explicit, preserve evidence, and choose use-case-specific handling.”

Under-the-Hood Knowledge Check

32.6 Summary

Validation protects IoT analytics before data enters dashboards, models, alerts, and control loops. Start with schema and type checks, then apply physical range, rate-of-change, freshness, completeness, and cross-sensor plausibility rules. Use statistical outlier methods such as Z-score, IQR, and MAD only after basic validity is established. Hard failures can be rejected for automation, but the original payload, failed rule, quality state, and handling decision should remain available for audit.

The practical goal is traceable quality state. Each reading should become valid, flagged, quarantined, or rejected for a named reason. Downstream consumers then know whether the value is safe for control, acceptable for a warning dashboard, excluded from model training, or awaiting review.

32.7 Key Takeaway

Validate first, clean second, analyze third. A data-quality pipeline should catch impossible readings with explicit rules, flag suspicious readings with evidence, and preserve enough validation metadata to reproduce every downstream decision.

32.8 See Also

Cloud Data: Quality and Security - How validation evidence connects to governance, access, and audit controls.
Data Quality and Preprocessing - The full data-quality workflow after validation.
Missing Value Imputation and Noise Filtering - How to handle gaps and noise after invalid data is identified.
Data Quality Normalization Lab - Hands-on scaling and preparation for validated records.
Anomaly Detection - Finding unusual but possible behaviour after basic validity checks pass.