Automated Outlier Detection with Numeric Variables – The Data Story Guide

Values in the data that are surprising in some way are often referred to as outliers or unusual values. They can occur with categorical variables, but are typically more of a problem with numeric variables.

An outlier is an observation that is inconsistent with the main pattern or distribution of the data. Various techniques and automated tools have been developed for automatically checking and identifying outliers and unusual values in data sets.

The most common approach to identifying outliers is based on the number of standard deviations that the observation is from the average of a variable. For example, values that are more than 2.5 or 3 standard deviations from the average are classified as outliers.

This approach to identifying outliers implicitly assumes that the data is normally distributed. This is a problematic assumption, as it is extremely rare that the data is normally distributed (despite the name). Making such an assumption can, and often does, lead to incorrect conclusions regarding whether a value is genuinely an outlier. For example, most numeric data in surveys either has a restricted range of some kind (e.g., cannot be less than 0, can only be integers) and thus can never be normal, and is commonly skewed.

The consequence of this is that automated outlier detection should only be done in situations where it is impractical to investigate a histogram (i.e., reviewing a histogram is always better, provided there is time). See: Use Histograms to Understand Numeric Variables.