Model-Based Outlier Detection – The Data Story Guide

An outlier is a case that is in some way inconsistent with the rest of the data. Outliers can be detected by proposing a "model" to describe the data, and identifying observations that are inconsistent with that model. Common ways of doing this include:

Univariate models
The Mahalanobis distance
Residuals and other influential observations from predictive models

Univariate models

The simplest model-based approach to outlier detection is to assume that data is normally distributed and to define an outlier as any observation that is more than, say, 2.5 or 3 standard deviations from the mean. As discussed in Automated Outlier Detection with Numeric Variables, this is a crude approach.

More sophisticated approaches involve attempting to work out a better model than the normal distribution and then identifying inconsistent observations.

Such approaches can be useful when automating ongoing quality control. However, they are rarely useful for performing a one-off check of a data file.

The Mahalanobis distance

The Mahalanobis distance is a multivariate version of testing the distance between a case and the multivariate mean. It contains an additional adjustment to take into account the correlation between variables.

As with univariate models, this approach can be useful for automating data checking but hasn't shown any evidence over time of being particularly useful outside of classrooms.

Residuals and other influential observations from predictive models

A well-known problem with predictive models, such as linear regression, binary logit, ordered logit, and other GLMs, is that conclusions can be surprisingly sensitive to a small number of rogue cases.

Various techniques have been developed for identifying such rogue observations. These techniques work, in various degrees by:

Identifying cases that are poorly predicted by the predictive model (i.e., cases with large residuals).
Seeing how results of a predictive model change if a specific case is not used to fit the model (such cases are regarded as being influential).

The math of the various techniques for identifying rogue observations can be quite complicated, but it is not always necessary to understand it. A useful workflow is to:

Select whatever outlier detection mechanisms are available in the software you are using to fit the predictive models.
Investigate the outliers (e.g., look at their raw data).
Re-run the model with the identified outliers removed and see if conclusions changed in a way that is concerning.