Outlier detection

Command:

Statistics

Outlier detection

Description

Outlier detection is used to detect anomalous observations in sample data.

Required input

Dialog box for outlier detection

Variable: the name of the variable containing the data to be analyzed.

Filter: (optionally) a filter in order to include only a selected subgroup of cases in the statistical analysis.

Methods of outlier detection:

Grubbs - left-sided: check only the smallest value (Grubbs, 1969).
Grubbs - right-sided: check only the largest value (Grubbs, 1969).
Grubbs - double-sided: check the most extreme value at either side (Grubbs, 1969).
The single-sided Grubbs' tests are more sensitive than the double-sided test.
Generalized ESD test: the Generalized Extreme Studentized Deviate (ESD) procedure can detect multiple outliers in one step (Rosner, 1983).
- test for maximum number of outliers: enter the maximum number of outliers to detect.
Tukey: check for multiple outliers at either side, categorized as 'outside' or 'far out' values (Tukey, 1977).
- An outside value is defined as a value that is smaller than the lower quartile minus 1.5 times the interquartile range, or larger than the upper quartile plus 1.5 times the interquartile range (the 'inner fences').
- A far out value is defined as a value that is smaller than the lower quartile minus 3 times the interquartile range, or larger than the upper quartile plus 3 times the interquartile range (the 'outer fences').
John Tukey, in fact, did not use the term 'outlier', but used the classifications 'outside' and 'far out'.

Options

Alpha level for Grubbs' and ESD test: select the alpha-level (ranging from 0.10 to 0.001), applicable only in Grubbs' test and the Generalized ESD test. With a bigger alpha-level the test will be more sensitive and outliers will more rapidly be detected; however, this may result in false-positive results.
Logarithmic transformation: the outlier detection methods assume that the data follow an approximately normal distribution (see next option). Sometimes data should be logarithmically transformed before analysis. See Logarithmic transformation.
The example on this page uses the data from Rosner (1983) on their original scale. Therefore Logarithmic transformation is performed like in the Rosner paper.
Test for Normal distribution: see Tests for Normal distribution.

Results

Outlier detection

Variable	Vit_E_Intake Vit E Intake

Back-transformed after logarithmic transformation.

Sample size	54
Lowest value	0.7800
Highest value	407.4800
Geometric mean	10.1834
Median	8.1249
Coefficient of Skewness	1.1817 (P=0.0011)
Coefficient of Kurtosis	1.9972 (P=0.0248)
Shapiro-Francia test for Normal distribution	W'=0.9000 reject Normality (P=0.0006)

Suspected outliers

Grubbs - double-sided (alpha-level 0.05)
None

Tukey, 1977
Outside values	208.51 225.88 407.48
Far-out values	None

Generalized ESD test (alpha-level 0.05)
208.51 225.88 407.48

Summary statistics

Summary statistics for the selected data are displayed. See Summary statistics.
If the test for Normal distribution reports 'reject Normality' the outlier detection methods may be invalid since they assume that the data follow an approximately normal distribution. Perhaps data should have been logarithmically transformed before analysis.
In the example, data are logarithmically transformed.

Suspected outliers

The program lists the outliers identified by the different procedures.

Grubbs' test can only be used to detect one single outlier; if you suspect there is more than one outlier you should not repeat the procedure but use the Generalized ESD test.

What to do when you have identified an outlier

Do not remove outliers automatically.

Remove outliers only when a cause can be found for the spurious result, such as a pre-, post-, or analytical error.
When you conclude that a pre-, post-, or analytical error is the cause of the spurious result, be aware that the same errors may exist in the other data values.
Check the distribution of the data. Logarithmically transformed sample data may more closely follow a Normal distribution. Graph the data with and without logarithmic transformation, for example using a Box-and-Whisker plot.
You may consider to replace the outlier value with the next highest/lowest (non-outlier) number.
Keep the outlier but use robust or non-parametric statistical methods that do not assume that data are Normally distributed.
Do the statistical analysis and report conclusions both with and without the suspected outlier.

In all cases, report the outliers and how you have dealt with them.

Literature

Grubbs FE (1969) Procedures for detecting outlying observations in samples. Technometrics 11:1-21.
Rosner B (1983) Percentage points for a generalized ESD many-outlier procedure. Technometrics 25:165-172.
Tukey JW (1977) Exploratory data analysis. Reading, Mass: Addison-Wesley Publishing Company.

Outlier detection

Description

Required input

Results

Suspected outliers

Summary statistics

Suspected outliers

What to do when you have identified an outlier

Literature

See also