Outliers - Detection and Treatment

Many of the methods discussed in this class don’t work well if the dataset has outliers. An outlier is any observation that is in some way unusual/strange/weird.

We have already seen that an observation that is unusual with respect to one variable appears as a separate dot in an R boxplot:

Unfortunately there are no hard rules exactly when an observation becomes an outlier. To a large part that depends on the method of analysis we want to use, some methods are sensitive to outliers, others are more robust.

In addition to the case discussed above, there are other ways in which an observation can be an outlier:

Case Study: Alcohol vs. Tobacco Expenditure

Data from a British government survey of household spending may be used to examine the relationship between household spending on tobacco products and alcoholic beverages. The numbers are the average expenditure for each of the 11 regions of England.

alcohol
##              Region Alcohol Tobacco
## 1             North    6.47    4.03
## 2         Yorkshire    6.13    3.76
## 3         Northeast    6.19    3.77
## 4     East_Midlands    4.89    3.34
## 5     West_Midlands    5.63    3.47
## 6       East_Anglia    4.52    2.92
## 7         Southeast    5.89    3.20
## 8         Southwest    4.79    2.71
## 9             Wales    5.27    3.53
## 10         Scotland    6.08    4.51
## 11 Northern_Ireland    4.02    4.56

Here we have two quantitative variables, so the obvious thing to do is draw the scatterplot:

attach(alcohol) 
splot(Tobacco , Alcohol)

There seems to be generally a positive relationship, but also one case that does not fit. It seems it has the smallest value for Alcohol, and so we can easily find out which observation it is:

alcohol[which(Alcohol==min(Alcohol)), ]
##              Region Alcohol Tobacco
## 11 Northern_Ireland    4.02    4.56

So it is Northern Ireland, where there is a fairly high expenditure on Tobacco but not on Alcohol (???)

Note that neither Alcohol nor Tobacco have any outliers by themselves:

bplot(Alcohol)

bplot(Tobacco)

Again, it is not always obvious when an observation becomes an outlier:

If we have two quantitative variables an outlier can happen in one of three ways:

  • in the x variable, which we can check in the boxplot of x

  • in the y variable, which we can check in the boxplot of y

  • in the relationship between the xand the y variable, which we can check in the scatterplot of x and y

In fact we can do all three in one step:

mplot(Tobacco, Alcohol) 

Treatment of Outliers

If we have an outlier in a dataset, what do we do then? First and foremost, don’t ignore them! Most statistical methods are very sensitive to outliers, often they simply don’t work.

Example Is there a relationship between Alcohol and Tobacco expenditures in England? Because we have two quantitative variables we might use Pearson’s correlation coefficient to answer this question:

cor(Tobacco, Alcohol)
## [1] 0.2235721
cor(Tobacco[-11], Alcohol[-11])
## [1] 0.7842873

So with Northern Ireland we find a weak positive correlation, but without Northern Ireland it is a fairly strong positive correlation.

Which one is right? Clearly the first one is wrong because of the outlier!

So, if there are outliers, what do we do?

  1. Learn as much as you can about the “story” behind the data and understand why there is an outlier. Is it an error? Is it something we should expect to see in this kind of data? etc.

  2. Find a method that is not sensitive to outliers. For example, alternatives to Pearson’s correlation coefficient include Spearman’s rank correlation coefficient and Kendall’s coefficient of concordance , although neither of them works any better here.

  3. Try and “adjust” the outliers. We know what “caused” the Alcohol number for Northern Ireland to be off, so maybe we can adjust it.

  4. If all else fails, eliminate the outlier(s)