Many of the methods discussed in this class don’t work well if the data set has outliers. An outlier is any observation that is in some way unusual/strange/weird.
We have already seen that an observation that is unusual with respect to one variable appears as a separate dot in an R boxplot:
Unfortunately there are no hard rules exactly when an observation becomes an outlier. To a large part that depends on the method of analysis we want to use, some methods are sensitive to outliers, others are more robust.
In addition to the case discussed above, there are other ways in which an observation can be an outlier:
Data from a British government survey of household spending may be used to examine the relationship between household spending on tobacco products and alcoholic beverages. The numbers are the average expenditure for each of the 11 regions of England.
alcohol
## Region Alcohol Tobacco
## 1 North 6.47 4.03
## 2 Yorkshire 6.13 3.76
## 3 Northeast 6.19 3.77
## 4 East_Midlands 4.89 3.34
## 5 West_Midlands 5.63 3.47
## 6 East_Anglia 4.52 2.92
## 7 Southeast 5.89 3.20
## 8 Southwest 4.79 2.71
## 9 Wales 5.27 3.53
## 10 Scotland 6.08 4.51
## 11 Northern_Ireland 4.02 4.56
Here we have two quantitative variables, so the obvious thing to do is draw the scatterplot:
attach(alcohol)
splot(Tobacco , Alcohol)
There seems to be generally a positive relationship, but also one case that does not fit. It seems it has the smallest value for Alcohol, which we can see in the data is for Northern Ireland, where there is a fairly high expenditure on Tobacco but not on Alcohol (???)
Note that neither Alcohol nor Tobacco have any outliers by themselves:
bplot(Alcohol)
bplot(Tobacco)
Again, it is not always obvious when an observation becomes an outlier:
If we have two quantitative variables an outlier can happen in one of three ways:
in the x variable, which we can check in the boxplot of x
in the y variable, which we can check in the boxplot of y
in the relationship between the x and the y variable, which we can check in the scatterplot of x and y
In fact we can do all three in one step:
mplot(Tobacco, Alcohol)
Consider the following case:
How many outliers does this data set have? Actually just one, but it appears in each of three graphs.
If we have an outlier in a data set, what do we do then? First and foremost, don’t ignore them! Most statistical methods are very sensitive to outliers, often they simply don’t work.
Example Is there a relationship between Alcohol and Tobacco expenditures in England? Because we have two quantitative variables we might use Pearson’s correlation coefficient to answer this question:
round(cor(Tobacco, Alcohol), 3)
## [1] 0.224
round(cor(Tobacco[-11], Alcohol[-11]), 3)
## [1] 0.784
So with Northern Ireland we find a weak positive correlation, but without Northern Ireland it is a fairly strong positive correlation.
Which one is right? Clearly the first one is wrong because of the outlier!
So, if there are outliers, what do we do?
Learn as much as you can about the “story” behind the data and understand why there is an outlier. Is it an error? Is it something we should expect to see in this kind of data? etc.
Find a method that is not sensitive to outliers. For example, alternatives to Pearson’s correlation coefficient include Spearman’s rank correlation coefficient and Kendall’s coefficient of concordance , although neither of them works any better here.
Try and “adjust” the outliers. We know what “caused” the Alcohol number for Northern Ireland to be off, so maybe we can adjust it.
If all else fails, eliminate the outlier(s)
attach(brainsize)
mplot(brain.wt.g, body.wt.kg)
Here we have at least two, maybe three outliers. They are at #1 (African Elephant), #5 (Asian Elephant) and #34 (Man). Let’s take them out:
mplot(brain.wt.g[-c(1, 5, 34)], body.wt.kg[-c(1, 5, 34)])
but this still seems to have outliers!
Here just eliminating them does not work.