There is a general theorem that asserts that any classifier is equivalent to one based on Bayes’ theorem with some prior. Of course the choice of prior is the usually difficult issue when doing Bayesian statistics. The idea of the Naive Bayesian classifier is to keep things as simple as possible and to use a prior uniform over the features. While this is often unjustified in practical problems, it still leads to a surprisingly good classifier.
Another simplification comes from the fact that this classifier is designed for categorical data. It can still be used for quantitative data but that has to be categorized (binned) first.
As an example consider the data set SMSSpamCollection. This data set has 5574 emails, each identified as either spam or ham. The goal is to determine from the email whether it is spam or not.
For more info on this data set see https://archive.ics.uci.edu/ml/datasets/sms+spam+collection.
You can download the file with
txt <- scan("http://academic.uprm.edu/wrolke/esma6836/SMSSpamCollection.txt",
what="char", sep="\n")
Let’s see what we have:
head(txt)
[1] "ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "ham Ok lar... Joking wif u oni..."
[3] "spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "ham U dun say so early hor... U c already then say..."
[5] "ham Nah I don't think he goes to usf, he lives around here though"
[6] "spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
so each line starts with either ham or spam, then the text of the email. We should start by separating those parts:
type <- substr(txt, 1, 4)
type[type=="ham "] <- "ham"
text <- substr(txt, 5, nchar(txt))
head(type)
## [1] "ham" "ham" "spam" "ham" "ham" "spam"
head(text)
## [1] " Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## [2] " Ok lar... Joking wif u oni..."
## [3] " Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
## [4] " U dun say so early hor... U c already then say..."
## [5] " Nah I don't think he goes to usf, he lives around here though"
## [6] " FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
many classification routines in R work on factors, so we should turn type into one:
type <- factor(type)
table(type)
## type
## ham spam
## 4827 747
so the data set contains 4827 regular and 747 spam emails.
Let’s visualize what we have in our data sets:
library(wordcloud)
par(mfrow=c(1, 2))
wordcloud(text[type=="ham"], min.freq = 50, scale=c(2, 0.5))
wordcloud(text[type=="spam"],min.freq = 50, scale=c(2, 0.5))
How can we use the information in the email text to determine whether it is spam or not? The idea is that certain words are more likely to appear in a spam than in an ordinary email (Viagra, anyone?). So to start we need to split each email into individual words, while keeping the info whether the word came from a spam email or not. We will use the tm (text mining) library for this job:
library(tm)
sms_corpus <- VCorpus(VectorSource(text))
sms_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5574
as.character(sms_corpus[[1]])
## [1] " Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
Next we should do some cleaning. We should remove any punctuation (. , ? etc) and any numbers, and turn all the words into lower case:
sms_corpus <- tm_map(sms_corpus,
content_transformer(tolower))
sms_corpus <- tm_map(sms_corpus, removeNumbers)
sms_corpus <- tm_map(sms_corpus, removePunctuation)
as.character(sms_corpus[[1]])
## [1] " go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat"
We should remove words that are used only for grammatical purposes, such as “a”, “is” etc. Such words are called stopwords in Linguistics. The tm library has a collection of them in
head(stopwords())
## [1] "i" "me" "my" "myself" "we" "our"
and we can remove them with
sms_corpus <- tm_map(sms_corpus,
removeWords, stopwords())
as.character(sms_corpus[[1]])
## [1] " go jurong point crazy available bugis n great world la e buffet cine got amore wat"
Some words appear in may forms due to the grammar, such as “work”, “works”, “worked” etc. Reducing them to one basic form is called stemming. The tm library dos this using the SnowballC library:
library(SnowballC)
wordStem(c("work", "works", "worked"))
## [1] "work" "work" "work"
sms_corpus <- tm_map(sms_corpus, stemDocument())
## Error in stemDocument(): argument "x" is missing, with no default
as.character(sms_corpus[[1]])
## [1] " go jurong point crazy available bugis n great world la e buffet cine got amore wat"
Finally we should remove any extra white space:
sms_corpus <- tm_map(sms_corpus, stripWhitespace)
as.character(sms_corpus[[1]])
## [1] " go jurong point crazy available bugis n great world la e buffet cine got amore wat"
Ok! But so far we still have an email as a single character string. Next we have to count how many times a word appeared in either a spam or a ham email. This is done by creating a Document Term Matrix (DTM). In this matrix each row is an email and each column is a word, and the entry is the number of occurrences. Of course, most words don’t appear in most emails, so most entries in this matrix are 0. Such a matrix is called sparse. We can create it with
sms_dtm <- DocumentTermMatrix(sms_corpus)
sms_dtm
## <<DocumentTermMatrix (documents: 5574, terms: 8330)>>
## Non-/sparse entries: 44278/46387142
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
Next we need to split the data set into a training and a testing part, We will randomly choose 75% of the data fro training, a quite common split:
I <- sample(c(TRUE, FALSE), size=5574,
replace=TRUE, prob = c(3, 1))
sms_dtm_train <- sms_dtm[I, ]
type_train <- type[I]
sms_dtm_test <- sms_dtm[!I, ]
type_test <- type[!I]
df <- data.frame(
train=c(round(100*prop.table(table(type[I])), 1)),
test=c(round(100*prop.table(table(type[!I])), 1)))
kable.nice(df)
train | test | |
---|---|---|
ham | 86.9 | 85.8 |
spam | 13.1 | 14.2 |
It seems both data sets have an equal percentage of spam emails.
In order to be useful as a predictor, a word will have to appear in at least a number of emails, so it makes sense to eliminate those words that are rare:
freq.words <- findFreqTerms(sms_dtm_train, 5)
str(freq.words)
## chr [1:1259] "â\200¦" "â\200“" "abiola" "able" "abt" "accept" ...
so we are left with 1259 such words. Let’s apply this to the DTM:
sms_dtm_train <- sms_dtm_train[, freq.words]
sms_dtm_test <- sms_dtm_test[, freq.words]
The Naive Bayes routine requires categorical data, so we need to turn the DTM matrices into that:
sms_train <- apply(sms_dtm_train, 2,
function(x) ifelse(x==0, "No", "Yes"))
sms_test <- apply(sms_dtm_test, 2,
function(x) ifelse(x==0, "No", "Yes"))
Finally we are ready to train the classifier:
library(e1071)
fit <- naiveBayes(sms_train, type_train)
sms_predict <- predict(fit, sms_test)
Let’s see how we did:
library(gmodels)
CrossTable(type_test, sms_predict,
prop.r = FALSE,
prop.chisq = FALSE,
prop.t = FALSE,
dnn=c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1365
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1167 | 4 | 1171 |
## | 0.983 | 0.022 | |
## -------------|-----------|-----------|-----------|
## spam | 20 | 174 | 194 |
## | 0.017 | 0.978 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1187 | 178 | 1365 |
## | 0.870 | 0.130 | |
## -------------|-----------|-----------|-----------|
##
##
so 1.7% of the good emails were falsely identified as spam, and 2.2% of the spam emails were identified as good. Not bad!
Notice that these mistakes are not symmetric: it is much worse if a good email ends up in your spam folder than vice versa!
As it is right now, there is a bit of a problem: say there is a word that appeared only a few (but 5 or more times) and always in spam messages. Then the classifier will always assign any message with this word to spam. But it could also have appeared in a ham message, it just didn’t in our training set.
One way to eliminate this issue is to simply add a count of 1 to all frequencies. Technically this is called the Laplace estimator. It is done with
fit <- naiveBayes(sms_train, type_train,
laplace = 1)
sms_predict <- predict(fit, sms_test)
CrossTable(type_test, sms_predict,
prop.r = FALSE,
prop.chisq = FALSE,
prop.t = FALSE,
dnn=c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1365
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1139 | 32 | 1171 |
## | 0.991 | 0.148 | |
## -------------|-----------|-----------|-----------|
## spam | 10 | 184 | 194 |
## | 0.009 | 0.852 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1149 | 216 | 1365 |
## | 0.842 | 0.158 | |
## -------------|-----------|-----------|-----------|
##
##
and this has cut the false spam error rate by half. It has actually increased the false ham error rate, but this is likely ok here.
As another example consider the mushrooms data set. It is stored as an Excel file, so we can use the import function of the rio package:
library(rio)
mushrooms <- import("http://academic.uprm.edu/wrolke/esma6836/mushrooms.xlsx")
Let’s see what we have:
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : chr "p" "e" "e" "p" ...
## $ cap_shape : chr "x" "x" "b" "x" ...
## $ cap_surface : chr "s" "s" "s" "y" ...
## $ cap_color : chr "n" "y" "w" "w" ...
## $ bruises : chr "t" "t" "t" "t" ...
## $ odor : chr "p" "a" "l" "p" ...
## $ gill_attachment : chr "f" "f" "f" "f" ...
## $ gill_spacing : chr "c" "c" "c" "c" ...
## $ gill_size : chr "n" "b" "b" "n" ...
## $ gill_color : chr "k" "k" "n" "n" ...
## $ stalk_shape : chr "e" "e" "e" "e" ...
## $ stalk_root : chr "e" "c" "c" "e" ...
## $ stalk_surface_above_ring: chr "s" "s" "s" "s" ...
## $ stalk_surface_below_ring: chr "s" "s" "s" "s" ...
## $ stalk_color_above_ring : chr "w" "w" "w" "w" ...
## $ stalk_color_below_ring : chr "w" "w" "w" "w" ...
## $ veil_type : chr "p" "p" "p" "p" ...
## $ veil_color : chr "w" "w" "w" "w" ...
## $ ring_number : chr "o" "o" "o" "o" ...
## $ ring_type : chr "p" "p" "p" "p" ...
## $ spore_print_color : chr "k" "n" "n" "k" ...
## $ population : chr "s" "n" "n" "s" ...
## $ habitat : chr "u" "g" "m" "u" ...
The first entry is either p(=poisonous) or e=(edible), which is telling us whether a mushroom is good to eat or not. This is the response (or class) variable. The predictors are
Here is a picture of a mushroom and what these things are:
Notice something strange:
table(mushrooms$veil_type)
##
## p
## 8124
So all the mushrooms have the same value. It could be some mistake, or maybe in theory another value is possible but was never observed. At any rate this will not be useful as a predictor, so we should take it out:
mushrooms$veil_type <- NULL
Now
table(mushrooms$class)
##
## e p
## 4208 3916
show us that 51.8% of the mushrooms in the data set are edible.
As before we randomly choose 75% of the cases for training and the rest for testing. Also, the routines we will use require factors, so
for(i in 1:ncol(mushrooms))
mushrooms[, i] <- as.factor(mushrooms[, i])
I <- sample(c(TRUE, FALSE), size=nrow(mushrooms),
replace = TRUE, prob=c(3, 1))
mushrooms.train <- mushrooms[I, ]
mushrooms.test <- mushrooms[!I, ]
We begin with the so called 1R classifier. The idea is to keep it as simple as possible, and the classifier essentially find the single predictor and rule based on it that does the best job. It is implemented in the OneR routine of the RWeka library:
library(RWeka)
fit <- OneR(class~., data=mushrooms.train)
fit
## odor:
## a -> e
## c -> p
## f -> p
## l -> e
## m -> p
## n -> e
## p -> p
## s -> p
## y -> p
## (5924/6006 instances correct)
If you check the values of odor you will find that in essence mushrooms that smell bad are poisonous!
Let’s see how well this does:
predicted <- predict(fit, mushrooms.test)
CrossTable(mushrooms.test$class, predicted,
prop.r = FALSE,
prop.chisq = FALSE,
prop.t = FALSE,
dnn=c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 2118
##
##
## | actual
## predicted | e | p | Row Total |
## -------------|-----------|-----------|-----------|
## e | 1080 | 0 | 1080 |
## | 0.966 | 0.000 | |
## -------------|-----------|-----------|-----------|
## p | 38 | 1000 | 1038 |
## | 0.034 | 1.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1118 | 1000 | 2118 |
## | 0.528 | 0.472 | |
## -------------|-----------|-----------|-----------|
##
##
so our classifier didn’t predict a single edible mushroom to be poisonous but predicted 38 (or 2.8%) of the poisonous mushrooms to be edible. Actually a very good performance, unfortunately in this case still to bad.
The big advantage of this classifier is that it leads to a single simple rule: don’t eat stinky mushrooms!
A somewhat more complicated classifier is implemented in the routine JRip. It does the so called RIPPER (Incremental Reduced Error Pruning) algorithm. It was designed to allow the use of more than a single feature but still keep it relatively simple:
fit <- JRip(class~., data=mushrooms.train)
fit
## JRIP rules:
## ===========
##
## (odor = f) => class=p (1599.0/0.0)
## (gill_size = n) and (gill_color = b) => class=p (851.0/0.0)
## (gill_size = n) and (odor = p) => class=p (184.0/0.0)
## (odor = c) => class=p (135.0/0.0)
## (spore_print_color = r) => class=p (48.0/0.0)
## (stalk_surface_below_ring = y) and (stalk_surface_above_ring = k) => class=p (48.0/0.0)
## (habitat = l) and (gill_attachment = f) and (population = c) => class=p (13.0/0.0)
## => class=e (3128.0/0.0)
##
## Number of Rules : 8
predicted <- predict(fit, mushrooms.test)
CrossTable(mushrooms.test$class, predicted,
prop.r = FALSE,
prop.chisq = FALSE,
prop.t = FALSE,
dnn=c("predicted", "actual"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 2118
##
##
## | actual
## predicted | e | p | Row Total |
## -------------|-----------|-----------|-----------|
## e | 1080 | 0 | 1080 |
## | 1.000 | 0.000 | |
## -------------|-----------|-----------|-----------|
## p | 0 | 1038 | 1038 |
## | 0.000 | 1.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1080 | 1038 | 2118 |
## | 0.510 | 0.490 | |
## -------------|-----------|-----------|-----------|
##
##
and this one makes no errors at all!