One of the most famous data sets in Statistics was first studied by Fisher, his iris data. For each of three types of iris flowers (Iris setosa, Iris virginica and Iris versicolor) we have four measurements: the lengths and the widths of the Petal and the Sepal. The goal is to determine from these measurements the type of flower.
kable.nice(head(iris), do.row.names = FALSE)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
pushViewport(viewport(layout = grid.layout(3, 3)))
print(ggplot(data=iris,
aes(Sepal.Length, Sepal.Width, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=iris,
aes(Sepal.Length, Petal.Length, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(ggplot(data=iris,
aes(Sepal.Length, Petal.Width, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=1, layout.pos.col=3))
print(ggplot(data=iris,
aes(Sepal.Width, Petal.Length, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=2, layout.pos.col=2))
print(ggplot(data=iris,
aes(Sepal.Width, Petal.Width, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=2, layout.pos.col=3))
print(ggplot(data=iris,
aes(Petal.Length, Petal.Width, color=Species)) +
geom_point() + theme(legend.position="none"),
vp=viewport(layout.pos.row=3, layout.pos.col=3))
It is clear that we the different flowers have different values of the predictors, but with some overlap.
If we code the response variable Species (say setosa=0, versicolor=1 and virginica=2) we again have a regression problem with a discrete response variable like in the last section. However, there are two differences:
We will consider the three artificial examples. Here is a routine that generates some data from each of them:
gen.ex <- function(which, n=50) {
library(mvtnorm)
ex1 <- function(mu=2, n=50) {
x1 <- rmvnorm(n, mean=c(0,0), sigma=diag(2))
x2 <- rmvnorm(n, mean=c(mu,mu), sigma=diag(2))
data.frame(x=c(x1[, 1], x2[, 1]),
y=c(x1[, 2], x2[, 2]),
group=rep(c("A", "B"), each=n))
}
ex2 <- function(mu=2, n=50) {
x <- cbind(runif(10000), runif(10000, -1, 1))
x <- x[x[, 1]^2 + x[, 2]^2<1, ]
x <- x[1:n, ]
y <- cbind(runif(10000, 0, 2), runif(10000, -2, 2))
y <- y[y[, 1]^2 + y[, 2]^2>0.9, ]
y <- y[1:n, ]
data.frame(x=c(x[, 1], y[, 1]),
y=c(x[, 2], y[, 2]),
group=rep(c("A", "B"), each=n))
}
ex3 <- function(mu=2, n=33) {
x1 <- rmvnorm(n, mean=c(0, 0), sigma=diag(2))
x2 <- rmvnorm(n, mean=c(mu, mu), sigma=diag(2))
x3 <- rmvnorm(n, mean=2*c(mu, mu), sigma=diag(2))
data.frame(x=c(x1[, 1], x2[, 1], x3[, 1]),
y=c(x1[, 2], x2[, 2], x3[, 2]),
group=rep(c("A", "B", "C"), each=n))
}
if(which==1)
df <- ex1(n=n)
if(which==2)
df <- ex2(n=n)
if(which==3)
df <- ex3(n=n)
df$Code <- ifelse(df$group=="A", 0, 1)
if(which==3)
df$Code[df$group=="C"] <- 2
df
}
ggplot(data=gen.ex(1, n=150), aes(x, y,color=group)) +
geom_point() +
labs(x="x1",y="x2") +
theme(legend.position="none")
ggplot(data=gen.ex(2, n=150), aes(x, y,color=group)) +
geom_point() +
labs(x="x1",y="x2") +
theme(legend.position="none")
ggplot(data=gen.ex(3, n=150), aes(x, y,color=group)) +
geom_point() +
labs(x="x1",y="x2") +
theme(legend.position="none")
Let’s say we code a response variable y as 0 if “green” and 1 if “red” if there are two groups (models 1 and 2) or with 0, 1 and 2 if there are three groups (model 3). Then we run the linear regression of y on x1 and x2.
Now the model is the same as always: \(\pmb{y=X\beta+\epsilon}\), and so the estimator is as always \(\pmb{\hat{\beta}=(X'X)^{-1}X'y}\). Say we have two groups ordered so that all cases with y=0 come first, then
\[\pmb{y} = \begin{pmatrix} 0 & 0 & ... & 0 & 1 & ... & 1 \end{pmatrix}'\]
Let’s say there are n observations from group 0 and m from group 1, so
\[ \begin{aligned} &\pmb{X'y} = \begin{pmatrix} \pmb{X'_{n}} & \pmb{X'_{m}} \\ \end{pmatrix} \pmb{y} = \pmb{X'_mj} = \begin{pmatrix} \sum_{i=m+1}^{m+n} x_{i1}\\ \vdots \\ \sum_{i=m+1}^{m+n} x_{k1}\end{pmatrix} \end{aligned} \]
Finally we assign a point (x1, x2) to “green” if its predicted response is <0.5, and to “red” otherwise for models 1 and 2, and depending on whether its predicted response is <2/3 or >4/3 for model 3.
Of course in the case of two groups we could also use logistic regression, but we won’t pursue this idea here.
To see what this looks like we find an even spaced grid and predict the color for each point. Then we overlay that grid onto the graph. This is done in
make.grid <- function(df) {
x <- seq(min(df$x), max(df$x), length=100)
y <- seq(min(df$y), max(df$y), length=100)
expand.grid(x=x, y=y)
}
do.graph <- function(df, df1) {
print(ggplot(data=df, aes(x, y, color=group)) +
geom_point(size=2) +
labs(x="x1",y="x2") +
theme(legend.position="none") +
geom_point(data=df1,
aes(x,y, color=group, alpha=0.1),
inherits.aes=FALSE))
}
Here our three examples:
df <- gen.ex(1)
df$Code <- ifelse(df$group=="A", 0, 1)
fit <- lm(Code~x+y, data=df)
df1 <- make.grid(df)
df1$group <- ifelse(predict(fit, df1)<0.5, "A", "B")
do.graph(df, df1)
df <- gen.ex(2)
df$Code <- ifelse(df$group=="A", 0, 1)
fit <- lm(Code~x+y, data=df)
df1 <- make.grid(df)
df1$group <- ifelse(predict(fit, df1)<0.5, "A", "B")
do.graph(df, df1)
df <- gen.ex(3)
df$Code <- ifelse(df$group=="A", 0, 1)
df$Code[df$group=="C"] <- 2
fit <- lm(Code~x+y, data=df)
df1 <- make.grid(df)
tmp <- predict(fit, df1)
df1$group <- ifelse(tmp<2/3, "A", "B")
df1$group[tmp>4/3] <-"C"
do.graph(df, df1)
this seems to work ok for examples 1 and 3, not so much for 2.
Let’s have a closer look at example 1:
df <- gen.ex(1)
fit <- lm(Code~x+y, data=df)
coef(fit)
## (Intercept) x y
## 0.2279637 0.1064668 0.2012716
we assign the group depending if the fitted value is < or > than 0.5. What do we get if it is equal to 0.5?
\[ \begin{aligned} & 0.5 = \beta_0 +\beta_1 x_1+ \beta_2 x_2\\ & x_2 = (0.5 - \beta_0 -\beta_1 x_1)/\beta_2 \\ & x_2 = \frac{0.5-\beta_1}{\beta_2}-\frac{\beta_1}{\beta_2}x_1 \end{aligned} \] Let’s add that line to the graph:
ggplot(data=df, aes(x, y, color=group)) +
geom_point(size=2) +
theme(legend.position="none") +
geom_abline(intercept = (0.5-coef(fit)[2])/coef(fit)[3],
slope=-coef(fit)[2]/coef(fit)[3])
and this is called the decision boundary.
It is easy to see that in example 3 it works like this:
df <- gen.ex(3)
fit <- lm(Code~x+y, data=df)
ggplot(data=df, aes(x, y, color=group)) +
geom_point(size=2) +
theme(legend.position="none") +
geom_abline(intercept = (2/3-coef(fit)[2])/coef(fit)[3],
slope=-coef(fit)[2]/coef(fit)[3]) +
geom_abline(intercept = (4/3-coef(fit)[2])/coef(fit)[3],
slope=-coef(fit)[2]/coef(fit)[3])
One thing that sets a classification problem apart from regression is that here we have an obvious way to judge how good a method is, namely the miss-classification rate: What percentage of the observations are given the wrong label?
Let’s see:
msr <- function(x, y) {
z <- table(x, y)
round((sum(z)-sum(diag(z)))/sum(z)*100, 1)
}
df <- gen.ex(1, n=1000)
fit <- lm(Code~x+y, data=df)
pred <- ifelse(predict(fit)<0.5, "A", "B")
table(df$group, pred)
## pred
## A B
## A 919 81
## B 83 917
msr(df$group, pred)
## [1] 8.2
df <- gen.ex(2, n=1000)
fit <- lm(Code~x+y, data=df)
pred <- ifelse(predict(fit)<0.5, "A", "B")
msr(df$group, pred)
## [1] 19.5
df <- gen.ex(3, n=1000)
fit <- lm(Code~x+y, data=df)
tmp <- predict(fit)
pred <- ifelse(tmp<2/3, "A", "B")
pred[tmp>4/3] <- "C"
msr(df$group, pred)
## [1] 11.8
Of course these misclassification rates are to optimistic: we calculated it on the same data set that we fit on. We should always train and test on different data sets, maybe using cross-validation:
df <- gen.ex(1, n=1000)
print(dim(df))
## [1] 2000 4
out <- rep(0, 10)
for(i in 1:10) {
I <- sample(1:2000, size=400)
fit <- lm(Code~x+y, data=df[-I, ])
pred <- ifelse(predict(fit, df[I, 1:2])<0.5, "A", "B")
out[i] <- msr(df$group[I], pred)
}
mean(out)
## [1] 7.42
Here we split the data into \(80\%\) for training and \(20\%\) for evaluation. Is this a good split? Actually, nobody knows!
Our method works quite well for examples 1 and 3, but not so much for example 2.
df <- gen.ex(2)
df$Code <- ifelse(df$group=="A", 0, 1)
fit <- lm(Code~x+y, data=df)
df1 <- make.grid(df)
df1$group <- ifelse(predict(fit, df1)<0.5, "A", "B")
do.graph(df, df1)
shows us why: here a linear decision boundary clearly won’t work. So how about a quadratic one?
df$x2 <- df$x^2
df$y2 <- df$y^2
df$xy <- df$x*df$y
fit <- lm(Code~x+y+x2+y2+xy, data=df)
df1 <- make.grid(df)
df1$x2 <- df1$x^2
df1$y2 <- df1$y^2
df1$xy <- df1$x*df1$y
df1$group <- ifelse(predict(fit, df1)<0.5, "A", "B")
do.graph(df, df1)
and that looks much better!
Here is the mcr based on cross-validation:
df <- df[, c(4, 1:2, 5:7)]
out <- rep(0, 10)
for(i in 1:10) {
I <- sample(1:2000, size=400)
fit <- lm(Code~x+y+x2+y2+xy, data=df[-I, ])
pred <- ifelse(predict(fit, df[I, -1])<0.5, "A", "B")
out[i] <- msr(df$Code[I], pred)
}
mean(out)
## [1] 7.67
The two solutions we have discussed above, linear and quadratic regression, are (slight variations of) what Fisher came up with back when he introduced the Iris data set. They are now called
and are implemented in R with
library(MASS)
df <- gen.ex(1)
fit <- lda(df$group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
head(df1)
## x y group
## 1 -2.432248 -1.761245 A
## 2 -2.371442 -1.761245 A
## 3 -2.310637 -1.761245 A
## 4 -2.249831 -1.761245 A
## 5 -2.189026 -1.761245 A
## 6 -2.128220 -1.761245 A
do.graph(df, df1)
df <- gen.ex(3)
fit <- lda(group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
do.graph(df, df1)
for example 2 we should use
df <- gen.ex(2)
fit <- qda(group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
do.graph(df, df1)
Notice a couple of differences between the lm and lda/qda solutions:
in lda/qda we don’t have to do any coding, they accept categorical variables as response.
there is a difference between the lm and the lda/qda solutions of examples 2 and 3. Do you see what it is, and why?