The two solutions we have discussed in the last section, linear and quadratic regression, are (slight variations of) what Fisher came up with back when he introduced the Iris data set. They are now called
and are implemented in R with
library(MASS)
df <- gen.ex(1)
fit <- lda(df$group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
head(df1)
## x y group
## 1 -2.917448 -2.142699 A
## 2 -2.850408 -2.142699 A
## 3 -2.783368 -2.142699 A
## 4 -2.716327 -2.142699 A
## 5 -2.649287 -2.142699 A
## 6 -2.582247 -2.142699 A
do.graph(df, df1)
df <- gen.ex(3)
fit <- lda(group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
do.graph(df, df1)
for example 2 we should use
df <- gen.ex(2)
fit <- qda(group~x+y, data=df)
df1 <- make.grid(df)
df1$group <- predict(fit, df1)$class
do.graph(df, df1)
Notice a couple of differences between the lm and lda/qda solutions:
in lda/qda we don’t have to do any coding, they accept categorical variables as response.
there is a difference between the lm and the lda/qda solutions of examples 2 and 3. Do you see what it is, and why?
Instead of using lm we could of course also have used loess:
df <- gen.ex(2)
fit <- loess(Code~x+y, data=df,
control = loess.control(surface = "direct"))
df1$group <- c(ifelse(predict(fit, df1)<0.5, "A", "B"))
do.graph(df, df1)
and in fact that looks quite a bit like the qda solution.
Here is an entirely different idea: we define a distance function, that is a function that calculates the distance between two points. In our examples we can just find Euclidean distance, but in other cases other distance functions can be useful. For a point x where we want to do prediction we find its k nearest neighbors and assign the label by majority rule. So if k=3 and if at least 2 of the three nearest neighbors are type “A”, then we assign type “A” to x.
library(class)
df1$group <- factor(
knn(df[, 1:2], df1[, 1:2], cl=df$group, k=1))
do.graph(df, df1)
df1$group <- factor(
knn(df[, 1:2], df1[, 1:2], cl=df$group, k=3))
do.graph(df, df1)
df1$group <- factor(
knn(df[, 1:2], df1[, 1:2], cl=df$group, k=11))
do.graph(df, df1)
clearly the choice of k determines the bias variance trade-off.
Here is the knn solution for the other two cases:
df <- gen.ex(1)
df1 <- make.grid(df)
df1$group <- factor(
knn(df[, 1:2], df1, cl=factor(df$group), k=5))
do.graph(df, df1)
df <- gen.ex(3)
df1 <- make.grid(df)
df1$group <- factor(
knn(df[, 1:2], df1, cl=factor(df$group), k=5))
do.graph(df, df1)
Our ability to apply this method clearly depends on how fast we can find the nearest neighbors. This issue has been studied extensively in Statistics and in Computer Science, and highly sophisticated algorithms are known that can handle millions of cases and hundreds of variables.