Introduction

Say we have data coming from k distinct groups. For a set of n observations we know the correct group assignment, and we want to find a "function" that assigns a new observation to one of the k groups based on its measurements.

We will consider three artificial examples:

model=1: group 1 (coded as "green") are 50 observations from a bivariate normal with means (0,0) and variance-covariance matrix the identity. group 2 (coded as "red") are 50 observations from a bivariate normal with means (μ,μ) and variance-covariance matrix the identity.
model=2: group 1 (coded as "green") are 50 observations from U[0,1]2 such that x12+x22<1, group 2 (coded as "red") are 50 observations from U[0,2] x U[-2,2] such that x12+x22>0.9
model=3: group 1 (coded as "green") are 33 observations from bivariate normal with means (0,0) and variance-covariance matrix the identity. group 2 (coded as "red") are 34 observations from a bivariate normal with means (μ,μ) and variance-covariance matrix the identity. group 3 (coded as "blue") are 33 observations from a bivariate normal with means (2μ,2μ) and variance-covariance matrix the identity.

Examples of these are drawn in

class.fun(model=j)

where j is 1,2 or 3.

Note the use of eqscplot instead of plot. This routine draws the scatterplot with equal scales, a good idea if the variables are measured on the same scale.

Our basic problem is to find a "rule" that assigns to each point in R2 a class label. Or in other words, if I got a 101st observation (x1,x2) with the color missing, I want to be able to assign it to one of the groups.

One approach to this problem is to use linear regression! For this we code a response variable y as 0 if "green" and 1 if "red" if there are two groups (models 1 and 2) or with 0, 1 and 2 if there are three groups (model 3). Then we run the linear regression of y on x1 and x2.
Finally we assign a point (x1,x2) to "green" if its predicted response is <0.5, and to "red" otherwise for models 1 and 2, and depending on whether its predicted response is <1/3 or > 2/3 for model 3.
In

class.fun(j,"lm")

we implement this idea, and use it to classify all the points (x1,x2) on a fine grid, plotting each with its classified color.

We can even find a decision boundary: those are points (x1,x2) with 0.5 = β0 + β1x1 + β2x2, or x2 = (0.5-β01x1)/β2. In the case of 3 groups there are two decision boundaries given by x2 = (2/3-β01x1)/β2 and x2 = (4/3-β01x1)/β2 .

One measure of how good a classifier performs is the miss-classification rate, that is how many of the observations in the dataset are wrongly classified? This is also calculated in class.fun(j,"lm")

How good is this method? It works quite well for models 1 and 3, which are what we call linearly-separable, but not so good for model 2

Here is a completely different but very simple idea: for a new point (x1,x2) find the point in the dataset that is closest, and classify (x1,x2) the same! This is called the 1-nearest neighbor classifier. It is implemented by the function knn, part of the "class" library. We run it in

class.fun(j,"knn")

Clearly this method has a much higher variance than the linear regression idea above. On the other hand the miss-classification rate is always 0!

We can generalize the method easily as follows: instead of just considering the nearest neighbor, let's find the k nearest neighbors, and assign the group by majority rule, that is if k=3, we assign the group that two (or all three) belong to. This is done in

class.fun(j,"knn",k=3)

Now the method seems a lot less "ragged", but notice that the miss-classification rate is no longer 0.

The data in our examples is nicely continuous, so "distance" simply means Euclidean distance. The method extends to more complicated data types as long as we have a measure for distance, though. (How far is "male" from "female"?)

A great question is immediately: how should we choose k? Clearly the miss-classification rate is no good, because then we would always use k=1.

Most of the standard methods in discrimination and classification (and there are many) are variations of these two basic ideas. 1-nearest neighbor is probably the most commonly used method for problems with just a few predictors (2 or 3 say). Here are a few ideas for enhancing these basic methods:

• Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the 0-1 weights used in k-nearest neighbors.
• In high dimensional problems the distance kernels are modified to emphasize some variables more than others.
• Local regression fits linear models of locally weighted least squares.
• Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models. (polynomial regression, for example)
• projection persuit regression and neural networks consist of sums of non-linearly transformed linear models.

Let's look a little more at model 2. Here clearly a straight line can't do a good job separating the two groups. But how about adding some "quadratic" terms?

class.fun(j,"quadratic")

again uses least squares regression, but now we fit a model including the terms x12, x22 and x1×x2. This already does a better job!

Notice that here we have a quadratic function of two variables f(x1,x2) = β0 + β1x1 + β2x2 + β3x12 + β4x22 + β5x1×x2.

How about doing local regression instead of polynomial regression? Again, no problem, see

class.fun(j,"loess")

Cross-Validation and Miss-Classification Rate

As we saw for the 1-nearest neighbor method, the miss-classification rate as defined above has some problems. More generally, it will always give a number that is biased downwards because we evaluate the performance of a method on the same data that we used for fitting. So the method is optimized for the data, not for the underlying population. We can find a better performance measure combining the miss-classification rate with cross-validation:

1) fit the method on the data without observation i
2) use the fit to predict the class of observation i
3) check whether the classification is correct
4) loop over all observation i=1,..,n
5) find the mean miss-classification-rate

This is implemented in

class1.fun(1,"lm")

For the prediction of the loess method we need to add the argument control=loess.control(surface="direct") in order to get the extrapolations, which predict of a loess object does not do by default.

How does the ordinary mcr differ from the cv mcr? Here is the result of one small simulation study. We run class1.fun(j) for all four methods and model=1, using m=1, k=5 and span=0.75. We repeat this 100 times and find the mean mcr's:
 Method  mcr  cv mcr
 lm  23%  25%
 knn  20%  29%
 quadratic  24%  26%
 loess  21%  26%