Let's take a moment to digress and assume that Y is a real-valued output vector. Then by far the most common error function is square error loss
L(Y,f(X)) = (Y-f(X))2
This leads to a criterion for choosing f, the expected prediction error:
EPE(f) = EL(Y,f(X)) = E(Y-f(X))2 = ∫(y-f(x))2P(dx,dy)
Using a well-known formula for conditional expectations we have
EPE(f) = EXEY|X([Y-f(X)]2|X)
and so it suffices to minimize EPE pointwise:
f(x) = argmincEY|X([Y-c]2|X=x)
The solution is f(x) = E(Y|X=x), the conditional expectation. So the best prediction for Y at any point X=x is the conditional mean (if we agree on square error loss).
Now let's assume that f(x) is approximately linear, that is f(x) = xTβ Plugging this linear model for f(x) into EPE and differentiating we can solve for β theoretically and we find
β = [E(XXT]-1E(XY)
so we have just found a justification for least-squares regression within the Bayesian framework.
Let's get back to the discrimination problem. Here the "response" Y is a discrete variable, namely Y=k if the observation belongs to the kth class. In such a situation the loss function is a matrix. Say we observe X=x and estimate the class to be T(x){1,..,k}. The loss function is then a k by k matrix L. L is 0 on the diagonal (no loss if we are correct in our assignment) and nonnegative elsewhere, where Lij is the price paid if we classify an observation from class i wrongly as class j. Quite common is to use zero-one loss where Lij = 1 if i≠j. Now the expected prediction error is
EPE = ELT(X)
As before we can condition and get
EPE = EX∑LkT(X)P(k|X)
and again we can minimize pointwise to get
T(x) = argminyY∑LkyP(k|X=x)
If we use 0-1 loss this simplifies to
T(x) = argminyY[1-P(y|X=x)]
or
T(X) = k if P(k|X=x] = maxy P(y|X=x)
This seems very reasonable: assign an observation to class k if class k has the highest posterior probability, given X=x. This rule is called the Bayes classifier. The miss-classification rate of the Bayes classifier is called the Bayes rate, and is the best achievable rate.
Note that the k-nearest neighbor method directly approximates this solution, where we estimate the unknown posterior probability by the majority vote of the neighboring observations.
Suppose we have a two-class problem, and we used the linear regression approach with a dummy response variable y, followed by squared-error loss estimation. Then
f(X) = E(Y|X) = P(Y=1|X)
because expectations of indicator variables are probabilities. And so again we see that this method also tries to approximate the Bayes classifier!
Here is another example: Consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest neigbor estimate at the origin. The median distance from the origin to the closest data point is given by
Now d(5000,10)=0.52, so most data points are closer to the boundary of the sample space than to any other data point!
One consequence of all this is that classification in high dimensions is very difficult. Especially nearest neighbor methods are no longer useful.
Linear discriminant analysis (LDA) assumes that all the classes have the same variance-covariance matrix σk = σ. In comparing two classes k and l it is sufficient to look at the log-odds ratio and we see
a linear equation in x. This linear log-odds function implies that the decision boundary between classes k and l is linear in x; in p dimensions it is an affine plane (a hyperplane which does not go through the origin)
If there are only two classes this can be shown to be the same method as our "lm". If there are more than two they are not.
If the σk are not all the same the math becomes more complicated but is still doable, and the resulting method is similar to our "quadratic" method.
R has both methods implemented, and we use them in
class2.fun(1,"lda")
on our three examples.
In R we again use the routine rpart, with the argument method="class". This is implemented in
class3.fun(1)
for our three examples.