Principal Components

Say we have a multiple regression problem with a response variable y and predictors x1, .., xk. The idea of principle components is this: try and find a few linear combinations of x1, .., xk that explain the variation in y. That is, find z1=∑αi1xi, .., zm=∑αimxi and do a regression of y on z1, .., zm instead of x1, .., xk.

This is a useful idea if the following happens:

• m is much smaller than k
• zj=∑αijxi can be interpreted in some way, that is we can understand the meaning of zj.

Using matrix notation we can write Z=XA where X is the data matrix and A is the m by n matrix of α's.

How should we choose A? The idea of principle components is as follows: Choose A such that

• the variables z1, .., zm are uncorrelated (Z'Z is a diagonal matrix)
• z1 has the largest possible variance, then z2 has the second largest (subject to cor(z1,z2)=0), and so on.

So we want to find a matrix A such that Z'Z = (XA)'XA = A'(X'X)A = D

Now X'X is a symmetric k by k matrix. It could be singular but let's assume for the moment that it is not. Then using Linear Algebra it can be shown that the columns of A are the the eigenvectors of the matrix X'X.

Here is a simple illustration: in

princomp.fun()

we have a sample of 100 observations from a bivariate normal distribution with means (0,0), variances (1,1) and correlation 0.9. First we plot X2 vs X1, then we find the matrix X'X and its eigenvectors (using the R function eigen). Next we find the new variables Z1 and Z2 as linear combinations of the eigenvectors and X. Plotting these we see two things:

• Z1 and Z2 are uncorrelated (here, they are of course independent)
• the variance of Z1 is much large than the variance of Z2.

There is another, geometric way to see what principle components are: again we draw the scatterplot of X2 vs. X1, but without the axes. The first principle component transformation is y=e1,1x1+e1,2x2. In the X2 vs. X1 plot this describes a line with slope -e1,1/e1,2 and going through the origin, which we add to the plot. We do the same with the second principle component transformation. Now we can see that the transformation is really a change of coordinate system, from X1,X2 to Z1,Z2.

In practise we can use the R function princomp to carry out the calculations for us, which we do also in princomp.fun(). Looking at the summary of the princomp object we see that the first principle component explains about 95% of the variation in (X1,X2). One of the components of the pc object is called "loadings" and we see that these are just the eigenvectors.

Example consider the data in testscores. This is artificial data supposed to be the test scores of 25 mathematics graduate students in their qualifying exams. The differential geometry and complex analysis exams were closed book whereas the others were open book. In

testscores.fun()

we find the principle components of the test scores. Looking at the summary we see that the first pc accounts for 82% of the variation in the data, and the first two account for 90%.

Let's have a look at the loadings: the first one is (0.6, 0.36, 0.3, 0.39, 0.52) and amounts to an average over all the exams. The second one is (-0.67, -0.25, 0.21, 0.34, 0.57). Notice that here the first two are negative and the others are positive. But the first two were the closed book exams and the others were open book!

Example Next we have a look at the States dataset. We are interested in finding relationships between the 8 variables. For this we carry out a PC analysis. Note though that the scales of the variables are very different (Illiteracy values are on the order of 1 or 2 whereas Area is of the order 10000). In such a case it is better to base the PC on the correlation rather than the covariance. We do this by adding the argument cor=T to the princomp call, see

states.fun(1)

It appears that the first PC contrasts "good" variables such as income and life expectancy with bad ones such as murder and illiteracy. This explains about 45% of the variation. The second PC contrasts 'Frost' with all the other variables. It accounts for an additional 20% but it seems difficult to understand exactly what that means.

One important question is how many PCs are needed to "reasonably" explain the data? One useful graph here is the screeplot, given in

states.fun(2)

It is a simple barchart of the the variation explained by each PC. One popular method is to include enough PCs to cover at least 90% of the variation. In the states data this means 5.