Principle Components Analysis

As we saw before, highly correlated predictors can cause difficulties in a regression analysis. We will now study a way to deal with this.

The idea of principle components is this: find a few linear combinations of \(x_1, .., x_k\) that explain the variation in y. That is, find \[ z_1=\sum \alpha_{i1} x_i \\ ... \\ z_m=\sum \alpha_{ik} x_i \\ \] and do a regression of y on \(z_1, .., z_m\) instead of \(x_1, .., x_k\).

This is a useful idea if the following happens:

  • m is much smaller than k
  • \(z_i\) and \(z_j\) are uncorrelated
  • \(z_j=\sum \alpha_{ij} x_i\) can be interpreted in some way, that is we can understand the meaning of \(z_j\)

Using matrix notation we can write \(Z=XA\) where X is the data matrix and A is the m by n matrix of \(\alpha\)’s.

How should we choose A? The idea of principle components is as follows: Choose A such that

  • the variables \(z_1, .., z_m\) are uncorrelated (Z’Z is a diagonal matrix)
  • \(z_1\) has the largest possible variance, then \(z_2\) has the second largest (subject to cor(\(z_1\), \(z_2\))=0), and so on.

So we want to find a matrix A such that Z’Z = (XA)‘XA = A’(X’X)A = D

Now X’X is a symmetric k by k matrix. It could be singular but let’s assume for the moment that it is not. Then using Linear Algebra it can be shown that the columns of A are the the eigenvectors of the matrix X’X.

Let’s see how this works on an artificial example. We have a sample of 100 observations from a bivariate normal distribution with means (0,0), variances (1, 1) and correlation 0.9. First we plot x2 vs x1, then we find the matrix X’X and its eigenvectors (using the R function eigen). Next we find the new variables z1 and z2 as linear combinations of the eigenvectors and X.

library(mvtnorm)
x <- rmvnorm(100, mean = c(0, 0), 
             sigma = matrix(c(1, 0.9, 0.9, 1), 2, 2))
xyr<- range(x)
plot(x, xlab = "X1", ylab = "X2", xlim = xyr, ylim = xyr)

y <- t(x) %*% x
print(y)
##          [,1]     [,2]
## [1,] 113.9399 101.5266
## [2,] 101.5266 108.3705
E <- eigen(y)
print(E)
## eigen() decomposition
## $values
## [1] 212.720001   9.590372
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.7167349  0.6973458
## [2,] -0.6973458 -0.7167349
z1 <- E$vector[1, 1] * x[, 1] + E$vector[1, 2] * x[, 2]
z2 <- E$vector[2, 1] * x[, 1] + E$vector[2, 2] * x[, 2]
plot(z1, z2, xlab = "z1", ylab = "z2", ylim = range(z1))

Notice

  • z1 and z2 are uncorrelated (here, they are of course independent)
  • the variance of z1 is much large than the variance of z2.

There is another, geometric way to see what principle components are: again we draw the scatterplot of x2 vs. x1, but without the axes. The first principle component transformation is y=e1,1x1+e1,2x2. In the x2 vs. x1 plot this describes a line with slope -e1,1/e1,2 and going through the origin, which we add to the plot. We do the same with the second principle component transformation.

plot(x, xlab = "x1", ylab = "x2", 
     xlim = xyr, ylim = xyr,  axes = F)
abline(0, -E$vector[1, 1]/E$vector[1, 2])
abline(0, -E$vector[2, 1]/E$vector[2, 2])

Now we can see that the transformation is really a change of coordinate system, from x1, x2 to z1, z2.

In practice we can use the R function princomp to carry out the calculations for us.

pc <- princomp(x)
print(summary(pc))
## Importance of components:
##                           Comp.1     Comp.2
## Standard deviation     1.4529260 0.30876903
## Proportion of Variance 0.9567888 0.04321123
## Cumulative Proportion  0.9567888 1.00000000

we see that the first principle component explains about 95% of the variation in (x1, x2).

One of the components of the pc object is called “loadings”

pc$loadings
## 
## Loadings:
##      Comp.1 Comp.2
## [1,]  0.718  0.696
## [2,]  0.696 -0.718
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0

and we see that these are just the eigenvectors.

Example: Scores on math tests

consider the data in testscores. This is artificial data supposed to be the test scores of 25 mathematics graduate students in their qualifying exams. The differential geometry and complex analysis exams were closed book whereas the others were open book.

kable.nice(testscores)
diffgeom complex algebra reals statistics
36 58 43 36 37
62 54 50 46 52
31 42 41 40 29
76 78 69 66 81
46 56 52 56 40
12 42 38 38 28
39 46 51 54 41
30 51 54 52 32
22 32 43 28 22
9 40 47 30 24
32 49 54 37 52
40 62 51 40 49
64 75 70 66 63
36 38 58 62 62
24 46 44 55 49
50 50 54 52 51
42 42 52 38 50
2 35 32 22 16
56 53 42 40 32
59 72 70 66 62
28 50 50 42 63
19 46 49 40 30
36 56 56 54 52
54 57 59 62 58
14 35 38 29 20
test.pc <- princomp(testscores)
summary(test.pc, loadings = TRUE)
## Importance of components:
##                            Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     28.4896795 9.03547104 6.60095491 6.13358179
## Proportion of Variance  0.8212222 0.08260135 0.04408584 0.03806395
## Cumulative Proportion   0.8212222 0.90382353 0.94790936 0.98597332
##                            Comp.5
## Standard deviation     3.72335754
## Proportion of Variance 0.01402668
## Cumulative Proportion  1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## diffgeom    0.598  0.675  0.185  0.386       
## complex     0.361  0.245 -0.249 -0.829 -0.247
## algebra     0.302 -0.214 -0.211 -0.135  0.894
## reals       0.389 -0.338 -0.700  0.375 -0.321
## statistics  0.519 -0.570  0.607        -0.179

Looking at the summary we see that the first pc accounts for 82% of the variation in the data, and the first two account for 90%.

Let’s have a look at the loadings: the first one is (0.6, 0.36, 0.3, 0.39, 0.52) and amounts to an average over all the exams. The second one is (-0.67, -0.25, 0.21, 0.34, 0.57). Notice that here the first two are negative and the others are positive. But the first two were the closed book exams and the others were open book!

Example: States

The data set state.x77 has info of the 50 states of the United States of America.

‘Population’: population estimate as of July 1, 1975
‘Income’: per capita income (1974)
‘Illiteracy’: illiteracy (1970, percent of population)
‘Life Exp’: life expectancy in years (1969-71)
‘Murder’: murder and non-negligent manslaughter rate per 100,000 population (1976)
‘HS Grad’: percent high-school graduates (1970)
‘Frost’: mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
‘Area’: land area in square miles

Source: U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States.

kable.nice(head(state.x77))
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
state.pc <- princomp(state.x77, cor = T)
summary(state.pc, loading = TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion  0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
##                            Comp.6    Comp.7     Comp.8
## Standard deviation     0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion  0.96779544 0.9858515 1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Population  0.126  0.411  0.656  0.409  0.406                0.219
## Income     -0.299  0.519  0.100        -0.638 -0.462              
## Illiteracy  0.468               -0.353        -0.387  0.620  0.339
## Life Exp   -0.412         0.360 -0.443  0.327 -0.219  0.256 -0.527
## Murder      0.444  0.307 -0.108  0.166 -0.128  0.325  0.295 -0.678
## HS Grad    -0.425  0.299        -0.232         0.645  0.393  0.307
## Frost      -0.357 -0.154 -0.387  0.619  0.217 -0.213  0.472       
## Area               0.588 -0.510 -0.201  0.499 -0.148 -0.286

It appears that the first PC contrasts “good” variables such as income and life expectancy with bad ones such as murder and illiteracy. This explains about 45% of the variation. The second PC contrasts ‘Frost’ with all the other variables. It accounts for an additional 20% but it seems difficult to understand exactly what that means.

One important question is how many PCs are needed to “reasonably” explain the data? One useful graph here is the screeplot, given in

plot(state.pc)

It is a simple barchart of the the variation explained by each PC. One popular method is to include enough PCs to cover at least 90% of the variation. In the states data this means 5, which seems a bit much.