As we saw before, highly correlated predictors can cause difficulties in a regression analysis. We will now study a way to deal with this.
The idea of principle components is this: find a few linear combinations of \(x_1, .., x_k\) that explain the variation in y. That is, find \[ z_1=\sum \alpha_{i1} x_i \\ ... \\ z_m=\sum \alpha_{ik} x_i \\ \] and do a regression of y on \(z_1, .., z_m\) instead of \(x_1, .., x_k\).
This is a useful idea if the following happens:
Using matrix notation we can write \(Z=XA\) where X is the data matrix and A is the m by n matrix of \(\alpha\)’s.
How should we choose A? The idea of principle components is as follows: Choose A such that
So we want to find a matrix A such that Z’Z = (XA)‘XA = A’(X’X)A = D
Now X’X is a symmetric k by k matrix. It could be singular but let’s assume for the moment that it is not. Then using Linear Algebra it can be shown that the columns of A are the the eigenvectors of the matrix X’X.
Let’s see how this works on an artificial example. We have a sample of 100 observations from a bivariate normal distribution with means (0,0), variances (1, 1) and correlation 0.9. First we plot x2 vs x1, then we find the matrix X’X and its eigenvectors (using the R function eigen). Next we find the new variables z1 and z2 as linear combinations of the eigenvectors and X.
library(mvtnorm)
x <- rmvnorm(100, mean = c(0, 0),
sigma = matrix(c(1, 0.9, 0.9, 1), 2, 2))
xyr<- range(x)
plot(x, xlab = "X1", ylab = "X2", xlim = xyr, ylim = xyr)
y <- t(x) %*% x
print(y)
## [,1] [,2]
## [1,] 113.9399 101.5266
## [2,] 101.5266 108.3705
E <- eigen(y)
print(E)
## eigen() decomposition
## $values
## [1] 212.720001 9.590372
##
## $vectors
## [,1] [,2]
## [1,] -0.7167349 0.6973458
## [2,] -0.6973458 -0.7167349
z1 <- E$vector[1, 1] * x[, 1] + E$vector[1, 2] * x[, 2]
z2 <- E$vector[2, 1] * x[, 1] + E$vector[2, 2] * x[, 2]
plot(z1, z2, xlab = "z1", ylab = "z2", ylim = range(z1))
Notice
There is another, geometric way to see what principle components are: again we draw the scatterplot of x2 vs. x1, but without the axes. The first principle component transformation is y=e1,1x1+e1,2x2. In the x2 vs. x1 plot this describes a line with slope -e1,1/e1,2 and going through the origin, which we add to the plot. We do the same with the second principle component transformation.
plot(x, xlab = "x1", ylab = "x2",
xlim = xyr, ylim = xyr, axes = F)
abline(0, -E$vector[1, 1]/E$vector[1, 2])
abline(0, -E$vector[2, 1]/E$vector[2, 2])
Now we can see that the transformation is really a change of coordinate system, from x1, x2 to z1, z2.
In practice we can use the R function princomp to carry out the calculations for us.
pc <- princomp(x)
print(summary(pc))
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.4529260 0.30876903
## Proportion of Variance 0.9567888 0.04321123
## Cumulative Proportion 0.9567888 1.00000000
we see that the first principle component explains about 95% of the variation in (x1, x2).
One of the components of the pc object is called “loadings”
pc$loadings
##
## Loadings:
## Comp.1 Comp.2
## [1,] 0.718 0.696
## [2,] 0.696 -0.718
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
and we see that these are just the eigenvectors.
consider the data in testscores. This is artificial data supposed to be the test scores of 25 mathematics graduate students in their qualifying exams. The differential geometry and complex analysis exams were closed book whereas the others were open book.
kable.nice(testscores)
diffgeom | complex | algebra | reals | statistics |
---|---|---|---|---|
36 | 58 | 43 | 36 | 37 |
62 | 54 | 50 | 46 | 52 |
31 | 42 | 41 | 40 | 29 |
76 | 78 | 69 | 66 | 81 |
46 | 56 | 52 | 56 | 40 |
12 | 42 | 38 | 38 | 28 |
39 | 46 | 51 | 54 | 41 |
30 | 51 | 54 | 52 | 32 |
22 | 32 | 43 | 28 | 22 |
9 | 40 | 47 | 30 | 24 |
32 | 49 | 54 | 37 | 52 |
40 | 62 | 51 | 40 | 49 |
64 | 75 | 70 | 66 | 63 |
36 | 38 | 58 | 62 | 62 |
24 | 46 | 44 | 55 | 49 |
50 | 50 | 54 | 52 | 51 |
42 | 42 | 52 | 38 | 50 |
2 | 35 | 32 | 22 | 16 |
56 | 53 | 42 | 40 | 32 |
59 | 72 | 70 | 66 | 62 |
28 | 50 | 50 | 42 | 63 |
19 | 46 | 49 | 40 | 30 |
36 | 56 | 56 | 54 | 52 |
54 | 57 | 59 | 62 | 58 |
14 | 35 | 38 | 29 | 20 |
test.pc <- princomp(testscores)
summary(test.pc, loadings = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 28.4896795 9.03547104 6.60095491 6.13358179
## Proportion of Variance 0.8212222 0.08260135 0.04408584 0.03806395
## Cumulative Proportion 0.8212222 0.90382353 0.94790936 0.98597332
## Comp.5
## Standard deviation 3.72335754
## Proportion of Variance 0.01402668
## Cumulative Proportion 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## diffgeom 0.598 0.675 0.185 0.386
## complex 0.361 0.245 -0.249 -0.829 -0.247
## algebra 0.302 -0.214 -0.211 -0.135 0.894
## reals 0.389 -0.338 -0.700 0.375 -0.321
## statistics 0.519 -0.570 0.607 -0.179
Looking at the summary we see that the first pc accounts for 82% of the variation in the data, and the first two account for 90%.
Let’s have a look at the loadings: the first one is (0.6, 0.36, 0.3, 0.39, 0.52) and amounts to an average over all the exams. The second one is (-0.67, -0.25, 0.21, 0.34, 0.57). Notice that here the first two are negative and the others are positive. But the first two were the closed book exams and the others were open book!
The data set state.x77 has info of the 50 states of the United States of America.
‘Population’: population estimate as of July 1, 1975
‘Income’: per capita income (1974)
‘Illiteracy’: illiteracy (1970, percent of population)
‘Life Exp’: life expectancy in years (1969-71)
‘Murder’: murder and non-negligent manslaughter rate per 100,000 population (1976)
‘HS Grad’: percent high-school graduates (1970)
‘Frost’: mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
‘Area’: land area in square miles
Source: U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States.
kable.nice(head(state.x77))
Population | Income | Illiteracy | Life Exp | Murder | HS Grad | Frost | Area | |
---|---|---|---|---|---|---|---|---|
Alabama | 3615 | 3624 | 2.1 | 69.05 | 15.1 | 41.3 | 20 | 50708 |
Alaska | 365 | 6315 | 1.5 | 69.31 | 11.3 | 66.7 | 152 | 566432 |
Arizona | 2212 | 4530 | 1.8 | 70.55 | 7.8 | 58.1 | 15 | 113417 |
Arkansas | 2110 | 3378 | 1.9 | 70.66 | 10.1 | 39.9 | 65 | 51945 |
California | 21198 | 5114 | 1.1 | 71.71 | 10.3 | 62.6 | 20 | 156361 |
Colorado | 2541 | 4884 | 0.7 | 72.06 | 6.8 | 63.9 | 166 | 103766 |
state.pc <- princomp(state.x77, cor = T)
summary(state.pc, loading = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion 0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion 0.96779544 0.9858515 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Population 0.126 0.411 0.656 0.409 0.406 0.219
## Income -0.299 0.519 0.100 -0.638 -0.462
## Illiteracy 0.468 -0.353 -0.387 0.620 0.339
## Life Exp -0.412 0.360 -0.443 0.327 -0.219 0.256 -0.527
## Murder 0.444 0.307 -0.108 0.166 -0.128 0.325 0.295 -0.678
## HS Grad -0.425 0.299 -0.232 0.645 0.393 0.307
## Frost -0.357 -0.154 -0.387 0.619 0.217 -0.213 0.472
## Area 0.588 -0.510 -0.201 0.499 -0.148 -0.286
It appears that the first PC contrasts “good” variables such as income and life expectancy with bad ones such as murder and illiteracy. This explains about 45% of the variation. The second PC contrasts ‘Frost’ with all the other variables. It accounts for an additional 20% but it seems difficult to understand exactly what that means.
One important question is how many PCs are needed to “reasonably” explain the data? One useful graph here is the screeplot, given in
plot(state.pc)
It is a simple barchart of the the variation explained by each PC. One popular method is to include enough PCs to cover at least 90% of the variation. In the states data this means 5, which seems a bit much.