Principle Components Analysis

As we saw before, highly correlated predictors can cause difficulties in a regression analysis. We will now study a way to deal with this.

The idea of principle components is this: find a few linear combinations of \(x_1, .., x_k\) that explain the variation in y. That is, find \[ z_1=\sum \alpha_{i1} x_i \\ ... \\ z_m=\sum \alpha_{ik} x_i \\ \] and do a regression of y on \(z_1, .., z_m\) instead of \(x_1, .., x_k\).

This is a useful idea if the following happens:

m is much smaller than k
\(z_i\) and \(z_j\) are uncorrelated
\(z_j=\sum \alpha_{ij} x_i\) can be interpreted in some way, that is we can understand the meaning of \(z_j\)

Using matrix notation we can write \(Z=XA\) where X is the data matrix and A is the m by n matrix of \(\alpha\)’s.

How should we choose A? The idea of principle components is as follows: Choose A such that

the variables \(z_1, .., z_m\) are uncorrelated (Z’Z is a diagonal matrix)
\(z_1\) has the largest possible variance, then \(z_2\) has the second largest (subject to cor(\(z_1\), \(z_2\))=0), and so on.

So we want to find a matrix A such that Z’Z = (XA)‘XA = A’(X’X)A = D

Now X’X is a symmetric k by k matrix. It could be singular but let’s assume for the moment that it is not. Then using Linear Algebra it can be shown that the columns of A are the the eigenvectors of the matrix X’X.

Let’s see how this works on an artificial example. We have a sample of 100 observations from a bivariate normal distribution with means (0,0), variances (1, 1) and correlation 0.9. First we plot x2 vs x1, then we find the matrix X’X and its eigenvectors (using the R function eigen). Next we find the new variables z1 and z2 as linear combinations of the eigenvectors and X.

library(mvtnorm)
x <- rmvnorm(100, mean = c(0, 0), 
             sigma = matrix(c(1, 0.9, 0.9, 1), 2, 2))
xyr<- range(x)
plot(x, xlab = "X1", ylab = "X2", xlim = xyr, ylim = xyr)

y <- t(x) %*% x
print(y)

##          [,1]     [,2]
## [1,] 113.9399 101.5266
## [2,] 101.5266 108.3705

E <- eigen(y)
print(E)

## eigen() decomposition
## $values
## [1] 212.720001   9.590372
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.7167349  0.6973458
## [2,] -0.6973458 -0.7167349

z1 <- E$vector[1, 1] * x[, 1] + E$vector[1, 2] * x[, 2]
z2 <- E$vector[2, 1] * x[, 1] + E$vector[2, 2] * x[, 2]
plot(z1, z2, xlab = "z1", ylab = "z2", ylim = range(z1))

Notice

z1 and z2 are uncorrelated (here, they are of course independent)
the variance of z1 is much large than the variance of z2.

There is another, geometric way to see what principle components are: again we draw the scatterplot of x2 vs. x1, but without the axes. The first principle component transformation is y=e1,1x1+e1,2x2. In the x2 vs. x1 plot this describes a line with slope -e1,1/e1,2 and going through the origin, which we add to the plot. We do the same with the second principle component transformation.

plot(x, xlab = "x1", ylab = "x2", 
     xlim = xyr, ylim = xyr,  axes = F)
abline(0, -E$vector[1, 1]/E$vector[1, 2])
abline(0, -E$vector[2, 1]/E$vector[2, 2])

Now we can see that the transformation is really a change of coordinate system, from x1, x2 to z1, z2.

In practice we can use the R function princomp to carry out the calculations for us.

pc <- princomp(x)
print(summary(pc))

## Importance of components:
##                           Comp.1     Comp.2
## Standard deviation     1.4529260 0.30876903
## Proportion of Variance 0.9567888 0.04321123
## Cumulative Proportion  0.9567888 1.00000000

we see that the first principle component explains about 95% of the variation in (x1, x2).

One of the components of the pc object is called “loadings”

pc$loadings

## 
## Loadings:
##      Comp.1 Comp.2
## [1,]  0.718  0.696
## [2,]  0.696 -0.718
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0

and we see that these are just the eigenvectors.

Example: Scores on math tests

consider the data in testscores. This is artificial data supposed to be the test scores of 25 mathematics graduate students in their qualifying exams. The differential geometry and complex analysis exams were closed book whereas the others were open book.

kable.nice(testscores)

diffgeom	complex	algebra	reals	statistics
36	58	43	36	37
62	54	50	46	52
31	42	41	40	29
76	78	69	66	81
46	56	52	56	40
12	42	38	38	28
39	46	51	54	41
30	51	54	52	32
22	32	43	28	22
9	40	47	30	24
32	49	54	37	52
40	62	51	40	49
64	75	70	66	63
36	38	58	62	62
24	46	44	55	49
50	50	54	52	51
42	42	52	38	50
2	35	32	22	16
56	53	42	40	32
59	72	70	66	62
28	50	50	42	63
19	46	49	40	30
36	56	56	54	52
54	57	59	62	58
14	35	38	29	20

test.pc <- princomp(testscores)
summary(test.pc, loadings = TRUE)

## Importance of components:
##                            Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     28.4896795 9.03547104 6.60095491 6.13358179
## Proportion of Variance  0.8212222 0.08260135 0.04408584 0.03806395
## Cumulative Proportion   0.8212222 0.90382353 0.94790936 0.98597332
##                            Comp.5
## Standard deviation     3.72335754
## Proportion of Variance 0.01402668
## Cumulative Proportion  1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## diffgeom    0.598  0.675  0.185  0.386       
## complex     0.361  0.245 -0.249 -0.829 -0.247
## algebra     0.302 -0.214 -0.211 -0.135  0.894
## reals       0.389 -0.338 -0.700  0.375 -0.321
## statistics  0.519 -0.570  0.607        -0.179

Looking at the summary we see that the first pc accounts for 82% of the variation in the data, and the first two account for 90%.

Let’s have a look at the loadings: the first one is (0.6, 0.36, 0.3, 0.39, 0.52) and amounts to an average over all the exams. The second one is (-0.67, -0.25, 0.21, 0.34, 0.57). Notice that here the first two are negative and the others are positive. But the first two were the closed book exams and the others were open book!

Example: States

The data set state.x77 has info of the 50 states of the United States of America.

‘Population’: population estimate as of July 1, 1975
‘Income’: per capita income (1974)
‘Illiteracy’: illiteracy (1970, percent of population)
‘Life Exp’: life expectancy in years (1969-71)
‘Murder’: murder and non-negligent manslaughter rate per 100,000 population (1976)
‘HS Grad’: percent high-school graduates (1970)
‘Frost’: mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
‘Area’: land area in square miles

Source: U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States.

kable.nice(head(state.x77))

	Population	Income	Illiteracy	Life Exp	Murder	HS Grad	Frost	Area
Alabama	3615	3624	2.1	69.05	15.1	41.3	20	50708
Alaska	365	6315	1.5	69.31	11.3	66.7	152	566432
Arizona	2212	4530	1.8	70.55	7.8	58.1	15	113417
Arkansas	2110	3378	1.9	70.66	10.1	39.9	65	51945
California	21198	5114	1.1	71.71	10.3	62.6	20	156361
Colorado	2541	4884	0.7	72.06	6.8	63.9	166	103766

state.pc <- princomp(state.x77, cor = T)
summary(state.pc, loading = TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion  0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
##                            Comp.6    Comp.7     Comp.8
## Standard deviation     0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion  0.96779544 0.9858515 1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Population  0.126  0.411  0.656  0.409  0.406                0.219
## Income     -0.299  0.519  0.100        -0.638 -0.462              
## Illiteracy  0.468               -0.353        -0.387  0.620  0.339
## Life Exp   -0.412         0.360 -0.443  0.327 -0.219  0.256 -0.527
## Murder      0.444  0.307 -0.108  0.166 -0.128  0.325  0.295 -0.678
## HS Grad    -0.425  0.299        -0.232         0.645  0.393  0.307
## Frost      -0.357 -0.154 -0.387  0.619  0.217 -0.213  0.472       
## Area               0.588 -0.510 -0.201  0.499 -0.148 -0.286

It appears that the first PC contrasts “good” variables such as income and life expectancy with bad ones such as murder and illiteracy. This explains about 45% of the variation. The second PC contrasts ‘Frost’ with all the other variables. It accounts for an additional 20% but it seems difficult to understand exactly what that means.

One important question is how many PCs are needed to “reasonably” explain the data? One useful graph here is the screeplot, given in

plot(state.pc)

It is a simple barchart of the the variation explained by each PC. One popular method is to include enough PCs to cover at least 90% of the variation. In the states data this means 5, which seems a bit much.

diffgeom	complex	algebra	reals	statistics
36	58	43	36	37
62	54	50	46	52
31	42	41	40	29
76	78	69	66	81
46	56	52	56	40
12	42	38	38	28
39	46	51	54	41
30	51	54	52	32
22	32	43	28	22
9	40	47	30	24
32	49	54	37	52
40	62	51	40	49
64	75	70	66	63
36	38	58	62	62
24	46	44	55	49
50	50	54	52	51
42	42	52	38	50
2	35	32	22	16
56	53	42	40	32
59	72	70	66	62
28	50	50	42	63
19	46	49	40	30
36	56	56	54	52
54	57	59	62	58
14	35	38	29	20

diffgeom	complex	algebra	reals	statistics
36	58	43	36	37
62	54	50	46	52
31	42	41	40	29
76	78	69	66	81
46	56	52	56	40
12	42	38	38	28
39	46	51	54	41
30	51	54	52	32
22	32	43	28	22
9	40	47	30	24
32	49	54	37	52
40	62	51	40	49
64	75	70	66	63
36	38	58	62	62
24	46	44	55	49
50	50	54	52	51
42	42	52	38	50
2	35	32	22	16
56	53	42	40	32
59	72	70	66	62
28	50	50	42	63
19	46	49	40	30
36	56	56	54	52
54	57	59	62	58
14	35	38	29	20

diffgeom	complex	algebra	reals	statistics
36	58	43	36	37
62	54	50	46	52
31	42	41	40	29
76	78	69	66	81
46	56	52	56	40
12	42	38	38	28
39	46	51	54	41
30	51	54	52	32
22	32	43	28	22
9	40	47	30	24
32	49	54	37	52
40	62	51	40	49
64	75	70	66	63
36	38	58	62	62
24	46	44	55	49
50	50	54	52	51
42	42	52	38	50
2	35	32	22	16
56	53	42	40	32
59	72	70	66	62
28	50	50	42	63
19	46	49	40	30
36	56	56	54	52
54	57	59	62	58
14	35	38	29	20