Statistics with R

Basic Statistics

In this section we will use R to analyze a number of data sets. I will assume that you are familiar with basic concepts such as confidence intervals and hypothesis testing. If you are not you can read up on them on the web page of my course ESMA 3101: Introduction to Statistics I.

We have talked about the upr admissions data before. Here are some simple things to do when looking at this kind of data:

Basic Summaries and Graphs for Categorical Data

Categorical data is essentially what R calls factors, a few values repeated many times. To start they are usually organized as a

  • Table
Gender <- table(upr$Gender)
names(Gender) <- c("Female", "Male")
Percentage <- round(Gender/sum(Gender)*100, 1)
kable.nice(cbind(Gender, Percentage), do.row.names = FALSE) 
Gender Percentage
11487 48.5
12179 51.5

When we want to study the relationship of two of these we can find the

  • Contingency Table
kable.nice(table(upr$Year, upr$Gender))
F M
2003 1102 1151
2004 1040 1118
2005 1162 1138
2006 1137 1098
2007 1208 1256
2008 1219 1219
2009 1180 1237
2010 958 1073
2011 853 919
2012 769 979
2013 859 991

or with percentages based on grand total:

kable.nice(round(table(upr$Year, upr$Gender)/length(upr$Year)*100, 2))
F M
2003 4.66 4.86
2004 4.39 4.72
2005 4.91 4.81
2006 4.80 4.64
2007 5.10 5.31
2008 5.15 5.15
2009 4.99 5.23
2010 4.05 4.53
2011 3.60 3.88
2012 3.25 4.14
2013 3.63 4.19

Often more interesting are percentages based on row totals

tmp <- table(upr$Year, upr$Gender)
apply(tmp, 1, sum)
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
## 2253 2158 2300 2235 2464 2438 2417 2031 1772 1748 1850
tmp <- round(tmp/apply(tmp, 1, sum)*100, 1)
kable.nice(tmp)
F M
2003 48.9 51.1
2004 48.2 51.8
2005 50.5 49.5
2006 50.9 49.1
2007 49.0 51.0
2008 50.0 50.0
2009 48.8 51.2
2010 47.2 52.8
2011 48.1 51.9
2012 44.0 56.0
2013 46.4 53.6

Notice that here the row sums are alwasy 100. Of course one could also use column totals.

There is a nicer version of this:

library(gmodels)
CrossTable(upr$Year, upr$Gender, 
           prop.c=FALSE, prop.t=FALSE, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23666 
## 
##  
##              | upr$Gender 
##     upr$Year |         F |         M | Row Total | 
## -------------|-----------|-----------|-----------|
##         2003 |      1102 |      1151 |      2253 | 
##              |     0.489 |     0.511 |     0.095 | 
## -------------|-----------|-----------|-----------|
##         2004 |      1040 |      1118 |      2158 | 
##              |     0.482 |     0.518 |     0.091 | 
## -------------|-----------|-----------|-----------|
##         2005 |      1162 |      1138 |      2300 | 
##              |     0.505 |     0.495 |     0.097 | 
## -------------|-----------|-----------|-----------|
##         2006 |      1137 |      1098 |      2235 | 
##              |     0.509 |     0.491 |     0.094 | 
## -------------|-----------|-----------|-----------|
##         2007 |      1208 |      1256 |      2464 | 
##              |     0.490 |     0.510 |     0.104 | 
## -------------|-----------|-----------|-----------|
##         2008 |      1219 |      1219 |      2438 | 
##              |     0.500 |     0.500 |     0.103 | 
## -------------|-----------|-----------|-----------|
##         2009 |      1180 |      1237 |      2417 | 
##              |     0.488 |     0.512 |     0.102 | 
## -------------|-----------|-----------|-----------|
##         2010 |       958 |      1073 |      2031 | 
##              |     0.472 |     0.528 |     0.086 | 
## -------------|-----------|-----------|-----------|
##         2011 |       853 |       919 |      1772 | 
##              |     0.481 |     0.519 |     0.075 | 
## -------------|-----------|-----------|-----------|
##         2012 |       769 |       979 |      1748 | 
##              |     0.440 |     0.560 |     0.074 | 
## -------------|-----------|-----------|-----------|
##         2013 |       859 |       991 |      1850 | 
##              |     0.464 |     0.536 |     0.078 | 
## -------------|-----------|-----------|-----------|
## Column Total |     11487 |     12179 |     23666 | 
## -------------|-----------|-----------|-----------|
## 
## 

As for graphs we have the

  • Bar Chart
ggplot(upr, aes(x=Year, fill=Gender)) + 
  geom_bar(position="dodge", alpha=0.75) +
    labs(x="", y="Counts") 

or again based on grand total percentages:

ggplot(upr, aes(x=Year, fill=Gender)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), 
           position="dodge", alpha=0.75) +
  scale_y_continuous(labels=scales::percent) +
    labs(x="", y="Percentage", 
         title="Percentages based on Grand Total") 

Numerical Summaries

When we have quantitative data we can find

round(mean(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.733
round(median(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.83
round(sd(upr$Freshmen.GPA, na.rm=TRUE), 3) # Standard Deviation
## [1] 0.779
round(quantile(upr$Freshmen.GPA, 
               probs = c(0.1, 0.25, 0.75, 0.9),
               na.rm=TRUE), 3) # Quantiles and Quartiles
##  10%  25%  75%  90% 
## 1.71 2.32 3.28 3.65

Notice also that I have rounded all the answers. Proper rounding is an important thing to do!

Histogram and Boxplot

bw <- 4/50 
# use about 50 bins
plt1 <- ggplot(upr, aes(Freshmen.GPA)) +
  geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Freshmen GPA", y = "Counts")
plt2 <- ggplot(upr, aes(Gender,Freshmen.GPA)) + 
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt2 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))    

Two Quantitative Variables

round(cor(upr$Year, upr$Freshmen.GPA, 
    use="complete.obs"), 3)
## [1] 0.097
ggplot(upr, aes(Year, Freshmen.GPA)) +
  geom_jitter(shape=16, size=1/10, width = 0.25) +
  scale_x_continuous(breaks = 2003:2013) +
  labs(x="Year", y="GPA after Freshmen Year") 

Inference for the Mean

The basic R command for inference for a population mean is t.test:

Example: Simon Newcomb’s Measurements of the Speed of Light

Simon Newcomb made a series of measurements of the speed of light between July and September 1880. He measured the time in seconds that a light signal took to pass from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of 7400m. His first measurement was 0.000024828 seconds, or 24,828 nanoseconds (109 nanoseconds = 1 second).

We want to find a 95% confidence interval the speed of light.

attach(newcomb)
t.test(Measurement)
## 
##  One Sample t-test
## 
## data:  Measurement
## t = 18770, df = 65, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  24823.57 24828.85
## sample estimates:
## mean of x 
##  24826.21

The assumptions for this method are:

  • data comes from a normal distribution
  • or data set is large enough

Let’s check:

y <- Measurement
dta <- data.frame(y = y, 
                  x = rep(" ", length(y)))
 ggplot(aes(x, y), data = dta) +
   geom_boxplot()

It seems there is at least one serious outlier. This should not happen if the data came from a normal distribution.

We could proceed in one of two ways:

  • eliminate outlier
x <- Measurement[Measurement>24775]
t.test(x)$conf.int
## [1] 24825.74 24828.84
## attr(,"conf.level")
## [1] 0.95
  • use the median:
median(Measurement)
## [1] 24827

but now we need to find a confidence interval for the median. That can be done with the non-parametric Wilcoxon Rank Sum method:

wilcox.test(newcomb$Measurement, conf.int = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  newcomb$Measurement
## V = 2211, p-value = 1.633e-12
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
##  24826.0 24828.5
## sample estimates:
## (pseudo)median 
##        24827.5

Example: Resting Period of Monarch Butterflies

Some Monarch butterflies fly early in the day, others somewhat later. After the flight they have to rest for a short period. It has been theorized that the resting period (RIP) of butterflies flying early in the morning is shorter because this is a thermoregulatory mechanism, and it is cooler in the mornings. The mean RIP of all Monarch butterflies is 133 sec. Test the theory at the 10% level.

Research by Anson Lui, Resting period of early and late flying Monarch butterflies Danaeus plexippus, 1997

  1. Parameter: mean \(\mu\)
  2. Method: 1-sample t
  3. Assumptions: normal data or large sample
  4. \(\alpha = 0.1\)
  5. \(H_0: \mu =133\) (RIP is the same for early morning flying butterflies as all others)
  6. \(H_0: \mu <133\) (RIP is the shorter for early morning flying butterflies)
t.test(butterflies$RIP.sec., 
       mu=133, 
       alternative = "less")$p.value
## [1] 0.05583963
  1. \(p = 0.0558 < \alpha = 0.1\), so we reject the null hypothesis
  2. It appears the resting time is somewhat shorter, but the conclusion is not a strong one.

Checking the assumption:

y <- butterflies$RIP.sec.
dta <- data.frame(y = y, 
                  x = rep(" ", length(y)))
 ggplot(aes(x, y), data = dta) +
   geom_boxplot()

looks good.

Inference for Proportions

The R routine for inference for a proportion (or a probability or a percentage) is binom.test. This implements a method by Clopper and Pearson (1934). This method is exact and has no assumptions.

Note The formula discussed in many introductory statistic courses for the confidence interval is

\[ \hat p \pm z_{\alpha/2} \sqrt{\frac{\hat p (1-\hat p)}{n} } \]

where \(\hat p\) is the proportion of success. This leads to confidence intervals that are now known to be quite wrong, and so this method should not be used anymore. The same is true for the corresponding hypothesis test. This method (actually a slight improvement due to Wilson (1927)) is implemented in R by prop.test.

Example: Jon Kerrichs Coin

The South African Jon Kerrich spent some time in a German prisoner of war camp during world war I. He used his time to flip a coin 10000 times, resulting in 5067 heads.

Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.

  1. Parameter: proportion \(\pi\)
  2. Method: exact binomial
  3. Assumptions: None
  4. \(\alpha = 0.05\)
  5. \(H_0: \pi = 0.5\) (50% of flips result in “Heads”, coin is fair)
  6. \(H_a: \pi \ne 0.5\) (coin is not fair)
binom.test(x = 5067, n = 10000)$p.value 
## [1] 0.1835155
  1. \(p = 0.1835 > \alpha=0.05\), so we fail to reject the null hypothesis.
  2. it appears Jon Kerrich’s coin was indeed fair.

Example: Sample Size for Polling

Say some polling institute wants to conduct a poll for the next election for president. They will then find a 95% confidence interval and they want this interval to have an error of 3 percentage points (aka \(\pm 0.03\)). What sample size do they need?

In American politics the two parties are always very close, so in a poll with n people about n/2 will vote for one or the other party. Let’s do a little trial and error:

n <- 100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.102

Now that is to large, so

n <- 200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.071
n <- 400
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.05
n <- 800
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.035
n <- 1200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.029
n <- 1100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.03

There is something quite remarkable about this result!

Correlation

Example: UPR Admissions data

What are the correlations between the various variables?

head(upr, 2)
##      ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005      M          502           3.97            647
## 2 00D66CF1BF 2003      M          502           3.80            597
##   Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1           621          626         672           551 342         3.67
## 2           726          618         718           575 343         2.75
##   Graduated Year.Grad. Grad..GPA Class.Facultad
## 1        Si       2012      3.33           INGE
## 2        No         NA        NA           INGE

Let’s take out the those variables that are not useful for predicting success, either because we don’t have their value at the time of the admissions process (Freshmen.GPA) or for legal reasons (Gender)

x <- upr[, -c(1:4, 11:16)]
head(x, 2)
##   Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles Aprov.Matem
## 1           3.97            647           621          626         672
## 2           3.80            597           726          618         718
##   Aprov.Espanol
## 1           551
## 2           575
round(cor(x, use = "complete.obs") ,3)
##                Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles
## Highschool.GPA          1.000          0.176         0.156        0.049
## Aptitud.Verbal          0.176          1.000         0.461        0.513
## Aptitud.Matem           0.156          0.461         1.000        0.456
## Aprov.Ingles            0.049          0.513         0.456        1.000
## Aprov.Matem             0.216          0.474         0.819        0.481
## Aprov.Espanol           0.247          0.602         0.389        0.428
##                Aprov.Matem Aprov.Espanol
## Highschool.GPA       0.216         0.247
## Aptitud.Verbal       0.474         0.602
## Aptitud.Matem        0.819         0.389
## Aprov.Ingles         0.481         0.428
## Aprov.Matem          1.000         0.404
## Aprov.Espanol        0.404         1.000

somewhat surprising, Highschool.GPA is not highly correlated with any of the tests. The highest correlation is between the two math tests (0.819)

Example: The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Question: was the draft was really “random”?

Here we have two quantitative variables, so we start with the scatterplot:

attach(draft)
ggplot(aes(x, y), 
       data=data.frame(x=Day.of.Year,
                       y=Draft.Number)) +
  geom_point()

and this does not look like there is a problem with independence.

However:

  1. Parameter: Pearson’s correlation coefficient \(\rho\)
  2. Method: Test for Pearson’s correlation coefficient \(\rho\)
  3. Assumptions: relationship is linear and that there are no outliers.
  4. \(\alpha = 0.05\)
  5. \(H_0: \rho =0\) (no relationship between Day of Year and Draft Number)
  6. \(H_a: \rho \ne 0\) (some relationship between Day of Year and Draft Number)
cor.test(Draft.Number, Day.of.Year)$p.value
## [1] 1.263829e-05
  1. \(p=0.0000 <\alpha = 0.05\), so we reject the null hypothesis,
  2. There is a statistically significant relationship between Day of Year and Draft Number.

Categorical Data Analysis - Tests for Independence

Example: Drownings in Los Angeles

Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.

Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. They examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period

kable.nice(drownings)
Male Female
Private Swimming Pool 488 219
Bathtub 115 132
Ocean 231 40
Freshwater bodies 155 19
Hottubs 16 15
Reservoirs 32 2
Other Pools 46 14
Pails, basins, toilets 7 4
Other 40 12

Here we have two categorical variables (Method of Drowning and Gender), both categorical. We want to know whether the variables are independent. The most popular method of analysis for this type of problem is Pearson’s chi square test of independence. It is done with the command chisq.test and it has the assumption of no expected counts less than 5.

  1. Parameters of interest: measure of association
  2. Method of analysis: chi-square test of independence
  3. Assumptions of Method: all expected counts greater than 5
  4. Type I error probability \(\alpha\)=0.05
  5. H0: Classifications are independent = there is no difference in the method of drowning between men and women.
  6. Ha: Classifications are dependent = there is some difference in the method of drowning between men and women.
chisq.test(drownings)
## 
##  Pearson's Chi-squared test
## 
## data:  drownings
## X-squared = 144.48, df = 8, p-value < 2.2e-16
  1. p = 0 < \(\alpha\)=0.05, we reject the null hypothesis, there is a statistically significant difference between men and women and where they drown.

Let’s see whether there is a problem with the assumptions:

kable.nice(round(chisq.test(drownings)$expected, 1))
Male Female
Private Swimming Pool 503.4 203.6
Bathtub 175.9 71.1
Ocean 193.0 78.0
Freshwater bodies 123.9 50.1
Hottubs 22.1 8.9
Reservoirs 24.2 9.8
Other Pools 42.7 17.3
Pails, basins, toilets 7.8 3.2
Other 37.0 15.0

and we see that the expected counts of Pails, basins, toilets and Female is 3.2. In real life this would be considered ok, but it would also be easy to fix:

newmale <- c(drownings[1:7, 1], 7+40)
newfemale <- c(drownings[1:7, 2], 4+12)
newdrown <- cbind(newmale, newfemale)
rownames(newdrown)[8] <- "Other"
kable.nice(newdrown)
newmale newfemale
Private Swimming Pool 488 219
Bathtub 115 132
Ocean 231 40
Freshwater bodies 155 19
Hottubs 16 15
Reservoirs 32 2
Other Pools 46 14
Other 47 16
out <- chisq.test(newdrown) 
kable.nice(round(out$expected, 1))
newmale newfemale
Private Swimming Pool 503.4 203.6
Bathtub 175.9 71.1
Ocean 193.0 78.0
Freshwater bodies 123.9 50.1
Hottubs 22.1 8.9
Reservoirs 24.2 9.8
Other Pools 42.7 17.3
Other 44.9 18.1
out$p.value
## [1] 8.518561e-28

Comparing the Means of Several Populations - ANOVA

Example: Mothers Cocaine Use and Babies Health

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.

The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885

attach(mothers)
kable.nice(mothers[sample(1:nrow(mothers), 10), ], 
           do.row.names = FALSE)
Status Length
Drug Free 51.7
Drug Free 51.8
Drug Free 53.7
Drug Free 52.1
First Trimester 48.5
Drug Free 52.9
Throughout 48.2
Drug Free 50.4
Throughout 51.5
Drug Free 56.5
dta <- data.frame(y = Length, 
                  x = Status)
 ggplot(aes(x, y), data = dta) +
   geom_boxplot() +
   labs(x="", y="Length")

out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(Status)
out[, 1] <- tapply(Length, Status, length)
out[, 2] <- round(tapply(Length, Status, mean), 2)
out[, 3] <- round(tapply(Length, Status, sd), 2)
kable.nice(out)
Size Mean SD
Drug Free 39 51.1 2.9
First Trimester 19 49.3 2.5
Throughout 36 48.0 3.6

The standard method for this problem is called ANOVA (Analysis of Variance) and is run with the aov command.

  1. Parameters of interest: group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha\)=0.05
  5. Null hypothesis H0: \(\mu_1 = \mu_2 = \mu_3\) (groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (at least two groups have different means)
fit <- aov(Length~Status)
summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)
## Status       2  181.4   90.69   9.319 0.000208
## Residuals   91  885.6    9.73
  1. 0.000 < 0.05, there is some evidence that the group means are not the same, the babies whose mothers used cocaine tend to be a little shorter (less healthy?)

In step 3 we have the assumptions

  1. residuals have a normal distribution
ggplot(data.frame(x=fit$res), aes(sample=x)) + 
  stat_qq() + 
  stat_qq_line()

looks fine

  1. groups have equal variance

The box plot shows that the variances are about the same.



Often if the null of no difference is rejected, one wants to go a step further and do a pairwise comparison:

  • is Drug Free different from First Trimester?
  • is First Trimester different from Throughout?

There are a number of methods known for this problem, a popular one is by Tukey:

tuk <- TukeyHSD(fit)
plot(tuk)

this draws confidence intervals for the difference in means of all pairs. If an interval does not contain 0, the corresponding pair is statistically significantly different.

Here that is the case only for Drug Free - Throughout, so the other two pairs are not statistically significantly different. Remember, however that failing to reject H0 is NOT the same as accepting H0. The fact that those pairs are not statistically significantly different is almost certainly due to a lack of sample size.

Regression

Example: Predicting the Usage of Electricity

In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.

attach(elusage) 
kable.nice(head(elusage), do.row.names = FALSE)
Month Year Usage Temperature
8 1989 24.828 73
9 1989 24.688 67
10 1989 19.310 57
11 1989 59.706 43
12 1989 99.667 26
1 1990 49.333 41
ggplot(aes(Temperature, Usage), data=elusage) +
  geom_point()

We want to find a model (aka a function) f with

\[ \text{Usage}=f(\text{Temperature}) + \epsilon \] where \(\epsilon\) is some random variable, often with the assumption \(\epsilon \sim N(0, \sigma)\).

  1. Linear Model

\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \epsilon \]

fit <- lm(Usage~Temperature, data=elusage)
summary(fit)
## 
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.305  -8.163   0.559   7.723  28.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619    5.56494   20.97   <2e-16
## Temperature  -1.36461    0.09941  -13.73   <2e-16
## 
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7763 
## F-statistic: 188.4 on 1 and 53 DF,  p-value: < 2.2e-16

gives the model as

\[ \text{Usage} = 116.7 - 1.36 \text{ Temperature} + \epsilon \]

How do we know that this is a good model? The main diagnostic is the Residual vs Fits plot:

Fits <- 116.7 - 1.36*Temperature
Residuals <- Usage-Temperature

actually, they are already part of the fit object:

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(fit),
                        Residuals=residuals(fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

the idea is that if the model is good, the residuals and the fitted values should be independent, and so this graph should not show any pattern. Adding the non-parametric fit and a horizontal line shows that this is not the case here.

  • polynomial model

Let’s add a quadratic term to the model

\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \beta_2 \text{ Temperature}^2 + \epsilon \]

T2 <- Temperature^2
quad.fit <- lm(Usage~Temperature + T2,
               data=elusage)
summary(quad.fit)
## 
## Call:
## lm(formula = Usage ~ Temperature + T2, data = elusage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8593  -5.6813   0.3756   5.4650  21.9701 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 196.715321  17.487567  11.249 1.53e-15
## Temperature  -4.640462   0.694906  -6.678 1.62e-08
## T2            0.030732   0.006472   4.749 1.65e-05
## 
## Residual standard error: 9.574 on 52 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.841 
## F-statistic: 143.8 on 2 and 52 DF,  p-value: < 2.2e-16
 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(quad.fit),
                        Residuals=residuals(quad.fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

and that is much better. If it were not one can continue and add even higher order terms. There will always be a polynomial model that give a good fit.

  • Transformations

here we use some transformation (functions) of the data, for example \(\log\):

log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)
summary(fit)
## 
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.305  -8.163   0.559   7.723  28.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619    5.56494   20.97   <2e-16
## Temperature  -1.36461    0.09941  -13.73   <2e-16
## 
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7763 
## F-statistic: 188.4 on 1 and 53 DF,  p-value: < 2.2e-16
 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(log.fit),
                        Residuals=residuals(log.fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

and that is also not to bad. Other standard choices for transformations are square root and inverse, but in principle any function might work.


How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as

\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]

the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two model, the one with the higher \(R^2\) is better.

Here we find

# Linear Model
round(100*unlist(summary(fit)$r.squared), 2)
## [1] 78.05
# Quadratic Model
round(100*unlist(summary(quad.fit)$r.squared), 2)
## [1] 84.69
# Log Transfrom Model
round(100*unlist(summary(log.fit)$r.squared), 2)      
## [1] 81.12

but we need to be careful here: the linear model is a special case of the quadratic model, and so it’s \(R^2\) can never be smaller. There are other ways to choose between such nested models, for example the F test, but here this is not an issue because the linear model is bad anyway.

Now the \(R^2\) of the quadratic model is \(84.7\%\) and that of the log transform model is \(81.1\%\), so the quadratic one is better.

Finally let’s have a look what those models look like:

x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit, 
                  newdata=data.frame(Temperature=x, 
                                     T2 = x^2))
y.log <- exp(predict(log.fit, 
                 newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
                  y=c(y.quad, y.log),
                  Model=rep(c("Quadratic", "Log"),
                         each=100))
ggplot(aes(x, y, color=Model), data=dta) +
  geom_line(size=1.2) +
  geom_point(data=elusage, 
             aes(Temperature, Usage),
             inherit.aes = FALSE) +
  xlab("Temperature") +
  ylab("Usage")