Statistics with R

Basic Statistics

In this section we will use R to analyze a number of data sets. I will assume that you are familiar with basic concepts such as confidence intervals and hypothesis testing. If you are not you can read up on them on the web page of my course ESMA 3101: Introduction to Statistics I.

We have talked about the upr admissions data before. Here are some simple things to do when looking at this kind of data:

Basic Summaries and Graphs for Categorical Data

Categorical data is essentially what R calls factors, a few values repeated many times. To start they are usually organized as a

Table

Gender <- table(upr$Gender)
names(Gender) <- c("Female", "Male")
Percentage <- round(Gender/sum(Gender)*100, 1)
kable.nice(cbind(Gender, Percentage), do.row.names = FALSE)

Gender	Percentage
11487	48.5
12179	51.5

When we want to study the relationship of two of these we can find the

Contingency Table

kable.nice(table(upr$Year, upr$Gender))

	F	M
2003	1102	1151
2004	1040	1118
2005	1162	1138
2006	1137	1098
2007	1208	1256
2008	1219	1219
2009	1180	1237
2010	958	1073
2011	853	919
2012	769	979
2013	859	991

or with percentages based on grand total:

kable.nice(round(table(upr$Year, upr$Gender)/length(upr$Year)*100, 2))

	F	M
2003	4.66	4.86
2004	4.39	4.72
2005	4.91	4.81
2006	4.80	4.64
2007	5.10	5.31
2008	5.15	5.15
2009	4.99	5.23
2010	4.05	4.53
2011	3.60	3.88
2012	3.25	4.14
2013	3.63	4.19

Often more interesting are percentages based on row totals

tmp <- table(upr$Year, upr$Gender)
apply(tmp, 1, sum)

## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
## 2253 2158 2300 2235 2464 2438 2417 2031 1772 1748 1850

tmp <- round(tmp/apply(tmp, 1, sum)*100, 1)
kable.nice(tmp)

	F	M
2003	48.9	51.1
2004	48.2	51.8
2005	50.5	49.5
2006	50.9	49.1
2007	49.0	51.0
2008	50.0	50.0
2009	48.8	51.2
2010	47.2	52.8
2011	48.1	51.9
2012	44.0	56.0
2013	46.4	53.6

Notice that here the row sums are alwasy 100. Of course one could also use column totals.

There is a nicer version of this:

library(gmodels)
CrossTable(upr$Year, upr$Gender, 
           prop.c=FALSE, prop.t=FALSE, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23666 
## 
##  
##              | upr$Gender 
##     upr$Year |         F |         M | Row Total | 
## -------------|-----------|-----------|-----------|
##         2003 |      1102 |      1151 |      2253 | 
##              |     0.489 |     0.511 |     0.095 | 
## -------------|-----------|-----------|-----------|
##         2004 |      1040 |      1118 |      2158 | 
##              |     0.482 |     0.518 |     0.091 | 
## -------------|-----------|-----------|-----------|
##         2005 |      1162 |      1138 |      2300 | 
##              |     0.505 |     0.495 |     0.097 | 
## -------------|-----------|-----------|-----------|
##         2006 |      1137 |      1098 |      2235 | 
##              |     0.509 |     0.491 |     0.094 | 
## -------------|-----------|-----------|-----------|
##         2007 |      1208 |      1256 |      2464 | 
##              |     0.490 |     0.510 |     0.104 | 
## -------------|-----------|-----------|-----------|
##         2008 |      1219 |      1219 |      2438 | 
##              |     0.500 |     0.500 |     0.103 | 
## -------------|-----------|-----------|-----------|
##         2009 |      1180 |      1237 |      2417 | 
##              |     0.488 |     0.512 |     0.102 | 
## -------------|-----------|-----------|-----------|
##         2010 |       958 |      1073 |      2031 | 
##              |     0.472 |     0.528 |     0.086 | 
## -------------|-----------|-----------|-----------|
##         2011 |       853 |       919 |      1772 | 
##              |     0.481 |     0.519 |     0.075 | 
## -------------|-----------|-----------|-----------|
##         2012 |       769 |       979 |      1748 | 
##              |     0.440 |     0.560 |     0.074 | 
## -------------|-----------|-----------|-----------|
##         2013 |       859 |       991 |      1850 | 
##              |     0.464 |     0.536 |     0.078 | 
## -------------|-----------|-----------|-----------|
## Column Total |     11487 |     12179 |     23666 | 
## -------------|-----------|-----------|-----------|
## 
##

As for graphs we have the

Bar Chart

ggplot(upr, aes(x=Year, fill=Gender)) + 
  geom_bar(position="dodge", alpha=0.75) +
    labs(x="", y="Counts")

or again based on grand total percentages:

ggplot(upr, aes(x=Year, fill=Gender)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), 
           position="dodge", alpha=0.75) +
  scale_y_continuous(labels=scales::percent) +
    labs(x="", y="Percentage", 
         title="Percentages based on Grand Total")

Numerical Summaries

When we have quantitative data we can find

round(mean(upr$Freshmen.GPA, na.rm=TRUE), 3)

## [1] 2.733

round(median(upr$Freshmen.GPA, na.rm=TRUE), 3)

## [1] 2.83

round(sd(upr$Freshmen.GPA, na.rm=TRUE), 3) # Standard Deviation

## [1] 0.779

round(quantile(upr$Freshmen.GPA, 
               probs = c(0.1, 0.25, 0.75, 0.9),
               na.rm=TRUE), 3) # Quantiles and Quartiles

##  10%  25%  75%  90% 
## 1.71 2.32 3.28 3.65

Notice also that I have rounded all the answers. Proper rounding is an important thing to do!

Histogram and Boxplot

bw <- 4/50 
# use about 50 bins
plt1 <- ggplot(upr, aes(Freshmen.GPA)) +
  geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Freshmen GPA", y = "Counts")
plt2 <- ggplot(upr, aes(Gender,Freshmen.GPA)) + 
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt2 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))

Two Quantitative Variables

round(cor(upr$Year, upr$Freshmen.GPA, 
    use="complete.obs"), 3)

## [1] 0.097

ggplot(upr, aes(Year, Freshmen.GPA)) +
  geom_jitter(shape=16, size=1/10, width = 0.25) +
  scale_x_continuous(breaks = 2003:2013) +
  labs(x="Year", y="GPA after Freshmen Year")

Inference for the Mean

The basic R command for inference for a population mean is t.test:

Example: Simon Newcomb’s Measurements of the Speed of Light

Simon Newcomb made a series of measurements of the speed of light between July and September 1880. He measured the time in seconds that a light signal took to pass from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of 7400m. His first measurement was 0.000024828 seconds, or 24,828 nanoseconds (109 nanoseconds = 1 second).

We want to find a 95% confidence interval the speed of light.

attach(newcomb)
t.test(Measurement)

## 
##  One Sample t-test
## 
## data:  Measurement
## t = 18770, df = 65, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  24823.57 24828.85
## sample estimates:
## mean of x 
##  24826.21

The assumptions for this method are:

data comes from a normal distribution
or data set is large enough

Let’s check:

y <- Measurement
dta <- data.frame(y = y, 
                  x = rep(" ", length(y)))
 ggplot(aes(x, y), data = dta) +
   geom_boxplot()

It seems there is at least one serious outlier. This should not happen if the data came from a normal distribution.

We could proceed in one of two ways:

eliminate outlier

x <- Measurement[Measurement>24775]
t.test(x)$conf.int

## [1] 24825.74 24828.84
## attr(,"conf.level")
## [1] 0.95

use the median:

median(Measurement)

## [1] 24827

but now we need to find a confidence interval for the median. That can be done with the non-parametric Wilcoxon Rank Sum method:

wilcox.test(newcomb$Measurement, conf.int = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  newcomb$Measurement
## V = 2211, p-value = 1.633e-12
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
##  24826.0 24828.5
## sample estimates:
## (pseudo)median 
##        24827.5

Example: Resting Period of Monarch Butterflies

Some Monarch butterflies fly early in the day, others somewhat later. After the flight they have to rest for a short period. It has been theorized that the resting period (RIP) of butterflies flying early in the morning is shorter because this is a thermoregulatory mechanism, and it is cooler in the mornings. The mean RIP of all Monarch butterflies is 133 sec. Test the theory at the 10% level.

Research by Anson Lui, Resting period of early and late flying Monarch butterflies Danaeus plexippus, 1997

Parameter: mean \(\mu\)
Method: 1-sample t
Assumptions: normal data or large sample
\(\alpha = 0.1\)
\(H_0: \mu =133\) (RIP is the same for early morning flying butterflies as all others)
\(H_0: \mu <133\) (RIP is the shorter for early morning flying butterflies)

t.test(butterflies$RIP.sec., 
       mu=133, 
       alternative = "less")$p.value

## [1] 0.05583963

\(p = 0.0558 < \alpha = 0.1\), so we reject the null hypothesis
It appears the resting time is somewhat shorter, but the conclusion is not a strong one.

Checking the assumption:

y <- butterflies$RIP.sec.
dta <- data.frame(y = y, 
                  x = rep(" ", length(y)))
 ggplot(aes(x, y), data = dta) +
   geom_boxplot()

looks good.

Inference for Proportions

The R routine for inference for a proportion (or a probability or a percentage) is binom.test. This implements a method by Clopper and Pearson (1934). This method is exact and has no assumptions.

Note The formula discussed in many introductory statistic courses for the confidence interval is

\[ \hat p \pm z_{\alpha/2} \sqrt{\frac{\hat p (1-\hat p)}{n} } \]

where \(\hat p\) is the proportion of success. This leads to confidence intervals that are now known to be quite wrong, and so this method should not be used anymore. The same is true for the corresponding hypothesis test. This method (actually a slight improvement due to Wilson (1927)) is implemented in R by prop.test.

Example: Jon Kerrichs Coin

The South African Jon Kerrich spent some time in a German prisoner of war camp during world war I. He used his time to flip a coin 10000 times, resulting in 5067 heads.

Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.

Parameter: proportion \(\pi\)
Method: exact binomial
Assumptions: None
\(\alpha = 0.05\)
\(H_0: \pi = 0.5\) (50% of flips result in “Heads”, coin is fair)
\(H_a: \pi \ne 0.5\) (coin is not fair)

binom.test(x = 5067, n = 10000)$p.value

## [1] 0.1835155

\(p = 0.1835 > \alpha=0.05\), so we fail to reject the null hypothesis.
it appears Jon Kerrich’s coin was indeed fair.

Example: Sample Size for Polling

Say some polling institute wants to conduct a poll for the next election for president. They will then find a 95% confidence interval and they want this interval to have an error of 3 percentage points (aka \(\pm 0.03\)). What sample size do they need?

In American politics the two parties are always very close, so in a poll with n people about n/2 will vote for one or the other party. Let’s do a little trial and error:

n <- 100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.102

Now that is to large, so

n <- 200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.071

n <- 400
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.05

n <- 800
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.035

n <- 1200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.029

n <- 1100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)

## [1] 0.03

There is something quite remarkable about this result!

Correlation

Example: UPR Admissions data

What are the correlations between the various variables?

head(upr, 2)

##      ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005      M          502           3.97            647
## 2 00D66CF1BF 2003      M          502           3.80            597
##   Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1           621          626         672           551 342         3.67
## 2           726          618         718           575 343         2.75
##   Graduated Year.Grad. Grad..GPA Class.Facultad
## 1        Si       2012      3.33           INGE
## 2        No         NA        NA           INGE

Let’s take out the those variables that are not useful for predicting success, either because we don’t have their value at the time of the admissions process (Freshmen.GPA) or for legal reasons (Gender)

x <- upr[, -c(1:4, 11:16)]
head(x, 2)

##   Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles Aprov.Matem
## 1           3.97            647           621          626         672
## 2           3.80            597           726          618         718
##   Aprov.Espanol
## 1           551
## 2           575

round(cor(x, use = "complete.obs") ,3)

##                Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles
## Highschool.GPA          1.000          0.176         0.156        0.049
## Aptitud.Verbal          0.176          1.000         0.461        0.513
## Aptitud.Matem           0.156          0.461         1.000        0.456
## Aprov.Ingles            0.049          0.513         0.456        1.000
## Aprov.Matem             0.216          0.474         0.819        0.481
## Aprov.Espanol           0.247          0.602         0.389        0.428
##                Aprov.Matem Aprov.Espanol
## Highschool.GPA       0.216         0.247
## Aptitud.Verbal       0.474         0.602
## Aptitud.Matem        0.819         0.389
## Aprov.Ingles         0.481         0.428
## Aprov.Matem          1.000         0.404
## Aprov.Espanol        0.404         1.000

somewhat surprising, Highschool.GPA is not highly correlated with any of the tests. The highest correlation is between the two math tests (0.819)

Example: The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Question: was the draft was really “random”?

Here we have two quantitative variables, so we start with the scatterplot:

attach(draft)
ggplot(aes(x, y), 
       data=data.frame(x=Day.of.Year,
                       y=Draft.Number)) +
  geom_point()

and this does not look like there is a problem with independence.

However:

Parameter: Pearson’s correlation coefficient \(\rho\)
Method: Test for Pearson’s correlation coefficient \(\rho\)
Assumptions: relationship is linear and that there are no outliers.
\(\alpha = 0.05\)
\(H_0: \rho =0\) (no relationship between Day of Year and Draft Number)
\(H_a: \rho \ne 0\) (some relationship between Day of Year and Draft Number)

cor.test(Draft.Number, Day.of.Year)$p.value

## [1] 1.263829e-05

\(p=0.0000 <\alpha = 0.05\), so we reject the null hypothesis,
There is a statistically significant relationship between Day of Year and Draft Number.

Categorical Data Analysis - Tests for Independence

Example: Drownings in Los Angeles

Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.

Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. They examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period

kable.nice(drownings)

	Male	Female
Private Swimming Pool	488	219
Bathtub	115	132
Ocean	231	40
Freshwater bodies	155	19
Hottubs	16	15
Reservoirs	32	2
Other Pools	46	14
Pails, basins, toilets	7	4
Other	40	12

Here we have two categorical variables (Method of Drowning and Gender), both categorical. We want to know whether the variables are independent. The most popular method of analysis for this type of problem is Pearson’s chi square test of independence. It is done with the command chisq.test and it has the assumption of no expected counts less than 5.

Parameters of interest: measure of association
Method of analysis: chi-square test of independence
Assumptions of Method: all expected counts greater than 5
Type I error probability \(\alpha\)=0.05
H₀: Classifications are independent = there is no difference in the method of drowning between men and women.
H_a: Classifications are dependent = there is some difference in the method of drowning between men and women.

chisq.test(drownings)

## 
##  Pearson's Chi-squared test
## 
## data:  drownings
## X-squared = 144.48, df = 8, p-value < 2.2e-16

p = 0 < \(\alpha\)=0.05, we reject the null hypothesis, there is a statistically significant difference between men and women and where they drown.

Let’s see whether there is a problem with the assumptions:

kable.nice(round(chisq.test(drownings)$expected, 1))

	Male	Female
Private Swimming Pool	503.4	203.6
Bathtub	175.9	71.1
Ocean	193.0	78.0
Freshwater bodies	123.9	50.1
Hottubs	22.1	8.9
Reservoirs	24.2	9.8
Other Pools	42.7	17.3
Pails, basins, toilets	7.8	3.2
Other	37.0	15.0

and we see that the expected counts of Pails, basins, toilets and Female is 3.2. In real life this would be considered ok, but it would also be easy to fix:

newmale <- c(drownings[1:7, 1], 7+40)
newfemale <- c(drownings[1:7, 2], 4+12)
newdrown <- cbind(newmale, newfemale)
rownames(newdrown)[8] <- "Other"
kable.nice(newdrown)

	newmale	newfemale
Private Swimming Pool	488	219
Bathtub	115	132
Ocean	231	40
Freshwater bodies	155	19
Hottubs	16	15
Reservoirs	32	2
Other Pools	46	14
Other	47	16

out <- chisq.test(newdrown) 
kable.nice(round(out$expected, 1))

	newmale	newfemale
Private Swimming Pool	503.4	203.6
Bathtub	175.9	71.1
Ocean	193.0	78.0
Freshwater bodies	123.9	50.1
Hottubs	22.1	8.9
Reservoirs	24.2	9.8
Other Pools	42.7	17.3
Other	44.9	18.1

out$p.value

## [1] 8.518561e-28

Comparing the Means of Several Populations - ANOVA

Example: Mothers Cocaine Use and Babies Health

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.

The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885

attach(mothers)
kable.nice(mothers[sample(1:nrow(mothers), 10), ], 
           do.row.names = FALSE)

Status	Length
Drug Free	51.7
Drug Free	51.8
Drug Free	53.7
Drug Free	52.1
First Trimester	48.5
Drug Free	52.9
Throughout	48.2
Drug Free	50.4
Throughout	51.5
Drug Free	56.5

dta <- data.frame(y = Length, 
                  x = Status)
 ggplot(aes(x, y), data = dta) +
   geom_boxplot() +
   labs(x="", y="Length")

out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(Status)
out[, 1] <- tapply(Length, Status, length)
out[, 2] <- round(tapply(Length, Status, mean), 2)
out[, 3] <- round(tapply(Length, Status, sd), 2)
kable.nice(out)

	Size	Mean	SD
Drug Free	39	51.1	2.9
First Trimester	19	49.3	2.5
Throughout	36	48.0	3.6

The standard method for this problem is called ANOVA (Analysis of Variance) and is run with the aov command.

Parameters of interest: group means
Method of analysis: ANOVA
Assumptions of Method: residuals have a normal distribution, groups have equal variance
Type I error probability \(\alpha\)=0.05
Null hypothesis H₀: \(\mu_1 = \mu_2 = \mu_3\) (groups have the same means)
Alternative hypothesis H_a: \(\mu_i \ne \mu_j\) (at least two groups have different means)

fit <- aov(Length~Status)
summary(fit)

##             Df Sum Sq Mean Sq F value   Pr(>F)
## Status       2  181.4   90.69   9.319 0.000208
## Residuals   91  885.6    9.73

0.000 < 0.05, there is some evidence that the group means are not the same, the babies whose mothers used cocaine tend to be a little shorter (less healthy?)

In step 3 we have the assumptions

residuals have a normal distribution

ggplot(data.frame(x=fit$res), aes(sample=x)) + 
  stat_qq() + 
  stat_qq_line()

looks fine

groups have equal variance

The box plot shows that the variances are about the same.

Often if the null of no difference is rejected, one wants to go a step further and do a pairwise comparison:

is Drug Free different from First Trimester?
is First Trimester different from Throughout?

There are a number of methods known for this problem, a popular one is by Tukey:

tuk <- TukeyHSD(fit)
plot(tuk)

this draws confidence intervals for the difference in means of all pairs. If an interval does not contain 0, the corresponding pair is statistically significantly different.

Here that is the case only for Drug Free - Throughout, so the other two pairs are not statistically significantly different. Remember, however that failing to reject H₀ is NOT the same as accepting H₀. The fact that those pairs are not statistically significantly different is almost certainly due to a lack of sample size.

Regression

Example: Predicting the Usage of Electricity

In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.

attach(elusage) 
kable.nice(head(elusage), do.row.names = FALSE)

Month	Year	Usage	Temperature
8	1989	24.828	73
9	1989	24.688	67
10	1989	19.310	57
11	1989	59.706	43
12	1989	99.667	26
1	1990	49.333	41

ggplot(aes(Temperature, Usage), data=elusage) +
  geom_point()

We want to find a model (aka a function) f with

\[ \text{Usage}=f(\text{Temperature}) + \epsilon \] where \(\epsilon\) is some random variable, often with the assumption \(\epsilon \sim N(0, \sigma)\).

Linear Model

\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \epsilon \]

fit <- lm(Usage~Temperature, data=elusage)
summary(fit)

## 
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.305  -8.163   0.559   7.723  28.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619    5.56494   20.97   <2e-16
## Temperature  -1.36461    0.09941  -13.73   <2e-16
## 
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7763 
## F-statistic: 188.4 on 1 and 53 DF,  p-value: < 2.2e-16

gives the model as

\[ \text{Usage} = 116.7 - 1.36 \text{ Temperature} + \epsilon \]

How do we know that this is a good model? The main diagnostic is the Residual vs Fits plot:

Fits <- 116.7 - 1.36*Temperature
Residuals <- Usage-Temperature

actually, they are already part of the fit object:

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(fit),
                        Residuals=residuals(fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

the idea is that if the model is good, the residuals and the fitted values should be independent, and so this graph should not show any pattern. Adding the non-parametric fit and a horizontal line shows that this is not the case here.

polynomial model

Let’s add a quadratic term to the model

\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \beta_2 \text{ Temperature}^2 + \epsilon \]

T2 <- Temperature^2
quad.fit <- lm(Usage~Temperature + T2,
               data=elusage)
summary(quad.fit)

## 
## Call:
## lm(formula = Usage ~ Temperature + T2, data = elusage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8593  -5.6813   0.3756   5.4650  21.9701 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 196.715321  17.487567  11.249 1.53e-15
## Temperature  -4.640462   0.694906  -6.678 1.62e-08
## T2            0.030732   0.006472   4.749 1.65e-05
## 
## Residual standard error: 9.574 on 52 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.841 
## F-statistic: 143.8 on 2 and 52 DF,  p-value: < 2.2e-16

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(quad.fit),
                        Residuals=residuals(quad.fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

and that is much better. If it were not one can continue and add even higher order terms. There will always be a polynomial model that give a good fit.

Transformations

here we use some transformation (functions) of the data, for example \(\log\):

log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)
summary(fit)

## 
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.305  -8.163   0.559   7.723  28.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619    5.56494   20.97   <2e-16
## Temperature  -1.36461    0.09941  -13.73   <2e-16
## 
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7763 
## F-statistic: 188.4 on 1 and 53 DF,  p-value: < 2.2e-16

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(log.fit),
                        Residuals=residuals(log.fit))) +
  geom_point() +
  geom_smooth(se=FALSE) +
  geom_abline(slope = 0)

and that is also not to bad. Other standard choices for transformations are square root and inverse, but in principle any function might work.

How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as

\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]

the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two model, the one with the higher \(R^2\) is better.

Here we find

# Linear Model
round(100*unlist(summary(fit)$r.squared), 2)

## [1] 78.05

# Quadratic Model
round(100*unlist(summary(quad.fit)$r.squared), 2)

## [1] 84.69

# Log Transfrom Model
round(100*unlist(summary(log.fit)$r.squared), 2)

## [1] 81.12

but we need to be careful here: the linear model is a special case of the quadratic model, and so it’s \(R^2\) can never be smaller. There are other ways to choose between such nested models, for example the F test, but here this is not an issue because the linear model is bad anyway.

Now the \(R^2\) of the quadratic model is \(84.7\%\) and that of the log transform model is \(81.1\%\), so the quadratic one is better.

Finally let’s have a look what those models look like:

x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit, 
                  newdata=data.frame(Temperature=x, 
                                     T2 = x^2))
y.log <- exp(predict(log.fit, 
                 newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
                  y=c(y.quad, y.log),
                  Model=rep(c("Quadratic", "Log"),
                         each=100))
ggplot(aes(x, y, color=Model), data=dta) +
  geom_line(size=1.2) +
  geom_point(data=elusage, 
             aes(Temperature, Usage),
             inherit.aes = FALSE) +
  xlab("Temperature") +
  ylab("Usage")