In this section we will use R to analyze a number of data sets. I will assume that you are familiar with basic concepts such as confidence intervals and hypothesis testing. If you are not you can read up on them on the web page of my course ESMA 3101: Introduction to Statistics I.
We have talked about the upr admissions data before. Here are some simple things to do when looking at this kind of data:
Categorical data is essentially what R calls factors, a few values repeated many times. To start they are usually organized as a
Gender <- table(upr$Gender)
names(Gender) <- c("Female", "Male")
Percentage <- round(Gender/sum(Gender)*100, 1)
kable.nice(cbind(Gender, Percentage), do.row.names = FALSE)
Gender | Percentage |
---|---|
11487 | 48.5 |
12179 | 51.5 |
When we want to study the relationship of two of these we can find the
kable.nice(table(upr$Year, upr$Gender))
F | M | |
---|---|---|
2003 | 1102 | 1151 |
2004 | 1040 | 1118 |
2005 | 1162 | 1138 |
2006 | 1137 | 1098 |
2007 | 1208 | 1256 |
2008 | 1219 | 1219 |
2009 | 1180 | 1237 |
2010 | 958 | 1073 |
2011 | 853 | 919 |
2012 | 769 | 979 |
2013 | 859 | 991 |
or with percentages based on grand total:
kable.nice(round(table(upr$Year, upr$Gender)/length(upr$Year)*100, 2))
F | M | |
---|---|---|
2003 | 4.66 | 4.86 |
2004 | 4.39 | 4.72 |
2005 | 4.91 | 4.81 |
2006 | 4.80 | 4.64 |
2007 | 5.10 | 5.31 |
2008 | 5.15 | 5.15 |
2009 | 4.99 | 5.23 |
2010 | 4.05 | 4.53 |
2011 | 3.60 | 3.88 |
2012 | 3.25 | 4.14 |
2013 | 3.63 | 4.19 |
Often more interesting are percentages based on row totals
tmp <- table(upr$Year, upr$Gender)
apply(tmp, 1, sum)
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 2253 2158 2300 2235 2464 2438 2417 2031 1772 1748 1850
tmp <- round(tmp/apply(tmp, 1, sum)*100, 1)
kable.nice(tmp)
F | M | |
---|---|---|
2003 | 48.9 | 51.1 |
2004 | 48.2 | 51.8 |
2005 | 50.5 | 49.5 |
2006 | 50.9 | 49.1 |
2007 | 49.0 | 51.0 |
2008 | 50.0 | 50.0 |
2009 | 48.8 | 51.2 |
2010 | 47.2 | 52.8 |
2011 | 48.1 | 51.9 |
2012 | 44.0 | 56.0 |
2013 | 46.4 | 53.6 |
Notice that here the row sums are alwasy 100. Of course one could also use column totals.
There is a nicer version of this:
library(gmodels)
CrossTable(upr$Year, upr$Gender,
prop.c=FALSE, prop.t=FALSE, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 23666
##
##
## | upr$Gender
## upr$Year | F | M | Row Total |
## -------------|-----------|-----------|-----------|
## 2003 | 1102 | 1151 | 2253 |
## | 0.489 | 0.511 | 0.095 |
## -------------|-----------|-----------|-----------|
## 2004 | 1040 | 1118 | 2158 |
## | 0.482 | 0.518 | 0.091 |
## -------------|-----------|-----------|-----------|
## 2005 | 1162 | 1138 | 2300 |
## | 0.505 | 0.495 | 0.097 |
## -------------|-----------|-----------|-----------|
## 2006 | 1137 | 1098 | 2235 |
## | 0.509 | 0.491 | 0.094 |
## -------------|-----------|-----------|-----------|
## 2007 | 1208 | 1256 | 2464 |
## | 0.490 | 0.510 | 0.104 |
## -------------|-----------|-----------|-----------|
## 2008 | 1219 | 1219 | 2438 |
## | 0.500 | 0.500 | 0.103 |
## -------------|-----------|-----------|-----------|
## 2009 | 1180 | 1237 | 2417 |
## | 0.488 | 0.512 | 0.102 |
## -------------|-----------|-----------|-----------|
## 2010 | 958 | 1073 | 2031 |
## | 0.472 | 0.528 | 0.086 |
## -------------|-----------|-----------|-----------|
## 2011 | 853 | 919 | 1772 |
## | 0.481 | 0.519 | 0.075 |
## -------------|-----------|-----------|-----------|
## 2012 | 769 | 979 | 1748 |
## | 0.440 | 0.560 | 0.074 |
## -------------|-----------|-----------|-----------|
## 2013 | 859 | 991 | 1850 |
## | 0.464 | 0.536 | 0.078 |
## -------------|-----------|-----------|-----------|
## Column Total | 11487 | 12179 | 23666 |
## -------------|-----------|-----------|-----------|
##
##
As for graphs we have the
ggplot(upr, aes(x=Year, fill=Gender)) +
geom_bar(position="dodge", alpha=0.75) +
labs(x="", y="Counts")
or again based on grand total percentages:
ggplot(upr, aes(x=Year, fill=Gender)) +
geom_bar(aes(y = (..count..)/sum(..count..)),
position="dodge", alpha=0.75) +
scale_y_continuous(labels=scales::percent) +
labs(x="", y="Percentage",
title="Percentages based on Grand Total")
When we have quantitative data we can find
round(mean(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.733
round(median(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.83
round(sd(upr$Freshmen.GPA, na.rm=TRUE), 3) # Standard Deviation
## [1] 0.779
round(quantile(upr$Freshmen.GPA,
probs = c(0.1, 0.25, 0.75, 0.9),
na.rm=TRUE), 3) # Quantiles and Quartiles
## 10% 25% 75% 90%
## 1.71 2.32 3.28 3.65
Notice also that I have rounded all the answers. Proper rounding is an important thing to do!
bw <- 4/50
# use about 50 bins
plt1 <- ggplot(upr, aes(Freshmen.GPA)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "Freshmen GPA", y = "Counts")
plt2 <- ggplot(upr, aes(Gender,Freshmen.GPA)) +
geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt1 ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt2 ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
round(cor(upr$Year, upr$Freshmen.GPA,
use="complete.obs"), 3)
## [1] 0.097
ggplot(upr, aes(Year, Freshmen.GPA)) +
geom_jitter(shape=16, size=1/10, width = 0.25) +
scale_x_continuous(breaks = 2003:2013) +
labs(x="Year", y="GPA after Freshmen Year")
The basic R command for inference for a population mean is t.test:
Simon Newcomb made a series of measurements of the speed of light between July and September 1880. He measured the time in seconds that a light signal took to pass from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of 7400m. His first measurement was 0.000024828 seconds, or 24,828 nanoseconds (109 nanoseconds = 1 second).
We want to find a 95% confidence interval the speed of light.
attach(newcomb)
t.test(Measurement)
##
## One Sample t-test
##
## data: Measurement
## t = 18770, df = 65, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 24823.57 24828.85
## sample estimates:
## mean of x
## 24826.21
The assumptions for this method are:
Let’s check:
y <- Measurement
dta <- data.frame(y = y,
x = rep(" ", length(y)))
ggplot(aes(x, y), data = dta) +
geom_boxplot()
It seems there is at least one serious outlier. This should not happen if the data came from a normal distribution.
We could proceed in one of two ways:
x <- Measurement[Measurement>24775]
t.test(x)$conf.int
## [1] 24825.74 24828.84
## attr(,"conf.level")
## [1] 0.95
median(Measurement)
## [1] 24827
but now we need to find a confidence interval for the median. That can be done with the non-parametric Wilcoxon Rank Sum method:
wilcox.test(newcomb$Measurement, conf.int = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: newcomb$Measurement
## V = 2211, p-value = 1.633e-12
## alternative hypothesis: true location is not equal to 0
## 95 percent confidence interval:
## 24826.0 24828.5
## sample estimates:
## (pseudo)median
## 24827.5
Some Monarch butterflies fly early in the day, others somewhat later. After the flight they have to rest for a short period. It has been theorized that the resting period (RIP) of butterflies flying early in the morning is shorter because this is a thermoregulatory mechanism, and it is cooler in the mornings. The mean RIP of all Monarch butterflies is 133 sec. Test the theory at the 10% level.
Research by Anson Lui, Resting period of early and late flying Monarch butterflies Danaeus plexippus, 1997
t.test(butterflies$RIP.sec.,
mu=133,
alternative = "less")$p.value
## [1] 0.05583963
Checking the assumption:
y <- butterflies$RIP.sec.
dta <- data.frame(y = y,
x = rep(" ", length(y)))
ggplot(aes(x, y), data = dta) +
geom_boxplot()
looks good.
The R routine for inference for a proportion (or a probability or a percentage) is binom.test. This implements a method by Clopper and Pearson (1934). This method is exact and has no assumptions.
Note The formula discussed in many introductory statistic courses for the confidence interval is
\[ \hat p \pm z_{\alpha/2} \sqrt{\frac{\hat p (1-\hat p)}{n} } \]
where \(\hat p\) is the proportion of success. This leads to confidence intervals that are now known to be quite wrong, and so this method should not be used anymore. The same is true for the corresponding hypothesis test. This method (actually a slight improvement due to Wilson (1927)) is implemented in R by prop.test.
The South African Jon Kerrich spent some time in a German prisoner of war camp during world war I. He used his time to flip a coin 10000 times, resulting in 5067 heads.
Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.
binom.test(x = 5067, n = 10000)$p.value
## [1] 0.1835155
Say some polling institute wants to conduct a poll for the next election for president. They will then find a 95% confidence interval and they want this interval to have an error of 3 percentage points (aka \(\pm 0.03\)). What sample size do they need?
In American politics the two parties are always very close, so in a poll with n people about n/2 will vote for one or the other party. Let’s do a little trial and error:
n <- 100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.102
Now that is to large, so
n <- 200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.071
n <- 400
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.05
n <- 800
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.035
n <- 1200
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.029
n <- 1100
round(diff(as.numeric(binom.test(n/2, n)$conf.int)/2), 3)
## [1] 0.03
There is something quite remarkable about this result!
What are the correlations between the various variables?
head(upr, 2)
## ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005 M 502 3.97 647
## 2 00D66CF1BF 2003 M 502 3.80 597
## Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1 621 626 672 551 342 3.67
## 2 726 618 718 575 343 2.75
## Graduated Year.Grad. Grad..GPA Class.Facultad
## 1 Si 2012 3.33 INGE
## 2 No NA NA INGE
Let’s take out the those variables that are not useful for predicting success, either because we don’t have their value at the time of the admissions process (Freshmen.GPA) or for legal reasons (Gender)
x <- upr[, -c(1:4, 11:16)]
head(x, 2)
## Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles Aprov.Matem
## 1 3.97 647 621 626 672
## 2 3.80 597 726 618 718
## Aprov.Espanol
## 1 551
## 2 575
round(cor(x, use = "complete.obs") ,3)
## Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles
## Highschool.GPA 1.000 0.176 0.156 0.049
## Aptitud.Verbal 0.176 1.000 0.461 0.513
## Aptitud.Matem 0.156 0.461 1.000 0.456
## Aprov.Ingles 0.049 0.513 0.456 1.000
## Aprov.Matem 0.216 0.474 0.819 0.481
## Aprov.Espanol 0.247 0.602 0.389 0.428
## Aprov.Matem Aprov.Espanol
## Highschool.GPA 0.216 0.247
## Aptitud.Verbal 0.474 0.602
## Aptitud.Matem 0.819 0.389
## Aprov.Ingles 0.481 0.428
## Aprov.Matem 1.000 0.404
## Aprov.Espanol 0.404 1.000
somewhat surprising, Highschool.GPA is not highly correlated with any of the tests. The highest correlation is between the two math tests (0.819)
In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Question: was the draft was really “random”?
Here we have two quantitative variables, so we start with the scatterplot:
attach(draft)
ggplot(aes(x, y),
data=data.frame(x=Day.of.Year,
y=Draft.Number)) +
geom_point()
and this does not look like there is a problem with independence.
However:
cor.test(Draft.Number, Day.of.Year)$p.value
## [1] 1.263829e-05
Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.
Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. They examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period
kable.nice(drownings)
Male | Female | |
---|---|---|
Private Swimming Pool | 488 | 219 |
Bathtub | 115 | 132 |
Ocean | 231 | 40 |
Freshwater bodies | 155 | 19 |
Hottubs | 16 | 15 |
Reservoirs | 32 | 2 |
Other Pools | 46 | 14 |
Pails, basins, toilets | 7 | 4 |
Other | 40 | 12 |
Here we have two categorical variables (Method of Drowning and Gender), both categorical. We want to know whether the variables are independent. The most popular method of analysis for this type of problem is Pearson’s chi square test of independence. It is done with the command chisq.test and it has the assumption of no expected counts less than 5.
chisq.test(drownings)
##
## Pearson's Chi-squared test
##
## data: drownings
## X-squared = 144.48, df = 8, p-value < 2.2e-16
Let’s see whether there is a problem with the assumptions:
kable.nice(round(chisq.test(drownings)$expected, 1))
Male | Female | |
---|---|---|
Private Swimming Pool | 503.4 | 203.6 |
Bathtub | 175.9 | 71.1 |
Ocean | 193.0 | 78.0 |
Freshwater bodies | 123.9 | 50.1 |
Hottubs | 22.1 | 8.9 |
Reservoirs | 24.2 | 9.8 |
Other Pools | 42.7 | 17.3 |
Pails, basins, toilets | 7.8 | 3.2 |
Other | 37.0 | 15.0 |
and we see that the expected counts of Pails, basins, toilets and Female is 3.2. In real life this would be considered ok, but it would also be easy to fix:
newmale <- c(drownings[1:7, 1], 7+40)
newfemale <- c(drownings[1:7, 2], 4+12)
newdrown <- cbind(newmale, newfemale)
rownames(newdrown)[8] <- "Other"
kable.nice(newdrown)
newmale | newfemale | |
---|---|---|
Private Swimming Pool | 488 | 219 |
Bathtub | 115 | 132 |
Ocean | 231 | 40 |
Freshwater bodies | 155 | 19 |
Hottubs | 16 | 15 |
Reservoirs | 32 | 2 |
Other Pools | 46 | 14 |
Other | 47 | 16 |
out <- chisq.test(newdrown)
kable.nice(round(out$expected, 1))
newmale | newfemale | |
---|---|---|
Private Swimming Pool | 503.4 | 203.6 |
Bathtub | 175.9 | 71.1 |
Ocean | 193.0 | 78.0 |
Freshwater bodies | 123.9 | 50.1 |
Hottubs | 22.1 | 8.9 |
Reservoirs | 24.2 | 9.8 |
Other Pools | 42.7 | 17.3 |
Other | 44.9 | 18.1 |
out$p.value
## [1] 8.518561e-28
Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.
The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885
attach(mothers)
kable.nice(mothers[sample(1:nrow(mothers), 10), ],
do.row.names = FALSE)
Status | Length |
---|---|
Drug Free | 51.7 |
Drug Free | 51.8 |
Drug Free | 53.7 |
Drug Free | 52.1 |
First Trimester | 48.5 |
Drug Free | 52.9 |
Throughout | 48.2 |
Drug Free | 50.4 |
Throughout | 51.5 |
Drug Free | 56.5 |
dta <- data.frame(y = Length,
x = Status)
ggplot(aes(x, y), data = dta) +
geom_boxplot() +
labs(x="", y="Length")
out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(Status)
out[, 1] <- tapply(Length, Status, length)
out[, 2] <- round(tapply(Length, Status, mean), 2)
out[, 3] <- round(tapply(Length, Status, sd), 2)
kable.nice(out)
Size | Mean | SD | |
---|---|---|---|
Drug Free | 39 | 51.1 | 2.9 |
First Trimester | 19 | 49.3 | 2.5 |
Throughout | 36 | 48.0 | 3.6 |
The standard method for this problem is called ANOVA (Analysis of Variance) and is run with the aov command.
fit <- aov(Length~Status)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Status 2 181.4 90.69 9.319 0.000208
## Residuals 91 885.6 9.73
In step 3 we have the assumptions
ggplot(data.frame(x=fit$res), aes(sample=x)) +
stat_qq() +
stat_qq_line()
looks fine
The box plot shows that the variances are about the same.
Often if the null of no difference is rejected, one wants to go a step further and do a pairwise comparison:
There are a number of methods known for this problem, a popular one is by Tukey:
tuk <- TukeyHSD(fit)
plot(tuk)
this draws confidence intervals for the difference in means of all pairs. If an interval does not contain 0, the corresponding pair is statistically significantly different.
Here that is the case only for Drug Free - Throughout, so the other two pairs are not statistically significantly different. Remember, however that failing to reject H0 is NOT the same as accepting H0. The fact that those pairs are not statistically significantly different is almost certainly due to a lack of sample size.
In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.
attach(elusage)
kable.nice(head(elusage), do.row.names = FALSE)
Month | Year | Usage | Temperature |
---|---|---|---|
8 | 1989 | 24.828 | 73 |
9 | 1989 | 24.688 | 67 |
10 | 1989 | 19.310 | 57 |
11 | 1989 | 59.706 | 43 |
12 | 1989 | 99.667 | 26 |
1 | 1990 | 49.333 | 41 |
ggplot(aes(Temperature, Usage), data=elusage) +
geom_point()
We want to find a model (aka a function) f with
\[ \text{Usage}=f(\text{Temperature}) + \epsilon \] where \(\epsilon\) is some random variable, often with the assumption \(\epsilon \sim N(0, \sigma)\).
\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \epsilon \]
fit <- lm(Usage~Temperature, data=elusage)
summary(fit)
##
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.305 -8.163 0.559 7.723 28.611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619 5.56494 20.97 <2e-16
## Temperature -1.36461 0.09941 -13.73 <2e-16
##
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared: 0.7805, Adjusted R-squared: 0.7763
## F-statistic: 188.4 on 1 and 53 DF, p-value: < 2.2e-16
gives the model as
\[ \text{Usage} = 116.7 - 1.36 \text{ Temperature} + \epsilon \]
How do we know that this is a good model? The main diagnostic is the Residual vs Fits plot:
Fits <- 116.7 - 1.36*Temperature
Residuals <- Usage-Temperature
actually, they are already part of the fit object:
ggplot(aes(Fits, Residuals),
data=data.frame(Fits=fitted(fit),
Residuals=residuals(fit))) +
geom_point() +
geom_smooth(se=FALSE) +
geom_abline(slope = 0)
the idea is that if the model is good, the residuals and the fitted values should be independent, and so this graph should not show any pattern. Adding the non-parametric fit and a horizontal line shows that this is not the case here.
Let’s add a quadratic term to the model
\[ \text{Usage}=\beta_0 + \beta_1 \text{ Temperature} + \beta_2 \text{ Temperature}^2 + \epsilon \]
T2 <- Temperature^2
quad.fit <- lm(Usage~Temperature + T2,
data=elusage)
summary(quad.fit)
##
## Call:
## lm(formula = Usage ~ Temperature + T2, data = elusage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8593 -5.6813 0.3756 5.4650 21.9701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 196.715321 17.487567 11.249 1.53e-15
## Temperature -4.640462 0.694906 -6.678 1.62e-08
## T2 0.030732 0.006472 4.749 1.65e-05
##
## Residual standard error: 9.574 on 52 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.841
## F-statistic: 143.8 on 2 and 52 DF, p-value: < 2.2e-16
ggplot(aes(Fits, Residuals),
data=data.frame(Fits=fitted(quad.fit),
Residuals=residuals(quad.fit))) +
geom_point() +
geom_smooth(se=FALSE) +
geom_abline(slope = 0)
and that is much better. If it were not one can continue and add even higher order terms. There will always be a polynomial model that give a good fit.
here we use some transformation (functions) of the data, for example \(\log\):
log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)
summary(fit)
##
## Call:
## lm(formula = Usage ~ Temperature, data = elusage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.305 -8.163 0.559 7.723 28.611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.71619 5.56494 20.97 <2e-16
## Temperature -1.36461 0.09941 -13.73 <2e-16
##
## Residual standard error: 11.35 on 53 degrees of freedom
## Multiple R-squared: 0.7805, Adjusted R-squared: 0.7763
## F-statistic: 188.4 on 1 and 53 DF, p-value: < 2.2e-16
ggplot(aes(Fits, Residuals),
data=data.frame(Fits=fitted(log.fit),
Residuals=residuals(log.fit))) +
geom_point() +
geom_smooth(se=FALSE) +
geom_abline(slope = 0)
and that is also not to bad. Other standard choices for transformations are square root and inverse, but in principle any function might work.
How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as
\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]
the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two model, the one with the higher \(R^2\) is better.
Here we find
# Linear Model
round(100*unlist(summary(fit)$r.squared), 2)
## [1] 78.05
# Quadratic Model
round(100*unlist(summary(quad.fit)$r.squared), 2)
## [1] 84.69
# Log Transfrom Model
round(100*unlist(summary(log.fit)$r.squared), 2)
## [1] 81.12
but we need to be careful here: the linear model is a special case of the quadratic model, and so it’s \(R^2\) can never be smaller. There are other ways to choose between such nested models, for example the F test, but here this is not an issue because the linear model is bad anyway.
Now the \(R^2\) of the quadratic model is \(84.7\%\) and that of the log transform model is \(81.1\%\), so the quadratic one is better.
Finally let’s have a look what those models look like:
x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit,
newdata=data.frame(Temperature=x,
T2 = x^2))
y.log <- exp(predict(log.fit,
newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
y=c(y.quad, y.log),
Model=rep(c("Quadratic", "Log"),
each=100))
ggplot(aes(x, y, color=Model), data=dta) +
geom_line(size=1.2) +
geom_point(data=elusage,
aes(Temperature, Usage),
inherit.aes = FALSE) +
xlab("Temperature") +
ylab("Usage")