A number of R routines (for example boxplot and lm) use the model notation y~x, which you should read as y modeled as a function of x. So for example if we want to find the least squares regression model of y on x we use
lm(y ~ x)
In standard math notation that means fitting an equation of the form
\[ Y = \beta_0 + \beta_1 x + \epsilon \] Sometimes one wants to fit a no-intercept model:
\[ Y = \beta_1 x + \epsilon \] and this is done with
lm(y ~ x - 1)
If there are two predictors you can
with
lm(y ~ x + z)
\[ Y = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 x \times z + \epsilon \]
with
lm(y ~ x * z)
In the case of three (or more predictors) there are all sorts of possibilities:
\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon \]
lm(y ~ x1 + x2 + x3)
\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \sum_{i,j=1}^n \beta_{ij} x_i x_j + \\ + \text{ ... } + \\ \beta_{1..n}x_1 \times .. \times x_n + \epsilon \]
lm(y ~ (x1 + x2 + x3)^3 )
\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \sum_{i,j=1}^n \beta_{ij} x_i x_j + \epsilon \]
lm(y ~ (x1 + x2 + x3)^2 )
these model descriptions are not unique, for example the last one is equivalent to
lm(y ~ x1 * x2 * x3 - x1:x2:x3)
Sometime we want * to indicate actual multiplication and not interaction. This can be done with
lm(y ~ x1 + x2 + I(x1*x2))
Another useful one is ., which stands for all +’s, so say (y, x1, x2, x3) are the columns of a dataframe df, then
lm(y ~ x1 + x2 + x3, data=df )
is the same as
lm(y ~ ., data=df )
and
lm(y ~ .*x3, data=df )
is the same as
lm(y ~ x1 + x2 + x3 + x1*x3 +x2*x3)
if there are more than a few predictors it is usually easier to generate a matrix of predictors:
X <- cbind(x1, x2, x3, x4)
lm(y ~ X)
we have a list of prices and other information on houses in Albuquerque, New Mexico:
head(albuquerquehouseprice)
## Price Sqfeet Feature Corner Tax
## 1 2050 2650 7 0 1639
## 2 2080 2600 4 0 1088
## 3 2150 2664 5 0 1193
## 4 2150 2921 6 0 1635
## 5 1999 2580 4 0 1732
## 6 1900 2580 4 0 1534
summary(lm(Price ~ Sqfeet +
Feature + Corner + Tax,
data=albuquerquehouseprice))
##
## Call:
## lm(formula = Price ~ Sqfeet + Feature + Corner + Tax, data = albuquerquehouseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -541.92 -73.67 -12.87 66.74 617.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.60160 60.70644 1.262 0.2099
## Sqfeet 0.26680 0.06167 4.326 3.55e-05
## Feature 13.63261 13.29013 1.026 0.3074
## Corner -89.03338 42.20675 -2.109 0.0374
## Tax 0.66193 0.10882 6.083 2.08e-08
##
## Residual standard error: 171 on 102 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.8091, Adjusted R-squared: 0.8016
## F-statistic: 108.1 on 4 and 102 DF, p-value: < 2.2e-16
summary(lm(Price ~ Sqfeet + Feature,
data=albuquerquehouseprice))
##
## Call:
## lm(formula = Price ~ Sqfeet + Feature, data = albuquerquehouseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1005.40 -99.14 -3.16 75.93 782.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.55882 67.29583 -0.023 0.9816
## Sqfeet 0.58422 0.03901 14.978 <2e-16
## Feature 27.78585 14.53444 1.912 0.0584
##
## Residual standard error: 202.1 on 114 degrees of freedom
## Multiple R-squared: 0.7226, Adjusted R-squared: 0.7177
## F-statistic: 148.5 on 2 and 114 DF, p-value: < 2.2e-16
summary(lm(Price ~ Sqfeet * Feature,
data=albuquerquehouseprice))
##
## Call:
## lm(formula = Price ~ Sqfeet * Feature, data = albuquerquehouseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -987.92 -98.84 -0.50 84.97 834.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 396.37583 172.42540 2.299 0.02335
## Sqfeet 0.32030 0.11237 2.850 0.00519
## Feature -75.57104 43.76684 -1.727 0.08696
## Sqfeet:Feature 0.06585 0.02637 2.497 0.01397
##
## Residual standard error: 197.6 on 113 degrees of freedom
## Multiple R-squared: 0.7371, Adjusted R-squared: 0.7301
## F-statistic: 105.6 on 3 and 113 DF, p-value: < 2.2e-16
summary(lm(Price ~ (Sqfeet + Feature +
Corner + Tax)^2,
data=albuquerquehouseprice))
##
## Call:
## lm(formula = Price ~ (Sqfeet + Feature + Corner + Tax)^2, data = albuquerquehouseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -383.87 -82.03 -2.70 56.06 644.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.642e+02 1.884e+02 1.403 0.16394
## Sqfeet 2.204e-01 2.012e-01 1.096 0.27593
## Feature -3.650e+01 4.603e+01 -0.793 0.42974
## Corner 4.367e+02 1.559e+02 2.802 0.00615
## Tax 3.180e-01 3.205e-01 0.992 0.32351
## Sqfeet:Feature 4.618e-02 4.441e-02 1.040 0.30106
## Sqfeet:Corner -3.963e-01 1.347e-01 -2.943 0.00408
## Sqfeet:Tax 1.109e-04 1.063e-04 1.043 0.29969
## Feature:Corner -1.875e+01 3.626e+01 -0.517 0.60618
## Feature:Tax -3.492e-02 6.528e-02 -0.535 0.59391
## Corner:Tax 2.511e-01 3.332e-01 0.754 0.45291
##
## Residual standard error: 156.8 on 96 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8332
## F-statistic: 53.97 on 10 and 96 DF, p-value: < 2.2e-16
summary(lm(Price ~ (Sqfeet + Feature +
Corner + Tax)^4,
data=albuquerquehouseprice))
##
## Call:
## lm(formula = Price ~ (Sqfeet + Feature + Corner + Tax)^4, data = albuquerquehouseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -379.59 -80.05 4.74 55.19 650.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.811e+02 3.684e+02 1.306 0.195
## Sqfeet 4.321e-02 2.886e-01 0.150 0.881
## Feature -8.408e+01 9.756e+01 -0.862 0.391
## Corner 2.495e+03 2.263e+03 1.103 0.273
## Tax 2.608e-01 5.247e-01 0.497 0.620
## Sqfeet:Feature 8.536e-02 6.726e-02 1.269 0.208
## Sqfeet:Corner -1.941e+00 1.590e+00 -1.221 0.225
## Sqfeet:Tax 1.931e-04 2.552e-04 0.757 0.451
## Feature:Corner -7.821e+02 6.419e+02 -1.218 0.226
## Feature:Tax -3.020e-02 1.349e-01 -0.224 0.823
## Corner:Tax -2.410e+00 2.484e+00 -0.970 0.335
## Sqfeet:Feature:Corner 5.432e-01 4.326e-01 1.256 0.212
## Sqfeet:Feature:Tax -1.472e-05 5.931e-05 -0.248 0.805
## Sqfeet:Corner:Tax 1.933e-03 1.460e-03 1.324 0.189
## Feature:Corner:Tax 9.275e-01 7.042e-01 1.317 0.191
## Sqfeet:Feature:Corner:Tax -6.340e-04 3.955e-04 -1.603 0.112
##
## Residual standard error: 153.3 on 91 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.8632, Adjusted R-squared: 0.8406
## F-statistic: 38.27 on 15 and 91 DF, p-value: < 2.2e-16