Model Notation

A number of R routines (for example boxplot and lm) use the model notation y~x, which you should read as y modeled as a function of x. So for example if we want to find the least squares regression model of y on x we use

lm(y ~ x)

In standard math notation that means fitting an equation of the form

\[ Y = \beta_0 + \beta_1 x + \epsilon \] Sometimes one wants to fit a no-intercept model:

\[ Y = \beta_1 x + \epsilon \] and this is done with

lm(y ~ x - 1)

If there are two predictors you can

  • fit an additive model of the form \[ Y = \beta_0 + \beta_1 x + \beta_2 z + \epsilon \]

with

lm(y ~ x + z)
  • fit a model with an interaction term

\[ Y = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 x \times z + \epsilon \]

with

lm(y ~ x * z)



In the case of three (or more predictors) there are all sorts of possibilities:

  • model without interactions

\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \epsilon \]

lm(y ~ x1 + x2 + x3)
  • model with all interactions

\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \sum_{i,j=1}^n \beta_{ij} x_i x_j + \\ + \text{ ... } + \\ \beta_{1..n}x_1 \times .. \times x_n + \epsilon \]

lm(y ~ (x1 + x2 + x3)^3 )
  • model with all pairwise interactions

\[ Y_i = \beta_0 + \sum_{i=1}^n \beta_i x_i + \sum_{i,j=1}^n \beta_{ij} x_i x_j + \epsilon \]

lm(y ~ (x1 + x2 + x3)^2 )

these model descriptions are not unique, for example the last one is equivalent to

lm(y ~ x1 * x2 * x3 - x1:x2:x3)

Sometime we want * to indicate actual multiplication and not interaction. This can be done with

lm(y ~ x1 + x2 + I(x1*x2))

Another useful one is ., which stands for all +’s, so say (y, x1, x2, x3) are the columns of a dataframe df, then

lm(y ~ x1 + x2 + x3, data=df )

is the same as

lm(y ~ ., data=df )

and

lm(y ~ .*x3, data=df )

is the same as

lm(y ~ x1 + x2 + x3 + x1*x3 +x2*x3)



if there are more than a few predictors it is usually easier to generate a matrix of predictors:

X <- cbind(x1, x2, x3, x4)
lm(y ~ X)

Case Study: House Prices

we have a list of prices and other information on houses in Albuquerque, New Mexico:

head(albuquerquehouseprice)
##   Price Sqfeet Feature Corner  Tax
## 1  2050   2650       7      0 1639
## 2  2080   2600       4      0 1088
## 3  2150   2664       5      0 1193
## 4  2150   2921       6      0 1635
## 5  1999   2580       4      0 1732
## 6  1900   2580       4      0 1534
  • additive model, all four predictors:
summary(lm(Price ~ Sqfeet + 
             Feature + Corner + Tax,
           data=albuquerquehouseprice))
## 
## Call:
## lm(formula = Price ~ Sqfeet + Feature + Corner + Tax, data = albuquerquehouseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -541.92  -73.67  -12.87   66.74  617.68 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  76.60160   60.70644   1.262   0.2099
## Sqfeet        0.26680    0.06167   4.326 3.55e-05
## Feature      13.63261   13.29013   1.026   0.3074
## Corner      -89.03338   42.20675  -2.109   0.0374
## Tax           0.66193    0.10882   6.083 2.08e-08
## 
## Residual standard error: 171 on 102 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.8091, Adjusted R-squared:  0.8016 
## F-statistic: 108.1 on 4 and 102 DF,  p-value: < 2.2e-16
  • additive model, Sqfeet and Features
summary(lm(Price ~ Sqfeet + Feature,
           data=albuquerquehouseprice))
## 
## Call:
## lm(formula = Price ~ Sqfeet + Feature, data = albuquerquehouseprice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1005.40   -99.14    -3.16    75.93   782.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.55882   67.29583  -0.023   0.9816
## Sqfeet       0.58422    0.03901  14.978   <2e-16
## Feature     27.78585   14.53444   1.912   0.0584
## 
## Residual standard error: 202.1 on 114 degrees of freedom
## Multiple R-squared:  0.7226, Adjusted R-squared:  0.7177 
## F-statistic: 148.5 on 2 and 114 DF,  p-value: < 2.2e-16
  • model with interaction, Sqfeet and Features
summary(lm(Price ~ Sqfeet * Feature,
           data=albuquerquehouseprice))
## 
## Call:
## lm(formula = Price ~ Sqfeet * Feature, data = albuquerquehouseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -987.92  -98.84   -0.50   84.97  834.59 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    396.37583  172.42540   2.299  0.02335
## Sqfeet           0.32030    0.11237   2.850  0.00519
## Feature        -75.57104   43.76684  -1.727  0.08696
## Sqfeet:Feature   0.06585    0.02637   2.497  0.01397
## 
## Residual standard error: 197.6 on 113 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.7301 
## F-statistic: 105.6 on 3 and 113 DF,  p-value: < 2.2e-16
  • model with pairwise interactions:
summary(lm(Price ~ (Sqfeet + Feature +
                      Corner +  Tax)^2,
           data=albuquerquehouseprice))
## 
## Call:
## lm(formula = Price ~ (Sqfeet + Feature + Corner + Tax)^2, data = albuquerquehouseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -383.87  -82.03   -2.70   56.06  644.25 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     2.642e+02  1.884e+02   1.403  0.16394
## Sqfeet          2.204e-01  2.012e-01   1.096  0.27593
## Feature        -3.650e+01  4.603e+01  -0.793  0.42974
## Corner          4.367e+02  1.559e+02   2.802  0.00615
## Tax             3.180e-01  3.205e-01   0.992  0.32351
## Sqfeet:Feature  4.618e-02  4.441e-02   1.040  0.30106
## Sqfeet:Corner  -3.963e-01  1.347e-01  -2.943  0.00408
## Sqfeet:Tax      1.109e-04  1.063e-04   1.043  0.29969
## Feature:Corner -1.875e+01  3.626e+01  -0.517  0.60618
## Feature:Tax    -3.492e-02  6.528e-02  -0.535  0.59391
## Corner:Tax      2.511e-01  3.332e-01   0.754  0.45291
## 
## Residual standard error: 156.8 on 96 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8332 
## F-statistic: 53.97 on 10 and 96 DF,  p-value: < 2.2e-16
  • model with all possible terms:
summary(lm(Price ~ (Sqfeet + Feature +
                      Corner +  Tax)^4,
           data=albuquerquehouseprice))
## 
## Call:
## lm(formula = Price ~ (Sqfeet + Feature + Corner + Tax)^4, data = albuquerquehouseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -379.59  -80.05    4.74   55.19  650.84 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)                4.811e+02  3.684e+02   1.306    0.195
## Sqfeet                     4.321e-02  2.886e-01   0.150    0.881
## Feature                   -8.408e+01  9.756e+01  -0.862    0.391
## Corner                     2.495e+03  2.263e+03   1.103    0.273
## Tax                        2.608e-01  5.247e-01   0.497    0.620
## Sqfeet:Feature             8.536e-02  6.726e-02   1.269    0.208
## Sqfeet:Corner             -1.941e+00  1.590e+00  -1.221    0.225
## Sqfeet:Tax                 1.931e-04  2.552e-04   0.757    0.451
## Feature:Corner            -7.821e+02  6.419e+02  -1.218    0.226
## Feature:Tax               -3.020e-02  1.349e-01  -0.224    0.823
## Corner:Tax                -2.410e+00  2.484e+00  -0.970    0.335
## Sqfeet:Feature:Corner      5.432e-01  4.326e-01   1.256    0.212
## Sqfeet:Feature:Tax        -1.472e-05  5.931e-05  -0.248    0.805
## Sqfeet:Corner:Tax          1.933e-03  1.460e-03   1.324    0.189
## Feature:Corner:Tax         9.275e-01  7.042e-01   1.317    0.191
## Sqfeet:Feature:Corner:Tax -6.340e-04  3.955e-04  -1.603    0.112
## 
## Residual standard error: 153.3 on 91 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.8406 
## F-statistic: 38.27 on 15 and 91 DF,  p-value: < 2.2e-16