Regression Trees

Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model (for example a constant) in each one. They are conceptually simple yet powerful. They have been in use in many fields in the past, for example in Biology. (Animal or Vegetable?) In Statistics they became popular with the work of Breiman et al. in the 1980’s

As an illustration consider a continuous response variable y and two continuous predictors X1 and X2, each with values in [0,1]. A partition of the feature space is given in

and the idea is to assign the same label to all the observations whose (x1, x2) falls into the same rectangle.

This is a perfectly legitimate partition but it has the problem that it is very difficult to describe. Instead we will restrict ourselves to the use of recursive binary partitions like the on shown in

Such a tree is very easy to describe using a tree diagram:

In this diagram each split is called a node. If it does not have a further split it is called a terminal node. The corresponding regression model predicts Y with a constant cm in Region Rm, that is

\[ \hat{f}(x) = \sum_{i=1}^5 c_m I_{R_m} (x) \]

How to grow a tree

Say our data consists of p “inputs” and a response for each of n observations, that is \((x_i,y_i)\) for i=1,..,n, with \(x_i=(x_{i1},..,x_{ip})\).

The algorithm needs to automatically decide on the splitting variables and split points, and also what topology (or shape) the tree should have. Suppose that first we partition into m regions R1, .., Rm, and we model the response as a constant in each region.

If we adopt as our criterion minimization of the sum of squares

\[ \sum (y_i-f(x_i))^2 \] it is easy to show that the best constants cm are just the mean values of y’s with corresponding x’s in Rm:

\[ \hat{c}_m = E[Y|x \in R_m] \] Now finding the best binary partition in terms of minimum sum of squares is generally not possible for computational reasons, so we proceed as follows: Starting with all the data, consider a splitting variable j and split point s, and define the pair of half-planes

\[ \begin{aligned} &R_1(j, s) = \left\{ x|x_j \le s\right\}\\ &R_2(j, s) = \left\{ x|x_j >s\right\} \end{aligned} \] Then we seek the splitting variable j and split point s which are optimal.

Having found the best split we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated.

How large should a tree grow? Clearly without any “stopping rule” the tree will grow until each observation has its own region, which would amount to overfitting. The size of the tree is in fact a tuning parameter such as span for loess, governing the bias-variance trade-off.

The most common strategy is to grow a large tree T0, stopping only when some minimum node size (say 5) is reached. Then this large tree is pruned (made smaller) using cost-complexity pruning:

Say T1 is a sub-tree of T0, that is T1 can be obtained by pruning branches off T0. Let |T| be the number of terminal nodes in T and index the terminal nodes by 1,..,m. Let

\[ \begin{aligned} &\hat{c}_m = \frac1{n_m} \sum_{x_i \in R_m} y_i \\ &Q_m(T) = \frac1{n_m} \sum_{x_i \in R_m} (y_i - \hat{c}_m)^2 \end{aligned} \] then we define the cost-complexity criterion

\[ C_\alpha(T)=\sum_{m=1}^{|T|}n_m Q_m(T)+\alpha|T \]

the idea is to find, for each \(\alpha\), the sub-tree \(T_\alpha\) to minimize \(C_\alpha(T)\). The tuning parameter \(\alpha\) governs the trade-off between the size of the tree and the goodness-of-fit to the data. If \(\alpha=0\) the solution is the full tree T0, and for larger values of \(\alpha\) the tree becomes smaller. Methods to choose an “optimal” \(\alpha\) automatically are known, for example cross-validation.

Example: Simulated data

we will use the tree shown above

x <- runif(1000)
y <- runif(1000)
z <- rep(0, 1000)
for (i in 1:1000) {
  if (x[i] < 0.3) {
      if (y[i] < 0.5) {
          z[i]<- ifelse(x[i]<0.1, rnorm(1), rnorm(1, 1))
      }
      else {
          z[i]<- rnorm(1, 2)
      }
    }
    else {
      z[i]<- ifelse(y[i]<0.9, rnorm(1, 3), rnorm(1, 4))
    }
  }
fit <- rpart(z ~ x + y)
par(mar=c(1, 0, 0, 0))
plot(fit)
text(fit)

pushViewport(viewport(layout = grid.layout(1, 2)))
print(ggplot(data=data.frame(x=x, y=z), aes(x, y)) +
  geom_point()  ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=data.frame(x=y, y=z), aes(x, y)) +
  geom_point()  ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))      

Looking at these graphs it is not clear how one would fit a standard model to this data set.

Example: Kyphosis

fit <- rpart(Kyphosis ~ Age + Number + Start, 
            data = kyphosis)
par(mar=c(1, 0, 0, 0))
plot(fit)
text(fit)

notice that the method easily accepts a binary categorical response!

Example: US Temperature

ustemperature$Longitude <- (-ustemperature$Longitude)
fit <- rpart(JanTemp~Longitude+Latitude, 
             data = ustemperature)
par(mar=c(1, 0, 0, 0))
plot(fit)
text(fit)

here is the corresponding fitted line plot:

library(maps)
df <- us.cities[, 5:4]
df <- df[df[, 2]<50, ] #not Alaska
df <- df[df[, 2]>25, ] #not Hawaii
colnames(df) <- c("Longitude", "Latitude")
df$Temp <- predict(fit, df)
ggplot() + 
  geom_polygon(data = usa, 
              aes(x=long, y = lat, group = group),
              alpha=0.1) + 
  coord_fixed(1.3) +
  geom_point(data=df, aes(Longitude, Latitude, color=Temp)) +
  scale_colour_gradient(low="blue", high="red") +
  labs(x="Longitude", y="Latitude", color="Temperature") +
  scale_x_continuous(breaks = c(-120, -100, -80), 
                     labels = c("120W", "100W", "80W"))

Example: Air Pollution and Mortality

newair <- airpollution[, -16] #take out NOxPot
newair[, c(10, 13, 14)] <- log(newair[, c(10, 13, 14)])
colnames(newair)[c(10, 13, 14)] <- 
  c("log.Pop", "log.HCPot", "log.NOx")
fit <- rpart(Mortality~., data = newair)
par(mar=c(1, 0, 0, 0))
plot(fit)
text(fit)

and we see that the only predictors used are NonWhite, Rain and Education