Tree-Based Methods

2017-03-22

Tree Based Methods

Tree based Methods partition the feature space into a set of rectangles.

Nonlinear relationships do not affect tree performance.
Non-parametric approach without distributional assumptions.
Tree-based methods are simple and useful for interpretation.

Terminology for Trees

The root node contains all the sample data.
Every node in a tree is a subset of the sample.
Decision trees are typically drawn upside down, in the sense that the leaves (terminal nodes) are at the bottom of the tree.
The points along the tree where the predictor space is split are referred to as interval nodes.

Two main types of the decision trees

Regression tree analysis is when the predicted outcome can be considered a real number.
Classification tree analysis is when the predicted outcome is the class to which the data belongs.
Classification and Regression Tree (CART) is used to refer to decision tree algorithms that can be used for classification or regression predictive modeling problems.
Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation.

Classification and Regression Tree

The tree-building process

We divide the predictor space - that is, the set of possible values for \(X_1, X_2,...,X_p\) into J distinct and non-overlapping regions, \(R_1, R_2,..., R_j\).
For every observation that falls into the region \(R_j\), we make the same prediction, which is simply the mean (or majority vote) of the response values for the training observations in \(R_j\).

Some problems

How do we choose which predictor and cutpoint to split on?
- Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is constructed.
How to grow a tree?
- Take a top-down, greedy approach that is known as recursive binary splitting. It begins at the top of the tree and then successively splits the predictor space, and at each step of the tree-building process, the best split is made at that particular step.
When should we stop splitting?
- Tree construction ends using a predefined stopping criterion, such as a minimum number of training instances assigned to each leaf node of the tree.
What if tree is too large can we approximate with a smaller tree?
- Pruning the tree (cost-complexity pruning)

Classification Tree

A classification tree is used to predict a quanlitative/discrete response.
We predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.

Evaluate the quality of a particular split

\(p_{mk}\) represents the proportion of training observations in the mth region that are from the kth class.

Misclassification rate (not a good choice for a particular split, but preferable for final tree)

\(E = 1 - max(p_{mk})\)
Gini index - a small value indicates that a node contains predominantly observations from a single class.

\(G = \sum_{k=1}^{K}p_{mk}(1-p_{mk})\)
Cross-entropy - take on a small value if the mth mode is pure.

\(D = - \sum_{k=1}^{K}p_{mk}logp_{mk}\)
Node impurity measures for two-class classification.

Classification tree in R

tree() function in tree package

### Use tree() function in tree library

library(tree)
head(iris)


tree.iris = tree(Species~., iris)

## show information for each branch; terminal nodes are indicated using asterisks.
tree.iris

## ?tree

summary(tree.iris)


plot(tree.iris)
text(tree.iris, pretty = 0)

### validation set approach
set.seed(1)
train = sample(1:150,75)
iristest = iris[-train,]

### use training data fit
tree.iris = tree(Species~., iris, subset = train)
plot(tree.iris)
text(tree.iris, pretty = 0)

### prediction on validation set
tree.pred = predict(tree.iris, iristest[-train,], type="class") ## type = "class" for classification
table(tree.pred,iristest[-train,"Species"])
mean(tree.pred != iristest[-train,"Species"])

## cross validation for classification tree. More details in regression tree cross validation part.
cv.iris = cv.tree(tree.iris, FUN = prune.misclass)  ## use misclassification rate to guide the pruning
plot(cv.iris$size, cv.iris$dev,type="b")
plot(cv.iris$k, cv.iris$dev,type="b") ## corresponding cost-complexity parameter

prune.iris = prune.tree(tree.iris, best = 3,method = "misclass")
## equivalent to : prune.iris = prune.misclass(tree.iris, best = 3)

plot(prune.iris)
text(prune.iris, pretty = 0)
tree.pred = predict(prune.iris, iristest[-train,], type="class")
table(tree.pred,iristest[-train,"Species"])
mean(tree.pred != iristest[-train,"Species"]) 

### compare to the full tree above. We use a smaller tree and obtain the same prediction accuracy.

Regression Tree

A regression tree is very similar to a classification tree, but used to predict a quantitative/continuous response.
RSS is used as a criterion for making the binary splits.
The prediction is given by the mean response of the training observations that belong to the same terminal node.

We seek the splitting variable j and split point s that solve

\(min_{j,s}[min_{c1} \sum_{x_i:R_1(j,s)} (y_i - c_1)^2 + min_{c1} \sum_{x_i:R_2(j,s)} (y_i - c_2)^2]\)

The inner minimization is solved by

\(\hat{c}_1 = ave(y_i|x_i \in R_1(j,s))\) and \(\hat{c}_2 = ave(y_i|x_i \in R_2(j,s))\)

Regression Tree in R

### similar as classification tree
library(MASS)

set.seed(2)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv~.,Boston,subset=train)
summary(tree.boston)
plot(tree.boston)
text(tree.boston,pretty=0)

Tree Pruning

A very large tree might overfit the data, while a small tree might not capture the important structure.
Tree size is a tuning parameter governing the model's complexity.
The strategy is to grow a large tree \(T_0\), stopping the splitting process only when some minimum node size is reached.
This large tree is pruned using cost-complexity pruning.

Define a subtree \(T \subset T_0\). \(|T|\) denotes the number of terminal nodes in T.

\(N_m = \# (x_i \in R_m)\)

\(\hat{c}_m = \frac{1}{N_m}\sum_{xi \in R_m} y_i\)

\(Q_m(T) = \frac{1}{N_m}\sum_{x_i \in R_m}(y_i - \hat{c}_m)^2\)

The cost complexity criterion:

\(C_\alpha(T) = \sum_{m=1}^{|T|}N_mQ_m(T) + \alpha|T|\).

The tuning parameter \(\alpha\) governs the tradeoff between tree size and its goodness of fit to the data. Large values of \(\alpha\) result in smaller trees.

Estimation of \(\alpha\) is achieved by cross validation.

For a classification tree, the measures \(Q_mm(T)\) can be defined as misclassification error, Gini index, or Cross-entropy.

## continue to regression tree part

### use cv.tree() to do the cross validation. Default is 10-fold cv (K = 10)
### For classification Tree
### We can use argument FUN=prune.misclass: cv.tree(..., FUN = prune.misclass).Default is deviance.
### prune.misclass() is an abbreviation for prune.tree(method = "misclass")
cv.boston=cv.tree(tree.boston)
names(cv.boston)

### cross validation plot
### size: number of terminal nodes; 
### deviance: equivalent to RSS and misclassification rate
### k: cost-complexity parameter
plot(cv.boston$size,cv.boston$dev,type='b')

### use the cross validation plot to find the best size of the tree
### use the best tree size obtained by cross validation to prune the tree
### Arguments in prune.tree() function:
### k: cost-complexity parameter
### best: size (i.e. number of terminal nodes) of a specific subtree in the cost-complexity sequence to be returned.



cv.boston

prune.boston=prune.tree(tree.boston,best=5) ## return a subtree with size is 8.
plot(prune.boston)
text(prune.boston,pretty=0)

### use the decistion tree to make predictions
yhat=predict(tree.boston,newdata=Boston[-train,])
boston.test=Boston[-train,"medv"]

### calculate
mean((yhat-boston.test)^2)

Advantages

Easy to interpret.
Provide good visualizations.
Help to understand what are the most important attributes
Work for regression and classification problems.

Disadvantages

Tend to overfitting. Need tree pruning.
Different data sets may lead to very different trees.
Decision Trees can only provide axis-parrelel decision boundaries.

Bagging and Random Forest

Using a single tree for prediction does not work too well.
The decision trees also suffer from high variance.

library(MASS)
library(tree)
set.seed(1)
par(mfrow = c(2,2))

for ( i in 1:4){
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv~.,Boston,subset=train)
plot(tree.boston)
text(tree.boston,pretty=0)
}

If we train multiple desicion trees and aggregate them, the predictive performance of trees can be substantially improved.

Bagging

To improve the prediction accuracy and reduce variance:

Ideally,

We have many training data sets.
Fit the model using each training data set
Average the predictions

Practically,

Use bootstrapping (resample with replacement) to obtain many training data sets.
Fit the model using each bootstrap sample
Average the predictions
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.

Bagging Algorithm:

Take a random sample of size n with replacement from the training data (a bootstrap sample).
Construct a decistion tree on this bootstrap sample as usual but no need to prune
Use this tree to make the prediction
Repeat steps 1-3 a large number of times.
Average the predictions over all the trees.

In this approach we generate B different bootstrapped training data sets. We then train out method on the bth bootstrapped training set in order to get \(\hat{f}^{*b}(x)\), the prediction at a point x. We then average all the predictions to obtain

\(\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x)\).

This is called bagging.

For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote; the overall prediction is the most commonly occurring class among the B predictions.

Out-of-Bag Error Estimation

For each bootstrap sample drawn from the training data, there will be samples left behind that were not included. These samples are called Out-Of-Bag samples or OOB.
On average, each bagged tree makes use of around two-thirds of the observations.
The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag(OOB) observations.

\(lim_{n \rightarrow \infty}(1 - 1/n)^n = e^{-1} \approx \frac{1}{3}\)

We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation, which we average.
This estimate is essentially the LOOCV error for bagging.

Random Forests

Random Forests: Sample data and also the predictors.

Random Forests Algorithm:

Take a random sample of size n with replacement from the training data (bootstrap sample)
Take a random subset of the predictors.
Construct a split by using predictors selected in Step 2.
Repeat steps 2 and 3 for each split to grow the tree. Each tree is produced from a random sample of training data, and at each split a random sample of predictors.
Use this tree to make the prediction
Repeat Steps 1 - 5 a large number of times.
Average the predictions over all trees.

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees (the trees are more independent). This reduces the variance when we average the trees.
There are often a few predictors that dominate the decision tree fitting process because on the average they consistently perform just a bit better than their competitors. With random forests, each predictor will have at least several opportunities to be the predictor defining a split.
When building these decision trees, each time a split in a tree is considered, a random selection of m predictiors is chosen as split candidates from the full set of p predictions. The split is allowed to use only one of those m predictors.
- Common settings:
  - For classification a good default is : \(m = \sqrt{p}\)
  - For regression a good default is : \(m = p/3\)

Handwritten Digits Recognization

library(randomForest)
set.seed(1)
digits = read.csv("train.csv")
digits2 = digits[sample(1:nrow(digits),1000),]
digits2$label = as.factor(digits2$label)  ### only if response is factor, classification is assumed

rf.digits = randomForest(as.factor(label)~., digits2, ntree = 100,importance = T)

predict(rf.digits, newdata = digits[7,-1])
cus_col<-colorRampPalette(colors=c('white','black'))
m = matrix(as.matrix(digits)[7,-1],28,28)
image(m[,28:1],col = cus_col(256))


#### variable importance
#importance(rf.digits)
varImpPlot(rf.digits)

m = matrix(as.numeric(importance(rf.digits,type=2)),28,28)
image(m[,28:1],col = cus_col(16))

Variable importance measure

For bagged/RF regression trees, we record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor.
Similarly, for bagged/RF classification trees, we add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all B trees.

Bagging and Random Forest in R

# Bagging and Random Forests

#### install the randomForest package first
#### Use randomForest() function in randomForest package
#### bagging is just a special case of a random forest with m = p

library(randomForest)
library(MASS)
library(ISLR)
set.seed(1)
dim(Boston)

## mtry = 13 is for bagging, because there are 13 predictors in total.
bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
### useful arguments in randomForest()
### ntree :number of trees to grow.This should not be set to too small a number. Default is 500
### mtry: Number of variables randomly sampled as candidates at each split. 
###       When mtry is set to be the total number of predictors, the bagging should be done.
### replace: sampling with or without replacement. Default is TRUE

bag.boston

### make the predictions with bagging trees
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
plot(yhat.bag, boston.test)
abline(0,1)

### calculate mse
mean((yhat.bag-boston.test)^2)

bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
set.seed(1)


### Build a random forest of regression trees. 
### set mtry=6
rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)

### Two measures of variable importance: 
### One is mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded.
### the latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.
importance(rf.boston)

### Plot these importance measures
varImpPlot(rf.boston)


### almost the same for classification trees
### Be careful that when do the classification, the response (y) column must be factor. 
### Use as.factor() to convert to factor. See the handwritten digits example above.

Boosting

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification.
The motivation for boosting was a procedure that combines the outputs of many "weak" classifiers to produce a powerful "committee".
Boosting works in a similar way as bagging, except that the trees are grown sequentially - each tree is grown using information from previously grown trees.
Boosting is an ensemble method for reducing bias.
Bagging reduces variance of low-bias models
- Low-bias models are "complex" and unstable
- Bagging averages them together to create stability (reduce variance)
Boosting reduces bias of low-variance models
- Low-variance models are simple with high bias
- Boosting trains sequence of simple models
- Sum of simple models is complex/accurate (low bias)

Details of this procedure

Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly.
Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals.
Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm.
By fitting small trees to the residuals, we slowly improve \(\hat{f}\) in areas where it does not perform well. The shrinkage parameter \(\lambda\) slows the process down even further, allowing more and different shaped trees to attack the residuals.

Tuning parameters for boosting

The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B.
The shrinkage parameter \(\lambda\), a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small \(\lambda\) can require using a very large value of B in order to achieve good performance.
The number of splits d in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model. More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables.

# Boosting

### gbm() function in the gbm package
### install package gbm first

library(gbm)
set.seed(1)
### Useful arguments in gbm():
# distribution: "gaussian" for regression problem; "bernoulli" for classification problem
# n.trees: the total number of trees to fit
# intercation.depth: the maximum depth of each tree
# shrinkage: a shrinkage parameter(lambda) applied to each tree in the expansion. Default is 0.001
# verbose: If TRUE, gbm will print out progress and performance indicators

boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)

## partial dependence plots for variables to illustrate the marginal effect
# par(mfrow=c(1,2))
plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")

## make the predictions
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

## change the shrinkage parameter to 0.2
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)

#gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2)

yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)