Objectives

CART (single tree)
Bagging
Random Forest
Boosting

Main idea: Averaging.

We move to ensemble methods (pro: better prediction, con: worse interpretability)

Principle of methods and DATASET 1

Variability in tree estimates

There is considerability in the estimated tree and the associated predictions, even when cross validaion is used to choose the size of the tree.
The following example illustrates this variability, using 5 random splits into training and test sets.

heart=read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Heart.csv")
#heart=read.csv("http://www-bcf.usc.edu/~gareth/ISL/Heart.csv")
library(tree)
nmiss=apply(is.na(heart),1,sum) #number of missing values, by row
heart=heart[nmiss==0,]  #remove cases with missing values
attach(heart)
nrec=nrow(heart);nrec

## [1] 297

ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain

## [1] 149

## [1] "random split number  1  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  78  10
##        Yes  4  57
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  66  21
##       Yes 12  49
## [1] "Training error rate =  9.4  %"
## [1] "Test error rate =  22.3  %"
## [1] "random split number  2  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  68   6
##        Yes  9  66
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  54  20
##       Yes 29  45
## [1] "Training error rate =  10.07  %"
## [1] "Test error rate =  33.11  %"
## [1] "random split number  3  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  65  14
##        Yes 12  58
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  62  23
##       Yes 21  42
## [1] "Training error rate =  17.45  %"
## [1] "Test error rate =  29.73  %"
## [1] "random split number  4  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  77  13
##        Yes  6  53
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  68  29
##       Yes  9  42
## [1] "Training error rate =  12.75  %"
## [1] "Test error rate =  25.68  %"
## [1] "random split number  5  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  80  11
##        Yes  4  54
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  65  17
##       Yes 11  55
## [1] "Training error rate =  10.07  %"
## [1] "Test error rate =  18.92  %"

Bagging

Bagging, or bootstrap aggregation is a procedure which was developed to deal with the high degree of variability in the tree prediction.

Bagging for regression trees

Suppose that \(\hat f^1(x)\), \(\hat f^2(x), \ldots, \hat f^B(x)\), are predictions generated using \(B\) independent training sets.
Average the classifications to form

\[\hat f_{avg}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^b(x)\]

But how do we get independent training sets?
We bootstrap, by taking repeated samples from our SINGLE original training set.
This gives \(B\) bootstrap training sets.
Train the method on the \(b\)’th bootstrapped training set to prediction \(f^{*b}(x)\), for \(b=1,2, \ldots , B\).
Average to get

\[\hat f_{bag}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^{*b} (x)\]

This is **the bagged estimate* of \(f(x)\).
The procedure is called bagging

Bagging for classification trees

How do we average predictions for classification trees?
We use a majority vote. For a given \(x\), predict the outcome using the prediction generated by the majority of the \(B\) predictions.
For example, for a fixed \(x\), and predicting a binary outcome \(AHD\) with \(B=100\), suppose that 73 of the trees predict \(AHD=Y\), and the other 27 trees predict \(AHD=N\). Then the majority prediction is \(Y\).

library(randomForest) #random forest library

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

Out-of-bag (OOB) error rate

For each bootstrap sample, about 1/3 of the sample is left out. Why?
These left out observations, are referred to as out of bag (or OOB) observations.
A tree is estimated for each bootstrap sample.
For each data point, say case \(i\), not used in the tree construction, the outcome is predicted. A majority vote is taken using all trees which did not use the \(i\)’th observation in their construction, and this gives the overall classification for the outcome, say \(Y_i\).
This is repeated for each of the \(n\) observations, leading to an overall mislassification rate, the out of bag error rate. It is a valid measure of test error for the bagged model.
The OOB misclassification rate will typically be close, but not indentical, to the misclassificaton rate on an independent test set.

train=sample(1:nrec,ntrain,replace=F)
AHD.train=AHD[train]
AHD.test=AHD[-train]
heart.bag=randomForest(AHD~.,data=heart,subset=train,mtry=14,importance=T)
heart.bag

## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart, mtry = 14, importance = T,      subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 18.12%
## Confusion matrix:
##     No Yes class.error
## No  62  15   0.1948052
## Yes 12  60   0.1666667

pred.test=predict(heart.bag,newdata=heart[-train,])
print("Confusion matrix for test set")

## [1] "Confusion matrix for test set"

table(pred.test,AHD.test)

##          AHD.test
## pred.test No Yes
##       No  67  18
##       Yes 16  47

Variable importance measures

for each observation in training or test sets, bagging will give a prediction, and the predictions tend to be better than for individual trees
one downside of the bagging procedure is that it doesn’t give a single tree, so has less interprability than the single tree estimate. Bagging improves predictability at the expense of interpretation
We can record the total amount that the Gini index is decreased by splits over a particular predictor, averaged overall all \(B\) trees. This gives an overall estimate of importance of each predictor variable.
We can do the same using prediction accuracy, measuring the average decrease in prediction error resulting from splits over each predictor.

importance(heart.bag)

##                    No       Yes MeanDecreaseAccuracy MeanDecreaseGini
## X          0.07476407  1.284280            0.8553039        6.0825163
## Age        1.06589071  1.777932            2.0312460        3.2198839
## Sex        2.93776943  1.511667            3.4233705        0.5267229
## ChestPain 10.69219967 11.840729           14.8320223        6.8514542
## RestBP     2.86394062  8.638051            7.7308635        4.3420267
## Chol      -2.88475805 -1.573189           -3.1274613        4.4680851
## Fbs        0.68753032 -2.643287           -1.1376819        0.2844849
## RestECG    3.44216151  4.640968            5.3596124        1.6982777
## MaxHR      3.36723620  9.444810            9.0468521        5.1315817
## ExAng      5.63303420  4.169836            6.6796224        2.3574364
## Oldpeak    7.97004873 12.584993           14.5622456        5.9222721
## Slope      4.17468623  1.155356            4.1438552        0.8906864
## Ca         8.76575554  5.935472            9.8585296        4.2068380
## Thal      28.32897277 25.739275           35.1431699       27.9026195

varImpPlot(heart.bag)

Random forests

A random forest is like a bagged estimate, except that the predictors chosen at each split are a random sample of the complete set of predictors.
The number of variables \(m\) used at each split is a tuning parameter. The default is \(\sqrt p\) (rounded), where \(p\) is the total number of predictors.
Bagging is a particular case of the random forest algorithm, with \(m=p\).
Rationale:
- suppose there is one very strong predictor, and several moderaly strong predictors. Then each tree in the bagged collection will typically have the same variable at the first split.
- so all bagged trees will look fairly similar
- so predictions for the different trees will be quite similar
- so variance of the overall prediction will not be dramatically reduced. Why?
by randomly choosing variables at each split, the correlation among trees is reduced.
With a large number of correlated predictors, a small value of \(m\) typically works well.

heart.rf=randomForest(AHD~.,data=heart, subset=train,importance=T)
heart.rf

## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart, importance = T,      subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.77%
## Confusion matrix:
##     No Yes class.error
## No  67  10   0.1298701
## Yes 12  60   0.1666667

importance(heart.rf)

##                  No          Yes MeanDecreaseAccuracy MeanDecreaseGini
## X         -1.397063  0.002760374           -0.9859185        4.9582819
## Age        1.441290  4.936340622            4.5211690        4.8397769
## Sex        3.857060  2.938012008            4.7378004        2.1426732
## ChestPain 10.486304 11.878496717           14.6213525        8.6367107
## RestBP     1.959248  1.452121915            2.4062627        4.1226150
## Chol       1.147181  0.904183006            1.3926925        4.6384059
## Fbs        1.923851 -1.386881574            0.3558581        0.5184759
## RestECG    2.248565  1.085243506            2.5536359        1.6336395
## MaxHR      4.818789  6.125040665            7.5754747        7.9171340
## ExAng      5.908052  4.488123177            7.0309664        4.5976822
## Oldpeak    7.299092  8.520673603           11.0204382        7.8160824
## Slope      4.153070  2.424139490            4.6859871        2.5498119
## Ca         9.621453  6.406586024           10.7658444        4.9873911
## Thal      19.728484 20.737837374           25.6048731       14.1704503

varImpPlot(heart.rf)

pred.test=predict(heart.rf,newdata=heart[-train,])
print("Confusion matrix for test set")

## [1] "Confusion matrix for test set"

table(pred.test,AHD.test)

##          AHD.test
## pred.test No Yes
##       No  71  21
##       Yes 12  44

Boosting (you won’t be asked any questions on boosting)

Fit a tree, and obtain residuals
Fit a second tree on the residuals from the first tree
Update the residuals by combining predictions from the second tree with predictions from the first tree.
Iterate

library(gbm)          #boosting library

## Loaded gbm 2.1.5

Y=ifelse(AHD.train=="Yes",1,0)  #need a 0/1 outcome for boosting
heart.boost=gbm(Y~.-AHD,data=heart[train,],distribution="bernoulli",n.trees=5000)
summary(heart.boost)

##                 var    rel.inf
## Thal           Thal 23.8750363
## X                 X 12.7517494
## ChestPain ChestPain 12.1991604
## Oldpeak     Oldpeak 12.1965817
## Ca               Ca  7.0485840
## RestBP       RestBP  6.4162442
## Chol           Chol  5.1739430
## RestECG     RestECG  4.6993269
## MaxHR         MaxHR  4.5605497
## Age             Age  4.0542566
## ExAng         ExAng  3.0448924
## Slope         Slope  2.6475230
## Sex             Sex  0.8832662
## Fbs             Fbs  0.4488863

pred.train=predict(heart.boost,newdata=heart[train,],n.trees=5000,type="response")
pred.train=ifelse(pred.train<=.5,0,pred.train)
pred.train=ifelse(pred.train>.5,1,pred.train)
table(pred.train,AHD.train)

##           AHD.train
## pred.train No Yes
##          0 77   0
##          1  0  72

pred.test=predict(heart.boost,newdata=heart[-train,],n.trees=5000)
 pred.test=ifelse(pred.test<=.5,0,pred.test)
pred.test=ifelse(pred.test>.5,1,pred.test)
table(pred.test,AHD.test)

##          AHD.test
## pred.test No Yes
##         0 68  15
##         1 15  50

DATASET 2: Boston : regression of median value response

read the data

library(MASS)
boston <- Boston #I generally prefer lower case
#setDT(boston)
str(boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Let’s split the data and build one regression tree

#library(magrittr)
# for sample_frac
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tree)
set.seed(1)
# infix notation (pipe)
#boston_train = Boston %>% sample_frac(.5)
#boston_test = Boston %>%
#  setdiff(boston_train)
# magrittr for this syntax
# https://stackoverflow.com/questions/24536154/what-does-mean-in-r
# iris %>% head() is equivalent to head(iris)

# I (pf) prefer this syntax:
boston_train = sample_frac(Boston,.5)
boston_test = setdiff(Boston, boston_train)

tree_boston=tree(medv~., boston_train)

summary(tree_boston)

## 
## Regression tree:
## tree(formula = medv ~ ., data = boston_train)
## Variables actually used in tree construction:
## [1] "rm"    "lstat" "crim"  "age"  
## Number of terminal nodes:  7 
## Residual mean deviance:  10.38 = 2555 / 246 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.1800  -1.7770  -0.1775   0.0000   1.9230  16.5800

Let’s bag it!

Use \(m=p=13\)

mtry = 13 means that we use all 13 predictors in each split of the tree.

This implies that we are using RF for BAGGING (\(S x B\))

library(randomForest)
set.seed(1)
bag_boston = randomForest(medv~., 
                          data = boston_train, 
                          mtry = 13, 
                          importance = TRUE)
bag_boston

## 
## Call:
##  randomForest(formula = medv ~ ., data = boston_train, mtry = 13,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.39601
##                     % Var explained: 85.17

Now evaluate the performance of the BAGGED model on the test set:

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

bagged_estimate = predict(bag_boston, 
                          newdata = boston_test)

ggplot() + 
    geom_point(aes(x = boston_test$medv, y = bagged_estimate)) +
    geom_abline()

MSE of the bagged model

Now estimate the ‘MSE’ of the BAGGED model

bag_boston_25_trees = randomForest(medv~., data =  boston_train, mtry = 13, ntree = 25)
bagged_estimate_25_trees = predict(bag_boston_25_trees, newdata = boston_test)
mse_baggedmodel <- mean((bagged_estimate_25_trees - boston_test$medv)^2)
mse_baggedmodel

## [1] 23.66716

What if we take less trees?

mse <- function(n,test,train){
bagm = randomForest(medv~., data =  train, mtry = 13, ntree = n)
bagpred = predict(bagm, newdata = test)
return(mean((bagpred - test$medv)^2))
}

mse(5,boston_test,boston_train)

## [1] 25.29835

What if we take more trees?

mse(100,boston_test,boston_train)

## [1] 23.2125

Was bagging useful?

Now, was bagging useful?

To test this, we are going to compare (compare what? your answer here) the BAGGED model to a SINGLE tree (CART model) that is optimized by cross-validation.

Let us build the CONCURRENT MODEL (CART-CV)

It will be a cross-validated CART model

Let us run thi

tree_boston=tree(medv~., boston_train)
cv_boston = cv.tree(tree_boston)
plot(cv_boston$size, cv_boston$dev, type = 'b')

hmm, looks like cv is selecting 6 or 7 as best size (minimum deviance). But 7 more often than 6. OK I take 7 as optimal tree size.

And I prune the full tree to this size (russian dolls):

prune_boston = prune.tree(tree_boston, 
                          best = 7)
plot(prune_boston)
text(prune_boston, pretty = 0)

Now that I have built my alternative model, I can compare it to the bagged model.

But how should I compare it?

Simply by calculating the MSE of the CART model on the test set.

single_tree_estimate = predict(prune_boston, 
                               newdata = boston_train)
ggplot() + 
    geom_point(aes(x = boston_test$medv, y = single_tree_estimate)) +
    geom_abline()

mse_cartmodel <- mean((single_tree_estimate - boston_test$medv)^2)
mse_cartmodel

## [1] 153.5446

So now we compare the two MSEs. Which one is better, what is your conclusion?

c(mse_cartmodel,mse_baggedmodel)

## [1] 153.54458  23.66716

Yes BAGGING WAS USEFUL!

OK but can we do better ? (1)

How?

RF!

By default, when no value is passed to mtry, the function randomForest uses m=p/3 variables (features) for regression, and sqrt(m) features for classification.

Let us try \(m=6\)

mserf <- function(m,test,train){
rfb = randomForest(medv~., 
                         data = train, 
                         mtry = m, 
                         importance = TRUE)
rfest = predict(rfb,newdata = test)
return(mean((rfest - test$medv)^2))
}


m=6
mb=13
h=NULL
hb = NULL
for(i in 1:50){
h=c(h,mserf(m,boston_test,boston_train))
hb = c(hb,mserf(mb,boston_test,boston_train))
}
hist(h)

hist(hb)

Did RF improve over bagging?

c(median(h),median(hb))

## [1] 19.55883 23.34456

what do you think?

OK but can we do better ? (2)

How?

Boosting!

I leave this as an exercise.

DATASET 3: Carseats sales

Can you replicate the same methology for Carseats sales?

Response : either choose the original Sales response (continuous: regression exercise), or transform the response into a two levels factor (low, high: classification exercise)

(src:daviddalpiaz.github.io/r4sl/trees.html)

Full tree

library(tree)
library(ISLR)

data(Carseats)
#?Carseats
str(Carseats)

Carseats$Sales = as.factor(ifelse(Carseats$Sales <= 8, "Low", "High"))
str(Carseats)

# Fit an unpruned classification tree using all of the predictors.

seat_tree = tree(Sales ~ ., data = Carseats)
# seat_tree = tree(Sales ~ ., data = Carseats, 
#                  control = tree.control(nobs = nrow(Carseats), minsize = 10))
summary(seat_tree)
plot(seat_tree)
text(seat_tree, pretty = 0)
title(main = "Sales of carseats: full classification tree: no pruning!")

#  details of the splits.
seat_tree

Variability in tree estimates

There is considerability in the estimated tree and the associated predictions, even when cross validaion is used to choose the size of the tree.
The following example illustrates this variability, using 5 random splits into training and test sets.

# Carseats
attach(Carseats)
nrec=nrow(heart);nrec
ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain

Bagging, Randomforest, boosting

Objectives

Principle of methods and DATASET 1

Variability in tree estimates

Bagging

Bagging for regression trees

Bagging for classification trees

Out-of-bag (OOB) error rate

Variable importance measures

Random forests

Boosting (you won’t be asked any questions on boosting)

DATASET 2: Boston : regression of median value response

read the data

Let’s bag it!

MSE of the bagged model

Was bagging useful?

OK but can we do better ? (1)

OK but can we do better ? (2)

DATASET 3: Carseats sales

Full tree

Variability in tree estimates