Objectives

Main idea: Averaging.

We move to ensemble methods (pro: better prediction, con: worse interpretability)

Principle of methods and DATASET 1

Variability in tree estimates

  • There is considerability in the estimated tree and the associated predictions, even when cross validaion is used to choose the size of the tree.
  • The following example illustrates this variability, using 5 random splits into training and test sets.
heart=read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Heart.csv")
#heart=read.csv("http://www-bcf.usc.edu/~gareth/ISL/Heart.csv")
library(tree)
nmiss=apply(is.na(heart),1,sum) #number of missing values, by row
heart=heart[nmiss==0,]  #remove cases with missing values
attach(heart)
nrec=nrow(heart);nrec
## [1] 297
ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain
## [1] 149
## [1] "random split number  1  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  78  10
##        Yes  4  57
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  66  21
##       Yes 12  49
## [1] "Training error rate =  9.4  %"
## [1] "Test error rate =  22.3  %"
## [1] "random split number  2  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  68   6
##        Yes  9  66
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  54  20
##       Yes 29  45
## [1] "Training error rate =  10.07  %"
## [1] "Test error rate =  33.11  %"
## [1] "random split number  3  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  65  14
##        Yes 12  58
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  62  23
##       Yes 21  42
## [1] "Training error rate =  17.45  %"
## [1] "Test error rate =  29.73  %"
## [1] "random split number  4  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  77  13
##        Yes  6  53
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  68  29
##       Yes  9  42
## [1] "Training error rate =  12.75  %"
## [1] "Test error rate =  25.68  %"
## [1] "random split number  5  into training and test sets"

## [1] "Confusion matrix for training set"
##           AHD.train
## pred.train No Yes
##        No  80  11
##        Yes  4  54
## [1] "Confusion matrix for test set"
##          AHD.test
## pred.test No Yes
##       No  65  17
##       Yes 11  55
## [1] "Training error rate =  10.07  %"
## [1] "Test error rate =  18.92  %"

Bagging

  • Bagging, or bootstrap aggregation is a procedure which was developed to deal with the high degree of variability in the tree prediction.

Bagging for regression trees

  • Suppose that \(\hat f^1(x)\), \(\hat f^2(x), \ldots, \hat f^B(x)\), are predictions generated using \(B\) independent training sets.
  • Average the classifications to form

\[\hat f_{avg}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^b(x)\]

  • But how do we get independent training sets?

  • We bootstrap, by taking repeated samples from our SINGLE original training set.

  • This gives \(B\) bootstrap training sets.

  • Train the method on the \(b\)’th bootstrapped training set to prediction \(f^{*b}(x)\), for \(b=1,2, \ldots , B\).

  • Average to get

\[\hat f_{bag}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^{*b} (x)\]

  • This is **the bagged estimate* of \(f(x)\).

  • The procedure is called bagging

Bagging for classification trees

  • How do we average predictions for classification trees?

  • We use a majority vote. For a given \(x\), predict the outcome using the prediction generated by the majority of the \(B\) predictions.

  • For example, for a fixed \(x\), and predicting a binary outcome \(AHD\) with \(B=100\), suppose that 73 of the trees predict \(AHD=Y\), and the other 27 trees predict \(AHD=N\). Then the majority prediction is \(Y\).

library(randomForest) #random forest library
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

Out-of-bag (OOB) error rate

  • For each bootstrap sample, about 1/3 of the sample is left out. Why?
  • These left out observations, are referred to as out of bag (or OOB) observations.
  • A tree is estimated for each bootstrap sample.
  • For each data point, say case \(i\), not used in the tree construction, the outcome is predicted. A majority vote is taken using all trees which did not use the \(i\)’th observation in their construction, and this gives the overall classification for the outcome, say \(Y_i\).
  • This is repeated for each of the \(n\) observations, leading to an overall mislassification rate, the out of bag error rate. It is a valid measure of test error for the bagged model.
  • The OOB misclassification rate will typically be close, but not indentical, to the misclassificaton rate on an independent test set.
train=sample(1:nrec,ntrain,replace=F)
AHD.train=AHD[train]
AHD.test=AHD[-train]
heart.bag=randomForest(AHD~.,data=heart,subset=train,mtry=14,importance=T)
heart.bag
## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart, mtry = 14, importance = T,      subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 18.12%
## Confusion matrix:
##     No Yes class.error
## No  62  15   0.1948052
## Yes 12  60   0.1666667
pred.test=predict(heart.bag,newdata=heart[-train,])
print("Confusion matrix for test set")
## [1] "Confusion matrix for test set"
table(pred.test,AHD.test)
##          AHD.test
## pred.test No Yes
##       No  67  18
##       Yes 16  47

Variable importance measures

  • for each observation in training or test sets, bagging will give a prediction, and the predictions tend to be better than for individual trees
  • one downside of the bagging procedure is that it doesn’t give a single tree, so has less interprability than the single tree estimate. Bagging improves predictability at the expense of interpretation
  • We can record the total amount that the Gini index is decreased by splits over a particular predictor, averaged overall all \(B\) trees. This gives an overall estimate of importance of each predictor variable.
  • We can do the same using prediction accuracy, measuring the average decrease in prediction error resulting from splits over each predictor.
importance(heart.bag)
##                    No       Yes MeanDecreaseAccuracy MeanDecreaseGini
## X          0.07476407  1.284280            0.8553039        6.0825163
## Age        1.06589071  1.777932            2.0312460        3.2198839
## Sex        2.93776943  1.511667            3.4233705        0.5267229
## ChestPain 10.69219967 11.840729           14.8320223        6.8514542
## RestBP     2.86394062  8.638051            7.7308635        4.3420267
## Chol      -2.88475805 -1.573189           -3.1274613        4.4680851
## Fbs        0.68753032 -2.643287           -1.1376819        0.2844849
## RestECG    3.44216151  4.640968            5.3596124        1.6982777
## MaxHR      3.36723620  9.444810            9.0468521        5.1315817
## ExAng      5.63303420  4.169836            6.6796224        2.3574364
## Oldpeak    7.97004873 12.584993           14.5622456        5.9222721
## Slope      4.17468623  1.155356            4.1438552        0.8906864
## Ca         8.76575554  5.935472            9.8585296        4.2068380
## Thal      28.32897277 25.739275           35.1431699       27.9026195
varImpPlot(heart.bag)

Random forests

  • A random forest is like a bagged estimate, except that the predictors chosen at each split are a random sample of the complete set of predictors.

  • The number of variables \(m\) used at each split is a tuning parameter. The default is \(\sqrt p\) (rounded), where \(p\) is the total number of predictors.

  • Bagging is a particular case of the random forest algorithm, with \(m=p\).

  • Rationale:

    • suppose there is one very strong predictor, and several moderaly strong predictors. Then each tree in the bagged collection will typically have the same variable at the first split.
    • so all bagged trees will look fairly similar
    • so predictions for the different trees will be quite similar
    • so variance of the overall prediction will not be dramatically reduced. Why?
  • by randomly choosing variables at each split, the correlation among trees is reduced.

  • With a large number of correlated predictors, a small value of \(m\) typically works well.

heart.rf=randomForest(AHD~.,data=heart, subset=train,importance=T)
heart.rf
## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart, importance = T,      subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.77%
## Confusion matrix:
##     No Yes class.error
## No  67  10   0.1298701
## Yes 12  60   0.1666667
importance(heart.rf)
##                  No          Yes MeanDecreaseAccuracy MeanDecreaseGini
## X         -1.397063  0.002760374           -0.9859185        4.9582819
## Age        1.441290  4.936340622            4.5211690        4.8397769
## Sex        3.857060  2.938012008            4.7378004        2.1426732
## ChestPain 10.486304 11.878496717           14.6213525        8.6367107
## RestBP     1.959248  1.452121915            2.4062627        4.1226150
## Chol       1.147181  0.904183006            1.3926925        4.6384059
## Fbs        1.923851 -1.386881574            0.3558581        0.5184759
## RestECG    2.248565  1.085243506            2.5536359        1.6336395
## MaxHR      4.818789  6.125040665            7.5754747        7.9171340
## ExAng      5.908052  4.488123177            7.0309664        4.5976822
## Oldpeak    7.299092  8.520673603           11.0204382        7.8160824
## Slope      4.153070  2.424139490            4.6859871        2.5498119
## Ca         9.621453  6.406586024           10.7658444        4.9873911
## Thal      19.728484 20.737837374           25.6048731       14.1704503
varImpPlot(heart.rf)

pred.test=predict(heart.rf,newdata=heart[-train,])
print("Confusion matrix for test set")
## [1] "Confusion matrix for test set"
table(pred.test,AHD.test)
##          AHD.test
## pred.test No Yes
##       No  71  21
##       Yes 12  44

Boosting (you won’t be asked any questions on boosting)

  • Fit a tree, and obtain residuals
  • Fit a second tree on the residuals from the first tree
  • Update the residuals by combining predictions from the second tree with predictions from the first tree.
  • Iterate
library(gbm)          #boosting library
## Loaded gbm 2.1.5
Y=ifelse(AHD.train=="Yes",1,0)  #need a 0/1 outcome for boosting
heart.boost=gbm(Y~.-AHD,data=heart[train,],distribution="bernoulli",n.trees=5000)
summary(heart.boost)

##                 var    rel.inf
## Thal           Thal 23.8750363
## X                 X 12.7517494
## ChestPain ChestPain 12.1991604
## Oldpeak     Oldpeak 12.1965817
## Ca               Ca  7.0485840
## RestBP       RestBP  6.4162442
## Chol           Chol  5.1739430
## RestECG     RestECG  4.6993269
## MaxHR         MaxHR  4.5605497
## Age             Age  4.0542566
## ExAng         ExAng  3.0448924
## Slope         Slope  2.6475230
## Sex             Sex  0.8832662
## Fbs             Fbs  0.4488863
pred.train=predict(heart.boost,newdata=heart[train,],n.trees=5000,type="response")
pred.train=ifelse(pred.train<=.5,0,pred.train)
pred.train=ifelse(pred.train>.5,1,pred.train)
table(pred.train,AHD.train)
##           AHD.train
## pred.train No Yes
##          0 77   0
##          1  0  72
pred.test=predict(heart.boost,newdata=heart[-train,],n.trees=5000)
 pred.test=ifelse(pred.test<=.5,0,pred.test)
pred.test=ifelse(pred.test>.5,1,pred.test)
table(pred.test,AHD.test)
##          AHD.test
## pred.test No Yes
##         0 68  15
##         1 15  50

DATASET 2: Boston : regression of median value response

read the data

library(MASS)
boston <- Boston #I generally prefer lower case
#setDT(boston)
str(boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Let’s split the data and build one regression tree

#library(magrittr)
# for sample_frac
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tree)
set.seed(1)
# infix notation (pipe)
#boston_train = Boston %>% sample_frac(.5)
#boston_test = Boston %>%
#  setdiff(boston_train)
# magrittr for this syntax
# https://stackoverflow.com/questions/24536154/what-does-mean-in-r
# iris %>% head() is equivalent to head(iris)

# I (pf) prefer this syntax:
boston_train = sample_frac(Boston,.5)
boston_test = setdiff(Boston, boston_train)

tree_boston=tree(medv~., boston_train)

summary(tree_boston)
## 
## Regression tree:
## tree(formula = medv ~ ., data = boston_train)
## Variables actually used in tree construction:
## [1] "rm"    "lstat" "crim"  "age"  
## Number of terminal nodes:  7 
## Residual mean deviance:  10.38 = 2555 / 246 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.1800  -1.7770  -0.1775   0.0000   1.9230  16.5800

Let’s bag it!

Let’s bag it!

Use \(m=p=13\)

mtry = 13 means that we use all 13 predictors in each split of the tree.

This implies that we are using RF for BAGGING (\(S x B\))

library(randomForest)
set.seed(1)
bag_boston = randomForest(medv~., 
                          data = boston_train, 
                          mtry = 13, 
                          importance = TRUE)
bag_boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = boston_train, mtry = 13,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.39601
##                     % Var explained: 85.17

Now evaluate the performance of the BAGGED model on the test set:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
bagged_estimate = predict(bag_boston, 
                          newdata = boston_test)

ggplot() + 
    geom_point(aes(x = boston_test$medv, y = bagged_estimate)) +
    geom_abline()

MSE of the bagged model

Now estimate the ‘MSE’ of the BAGGED model

bag_boston_25_trees = randomForest(medv~., data =  boston_train, mtry = 13, ntree = 25)
bagged_estimate_25_trees = predict(bag_boston_25_trees, newdata = boston_test)
mse_baggedmodel <- mean((bagged_estimate_25_trees - boston_test$medv)^2)
mse_baggedmodel
## [1] 23.66716

What if we take less trees?

mse <- function(n,test,train){
bagm = randomForest(medv~., data =  train, mtry = 13, ntree = n)
bagpred = predict(bagm, newdata = test)
return(mean((bagpred - test$medv)^2))
}

mse(5,boston_test,boston_train)
## [1] 25.29835

What if we take more trees?

mse(100,boston_test,boston_train)
## [1] 23.2125

Was bagging useful?

Now, was bagging useful?

To test this, we are going to compare (compare what? your answer here) the BAGGED model to a SINGLE tree (CART model) that is optimized by cross-validation.

Let us build the CONCURRENT MODEL (CART-CV)

It will be a cross-validated CART model

Let us run thi

tree_boston=tree(medv~., boston_train)
cv_boston = cv.tree(tree_boston)
plot(cv_boston$size, cv_boston$dev, type = 'b')

hmm, looks like cv is selecting 6 or 7 as best size (minimum deviance). But 7 more often than 6. OK I take 7 as optimal tree size.

And I prune the full tree to this size (russian dolls):

prune_boston = prune.tree(tree_boston, 
                          best = 7)
plot(prune_boston)
text(prune_boston, pretty = 0)

Now that I have built my alternative model, I can compare it to the bagged model.

But how should I compare it?

Simply by calculating the MSE of the CART model on the test set.

single_tree_estimate = predict(prune_boston, 
                               newdata = boston_train)
ggplot() + 
    geom_point(aes(x = boston_test$medv, y = single_tree_estimate)) +
    geom_abline()

mse_cartmodel <- mean((single_tree_estimate - boston_test$medv)^2)
mse_cartmodel
## [1] 153.5446

So now we compare the two MSEs. Which one is better, what is your conclusion?

c(mse_cartmodel,mse_baggedmodel)
## [1] 153.54458  23.66716

Yes BAGGING WAS USEFUL!

OK but can we do better ? (1)

How?

RF!

By default, when no value is passed to mtry, the function randomForest uses m=p/3 variables (features) for regression, and sqrt(m) features for classification.

Let us try \(m=6\)

mserf <- function(m,test,train){
rfb = randomForest(medv~., 
                         data = train, 
                         mtry = m, 
                         importance = TRUE)
rfest = predict(rfb,newdata = test)
return(mean((rfest - test$medv)^2))
}


m=6
mb=13
h=NULL
hb = NULL
for(i in 1:50){
h=c(h,mserf(m,boston_test,boston_train))
hb = c(hb,mserf(mb,boston_test,boston_train))
}
hist(h)

hist(hb)

Did RF improve over bagging?

c(median(h),median(hb))
## [1] 19.55883 23.34456

what do you think?

OK but can we do better ? (2)

How?

Boosting!

I leave this as an exercise.

DATASET 3: Carseats sales

Can you replicate the same methology for Carseats sales?

Response : either choose the original Sales response (continuous: regression exercise), or transform the response into a two levels factor (low, high: classification exercise)

(src:daviddalpiaz.github.io/r4sl/trees.html)

Full tree

library(tree)
library(ISLR)

data(Carseats)
#?Carseats
str(Carseats)

Carseats$Sales = as.factor(ifelse(Carseats$Sales <= 8, "Low", "High"))
str(Carseats)

# Fit an unpruned classification tree using all of the predictors.

seat_tree = tree(Sales ~ ., data = Carseats)
# seat_tree = tree(Sales ~ ., data = Carseats, 
#                  control = tree.control(nobs = nrow(Carseats), minsize = 10))
summary(seat_tree)
plot(seat_tree)
text(seat_tree, pretty = 0)
title(main = "Sales of carseats: full classification tree: no pruning!")
#  details of the splits.
seat_tree

Variability in tree estimates

  • There is considerability in the estimated tree and the associated predictions, even when cross validaion is used to choose the size of the tree.
  • The following example illustrates this variability, using 5 random splits into training and test sets.
# Carseats
attach(Carseats)
nrec=nrow(heart);nrec
ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain