CART (single tree)
Bagging
Random Forest
Boosting
Main idea: Averaging.
We move to ensemble methods (pro: better prediction, con: worse interpretability)
heart=read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Heart.csv")
#heart=read.csv("http://www-bcf.usc.edu/~gareth/ISL/Heart.csv")
library(tree)
nmiss=apply(is.na(heart),1,sum) #number of missing values, by row
heart=heart[nmiss==0,] #remove cases with missing values
attach(heart)
nrec=nrow(heart);nrec
## [1] 297
ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain
## [1] 149
## [1] "random split number 1 into training and test sets"
## [1] "Confusion matrix for training set"
## AHD.train
## pred.train No Yes
## No 78 10
## Yes 4 57
## [1] "Confusion matrix for test set"
## AHD.test
## pred.test No Yes
## No 66 21
## Yes 12 49
## [1] "Training error rate = 9.4 %"
## [1] "Test error rate = 22.3 %"
## [1] "random split number 2 into training and test sets"
## [1] "Confusion matrix for training set"
## AHD.train
## pred.train No Yes
## No 68 6
## Yes 9 66
## [1] "Confusion matrix for test set"
## AHD.test
## pred.test No Yes
## No 54 20
## Yes 29 45
## [1] "Training error rate = 10.07 %"
## [1] "Test error rate = 33.11 %"
## [1] "random split number 3 into training and test sets"
## [1] "Confusion matrix for training set"
## AHD.train
## pred.train No Yes
## No 65 14
## Yes 12 58
## [1] "Confusion matrix for test set"
## AHD.test
## pred.test No Yes
## No 62 23
## Yes 21 42
## [1] "Training error rate = 17.45 %"
## [1] "Test error rate = 29.73 %"
## [1] "random split number 4 into training and test sets"
## [1] "Confusion matrix for training set"
## AHD.train
## pred.train No Yes
## No 77 13
## Yes 6 53
## [1] "Confusion matrix for test set"
## AHD.test
## pred.test No Yes
## No 68 29
## Yes 9 42
## [1] "Training error rate = 12.75 %"
## [1] "Test error rate = 25.68 %"
## [1] "random split number 5 into training and test sets"
## [1] "Confusion matrix for training set"
## AHD.train
## pred.train No Yes
## No 80 11
## Yes 4 54
## [1] "Confusion matrix for test set"
## AHD.test
## pred.test No Yes
## No 65 17
## Yes 11 55
## [1] "Training error rate = 10.07 %"
## [1] "Test error rate = 18.92 %"
\[\hat f_{avg}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^b(x)\]
But how do we get independent training sets?
We bootstrap, by taking repeated samples from our SINGLE original training set.
This gives \(B\) bootstrap training sets.
Train the method on the \(b\)’th bootstrapped training set to prediction \(f^{*b}(x)\), for \(b=1,2, \ldots , B\).
Average to get
\[\hat f_{bag}(x) = \frac{1}{B} \sum_{b=1}^B \hat f^{*b} (x)\]
This is **the bagged estimate* of \(f(x)\).
The procedure is called bagging
How do we average predictions for classification trees?
We use a majority vote. For a given \(x\), predict the outcome using the prediction generated by the majority of the \(B\) predictions.
For example, for a fixed \(x\), and predicting a binary outcome \(AHD\) with \(B=100\), suppose that 73 of the trees predict \(AHD=Y\), and the other 27 trees predict \(AHD=N\). Then the majority prediction is \(Y\).
library(randomForest) #random forest library
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
train=sample(1:nrec,ntrain,replace=F)
AHD.train=AHD[train]
AHD.test=AHD[-train]
heart.bag=randomForest(AHD~.,data=heart,subset=train,mtry=14,importance=T)
heart.bag
##
## Call:
## randomForest(formula = AHD ~ ., data = heart, mtry = 14, importance = T, subset = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 18.12%
## Confusion matrix:
## No Yes class.error
## No 62 15 0.1948052
## Yes 12 60 0.1666667
pred.test=predict(heart.bag,newdata=heart[-train,])
print("Confusion matrix for test set")
## [1] "Confusion matrix for test set"
table(pred.test,AHD.test)
## AHD.test
## pred.test No Yes
## No 67 18
## Yes 16 47
importance(heart.bag)
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## X 0.07476407 1.284280 0.8553039 6.0825163
## Age 1.06589071 1.777932 2.0312460 3.2198839
## Sex 2.93776943 1.511667 3.4233705 0.5267229
## ChestPain 10.69219967 11.840729 14.8320223 6.8514542
## RestBP 2.86394062 8.638051 7.7308635 4.3420267
## Chol -2.88475805 -1.573189 -3.1274613 4.4680851
## Fbs 0.68753032 -2.643287 -1.1376819 0.2844849
## RestECG 3.44216151 4.640968 5.3596124 1.6982777
## MaxHR 3.36723620 9.444810 9.0468521 5.1315817
## ExAng 5.63303420 4.169836 6.6796224 2.3574364
## Oldpeak 7.97004873 12.584993 14.5622456 5.9222721
## Slope 4.17468623 1.155356 4.1438552 0.8906864
## Ca 8.76575554 5.935472 9.8585296 4.2068380
## Thal 28.32897277 25.739275 35.1431699 27.9026195
varImpPlot(heart.bag)
A random forest is like a bagged estimate, except that the predictors chosen at each split are a random sample of the complete set of predictors.
The number of variables \(m\) used at each split is a tuning parameter. The default is \(\sqrt p\) (rounded), where \(p\) is the total number of predictors.
Bagging is a particular case of the random forest algorithm, with \(m=p\).
Rationale:
by randomly choosing variables at each split, the correlation among trees is reduced.
With a large number of correlated predictors, a small value of \(m\) typically works well.
heart.rf=randomForest(AHD~.,data=heart, subset=train,importance=T)
heart.rf
##
## Call:
## randomForest(formula = AHD ~ ., data = heart, importance = T, subset = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.77%
## Confusion matrix:
## No Yes class.error
## No 67 10 0.1298701
## Yes 12 60 0.1666667
importance(heart.rf)
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## X -1.397063 0.002760374 -0.9859185 4.9582819
## Age 1.441290 4.936340622 4.5211690 4.8397769
## Sex 3.857060 2.938012008 4.7378004 2.1426732
## ChestPain 10.486304 11.878496717 14.6213525 8.6367107
## RestBP 1.959248 1.452121915 2.4062627 4.1226150
## Chol 1.147181 0.904183006 1.3926925 4.6384059
## Fbs 1.923851 -1.386881574 0.3558581 0.5184759
## RestECG 2.248565 1.085243506 2.5536359 1.6336395
## MaxHR 4.818789 6.125040665 7.5754747 7.9171340
## ExAng 5.908052 4.488123177 7.0309664 4.5976822
## Oldpeak 7.299092 8.520673603 11.0204382 7.8160824
## Slope 4.153070 2.424139490 4.6859871 2.5498119
## Ca 9.621453 6.406586024 10.7658444 4.9873911
## Thal 19.728484 20.737837374 25.6048731 14.1704503
varImpPlot(heart.rf)
pred.test=predict(heart.rf,newdata=heart[-train,])
print("Confusion matrix for test set")
## [1] "Confusion matrix for test set"
table(pred.test,AHD.test)
## AHD.test
## pred.test No Yes
## No 71 21
## Yes 12 44
library(gbm) #boosting library
## Loaded gbm 2.1.5
Y=ifelse(AHD.train=="Yes",1,0) #need a 0/1 outcome for boosting
heart.boost=gbm(Y~.-AHD,data=heart[train,],distribution="bernoulli",n.trees=5000)
summary(heart.boost)
## var rel.inf
## Thal Thal 23.8750363
## X X 12.7517494
## ChestPain ChestPain 12.1991604
## Oldpeak Oldpeak 12.1965817
## Ca Ca 7.0485840
## RestBP RestBP 6.4162442
## Chol Chol 5.1739430
## RestECG RestECG 4.6993269
## MaxHR MaxHR 4.5605497
## Age Age 4.0542566
## ExAng ExAng 3.0448924
## Slope Slope 2.6475230
## Sex Sex 0.8832662
## Fbs Fbs 0.4488863
pred.train=predict(heart.boost,newdata=heart[train,],n.trees=5000,type="response")
pred.train=ifelse(pred.train<=.5,0,pred.train)
pred.train=ifelse(pred.train>.5,1,pred.train)
table(pred.train,AHD.train)
## AHD.train
## pred.train No Yes
## 0 77 0
## 1 0 72
pred.test=predict(heart.boost,newdata=heart[-train,],n.trees=5000)
pred.test=ifelse(pred.test<=.5,0,pred.test)
pred.test=ifelse(pred.test>.5,1,pred.test)
table(pred.test,AHD.test)
## AHD.test
## pred.test No Yes
## 0 68 15
## 1 15 50
library(MASS)
boston <- Boston #I generally prefer lower case
#setDT(boston)
str(boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let’s split the data and build one regression tree
#library(magrittr)
# for sample_frac
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tree)
set.seed(1)
# infix notation (pipe)
#boston_train = Boston %>% sample_frac(.5)
#boston_test = Boston %>%
# setdiff(boston_train)
# magrittr for this syntax
# https://stackoverflow.com/questions/24536154/what-does-mean-in-r
# iris %>% head() is equivalent to head(iris)
# I (pf) prefer this syntax:
boston_train = sample_frac(Boston,.5)
boston_test = setdiff(Boston, boston_train)
tree_boston=tree(medv~., boston_train)
summary(tree_boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = boston_train)
## Variables actually used in tree construction:
## [1] "rm" "lstat" "crim" "age"
## Number of terminal nodes: 7
## Residual mean deviance: 10.38 = 2555 / 246
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.1800 -1.7770 -0.1775 0.0000 1.9230 16.5800
Let’s bag it!
Use \(m=p=13\)
mtry = 13 means that we use all 13 predictors in each split of the tree.
This implies that we are using RF for BAGGING (\(S x B\))
library(randomForest)
set.seed(1)
bag_boston = randomForest(medv~.,
data = boston_train,
mtry = 13,
importance = TRUE)
bag_boston
##
## Call:
## randomForest(formula = medv ~ ., data = boston_train, mtry = 13, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.39601
## % Var explained: 85.17
Now evaluate the performance of the BAGGED model on the test set:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
bagged_estimate = predict(bag_boston,
newdata = boston_test)
ggplot() +
geom_point(aes(x = boston_test$medv, y = bagged_estimate)) +
geom_abline()
Now estimate the ‘MSE’ of the BAGGED model
bag_boston_25_trees = randomForest(medv~., data = boston_train, mtry = 13, ntree = 25)
bagged_estimate_25_trees = predict(bag_boston_25_trees, newdata = boston_test)
mse_baggedmodel <- mean((bagged_estimate_25_trees - boston_test$medv)^2)
mse_baggedmodel
## [1] 23.66716
What if we take less trees?
mse <- function(n,test,train){
bagm = randomForest(medv~., data = train, mtry = 13, ntree = n)
bagpred = predict(bagm, newdata = test)
return(mean((bagpred - test$medv)^2))
}
mse(5,boston_test,boston_train)
## [1] 25.29835
What if we take more trees?
mse(100,boston_test,boston_train)
## [1] 23.2125
Now, was bagging useful?
To test this, we are going to compare (compare what? your answer here) the BAGGED model to a SINGLE tree (CART model) that is optimized by cross-validation.
Let us build the CONCURRENT MODEL (CART-CV)
It will be a cross-validated CART model
Let us run thi
tree_boston=tree(medv~., boston_train)
cv_boston = cv.tree(tree_boston)
plot(cv_boston$size, cv_boston$dev, type = 'b')
hmm, looks like cv is selecting 6 or 7 as best size (minimum deviance). But 7 more often than 6. OK I take 7 as optimal tree size.
And I prune the full tree to this size (russian dolls):
prune_boston = prune.tree(tree_boston,
best = 7)
plot(prune_boston)
text(prune_boston, pretty = 0)
Now that I have built my alternative model, I can compare it to the bagged model.
But how should I compare it?
Simply by calculating the MSE of the CART model on the test set.
single_tree_estimate = predict(prune_boston,
newdata = boston_train)
ggplot() +
geom_point(aes(x = boston_test$medv, y = single_tree_estimate)) +
geom_abline()
mse_cartmodel <- mean((single_tree_estimate - boston_test$medv)^2)
mse_cartmodel
## [1] 153.5446
So now we compare the two MSEs. Which one is better, what is your conclusion?
c(mse_cartmodel,mse_baggedmodel)
## [1] 153.54458 23.66716
Yes BAGGING WAS USEFUL!
How?
RF!
By default, when no value is passed to mtry, the function randomForest uses m=p/3 variables (features) for regression, and sqrt(m) features for classification.
Let us try \(m=6\)
mserf <- function(m,test,train){
rfb = randomForest(medv~.,
data = train,
mtry = m,
importance = TRUE)
rfest = predict(rfb,newdata = test)
return(mean((rfest - test$medv)^2))
}
m=6
mb=13
h=NULL
hb = NULL
for(i in 1:50){
h=c(h,mserf(m,boston_test,boston_train))
hb = c(hb,mserf(mb,boston_test,boston_train))
}
hist(h)
hist(hb)
Did RF improve over bagging?
c(median(h),median(hb))
## [1] 19.55883 23.34456
what do you think?
How?
Boosting!
I leave this as an exercise.
Can you replicate the same methology for Carseats sales?
Response : either choose the original Sales response (continuous: regression exercise), or transform the response into a two levels factor (low, high: classification exercise)
(src:daviddalpiaz.github.io/r4sl/trees.html)
library(tree)
library(ISLR)
data(Carseats)
#?Carseats
str(Carseats)
Carseats$Sales = as.factor(ifelse(Carseats$Sales <= 8, "Low", "High"))
str(Carseats)
# Fit an unpruned classification tree using all of the predictors.
seat_tree = tree(Sales ~ ., data = Carseats)
# seat_tree = tree(Sales ~ ., data = Carseats,
# control = tree.control(nobs = nrow(Carseats), minsize = 10))
summary(seat_tree)
plot(seat_tree)
text(seat_tree, pretty = 0)
title(main = "Sales of carseats: full classification tree: no pruning!")
# details of the splits.
seat_tree
# Carseats
attach(Carseats)
nrec=nrow(heart);nrec
ntrain=ifelse(nrec%%2==0,nrec/2,(nrec+1)/2);ntrain