The tree-building process
We divide the predictor space - that is, the set of possible values for \(X_1, X_2,...,X_p\) into J distinct and non-overlapping regions, \(R_1, R_2,..., R_j\).
For every observation that falls into the region \(R_j\), we make the same prediction, which is simply the mean (or majority vote) of the response values for the training observations in \(R_j\).
Some problems
- How do we choose which predictor and cutpoint to split on?
- Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is constructed.
- How to grow a tree?
- Take a top-down, greedy approach that is known as recursive binary splitting. It begins at the top of the tree and then successively splits the predictor space, and at each step of the tree-building process, the best split is made at that particular step.
- When should we stop splitting?
- Tree construction ends using a predefined stopping criterion, such as a minimum number of training instances assigned to each leaf node of the tree.
- What if tree is too large can we approximate with a smaller tree?
- Pruning the tree (cost-complexity pruning)

Classification Tree
- A classification tree is used to predict a quanlitative/discrete response.
- We predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
Evaluate the quality of a particular split
\(p_{mk}\) represents the proportion of training observations in the mth region that are from the kth class.
Misclassification rate (not a good choice for a particular split, but preferable for final tree)
\(E = 1 - max(p_{mk})\)
Gini index - a small value indicates that a node contains predominantly observations from a single class.
\(G = \sum_{k=1}^{K}p_{mk}(1-p_{mk})\)
Cross-entropy - take on a small value if the mth mode is pure.
\(D = - \sum_{k=1}^{K}p_{mk}logp_{mk}\)
Node impurity measures for two-class classification.

Classification tree in R
tree() function in tree package
### Use tree() function in tree library
library(tree)
head(iris)
tree.iris = tree(Species~., iris)
## show information for each branch; terminal nodes are indicated using asterisks.
tree.iris
## ?tree
summary(tree.iris)
plot(tree.iris)
text(tree.iris, pretty = 0)
### validation set approach
set.seed(1)
train = sample(1:150,75)
iristest = iris[-train,]
### use training data fit
tree.iris = tree(Species~., iris, subset = train)
plot(tree.iris)
text(tree.iris, pretty = 0)
### prediction on validation set
tree.pred = predict(tree.iris, iristest[-train,], type="class") ## type = "class" for classification
table(tree.pred,iristest[-train,"Species"])
mean(tree.pred != iristest[-train,"Species"])
## cross validation for classification tree. More details in regression tree cross validation part.
cv.iris = cv.tree(tree.iris, FUN = prune.misclass) ## use misclassification rate to guide the pruning
plot(cv.iris$size, cv.iris$dev,type="b")
plot(cv.iris$k, cv.iris$dev,type="b") ## corresponding cost-complexity parameter
prune.iris = prune.tree(tree.iris, best = 3,method = "misclass")
## equivalent to : prune.iris = prune.misclass(tree.iris, best = 3)
plot(prune.iris)
text(prune.iris, pretty = 0)
tree.pred = predict(prune.iris, iristest[-train,], type="class")
table(tree.pred,iristest[-train,"Species"])
mean(tree.pred != iristest[-train,"Species"])
### compare to the full tree above. We use a smaller tree and obtain the same prediction accuracy.
Regression Tree
A regression tree is very similar to a classification tree, but used to predict a quantitative/continuous response.
RSS is used as a criterion for making the binary splits.
The prediction is given by the mean response of the training observations that belong to the same terminal node.
We seek the splitting variable j and split point s that solve
\(min_{j,s}[min_{c1} \sum_{x_i:R_1(j,s)} (y_i - c_1)^2 + min_{c1} \sum_{x_i:R_2(j,s)} (y_i - c_2)^2]\)
The inner minimization is solved by
\(\hat{c}_1 = ave(y_i|x_i \in R_1(j,s))\) and \(\hat{c}_2 = ave(y_i|x_i \in R_2(j,s))\)
Regression Tree in R
### similar as classification tree
library(MASS)
set.seed(2)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv~.,Boston,subset=train)
summary(tree.boston)
plot(tree.boston)
text(tree.boston,pretty=0)
Tree Pruning
A very large tree might overfit the data, while a small tree might not capture the important structure.
Tree size is a tuning parameter governing the model's complexity.
The strategy is to grow a large tree \(T_0\), stopping the splitting process only when some minimum node size is reached.
This large tree is pruned using cost-complexity pruning.

Define a subtree \(T \subset T_0\). \(|T|\) denotes the number of terminal nodes in T.
\(N_m = \# (x_i \in R_m)\)
\(\hat{c}_m = \frac{1}{N_m}\sum_{xi \in R_m} y_i\)
\(Q_m(T) = \frac{1}{N_m}\sum_{x_i \in R_m}(y_i - \hat{c}_m)^2\)
The cost complexity criterion:
\(C_\alpha(T) = \sum_{m=1}^{|T|}N_mQ_m(T) + \alpha|T|\).
The tuning parameter \(\alpha\) governs the tradeoff between tree size and its goodness of fit to the data. Large values of \(\alpha\) result in smaller trees.
Estimation of \(\alpha\) is achieved by cross validation.
For a classification tree, the measures \(Q_mm(T)\) can be defined as misclassification error, Gini index, or Cross-entropy.
## continue to regression tree part
### use cv.tree() to do the cross validation. Default is 10-fold cv (K = 10)
### For classification Tree
### We can use argument FUN=prune.misclass: cv.tree(..., FUN = prune.misclass).Default is deviance.
### prune.misclass() is an abbreviation for prune.tree(method = "misclass")
cv.boston=cv.tree(tree.boston)
names(cv.boston)
### cross validation plot
### size: number of terminal nodes;
### deviance: equivalent to RSS and misclassification rate
### k: cost-complexity parameter
plot(cv.boston$size,cv.boston$dev,type='b')
### use the cross validation plot to find the best size of the tree
### use the best tree size obtained by cross validation to prune the tree
### Arguments in prune.tree() function:
### k: cost-complexity parameter
### best: size (i.e. number of terminal nodes) of a specific subtree in the cost-complexity sequence to be returned.
cv.boston
prune.boston=prune.tree(tree.boston,best=5) ## return a subtree with size is 8.
plot(prune.boston)
text(prune.boston,pretty=0)
### use the decistion tree to make predictions
yhat=predict(tree.boston,newdata=Boston[-train,])
boston.test=Boston[-train,"medv"]
### calculate
mean((yhat-boston.test)^2)
Advantages
Easy to interpret.
Provide good visualizations.
Help to understand what are the most important attributes
Work for regression and classification problems.
Disadvantages
Tend to overfitting. Need tree pruning.
Different data sets may lead to very different trees.
Decision Trees can only provide axis-parrelel decision boundaries.
