Objective

This file is for those who worry about R mechanics

step-0 explain the goal

We want to predict the log-salary of baseball players from their Hits and Years.

step-1 load the data

library(ISLR)  #contains the Hitters data set
library(tree)  #contains functions for tree construction

# assign to variable named baseball a subset of the Hitters dataframe
# we onlly keep records that satisfy the boolean condition 
# !is.na(Hitters$Salary)  is true
# as the is.na function means 'is not available'
# this will only return true for the baseball players that have not an NA value, but a true numerical
# value in the column Salary
# note how you specify a column name but return rows, due to the placement of this boolean in the first place in Hitters[,]
baseball=Hitters[!is.na(Hitters$Salary),] #remove records with missing salary

# now we subset the columns
# and we create a new dataframe named baseb which contains 
# the salary in column 1
# the Years in column 2
# the Hits in column 3
baseb=baseball[,c("Salary","Years","Hits")] #keep only 3 variables
# we decide to define the output we want to model as the log-salary
# we overwrite column 1 of baseb by the log of the values originally stored in ths column
# note that by lazyness, we still name this column Salary, eve though it is now the log-salary
# this will be our output, that we often write with the letter y
# and we want to find a magical formula to predict these values (y) from the predictor values, years and hits
# 
baseb$Salary=log(baseb$Salary) #model log(salary)

build our own library of functions to calculate ONE node of the regression tree

# we are not running any code, here, just defining functions that we will use later

n=dim(baseb)[1] #number of baseball players
M=dim(baseb)[2] #variables 

# the next function takes three inputs:
# data:
# pred:
# cut:

SSE=function(data,pred,cut){
#split data into two sets, data1 and data2
#returns sums of squares and means for each of data1,data2
 data1=data[pred<=cut]
 data2=data[pred>cut]
#calculate the error sum of squares for each set
 M1=mean(data1)
 M2=mean(data2)
 SSE1=sum((data1-M1)^2)
 SSE2=sum((data2-M2)^2)
#calculate the total error sum of squares
 return(c(SSE1,SSE2,M1,M2))
 }

# the next function calculates a grid of values of C
# this 'grid' will be output on return as a vector, called cuts 
#  grid will be used later by the function

# define the value named gencuts
# takes a single argument x
# the meaning of x is a single predictor column in the dataframe

gencuts=function(x){

#calculates all possible cutpoints for x
xuo=sort(unique(x))  #sorted unique values of x
# define an empty vector called cuts of length the length of xuo - 1
# and initialize all the values of this vector to 0
cuts=rep(0,(length(xuo)-1)) 
nc=length(cuts) # we will have nc cut values. one per interval of two consecutive values in xuo
for (i in 1:nc){

# calculate the mean of the values of the vector xuo in the range i:(i+1)
# hence this returns 2.5 if xuo[i:(i+1)] = [1,4] for example 
# this mean is simply the midpoint of consecutive sorted values in xuo
# we store this mean at position i in vector cuts. It is a new valid value for a cut C of the predictor column x
cuts[i]=mean(xuo[i:(i+1)]) #midpoints 

}
# and we return the cuts vector, i.e. vector of candidate cutpoints here
return(cuts)
}


# now we define another function named cut1

cut1=function(y,x){
 cuts=gencuts(x) #get cutpoints for variable x
 ncuts=length(cuts) #number of cut points
 SSEs=matrix(rep(0,5*ncuts),byrow=T,ncol=5)
 for (i in 1:length(cuts)){
   SSEs[i,2:5]=SSE(y,x,cuts[i])
   SSEs[i,1]=sum(SSEs[i,2:3])}
   minp=(1:length(cuts))[SSEs[,1]==min(SSEs[,1])]
#returns best cut point, sums of squares for 2 branches, means for 2 branches
 return(c(cuts[minp],SSEs[minp,]))      
}

Finally, we will start using the code we wrote above

We will not use any external library, just the base R library

# find the best variable for the first partition
 for (j in 2:3){
 
 print(cut1(baseb[,1],baseb[,j]))
 
 }

## [1]   4.500000 115.058475  42.353165  72.705310   5.106790   6.354036
## [1] 117.500000 160.971530  96.510476  64.461054   5.566327   6.413784

From the output, we see that the best choice is to partition using \(Years<4.5\) and \(Years>4.5\).

# this is the dataset that we will be using down the left branch of the split
# we just apply our 'nodal rule': xbest is column 2 (Years) and Cbest is 4.5
# and we use the rule to define the subset of baseball players with Years < 4.5
  baseb1=baseb[baseb[,2]<4.5,] 

# this is the dataset that we will be using down the right branch of the split
# same thing but this is the complement of subset baseb1 
  baseb2=baseb[baseb[,2]>4.5,]
  
# and now?
# you clone yourself in two programmers
# A will deal with baseb1 and B will deal with baseb2
# what do I mean by 'deal with'?
# simply apply the same procedure as before, i.e. try to define a split-node-rule
# the only difference is that uou are not using the same dataset

# so A will end up defining 2 new datasets and B also 2 new datasets
# this is called recursive binary split

# however we need to know how to stop this explosive division process

# by introducing a stopping rule, we will ensure that the program terminates

# it will also become clear to you that A or B (or any of their descendants) will not always create 2 nodes. 
# hey will create either 0 or 1 or 2 nodes

proceed recursively down each of the associated branches of the tree. Consider a further partition of the branch identified by \(Years>4.5\).

for (j in 2:3)print(cut1(baseb2[,1],baseb2[,j]))

## [1]  6.500000 69.877019 27.582434 42.294585  6.164227  6.440167
## [1] 117.500000  48.976782  28.093708  20.883074   5.998380   6.739687

the next split on this branch uses \(Hits<117.5\) vs \(Hits>117.5\).
continue recursively down all branches, until some stopping criterion is satisfied.

Cross-validated choice of tree size

Carry out the cross validation, and plot the penalized sum of squares vs tree size.

# this unfortunately requires the tree library
library(tree)
# I am lazy. I call a R function to fit a regression tree
# the resulting model is stored in a variable named basebtree
basebtree=tree(baseb) #construct the tree
# but the model is too complex!
# is it not a good model
# I need to simplify its complexity
# to do so, I will use cross-validation
# again, I am lazy, I call a ready-to-use function to do the work for me
# this single line of code does a lot (of simple things)
basebcv=cv.tree(basebtree)
# would be better to code your own function for this!
# let us inspect what is output by this function.
# what do we have in the returned object basebcv
# to see everything in this object, I will yus the function attributes
attributes(basebcv)

## $names
## [1] "size"   "dev"    "k"      "method"
## 
## $class
## [1] "prune"         "tree.sequence"

#
plot(basebcv$size,basebcv$dev,type='b')

# and now we just select the sweet spot complexity
# to do this I want to select the complexity (size is the number of leaves in each model)
# which makes dev minimal
# I can do this by selecting where dev is minimized:
here = basebcv$dev==min(basebcv$dev)
# and the size at the sweetspot
bestsize=basebcv$size[here]
# or do this in a single line of code
bestsize=basebcv$size[basebcv$dev==min(basebcv$dev)]

# those whopay attention will have realized that during the lecture, I got 5, and here I got 4. 
# Can anybody tell me why ? :)
# having guessed the best tree complexity, i can plug it to prune the tree
# I call this my final model
# I store my final beloved tree in a variable named myfinalregressiontreemodel
myfinalregressiontreemodel=prune.tree(basebtree,best=bestsize)
plot(myfinalregressiontreemodel)
text(myfinalregressiontreemodel)

# so now i have a beloved model.
# My employer is happy. He is going to use my beloved tree to predict the salary of any baseball player
# not only those that are in the hitters dataset but others as well.
# and if I really like R mechanics, I should continue this exercise myself:
# write my own code to predict the salaries of all the baseball players playing in 2020 ...