This file is for those who worry about R mechanics
<img src=“”>
We want to predict the log-salary of baseball players from their Hits and Years.
library(ISLR) #contains the Hitters data set
library(tree) #contains functions for tree construction
# assign to variable named baseball a subset of the Hitters dataframe
# we onlly keep records that satisfy the boolean condition
# !is.na(Hitters$Salary) is true
# as the is.na function means 'is not available'
# this will only return true for the baseball players that have not an NA value, but a true numerical
# value in the column Salary
# note how you specify a column name but return rows, due to the placement of this boolean in the first place in Hitters[,]
baseball=Hitters[!is.na(Hitters$Salary),] #remove records with missing salary
# now we subset the columns
# and we create a new dataframe named baseb which contains
# the salary in column 1
# the Years in column 2
# the Hits in column 3
baseb=baseball[,c("Salary","Years","Hits")] #keep only 3 variables
# we decide to define the output we want to model as the log-salary
# we overwrite column 1 of baseb by the log of the values originally stored in ths column
# note that by lazyness, we still name this column Salary, eve though it is now the log-salary
# this will be our output, that we often write with the letter y
# and we want to find a magical formula to predict these values (y) from the predictor values, years and hits
#
baseb$Salary=log(baseb$Salary) #model log(salary)
# we are not running any code, here, just defining functions that we will use later
n=dim(baseb)[1] #number of baseball players
M=dim(baseb)[2] #variables
# the next function takes three inputs:
# data:
# pred:
# cut:
SSE=function(data,pred,cut){
#split data into two sets, data1 and data2
#returns sums of squares and means for each of data1,data2
data1=data[pred<=cut]
data2=data[pred>cut]
#calculate the error sum of squares for each set
M1=mean(data1)
M2=mean(data2)
SSE1=sum((data1-M1)^2)
SSE2=sum((data2-M2)^2)
#calculate the total error sum of squares
return(c(SSE1,SSE2,M1,M2))
}
# the next function calculates a grid of values of C
# this 'grid' will be output on return as a vector, called cuts
# grid will be used later by the function
# define the value named gencuts
# takes a single argument x
# the meaning of x is a single predictor column in the dataframe
gencuts=function(x){
#calculates all possible cutpoints for x
xuo=sort(unique(x)) #sorted unique values of x
# define an empty vector called cuts of length the length of xuo - 1
# and initialize all the values of this vector to 0
cuts=rep(0,(length(xuo)-1))
nc=length(cuts) # we will have nc cut values. one per interval of two consecutive values in xuo
for (i in 1:nc){
# calculate the mean of the values of the vector xuo in the range i:(i+1)
# hence this returns 2.5 if xuo[i:(i+1)] = [1,4] for example
# this mean is simply the midpoint of consecutive sorted values in xuo
# we store this mean at position i in vector cuts. It is a new valid value for a cut C of the predictor column x
cuts[i]=mean(xuo[i:(i+1)]) #midpoints
}
# and we return the cuts vector, i.e. vector of candidate cutpoints here
return(cuts)
}
# now we define another function named cut1
cut1=function(y,x){
cuts=gencuts(x) #get cutpoints for variable x
ncuts=length(cuts) #number of cut points
SSEs=matrix(rep(0,5*ncuts),byrow=T,ncol=5)
for (i in 1:length(cuts)){
SSEs[i,2:5]=SSE(y,x,cuts[i])
SSEs[i,1]=sum(SSEs[i,2:3])}
minp=(1:length(cuts))[SSEs[,1]==min(SSEs[,1])]
#returns best cut point, sums of squares for 2 branches, means for 2 branches
return(c(cuts[minp],SSEs[minp,]))
}
Finally, we will start using the code we wrote above
We will not use any external library, just the base R library
# find the best variable for the first partition
for (j in 2:3){
print(cut1(baseb[,1],baseb[,j]))
}
## [1] 4.500000 115.058475 42.353165 72.705310 5.106790 6.354036
## [1] 117.500000 160.971530 96.510476 64.461054 5.566327 6.413784
From the output, we see that the best choice is to partition using \(Years<4.5\) and \(Years>4.5\).
# this is the dataset that we will be using down the left branch of the split
# we just apply our 'nodal rule': xbest is column 2 (Years) and Cbest is 4.5
# and we use the rule to define the subset of baseball players with Years < 4.5
baseb1=baseb[baseb[,2]<4.5,]
# this is the dataset that we will be using down the right branch of the split
# same thing but this is the complement of subset baseb1
baseb2=baseb[baseb[,2]>4.5,]
# and now?
# you clone yourself in two programmers
# A will deal with baseb1 and B will deal with baseb2
# what do I mean by 'deal with'?
# simply apply the same procedure as before, i.e. try to define a split-node-rule
# the only difference is that uou are not using the same dataset
# so A will end up defining 2 new datasets and B also 2 new datasets
# this is called recursive binary split
# however we need to know how to stop this explosive division process
# by introducing a stopping rule, we will ensure that the program terminates
# it will also become clear to you that A or B (or any of their descendants) will not always create 2 nodes.
# hey will create either 0 or 1 or 2 nodes
for (j in 2:3)print(cut1(baseb2[,1],baseb2[,j]))
## [1] 6.500000 69.877019 27.582434 42.294585 6.164227 6.440167
## [1] 117.500000 48.976782 28.093708 20.883074 5.998380 6.739687
the next split on this branch uses \(Hits<117.5\) vs \(Hits>117.5\).
continue recursively down all branches, until some stopping criterion is satisfied.
# this unfortunately requires the tree library
library(tree)
# I am lazy. I call a R function to fit a regression tree
# the resulting model is stored in a variable named basebtree
basebtree=tree(baseb) #construct the tree
# but the model is too complex!
# is it not a good model
# I need to simplify its complexity
# to do so, I will use cross-validation
# again, I am lazy, I call a ready-to-use function to do the work for me
# this single line of code does a lot (of simple things)
basebcv=cv.tree(basebtree)
# would be better to code your own function for this!
# let us inspect what is output by this function.
# what do we have in the returned object basebcv
# to see everything in this object, I will yus the function attributes
attributes(basebcv)
## $names
## [1] "size" "dev" "k" "method"
##
## $class
## [1] "prune" "tree.sequence"
#
plot(basebcv$size,basebcv$dev,type='b')
# and now we just select the sweet spot complexity
# to do this I want to select the complexity (size is the number of leaves in each model)
# which makes dev minimal
# I can do this by selecting where dev is minimized:
here = basebcv$dev==min(basebcv$dev)
# and the size at the sweetspot
bestsize=basebcv$size[here]
# or do this in a single line of code
bestsize=basebcv$size[basebcv$dev==min(basebcv$dev)]
# those whopay attention will have realized that during the lecture, I got 5, and here I got 4.
# Can anybody tell me why ? :)
# having guessed the best tree complexity, i can plug it to prune the tree
# I call this my final model
# I store my final beloved tree in a variable named myfinalregressiontreemodel
myfinalregressiontreemodel=prune.tree(basebtree,best=bestsize)
plot(myfinalregressiontreemodel)
text(myfinalregressiontreemodel)
# so now i have a beloved model.
# My employer is happy. He is going to use my beloved tree to predict the salary of any baseball player
# not only those that are in the hitters dataset but others as well.
# and if I really like R mechanics, I should continue this exercise myself:
# write my own code to predict the salaries of all the baseball players playing in 2020 ...