Best Subset Selection
Fit a separate least squares regression for each possible combination of the p predictors. We then look at all of the resulting models, with the goal of identifying the one that is best.


Disadvantage
The best subset selection can't be applied with very large p, since the number of possible models grows repidly as p increases (\(2^p\)).
Stepwise model Selection
- Forward Stepwise Selection

The forward stepwise selection only needs to fit \(1+\sum_{k=0}^{p-1}(p-k) = 1 + p(p+1)/2\) models in total. However, this approach is not guaranteed to find the best poosible model out of all 2^p models.
- Backward Stepwise Selection

Backward stepwise selection is not guaranteed to yield the best model either.
Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.
Choosing the Optimal Model
Model Selection Criteria
The idea is to use some sort of criterion which provides a measure of the overall quality of a model. Sush a criterion must pubish models that are overly simple, as well as enforce parsimony and punish models that are overly complex.
For a fitted least squares model containing d predictors, \(C_p = \frac{1}{n}(RSS+2 d \hat{\sigma}^2)\)
where n is the number of observations, \(\hat{\sigma}^2\) is an estimate of the variance of the error (from the full model).
It can be considered as the MSE + penalty(relative to d), so a small value indicates a low test error.
\(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)
For least squares models, \(C_p\) and AIC are equivalent, and they will give the same order. A small value indicates a low test error.
\(BIC = \frac{1}{n}(RSS+log(n)d\hat{\sigma}^2)\)
A small value indicates a low test error. Since \(log(n) > 2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables. Hence results in the selection of smaller models than \(C_p\)
\(Adjusted R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}\)
A large value indicates a small test error.
Cross Validation approach
The cross-validation approach directly estimate of the test error.
We compute the validation set error or the cross-validation error for each model \(M_k\) under consideration, and select the k for which the resulting estimated test error is smallest.
Cross validation approach can be used in a wider range of model selection tasks (e.g. hard to estimate the error variance).
Subset Selection with R
# Best Subset Selection
library(ISLR)
data(Hitters)
names(Hitters) ## predict a baseball player's Salary based on his performance in the previous year.
dim(Hitters)
## is.na() identify the missing observations
sum(is.na(Hitters$Salary))
## remove the rows that have missing values
Hitters=na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))
### Use the regsubsets() function in leaps package
## Best Subset Selection
library(leaps)
regfit.full=regsubsets(Salary~.,Hitters)
## to get details of the results, use summary()
summary(regfit.full)
# some useful options:
# nvmax : maximum size of subsets to examine (default is 8)
# nbest : number of subsets of each size to record (default is 1)
# method : exhaustive (default), which is the best subset approach
# force.in : specify one or more variables to be included in all models
# force.out : specify one or more variables to be excluded in all models
plot(regfit.full) ## each row represents a model. shaded means the variables are included in the model.
# option scale can change the default criterion statistic to Cp, r-squared adjusted r2
plot(regfit.full, scale = "Cp")
regfit.full=regsubsets(Salary~. ,data= Hitters,nvmax=19)
reg.summary=summary(regfit.full)
# The included variables are indicated by asterisks in quotes,
# can also be changed to logical values: summary(regfit.full,matrix.logical = TRUE)
reg.summary
names(reg.summary)
# summary(regfit) will return:
# which A logical matrix indicating which elements are in each model
# rsq The r-squared for each model
# rss Residual sum of squares for each model
# adjr2 Adjusted r-squared
# cp Mallows' Cp
# bic Bayes information criterion, BIC
reg.summary$which
reg.summary$rsq
### plot the selection criterion to decide which model to select
# par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
### adjusted R^2, the larger the better
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2)
points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)
### Cp, the smaller the better
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp)
points(10,reg.summary$cp[10],col="red",cex=2,pch=20)
### BIC, the smaller the better
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
which.min(reg.summary$bic)
points(6,reg.summary$bic[6],col="red",cex=2,pch=20)
### directly plot the regsubset object
## each row represents a model. shaded means the variables are included in the model.
## Note that the axis is not quantitative but is ordered. The darkness of the shading simply represents the ordering of the criterion values.
plot(regfit.full)
# option scale can change the default criterion statistic to Cp, r-squared adjusted r2 ...
plot(regfit.full, scale = "Cp")
plot(regfit.full,scale="r2")
plot(regfit.full,scale="adjr2")
plot(regfit.full,scale="Cp")
plot(regfit.full,scale="bic")
## coef() returns the coefficients. id: choose which model (ordered as in the summary output) to return coefficients
coef(regfit.full,id = 6)
coef(regfit.full,id = which.max(reg.summary$adjr2))
# Forward and Backward Stepwise Selection
# use option method = "forward" or "backward"
regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
summary(regfit.fwd)
regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
summary(regfit.bwd)
coef(regfit.full,7)
coef(regfit.fwd,7)
coef(regfit.bwd,7)
# Choosing Among Models using cross validation approach
set.seed(10)
# train = sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)
## split the data set into halves
train = rep(c(TRUE,FALSE), length.out=nrow(Hitters))
train = sample(train) ## disorder the sequence randomly
test=(!train)
### implement the best subset method only on the training set
regfit.best=regsubsets(Salary~.,data=Hitters[train,],nvmax=19)
### model.matrix construct the design matrix (X)
test.mat=model.matrix(Salary~.,data=Hitters[test,])
val.errors=rep(NA,19)
for(i in 1:19){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi ## fit the data
val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
val.errors
which.min(val.errors)
coef(regfit.best,id = which.min(val.errors))
### After we choose the number of variables, we perform the best subset selection on the full data set and select the best model.
### it is important that we use the full data set in order to obtain more accurate coefficients estimates
### the best model on the full data set may differ from the model on the training set.
regfit.best=regsubsets(Salary~.,data=Hitters,nvmax=19)
coef(regfit.best,which.min(val.errors))
### use k-fold cross validation
### create a predict function for regsubsets.
predict.regsubsets=function(object,newdata,id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k=10
set.seed(1)
folds=sample(1:k,nrow(Hitters),replace=TRUE) ## roughly equal sized
## if you want to split into k folds more precisely, you can also use sample(rep(1:k,length.out = n))
cv.errors=matrix(NA,k,19, dimnames=list(NULL, paste(1:19)))
for(j in 1:k){
best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
for(i in 1:19){
pred=predict(best.fit,Hitters[folds==j,],id=i)
cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean) ## same as colMeans(cv.errors)
mean.cv.errors
which.min(mean.cv.errors)
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
coef(reg.best,which.min(mean.cv.errors))