Model Selection and Regularization

2017-03-12

Linear Model Selection and Regularization

In this chapter, we will consider some approaches for extending the linear model framework.

To improve the prediction accuracy (p > n) and model Interpretability (variable selection).

Three classes of methods

Subset Selection

This approach involves identifying a subset of the p redictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

Shrinkage (regularization)

This approach involves fitting a model involving all p predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates.

Dimension Reduction

This approach projects the p predictors into a M-dimensional subspace (linear combinations of variables). Then these M projections are used as predictors to fit a linear regression model.

Subset Selection

Best Subset Selection

Fit a separate least squares regression for each possible combination of the p predictors. We then look at all of the resulting models, with the goal of identifying the one that is best.

Disadvantage

The best subset selection can't be applied with very large p, since the number of possible models grows repidly as p increases (\(2^p\)).

Stepwise model Selection

Forward Stepwise Selection

The forward stepwise selection only needs to fit \(1+\sum_{k=0}^{p-1}(p-k) = 1 + p(p+1)/2\) models in total. However, this approach is not guaranteed to find the best poosible model out of all 2^p models.

Backward Stepwise Selection

Backward stepwise selection is not guaranteed to yield the best model either.

Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

Choosing the Optimal Model

Full model: the model includes every predictor.
Subset model: the model with one or more predictors missing
Choose the best subset model (or full model)
Two approaches
- We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
- We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures.

Model Selection Criteria

The idea is to use some sort of criterion which provides a measure of the overall quality of a model. Sush a criterion must pubish models that are overly simple, as well as enforce parsimony and punish models that are overly complex.

Mallows' \(C_p\)

For a fitted least squares model containing d predictors, \(C_p = \frac{1}{n}(RSS+2 d \hat{\sigma}^2)\)

where n is the number of observations, \(\hat{\sigma}^2\) is an estimate of the variance of the error (from the full model).

It can be considered as the MSE + penalty(relative to d), so a small value indicates a low test error.

\(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)

For least squares models, \(C_p\) and AIC are equivalent, and they will give the same order. A small value indicates a low test error.

\(BIC = \frac{1}{n}(RSS+log(n)d\hat{\sigma}^2)\)

A small value indicates a low test error. Since \(log(n) > 2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables. Hence results in the selection of smaller models than \(C_p\)

Adjusted R^2

\(Adjusted R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}\)

A large value indicates a small test error.

Cross Validation approach

The cross-validation approach directly estimate of the test error.
We compute the validation set error or the cross-validation error for each model \(M_k\) under consideration, and select the k for which the resulting estimated test error is smallest.
Cross validation approach can be used in a wider range of model selection tasks (e.g. hard to estimate the error variance).

Subset Selection with R

# Best Subset Selection

library(ISLR)
data(Hitters)
names(Hitters)  ## predict a baseball player's Salary based on his performance in the previous year.
dim(Hitters)

## is.na() identify the missing observations
sum(is.na(Hitters$Salary))

## remove the rows that have missing values
Hitters=na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))



### Use the regsubsets() function in leaps package

## Best Subset Selection
library(leaps)
regfit.full=regsubsets(Salary~.,Hitters)

## to get details of the results, use summary()
summary(regfit.full)

# some useful options:
# nvmax : maximum size of subsets to examine (default is 8)
# nbest : number of subsets of each size to record (default is 1)
# method : exhaustive (default), which is the best subset approach
# force.in : specify one or more variables to be included in all models
# force.out : specify one or more variables to be excluded in all models


plot(regfit.full) ## each row represents a model. shaded means the variables are included in the model.

# option scale can change the default criterion statistic to Cp, r-squared adjusted r2 
plot(regfit.full, scale = "Cp")



regfit.full=regsubsets(Salary~. ,data= Hitters,nvmax=19)
reg.summary=summary(regfit.full)
# The included variables are indicated by asterisks in quotes, 
# can also be changed to logical values: summary(regfit.full,matrix.logical = TRUE)
reg.summary


names(reg.summary)
# summary(regfit) will return:
# which A logical matrix indicating which elements are in each model
# rsq       The r-squared for each model
# rss       Residual sum of squares for each model
# adjr2 Adjusted r-squared
# cp        Mallows' Cp
# bic       Bayes information criterion, BIC

reg.summary$which
reg.summary$rsq


### plot the selection criterion to decide which model to select

# par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")

### adjusted R^2, the larger the better
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2)
points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)

### Cp, the smaller the better
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp)
points(10,reg.summary$cp[10],col="red",cex=2,pch=20)

### BIC, the smaller the better
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
which.min(reg.summary$bic)
points(6,reg.summary$bic[6],col="red",cex=2,pch=20)


### directly plot the regsubset object
## each row represents a model. shaded means the variables are included in the model. 
## Note that the axis is not quantitative but is ordered. The darkness of the shading simply represents the ordering of the criterion values.
plot(regfit.full)

# option scale can change the default criterion statistic to Cp, r-squared adjusted r2 ...
plot(regfit.full, scale = "Cp")
plot(regfit.full,scale="r2")
plot(regfit.full,scale="adjr2")
plot(regfit.full,scale="Cp")
plot(regfit.full,scale="bic")

## coef() returns the coefficients. id: choose which model (ordered as in the summary output) to return coefficients
coef(regfit.full,id = 6)
coef(regfit.full,id = which.max(reg.summary$adjr2))

# Forward and Backward Stepwise Selection
# use option method = "forward" or "backward"
regfit.fwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
summary(regfit.fwd)
regfit.bwd=regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
summary(regfit.bwd)
coef(regfit.full,7)
coef(regfit.fwd,7)
coef(regfit.bwd,7)


# Choosing Among Models using cross validation approach

set.seed(10)

# train = sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)

## split the data set into halves
train = rep(c(TRUE,FALSE), length.out=nrow(Hitters))
train = sample(train) ## disorder the sequence randomly
test=(!train)

### implement the best subset method only on the training set
regfit.best=regsubsets(Salary~.,data=Hitters[train,],nvmax=19)

### model.matrix construct the design matrix (X)
test.mat=model.matrix(Salary~.,data=Hitters[test,])

val.errors=rep(NA,19)

for(i in 1:19){
   coefi=coef(regfit.best,id=i)
   pred=test.mat[,names(coefi)]%*%coefi  ## fit the data
   val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}

val.errors
which.min(val.errors)
coef(regfit.best,id = which.min(val.errors))


### After we choose the number of variables, we perform the best subset selection on the full data set and select the best model.
### it is important that we use the full data set in order to obtain more accurate coefficients estimates
### the best model on the full data set may differ from the model on the training set.
regfit.best=regsubsets(Salary~.,data=Hitters,nvmax=19)
coef(regfit.best,which.min(val.errors))



### use k-fold cross validation 

### create a predict function for regsubsets.
predict.regsubsets=function(object,newdata,id,...){
    form=as.formula(object$call[[2]])
    mat=model.matrix(form,newdata)
    coefi=coef(object,id=id)
    xvars=names(coefi)
    mat[,xvars]%*%coefi
}


k=10
set.seed(1)
folds=sample(1:k,nrow(Hitters),replace=TRUE) ## roughly equal sized 
## if you want to split into k folds more precisely, you can also use sample(rep(1:k,length.out = n))

cv.errors=matrix(NA,k,19, dimnames=list(NULL, paste(1:19)))

for(j in 1:k){
  best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
  for(i in 1:19){
    pred=predict(best.fit,Hitters[folds==j,],id=i)
    cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)^2)
    }
}

mean.cv.errors=apply(cv.errors,2,mean) ## same as colMeans(cv.errors)
mean.cv.errors
which.min(mean.cv.errors)
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
reg.best=regsubsets(Salary~.,data=Hitters, nvmax=19)
coef(reg.best,which.min(mean.cv.errors))

Shrinkage Methods

Ridge regression and Lasso

The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.

Ridge Regression

The ridge regression coefficient estimates \(\hat{\beta}_{\lambda}^{R}\) are the values that minimize

\(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \text{shrinkage penalty}\)

where \(\lambda \geq 0\) is a tuning parameter, to be determined separately.

Penalizing large coefficients through the \(l_2\) Norm, so it is also called \(l_2\) Regularization.
The shrinkage penalty (\(l_2\) penalty) has the effect of shrinking the estimates of \(\beta_j\) towards 0.
The tuning parameter \(\lambda\) controls the relative impact on the regression coefficient estimates.
We need to select a good value for \(\lambda\) (cross-validation)
Because of the penalty term, we should apply ridge regression after standardizing the predictors, using the formula, so that the fit will not depend on the scale of predictors.

An alternative form of the ridge regression

minimize \(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2\) subject to \(\sum_{j=1}^{p}\beta_j^2 \leq s\)

where \(||\beta||_2 = \sqrt{{\sum_{j=1}^{p}}\beta_j^2}\).

Unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.

The lasso

The lasso coefficients, \(\beta_{\lambda}^L\), minimize the quantity

\(\sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}|\beta_j| = RSS + \text{shrinkage penalty}\)

Penalizing large coefficients through the \(l_1\) Norm, so it is also called \(l_1\) Regularization.
As with ridge regression, the lasso shrinks the coefficient estimates towards zero.

where \(||\beta||_1 = \sum|\beta_j|\)

In the case of the lasso, the \(l_1\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large.

In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
Neither ridge regression nor the lasso will universally dominate the other.
A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

Tuning Parameter for Ridge Regression and Lasso

Cross-validation provides a simple way to tackle this problem. We choose a grid of \(\lambda\) values, and compute the cross-validation error rate for each value of \(\lambda\).
We then select the tuning parameter value for which the cross-validation error is smallest.
Finally, the model is refit using all of the availableobservations and the selected value of the tuning parameter.

Ridge Regression and Lasso with R

glmnet() function in glmnet package

x: input matrix
y: response variable
alpha: 0: ridge regression and 1: the lasso
lambda: a lambda sequence
standardize: If standardize the variables. Default is TRUE.
Outputs:
- df: The number of nonzero coefficients for each value of lambda
- dev.ratio: You can consider this as \(R^2\)

## Use the glmnet() function in glmnet package
library(glmnet)
library(ISLR)

data(Hitters)
## remove the rows that have missing values
Hitters=na.omit(Hitters)



### ridge regression
x = model.matrix(Salary~., Hitters)[,-1]  ## create the X matrix

y = Hitters$Salary

grid = 10^seq(10,-2,length=100)

ridge.mod = glmnet(x,y,alpha=0, lambda = grid)
plot(ridge.mod,xvar = "lambda",label=TRUE)

dim(coef(ridge.mod))  ## coef() return the regression coefficients

## predict the coefficients for a new lambda (s : values of the penalty parameter lambda)
predict(ridge.mod, s= 50, type="coefficients")

## split the samples into a training set and a test set
set.seed(1)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
y.test = y[test]

### Calculate MSE on the test set when lambda = 4
ridge.mod = glmnet(x[train,],y[train],alpha=0, lambda = grid)
ridge.pred = predict(ridge.mod,s=4,newx = x[test,], exact = T)
mean((ridge.pred-y.test)^2)

### set s=0 in predict function, is just the ordinary least squares regression


## cross validation cv.glmnet()  nfolds: control the number of folds
set.seed(1)
cv.out = cv.glmnet(x[train,],y[train], alpha=0, nfolds=10) 

## you can add argument lambda to supply your own lambda sequence
## cv.out = cv.glmnet(x[train,],y[train], alpha=0, nfolds=10, lambda = grid) 


plot(cv.out) ## plot the cross-validation curve and upper and lower standard deviation curves

#cv.out$cvm    ## values of the mean cross-validation errors for different lambda
bestlam = cv.out$lambda.min  ## value of lambda that gives minimum mean cross-validation error
bestlam

### finally refit ridge regression model on the full data set using the best lambda

out = glmnet(x,y,alpha=0)
predict(out,type="coefficients", s = bestlam)[1:20,]



#### the Lasso 
#### very similar to ridge regression

#### alpha = 1 is the Lasso 
lasso.mod = glmnet(x[train,], y[train], alpha = 1,lambda = grid)
plot(lasso.mod, xvar = "lambda",label=TRUE)

### 10-fold cross validation
set.seed(1)
cv.out = cv.glmnet(x[train,], y[train], alpha = 1, nfold=10)
plot(cv.out)
bestlam = cv.out$lambda.min
out=glmnet(x,y, alpha = 1, lambda = grid)
lasso.coef = predict(out,type="coefficients", s = bestlam)[1:20,]

### the lasso can do the variable selection, since the coefficents of some variables are 0

lasso.coef[lasso.coef!=0]

Dimension Reduction

Dimension reduction methods are a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.

Let \(Z_1, Z_2, ..., Z_M\) represent \(M \leq p\) linear combinations of our original p predictors. That is,

\(Z_m = \sum_{j=1}^{p} \phi_{mj}X_j\)

for some constants \(\phi_{m1}, ..., \phi_{mp}\).

Then fit the linear regression model,

\(y_i = \theta_0 + \sum_{m=1}^{M}\theta_m z_{im} + \epsilon_i, i = 1,...,n\)

using ordinary least sqaures, and the regression coefficients are given by \(\theta_1,\theta_2,...,\theta_M\). This can be thought of as a special case of the original linear regression model,

\(\sum_{m=1}^{M}\theta_m z_{im} = \sum_{m=1}^{M}\theta_m\sum_{j=1}^{p}\phi_{mj}x_{ij} = \sum_{j=1}^{p}\sum_{m=1}^{M}\theta_m \phi_{mj}x_{ij} = \sum_{j=1}^{p}\beta_jx_{ij},\)

where \(\beta_j = \sum_{m=1}^{M}\theta_{m}\phi_{mj}\).

Principal Components Analysis

Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set (maximal variance).
PCA is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components.
PCA is an unsupervised approach, since it involves only a set of features \(X_1, X_2, ..., X_p\), and no assobiated response Y.
Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.
Look for the linear combination of the sample feature values:

\(z_{i1} = \phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip}\)

that has largest sample variance, subject to the constaint that \(\sum_{j=1}^{p} \phi_{j1}^2 = 1\).

We refer to \(z_{11},...,z_{n1}\) as the scores of the first principal component. The vector \(\phi_1\) with elements \(\phi_{11},\phi_{21},...\phi_{p1}\) is called loading vector and defines the direction.

The first principal component is that (normalized) linear combination of the variables with the largest variance.
The second principal component has largest variance, subject to being uncorrelated with the first. And so on.
The goal is dimension reduction and there is no guarantee that the dimensions are interpretable
We need standardize each predictor, prior to generating the principal components. This standardization ensures that all variables are on the same scale. Otherwise, the high-variance variales will tend to play a larger role in the principal components.

PCA in R

prcomp() function

head(iris)

### a simple example with only 2 predictors
iris2 = iris[,3:5] 
plot(iris2[,-3], col = iris2[,3])

irispca2 = prcomp(iris2[,-3],scale = T)
irispca2$rotation

# draw on 1 dimension
plot(irispca2$x[,1],rep(0, length(irispca2$x[,1])), col=iris$Species)

plot(irispca2$x[,1], irispca2$x[,2],col=iris$Species, xlab = "PC1", ylab = "PC2")



### consider the full iris data set (4 predictors)

plot(iris[,-5], col = iris$Species)

irispca = prcomp(iris[,-5],scale = T)

names(irispca)

### center and sclae are the means and standard deviations of the variables that were ### used for scaling the predictors before implement PCA
irispca$center
irispca$scale

### x : the principal component scores

plot(irispca$x[,1], irispca$x[,2],col=iris$Species, xlab = "PC1", ylab = "PC2")

### sdev: standard deviations of the principal components
irispca$sdev

### plot the first two principal components 

biplot(irispca,scale = 0)

### the variance explained by each principal component

pr.var = irispca$sdev^2
pr.var
### compute the proportion of variance explained by each principal component

pve = pr.var/sum(pr.var)
pve


## plot the proportion of variance explained
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

Principal Components Regression

Apply principal components analysis (PCA) to define the linear combinations of the predictors, for use in our regression.
In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors.
The number of principal components, M, is typically chosen by cross-validation.
When performing PCR, we need standardize each predictor, prior to generating the principal components.

Drawbacks:

PCR is not a feature selection method because each of the calculated principal components is a linear combination of the original variables.
Using principal components instead of the actual features can make it harder to explain what is affecting what.
The directions that best represent each predictor are obtained in an unsupervised way

PCR with R

Use pcr() function in pls library

Arguments:

ncomp : the number of compoents to be used in the modelling
segments: the number of segments to use (default: 10-fold cv)
scale : standardize the data

library(pls)
set.seed(1)

## remove the rows that have missing values
Hitters=na.omit(Hitters)

pcr.fit = pcr(Salary~., data = Hitters, subset = train, scale = T, validation = "CV")
### default is ten-fold cross-validation  use argument segments = k set k-fold cv
# pcr.fit = pcr(Salary~ ., data=Hitters, scale = TRUE, validation = "CV",segments = 5)
summary(pcr.fit)

# pcr() reports the root mean squared error, need to sqaure this quantity to obtain the usual MSE

## plot the cross-validation scores

## val.type: What type of validation statistic to plot: "RMSEP", "MSEP", "R2"
## legendpos: If present, a legend is drawn at the given position.
validationplot(pcr.fit,val.type="MSEP",legendpos = "topright")
axis(1,at = 1:20)


## make predictions and compute the test MSE
pcr.pred = predict(pcr.fit, newdata = x[test,], ncomp = 7)
mean((pcr.pred - y.test)^2)

### fit the final model using the full data set using the best ncomp

pcr.fit = pcr(y~x, scale = T, ncomp = 7)
summary(pcr.fit)