Resampling Methods

2017-03-06

Resampling Methods

Training data : the data set on which the model is built.
Test data : the data on which we apply our model and check the accuracy.
Resampling methods: repeatedly drawing samples from a training set and refiiting a model of interest on each sample in order to obtain additional information about the fitted model.

Two commonly used resampling methods:

Cross-Validation
- The validation set approach
- Leave-One-Out Cross-Validation (LOOCV)
- k-Fold Cross-validation
Bootstrap

Cross-Validation

Training error is the error we get applying the model to the same data from which we trained.
Test error is the error that we incur on new data.

When the test set is not available, cross-validation methods estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the method to those held out observations.

Cross-Validation approaches
- The validation set approach
- Leave-One-Out Cross-Validation (LOOCV)
- k-Fold Cross-validation

The Validation Set Approach

Randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
The resulting validation-set error provides an estimate of the test error.

## An example of the validation set approach
library(ISLR)

dim(Auto)

set.seed(1)
train = sample(1:392,196)  ## create the index to randomly split the data set into two halves
degrees = 1:10  
errate = c()

# fit the simple linear regression
lm.fit = lm(mpg~ horsepower,data = Auto, subset = train) 
mse = mean((Auto$mpg-predict(lm.fit, newdata =  Auto))[-train]^2)
errate = c(errate, mse)

# fit the regression with poly terms with degrees from 2 to 10
for (d in 2:10){
    lm.fit = lm(mpg~poly(horsepower,d),data = Auto, subset = train)
    mse = mean((Auto$mpg-predict(lm.fit, newdata =  Auto))[-train]^2)
    errate = c(errate, mse)
}

#par(new=T)   # add new lines on the same graph
plot(degrees,errate,type="b", ylim=c(15,30), ylab = "Mean squared error")

Advantages:

simple and fast

Drawbacks:

Highly variable: the validation estimate of the test error rate can be highly variable, depending on which observations are included in the training set.
Waste data: we only use half of the data set to fit the model, which will gives us the unreliable estimate.

Leave-one-out

Instead of creating two subsets like the validation set approach, a single observation is used for the validation set, and the remaining data make up the training set.
Repeat the procedure n times by iteratively leaving one observation out
Compute the average MSE of all n test estimates.

The LOOCV estimate for the test MSE is the average of these n test error estimates:

\(CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_{i}\)

### An example of LOOCV

degrees = 1:10  
errate = c()

for ( d in 1:10){
    
    mse = c()
    for (i in 1:nrow(Auto)){
    
    lm.fit = lm(mpg~poly(horsepower,d),data = Auto[-i,])
    mse0 = (Auto$mpg[i]-predict(lm.fit, newdata =  Auto[i,]))^2
    mse = c(mse,mse0)
}
    
    errate = c(errate, mean(mse))  
}

plot(degrees,errate,type="b", ylim=c(15,30), ylab = "Mean squared error")



### Use the cv.glm function in boot package

## set the family argument to default is the linear regression
library(boot)
glm.fit = glm(mpg~horsepower,data=Auto)  ## 
cv.glm(Auto, glm.fit)$delta

# tm = proc.time()  # record the running time
errate2 = c()
for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = cv.glm(Auto,glm.fit)$delta[1]
    errate2 = c(errate2,err0)
}

# proc.time() - tm
plot(errate2,type="b",col="red")

A Leave-one-out shortcut for Least squares linear regression

In linear regression, we can obtain all the prediction errors from a single fit.

\(CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}(\frac{y_i-\hat{y_i}}{1-h_i})^2\)

where \(h_i\) is the leverage statistics.

### create the loocv shortcut function
loocv = function(fit){
    h = hatvalues(fit)
    val = mean((residuals(fit)/(1-h))^2)
    return(val)
}



errate3 = c()

for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = loocv(glm.fit)
    errate3 = c(errate3,err0)
}


### This shortcut is much faster!

Advantages:

Much less bias, since the training set contains n-1 observations.
Performing LOOCV many times will always result in the same MSE

Drawbacks:

Computationally expensive

k-fold

Randomly divides a set of observations into k groups, for folds, of approximately equal size (commonly using k=5 or 10). Each fold contains a non overlapping validation set and training set.
Repeat k times, each time, one of the k subsets is used as the test set.
The test error is estimated by averaging these k estimates

The k-fold CV estimate:

\(CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i\)

When k = n, it is actually the leave-one-out cross-validation, since we leave out one data point at a time.

glm.fit = glm(mpg~horsepower,data=Auto)  ## 
cv.glm(Auto, glm.fit, K = 10)$delta


erratek = c()
for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = cv.glm(Auto,glm.fit, K=10)$delta[1]
    erratek = c(erratek,err0)
}


plot(degrees,erratek, type="l", ylim=c(15,30), ylab = "Mean squared error")

The variability is much lower than the variability in the test error estimates that results from the validation set approach
much faster than LOOCV method
Bias-Variance trade-off for k-Fold cross validation

Extending to Classification

Instead of MSE, classification error is used \(Err_i = I(y_i \neq \hat{y}_i)\)
The algorithm remains the same
In cv.glm() , add the argument cost, which is a function of two vector arguments specifying the cost function for the cross-validation.

library(boot)
head(mtcars)

### create the cost function
misRate = function(x,y){
    mean(x==ifelse(y>0.5,1,0)) 
    # mean(abs(x-y)>0.5)
}


### LOOCV
fit <- glm(vs ~ mpg, data = mtcars, family = binomial)

cv.glm(mtcars,fit,cost = misRate)$delta[1]

pred = c()
for (i in 1:nrow(mtcars)){
    fit <- glm(vs ~ mpg, data = mtcars[-i,], family = binomial)
    pred = c(pred, mtcars$vs[i] == ifelse(predict(fit,mtcars[i,],type="response")>0.5,1,0))
}

mean(pred)

### k-fold
fit <- glm(vs ~ mpg, data = mtcars, family = binomial)
cv.glm(mtcars,fit,cost = misRate,K = 5)$delta[1]

The Bootstrap

Bootstrapping is commonly used to quantify the uncertainty of estimates. For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.
Bootstrapping is a sampling with replacement method, which allows estimation of the sampling distribution of almost any statistic using random sampling methods.
The idea of bootstrap is to treat the sample as if it were the population, for the purpose of approximating the sampling distribution of a statistic.

A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.
A bootstrap statistic is the statistic computed on a bootstrap sample.
A bootstrap distribution is the distribution of many bootstrap statistics.

Estimating the Accuracy of a statistic of Interest

-The standard error

### create one sample
set.seed(1)
x = rnorm(30)

bootMean = rep(NA,1000)
sampleMean = rep(NA,1000)


## draw samples from the population
for(i in 1:1000){
    sampleMean[i] = mean(rnorm(30))
}

## bootstrap samples
for(i in 1:1000){
    bootMean[i] = mean(sample(x,replace=TRUE))  ## replace=TRUE means resampling with replacement
}


plot(density(bootMean))  ## bootstrap estimate
lines(density(sampleMean),col="red")


sd(bootMean)
sd(sampleMean)

### we can use the same idea to calculate the standard error of the median or other statistic of interest.

set.seed(1)

bootMedian = replicate(1000, median(sample(x,replace=TRUE)))
sampleMedian = replicate(1000, median(rnorm(30)))


sd(bootMedian)
sd(sampleMedian)

The bootstrap confidence interval
- The traditional way of calculating confidence intervals is by using formulas developed on the assumptions of distributions.
- The bootstrap confidence interval (percentile method) uses the quantiles of those boostrap samples to find the confidence interval. (Many other ways to calculate the bootstrap confidence interval.)

## 95% for sample mean

c1 = quantile(bootMean,c(0.025,0.975))
c1[2]-c1[1]  ## CI width 
c2 = quantile(sampleMean,c(0.025,0.975))
c2[2] - c2[1]

boot() function from boot package

bootobject <- boot(data= ,statistic= , R=, ...)

data: a vector, matrix, or data frame
statistic : A function that produces the statistics to be bootstrapped. The first argument passed will be the original data. The second will be a vector of indices.
R: number of bootstrap replicates

-… : Other named arguments for statistic which are passed

The output bootobject includes

t0: The observed values of statistics applied to the orginal data
t: A matrix where each row is a bootstrap replicate of the statistics. …

## use the same data set x, redo the median


medianFunc = function(data, index){
    return(median(data[index]))
}

bootMed = boot(x,medianFunc,R=1000)


bootMedian2 = bootMed$t

quantile(bootMedian2,c(0.025,0.975))

quantile(bootMedian,c(0.025,0.975))





## Bootstrap can estimate other more complex statistic

## use Portfolio data set
library(ISLR)
alpha.fn = function(data,index){
    x = data$X[index]
    y = data$Y[index]
    alpha = (var(y)-cov(x,y))/(var(x)+var(y)-2*cov(x,y))
    return(alpha)
}

set.seed(1)
bootAlpha = boot(Portfolio,alpha.fn,R=1000)

bootAlpha
quantile(bootAlpha$t,c(0.025,0.975))

Estimating the Accuracy of a Linear Regression Model

An animation of bootstrapping the linear regression coefficient

Bootstrap the regression coefficients

## use mtcars data set
### use the original data set fit the linear regression model
plot(mtcars$wt, mtcars$mpg)
fit = lm(mpg~wt, data = mtcars)
abline(fit, col="red")

### resample the data with replacement
index = sample(nrow(mtcars), replace = T)
newcars = mtcars[index,]
plot(newcars$wt, newcars$mpg)
fit = lm(mpg~wt, data = newcars)
abline(fit, col="red")



# Bootstrap 95% CI for regression coefficients using boot() function


# create the function to obtain regression coefficients 
boot.fn <- function(data, indices, formula) {
  d = data[indices,] 
  fit = lm(formula, data = d)
  return(coef(fit)) 
} 

#boot.fn(mtcars,index,mpg ~ wt)

# bootstrapping with 1000 replications 
set.seed(1)
results = boot(data=mtcars, statistic=boot.fn, R=1000, formula= mpg ~ wt)


### show bootstrap estimates
results
summary(fit)
## two columns : intercept and slope
head(results$t)


# get 95% confidence intervals for the slope
hist(results$t[,2])

quantile(results$t[,2],c(0.025,0.975))

Bootstrap the R-Squared

# Bootstrap 95% CI for R-Squared

# create the function to calculate R-Squared 

rsq = function( data, indices,formula) {
  d = data[indices,] # allows boot to select sample 
  fit = lm(formula, data=d)
  return(summary(fit)$r.square)
} 
# bootstrapping with 1000 replications 
set.seed(1)
results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt)

# view results
results 
summary(fit)

# get 95% confidence interval 

quantile(results$t,c(0.025,0.975))