2017-03-06

Resampling Methods

  • Training data : the data set on which the model is built.
  • Test data : the data on which we apply our model and check the accuracy.

  • Resampling methods: repeatedly drawing samples from a training set and refiiting a model of interest on each sample in order to obtain additional information about the fitted model.

Two commonly used resampling methods:

  • Cross-Validation
    • The validation set approach
    • Leave-One-Out Cross-Validation (LOOCV)
    • k-Fold Cross-validation
  • Bootstrap

Cross-Validation

  • Training error is the error we get applying the model to the same data from which we trained.
  • Test error is the error that we incur on new data.

When the test set is not available, cross-validation methods estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the method to those held out observations.

  • Cross-Validation approaches
    • The validation set approach
    • Leave-One-Out Cross-Validation (LOOCV)
    • k-Fold Cross-validation

The Validation Set Approach

  • Randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
  • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
  • The resulting validation-set error provides an estimate of the test error.

## An example of the validation set approach
library(ISLR)

dim(Auto)

set.seed(1)
train = sample(1:392,196)  ## create the index to randomly split the data set into two halves
degrees = 1:10  
errate = c()

# fit the simple linear regression
lm.fit = lm(mpg~ horsepower,data = Auto, subset = train) 
mse = mean((Auto$mpg-predict(lm.fit, newdata =  Auto))[-train]^2)
errate = c(errate, mse)

# fit the regression with poly terms with degrees from 2 to 10
for (d in 2:10){
    lm.fit = lm(mpg~poly(horsepower,d),data = Auto, subset = train)
    mse = mean((Auto$mpg-predict(lm.fit, newdata =  Auto))[-train]^2)
    errate = c(errate, mse)
}

#par(new=T)   # add new lines on the same graph
plot(degrees,errate,type="b", ylim=c(15,30), ylab = "Mean squared error")

Advantages:

  • simple and fast

Drawbacks:

  • Highly variable: the validation estimate of the test error rate can be highly variable, depending on which observations are included in the training set.
  • Waste data: we only use half of the data set to fit the model, which will gives us the unreliable estimate.

Leave-one-out

  • Instead of creating two subsets like the validation set approach, a single observation is used for the validation set, and the remaining data make up the training set.

  • Repeat the procedure n times by iteratively leaving one observation out

  • Compute the average MSE of all n test estimates.

The LOOCV estimate for the test MSE is the average of these n test error estimates:

\(CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_{i}\)

### An example of LOOCV

degrees = 1:10  
errate = c()

for ( d in 1:10){
    
    mse = c()
    for (i in 1:nrow(Auto)){
    
    lm.fit = lm(mpg~poly(horsepower,d),data = Auto[-i,])
    mse0 = (Auto$mpg[i]-predict(lm.fit, newdata =  Auto[i,]))^2
    mse = c(mse,mse0)
}
    
    errate = c(errate, mean(mse))  
}

plot(degrees,errate,type="b", ylim=c(15,30), ylab = "Mean squared error")



### Use the cv.glm function in boot package

## set the family argument to default is the linear regression
library(boot)
glm.fit = glm(mpg~horsepower,data=Auto)  ## 
cv.glm(Auto, glm.fit)$delta

# tm = proc.time()  # record the running time
errate2 = c()
for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = cv.glm(Auto,glm.fit)$delta[1]
    errate2 = c(errate2,err0)
}

# proc.time() - tm
plot(errate2,type="b",col="red")

A Leave-one-out shortcut for Least squares linear regression

In linear regression, we can obtain all the prediction errors from a single fit.

\(CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}(\frac{y_i-\hat{y_i}}{1-h_i})^2\)

where \(h_i\) is the leverage statistics.

### create the loocv shortcut function
loocv = function(fit){
    h = hatvalues(fit)
    val = mean((residuals(fit)/(1-h))^2)
    return(val)
}



errate3 = c()

for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = loocv(glm.fit)
    errate3 = c(errate3,err0)
}


### This shortcut is much faster!

Advantages:

  • Much less bias, since the training set contains n-1 observations.
  • Performing LOOCV many times will always result in the same MSE

Drawbacks:

  • Computationally expensive

k-fold

  • Randomly divides a set of observations into k groups, for folds, of approximately equal size (commonly using k=5 or 10). Each fold contains a non overlapping validation set and training set.
  • Repeat k times, each time, one of the k subsets is used as the test set.
  • The test error is estimated by averaging these k estimates

The k-fold CV estimate:

\(CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i\)

When k = n, it is actually the leave-one-out cross-validation, since we leave out one data point at a time.

glm.fit = glm(mpg~horsepower,data=Auto)  ## 
cv.glm(Auto, glm.fit, K = 10)$delta


erratek = c()
for (d in degrees){
    glm.fit = glm(mpg~poly(horsepower,d), data = Auto)
    err0 = cv.glm(Auto,glm.fit, K=10)$delta[1]
    erratek = c(erratek,err0)
}


plot(degrees,erratek, type="l", ylim=c(15,30), ylab = "Mean squared error")
  • The variability is much lower than the variability in the test error estimates that results from the validation set approach
  • much faster than LOOCV method
  • Bias-Variance trade-off for k-Fold cross validation

Extending to Classification

  • Instead of MSE, classification error is used \(Err_i = I(y_i \neq \hat{y}_i)\)
  • The algorithm remains the same
  • In cv.glm() , add the argument cost, which is a function of two vector arguments specifying the cost function for the cross-validation.
library(boot)
head(mtcars)

### create the cost function
misRate = function(x,y){
    mean(x==ifelse(y>0.5,1,0)) 
    # mean(abs(x-y)>0.5)
}


### LOOCV
fit <- glm(vs ~ mpg, data = mtcars, family = binomial)

cv.glm(mtcars,fit,cost = misRate)$delta[1]

pred = c()
for (i in 1:nrow(mtcars)){
    fit <- glm(vs ~ mpg, data = mtcars[-i,], family = binomial)
    pred = c(pred, mtcars$vs[i] == ifelse(predict(fit,mtcars[i,],type="response")>0.5,1,0))
}

mean(pred)

### k-fold
fit <- glm(vs ~ mpg, data = mtcars, family = binomial)
cv.glm(mtcars,fit,cost = misRate,K = 5)$delta[1]

The Bootstrap

  • Bootstrapping is commonly used to quantify the uncertainty of estimates. For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

  • Bootstrapping is a sampling with replacement method, which allows estimation of the sampling distribution of almost any statistic using random sampling methods.

  • The idea of bootstrap is to treat the sample as if it were the population, for the purpose of approximating the sampling distribution of a statistic.

  • A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.

  • A bootstrap statistic is the statistic computed on a bootstrap sample.

  • A bootstrap distribution is the distribution of many bootstrap statistics.

Estimating the Accuracy of a statistic of Interest

-The standard error

### create one sample
set.seed(1)
x = rnorm(30)

bootMean = rep(NA,1000)
sampleMean = rep(NA,1000)


## draw samples from the population
for(i in 1:1000){
    sampleMean[i] = mean(rnorm(30))
}

## bootstrap samples
for(i in 1:1000){
    bootMean[i] = mean(sample(x,replace=TRUE))  ## replace=TRUE means resampling with replacement
}


plot(density(bootMean))  ## bootstrap estimate
lines(density(sampleMean),col="red")


sd(bootMean)
sd(sampleMean)

### we can use the same idea to calculate the standard error of the median or other statistic of interest.

set.seed(1)

bootMedian = replicate(1000, median(sample(x,replace=TRUE)))
sampleMedian = replicate(1000, median(rnorm(30)))


sd(bootMedian)
sd(sampleMedian)
  • The bootstrap confidence interval

    • The traditional way of calculating confidence intervals is by using formulas developed on the assumptions of distributions.
    • The bootstrap confidence interval (percentile method) uses the quantiles of those boostrap samples to find the confidence interval. (Many other ways to calculate the bootstrap confidence interval.)
## 95% for sample mean

c1 = quantile(bootMean,c(0.025,0.975))
c1[2]-c1[1]  ## CI width 
c2 = quantile(sampleMean,c(0.025,0.975))
c2[2] - c2[1]

boot() function from boot package

bootobject <- boot(data= ,statistic= , R=, ...)
  • data: a vector, matrix, or data frame

  • statistic : A function that produces the statistics to be bootstrapped. The first argument passed will be the original data. The second will be a vector of indices.

  • R: number of bootstrap replicates

-… : Other named arguments for statistic which are passed

The output bootobject includes

  • t0: The observed values of statistics applied to the orginal data
  • t: A matrix where each row is a bootstrap replicate of the statistics. …
## use the same data set x, redo the median


medianFunc = function(data, index){
    return(median(data[index]))
}

bootMed = boot(x,medianFunc,R=1000)


bootMedian2 = bootMed$t

quantile(bootMedian2,c(0.025,0.975))

quantile(bootMedian,c(0.025,0.975))





## Bootstrap can estimate other more complex statistic

## use Portfolio data set
library(ISLR)
alpha.fn = function(data,index){
    x = data$X[index]
    y = data$Y[index]
    alpha = (var(y)-cov(x,y))/(var(x)+var(y)-2*cov(x,y))
    return(alpha)
}

set.seed(1)
bootAlpha = boot(Portfolio,alpha.fn,R=1000)

bootAlpha
quantile(bootAlpha$t,c(0.025,0.975))

Estimating the Accuracy of a Linear Regression Model

  • An animation of bootstrapping the linear regression coefficient
  • Bootstrap the regression coefficients
## use mtcars data set
### use the original data set fit the linear regression model
plot(mtcars$wt, mtcars$mpg)
fit = lm(mpg~wt, data = mtcars)
abline(fit, col="red")

### resample the data with replacement
index = sample(nrow(mtcars), replace = T)
newcars = mtcars[index,]
plot(newcars$wt, newcars$mpg)
fit = lm(mpg~wt, data = newcars)
abline(fit, col="red")



# Bootstrap 95% CI for regression coefficients using boot() function


# create the function to obtain regression coefficients 
boot.fn <- function(data, indices, formula) {
  d = data[indices,] 
  fit = lm(formula, data = d)
  return(coef(fit)) 
} 

#boot.fn(mtcars,index,mpg ~ wt)

# bootstrapping with 1000 replications 
set.seed(1)
results = boot(data=mtcars, statistic=boot.fn, R=1000, formula= mpg ~ wt)


### show bootstrap estimates
results
summary(fit)
## two columns : intercept and slope
head(results$t)


# get 95% confidence intervals for the slope
hist(results$t[,2])

quantile(results$t[,2],c(0.025,0.975))
  • Bootstrap the R-Squared
# Bootstrap 95% CI for R-Squared

# create the function to calculate R-Squared 

rsq = function( data, indices,formula) {
  d = data[indices,] # allows boot to select sample 
  fit = lm(formula, data=d)
  return(summary(fit)$r.square)
} 
# bootstrapping with 1000 replications 
set.seed(1)
results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt)

# view results
results 
summary(fit)

# get 95% confidence interval 

quantile(results$t,c(0.025,0.975))