This is slighlty updated version of our previous lecture on the bootstrap.
The goal is to review the concepts and practice coding.
To generate a synthetic dataset is easy in R:
What is the distribution I sample from here? What are the parameters of this distribution? How many sample do I request? Why do I use the function set.seed?
set.seed(87612345)
data=rnorm(20,mean=4,sd=.75)
Same questions:
set.seed(117426)
data=rchisq(20,1,nc=3)
And if you are not familiar with the chi-square distribution, read:
We can enerate a new bootstrap sample (‘fake new data’) by resampling the data with replacement. Note how I choose to generate a sample (called bdata here) of the same length as the original sample.
bdata=sample(data,20,replace=T)
Now I can calculate statistics of interest, which I note with the generic letter \(S\), for this sample. The choice of \(S\) depends on what I want to estimate/ my problem.
Calculate statistics of interest: \(S\) for this sample
Can you define a statistic?
A statistic is a quantity that I can calculate just based on the data, i.e. a given function of the sample value.
We can note this \(S=g(x)\).
Some examples of choices for \(S\):
SAMPLE STATISTICS ARE RANDOM VARIABLES AND HAVE THEIR OWN DISTRIBUTION
We can say some things about sample distributions EVEN IF we dont know the distribution the data comes from
EXAMPLESmean and variance of S=sample mean
What is the variance of the sample mean statistic? \[Var(\bar{x}) = Var(\frac{1}{n} \sum_{i=1}^n x_i) = Var(x_i)/n = V/n\]
IF we know which distribution data is drawn for, we can theoretically predict the distribution of sample statistics
Example: if \(x_i \sim N(\mu,\sigma^2)\), then what is the distribution of \(bar(x}\)?
mean and variance of S=sample variance
Mean and medians ?
difference between mean and median?
v=c(1,2,3,4)
mean(v)
## [1] 2.5
median(v)
## [1] 2.5
v=c(1,2,3,4000)
mean(v)
## [1] 1001.5
median(v)
## [1] 2.5
#v = c(1,2,90,10000)
#median(v)
# 46
#mean(v)
v = c(1,2,90,100000000)
median(v)
## [1] 46
Try to memorize the mean and variance of common distributions
\[Var(\chi^2(k) = 2 k\]
Quantiles
n=21
v = 1:n
v
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
quantile(v,probs=c(.3,.5,.7))
## 30% 50% 70%
## 7 11 15
w=v^2
w
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361
## [20] 400 441
quantile(w,probs=c(.3,.5,.7))
## 30% 50% 70%
## 49 121 225
w=abs(cos(v))
w
## [1] 0.540302306 0.416146837 0.989992497 0.653643621 0.283662185 0.960170287
## [7] 0.753902254 0.145500034 0.911130262 0.839071529 0.004425698 0.843853959
## [13] 0.907446781 0.136737218 0.759687913 0.957659480 0.275163338 0.660316708
## [19] 0.988704618 0.408082062 0.547729260
x <-1:21
quantile(w,probs=c(.3,.5,.7))
## 30% 50% 70%
## 0.4161468 0.6603167 0.8438540
plot(x,w)
lines(x,w)
Add the stat of interest to your list
# outside the bootstrap loop I will initialize an empty list
#
myListOfStats = NULL
# this line of code will append the new statistic value myStat to the existing list
# myListOfStats
# myListOfStats=c(myListOfStats,myStat)
Analyze the bootstrap distribution
Can you now see how the following code puts all these elements together?
set.seed(117426)
data=rchisq(20,8,ncp=5)
# theoretical mean is k + lambda df+ncp=13
mystatchoice = function(ichoice,mydata){
return(mean(mydata))
}
myListOfStats = NULL
for (i in 1:1000){
mybsample = sample(data,20,replace=T)
#mystat = mystatchoice(1, mybsample)
mystat = mean(mybsample)
myListOfStats = c(myListOfStats,mystat)
}
myListOfStats[1:20]
## [1] 13.297047 14.712842 12.848927 13.342714 10.590500 13.137522 13.642501
## [8] 13.577464 13.526397 9.843382 14.963724 13.513403 12.708170 11.984250
## [15] 11.459082 15.262356 14.378881 10.889801 12.620023 15.334310
mean(myListOfStats)
## [1] 13.03888
Histograms are a way to see the distribution:
hist(myListOfStats,nclass=20)
Quantiles are cutpoints that cut the values of the random variable (response) into zones of well-defined areas (i.e. probabilities). Here I ask which cut-point \(c_1,c_2\) have the probabilities \(P(X \leq c_1) = 0.025\) and \(P(X \leq c_2) = 0.975\):
# this will return the two cutpoints c_1 and c_2
# they are called the 2.5% percentile and 97.5% percentile of the empirical distribution
# provided in the vector myListOfStats (this vector stores random values of your random variable / response)
quantile(myListOfStats,probs=c(.025,.975))
## 2.5% 97.5%
## 10.48906 15.68088
Here I recall how to define a matrix and how to calculate the median of each row
#apply
M=matrix(1:12,byrow=T,ncol=4)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
apply(M,1,median)
## [1] 2.5 6.5 10.5
Or the mean of each column
#apply
M=matrix(1:12,byrow=T,ncol=4)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
apply(M,2,mean)
## [1] 5 6 7 8
Or the variance of each row
#apply
M=matrix(1:12,byrow=T,ncol=4)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
apply(M,2,var)
## [1] 16 16 16 16
Suppose that a data set consists of the \(n\) items \(x_1, x_2, \ldots , x_n\).
A bootstrap sample is a sample of size \(n\), drawn with \(replacement\), from the original sample. Because the sampling is with replacement, the sample points can be repeated in the bootstrap sample. Here’s an example.
data=round(rnorm(6),1)
data
## [1] -2.0 1.4 -1.4 0.2 0.2 1.5
There are six items in the data set. Now let’s draw 10 bootstrap samples, each of size 6, with replacement. The R command to draw a sample of size \(n\) from \(x\), with replacment, is \(sample(x,n,replace=T)\). We don’t have to know \(n\) ahead of time, as we can just use \(sample(x,length(x),replace=T)\). Let’s use \(Nboot\) to denote the number of bootstrap samples.
Nboot=10
for (i in 1:Nboot){
bsample=sample(data,length(data),replace=T)
print(bsample)
}
## [1] -2.0 0.2 1.4 0.2 0.2 -1.4
## [1] 0.2 -2.0 0.2 0.2 0.2 -1.4
## [1] -1.4 0.2 0.2 -2.0 -2.0 1.4
## [1] 0.2 0.2 -1.4 0.2 0.2 -2.0
## [1] -2.0 1.4 -2.0 1.4 0.2 0.2
## [1] -2.0 1.4 1.4 0.2 0.2 0.2
## [1] 1.5 1.4 -1.4 -2.0 0.2 1.5
## [1] 1.5 1.4 1.4 0.2 -1.4 0.2
## [1] 0.2 1.4 -1.4 -2.0 1.4 -1.4
## [1] -1.4 1.4 1.4 -1.4 1.5 1.4
#Bootstrap distribution
Generally we are interested in some statistic, which is nothing more than a function of the sample values, say the mean, or a specified percentile. We can calculate the statistic for each of our bootstrap samples, and thereby construct the “bootstrap distribution” of the statistic. The following examples show the bootstrap distribution of the mean and of the 90’th percentile, using 500 bootstrap samples. Let’s start with a sample of size 15 from a \(chi^2\) distribution with 2 degrees of freedom.
Nboot=500
n=15; df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
means=NULL
p90=NULL
for (i in 1:Nboot){
bsample=sample(x,n,replace=T)
means=c(means,mean(bsample)) #calculate the mean
p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
}
hist(means,nclass=20,main="bootstrap distribution of the mean")
hist(p90,nclass=20,main="bootstrap distribution of the 90'th percentile")
##Bootstrap confidence intervals
A common use of the bootstrap is to calculate a confidence interval which doesn’t depend on underlying distributional assumptions. The \(percentile method\) creates a \(100(1-\alpha)\)% confidence interval using the upper and lower \(\alpha/2\)’th percentiles of the bootstrap distribution. Where x is some data, the upper and lower \(\alpha\)’th percentiles of x are calculated in R using using \(quantile(x,c(\alpha/2,1-\alpha/2))\).
draw a sample of size 10 from \(N(2,1)\), and calculate confidence intervals for the underlying population mean, using both the usual \(t\)-interval, and a bootstrap interval using 500 bootstrap samples.
nboot=500
n=10; mu=2; sd=1
alpha=.05 #for 95% interval
data=rnorm(n,mu,sd) #simulate a data set
# get the bootstrap distribution of the mean
bmean=NULL
for (i in 1:nboot){
bsamp=sample(data,n,replace=T) #take a bootstrap sample from data
bmean=c(bmean,mean(bsamp)) #accumlate the mean
}
bint=quantile(bmean,c(alpha/2,1-alpha/2)) #calculate percentile interval
cat("bootstrap interval is ",bint)
## bootstrap interval is 2.064066 2.916272
As was indicated in the notes on the \(t-interval\), the actual coverage of the \(t-interval\) will be correct when the the assumption that the underlying sample is from a normal distribution holds. Also, its empirical coverage is generally pretty close to its nominal coverage if the underlying distribution is approximately normal. (This is because the Central Limit Theorem says that sample means will typically be close to normally distributed.)
The bootstrap percentile interval$ makes no assumption about the shape of the underlying distribution.
The following function bconflevel simulates \(Nbatch\) samples of size \(n\) from a distribution specifed by \(FUN\). Arguments to \(FUN\) are passed in using the \(...\) notation. For each sample, a \(100(1-\alpha)\)% bootstrap confidence interval is constructed, and the function counts the proportion of the samples where the confidence interval contains the true, but assumed unknown, parameter. This proportion is the empirical coverage.
bconflevel=function(Nbatch,Nboot=100,alpha=.05,n=20,truemean,FUN,...){
df=n-1 #degrees of freedom for t
ta2=qt(1-alpha/2,df) # 100(1-alpha/2)'th percentile of t
count=0
print(truemean)
for (i in 1:Nbatch){
data=FUN(n,...)
#generate the bootstrap distribution for the mean
bmeans=NULL
for (j in 1:Nboot){
datab=sample(data,size=length(data),replace=T) #bootstrap sample
bmeans=c(bmeans,mean(datab))}
quants=quantile(bmeans,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true mean, increment the count
if(quants[1]<truemean && quants[2]>truemean)count=count+1
}
emplevel=count/Nbatch #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}
\(Nbatch=500\) samples of size 10 are drawn from a normal distribution with mean 4 and standard deviation .3. The empirical coverage of the bootstrap interval is reported. For the bootstrap interval, 1000 bootstrap samples are used for each sample, so computing time for the bootstrap interval is roughly 1000 times that for calculating a t-interval.
bconflevel(Nbatch=500,Nboot=1000,n=10,truemean=4,FUN=rnorm,mean=4,sd=.3)
## [1] 4
## [1] "Nominal level = 0.95 Empirical level = 0.908"
The empirical coverage here will generally be fairly close to the nominal coverage because we are sampling from a normal population. Differences between the empirical and the nominal level will depend on a number of things, including the number of simulation batches, the sample size, the generated data, and the percentile method itself. Keep in mind, though, that the percentile interval doesn’t use any information about the underlying distribution.
Here we are sampling from a distribution which is very non-normal.
bconflevel(500,Nboot=1000,n=10,truemean=2,FUN=rchisq,df=2) #mean of chisq_2 is 2
## [1] 2
## [1] "Nominal level = 0.95 Empirical level = 0.9"
There is no analytical form for a confidence interval for a population percentile. This differs from the situation of the population mean, where the t-interval can be used. However, it’s easy to make a bootstrap interval for a percentile.
The following function carries out a simulation study assessing the actual coverage of the nominal 95% bootstrap interval for the 75’th percentile, when the underlying distribution is \(\chi^2_2\). For this distribution, the 75’th percentile is
qchisq(.75,2)
## [1] 2.772589
bconflevel=function(Nbatch=500,Nboot=100,alpha=.05,n=20,FUN,...){
pct=.75; df=2
truep=qchisq(pct,2)
count=0
for (i in 1:Nbatch){
data=FUN(n,...) #simulate data
n=length(data)
#generate the bootstrap distribution for the 90th percentile
bp90=NULL
for (j in 1:Nboot){
datab=sample(data,size=n,replace=T) #bootstrap sample
bp90=c(bp90,quantile(datab,pct))} #accumulate the bootstrapped percentiles
quants=quantile(bp90,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true percentile, increment the count
if(quants[1]<truep && quants[2]>truep)count=count+1
}
emplevel=count/Nbatch #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}
The following prints the empirical coverage of the percentile interval with 5 replicate simulations. Each simulation uses 500 simulation batches, but only 100 bootstrap samples for each generated data set, in order to not be too computationally expensive. This shows the variability in the bootstrap coverage.
#run 5 replicate simulations, each with 500 batches, but only 100 bootstrap samples
for (j in 1:5){
print(bconflevel(Nbatch=500,Nboot=100,n=30,FUN=rchisq,df=2))}
## [1] "Nominal level = 0.95 Empirical level = 0.932"
## [1] "Nominal level = 0.95 Empirical level = 0.908"
## [1] "Nominal level = 0.95 Empirical level = 0.904"
## [1] "Nominal level = 0.95 Empirical level = 0.914"
## [1] "Nominal level = 0.95 Empirical level = 0.93"
The coverage will typically differ from the nominal level, here 95%, but keep in mind that there is no analytical method to make a confidence interval for a percentile, so the bootstrap has given a fairly simple way to do something that we couldn’t do otherwise.
In practice, one would want to use considerably more than 100 bootstrap samples. 1000 might be reasonable.
centile interval with 5 replicate simulations. Each simulation uses 500 simulation batches, but only 100 bootstrap samples for each generated data set, in order to not be too computationally expensive. This shows the variability in the bootstrap coverage.
#run 5 replicate simulations, each with 500 batches, but only 100 bootstrap samples
for (j in 1:5){
print(bconflevel(Nbatch=500,Nboot=100,n=30,FUN=rchisq,df=2))}
## [1] "Nominal level = 0.95 Empirical level = 0.906"
## [1] "Nominal level = 0.95 Empirical level = 0.928"
## [1] "Nominal level = 0.95 Empirical level = 0.918"
## [1] "Nominal level = 0.95 Empirical level = 0.914"
## [1] "Nominal level = 0.95 Empirical level = 0.914"
The coverage will typically differ from the nominal level, here 95%, but keep in mind that there is no analytical method to make a confidence interval for a percentile, so the bootstrap has given a fairly simple way to do something that we couldn’t do otherwise.
In practice, one would want to use considerably more than 100 bootstrap samples. 1000 might be reasonable.
mymodel <- 1
Nboot=500
n=100;
if(mymodel == 1){
mu <- 2
sd <- 0.25
x=rnorm(n,mu,sd) #draw sample from a normal distribution
}
if(mymodel == 2){
df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
}
means=NULL
p90=NULL
for (i in 1:Nboot){
bsample=sample(x,n,replace = TRUE)
means=c(means,mean(bsample)) #calculate the mean
p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
}
hist(means,nclass=20,main="bootstrap distribution of the mean")
dumm <- function(nb=nb){
v <- array()
for (i in 1:nb){
v[i] <- mean(rchisq(i,df))
}
x <- 1:nb
y <- v
plot(x,y)
}
dumm(nb=5)
dumm(nb=50)
x = c(1,1.5,2,3,4,4.5,5,5.5,6,6.5,7,8,9,10,11,12,13,14,15)
y = c(6.3,11.1,20,24,26.1,30,33.8,34.0,38.1,39.9,42,46.1,53.1,52,52.5,48,42.8,27.8,21.9)
n=length(y)
plot(x,y,xlab="hardwood %",ylab="tensile strength")
We would like to use \(x\) to predict \(y\).
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
There is a builtin procedure “boot” which can be used to estimate the standard error of any statistic. In order to use it, you need to create a function which returns the statistic you’re interesedt in. For the regression slope,
b1=function(data,index){
x=data$x[index]
y=data$y[index]
lmout=lm(y~x)
b1=coef(lmout)[2]
return(b1)}
“index” is a subset of 1:n which will be passed in by the boot command.
library(boot)
data=data.frame(x=x,y=y)
boot(data,b1,R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = b1, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 1.770986 0.0742129 0.936982
The original estimate (\(\hat \beta_1\)), the standard error (estimated variance of \(\hat \beta_1\)) and something called the bias of \(\hat \beta_1\) are returned.
Once upon a time, there was … linanthus parryae.
More pictures of the Mojave desert
\[GD=p(1-p)\]
what does this code do?
myfun=function(x){
if(length(x)==1){return(x)
} else
return(x[1]+myfun(x[-1]))
}
v = c(5,6,7)
myfun(v)
## [1] 18
\[cox(x)+cos(2x)+cos(3x)+ \ldots\]
\[\sum_{i=1}^n g(i)\]
\[\sum_{i=1}^n (1/i)^p\]