Der Baron von Münchhausen zeigt, was der Wille kann.
Rudolf Erich Raspe: 1785 book: Baron Munchausen’s Narrative of his Marvellous Travels and Campaigns in Russia
The (fictional) Baron’s exploits, narrated in the first person, focus on his impossible achievements as a sportsman, soldier, and traveller, for instance riding on a cannonball, fighting a forty-foot crocodile, and travelling to the Moon…
Waoaoo !
Suppose that a data set consists of the \(n\) items \(x_1, x_2, \ldots , x_n\).
A bootstrap sample is a sample of size \(n\), drawn with \(replacement\), from the original sample. Because the sampling is with replacement, the sample points can be repeated in the bootstrap sample. Here’s an example.
data=round(rnorm(6),1)
data
## [1] 0.8 -1.6 -1.3 0.2 0.7 -0.5
There are six items in the data set. Now let’s draw 10 bootstrap samples, each of size 6, with replacement. The R command to draw a sample of size \(n\) from \(x\), with replacment, is \(sample(x,n,replace=T)\). We don’t have to know \(n\) ahead of time, as we can just use \(sample(x,length(x),replace=T)\). Let’s use \(Nboot\) to denote the number of bootstrap samples.
Nboot=10
for (i in 1:Nboot){
bsample=sample(data,length(data),replace=T)
print(bsample)
}
## [1] -1.6 -0.5 -0.5 -1.3 0.8 -1.3
## [1] 0.2 -0.5 0.8 -0.5 -1.3 -0.5
## [1] 0.8 0.7 -0.5 -0.5 -1.6 -0.5
## [1] -0.5 -1.6 -1.3 -1.3 -1.6 -1.3
## [1] -0.5 0.7 -1.3 -0.5 0.2 -0.5
## [1] -1.3 0.8 0.8 0.7 -1.6 -0.5
## [1] 0.8 0.7 0.2 0.7 -0.5 0.8
## [1] 0.7 0.8 0.7 0.7 -0.5 -1.6
## [1] -1.6 -1.3 -1.3 0.8 0.7 -0.5
## [1] 0.2 -1.3 -1.6 0.7 -1.6 -1.3
#Bootstrap distribution
Generally we are interested in some statistic, which is nothing more than a function of the sample values, say the mean, or a specified percentile. We can calculate the statistic for each of our bootstrap samples, and thereby construct the “bootstrap distribution” of the statistic. The following examples show the bootstrap distribution of the mean and of the 90’th percentile, using 500 bootstrap samples. Let’s start with a sample of size 15 from a \(chi^2\) distribution with 2 degrees of freedom.
Nboot=500
n=15; df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
means=NULL
p90=NULL
for (i in 1:Nboot){
bsample=sample(x,n,replace=T)
means=c(means,mean(bsample)) #calculate the mean
p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
}
hist(means,nclass=20,main="bootstrap distribution of the mean")
hist(p90,nclass=20,main="bootstrap distribution of the 90'th percentile")
##Bootstrap confidence intervals
A common use of the bootstrap is to calculate a confidence interval which doesn’t depend on underlying distributional assumptions. The \(percentile method\) creates a \(100(1-\alpha)\)% confidence interval using the upper and lower \(\alpha/2\)’th percentiles of the bootstrap distribution. Where x is some data, the upper and lower \(\alpha\)’th percentiles of x are calculated in R using using \(quantile(x,c(\alpha/2,1-\alpha/2))\).
draw a sample of size 10 from \(N(2,1)\), and calculate confidence intervals for the underlying population mean, using both the usual \(t\)-interval, and a bootstrap interval using 500 bootstrap samples.
nboot=500
n=10; mu=2; sd=1
alpha=.05 #for 95% interval
data=rnorm(n,mu,sd) #simulate a data set
# get the bootstrap distribution of the mean
bmean=NULL
for (i in 1:nboot){
bsamp=sample(data,n,replace=T) #take a bootstrap sample from data
bmean=c(bmean,mean(bsamp)) #accumlate the mean
}
bint=quantile(bmean,c(alpha/2,1-alpha/2)) #calculate percentile interval
cat("bootstrap interval is ",bint)
## bootstrap interval is 1.690938 2.464422
As was indicated in the notes on the \(t-interval\), the actual coverage of the \(t-interval\) will be correct when the the assumption that the underlying sample is from a normal distribution holds. Also, its empirical coverage is generally pretty close to its nominal coverage if the underlying distribution is approximately normal. (This is because the Central Limit Theorem says that sample means will typically be close to normally distributed.)
The bootstrap percentile interval$ makes no assumption about the shape of the underlying distribution.
The following function bconflevel simulates \(Nbatch\) samples of size \(n\) from a distribution specifed by \(FUN\). Arguments to \(FUN\) are passed in using the \(...\) notation. For each sample, a \(100(1-\alpha)\)% bootstrap confidence interval is constructed, and the function counts the proportion of the samples where the confidence interval contains the true, but assumed unknown, parameter. This proportion is the empirical coverage.
bconflevel=function(Nbatch,Nboot=100,alpha=.05,n=20,truemean,FUN,...){
df=n-1 #degrees of freedom for t
ta2=qt(1-alpha/2,df) # 100(1-alpha/2)'th percentile of t
count=0
print(truemean)
for (i in 1:Nbatch){
data=FUN(n,...)
#generate the bootstrap distribution for the mean
bmeans=NULL
for (j in 1:Nboot){
datab=sample(data,size=length(data),replace=T) #bootstrap sample
bmeans=c(bmeans,mean(datab))}
quants=quantile(bmeans,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true mean, increment the count
if(quants[1]<truemean && quants[2]>truemean)count=count+1
}
emplevel=count/Nbatch #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}
\(Nbatch=500\) samples of size 10 are drawn from a normal distribution with mean 4 and standard deviation .3. The empirical coverage of the bootstrap interval is reported. For the bootstrap interval, 1000 bootstrap samples are used for each sample, so computing time for the bootstrap interval is roughly 1000 times that for calculating a t-interval.
bconflevel(Nbatch=500,Nboot=1000,n=10,truemean=4,FUN=rnorm,mean=4,sd=.3)
## [1] 4
## [1] "Nominal level = 0.95 Empirical level = 0.924"
The empirical coverage here will generally be fairly close to the nominal coverage because we are sampling from a normal population. Differences between the empirical and the nominal level will depend on a number of things, including the number of simulation batches, the sample size, the generated data, and the percentile method itself. Keep in mind, though, that the percentile interval doesn’t use any information about the underlying distribution.
Here we are sampling from a distribution which is very non-normal.
bconflevel(500,Nboot=1000,n=10,truemean=2,FUN=rchisq,df=2) #mean of chisq_2 is 2
## [1] 2
## [1] "Nominal level = 0.95 Empirical level = 0.874"
There is no analytical form for a confidence interval for a population percentile. This differs from the situation of the population mean, where the t-interval can be used. However, it’s easy to make a bootstrap interval for a percentile.
The following function carries out a simulation study assessing the actual coverage of the nominal 95% bootstrap interval for the 75’th percentile, when the underlying distribution is \(\chi^2_2\). For this distribution, the 75’th percentile is
qchisq(.75,2)
## [1] 2.772589
bconflevel=function(Nbatch=500,Nboot=100,alpha=.05,n=20,FUN,...){
pct=.75; df=2
truep=qchisq(pct,2)
count=0
for (i in 1:Nbatch){
data=FUN(n,...) #simulate data
n=length(data)
#generate the bootstrap distribution for the 90th percentile
bp90=NULL
for (j in 1:Nboot){
datab=sample(data,size=n,replace=T) #bootstrap sample
bp90=c(bp90,quantile(datab,pct))} #accumulate the bootstrapped percentiles
quants=quantile(bp90,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true percentile, increment the count
if(quants[1]<truep && quants[2]>truep)count=count+1
}
emplevel=count/Nbatch #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}
The following prints the empirical coverage of the percentile interval with 5 replicate simulations. Each simulation uses 500 simulation batches, but only 100 bootstrap samples for each generated data set, in order to not be too computationally expensive. This shows the variability in the bootstrap coverage.
#run 5 replicate simulations, each with 500 batches, but only 100 bootstrap samples
for (j in 1:5){
print(bconflevel(Nbatch=500,Nboot=100,n=30,FUN=rchisq,df=2))}
## [1] "Nominal level = 0.95 Empirical level = 0.934"
## [1] "Nominal level = 0.95 Empirical level = 0.928"
## [1] "Nominal level = 0.95 Empirical level = 0.92"
## [1] "Nominal level = 0.95 Empirical level = 0.91"
## [1] "Nominal level = 0.95 Empirical level = 0.914"
The coverage will typically differ from the nominal level, here 95%, but keep in mind that there is no analytical method to make a confidence interval for a percentile, so the bootstrap has given a fairly simple way to do something that we couldn’t do otherwise.
In practice, one would want to use considerably more than 100 bootstrap samples. 1000 might be reasonable.
mymodel <- 1
Nboot=500
n=100;
if(mymodel == 1){
mu <- 2
sd <- 0.25
x=rnorm(n,mu,sd) #draw sample from a normal distribution
}
if(mymodel == 2){
df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
}
means=NULL
p90=NULL
for (i in 1:Nboot){
bsample=sample(x,n,replace = TRUE)
means=c(means,mean(bsample)) #calculate the mean
p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
}
hist(means,nclass=20,main="bootstrap distribution of the mean")
dumm <- function(nb=nb){
v <- array()
for (i in 1:nb){
v[i] <- mean(rchisq(i,df))
}
x <- 1:nb
y <- v
plot(x,y)
}
dumm(nb=5)
dumm(nb=50)
How about trying a distribution that is bimodal?
Let us mix two normals in random proportions
set.seed(1234) # include seed for same samples
n = 500; x1 = rbeta(n, .5, .4)
y1 = rnorm(n, 5, 2); y2 = rnorm(n, 10, 1)
w = rbinom(n, 1, .5) # 50:50 random choice
x2 = w*y1 + (1-w)*y2 # normal mixture
par(mfrow=c(1,2)) # two panels per plot
hist(x1, prob=T, col="skyblue2", main="BETA(.5, .4)"); rug(x1)
curve(dbeta(x, .5, .4), lwd=2, col="blue", add=T)
hist(x2, prob=T, col="skyblue2", main="Normal Mixture"); rug(x2)
curve(.5*dnorm(x,5,2)+.5*dnorm(x,10,1), lwd=2, col="blue", add=T)
par(mfrow=c(1,1)) # return to default single panel
OKDOK: so now we have a way to generate bimodal distributions
Exercise: can you find the bootstrap distribution of the mean?
What else could you want to estimate?
Bootstrapping is a form of empirical estimator.
The beauty of it? No distribution needs to be assumed.
In statistics, bootstrapping is any test or metric that relies on random sampling with replacement.
Bootstrapping allows assigning measures of accuracy (defined in terms of ) to sample estimates.
For example, we can use bootstrap to estimate the bias, variance, confidence intervals, prediction error of estimators.
Boostrapping allows estimation of the sampling distribution of almost any statistic using random sampling methods.
But there is a price.
Convergence occurs for random variables drawn from finite-variance distributions, but is slow.
Bickel and Freedman have shown (1981) that the mean bootstrap is asymptotically normally distributed around the empirical mean.
The main ingredient of the proof is the central limit theorem (in Lindeberg-Feller form).
In the case of fat tailed distributions (infinite variance), bootstrapping may not converge and results have to be interpreted with caution.
Useful references: