review of the bootstrap receipe
Bootstrap sample
try other distributions
- wrap things in a function?
Examples of bootstrapping
assignment 5 help/notes
external links on the bootstrap

review of the bootstrap receipe

This is slighlty updated version of our previous lecture on the bootstrap.

The goal is to review the concepts and practice coding.

Receipe

Formulate your problem
Get (or simulate) the ‘real’ dataset $D$ you want to analyze
Initialize an empty list (this will store your bootstrap distribution)
START THE BOOTSTRAP REPLICATES LOOP (repeat $B$ times)
generate a new bootstrap sample (i.e. a new dataset, resampled from $D$)
Calculate the statistics of interest: $S$ for this sample
Add this estimated stats of interest to your list
END OF BOOTSTRAP REPLICATES LOOP
Analyze the bootstrap distribution (your list): plot, ci, p-values etc.

To generate a synthetic dataset is easy in R:

What is the distribution I sample from here? What are the parameters of this distribution? How many sample do I request? Why do I use the function set.seed?

set.seed(87612345)
data=rnorm(20,mean=4,sd=.75)

Same questions:

set.seed(117426)
data=rchisq(20,1,nc=3)

And if you are not familiar with the chi-square distribution, read:

Chi-square in R

We can enerate a new bootstrap sample (‘fake new data’) by resampling the data with replacement. Note how I choose to generate a sample (called bdata here) of the same length as the original sample.

bdata=sample(data,20,replace=T)

Now I can calculate statistics of interest, which I note with the generic letter $S$, for this sample. The choice of $S$ depends on what I want to estimate/ my problem.

Review some common statistics

Calculate statistics of interest: $S$ for this sample

Can you define a statistic?

A statistic is a quantity that I can calculate just based on the data, i.e. a given function of the sample value.

We can note this $S=g(x)$.

Some examples of choices for $S$:

Sample mean (what is the function $g$?)
Sample variance
Sample median
Sample quantiles
Sample standard deviation
Sample min
Sample max
Sample order statistics
Sample gap between top and next top value
Sample number of dogs with more than $k$ fleas on them (salmon -lice)
Topology of sampled tree
Degree distribution of sampled graphs
Whatever you want to calculate from the data

SAMPLE STATISTICS ARE RANDOM VARIABLES AND HAVE THEIR OWN DISTRIBUTION

We can say some things about sample distributions EVEN IF we dont know the distribution the data comes from

EXAMPLES

mean and variance of S=sample mean

What is the variance of the sample mean statistic? \[Var(\bar{x}) = Var(\frac{1}{n} \sum_{i=1}^n x_i) = Var(x_i)/n = V/n\]

IF we know which distribution data is drawn for, we can theoretically predict the distribution of sample statistics

Example: if $x_i \sim N(\mu,\sigma^2)$, then what is the distribution of $bar(x}$?
mean and variance of S=sample variance

Mean and medians ?

difference between mean and median?

v=c(1,2,3,4)
mean(v)

## [1] 2.5

median(v)

## [1] 2.5

v=c(1,2,3,4000)
mean(v)

## [1] 1001.5

median(v)

## [1] 2.5

#v = c(1,2,90,10000)
#median(v)
# 46
#mean(v)
v = c(1,2,90,100000000)
median(v)

## [1] 46

Try to memorize the mean and variance of common distributions

\[Var(\chi^2(k) = 2 k\]

Quantiles

n=21
v = 1:n
v

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21

quantile(v,probs=c(.3,.5,.7))

## 30% 50% 70% 
##   7  11  15

w=v^2
w

##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361
## [20] 400 441

quantile(w,probs=c(.3,.5,.7))

## 30% 50% 70% 
##  49 121 225

w=abs(cos(v))
w

##  [1] 0.540302306 0.416146837 0.989992497 0.653643621 0.283662185 0.960170287
##  [7] 0.753902254 0.145500034 0.911130262 0.839071529 0.004425698 0.843853959
## [13] 0.907446781 0.136737218 0.759687913 0.957659480 0.275163338 0.660316708
## [19] 0.988704618 0.408082062 0.547729260

x <-1:21
quantile(w,probs=c(.3,.5,.7))

##       30%       50%       70% 
## 0.4161468 0.6603167 0.8438540

plot(x,w)
lines(x,w)

building the bootstrap distribution

Add the stat of interest to your list

# outside the bootstrap loop I will initialize an empty list
#
myListOfStats = NULL
# this line of code will append the new statistic value myStat to the existing list
# myListOfStats
# myListOfStats=c(myListOfStats,myStat)

Analyze the bootstrap distribution

Can you now see how the following code puts all these elements together?

set.seed(117426)
data=rchisq(20,8,ncp=5)
# theoretical mean is k + lambda  df+ncp=13
mystatchoice = function(ichoice,mydata){
return(mean(mydata))
}

myListOfStats = NULL
for (i in 1:1000){
mybsample   =  sample(data,20,replace=T)
#mystat  =  mystatchoice(1, mybsample)
mystat = mean(mybsample)
myListOfStats = c(myListOfStats,mystat)
}

myListOfStats[1:20]

##  [1] 13.297047 14.712842 12.848927 13.342714 10.590500 13.137522 13.642501
##  [8] 13.577464 13.526397  9.843382 14.963724 13.513403 12.708170 11.984250
## [15] 11.459082 15.262356 14.378881 10.889801 12.620023 15.334310

mean(myListOfStats)

## [1] 13.03888

Histograms are a way to see the distribution:

hist(myListOfStats,nclass=20)

Quantiles are cutpoints that cut the values of the random variable (response) into zones of well-defined areas (i.e. probabilities). Here I ask which cut-point $c_1,c_2$ have the probabilities $P(X \leq c_1) = 0.025$ and $P(X \leq c_2) = 0.975$:

# this will return the two cutpoints c_1 and c_2
# they are called the 2.5% percentile and 97.5% percentile of the empirical distribution
# provided in the vector myListOfStats (this vector stores random values of your random variable / response)
quantile(myListOfStats,probs=c(.025,.975))

##     2.5%    97.5% 
## 10.48906 15.68088

Here I recall how to define a matrix and how to calculate the median of each row

#apply
M=matrix(1:12,byrow=T,ncol=4)
M

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

apply(M,1,median)

## [1]  2.5  6.5 10.5

Or the mean of each column

#apply
M=matrix(1:12,byrow=T,ncol=4)
M

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

apply(M,2,mean)

## [1] 5 6 7 8

Or the variance of each row

#apply
M=matrix(1:12,byrow=T,ncol=4)
M

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

apply(M,2,var)

## [1] 16 16 16 16

Bootstrap sample

Suppose that a data set consists of the $n$ items $x_1, x_2, \ldots , x_n$.

A bootstrap sample is a sample of size $n$, drawn with $replacement$, from the original sample. Because the sampling is with replacement, the sample points can be repeated in the bootstrap sample. Here’s an example.

data=round(rnorm(6),1)
data

## [1] -2.0  1.4 -1.4  0.2  0.2  1.5

There are six items in the data set. Now let’s draw 10 bootstrap samples, each of size 6, with replacement. The R command to draw a sample of size $n$ from $x$, with replacment, is $sample(x,n,replace=T)$. We don’t have to know $n$ ahead of time, as we can just use $sample(x,length(x),replace=T)$. Let’s use $Nboot$ to denote the number of bootstrap samples.

Nboot=10
for (i in 1:Nboot){
  bsample=sample(data,length(data),replace=T)
  print(bsample)
  }

## [1] -2.0  0.2  1.4  0.2  0.2 -1.4
## [1]  0.2 -2.0  0.2  0.2  0.2 -1.4
## [1] -1.4  0.2  0.2 -2.0 -2.0  1.4
## [1]  0.2  0.2 -1.4  0.2  0.2 -2.0
## [1] -2.0  1.4 -2.0  1.4  0.2  0.2
## [1] -2.0  1.4  1.4  0.2  0.2  0.2
## [1]  1.5  1.4 -1.4 -2.0  0.2  1.5
## [1]  1.5  1.4  1.4  0.2 -1.4  0.2
## [1]  0.2  1.4 -1.4 -2.0  1.4 -1.4
## [1] -1.4  1.4  1.4 -1.4  1.5  1.4

#Bootstrap distribution

Generally we are interested in some statistic, which is nothing more than a function of the sample values, say the mean, or a specified percentile. We can calculate the statistic for each of our bootstrap samples, and thereby construct the “bootstrap distribution” of the statistic. The following examples show the bootstrap distribution of the mean and of the 90’th percentile, using 500 bootstrap samples. Let’s start with a sample of size 15 from a $chi^2$ distribution with 2 degrees of freedom.

Nboot=500
n=15; df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
means=NULL
p90=NULL
for (i in 1:Nboot){
  bsample=sample(x,n,replace=T)
  means=c(means,mean(bsample)) #calculate the mean 
  p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
  }
hist(means,nclass=20,main="bootstrap distribution of the mean")

hist(p90,nclass=20,main="bootstrap distribution of the 90'th percentile")

##Bootstrap confidence intervals

A common use of the bootstrap is to calculate a confidence interval which doesn’t depend on underlying distributional assumptions. The $percentile method$ creates a $100(1-\alpha)$% confidence interval using the upper and lower $\alpha/2$’th percentiles of the bootstrap distribution. Where x is some data, the upper and lower $\alpha$’th percentiles of x are calculated in R using using $quantile(x,c(\alpha/2,1-\alpha/2))$.

Example:

draw a sample of size 10 from $N(2,1)$, and calculate confidence intervals for the underlying population mean, using both the usual $t$-interval, and a bootstrap interval using 500 bootstrap samples.

nboot=500
n=10; mu=2; sd=1
alpha=.05 #for 95% interval
data=rnorm(n,mu,sd) #simulate a data set
# get the bootstrap distribution of the mean
 bmean=NULL
 for (i in 1:nboot){
  bsamp=sample(data,n,replace=T) #take a bootstrap sample from data
  bmean=c(bmean,mean(bsamp)) #accumlate the mean
  }
bint=quantile(bmean,c(alpha/2,1-alpha/2)) #calculate percentile interval
cat("bootstrap interval is ",bint)

## bootstrap interval is  2.064066 2.916272

As was indicated in the notes on the $t-interval$, the actual coverage of the $t-interval$ will be correct when the the assumption that the underlying sample is from a normal distribution holds. Also, its empirical coverage is generally pretty close to its nominal coverage if the underlying distribution is approximately normal. (This is because the Central Limit Theorem says that sample means will typically be close to normally distributed.)

The bootstrap percentile interval$ makes no assumption about the shape of the underlying distribution.

Assessing the empirical coverage of the bootstrap confidence interval

The following function bconflevel simulates $Nbatch$ samples of size $n$ from a distribution specifed by $FUN$. Arguments to $FUN$ are passed in using the $...$ notation. For each sample, a $100(1-\alpha)$% bootstrap confidence interval is constructed, and the function counts the proportion of the samples where the confidence interval contains the true, but assumed unknown, parameter. This proportion is the empirical coverage.

bconflevel=function(Nbatch,Nboot=100,alpha=.05,n=20,truemean,FUN,...){
df=n-1 #degrees of freedom for t
ta2=qt(1-alpha/2,df) # 100(1-alpha/2)'th percentile of t
count=0
print(truemean)
for (i in 1:Nbatch){
data=FUN(n,...)
#generate the bootstrap distribution for the mean
  bmeans=NULL
  for (j in 1:Nboot){
    datab=sample(data,size=length(data),replace=T) #bootstrap sample
    bmeans=c(bmeans,mean(datab))}
  quants=quantile(bmeans,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true mean, increment the count
if(quants[1]<truemean && quants[2]>truemean)count=count+1
}
emplevel=count/Nbatch   #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}

Example: sampling from a normal population.

$Nbatch=500$ samples of size 10 are drawn from a normal distribution with mean 4 and standard deviation .3. The empirical coverage of the bootstrap interval is reported. For the bootstrap interval, 1000 bootstrap samples are used for each sample, so computing time for the bootstrap interval is roughly 1000 times that for calculating a t-interval.

bconflevel(Nbatch=500,Nboot=1000,n=10,truemean=4,FUN=rnorm,mean=4,sd=.3)

## [1] 4

## [1] "Nominal level =  0.95  Empirical level =  0.908"

The empirical coverage here will generally be fairly close to the nominal coverage because we are sampling from a normal population. Differences between the empirical and the nominal level will depend on a number of things, including the number of simulation batches, the sample size, the generated data, and the percentile method itself. Keep in mind, though, that the percentile interval doesn’t use any information about the underlying distribution.

Example: sampling from a $\chi^2$ distribution

Here we are sampling from a distribution which is very non-normal.

bconflevel(500,Nboot=1000,n=10,truemean=2,FUN=rchisq,df=2) #mean of chisq_2 is 2

## [1] 2

## [1] "Nominal level =  0.95  Empirical level =  0.9"

Boostrap interval for percentiles

There is no analytical form for a confidence interval for a population percentile. This differs from the situation of the population mean, where the t-interval can be used. However, it’s easy to make a bootstrap interval for a percentile.

The following function carries out a simulation study assessing the actual coverage of the nominal 95% bootstrap interval for the 75’th percentile, when the underlying distribution is $\chi^2_2$. For this distribution, the 75’th percentile is

qchisq(.75,2)

## [1] 2.772589

bconflevel=function(Nbatch=500,Nboot=100,alpha=.05,n=20,FUN,...){
pct=.75; df=2
truep=qchisq(pct,2)
count=0
for (i in 1:Nbatch){
data=FUN(n,...) #simulate data
n=length(data)
#generate the bootstrap distribution for the 90th percentile
  bp90=NULL
  for (j in 1:Nboot){
    datab=sample(data,size=n,replace=T) #bootstrap sample
    bp90=c(bp90,quantile(datab,pct))} #accumulate the bootstrapped percentiles
    quants=quantile(bp90,probs=c(alpha/2,1-alpha/2)) # upper and lower percentage points
#if the interval contains the true percentile, increment the count
if(quants[1]<truep && quants[2]>truep)count=count+1
}
emplevel=count/Nbatch   #empirical level of the t-interval
paste("Nominal level = ",1-alpha," Empirical level = ",emplevel)
}

The following prints the empirical coverage of the percentile interval with 5 replicate simulations. Each simulation uses 500 simulation batches, but only 100 bootstrap samples for each generated data set, in order to not be too computationally expensive. This shows the variability in the bootstrap coverage.

#run 5 replicate simulations, each with 500 batches, but only 100 bootstrap samples
for (j in 1:5){ 
print(bconflevel(Nbatch=500,Nboot=100,n=30,FUN=rchisq,df=2))}

## [1] "Nominal level =  0.95  Empirical level =  0.932"
## [1] "Nominal level =  0.95  Empirical level =  0.908"
## [1] "Nominal level =  0.95  Empirical level =  0.904"
## [1] "Nominal level =  0.95  Empirical level =  0.914"
## [1] "Nominal level =  0.95  Empirical level =  0.93"

The coverage will typically differ from the nominal level, here 95%, but keep in mind that there is no analytical method to make a confidence interval for a percentile, so the bootstrap has given a fairly simple way to do something that we couldn’t do otherwise.

In practice, one would want to use considerably more than 100 bootstrap samples. 1000 might be reasonable.

centile interval with 5 replicate simulations. Each simulation uses 500 simulation batches, but only 100 bootstrap samples for each generated data set, in order to not be too computationally expensive. This shows the variability in the bootstrap coverage.

#run 5 replicate simulations, each with 500 batches, but only 100 bootstrap samples
for (j in 1:5){ 
print(bconflevel(Nbatch=500,Nboot=100,n=30,FUN=rchisq,df=2))}

## [1] "Nominal level =  0.95  Empirical level =  0.906"
## [1] "Nominal level =  0.95  Empirical level =  0.928"
## [1] "Nominal level =  0.95  Empirical level =  0.918"
## [1] "Nominal level =  0.95  Empirical level =  0.914"
## [1] "Nominal level =  0.95  Empirical level =  0.914"

In practice, one would want to use considerably more than 100 bootstrap samples. 1000 might be reasonable.

try other distributions

wrap things in a function?

mymodel <- 1

Nboot=500
n=100; 

if(mymodel == 1){
mu <- 2
sd <- 0.25
x=rnorm(n,mu,sd) #draw sample from a normal distribution
}

if(mymodel == 2){
df=2
x=rchisq(n,df) #draw sample from a chi-squared distribution
}

means=NULL
p90=NULL
for (i in 1:Nboot){
  bsample=sample(x,n,replace = TRUE)
  means=c(means,mean(bsample)) #calculate the mean 
  p90=c(p90,quantile(bsample,.9)) #calulate 90'th percentile
  }
hist(means,nclass=20,main="bootstrap distribution of the mean")

dumm <- function(nb=nb){
v <- array()


for (i in 1:nb){
v[i] <- mean(rchisq(i,df))
}
x <- 1:nb
y <- v
plot(x,y)
}

dumm(nb=5)

dumm(nb=50)

Examples of bootstrapping

Bootstrapping the standard error of the slope.

x = c(1,1.5,2,3,4,4.5,5,5.5,6,6.5,7,8,9,10,11,12,13,14,15)
y = c(6.3,11.1,20,24,26.1,30,33.8,34.0,38.1,39.9,42,46.1,53.1,52,52.5,48,42.8,27.8,21.9)
n=length(y)
plot(x,y,xlab="hardwood %",ylab="tensile strength")

We would like to use $x$ to predict $y$.

Let’s start with a simple linear regression model

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

There is a builtin procedure “boot” which can be used to estimate the standard error of any statistic. In order to use it, you need to create a function which returns the statistic you’re interesedt in. For the regression slope,

b1=function(data,index){
  x=data$x[index]
  y=data$y[index]
  lmout=lm(y~x)
  b1=coef(lmout)[2]
  return(b1)}

“index” is a subset of 1:n which will be passed in by the boot command.

library(boot)
data=data.frame(x=x,y=y)
boot(data,b1,R=1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = b1, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 1.770986 0.0742129    0.936982

The original estimate ($\hat \beta_1$), the standard error (estimated variance of $\hat \beta_1$) and something called the bias of $\hat \beta_1$ are returned.

Bootstrapping in population genetics

Once upon a time, there was … linanthus parryae.

More pictures of the Mojave desert

\[GD=p(1-p)\]

Bootstrapping relatedness coefficients

Microsatellites markers are used in forensic analysis

PCR + electrophoresis on agarose gel

\[R_{ij}= \sum_l \sum_{i,j} z_{il}z_{jl}\] ## Bootstrapping abundances (species)

coral reef paper

so many others…

assignment 5 help/notes

what does this code do?

myfun=function(x){
if(length(x)==1){return(x)
} else
return(x[1]+myfun(x[-1]))
}

v = c(5,6,7)
myfun(v)

## [1] 18

\[cox(x)+cos(2x)+cos(3x)+ \ldots\]

\[\sum_{i=1}^n g(i)\]

\[\sum_{i=1}^n (1/i)^p\]

external links on the bootstrap

Bootstrap sampling and Bootstrap Confidence Intervals (Take 2)