Preliminary explanations

Let us start today’s activity with a few words of explanation.

In statistics, we often want to represent data by a Model

Doing this allows us, to a certain extent, to say something that goes beyond the data itself.

To carry out this exercise, we select a model in our bag of models.

Most models have parameters, which characterize the shape of their probability distribution.

Hence, if I select the normal distribution as my model (as we will do here), I need 2 parameters, the mean and the variance of the normal distribution, to represent my data.

But we do not know these parameters.

Therefore we estimate these parameters, assuming that the model is correct (that the data was sampled from a normal distribution).

Having estimated the mean and the variance, we can calculate ‘the error bar’ of our estimator of the mean.

The standard method for doing this is called t-confidence interval for the mean estimator of a normal distribution.

So, in the exercise we propose today, we will use R to make simulations and check, empirically, if the analytical theory is correct.

We will also play a trick. We will generate data from another model, and see how wrong our confidence interval is.

To check the quality of our confidence interval we are going to estimate its ‘coverage’.

The idea is simple: repeat several estimations of the mean, using synthetic=simulated data and see if the confidence intervals does indeed contain the true (this time known!) value of the parameter.

We will collect the proportion of replicates in which the calculated interval indeed contains the mean. This is the empirical coverage of the t-confidence interval.

And now, on with calculations in R.

Carrying out a simulation study

The generic framework for carrying out a simulation study is

  set #Nbatch#, the number of simulation batches Nbatch
  set #n#, the sample size of each simulation batch
    for (i in 1: Nbatch){ 
       draw a sample of size n from a distribution
       do something using the sample
       }
   aggregate results across simulation batches

Example: simulation study to evaluate the coverage of the t-interval

The t-interval is given by

\[\bar x \pm t_{\alpha/2,n-1} s/\sqrt{n}\]

where \(t_{\alpha/2,n-1}\) cuts off an area \(\alpha/2\) in the upper tail of the \(t\) distribution with \(n-1\) degrees of freedom.

when samples are drawn from a normal population

When the data are a sample from a normal population with mean \(\mu\) and standard deviation \(\sigma\), then \(100(1-\alpha)\)% of t-intervals constructed will contain the true but unknown population mean \(\mu\). The nominal, or named, level of the t-interval is \(100(1-\alpha)\)%.

The following simulation study draws 100 samples of size 10 from a normal population with mean 3, standard deviation .5. For each sample, a 95% confidence interval for the mean is calculated. The 100 intervals are plotted, and a count is made as to how many of the intervals contain the true mean. The proportion of intervals containing the true mean is the “empirical coverage” of the procedure.

 plot(0,0,xlim=c(2,4),ylim=c(0,100),type="n",
       ylab="simulation batch", xlab="when sampling from normal population")
 Nbatch=100 #fix the number of simulation batches
 n=10       #fix the sample size for each batch
 truemean=3; sd=.5; alpha=.05 #fix some parameters for this specific simulation
 cover=0

for (i in 1:Nbatch){
   data=rnorm(n,truemean,sd) #draw a sample
   #now do something with the sample              
   tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
   count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
   lines(tint,c(i,i))
   cover=cover+count1
 }

lines(c(truemean,truemean),c(0,100))

 empirical=cover/Nbatch 
 print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is  96 %"

The estimated, or empirical, coverage of the t-interval is 0.96%. If we did more simulation batches the empirical coverage would typically be closer to the nominal coverage. This is a property of the t-interval when sampling from a normal population.

when samples are drawn from a non-normal population

What if we sample from a non-normal popluation? The following does the similar simulation, but where samples are drawn from \(\chi^2_2\), the \(\chi^2\) distribution with 2 degrees of freedom. This is a very non-normal population, as can be seen from the following.

x=seq(0,5,length.out=200)
plot(x,dchisq(x,2),type="l",ylab="f(x)",main="chi-squared_2 density function")

Now carry out the same simulation study, but with samples drawn from the \(\chi^2_2\) distribution.

 plot(0,0,xlim=c(-1,6),ylim=c(0,100),type="n",
     ylab="simulation batch", xlab="sampling from chi-squared distribution")
 truemean=2; n=10; Nbatch=100; alpha=.05; df=2
 cover=0
 for (i in 1:Nbatch){
   data=rchisq(n,df=2)
   tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
   lines(tint,c(i,i))
   cover=cover+ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
 }
  lines(c(truemean,truemean),c(0,100))

 empirical=cover/Nbatch  
 print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is  84 %"

With \(n\) moderately large, the empirical coverage will be fairly close to the nominal coverage. This is because of the central limit theorem, which says that the distribution of the sample mean is quite close to a normal distribution even if the underlying population is non-normal. For smallish values of \(n\), such as was used here, the empirical and nominal coverage can be quite different.