Preliminary explanations

Let us start today’s activity with a few words of explanation.

In statistics, we often want to represent data by a Model

Doing this allows us, to a certain extent, to say something that goes beyond the data itself.

To carry out this exercise, we select a model in our bag of models.

Most models have parameters, which characterize the shape of their probability distribution.

Hence, if I select the normal distribution as my model (as we will do here), I need 2 parameters, the mean and the variance of the normal distribution, to represent my data.

But we do not know these parameters.

Therefore we estimate these parameters, assuming that the model is correct (that the data was sampled from a normal distribution).

Having estimated the mean and the variance, we can calculate ‘the error bar’ of our estimator of the mean.

The standard method for doing this is called t-confidence interval for the mean estimator of a normal distribution.

So, in the exercise we propose today, we will use R to make simulations and check, empirically, if the analytical theory is correct.

We will also play a trick. We will generate data from another model, and see how wrong our confidence interval is.

To check the quality of our confidence interval we are going to estimate its ‘coverage’.

The idea is simple: repeat several estimations of the mean, using synthetic=simulated data and see if the confidence intervals does indeed contain the true (this time known because we simulate data!) value of the parameter.

We will collect the proportion of replicates in which the calculated interval indeed contains the mean. This is the empirical coverage of the t-confidence interval.

So, we will use an exact mathematical formula to build a 95% confidence interval for the mean of a normal distribution, and use numerical simulations in R to check if the probability that the confidence interval contains the mean is indeed 95%.

And now, on with calculations in R.

Is the Rat in the Box?

Is the rat in the box?

Or is the box empty?

Carrying out a simulation study

The generic framework for carrying out a simulation study is

  set #Nbatch#, the number of simulation batches Nbatch
  set #n#, the sample size of each simulation batch
    for (i in 1: Nbatch){ 
       draw a sample of size n from a distribution
       do something using the sample
       }
   aggregate results across simulation batches

Example: simulation study to evaluate the coverage of the t-interval

The t-interval is given by

\[\bar x \pm t_{\alpha/2,n-1} s/\sqrt{n}\]

where \(t_{\alpha/2,n-1}\) cuts off an area \(\alpha/2\) in the upper tail of the \(t\) distribution with \(n-1\) degrees of freedom, \(t_{n-1}\).

Please note that because the t-distribution is symmetric around 0, this implies that if \(T \sim t_{n-1}\):

\[\mathbb{P}( T \in [ \alpha/2, 1-\alpha/2 ] ) = 1- \alpha\]

Because \(T=\frac{\bar{x} - \mu}{s/\sqrt{n}}\) is such that \[T \sim t_{n-1}\], we get:

\[ \mathbb{P} (\mu \in [\bar{x} - t_{\alpha/2,n-1} s / \sqrt{n}, \bar{x} + t_{\alpha/2,n-1} s / \sqrt{n} ]) = 1- \alpha \] Which means that the probability that the rat (the true mean, \(\mu\)) is in the box (the confidence interval at nominal level \(1-\alpha\)) is \(1-\alpha\).

when samples are drawn from a normal population

When the data are a sample from a normal population with mean \(\mu\) and standard deviation \(\sigma\), then \(100(1-\alpha)\)% of t-intervals constructed will contain the true but unknown population mean \(\mu\). The nominal, or named, level of the t-interval is \(100(1-\alpha)\)%.

The following simulation study draws 100 samples of size 10 from a normal population with mean 3, standard deviation .5. For each sample, a 95% confidence interval for the mean is calculated. The 100 intervals are plotted, and a count is made as to how many of the intervals contain the true mean. The proportion of intervals containing the true mean is the “empirical coverage” of the procedure.

 plot(0,0,xlim=c(2,4),ylim=c(0,100),type="n",
       ylab="simulation batch", xlab="when sampling from normal population")
 Nbatch=100 #fix the number of simulation batches
 n=10       #fix the sample size for each batch
 truemean=3; sd=.5; alpha=.05 #fix some parameters for this specific simulation
 cover=0

for (i in 1:Nbatch){
   data=rnorm(n,truemean,sd) #draw a sample
   #now do something with the sample              
   tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
   count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
   lines(tint,c(i,i))
   cover=cover+count1
 }

lines(c(truemean,truemean),c(0,100))

 empirical=cover/Nbatch 
 print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is  96 %"

The estimated, or empirical, coverage of the t-interval is 0.96%. If we did more simulation batches the empirical coverage would typically be closer to the nominal coverage. This is a property of the t-interval when sampling from a normal population.

when samples are drawn from a non-normal population

What if we sample from a non-normal popluation? The following does the similar simulation, but where samples are drawn from \(\chi^2_2\), the \(\chi^2\) distribution with 2 degrees of freedom. This is a very non-normal population, as can be seen from the following.

x=seq(0,5,length.out=200)
plot(x,dchisq(x,2),type="l",ylab="f(x)",main="chi-squared_2 density function")

Now carry out the same simulation study, but with samples drawn from the \(\chi^2_2\) distribution.

 plot(0,0,xlim=c(-1,6),ylim=c(0,100),type="n",
     ylab="simulation batch", xlab="sampling from chi-squared distribution")
 truemean=2; n=10; Nbatch=100; alpha=.05; df=2
 cover=0
 for (i in 1:Nbatch){
   data=rchisq(n,df=2)
   tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
   lines(tint,c(i,i))
   cover=cover+ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
 }
  lines(c(truemean,truemean),c(0,100))

 empirical=cover/Nbatch  
 print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is  93 %"

With \(n\) moderately large, the empirical coverage will be fairly close to the nominal coverage. This is because of the central limit theorem, which says that the distribution of the sample mean is quite close to a normal distribution even if the underlying population is non-normal. For smallish values of \(n\), such as was used here, the empirical and nominal coverage can be quite different.

Hints

Dissecting the code

You may be completely lost with the first cell of R code

Therefore, I have dissected it for you here:

# first initialize the plot
 plot(0,0,xlim=c(2,4),ylim=c(0,100),type="n",
      ylab="simulation batch", xlab="when sampling from normal population")


# we will call 

 Nbatch=100 #fix the number of simulation batches
 n=10       #fix the sample size for each batch
 truemean=3; truesd=.5; alpha=.05 #fix some parameters for this specific simulation
 cover=0

for (i in 1:Nbatch){
  
# draw n sample from the normal distribution
  
   data=rnorm(n,truemean,truesd) #draw a sample

# estimate the parameters
   estmean = mean(data)
   estsd   =   sd(data)
   
# calculate the t-confidence interval for the estimated mean: 
   tnx <- qt(1-alpha/2,n-1)
   tint=estmean + c(-1,1)*tnx*estsd/sqrt(n)
   
# now count1 is 1 if this segment contains the true parameter (true mean) or else zero
   
   count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
   
# increment the counter : cover counts hits
   cover=cover+count1
   
# add graphics: draw the segment a bit below
   lines(tint,c(i,i)) 
   
 }

# draw a vertical line where the true mean is
lines(c(truemean,truemean),c(0,100))

# all plots are now finished


# calculate the coverage probability
 empirical=cover/Nbatch 
# spit out the result:
 print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is  94 %"
# punchline?
# is the estimated coverage the same as the announced coverage?

Wrapping the code into a function

You may want to turn the simulation into a reusable function which exposes the parameters. This is done here. Hey: this is good practice, you should try to define your own functions.

bete <- function(n=n,alpha=alpha,Nbatch=Nbatch,
                 truemean=truemean,truesd=truesd,showplot=showplot){

# first initialize the plot
if(showplot == 1){
 plot(0,0,xlim=c(2,4),ylim=c(0,Nbatch),type="n",
      ylab="simulation batch", xlab="when sampling from normal population")
}

 #fix some parameters for this specific simulation
 cover=0

for (i in 1:Nbatch){
  
# draw n sample from the normal distribution
  
   data=rnorm(n,truemean,truesd) #draw a sample

# estimate the parameters
   estmean = mean(data)
   estsd   =   sd(data)
   
# calculate the t-confidence interval for the estimated mean: 
   
   tnx <- qt(1-alpha/2,n-1)
   tint = estmean + c(-1,1)*tnx*estsd/sqrt(n)
   
# now count1 is 1 if this segment contains the true parameter (true mean) or else zero
   
   count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
   
# increment the counter : cover counts hits
   cover=cover+count1
   
# add graphics: draw the segment a bit below
   if(showplot == 1){
   lines(tint,c(i,i)) 
   }
   
 }

# draw a vertical line where the true mean is
 if(showplot == 1){
lines(c(truemean,truemean),c(0,100))
 }
# all plots are now finished


# calculate the coverage probability
 empirical=cover/Nbatch 
# spit out the result:
 print(paste("empirical coverage is ",100*empirical,"%"))

# punchline?
# is the estimated coverage the same as the announced coverage?
 
 theoretical = 1. - alpha
 print(paste("theoretical coverage is ",100*theoretical,"%"))
 return (empirical) 
}

# we will call 

 Nbatch=100 #fix the number of simulation batches
 n=10       #fix the sample size for each batch
 truemean=3; 
 truesd=.5; 
 alpha=.05;
 showplot=0;
 
bete(n=n,alpha=alpha,Nbatch=Nbatch,truemean=truemean,truesd=truesd,showplot=showplot) 
## [1] "empirical coverage is  95 %"
## [1] "theoretical coverage is  95 %"
## [1] 0.95

OK

Now can you write a function that calls bete 1000 times?

What do you conclude?

## For the curious

What is the uncertainty on the value of the speed of light?

Check the supplementary material:

The speed of light and its uncertainties

Is the Rat in the Box? (mathematically)

We are going to provide some justification for the t-confidence interval.

Before we start, check,check,check:

Can you give a definition of ‘estimators’?

Can you a definition of ‘statistics’?

Can you give a definition of the normal distribution?

Can you give a fefinition of the t-distribution?

Can you give a definition of the chi-square distribution?

Let us review the result obtained by William Gosset (Student).

If \(X_i\) are random variables that are i.i.d. normally distributed as \(N(\mu,\sigma^2)\) and if \(\bar{X},S^2\) are the empirical mean and empirical variance estimators of the true mean \(\mu\) and the true variance \(\sigma^2\), then the random variable

\[Z_n = \frac{\bar{X} - \mu}{ S / \sqrt{n} }\]

follows a Student t distribution with \(n-1\) degrees of freedom:

\[ Z_n \sim t_{n-1}\]

and the random variable

\[W_n = (n-1) S^2 / \sigma^2\]

follows a \(\chi^2\) distribution with \(n-1\) degrees of freedom

\[ W_n \sim \chi^2_{n-1}\]

The reasons why this is true are as follows.

First, Student’s t-distribution with \(k\) degrees of freedom can be defined from the normal and chi-square distributions by:

\[T = \frac{Z}{\sqrt{V/k}}\] where \(Z \sim N(0,1)\) and \(V \sim \chi^2_k\) are independent normal and chi-square distributions.

But the chi-square distribution itself is just the square norm of a random vector with independent normal components.

Therefore, all there is to the result of Gosset/Fisher is to demonstrate that the

How can we do this?

Let us consider a random vector with independent standard normally distributed coordinates \((X,Y)\).

Let us show that the square norm of the vector is completely independent of the angle that the vector makes with the x-axis. This is intuitively obvious: we can gain no information on the angle of a vector by observing its norm and vice-versa!

For those not convinced by this ‘physical argument’, let us apply some maths.

Now let us consider, in essence, the change of coordinates (cartesian to polar):

\[(X,Y) \rightarrow (Z,W)\]

Remember that to show that two random variables \(Z\) and \(W\) are independent, we must show that the pdf of the joint is the product of pdfs of the marginals: \[f_{(Z,W)}(z,w)=f_Z(z)f_W(w)\] Let \[Z=X^2+Y^2\] and \[W=\frac{Y}{X}\] (square norm and tangent).

One can easily obtain the pdf of a function \(h(X)\) of a random variable \(X\) by an application of the jacobian formula (see links).

Therefore, applying this to \(h(X)=X^2\) with \(X \sim N(0,1)\), we find the pdf of \(X^2\) to be:

\[\frac{1}{\sqrt{2\pi}}x^{-1/2}e^{-x/2}1_{(0,+\infty)}(x)\]

By independence, we can calculate the joint distribution of \((X^2,Y^2)\):

\[\frac{x^{-1/2}e^{-x/2}}{\sqrt{2}\Gamma(1/2)}1_{(0,+\infty)}(x)\times \frac{y^{-1/2}e^{-y/2}}{\sqrt{2}\Gamma(1/2)}1_{(0,+\infty)}(x)=\frac{(xy)^{-1/2}e^{-(x+y)/2}}{2(\Gamma(1/2))^2}1_{(0,+\infty)}(x)1_{(0,+\infty)}(y)\]

And because we can reconstruct \(X^2\) and \(Y^2\) from \(Z\) and \(W\):

we can map the pdf onto the pdf of the new set of variables by calculating a change of volume. This is another application of determinants.

The determinant of the reciprocal transformation, also called inverse jacobian, is:

\[\left|\begin{array}{cc} \frac{W^2}{W^2+1}& \frac{1}{W^2+1}\\ \frac{2ZW}{(W^2+1)^2} & -\frac{2ZW}{(W^2+1)^2} \end{array}\right| =\frac{2ZW}{(W^2+1)^2} \]

From what we find that

\[\frac{(\frac{ZW^2}{W^2+1}\frac{Z}{W^2+1})^{-1/2}e^{-(\frac{ZW^2}{W^2+1}+\frac{Z}{W^2+1})/2}}{2(\Gamma(1/2))^2}\times \frac{2ZW}{(W^2+1)^2}1_{(0,+\infty)}(z)=\frac{1}{w^2+1}\frac{e^{-z/2}}{\pi}1_{(0,+\infty)}(z).\]

We easily recognize the product of a function of \(w\) and a function of \(z\) which indeed turn out to be the marginal pdfs.

The pdf of \(W\) is:

\[\frac{1}{\pi(1+w^2)}\]

Hey, bonus! We recognize here the pdf of the Cauchy distribution! It turns out that the Cauchy distribution is the same distribution as \(t_1\). Therefore we have shown that the tangent ratio of a random standard normal vector is Cauchy-distributed.

The pdf of \(Z\) is:

\[\frac{e^{-z/2}}{2 \pi}1_{(0,+\infty)}(z)\]

Therefore, the square norm and the tangent ratio (or equivalently norm and angle) are independent.

Useful links:

Wikipedia: Student’s t-distribution Proofs Transforming pdfs