Let us start today’s activity with a few words of explanation.
In statistics, we often want to represent data by a Model
Doing this allows us, to a certain extent, to say something that goes beyond the data itself.
To carry out this exercise, we select a model in our bag of models.
Most models have parameters, which characterize the shape of their probability distribution.
Hence, if I select the normal distribution as my model (as we will do here), I need 2 parameters, the mean and the variance of the normal distribution, to represent my data.
But we do not know these parameters.
Therefore we estimate these parameters, assuming that the model is correct (that the data was sampled from a normal distribution).
Having estimated the mean and the variance, we can calculate ‘the error bar’ of our estimator of the mean.
The standard method for doing this is called t-confidence interval for the mean estimator of a normal distribution.
So, in the exercise we propose today, we will use R to make simulations and check, empirically, if the analytical theory is correct.
We will also play a trick. We will generate data from another model, and see how wrong our confidence interval is.
To check the quality of our confidence interval we are going to estimate its ‘coverage’.
The idea is simple: repeat several estimations of the mean, using synthetic=simulated data and see if the confidence intervals does indeed contain the true (this time known because we simulate data!) value of the parameter.
We will collect the proportion of replicates in which the calculated interval indeed contains the mean. This is the empirical coverage of the t-confidence interval.
So, we will use an exact mathematical formula to build a 95% confidence interval for the mean of a normal distribution, and use numerical simulations in R to check if the probability that the confidence interval contains the mean is indeed 95%.
And now, on with calculations in R.
Is the rat in the box?
Or is the box empty?
The generic framework for carrying out a simulation study is
set #Nbatch#, the number of simulation batches Nbatch
set #n#, the sample size of each simulation batch
for (i in 1: Nbatch){
draw a sample of size n from a distribution
do something using the sample
}
aggregate results across simulation batches
The t-interval is given by
\[\bar x \pm t_{\alpha/2,n-1} s/\sqrt{n}\]
where \(t_{\alpha/2,n-1}\) cuts off an area \(\alpha/2\) in the upper tail of the \(t\) distribution with \(n-1\) degrees of freedom, \(t_{n-1}\).
Please note that because the t-distribution is symmetric around 0, this implies that if \(T \sim t_{n-1}\):
\[\mathbb{P}( T \in [ \alpha/2, 1-\alpha/2 ] ) = 1- \alpha\]
Because \(T=\frac{\bar{x} - \mu}{s/\sqrt{n}}\) is such that \[T \sim t_{n-1}\], we get:
\[ \mathbb{P} (\mu \in [\bar{x} - t_{\alpha/2,n-1} s / \sqrt{n}, \bar{x} + t_{\alpha/2,n-1} s / \sqrt{n} ]) = 1- \alpha \] Which means that the probability that the rat (the true mean, \(\mu\)) is in the box (the confidence interval at nominal level \(1-\alpha\)) is \(1-\alpha\).
When the data are a sample from a normal population with mean \(\mu\) and standard deviation \(\sigma\), then \(100(1-\alpha)\)% of t-intervals constructed will contain the true but unknown population mean \(\mu\). The nominal, or named, level of the t-interval is \(100(1-\alpha)\)%.
The following simulation study draws 100 samples of size 10 from a normal population with mean 3, standard deviation .5. For each sample, a 95% confidence interval for the mean is calculated. The 100 intervals are plotted, and a count is made as to how many of the intervals contain the true mean. The proportion of intervals containing the true mean is the “empirical coverage” of the procedure.
plot(0,0,xlim=c(2,4),ylim=c(0,100),type="n",
ylab="simulation batch", xlab="when sampling from normal population")
Nbatch=100 #fix the number of simulation batches
n=10 #fix the sample size for each batch
truemean=3; sd=.5; alpha=.05 #fix some parameters for this specific simulation
cover=0
for (i in 1:Nbatch){
data=rnorm(n,truemean,sd) #draw a sample
#now do something with the sample
tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
lines(tint,c(i,i))
cover=cover+count1
}
lines(c(truemean,truemean),c(0,100))
empirical=cover/Nbatch
print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is 96 %"
The estimated, or empirical, coverage of the t-interval is 0.96%. If we did more simulation batches the empirical coverage would typically be closer to the nominal coverage. This is a property of the t-interval when sampling from a normal population.
What if we sample from a non-normal popluation? The following does the similar simulation, but where samples are drawn from \(\chi^2_2\), the \(\chi^2\) distribution with 2 degrees of freedom. This is a very non-normal population, as can be seen from the following.
x=seq(0,5,length.out=200)
plot(x,dchisq(x,2),type="l",ylab="f(x)",main="chi-squared_2 density function")
Now carry out the same simulation study, but with samples drawn from the \(\chi^2_2\) distribution.
plot(0,0,xlim=c(-1,6),ylim=c(0,100),type="n",
ylab="simulation batch", xlab="sampling from chi-squared distribution")
truemean=2; n=10; Nbatch=100; alpha=.05; df=2
cover=0
for (i in 1:Nbatch){
data=rchisq(n,df=2)
tint=mean(data)+c(-1,1)*qt(1-alpha/2,n-1)*sd(data)/sqrt(n)
lines(tint,c(i,i))
cover=cover+ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
}
lines(c(truemean,truemean),c(0,100))
empirical=cover/Nbatch
print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is 93 %"
With \(n\) moderately large, the empirical coverage will be fairly close to the nominal coverage. This is because of the central limit theorem, which says that the distribution of the sample mean is quite close to a normal distribution even if the underlying population is non-normal. For smallish values of \(n\), such as was used here, the empirical and nominal coverage can be quite different.
You may be completely lost with the first cell of R code
Therefore, I have dissected it for you here:
# first initialize the plot
plot(0,0,xlim=c(2,4),ylim=c(0,100),type="n",
ylab="simulation batch", xlab="when sampling from normal population")
# we will call
Nbatch=100 #fix the number of simulation batches
n=10 #fix the sample size for each batch
truemean=3; truesd=.5; alpha=.05 #fix some parameters for this specific simulation
cover=0
for (i in 1:Nbatch){
# draw n sample from the normal distribution
data=rnorm(n,truemean,truesd) #draw a sample
# estimate the parameters
estmean = mean(data)
estsd = sd(data)
# calculate the t-confidence interval for the estimated mean:
tnx <- qt(1-alpha/2,n-1)
tint=estmean + c(-1,1)*tnx*estsd/sqrt(n)
# now count1 is 1 if this segment contains the true parameter (true mean) or else zero
count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
# increment the counter : cover counts hits
cover=cover+count1
# add graphics: draw the segment a bit below
lines(tint,c(i,i))
}
# draw a vertical line where the true mean is
lines(c(truemean,truemean),c(0,100))
# all plots are now finished
# calculate the coverage probability
empirical=cover/Nbatch
# spit out the result:
print(paste("empirical coverage is ",100*empirical,"%"))
## [1] "empirical coverage is 94 %"
# punchline?
# is the estimated coverage the same as the announced coverage?
You may want to turn the simulation into a reusable function which exposes the parameters. This is done here. Hey: this is good practice, you should try to define your own functions.
bete <- function(n=n,alpha=alpha,Nbatch=Nbatch,
truemean=truemean,truesd=truesd,showplot=showplot){
# first initialize the plot
if(showplot == 1){
plot(0,0,xlim=c(2,4),ylim=c(0,Nbatch),type="n",
ylab="simulation batch", xlab="when sampling from normal population")
}
#fix some parameters for this specific simulation
cover=0
for (i in 1:Nbatch){
# draw n sample from the normal distribution
data=rnorm(n,truemean,truesd) #draw a sample
# estimate the parameters
estmean = mean(data)
estsd = sd(data)
# calculate the t-confidence interval for the estimated mean:
tnx <- qt(1-alpha/2,n-1)
tint = estmean + c(-1,1)*tnx*estsd/sqrt(n)
# now count1 is 1 if this segment contains the true parameter (true mean) or else zero
count1=ifelse(tint[1]<truemean&&truemean<tint[2],1,0)
# increment the counter : cover counts hits
cover=cover+count1
# add graphics: draw the segment a bit below
if(showplot == 1){
lines(tint,c(i,i))
}
}
# draw a vertical line where the true mean is
if(showplot == 1){
lines(c(truemean,truemean),c(0,100))
}
# all plots are now finished
# calculate the coverage probability
empirical=cover/Nbatch
# spit out the result:
print(paste("empirical coverage is ",100*empirical,"%"))
# punchline?
# is the estimated coverage the same as the announced coverage?
theoretical = 1. - alpha
print(paste("theoretical coverage is ",100*theoretical,"%"))
return (empirical)
}
# we will call
Nbatch=100 #fix the number of simulation batches
n=10 #fix the sample size for each batch
truemean=3;
truesd=.5;
alpha=.05;
showplot=0;
bete(n=n,alpha=alpha,Nbatch=Nbatch,truemean=truemean,truesd=truesd,showplot=showplot)
## [1] "empirical coverage is 95 %"
## [1] "theoretical coverage is 95 %"
## [1] 0.95
OK
Now can you write a function that calls bete 1000 times?
What do you conclude?
## For the curious
What is the uncertainty on the value of the speed of light?
Check the supplementary material:
We are going to provide some justification for the t-confidence interval.
Before we start, check,check,check:
Can you give a definition of ‘estimators’?
Can you a definition of ‘statistics’?
Can you give a definition of the normal distribution?
Can you give a fefinition of the t-distribution?
Can you give a definition of the chi-square distribution?
Let us review the result obtained by William Gosset (Student).
If \(X_i\) are random variables that are i.i.d. normally distributed as \(N(\mu,\sigma^2)\) and if \(\bar{X},S^2\) are the empirical mean and empirical variance estimators of the true mean \(\mu\) and the true variance \(\sigma^2\), then the random variable
\[Z_n = \frac{\bar{X} - \mu}{ S / \sqrt{n} }\]
follows a Student t distribution with \(n-1\) degrees of freedom:
\[ Z_n \sim t_{n-1}\]
and the random variable
\[W_n = (n-1) S^2 / \sigma^2\]
follows a \(\chi^2\) distribution with \(n-1\) degrees of freedom
\[ W_n \sim \chi^2_{n-1}\]
The reasons why this is true are as follows.
First, Student’s t-distribution with \(k\) degrees of freedom can be defined from the normal and chi-square distributions by:
\[T = \frac{Z}{\sqrt{V/k}}\] where \(Z \sim N(0,1)\) and \(V \sim \chi^2_k\) are independent normal and chi-square distributions.
But the chi-square distribution itself is just the square norm of a random vector with independent normal components.
Therefore, all there is to the result of Gosset/Fisher is to demonstrate that the
How can we do this?
Let us consider a random vector with independent standard normally distributed coordinates \((X,Y)\).
Let us show that the square norm of the vector is completely independent of the angle that the vector makes with the x-axis. This is intuitively obvious: we can gain no information on the angle of a vector by observing its norm and vice-versa!
For those not convinced by this ‘physical argument’, let us apply some maths.
Now let us consider, in essence, the change of coordinates (cartesian to polar):
\[(X,Y) \rightarrow (Z,W)\]
Remember that to show that two random variables \(Z\) and \(W\) are independent, we must show that the pdf of the joint is the product of pdfs of the marginals: \[f_{(Z,W)}(z,w)=f_Z(z)f_W(w)\] Let \[Z=X^2+Y^2\] and \[W=\frac{Y}{X}\] (square norm and tangent).
One can easily obtain the pdf of a function \(h(X)\) of a random variable \(X\) by an application of the jacobian formula (see links).
Therefore, applying this to \(h(X)=X^2\) with \(X \sim N(0,1)\), we find the pdf of \(X^2\) to be:
\[\frac{1}{\sqrt{2\pi}}x^{-1/2}e^{-x/2}1_{(0,+\infty)}(x)\]
By independence, we can calculate the joint distribution of \((X^2,Y^2)\):
\[\frac{x^{-1/2}e^{-x/2}}{\sqrt{2}\Gamma(1/2)}1_{(0,+\infty)}(x)\times \frac{y^{-1/2}e^{-y/2}}{\sqrt{2}\Gamma(1/2)}1_{(0,+\infty)}(x)=\frac{(xy)^{-1/2}e^{-(x+y)/2}}{2(\Gamma(1/2))^2}1_{(0,+\infty)}(x)1_{(0,+\infty)}(y)\]
And because we can reconstruct \(X^2\) and \(Y^2\) from \(Z\) and \(W\):
we can map the pdf onto the pdf of the new set of variables by calculating a change of volume. This is another application of determinants.
The determinant of the reciprocal transformation, also called inverse jacobian, is:
\[\left|\begin{array}{cc} \frac{W^2}{W^2+1}& \frac{1}{W^2+1}\\ \frac{2ZW}{(W^2+1)^2} & -\frac{2ZW}{(W^2+1)^2} \end{array}\right| =\frac{2ZW}{(W^2+1)^2} \]
From what we find that
\[\frac{(\frac{ZW^2}{W^2+1}\frac{Z}{W^2+1})^{-1/2}e^{-(\frac{ZW^2}{W^2+1}+\frac{Z}{W^2+1})/2}}{2(\Gamma(1/2))^2}\times \frac{2ZW}{(W^2+1)^2}1_{(0,+\infty)}(z)=\frac{1}{w^2+1}\frac{e^{-z/2}}{\pi}1_{(0,+\infty)}(z).\]
We easily recognize the product of a function of \(w\) and a function of \(z\) which indeed turn out to be the marginal pdfs.
The pdf of \(W\) is:
\[\frac{1}{\pi(1+w^2)}\]
Hey, bonus! We recognize here the pdf of the Cauchy distribution! It turns out that the Cauchy distribution is the same distribution as \(t_1\). Therefore we have shown that the tangent ratio of a random standard normal vector is Cauchy-distributed.
The pdf of \(Z\) is:
\[\frac{e^{-z/2}}{2 \pi}1_{(0,+\infty)}(z)\]
Therefore, the square norm and the tangent ratio (or equivalently norm and angle) are independent.
Useful links:
Wikipedia: Student’s t-distribution Proofs Transforming pdfs