Review resampling methods (ISLR, Chapter 5)
Manipulate probabilities and distributions
Learn how to compare models via cross-validation
Learn how to compare models by statistical testing
The lecture of Chapter 3,5 and 8 of ISLR is recommanded.
(q1: HELP FOR Chap.5, PROBLEM 1)
One says that a function \(B(x,y)\) is bilinear when it behaves ‘like a product’.
\[(2 X + 3 Y) * (5Z + 9T) = (2X)*(5Z) + (2X)*(9T) + (3Y)*(5Z) + (3Y)*(9T)\]
You can simplify this expresson further by using the property:
\[(2X)(5Z) = (2*5)XZ = 10 XZ\]
Mathematics provide many example of bilinear forms: scalar product of vectors, riemannian metrics, exterior products are a few examples.
Hence, for a bilinear form, one has (please complete): \[B(9X+3Y,2Z+3T) = 18 B(X,Z) + \ldots\]
One can easily show that the covariance \(Cov(X,Y)\) of two random variables is a bilinear form.
Recall that the covariance \(Cov(X,Y)\) of two random variables \(X\) and \(Y\) is a measure of how much ‘the expected value of the product differs from the product of the expected values’:
\[ Cov(X,Y)= \mathbb{E}(XY) - \mathbb{E}(X) \mathbb{E}(Y)\]
Also recall that two random variables are independent iff their joint distribution is the product of their marginal distributions:
\[F_{X,Y}(x,y)=F_X(x)F_Y(y)\]
The symbol \(F_X(x)\) represents the cumulative distribution function of random variable \(x\) which, by definition, is:
\[F_X(x) = \mathbb{P}(X \leq x)\]
Please note that \(F_X(x)\) is a well-defined non-random function of the scalar real variable \(x\) when \(X\) is a scalar real-values random variable.
If \(X\) and \(Y\) are independent random variables, draws of \(X\) reveal no information about draws of \(Y\).
Exercises (not for assignment 8 but for your own culture):As a consequence of the bilinearity of covariance, we can derive interesting properties for the variance.
Recall that the variance \(Var(X)\) of a random variable is:
\[ Var(X)= Cov(X,X)\]
Use this definition to show that
\[Var(X) = \mathbb{E}(X^2) - [\mathbb{E}(X)]^2\]
\[Var(X) = \mathbb{E}[(X -\mathbb{E}(X))^2]\]
the variance behaves ‘like a square’, i.e.
\[ Var(aX+bY) = a^2 Var(X) + b^2 Var(Y) + 2ab Cov(X,Y)\]
Remarks:Let us apply this formula to an interesting problem. Suppose that you have two quantities \(X\) and \(Y\) that are uncertain. Suppose also that these quantities are not independent but have some covariance. We can build a path of random variables \(Z(t)\) that interpolates between \(X\) (say stock market value in Paris) and \(Y\) *say stock market value in Montreal).
\[Z(t)= t X + (1-t)Y\]
Now the question is: what is the variance of \(Z(t)\)? This can easily be calculated by exploiting the fact that the variance behaves like a square:
\[Var(Z(t))= \ldots\]
Please review how to draw samples from known distributions in R
Let us manipulate normally distributed random variables in R
We note \(N(\mu,V)\) the normal distribution with mean \(\mu\) and variance \(V\). The variance can also be written \[V=\sigma^2\] where \(\sigma is the standard deviation\).
In R, we can use the ‘rnorm’ function to draw samples from a normal distribution. To draw \(ns=23\) samples from a normal distribution \(N(4,V)\) with \(V=100\):
set.seed(23)
x=rnorm(100,mean=4,sd=10)
Suppose that a continuous response \(y\) depends on a covariate \(x\) in the following way:
\[y=3*x - 2*x^2 + e\] where \(e\) is a superimposed noise that affects \(y\).
Suppose that the covariate \(x\) is randomly normally distributed:
\[X \sim N(0,100)\]
and the noise \(e\) is also randomly normally distributed:
\[E \sim N(0,20)\]
Write your R code to simulate \(50\) samples of the response \(y\) (see solutions-2a)
set.seed(23)
# your code here
Can you represent what the following code does by formulas? (see solutions-2b)
#set.seed(23)
#sd(rnorm(100,sd=5))
nsample=50
meanx=3
sdx=9
meane=0
sde=5 # try 100 then 1000 then 10000
x=rnorm(nsample,mean=meanx,sd=sdx)
e=rnorm(nsample,mean=meane,sd=sde)
y=-2. + 4.1*x-3.5*x^3 + e
#plot(x,y)
(q2: HELP FOR Chap.5, PROBLEM 2)
For problem 2, we will have to remember our discussions of the combinatorics of the bootstrap.
Let us mention two of the most brilliant early probabilists:
|
|
+Sampling probability: Quizz-1
Deduce from all previous steps that the probability of not ‘seeing’ an item \(k\) in a bootstrap sample of size \(p\) from a data set of size \(n\) is : \[p = (1-1/n)^p\]
Now let us apply the formula above
+Sampling probability: Quizz-2
What is the formula that calculates the probability that \(j\) IS in a bootstrap sample of length \(p\) taken from a dataset \(d=(1,2,3,\ldots,n)\)?
n=c(1:100,100*c(1:10),500*(1:100))
# what are we calculating here?
# do you understand this formula?
Pn=1-(1-1/n)^n
# n is not a number but
plot(n,Pn,xlab="n",ylab="P_n",main="P(j'th observation IS IN bootstrap sample)",log="x",ylim=c(0,1))
abline(1-1/exp(1),0)
How can we explain this phenomenon?
All we have done here is to repeat exactly the steps that I used in my lectures on the combinatorics of the bootstrap, see the following links to my lectures:
The real magics is due to Leonhard Euler!
The first printed occurence of the letter e to identify the base of the natural logarithm is found in a paper that Euler wrote on the explosive forces in cannons. Euler was interested by everything and wrote on everything, from vaccines to artillery.
E853 has achieved lasting fame in mathematical footnote history by being the first printed source in which e is used to represent the base of the natural logarithm.
He was your age (21 years old) when he wrote this paper (Meditation on experiments made recently on the firing of cannon.) in 1727. He decided to describe seven real experiments performed between August 21 and September 2, 1727. EULER WAS A DATA SCIENTIST!
Click on this beautiful print by Walther Hermann Ryff (1547) to discover Euler as a data scientist:
Leonhard Euler was not the first to discover the number e, but he contributed more than his predecessors to a deep understanding of this number and calculated its value to 23 decimal places. How would you do this yourselves? (see below or solutions)
The first known use of this constant, then represented by the letter b, was in a correspondence from Gottfried Leibniz to Christiaan Huygens in 1690 and 1691.
Leonhard Euler introduced ‘formally’ the letter e as the base for natural logarithms in a letter to Christian Goldbach on 25 November 1731.
Method 1: use \[\lim_{n \rightarrow \infty} (1+\frac{x}{n})^{n} = e^x \]
This method was discoverd by Jacob Bernoulli in 1683. He sought to find the value of the \[\lim_{n \rightarrow \infty} (1+\frac{1}{n})^{n}\]
\[e = \sum_{n=0}^\infty \frac{1}{n!}\]
(q3: HELP FOR Chap.5, PROBLEM 3)
Let us briefly review and compare these methods.
LOOCV is entirely unnecessary for standard ‘linear models’.
Here is why. The LOOCV cross-validation error can be calculated by the following formula:
\[ CV_{n} = \frac{1}{n} + \sum_{i=1}^{n} (\frac{y_i- \hat{y}_i}{1-h_i})^2\]
where \(e_i=y_i-\hat{y}_i\) is the error of prediction of the linear model and \(h_i\) is the ‘leverage statistics’ of record \(i\).
What is the influence of an observation on the regression? It is best to make an analogy with levers.
Recall that the great Greek mathematician Archimedes thought about using levers to lift the world (nothing else). Here is a nice allegoric painting by Giulio Parigi (1599) of this virtuoso act:
Each observation is similarly trying to lift the least-square solution. The statistician wishes all observations to have similar leverage. Ideally each of \(n\) observations should contribute \(1/n\). However, even in a simple linear regression, this is necessarily not the case.
The leverage statistic of sample (or rather predictor) \(i\) is calculated as:
\[h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{j=1}^n (x_j-\bar{x})^2}\]
Note that leverage can be calculated in total absence of information regarding the response. It is a property of the distribution of sampled covariates values.
Note also that the leverage lies between \(1/n\) and \(1\). As said in ISLR, it ‘reflects the amount that an observation influences its own fit’.
The moral of this equation is that observations which have their predictor far from the mean predictor value (${x}) weigh more heavily on the solution. This is easy to understand from a mechanical standpoint: the same deviation will make the regression line rotate more if we are far from the center of rotation. See Chapter 3 and Figure 3.13 of ISLR for a visual example.
Note that, for a single predictor, the total leverage is:
\[\sum_{i=1}^n h_i = 1 + 1 = 2\]
If we sum the leverages over \(p\) predictors, this becomes \(1+p\), Hence, we can use this as an order of magnitude to detect outliers. If a predictor has a leverage far greater than \(1+p\), it should be considered with caution because it potentially has too great an influence on the solution.
Hence, while LOOCV requires \(n\) model fits in general, for linear models, a single model fit (all data!) suffices.
I invite those of you who are not convinced by the analogy made here with mechanical levers to consult the work made my Koenig and Huyghens on the rotation of solid bodies.
+Summary facts: LOOCV
+Summary facts: k-fold-CV
k-fold-CV uses a higher fraction of the data than for Validation-Set (PRO)
k-fold-CV error estimation is more biased than LOOCV (CON)Quizz
Suggested pipeline:
I encourage you to try do fill-in the following cells before jumping to the solution section.
Which R libraries can we use to fit linear models (lm)?
# load the library here
What is the name of the function that calculates cross-validation for linear models?
# name of the function
Run here an example of cross-validation provided in the R documentation of your chosen linear model library:
# paste code from R documentation and run cross-validation here
Which R libraries can we use to fit general linear models (glm)?
# load the library here
What is the name of the function that calculates cross-validation for linear models?
# name of the function
Now you can repeat this for general linear models:
Which R libraries can we use to fit general linear models (glm)?
# load the library here
# name of cv function
# run an example of cv by pasting examples from the R documentation here.
Suppose you want to plot the value of a function \(f(x)\) on a regularly spaced grid of values of \(x\).
How would we do this in R?
We can generate the first 50 integers like this:
n=c(1:50)
Say I want to plot \(f_n = sqrt(n)\) for the first 50 integers:
n=c(1:50)
f_n=n**(1/2)
plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")
It is often useful to add vertical or horizontal lines to a plot. For example, asymptotes of hyperbola or other curves can be plotted by making use of the ‘abline’ function. Here is an example of how to call abline to complement a plot.
plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")
abline(v="5")
abline(h=6)
100*c(1:10)
## [1] 100 200 300 400 500 600 700 800 900 1000
We can combine several grids with different spacing.
# this will create a grid of 10 values: 1 2 3 ... 10
# followed by a grid of 10 values 10 20 30 ... 100
# you could continue
n=c(1:10,10*c(1:10))
f_n=n**(1/2)
plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")
It is often useful to plot a functions in transformed coordinates. For example, we may want to use a logarithmic scale for the independent quantity corresponding to the \(x\)-axis. This can be achieved by:
plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function with a logarithmic x-axis",log="x")
When you are not sure about the meaning of arguments of a function, just interrogate the R documentation in the console of Rstudio:
?plot
## starting httpd help server ... done
In this case, you will need to dig down the documentation and follow the plot.default link
+Statistical significance of regression models can be measured by ‘p-values’. +The function that fits the model returns one p-value for each of the estimated parameter (e.g. the coefficients of a polynomial regression model). +These ‘p-values’ are usually found in the summary output of the fitted model +We can use the framework of hypothesis testing to calculate statistical significance
I briefly recall how statistical hypothesis testing works:
You formulate a null-hypothesis, noted \(H_0\) which is a quantitative assertion about the true parameters. A common example of null hypothesis is:
\[H_0: a_k = 0\] where \(a_k\) is the k-th parameter of your model.
This null hypothesis states that the true value of the \(k\)-th parameter of the model is zero. For example, suppose you have a quadratic model, you may want to test if the quadratic term is really necessary. You could do this by testing the hypothesis \(a_2=0\).
You formulate an alternative hypothesis, noted \(H_1\) or \(H_a\), for example:
\[ H_1: a_k \neq 0 \]To do so, one calculates the ‘p-value’, \(p\) of our test of hypothesis \(H_0\).
This is a measure of ‘trust’ that we have in the validity of the assumption \(H_0\). Typically, this is done by estimating the probability of an observed statistics \(T_{obs}\), where the calculation makes use of \(H_0\) (i.e. is done under the assumption that the null-hypothesis holds true.
To summarize, a small p-value of a hypothesis test means that the null-hypothesis is unlikely considering the observed data.
Hence, in the quadratic regression example evoked above, if the p-value of \(H_0: a_2 = 0\) is small, we would reject the hypothesis that the quadratic term is null and thus conclude that we need this term and that a linear model would not fit the data adequately.
Let us apply these concepts to the the hardwood data. This data was used in my lectures to introduce polynomial regression.
I recall that the hardwood data is given by a series of pairs \((x_i,y_i), i=1, 2, \ldots , n\), where \(x\) is the proportion of hardwood in paper pulp, and \(y\) is a measure of tensile strength. A plot of \(x\) vs \(y\) indicates that the relationship is non-linear.
A slight mistake exists in the original lecture. The data has nothing to do with lumber. We are talking about paper, not planks of wood.
It is important to note the difference between a simple-minded exercise in statistics and a more realistic study of the mechanics of paper and composite material. Those of you with an inclination for mechanics can look at these (tensile) papers:
It is apparent that hardwood content is by no mean the sole determinant of strength. The strength of Kraft paper depends on fiber chararacteristics (such as length and strength distribution).
Having said that, we can now go on and explore the statistical significance of parameters estimated by various models.
# I lump the data preparation into a function
hardwood <- function(){
x = c(1,1.5,2,3,4,4.5,5,5.5,6,6.5,7,8,9,10,11,12,13,14,15)
y = c(6.3,11.1,20,24,26.1,30,33.8,34.0,38.1,39.9,42,46.1,53.1,52,52.5,48,42.8,27.8,21.9)
n=length(y)
#plot(x,y,xlab="hardwood %",ylab="tensile strength")
mydata = data.frame(x,y)
return(mydata)
}
# I call the function to collect the data
d = hardwood()
mydata <- d
# I make a scatter plot and write a title
# note that the original lecture contained an error. We are talking about paper not lumber. see below
#
plot(d$x,d$y,xlab="hardwood %",ylab="tensile strength",main="Tensile strengh of Paper (y) \n as a function of hardwood content in pulp (x)")
x2 <- (d$x)^2
yi=lm(y~x,data=d)
syi=summary(yi)
print(syi)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.986 -3.749 2.938 7.675 15.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.3213 5.4302 3.926 0.00109 **
## x 1.7710 0.6478 2.734 0.01414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 17 degrees of freedom
## Multiple R-squared: 0.3054, Adjusted R-squared: 0.2645
## F-statistic: 7.474 on 1 and 17 DF, p-value: 0.01414
# can you see what we are doing here?
# we extract the p-value from the summary
# to do this, we look at the second row (parameter or 'effect' x)
# and we take the fourth column: Pr(>|t})
# this prints the p-value of the test H_0: a_1=0
# I get a value of 0.01414
# as this is smaller than 0.05 (conventional level of significance)
# but larger than 0.01 (next smaller level of significance)
# this test earns 1 star of significance
# indeed this is what you see in the summary: one star printed beside this value
coefficients(syi)[2,4]
## [1] 0.01414016
You can also proceed differently if you do not want to build by hand the monomial predictors. For example you can fit a cubic polynomial like this:
lm.fit = lm(y~poly(x,degree=3),data = d)
and retrieve the p-values (separate tests for each coefficient) like this:
pvalues=coefficients(summary(lm.fit))[,]
pvalues
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.18421 0.5930501 57.641352 5.041090e-19
## poly(x, degree = 3)1 32.30213 2.5850455 12.495767 2.480980e-09
## poly(x, degree = 3)2 -45.39625 2.5850455 -17.561103 2.058028e-11
## poly(x, degree = 3)3 -14.57400 2.5850455 -5.637811 4.721725e-05
Note how fitting a polynomial of degree 2 to the hardwood data considerably reduces all p-values. This is an indication that the second degree polynomial
lm.fit2 = lm(y~poly(x,degree=2),data = d)
pvalues2=coefficients(summary(lm.fit2))[,]
pvalues2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.18421 1.014108 33.70864 2.730951e-16
## poly(x, degree = 2)1 32.30213 4.420395 7.30752 1.758404e-06
## poly(x, degree = 2)2 -45.39625 4.420395 -10.26973 1.894349e-08
lm.fit1 = lm(y~poly(x,degree=1),data = d)
pvalues1=coefficients(summary(lm.fit1))[,]
pvalues1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.18421 2.71075 12.610609 4.691972e-10
## poly(x, degree = 1) 32.30213 11.81589 2.733788 1.414016e-02
To summarize, we can determine for each model which of the fitted coefficients ‘are significant’ (which is a concise and slighly incorrect way to say that they are statistically significantly different from zero, i.e. the null hypothesis of zero-ness is rejected at a certain level of significance for these coefficients).
This should help you for q5 of Assignment 8
Supppose we have \(K\) estimates (obtained for example by bootstrap sampling) of the probability that a item in a data set is classified in class \(p\), given the measured values of the predictors, \(X\) for this item.
There are two ways to decide how to use these numbers \(p_1,p_2,p_K\) to make a decision as to the question ‘this item belongs to class \(p\)’.This should help you for q6 of Assignment 8
To create a validation set, we need to roughly divide data into two equal size subsets
We can do this by 1. determining the number of records of the data, 2. picking without replacement n/2 random records in the dataset and 3. using this set of indexes to define the testing and training data sets
# determine the number of rows of my dataframe
#n=nrow(mydata)
# pick up n/2 random integers within the range 1 to n
#indices=sample(1:n,n/2,replace=F)
# create two complementary subsets of records
# set1 = d[indices,]
# set2 = d[-indices,]
Recall how to fit a regression tree (CART) model, make a prediction and evaluate the mean-square-error (MSE) of the model
library(tree)
# assume that you want to use a dataset named d
# this could either be a testing set or a training set
# this calculates a fitted CART tree in d.tree
#d.tree=tree(MyResponseColumnName~.,data=d)
# this plots the CART
#plot(train.tree)
# and the labels of the nodes
#text(train.tree)
# this makes a prediction on an other data set (typically a testing dataset)
#mypred=predict(d.tree,newdata=otherdata)
#myMSE=mean((otherdata$MyResponseColumnName-mypred)^2)
Once you have fitted the full tree, you know it is going to overfit the data.
We need to prune it to the right level of complexity.
We can do this by 1. using cross-validation to predict the optimal number or leaves that the pruned model should have and 2. prune the full tree to this level of complexity.
Would you know how to do this?
# use cv.tree()
# outcv =
# determine the best size
# bestsize=outcv$size[outcv$dev==min(outcv$dev)]
# build the pruned tree to this size
# ... prune.tree(....)
# now we can calculate the MSE (mean square error) of prediction errors based on the pruned tree model
#prunepred=predict(tree.pruned,newdata=cstest)
#prunedMSE=mean((cstest$Sales-prunepred)^2)
Applying bilinearity, one obtains the following polynomial of degree 2 in \(t\).
\[ Var(Z(t))= \sigma_X^2 t^2 + 2t(1-t)\sigma_{XY} + (1-t)^2 \sigma_Y^2 \]
Now let us find the value of \(t\) that makes the variance minimal among all values of \(t \in [0,1]\).
To do this, simply remember your High School courses on the parabola or calculate the derivative with respect to \(t\) and find the value of \(t\) that zeroes the derivative.
Regroup the terms by the degrees of \(t\) monomials:
\[ Var(Z(t))= t^2 (a + b - 2c) + t (2c -2b ) + b = g(t)\] where I have simplified the notation (find yourselves what a, b and c represent).
Now, I can find the extremum (apex of the parabola) by computing the derivative and setting the value of \(t\) to the value that zeroes the slope:
\[ g'(t) = 2(a+b-2c) t + 2(c-b) = 0\]
\[ t_{extremum} = \frac{b-c}{a+b-2c}\]
Is this extremum a maximum or a minimum? What is the orientation of this parabola? To check this, look at the sigh of the coefficient of \(t^2\). If this coefficient is positive the parabola opens upward (tends to plus infinity), else it opens down (tends to minus infinity).
Show that this parabola is always opening up. Why is this the case regardless of the values of a,b and c?
Hint: Can a variance be negative? What is the value of \(Var(X-Y)\)? Expand this by making use of bilinearity (again!).
Solution-2a
#set.seed(23)
nsample=50
Varx=100
meanx=0
sdx=sqrt(Varx)
Vare=20
meane=0
sde=sqrt(Vare)
x=rnorm(nsample,mean=meanx,sd=sdx)
e=rnorm(nsample,mean=meane,sd=sde)
y=3*x-2*x^2 + e
plot(x,y)
Solution-2b
\[ X \sim N(3,81)\]
\[ E \sim N(0,25) \]
\[ y = -2. + 4.1 *x -3.5 x^3 +e \]
links to my lectures:
A generalized linear model is a model in which the linear regression is applied not to the response itself but to a transformation of the response. This means that in linear regression, we use:
\[y = Ax + e\]
but in generalized linear models, we first transform \(y\) to \(\eta = g(y)\) and then fit
\[ \eta = Ax + e\]
The model is characterized by a ‘family’ of transformation (binomial, logit, exponential, Poisson and so on). The family is one of the arguments specified when calling the glm function. By default, glm runs a linear model (i.e. the transformation is the identity \(\eta=y\)).
The stats library gives access to a function called glm which allows you to fit generalized linear model (glm).
library(stats)
# choose the degree of the polynomial model you want to fit:
#degree=4
# use glm to calculate the best coefficients (fitted model)
# and store all results of model fit in an object here called glm.fitdeg4
#glm.fitdeg4 = glm(y~poly(x,degree),data = mydata)
# calculate the LOOCV error of the model and store it
#loocv4=cv.glm(mydata,glm.fitdeg4)$delta[1]
#loocv4
How do I fit a polynomial model?
# create your dataset
#mydata =data.frame(x,y)
# choose the degree of the polynomial model you want to fit:
#degree=4
# use glm to calculate the best coefficients (fitted model)
# and store all results of model fit in an object here called glm.fitdeg4
#glm.fitdeg4 = glm(y~poly(x,degree),data = mydata)
# calculate the LOOCV error of the model and store it
#loocv4=cv.glm(mydata,glm.fitdeg4)$delta[1]
#loocv4
Show that the LOOCV depends on the simulation and random seed / data
Use LOOCV as a method for best model selection
#cv.error[degree]=cv.glm(mydata,glm.fit)$delta[1]
#cv.error
solutions-3
\[ Var(Z(t))= \sigma+X^2 t^2 + t(1-t)\sigma_{XY} + (1-t)^2 \sigma_Y^2 \]
#mydata =data.frame(x,y)
#set.seed(999161)
#library(boot)
#glm.fit = glm(y~poly(x,degree),data = mydata)
#cv.error[degree]=cv.glm(mydata,glm.fit)$delta[1]
#cv.error
You do not need to read this section if you are not interested by theory. In this section, I will give a formal explanation of the reason why LOOCV does not need to perform \(n\) model fits, but only ONE model fit, when the model is a linear regression model.
We will use the notation \(U_i\) to indicate the i-th component of vector \(U\) and \(U_{[-i]}\) to indicate the vector \(U\) with the i-th row deleted. Please do not confuse one for the other. Use your common sense because notations are slightly confusing here. For example, \(x_i\) is not a scalar, but the vector of predictors for record number \(i\). I think that the formulas are clear enough to delineate the general nature of the argument. Ask me if you have any questions regarding this section.
Recall that the vector \(a\) of parameters (regression coefficients) linear model \(Y=Xa+e\) can be estimated via the formula: \[\hat{a}=HZ\] where
\[Z= X^T Y\] and
\[H= [X^TX]^{-1}\] The same formulas are applicable to any choice of data, in particular to each dataset with record \(i\) removed, which is why it is logical to introduce the matrices and vectors:
\[ Z_{[-i]}= X_{[-i]}^T Y_{[-i]}\]
\[H_{[-i]}= [ X_{[-i]}^TX_{[-i]} ]^{-1}\]
\[ \hat{a}_{-i} = H_{-i} Z_{-i}\]
\[ e_{[-i]} = y_i - x_i^T \hat{a}_{[-i]}\] Please note that this is the error made by the prediction of the response for record \(i\), based on fitting the linear model based on the full data set minus record \(i\).
Now comes the key-argument. The Shermann-Woodbury formula allows us to explicitly calculate the inverse of a rank-one update matrix.
This is relevant here because the ‘XX’ matrices only differ by the tensor square of a vector:
\[ X_{[-i]}^TX_{[-i]} = X^TX - x_i x_i^T\] It turns out that the leverage is:
\[h_i=x_i^T (X'X)^{-1} x_i\]
Using all formulas above and the SW-formula easily gives:
\[e_{[-i]} = \frac{e_i}{1-h_i} \]
This is the wanted result. It shows that the error \(e_i=y_i-\hat{y}_i\) estimated from the linear model is over-optimistic because it double dips the data.
\[ CV_{n} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i- \hat{y}_i}{1-h_i})^2\]
This formula is very easy to understand: the LOOCV error rate is just the empirical average (i.e. ‘sample mean’) of the adjusted square errors of prediction of each observation, where the adjustement is carried out so as to avoid using a data point to predict its own response.