Objectives

Review resampling methods (ISLR, Chapter 5)
Manipulate probabilities and distributions
Learn how to compare models via cross-validation
Learn how to compare models by statistical testing

The lecture of Chapter 3,5 and 8 of ISLR is recommanded.

Variance and covariance

(q1: HELP FOR Chap.5, PROBLEM 1)

Theory

One says that a function $B(x,y)$ is bilinear when it behaves ‘like a product’.

\[(2 X + 3 Y) * (5Z + 9T) = (2X)*(5Z) + (2X)*(9T) + (3Y)*(5Z) + (3Y)*(9T)\]

You can simplify this expresson further by using the property:

\[(2X)(5Z) = (2*5)XZ = 10 XZ\]

Mathematics provide many example of bilinear forms: scalar product of vectors, riemannian metrics, exterior products are a few examples.

Hence, for a bilinear form, one has (please complete): \[B(9X+3Y,2Z+3T) = 18 B(X,Z) + \ldots\]

One can easily show that the covariance $Cov(X,Y)$ of two random variables is a bilinear form.

Recall that the covariance $Cov(X,Y)$ of two random variables $X$ and $Y$ is a measure of how much ‘the expected value of the product differs from the product of the expected values’:

\[ Cov(X,Y)= \mathbb{E}(XY) - \mathbb{E}(X) \mathbb{E}(Y)\]

Also recall that two random variables are independent iff their joint distribution is the product of their marginal distributions:

\[F_{X,Y}(x,y)=F_X(x)F_Y(y)\]

The symbol $F_X(x)$ represents the cumulative distribution function of random variable $x$ which, by definition, is:

\[F_X(x) = \mathbb{P}(X \leq x)\]

Please note that $F_X(x)$ is a well-defined non-random function of the scalar real variable $x$ when $X$ is a scalar real-values random variable.

If $X$ and $Y$ are independent random variables, draws of $X$ reveal no information about draws of $Y$.

Exercises (not for assignment 8 but for your own culture):

Show that the covariance of two independent variables is zero
Suppose that the covariance of two random variables is zero, does this mean that these random variables are independent? Can you give examples?
Suppose that $Y=aX+E$ and $E$ has expectation 0 and $X$ and $E$ are independent. Calculate $Cov(X,Y)$.
Use this definition to show that the Covariance is a bilinear form.

As a consequence of the bilinearity of covariance, we can derive interesting properties for the variance.

Recall that the variance $Var(X)$ of a random variable is:

\[ Var(X)= Cov(X,X)\]

Use this definition to show that

\[Var(X) = \mathbb{E}(X^2) - [\mathbb{E}(X)]^2\]
\[Var(X) = \mathbb{E}[(X -\mathbb{E}(X))^2]\]
the variance behaves ‘like a square’, i.e.

\[ Var(aX+bY) = a^2 Var(X) + b^2 Var(Y) + 2ab Cov(X,Y)\]

Remarks:

The notations: $Var(X) = \sigma_X^2$ and $Cov(X,Y) = \sigma_{XY}$ are commonly used.
The standard deviation is the square root of the variance
For vector-valued random variables $X$ and $Y$, the covariance is a matrix of scalar-valued covariances between each component $X_i$ of $X$ and $Y_j$ of $Y$.

Application to optimal interpolation

Let us apply this formula to an interesting problem. Suppose that you have two quantities $X$ and $Y$ that are uncertain. Suppose also that these quantities are not independent but have some covariance. We can build a path of random variables $Z(t)$ that interpolates between $X$ (say stock market value in Paris) and $Y$ *say stock market value in Montreal).

\[Z(t)= t X + (1-t)Y\]

Now the question is: what is the variance of $Z(t)$? This can easily be calculated by exploiting the fact that the variance behaves like a square:

\[Var(Z(t))= \ldots\]

Problem-1: (See Solution-1)

Develop this expression as a function of $Cov(X,X)=\sigma_x^2$, $Cov(Y,Y)=\sigma_Y^2$ and $Cov(X,Y)=\sigma_{XY}$.
Show that the variance goes through a minimum and calculates both the value of $t$ that realizes the minimum and the corresponding variance.

Variance and covariance of normal random variables

Please review how to draw samples from known distributions in R
Let us manipulate normally distributed random variables in R

We note $N(\mu,V)$ the normal distribution with mean $\mu$ and variance $V$. The variance can also be written \[V=\sigma^2\] where $\sigma is the standard deviation$.

In R, we can use the ‘rnorm’ function to draw samples from a normal distribution. To draw $ns=23$ samples from a normal distribution $N(4,V)$ with $V=100$:

set.seed(23)
x=rnorm(100,mean=4,sd=10)

Suppose that a continuous response $y$ depends on a covariate $x$ in the following way:

\[y=3*x - 2*x^2 + e\] where $e$ is a superimposed noise that affects $y$.

Suppose that the covariate $x$ is randomly normally distributed:

\[X \sim N(0,100)\]

and the noise $e$ is also randomly normally distributed:

\[E \sim N(0,20)\]

Write your R code to simulate $50$ samples of the response $y$ (see solutions-2a)

set.seed(23)
# your code here

Can you represent what the following code does by formulas? (see solutions-2b)

#set.seed(23)
#sd(rnorm(100,sd=5))
nsample=50
meanx=3
sdx=9
meane=0
sde=5 # try 100 then 1000 then 10000
x=rnorm(nsample,mean=meanx,sd=sdx)
e=rnorm(nsample,mean=meane,sd=sde)
y=-2. + 4.1*x-3.5*x^3 + e
#plot(x,y)

Combinatorics of the bootstrap

(q2: HELP FOR Chap.5, PROBLEM 2)

Link to lecture

Exercises on sampling and probabilities

For problem 2, we will have to remember our discussions of the combinatorics of the bootstrap.

Let us mention two of the most brilliant early probabilists:

+Sampling probability: Quizz-1

Who are the two early probabilists shown here?
If we have three items, $1,2,3$, and I draw one sample randomly. What is the probability that the sample is not 1?
If we have three items, $1,2,3,\ldots,n$, (with and I draw one sample randomly. What is the probability that the sample is not $k$, where $k$ is some fixed number between $1$ and $n$?
Suppose I replace items after sampling. Does the answer of question 2 differ among successive sampling?
When we take different bootstrap samples from a dataset, are these samples statistically dependent of each other?
If $A$ and $B$ are two independent events and $A$ has probability $p_A$ and $B$ has probability $p_B$, what is the probability that both events $A$ and $B$ be realized?
Deduce from all previous steps that the probability of not ‘seeing’ an item $k$ in a bootstrap sample of size $p$ from a data set of size $n$ is : \[p = (1-1/n)^p\]

Now let us apply the formula above

+Sampling probability: Quizz-2

What is the probality that $4$ is not in a bootstrap sample of length $3$ for the original dataset $d=(1,2,3,4,5,6)$?
Does the answer to this question change if I replace $4$ by $2$?
If you know the probability $p_A$ of an event $A$, how do you calculate the probability of the complement of event $A$?
Apply this concept to find the probability that $4$ IS in in a bootstrap sample of length $3$ for the original dataset $d=(1,2,3,4,5,6)$.
What is the formula that calculates the probability that $j$ IS in a bootstrap sample of length $p$ taken from a dataset $d=(1,2,3,\ldots,n)$?
```
   n=c(1:100,100*c(1:10),500*(1:100))
# what are we calculating here?
# do you understand this formula?
   Pn=1-(1-1/n)^n
# n is not a number but
   plot(n,Pn,xlab="n",ylab="P_n",main="P(j'th observation IS IN bootstrap sample)",log="x",ylim=c(0,1))
   abline(1-1/exp(1),0)
```
How can we explain this phenomenon?
- Summary
All we have done here is to repeat exactly the steps that I used in my lectures on the combinatorics of the bootstrap, see the following links to my lectures:

Here
- Additional exercice for fun!
The real magics is due to Leonhard Euler!

The first printed occurence of the letter e to identify the base of the natural logarithm is found in a paper that Euler wrote on the explosive forces in cannons. Euler was interested by everything and wrote on everything, from vaccines to artillery.

E853 has achieved lasting fame in mathematical footnote history by being the first printed source in which e is used to represent the base of the natural logarithm.

He was your age (21 years old) when he wrote this paper (Meditation on experiments made recently on the firing of cannon.) in 1727. He decided to describe seven real experiments performed between August 21 and September 2, 1727. EULER WAS A DATA SCIENTIST!

Click on this beautiful print by Walther Hermann Ryff (1547) to discover Euler as a data scientist:

Leonhard Euler was not the first to discover the number e, but he contributed more than his predecessors to a deep understanding of this number and calculated its value to 23 decimal places. How would you do this yourselves? (see below or solutions)

The first known use of this constant, then represented by the letter b, was in a correspondence from Gottfried Leibniz to Christiaan Huygens in 1690 and 1691.

Leonhard Euler introduced ‘formally’ the letter e as the base for natural logarithms in a letter to Christian Goldbach on 25 November 1731.
- Methods of calculation of $e$:
Method 1: use \[\lim_{n \rightarrow \infty} (1+\frac{x}{n})^{n} = e^x \]

This method was discoverd by Jacob Bernoulli in 1683. He sought to find the value of the \[\lim_{n \rightarrow \infty} (1+\frac{1}{n})^{n}\]
- Method 2:
\[e = \sum_{n=0}^\infty \frac{1}{n!}\]
- Exercises (not for assignment 8 but for your own culture):
1. Implement these calculation in R .
2. What is the fastest method

Review of sampling-based validation methods

(q3: HELP FOR Chap.5, PROBLEM 3)

The goals of validation methods is to estimate quantitative measures of the quality and performance of models.
Validation methods are used to calibrate hyperparameters of models
Validation methods are used to compare models
Validation methods are used to quantify the robustness of models and to detect overfitting
Validation methods offer an alternative to model comparison based on statistical testing (see this section below)
One of the goals of the assignment is precisely to help you practice both approaches (cross-validation and statistical testing)

Three basic validation methods

The Validation set method
The Leave-One-Out cross-validation method
The k-fold cross-validation methods

Let us briefly review and compare these methods.

Validation set method

divide the data into two equal parts
use one part as training set and the other as testing set
fit the model with data from the training set
compare model predictions to truth -from data- on the testing set

k-fold cross-validation

Assume that data has been provided in the form of a training set and a testing set
Prepare a random partition of the training set into $k$ subsets of (approximately) equal size. Each subset is called ‘a fold’.
For each fold $k$ perform the following operations
Leave out the fold and fit a model to all records not in the fold but in the training set.
Calculate the performance of the fit by comparing the prediction of the model for data in the fold
Combine the resulting $k$ estimates of performance to obtain an overall measure of performance.

Leave-One-Out cross-validation

treat each record as a testing set and the complementary data as a training set
no random sampling, as each record is leaved out in turn
for a data set of $n$ records, this implies fitting $n$ models.

LOOCV: the special case of least-square linear models

LOOCV is entirely unnecessary for standard ‘linear models’.

Here is why. The LOOCV cross-validation error can be calculated by the following formula:

\[ CV_{n} = \frac{1}{n} + \sum_{i=1}^{n} (\frac{y_i- \hat{y}_i}{1-h_i})^2\]

where $e_i=y_i-\hat{y}_i$ is the error of prediction of the linear model and $h_i$ is the ‘leverage statistics’ of record $i$.

What is the influence of an observation on the regression? It is best to make an analogy with levers.

Recall that the great Greek mathematician Archimedes thought about using levers to lift the world (nothing else). Here is a nice allegoric painting by Giulio Parigi (1599) of this virtuoso act:

Each observation is similarly trying to lift the least-square solution. The statistician wishes all observations to have similar leverage. Ideally each of $n$ observations should contribute $1/n$. However, even in a simple linear regression, this is necessarily not the case.

The leverage statistic of sample (or rather predictor) $i$ is calculated as:

\[h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{j=1}^n (x_j-\bar{x})^2}\]

Note that leverage can be calculated in total absence of information regarding the response. It is a property of the distribution of sampled covariates values.

Note also that the leverage lies between $1/n$ and $1$. As said in ISLR, it ‘reflects the amount that an observation influences its own fit’.

The moral of this equation is that observations which have their predictor far from the mean predictor value (${x}) weigh more heavily on the solution. This is easy to understand from a mechanical standpoint: the same deviation will make the regression line rotate more if we are far from the center of rotation. See Chapter 3 and Figure 3.13 of ISLR for a visual example.

Note that, for a single predictor, the total leverage is:

\[\sum_{i=1}^n h_i = 1 + 1 = 2\]

If we sum the leverages over $p$ predictors, this becomes $1+p$, Hence, we can use this as an order of magnitude to detect outliers. If a predictor has a leverage far greater than $1+p$, it should be considered with caution because it potentially has too great an influence on the solution.

Hence, while LOOCV requires $n$ model fits in general, for linear models, a single model fit (all data!) suffices.

I invite those of you who are not convinced by the analogy made here with mechanical levers to consult the work made my Koenig and Huyghens on the rotation of solid bodies.

Pros and Cons of methods

+Summary facts: LOOCV

LOOCV is not difficult to implement
LOOCV is more computationally intensive than k-fold-CV
For the special case of linear regression models, LOOCV can be replaced by an explicit formula that is of course very efficient
Except for linear models, LOOCV requires $n$ model fits on datasets of sizes $n-1$.
Datasets used in LOOCV are more correlated (in fact nearly identical) than for k-fold-CV (CON)
LOOCV error rate is more variable than >k-fold-CV (CON)

+Summary facts: k-fold-CV

k-fold-CV is not difficult to implement (PRO)
k-fold-CV is less computationally intensive than LOOCV (PRO)
k-fold-CV requires $k$ model fits on datasets of size $n/k$. (PRO)
k-fold-CV error rate is less variable than LOOCV (PRO)
Models estimates used in k-fold-CV are less correlated than for LOOCV (PRO)
k-fold-CV error rate is less variable than LOOCV (PRO)
k-fold-CV uses a higher fraction of the data than for Validation-Set (PRO)
k-fold-CV error estimation is more biased than LOOCV (CON)

Application: cross-validation in R

Quizz

How would you implement LOOCV in R?
How would you implement k-Fold CV?
Which routines/functions can I call in R to calculate model performance measures by cross-validation?

Suggested pipeline:

Prepare your data/datasets
Ask yourselves what kind of models you want to fit to the data
Identify the relevant R libraries to calculat the fitted models
Identify the name of the relevant functions that calculate cross-validated performance measures
Identify how to retrieve specific pieces of information from objects returned by R functions
Systematically use the R documentation along the entire process
Run the examples of the R documentation in the console

I encourage you to try do fill-in the following cells before jumping to the solution section.

Which R libraries can we use to fit linear models (lm)?

# load the library here

What is the name of the function that calculates cross-validation for linear models?

# name of the function

Run here an example of cross-validation provided in the R documentation of your chosen linear model library:

# paste code from R documentation and run cross-validation here

Which R libraries can we use to fit general linear models (glm)?

# load the library here

What is the name of the function that calculates cross-validation for linear models?

# name of the function

Now you can repeat this for general linear models:

Which R libraries can we use to fit general linear models (glm)?

# load the library here
# name of cv function
# run an example of cv by pasting examples from the R documentation here.

Help for plotting

Suppose you want to plot the value of a function $f(x)$ on a regularly spaced grid of values of $x$.

How would we do this in R?

We can generate the first 50 integers like this:

   n=c(1:50)

Say I want to plot $f_n = sqrt(n)$ for the first 50 integers:

   n=c(1:50)
   f_n=n**(1/2)
   plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")

It is often useful to add vertical or horizontal lines to a plot. For example, asymptotes of hyperbola or other curves can be plotted by making use of the ‘abline’ function. Here is an example of how to call abline to complement a plot.

   plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")
   abline(v="5")
   abline(h=6)

100*c(1:10)

##  [1]  100  200  300  400  500  600  700  800  900 1000

We can combine several grids with different spacing.

# this will create a grid of 10 values: 1 2 3 ... 10
# followed by a grid of 10 values 10 20 30 ... 100
# you could continue
   n=c(1:10,10*c(1:10))
   f_n=n**(1/2)
   plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function")

It is often useful to plot a functions in transformed coordinates. For example, we may want to use a logarithmic scale for the independent quantity corresponding to the $x$-axis. This can be achieved by:

plot(n,f_n,xlab="n",ylab="f_n",main="Plot of the square root function with a logarithmic x-axis",log="x")

When you are not sure about the meaning of arguments of a function, just interrogate the R documentation in the console of Rstudio:

?plot

## starting httpd help server ... done

In this case, you will need to dig down the documentation and follow the plot.default link

Interrogating model results: Statistical testing procedure

Statistical significance of estimated parameters

+Statistical significance of regression models can be measured by ‘p-values’. +The function that fits the model returns one p-value for each of the estimated parameter (e.g. the coefficients of a polynomial regression model). +These ‘p-values’ are usually found in the summary output of the fitted model +We can use the framework of hypothesis testing to calculate statistical significance

I briefly recall how statistical hypothesis testing works:

You formulate a null-hypothesis, noted $H_0$ which is a quantitative assertion about the true parameters. A common example of null hypothesis is:

\[H_0: a_k = 0\] where $a_k$ is the k-th parameter of your model.

This null hypothesis states that the true value of the $k$-th parameter of the model is zero. For example, suppose you have a quadratic model, you may want to test if the quadratic term is really necessary. You could do this by testing the hypothesis $a_2=0$.
You formulate an alternative hypothesis, noted $H_1$ or $H_a$, for example:
\[ H_1: a_k \neq 0 \]
The goal is to test if the data is or is not consistent with $H_0$.
To do so, one calculates the ‘p-value’, $p$ of our test of hypothesis $H_0$.

This is a measure of ‘trust’ that we have in the validity of the assumption $H_0$. Typically, this is done by estimating the probability of an observed statistics $T_{obs}$, where the calculation makes use of $H_0$ (i.e. is done under the assumption that the null-hypothesis holds true.
Now comes the time to conclude. If the computed p-value is smaller than a given level of significance (often fixed at 5%,i.e. $p < 0.05), we conclude that $H_0$ should be rejected at this level of significance.
If the p-value is larger than the level of significance, we say that we cannot reject $H_0$. This does not mean that $H_0$ is false, but simply that we do not have any or enough evidence to reject it.

To summarize, a small p-value of a hypothesis test means that the null-hypothesis is unlikely considering the observed data.

Hence, in the quadratic regression example evoked above, if the p-value of $H_0: a_2 = 0$ is small, we would reject the hypothesis that the quadratic term is null and thus conclude that we need this term and that a linear model would not fit the data adequately.

Application to the Hardwood data

Let us apply these concepts to the the hardwood data. This data was used in my lectures to introduce polynomial regression.

I recall that the hardwood data is given by a series of pairs $(x_i,y_i), i=1, 2, \ldots , n$, where $x$ is the proportion of hardwood in paper pulp, and $y$ is a measure of tensile strength. A plot of $x$ vs $y$ indicates that the relationship is non-linear.

A slight mistake exists in the original lecture. The data has nothing to do with lumber. We are talking about paper, not planks of wood.

It is important to note the difference between a simple-minded exercise in statistics and a more realistic study of the mechanics of paper and composite material. Those of you with an inclination for mechanics can look at these (tensile) papers:

It is apparent that hardwood content is by no mean the sole determinant of strength. The strength of Kraft paper depends on fiber chararacteristics (such as length and strength distribution).

Having said that, we can now go on and explore the statistical significance of parameters estimated by various models.

I first prepare the data

# I lump the data preparation into a function
hardwood <- function(){
x = c(1,1.5,2,3,4,4.5,5,5.5,6,6.5,7,8,9,10,11,12,13,14,15)
y = c(6.3,11.1,20,24,26.1,30,33.8,34.0,38.1,39.9,42,46.1,53.1,52,52.5,48,42.8,27.8,21.9)
n=length(y)
#plot(x,y,xlab="hardwood %",ylab="tensile strength")
mydata = data.frame(x,y)
return(mydata)
}
# I call the function to collect the data
d = hardwood()
mydata <- d
# I make a scatter plot and write a title
# note that the original lecture contained an error. We are talking about paper not lumber. see below
#
plot(d$x,d$y,xlab="hardwood %",ylab="tensile strength",main="Tensile strengh of Paper (y) \n as a function of hardwood content in pulp (x)")

I fit the simplest model: a polynomial model of degree 1 (linear model):

x2 <- (d$x)^2
yi=lm(y~x,data=d)
syi=summary(yi)
print(syi)

## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.986  -3.749   2.938   7.675  15.840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  21.3213     5.4302   3.926  0.00109 **
## x             1.7710     0.6478   2.734  0.01414 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 17 degrees of freedom
## Multiple R-squared:  0.3054, Adjusted R-squared:  0.2645 
## F-statistic: 7.474 on 1 and 17 DF,  p-value: 0.01414

# can you see what we are doing here?
# we extract the p-value from the summary
# to do this, we look at the second row (parameter or 'effect' x)
# and we take the fourth column: Pr(>|t})
# this prints the p-value of the test H_0: a_1=0
# I get a value of 0.01414
# as this is smaller than 0.05 (conventional level of significance)
# but larger than 0.01 (next smaller level of significance)
# this test earns 1 star of significance
# indeed this is what you see in the summary: one star printed beside this value
coefficients(syi)[2,4]

## [1] 0.01414016

You can also proceed differently if you do not want to build by hand the monomial predictors. For example you can fit a cubic polynomial like this:

lm.fit = lm(y~poly(x,degree=3),data = d)

and retrieve the p-values (separate tests for each coefficient) like this:

pvalues=coefficients(summary(lm.fit))[,]
pvalues

##                       Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           34.18421  0.5930501  57.641352 5.041090e-19
## poly(x, degree = 3)1  32.30213  2.5850455  12.495767 2.480980e-09
## poly(x, degree = 3)2 -45.39625  2.5850455 -17.561103 2.058028e-11
## poly(x, degree = 3)3 -14.57400  2.5850455  -5.637811 4.721725e-05

Note how fitting a polynomial of degree 2 to the hardwood data considerably reduces all p-values. This is an indication that the second degree polynomial

lm.fit2 = lm(y~poly(x,degree=2),data = d)
pvalues2=coefficients(summary(lm.fit2))[,]
pvalues2

##                       Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)           34.18421   1.014108  33.70864 2.730951e-16
## poly(x, degree = 2)1  32.30213   4.420395   7.30752 1.758404e-06
## poly(x, degree = 2)2 -45.39625   4.420395 -10.26973 1.894349e-08

lm.fit1 = lm(y~poly(x,degree=1),data = d)
pvalues1=coefficients(summary(lm.fit1))[,]
pvalues1

##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)         34.18421    2.71075 12.610609 4.691972e-10
## poly(x, degree = 1) 32.30213   11.81589  2.733788 1.414016e-02

To summarize, we can determine for each model which of the fitted coefficients ‘are significant’ (which is a concise and slighly incorrect way to say that they are statistically significantly different from zero, i.e. the null hypothesis of zero-ness is rejected at a certain level of significance for these coefficients).

Exercises on classification trees

This should help you for q5 of Assignment 8

Supppose we have $K$ estimates (obtained for example by bootstrap sampling) of the probability that a item in a data set is classified in class $p$, given the measured values of the predictors, $X$ for this item.

There are two ways to decide how to use these numbers $p_1,p_2,p_K$ to make a decision as to the question ‘this item belongs to class $p$’.

In the majority method, we count how many p’s are larger than 0.5. If this is the majority, we decide that the item is in the class. Otherwise it is not.
In the averaging method, we simply average the p’s and compare the average to 0.5. If this average is more than 0.5, item is in the class, or else, not.

Exercises on CART and random forest models

This should help you for q6 of Assignment 8

To create a validation set, we need to roughly divide data into two equal size subsets

We can do this by 1. determining the number of records of the data, 2. picking without replacement n/2 random records in the dataset and 3. using this set of indexes to define the testing and training data sets

# determine the number of rows of my dataframe
#n=nrow(mydata)
# pick up n/2 random integers within the range 1 to n
#indices=sample(1:n,n/2,replace=F)
# create two complementary subsets of records
# set1 = d[indices,]
# set2 = d[-indices,]

Recall how to fit a regression tree (CART) model, make a prediction and evaluate the mean-square-error (MSE) of the model

library(tree)
# assume that you want to use a dataset named d
# this could either be a testing set or a training set
# this calculates a fitted CART tree in d.tree
#d.tree=tree(MyResponseColumnName~.,data=d)
# this plots the CART
#plot(train.tree)
# and the labels of the nodes
#text(train.tree)
# this makes a prediction on an other data set (typically a testing dataset)
#mypred=predict(d.tree,newdata=otherdata)
#myMSE=mean((otherdata$MyResponseColumnName-mypred)^2)

Once you have fitted the full tree, you know it is going to overfit the data.

We need to prune it to the right level of complexity.

We can do this by 1. using cross-validation to predict the optimal number or leaves that the pruned model should have and 2. prune the full tree to this level of complexity.

Would you know how to do this?

# use cv.tree()
# outcv = 
# determine the best size
# bestsize=outcv$size[outcv$dev==min(outcv$dev)]
# build the pruned tree to this size 
# ... prune.tree(....)
# now we can calculate the MSE (mean square error) of prediction errors based on the pruned tree model
#prunepred=predict(tree.pruned,newdata=cstest)
#prunedMSE=mean((cstest$Sales-prunepred)^2)

Solutions

Problem-1

Applying bilinearity, one obtains the following polynomial of degree 2 in $t$.

\[ Var(Z(t))= \sigma_X^2 t^2 + 2t(1-t)\sigma_{XY} + (1-t)^2 \sigma_Y^2 \]

Now let us find the value of $t$ that makes the variance minimal among all values of $t \in [0,1]$.

To do this, simply remember your High School courses on the parabola or calculate the derivative with respect to $t$ and find the value of $t$ that zeroes the derivative.

Regroup the terms by the degrees of $t$ monomials:

\[ Var(Z(t))= t^2 (a + b - 2c) + t (2c -2b ) + b = g(t)\] where I have simplified the notation (find yourselves what a, b and c represent).

Now, I can find the extremum (apex of the parabola) by computing the derivative and setting the value of $t$ to the value that zeroes the slope:

\[ g'(t) = 2(a+b-2c) t + 2(c-b) = 0\]

\[ t_{extremum} = \frac{b-c}{a+b-2c}\]

Is this extremum a maximum or a minimum? What is the orientation of this parabola? To check this, look at the sigh of the coefficient of $t^2$. If this coefficient is positive the parabola opens upward (tends to plus infinity), else it opens down (tends to minus infinity).

Show that this parabola is always opening up. Why is this the case regardless of the values of a,b and c?

Hint: Can a variance be negative? What is the value of $Var(X-Y)$? Expand this by making use of bilinearity (again!).

Problem-2

Solution-2a

#set.seed(23)

nsample=50
Varx=100
meanx=0
sdx=sqrt(Varx)
Vare=20
meane=0
sde=sqrt(Vare)
x=rnorm(nsample,mean=meanx,sd=sdx)
e=rnorm(nsample,mean=meane,sd=sde)
y=3*x-2*x^2 + e
plot(x,y)

Solution-2b

\[ X \sim N(3,81)\]

\[ E \sim N(0,25) \]

\[ y = -2. + 4.1 *x -3.5 x^3 +e \]

Problem-3

links to my lectures:

A generalized linear model is a model in which the linear regression is applied not to the response itself but to a transformation of the response. This means that in linear regression, we use:

\[y = Ax + e\]

but in generalized linear models, we first transform $y$ to $\eta = g(y)$ and then fit

\[ \eta = Ax + e\]

The model is characterized by a ‘family’ of transformation (binomial, logit, exponential, Poisson and so on). The family is one of the arguments specified when calling the glm function. By default, glm runs a linear model (i.e. the transformation is the identity $\eta=y$).

The stats library gives access to a function called glm which allows you to fit generalized linear model (glm).

library(stats)
# choose the degree of the polynomial model you want to fit:
#degree=4
# use glm to calculate the best coefficients (fitted model)
# and store all results of model fit in an object here called glm.fitdeg4
#glm.fitdeg4 = glm(y~poly(x,degree),data = mydata)
# calculate the LOOCV error of the model and store it
#loocv4=cv.glm(mydata,glm.fitdeg4)$delta[1]
#loocv4

Problem-u

How do I fit a polynomial model?

# create your dataset
#mydata =data.frame(x,y)
# choose the degree of the polynomial model you want to fit:
#degree=4
# use glm to calculate the best coefficients (fitted model)
# and store all results of model fit in an object here called glm.fitdeg4
#glm.fitdeg4 = glm(y~poly(x,degree),data = mydata)
# calculate the LOOCV error of the model and store it
#loocv4=cv.glm(mydata,glm.fitdeg4)$delta[1]
#loocv4

Show that the LOOCV depends on the simulation and random seed / data

Use LOOCV as a method for best model selection

#cv.error[degree]=cv.glm(mydata,glm.fit)$delta[1]
#cv.error

solutions-3

solutions-2

\[ Var(Z(t))= \sigma+X^2 t^2 + t(1-t)\sigma_{XY} + (1-t)^2 \sigma_Y^2 \]

#mydata =data.frame(x,y)
#set.seed(999161)

#library(boot)
#glm.fit = glm(y~poly(x,degree),data = mydata)
#cv.error[degree]=cv.glm(mydata,glm.fit)$delta[1]
#cv.error

Problem-t

You do not need to read this section if you are not interested by theory. In this section, I will give a formal explanation of the reason why LOOCV does not need to perform $n$ model fits, but only ONE model fit, when the model is a linear regression model.

We will use the notation $U_i$ to indicate the i-th component of vector $U$ and $U_{[-i]}$ to indicate the vector $U$ with the i-th row deleted. Please do not confuse one for the other. Use your common sense because notations are slightly confusing here. For example, $x_i$ is not a scalar, but the vector of predictors for record number $i$. I think that the formulas are clear enough to delineate the general nature of the argument. Ask me if you have any questions regarding this section.

Recall that the vector $a$ of parameters (regression coefficients) linear model $Y=Xa+e$ can be estimated via the formula: \[\hat{a}=HZ\] where

\[Z= X^T Y\] and

\[H= [X^TX]^{-1}\] The same formulas are applicable to any choice of data, in particular to each dataset with record $i$ removed, which is why it is logical to introduce the matrices and vectors:

\[ Z_{[-i]}= X_{[-i]}^T Y_{[-i]}\]

\[H_{[-i]}= [ X_{[-i]}^TX_{[-i]} ]^{-1}\]

\[ \hat{a}_{-i} = H_{-i} Z_{-i}\]

\[ e_{[-i]} = y_i - x_i^T \hat{a}_{[-i]}\] Please note that this is the error made by the prediction of the response for record $i$, based on fitting the linear model based on the full data set minus record $i$.

Now comes the key-argument. The Shermann-Woodbury formula allows us to explicitly calculate the inverse of a rank-one update matrix.

This is relevant here because the ‘XX’ matrices only differ by the tensor square of a vector:

\[ X_{[-i]}^TX_{[-i]} = X^TX - x_i x_i^T\] It turns out that the leverage is:

\[h_i=x_i^T (X'X)^{-1} x_i\]

Using all formulas above and the SW-formula easily gives:

\[e_{[-i]} = \frac{e_i}{1-h_i} \]

This is the wanted result. It shows that the error $e_i=y_i-\hat{y}_i$ estimated from the linear model is over-optimistic because it double dips the data.

\[ CV_{n} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i- \hat{y}_i}{1-h_i})^2\]

This formula is very easy to understand: the LOOCV error rate is just the empirical average (i.e. ‘sample mean’) of the adjusted square errors of prediction of each observation, where the adjustement is carried out so as to avoid using a data point to predict its own response.

Let’s prepare for Assignment 8