Linear Regression

2017-03-03

Reminder

Assignment 1 due on Friday
Submit to stat2450winter@gmail.com
PDF format

Linear Regression

Linear regression is an approach for modeling the relationship between dependent variable y and one or more independent variables denoted X.
Linear relationship vs Non-linear relationship
Simple linear regression: Predicting a quantitative response Y based on a single predictor is called simple linear regression.
Multiple linear regression: For more than one predictors, the process is called multiple linear regression.
Supervised Learning
Parametric approach
Example

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

http://www-bcf.usc.edu/~gareth/ISL/data.html

Simple Regression Regression

Simple linear regression: predicting a quantitative response Y on the basis of a single predictor variable X.

\(Y = \beta_0 + \beta_1X + \epsilon\)

\(\beta_1\) : Slope
\(\beta_0\) : Intercept
\(\epsilon\) : The error term (a catch-all for what we miss with this simple model)

Fitting the linear regression model

Least squares approach

minimizing the residual of squares

\(RSS = \sum(y_i - \hat{y}_i) = \sum(y_i - \hat{\beta_0}-\hat{\beta_1}x_i)\)

\(\hat{\beta_1} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}\)

\(\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\)

Using R to fit the model:

ads = read.table("../../Advertising.csv",header=T,row.names=1,sep=",")

y = ads$Sales
x = ads$TV

beta1 = sum((x - mean(x))*(y - mean(y)))/ sum((x-mean(x))^2)
beta0 = mean(y) - beta1*mean(x)



### using lm() function

fit = lm(Sales ~ TV,data = ads)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ TV, data = ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

plot(x,y)
abline(fit,col="red") ### same as abline(beta0,beta1)

Assessing the model

How close our estimates are to the true coefficients?

The standard errors associated with \(\hat{\beta_0}, \hat{\beta_1}\)

\(SE(\hat{\beta_0})^2 = \sigma^2[\frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}]\)

\(SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum(x_i-\bar{x})^2}\)

If \(\sigma\) is unknow, use \(RSE = \sqrt{RSS/(n-2)}\) to estimate. The (1-\(\alpha\))% confidence interval can be computed as,

\(\hat{\beta_0} \pm t_{\alpha/2}SE(\hat{\beta_0})\)

\(\hat{\beta_1} \pm t_{\alpha/2}SE(\hat{\beta_1})\)

Is There a Relationship Between the Response and the Predictor?
- Hypothesis Test:
\(H_0 : \beta_1 = 0\)

\(H_a : \beta_1 \neq 0\)
- t-statistics:
\(t_{n-2} = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\)
- Calculate the p-value
How well is the fitting?
The RSE is an estimate of the standard deviation of ε. It is the average amount that the response will deviate from the true regression line.

RSE is an absolute measure of lack of fit of the model.

\(RSE = \sqrt{RSS/(n-2)}\)
\(R^2\) statistic (coefficient of determination)

It is the percentage of the response variable variation that is explained by a linear model.

\(R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}\)

In general, the higher the R-squared, the better the model fits the data.
- 0 indicates that the model explains none of the variability of the response data around its mean.
- 1 indicates that the model explains all the variability of the response data around its mean.

Make the prediction

Confidence interval for the mean response

\(SE(\mu_{y|x})^2 = \sigma^2[\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}]\)

\(\mu_{y|x} \pm t_{\alpha/2}\hat{\sigma}SE(\mu_{y|x})\)

Prediction interval for the mean response

\(SE(\hat{y}|x)^2 = var(\mu_{y|x}) + var(\epsilon) = \sigma^2[\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}] + \sigma^2 = \sigma^2[1+\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}]\)

\(\hat{y}|x \pm t_{\alpha/2}\hat{\sigma}SE(\hat{y}|x)\)

### make the prediction and intervals
fit = lm(Sales ~ TV, data=ads)
predict(fit,data.frame(TV = c(50,100,150)),interval = "confidence",level = 0.95)

##         fit       lwr      upr
## 1  9.409426  8.722696 10.09616
## 2 11.786258 11.267820 12.30470
## 3 14.163090 13.708423 14.61776

predict(fit,data.frame(TV = c(50,100,150)),interval = "prediction",level = 0.95)

##         fit      lwr      upr
## 1  9.409426 2.946709 15.87214
## 2 11.786258 5.339251 18.23326
## 3 14.163090 7.720898 20.60528

xs = seq(min(ads$TV),max(ads$TV),length.out = 100)
CI = predict(fit,data.frame(TV = xs),interval = "confidence",level = 0.95)
PI = predict(fit,data.frame(TV = xs),interval = "prediction",level = 0.95)

plot(Sales ~ TV, data = ads)
abline(fit,col="red")
## confidence interval
lines(xs,CI[,2],col="blue")
lines(xs,CI[,3],col="blue")
## predicton interval
lines(xs,PI[,2],col="green")
lines(xs,PI[,3],col="green")

lm() Summary for simple linear regression

library(MASS)
library(ISLR)

# Simple Linear Regression

names(Boston) ## show column names

lm.fit=lm(medv~lstat,data=Boston) ##fit the linear regression model

lm.fit  ## basic information

summary(lm.fit) ## detailed information

names(lm.fit) ## get the names of this "lm" object

coefs  = coef(lm.fit)       # coefficients
resid = residuals(lm.fit)          # residuals
pred  = predict(lm.fit)            # fitted values
ci = confint(lm.fit)    # confidence interval

plot(Boston$lstat, Boston$medv)
abline(lm.fit,col="red")

Diagnostics Plots

Linear relationship

set.seed(1)

x = round(runif(100,1,10),2)
y = 3*x + 2
y = y+rnorm(length(x),0,4)

fit1 = lm(y ~ x)
plot(x,y)
abline(fit1,col="red")

plot(predict(fit1),residuals(fit1),ylim=c(-15,15)) 

## non linear
y = (x-3)^2 + 1 + rnorm(length(x),0,4)
fit2 = lm(y ~ x)
plot(x,y)
abline(fit2,col="red")

plot(predict(fit2),residuals(fit2))

Normal Distribution

## QQ plot

set.seed(1)

x = round(runif(200,1,10),2)
y = 3*x + 2 + rnorm(length(x),0,4)

#hist(rnorm(1000,0,2))
#hist(rnorm(1000,0,2)^2)


fit = lm(y~x)
plot(x,y)
abline(fit,col="red")
hist(residuals(fit),breaks = 20)
qqnorm(residuals(fit))
qqline(residuals(fit))
plot(predict(fit),residuals(fit))

Constant variance of error

set.seed(1)

x = round(runif(50,1,10),2)
y = 3*x + 2 + rnorm(length(x),0,4)
fit = lm(y~x)
plot(predict(fit),residuals(fit))


newx = sort(x)
newy = 3*newx + 2 + rnorm(length(newx),0, seq(0,length.out = length(newx), by=0.2))

plot(x,y,main="constant variance")
plot(newx,newy, main="non-constant variance")

fit = lm(y~x)
plot(predict(fit),residuals(fit),main="constant variance")

fit1 = lm(newy ~ newx)
plot(predict(fit1),residuals(fit1), main="non-constant variance")

Influential Points

Outliers
- Outliers: the abservations for which the response \(y_i\) is unusual given the predictor \(x_i\).
- Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.
High Leverage Points
- High leverage points: observations with high leverage high leverage have an unusual value for \(x_i\).
- Leverage is a measure of how far an independent variable deviates from its mean.
- These leverage points can have an effect on the estimate of regression coefficients.

set.seed(1)

x = round(runif(30,1,10),2)
newP = c(6,40)  ## 15, 60   ## add a abnormal point
y = 3*x + 2 + rnorm(length(x),0,4)

newx = c(newP[1],x)
newy = c(newP[2],y)
plot(newx,newy)
points(newP[1],newP[2],col="red")
# without the outlier
fit1 = lm(y~x)
abline(fit1,col="blue")

# with the outlier
fit2 = lm(newy ~ newx)
abline(fit2,col="red")

summary(fit1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2427 -1.3139 -0.4592  2.3613  5.1237 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4305     1.3637   1.782   0.0856 .  
## x             2.9782     0.2211  13.470 9.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.164 on 28 degrees of freedom
## Multiple R-squared:  0.8663, Adjusted R-squared:  0.8615 
## F-statistic: 181.4 on 1 and 28 DF,  p-value: 9.313e-14

summary(fit2)

## 
## Call:
## lm(formula = newy ~ newx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9190 -1.7331 -0.9879  1.9915 19.0494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8507     2.0465   1.393    0.174    
## newx          3.0166     0.3321   9.084 5.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.755 on 29 degrees of freedom
## Multiple R-squared:  0.7399, Adjusted R-squared:  0.731 
## F-statistic: 82.51 on 1 and 29 DF,  p-value: 5.575e-10

plot(predict(fit2),residuals(fit2))

plot(predict(fit2),rstudent(fit2))

# high leverage point
newP = c(15,60)
# plot(hatvalues(fit2))  ## leverage plot

Multiple Linear Regression

## three seperate simple linear regression
fitTV = lm(Sales ~ TV,data = ads)
summary(fitTV)
fitRadio = lm(Sales ~ Radio,data = ads)
summary(fitRadio)
fitNews = lm(Sales ~ Newspaper,data = ads)
summary(fitNews)


## a multiple linear regression

fit = lm(Sales ~ TV + Radio + Newspaper,data = ads)
summary(fit)

## Scatter plot of data set "Advertising"

plot(ads)

cor(ads)

The multiple linear regression model with p distinct predictors: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon\)

\(\beta_j\) : the average effect on Y of a one unit increase in \(X_j\), holding all other predictors fixed.

Least squares approach

residual standard error: \(RSE = \sqrt{\frac{RSS}{n-p-1}}\)

Matrix form:

\(\mathbf{Y} = \mathbf{X\beta} + \mathbf{\epsilon}\)

\(\hat{\beta} = (X'X)^{-1}X'Y\)

y = ads$Sales
y = ads$Sales
x = cbind(rep(1,length(y)),ads[,1:3])
colnames(x)[1] = "Intercept"

x = as.matrix(x) ## covert to matrix
beta = solve(t(x)%*%x)%*%t(x) %*% y

coef(fit)

## (Intercept)          TV 
##  7.03259355  0.04753664

Multiple Coefficient of Determination
- TSS: total variation in y \(\sum(y_i - \bar{y})^2\)
- RSS: variation unexplained after our regression \(\sum(y_i-\hat{y_i})^2\)
- ESS: variation in y explained by our regression \(\sum(\hat{y_i} - \bar{y})^2\)

\(R^2 = \frac{ESS}{TSS} = 1- \frac{RSS}{TSS}\)

\(R^2\) always increases as you add new predictors, but it doesn't mean that our model does a better job of explaining.

summary(fit)

ads2 = ads
set.seed(1)
ads2$Noise = round(runif(nrow(ads),5,15),2)

fit2 = lm(Sales ~ TV + Radio + Newspaper + Noise,data=ads2)

Adjusted \(R^2\)

\(R^2_a = 1 - (1-R^2)(\frac{n-1}{n-p-1}) = 1 - \frac{n-1}{n-p-1}\frac{RSS}{TSS}\)

F-test of overall or joint significance

\(H_0 : \beta_1 = \beta_2 = ... = \beta_p =0\)

\(H_a : \text{one or more slope parameters is non-zero}\)

\(\frac{(TSS-RSS)/p}{RSS/(n-p-1)} \sim F_{p,n-p-1}\)

Variable Selection
- Forward selection
  - Start with no variables in the model.
  - Add to the model the variable that results in the lowest RSS
  - Continue until we reach some threshold (p-value, R square)
- Backward selection
  - Start with all the predictors in the model.
  - Remove the predictor with highest p-value
  - Continue
- Mixed selection
  - a combination of backward and forward selection
  - Add variables as forward selection, refit the model, then remove the variable if at any point the p-value for that variable rises above a certain threshold.
  - Continue until all variables in the model have a sufficiently low p-value

Extensions of the Linear Regression

model fomula in lm()
- "." : all predictors
- "+" : add predictors
- "-" : remove a predictor form the formula
- ":" : an interaction between predictors
- "*" : all possible interactions
- "^" : interactions up to a degree
- "I()" : AsIs function, interpret arithmetically

library(ISLR)

#ads = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv",row.names=1)

head(ads)
head(model.matrix(Sales ~ .,data=ads))
head(model.matrix(Sales ~ TV + Radio,data=ads))
head(model.matrix(Sales ~ TV + Radio + TV:Radio,data=ads))
head(model.matrix(Sales ~ TV * Radio,data=ads))
head(model.matrix(Sales ~ TV + I(TV^2),data=ads))
head(model.matrix(Sales ~ (TV+Radio+Newspaper)^3,data=ads))


## update() function convenient to select important variables
## backward selection

fit = lm(medv~.,data = Boston)
summary(fit)
fit = update(fit, ~. -age) ## same as lm(medv~. - age,data = Boston)
summary(fit)
fit = update(fit, ~. -indus) ## same as lm(medv~. - age - indus,data = Boston)
summary(fit2)

Non-linear Transformations of data

library(MASS)


plot(mammals)

#identify(mammals,labels = row.names(mammals))

fit = lm(brain ~ body,data = mammals)
abline(fit,col="Red")
summary(fit)

### use log transformation to lower the effect of the influential points
logmammals = log(mammals)
logfit = lm(brain ~ body,data = logmammals)
plot(logmammals)
abline(logfit,col="Red")
summary(logfit)

Qualitative predictors

plot(Carseats)

plot(Carseats$Price,Carseats$Sales,col = Carseats$ShelveLoc,type="n")
points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Bad",],col="blue")
points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Medium",],col="green")
points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Good",],col="red")
legend("topright",col=c("blue","red","green"),legend = c("Bad","Good","Medium"),pch=1)


fit = lm(Sales ~ Price + ShelveLoc, data = Carseats)
summary(fit)
contrasts(Carseats$ShelveLoc)  # coding for the dummy variables

head(model.matrix(~Price+ShelveLoc,data = Carseats))
head(Carseats)

coefs = coef(fit)
abline(coefs[1],coefs[2],col="blue",lty=2)  ## bad
abline(coefs[1]+coefs[4],coefs[2],col="green",lty=2) ## medium
abline(coefs[1]+coefs[3],coefs[2],col="red",lty=2) ## good

#Carseats$ShelveLoc = relevel(Carseats$ShelveLoc,ref="Good") ## change the level order
#rm(Carseats)

Polynomial and Interaction term
- Interaction term: the effect of one predictor variable on the response variable is different at different values of the other predictor variable

### interaction: quantitative and qualitative

### same data set "Carseats"
fit = lm(Sales~ Price * ShelveLoc,data=Carseats)
summary(fit)


coefs = coef(fit)
plot(Carseats$Price,Carseats$Sales,col = Carseats$ShelveLoc,type="n")
abline(coefs[1],coefs[2],col="blue") #bad
abline(coefs[1]+coefs[4],coefs[2]+coefs[6],col="green") #medium
abline(coefs[1]+coefs[3],coefs[2]+coefs[5],col="red") #good



### interaction: quantitative and quantitative

#ads = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv",row.names=1)
fit = lm(Sales ~ .,data = ads)
summary(fit)

# fit1 = lm(Sales ~ - Newspaper,data = ads) # equivalent to update()
fit1 = update(fit, ~.-Newspaper)
summary(fit1)

## predict the sales when Radio = 10, 25, 50
R10 = predict(fit1,data.frame(Radio=10,TV=seq(10,300,20)))
R25 = predict(fit1,data.frame(Radio=25,TV=seq(10,300,20)))
R50 = predict(fit1,data.frame(Radio=50,TV=seq(10,300,20)))
plot(ads$TV,ads$Sales)
lines(seq(10,300,20),R10,col="blue",xlab = "TV",ylab="Sales")
lines(seq(10,300,20),R25,col="green", xlab = "TV",ylab="Sales")
lines(seq(10,300,20),R50,col="red", xlab = "TV",ylab="Sales")

### the slopes of these 3 lines are the same at the different values of Radio




### compare with the model with the interaction term TV:Radio
fit2 = lm(Sales ~ TV*Radio, data = ads)
summary(fit2)


## predict the sales when Radio = 10, 25, 50
R10 = predict(fit2,data.frame(Radio=10,TV=seq(10,300,20)))
R25 = predict(fit2,data.frame(Radio=25,TV=seq(10,300,20)))
R50 = predict(fit2,data.frame(Radio=50,TV=seq(10,300,20)))
plot(ads$TV,ads$Sales)
lines(seq(10,300,20),R10,col="blue",xlab = "TV",ylab="Sales")
lines(seq(10,300,20),R25,col="green",xlab = "TV",ylab="Sales")
lines(seq(10,300,20),R50,col="red",xlab = "TV",ylab="Sales")

### the slopes of these 3 lines are different because of the different values of Radio

-  Polynomial regression : a form of linear regression in which the relationship

between the independent variable x and the dependent variable y is modeled 
as an nth degree polynomial.

## polynomial
set.seed(1)

x = round(runif(100,1,10),2)
y = (x-3)^2 + 1 + rnorm(length(x),0,4)
fit = lm(y ~ x)
plot(x,y)
abline(fit,col="red")
plot(predict(fit),residuals(fit)) 


## add a quadric term x^2, and fit the model again

fit2 = lm(y ~ x + I(x^2))
summary(fit2)
plot(x,y)
lines(x[order(x)],fitted(fit2)[order(x)],col="red")
plot(predict(fit2),residuals(fit2)) 

#par(mfrow=c(2,2))
#par(mfrow=c(1,1))