- Assignment 1 due on Friday
- Submit to stat2450winter@gmail.com
- PDF format
2017-03-03
Linear regression is an approach for modeling the relationship between dependent variable y and one or more independent variables denoted X.
Linear relationship vs Non-linear relationship
Multiple linear regression: For more than one predictors, the process is called multiple linear regression.
Supervised Learning
Parametric approach
Example
The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
Simple linear regression: predicting a quantitative response Y on the basis of a single predictor variable X.
\(Y = \beta_0 + \beta_1X + \epsilon\)
minimizing the residual of squares
\(RSS = \sum(y_i - \hat{y}_i) = \sum(y_i - \hat{\beta_0}-\hat{\beta_1}x_i)\)
\(\hat{\beta_1} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}\)
\(\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\)
ads = read.table("../../Advertising.csv",header=T,row.names=1,sep=",") y = ads$Sales x = ads$TV beta1 = sum((x - mean(x))*(y - mean(y)))/ sum((x-mean(x))^2) beta0 = mean(y) - beta1*mean(x) ### using lm() function fit = lm(Sales ~ TV,data = ads) summary(fit)
## ## Call: ## lm(formula = Sales ~ TV, data = ads) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.3860 -1.9545 -0.1913 2.0671 7.2124 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.032594 0.457843 15.36 <2e-16 *** ## TV 0.047537 0.002691 17.67 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.259 on 198 degrees of freedom ## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099 ## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
plot(x,y) abline(fit,col="red") ### same as abline(beta0,beta1)
The standard errors associated with \(\hat{\beta_0}, \hat{\beta_1}\)
\(SE(\hat{\beta_0})^2 = \sigma^2[\frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}]\)
\(SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum(x_i-\bar{x})^2}\)
If \(\sigma\) is unknow, use \(RSE = \sqrt{RSS/(n-2)}\) to estimate. The (1-\(\alpha\))% confidence interval can be computed as,
\(\hat{\beta_0} \pm t_{\alpha/2}SE(\hat{\beta_0})\)
\(\hat{\beta_1} \pm t_{\alpha/2}SE(\hat{\beta_1})\)
Is There a Relationship Between the Response and the Predictor?
\(H_0 : \beta_1 = 0\)
\(H_a : \beta_1 \neq 0\)
\(t_{n-2} = \frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\)
How well is the fitting?
The RSE is an estimate of the standard deviation of ε. It is the average amount that the response will deviate from the true regression line.
RSE is an absolute measure of lack of fit of the model.
\(RSE = \sqrt{RSS/(n-2)}\)
\(R^2\) statistic (coefficient of determination)
It is the percentage of the response variable variation that is explained by a linear model.
\(R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}\)
In general, the higher the R-squared, the better the model fits the data.
\(SE(\mu_{y|x})^2 = \sigma^2[\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}]\)
\(\mu_{y|x} \pm t_{\alpha/2}\hat{\sigma}SE(\mu_{y|x})\)
\(SE(\hat{y}|x)^2 = var(\mu_{y|x}) + var(\epsilon) = \sigma^2[\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}] + \sigma^2 = \sigma^2[1+\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}]\)
\(\hat{y}|x \pm t_{\alpha/2}\hat{\sigma}SE(\hat{y}|x)\)
### make the prediction and intervals fit = lm(Sales ~ TV, data=ads) predict(fit,data.frame(TV = c(50,100,150)),interval = "confidence",level = 0.95)
## fit lwr upr ## 1 9.409426 8.722696 10.09616 ## 2 11.786258 11.267820 12.30470 ## 3 14.163090 13.708423 14.61776
predict(fit,data.frame(TV = c(50,100,150)),interval = "prediction",level = 0.95)
## fit lwr upr ## 1 9.409426 2.946709 15.87214 ## 2 11.786258 5.339251 18.23326 ## 3 14.163090 7.720898 20.60528
xs = seq(min(ads$TV),max(ads$TV),length.out = 100) CI = predict(fit,data.frame(TV = xs),interval = "confidence",level = 0.95) PI = predict(fit,data.frame(TV = xs),interval = "prediction",level = 0.95) plot(Sales ~ TV, data = ads) abline(fit,col="red") ## confidence interval lines(xs,CI[,2],col="blue") lines(xs,CI[,3],col="blue") ## predicton interval lines(xs,PI[,2],col="green") lines(xs,PI[,3],col="green")
library(MASS) library(ISLR) # Simple Linear Regression names(Boston) ## show column names lm.fit=lm(medv~lstat,data=Boston) ##fit the linear regression model lm.fit ## basic information summary(lm.fit) ## detailed information names(lm.fit) ## get the names of this "lm" object coefs = coef(lm.fit) # coefficients resid = residuals(lm.fit) # residuals pred = predict(lm.fit) # fitted values ci = confint(lm.fit) # confidence interval plot(Boston$lstat, Boston$medv) abline(lm.fit,col="red")
set.seed(1) x = round(runif(100,1,10),2) y = 3*x + 2 y = y+rnorm(length(x),0,4) fit1 = lm(y ~ x) plot(x,y) abline(fit1,col="red") plot(predict(fit1),residuals(fit1),ylim=c(-15,15)) ## non linear y = (x-3)^2 + 1 + rnorm(length(x),0,4) fit2 = lm(y ~ x) plot(x,y) abline(fit2,col="red") plot(predict(fit2),residuals(fit2))
## QQ plot set.seed(1) x = round(runif(200,1,10),2) y = 3*x + 2 + rnorm(length(x),0,4) #hist(rnorm(1000,0,2)) #hist(rnorm(1000,0,2)^2) fit = lm(y~x) plot(x,y) abline(fit,col="red") hist(residuals(fit),breaks = 20) qqnorm(residuals(fit)) qqline(residuals(fit)) plot(predict(fit),residuals(fit))
set.seed(1) x = round(runif(50,1,10),2) y = 3*x + 2 + rnorm(length(x),0,4) fit = lm(y~x) plot(predict(fit),residuals(fit)) newx = sort(x) newy = 3*newx + 2 + rnorm(length(newx),0, seq(0,length.out = length(newx), by=0.2)) plot(x,y,main="constant variance") plot(newx,newy, main="non-constant variance") fit = lm(y~x) plot(predict(fit),residuals(fit),main="constant variance") fit1 = lm(newy ~ newx) plot(predict(fit1),residuals(fit1), main="non-constant variance")
Outliers
Outliers: the abservations for which the response \(y_i\) is unusual given the predictor \(x_i\).
Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.
High Leverage Points
set.seed(1) x = round(runif(30,1,10),2) newP = c(6,40) ## 15, 60 ## add a abnormal point y = 3*x + 2 + rnorm(length(x),0,4) newx = c(newP[1],x) newy = c(newP[2],y) plot(newx,newy) points(newP[1],newP[2],col="red") # without the outlier fit1 = lm(y~x) abline(fit1,col="blue") # with the outlier fit2 = lm(newy ~ newx) abline(fit2,col="red")
summary(fit1)
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.2427 -1.3139 -0.4592 2.3613 5.1237 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.4305 1.3637 1.782 0.0856 . ## x 2.9782 0.2211 13.470 9.31e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.164 on 28 degrees of freedom ## Multiple R-squared: 0.8663, Adjusted R-squared: 0.8615 ## F-statistic: 181.4 on 1 and 28 DF, p-value: 9.313e-14
summary(fit2)
## ## Call: ## lm(formula = newy ~ newx) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.9190 -1.7331 -0.9879 1.9915 19.0494 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.8507 2.0465 1.393 0.174 ## newx 3.0166 0.3321 9.084 5.58e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.755 on 29 degrees of freedom ## Multiple R-squared: 0.7399, Adjusted R-squared: 0.731 ## F-statistic: 82.51 on 1 and 29 DF, p-value: 5.575e-10
plot(predict(fit2),residuals(fit2))
plot(predict(fit2),rstudent(fit2))
# high leverage point newP = c(15,60) # plot(hatvalues(fit2)) ## leverage plot
## three seperate simple linear regression fitTV = lm(Sales ~ TV,data = ads) summary(fitTV) fitRadio = lm(Sales ~ Radio,data = ads) summary(fitRadio) fitNews = lm(Sales ~ Newspaper,data = ads) summary(fitNews) ## a multiple linear regression fit = lm(Sales ~ TV + Radio + Newspaper,data = ads) summary(fit) ## Scatter plot of data set "Advertising" plot(ads) cor(ads)
The multiple linear regression model with p distinct predictors: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon\)
\(\beta_j\) : the average effect on Y of a one unit increase in \(X_j\), holding all other predictors fixed.
residual standard error: \(RSE = \sqrt{\frac{RSS}{n-p-1}}\)
Matrix form:
\(\mathbf{Y} = \mathbf{X\beta} + \mathbf{\epsilon}\)
\(\hat{\beta} = (X'X)^{-1}X'Y\)
y = ads$Sales y = ads$Sales x = cbind(rep(1,length(y)),ads[,1:3]) colnames(x)[1] = "Intercept" x = as.matrix(x) ## covert to matrix beta = solve(t(x)%*%x)%*%t(x) %*% y coef(fit)
## (Intercept) TV ## 7.03259355 0.04753664
Multiple Coefficient of Determination
TSS: total variation in y \(\sum(y_i - \bar{y})^2\)
RSS: variation unexplained after our regression \(\sum(y_i-\hat{y_i})^2\)
ESS: variation in y explained by our regression \(\sum(\hat{y_i} - \bar{y})^2\)
\(R^2 = \frac{ESS}{TSS} = 1- \frac{RSS}{TSS}\)
\(R^2\) always increases as you add new predictors, but it doesn't mean that our model does a better job of explaining.
summary(fit) ads2 = ads set.seed(1) ads2$Noise = round(runif(nrow(ads),5,15),2) fit2 = lm(Sales ~ TV + Radio + Newspaper + Noise,data=ads2)
\(R^2_a = 1 - (1-R^2)(\frac{n-1}{n-p-1}) = 1 - \frac{n-1}{n-p-1}\frac{RSS}{TSS}\)
\(H_0 : \beta_1 = \beta_2 = ... = \beta_p =0\)
\(H_a : \text{one or more slope parameters is non-zero}\)
\(\frac{(TSS-RSS)/p}{RSS/(n-p-1)} \sim F_{p,n-p-1}\)
Variable Selection
library(ISLR) #ads = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv",row.names=1) head(ads) head(model.matrix(Sales ~ .,data=ads)) head(model.matrix(Sales ~ TV + Radio,data=ads)) head(model.matrix(Sales ~ TV + Radio + TV:Radio,data=ads)) head(model.matrix(Sales ~ TV * Radio,data=ads)) head(model.matrix(Sales ~ TV + I(TV^2),data=ads)) head(model.matrix(Sales ~ (TV+Radio+Newspaper)^3,data=ads)) ## update() function convenient to select important variables ## backward selection fit = lm(medv~.,data = Boston) summary(fit) fit = update(fit, ~. -age) ## same as lm(medv~. - age,data = Boston) summary(fit) fit = update(fit, ~. -indus) ## same as lm(medv~. - age - indus,data = Boston) summary(fit2)
library(MASS) plot(mammals) #identify(mammals,labels = row.names(mammals)) fit = lm(brain ~ body,data = mammals) abline(fit,col="Red") summary(fit) ### use log transformation to lower the effect of the influential points logmammals = log(mammals) logfit = lm(brain ~ body,data = logmammals) plot(logmammals) abline(logfit,col="Red") summary(logfit)
plot(Carseats) plot(Carseats$Price,Carseats$Sales,col = Carseats$ShelveLoc,type="n") points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Bad",],col="blue") points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Medium",],col="green") points(Sales~Price,data=Carseats[Carseats$ShelveLoc == "Good",],col="red") legend("topright",col=c("blue","red","green"),legend = c("Bad","Good","Medium"),pch=1) fit = lm(Sales ~ Price + ShelveLoc, data = Carseats) summary(fit) contrasts(Carseats$ShelveLoc) # coding for the dummy variables head(model.matrix(~Price+ShelveLoc,data = Carseats)) head(Carseats) coefs = coef(fit) abline(coefs[1],coefs[2],col="blue",lty=2) ## bad abline(coefs[1]+coefs[4],coefs[2],col="green",lty=2) ## medium abline(coefs[1]+coefs[3],coefs[2],col="red",lty=2) ## good #Carseats$ShelveLoc = relevel(Carseats$ShelveLoc,ref="Good") ## change the level order #rm(Carseats)
Polynomial and Interaction term
### interaction: quantitative and qualitative ### same data set "Carseats" fit = lm(Sales~ Price * ShelveLoc,data=Carseats) summary(fit) coefs = coef(fit) plot(Carseats$Price,Carseats$Sales,col = Carseats$ShelveLoc,type="n") abline(coefs[1],coefs[2],col="blue") #bad abline(coefs[1]+coefs[4],coefs[2]+coefs[6],col="green") #medium abline(coefs[1]+coefs[3],coefs[2]+coefs[5],col="red") #good ### interaction: quantitative and quantitative #ads = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv",row.names=1) fit = lm(Sales ~ .,data = ads) summary(fit) # fit1 = lm(Sales ~ - Newspaper,data = ads) # equivalent to update() fit1 = update(fit, ~.-Newspaper) summary(fit1) ## predict the sales when Radio = 10, 25, 50 R10 = predict(fit1,data.frame(Radio=10,TV=seq(10,300,20))) R25 = predict(fit1,data.frame(Radio=25,TV=seq(10,300,20))) R50 = predict(fit1,data.frame(Radio=50,TV=seq(10,300,20))) plot(ads$TV,ads$Sales) lines(seq(10,300,20),R10,col="blue",xlab = "TV",ylab="Sales") lines(seq(10,300,20),R25,col="green", xlab = "TV",ylab="Sales") lines(seq(10,300,20),R50,col="red", xlab = "TV",ylab="Sales") ### the slopes of these 3 lines are the same at the different values of Radio ### compare with the model with the interaction term TV:Radio fit2 = lm(Sales ~ TV*Radio, data = ads) summary(fit2) ## predict the sales when Radio = 10, 25, 50 R10 = predict(fit2,data.frame(Radio=10,TV=seq(10,300,20))) R25 = predict(fit2,data.frame(Radio=25,TV=seq(10,300,20))) R50 = predict(fit2,data.frame(Radio=50,TV=seq(10,300,20))) plot(ads$TV,ads$Sales) lines(seq(10,300,20),R10,col="blue",xlab = "TV",ylab="Sales") lines(seq(10,300,20),R25,col="green",xlab = "TV",ylab="Sales") lines(seq(10,300,20),R50,col="red",xlab = "TV",ylab="Sales") ### the slopes of these 3 lines are different because of the different values of Radio
- Polynomial regression : a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.
## polynomial set.seed(1) x = round(runif(100,1,10),2) y = (x-3)^2 + 1 + rnorm(length(x),0,4) fit = lm(y ~ x) plot(x,y) abline(fit,col="red") plot(predict(fit),residuals(fit)) ## add a quadric term x^2, and fit the model again fit2 = lm(y ~ x + I(x^2)) summary(fit2) plot(x,y) lines(x[order(x)],fitted(fit2)[order(x)],col="red") plot(predict(fit2),residuals(fit2)) #par(mfrow=c(2,2)) #par(mfrow=c(1,1))