Classification

(Binomial) Logistic regression models the relationship between $p(X) = Pr(Y = 1| X)$ and X.
Probability; Odds; log-odds (logit)
- Probability: p(x)
  - range: [0,1]
- Odds: $\frac{p(success)}{p(failure)} = \frac{p(x)}{1 - p(x)}$
  - range: $[0, +\infty)$
- log-odds (logit) : $log(\frac{p(x)}{1 - p(x)})$
  - range: $(-\infty, +\infty)$
Logistic Regression model

$log(\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$

Convert to probability

$P(x) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

Use the logistic function to model p(X) that gives outputs between 0 and 1. There is not a straight-line relationship between p(X) and X, and the fact that the rate of change in p(X) per unit change in X depends on the current value of X

logistic function $S(z) = \frac{e^z}{1+e^z}$ will map any value to a proportion value between 0 and 1 (an S-shaped curve).

curve(exp(x)/(1+exp(x)),ylab = "y", main = "logistic function",xlim=c(-6,6))
abline(h=0,lty=2,col="blue")
abline(h=1,lty=2,col="blue")

Estimating the coefficients

The estimates of $\hat{\beta}s$ are chosen to maximize the likelihood function:

$l(\beta_0,\beta_1) = \prod_{i:y_i=1}p(x_i) \prod_{i':y_i'= 0}(1-p(x_i))$
Logistic Regression with R
- glm() function with the argument family = binomial
- predict() function with the argument type
  - default: on the scale of the linear predictors (log-odds)
  - type = "response": on the scale of the response variable (probabilities)

### use one predictor

head(mtcars)
plot(mtcars$mpg,mtcars$vs) # vs : 0 means a V-engine, and 1 straight engine.

#plot(mtcars$mpg,mtcars$am)

fit = glm(vs ~ mpg, data = mtcars, family = binomial)
summary(fit)

logit = predict(fit)
prob = predict(fit, type = "response")

all(exp(logit) / (1 + exp(logit)) == prob)  ## convert logit to probability


### draw the regression line
curve(predict(fit,data.frame(mpg = x), type="response"),add=TRUE)
points(mtcars$mpg, prob,col="red")
abline(h = 0.5,lty = 2, col = "blue")


### correctly or incorrectly classified

pred = rep(0,nrow(mtcars))
pred[prob > 0.5] = 1
table(pred, mtcars$vs)

Interpret the coeffients
- $exp(beta_i)$ is the odds ratio associated with a one-unit increase in $x_i$

beta1 = coef(fit)[2]
exp(beta1) # The odds multiply by 1.5 for every 1-unit increase in mpg.

## Multiple Logistic Regression

library(ISLR)

set.seed=1
train = sample(1:10000, 8000)
trainSet = Default[train,]


testX = Default[-train,c("student","balance","income")]
testY = Default[-train,c("default")]

fit = glm(default ~ . , data = trainSet, family = binomial)
summary(fit)
contrasts(Default$default)

# remove the non-significant predictors
fit = update(fit, ~ . - income)
summary(fit)


prob = predict(fit,newdata = testX, type = "response")
pred = rep("No",2000)
pred[prob>0.5] = "Yes"
table(pred,testY)

## for this case, set the probability threshold lower, maybe better
pred = rep("No",2000)
pred[prob>0.2] = "Yes"
table(pred,testY)

For multiple-class classification problem, LDA is more popular.

Bayes' Classifier

The Bayes classifier assigns an observation $X = x$ to the class for which the posterior probability $Pr(Y = k|X = x)$ is largest (most likely class).

$Pr(Y = k|X = x) = \frac{P(X = x|Y = k)P(Y=k)}{P(X=x)} = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}$

$\pi_k = P(Y=k)$: prior probability of class k, which can be simply estimated as the fraction of the training observations that belong to the kth class.
$f_k(x) = P(X = x|Y = k)$: likelihood function of X for an observation that comes from the kth class
$Pr(Y = k|X = x)$ : posterior probability that an observation $X = x$ belongs to the kth class

An example of the Naive Bayes Classifier

Naive Bayes classifier assumes that the features are independent of each other given the class.

Based on the "patients" data set, we want to know if a new patient (fever = "Yes""; headache = "Mild") will have a flu?

### symptoms and diagnosis data
patients = data.frame(fever = c("Yes","No","Yes","Yes","No","Yes","No","Yes"),
                      headache = c("Mild","No","Strong","Mild","No","Strong","Strong","Mild"), 
                      flu =c("No","Yes","Yes","Yes","No","Yes","No","Yes"))

### look up tables
table(patients$flu)

## 
##  No Yes 
##   3   5

table(patients$flu,patients$headache)

##      
##       Mild No Strong
##   No     1  1      1
##   Yes    2  1      2

table(patients$flu,patients$fever)

##      
##       No Yes
##   No   2   1
##   Yes  1   4

$Pr( flu = Yes| headache = Mild, fever = Yes) \propto Pr(headache = Mild| flue= Yes)Pr(fever = Yes| flue= Yes)Pr(flu = Yes) = 2/5 * 4/5 * 5/8 = 0.2$
$Pr( flu = No | headache = Mild, fever = Yes) \propto Pr(headache = Mild| flue= No)Pr(fever = Yes| flue= No)Pr(flu = No) = 1/3 * 1/3 * 3/8 = 0.04$

Precisely, the posterior probability should be $\frac{0.2}{0.2+0.04} = 0.83$ and $\frac{0.04}{0.2+0.04} = 0.17$, so assign to class flu = "Yes".

Linear Discriminant Analysis (LDA)

The Bayes classifier requires knowing the class conditional densities $f_k(X)$. LDA does this by assuming that the data within each class are normally distributed and the same covariance across all K classes $\sigma_1 = \sigma_2 = ... = \sigma_k$. Simplify the bayes classifier equation, it is equivalent to assigning the observation to the class for which

$\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(\pi_k)$

is largest.

### Assume the sample are drawn from two normal distributions with different means N(5,1) and N(7,1)

### Assume the prior is the same 
mu1 = 4
mu2 = 8

curve(dnorm(x,mu1,1), col = "red", xlim = c(1,11), ylab = "Density")
curve(dnorm(x,mu2,1), col = "blue", add = T, ylab = "Density")

dis1 = (mu1+mu2)/2  ### calculate the decision boundary

abline(v = dis1, lty = 3)

### The priors of two classes are 0.3, 0.7
curve(0.3 * dnorm(x,mu1,1), col = "red", xlim = c(1,11), ylim = c(0,0.3))
curve(0.7 * dnorm(x,mu2,1), col = "blue", add = T, ylab = "Density")

dis2 = ((mu2^2 - mu1^2)/2 + log(0.3) - log(0.7))/(mu2 - mu1) ### descision boundary

abline(v = dis2, lty = 3)

In practice, we don't know the population mean and variance, so we have to use sample data to estimate them.

### Draw samples from the two distributions above
set.seed(1)
c1 = rnorm(50,mu1,1)
c2 = rnorm(50,mu2,1)

## since we draw 100 from each distribution, so the prior is the same

hist(c1,col="red",xlim = c(1,11))

hist(c2,col="blue",add=T)

mu1hat = mean(c1)
mu2hat = mean(c2)
sigmahat = (sum((c1-mu1hat)^2) + sum((c2-mu2hat)^2)) / (100-2)

dis = (mu1hat + mu2hat)/2

abline(v = dis1)
abline(v = dis,lty = 3)

When there are multiple predictors, $X = (X_1,X_2,…,X_p) $ is assumed to be drawn from a multivariate normal distribution, and the procedure is the same but in a matrix form.

LDA classifier plugs the estimates:

$\hat{\pi}_k = n_k/n$
$\hat{\mu}_k = \frac{1}{n_k} \sum_{i:y_k=k}x_i$
$\hat{\Sigma}^2 = \frac{1}{n-K}\sum_{k=1}^{K}\sum_{i:y_i=k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^{T}$

The discriminant functions: $\hat{\delta_k}(x) = x^{T} \hat{\Sigma}^{-1}\hat{\mu}_k - \frac{1}{2}\hat{\mu_k}^{T}\hat{\Sigma}^{-1}\hat{\mu}_k + log(\hat{\pi}_k)$

library(LearnBayes)
set.seed(1)

### draw samples from multivariate normal distributions
c1 = rmnorm(50, mean = c(5,5), varcov = matrix(c(1,0.1,0.1,1),2,2))

c2 = rmnorm(50, mean = c(7,3), varcov = matrix(c(1,0.1,0.1,1),2,2))

c3 = rmnorm(50, mean = c(9,7), varcov = matrix(c(1,0.1,0.1,1),2,2))

## construct the data set
c = rbind(c1,c2,c3)
colnames(c) = c("x1","x2")
c = as.data.frame(c)
c$class = rep(c("red","green","blue"), each = 50)

plot(c[,1:2], col = c$class)

## Draw decision boundary based on the discriminant functions
# sigma = (cov(c1) + cov(c2) + cov(c3))*49/147
# m1 = colMeans(c1)
# m2 = colMeans(c2)
# m3 = colMeans(c3)
# 
# b1 = solve(sigma) %*% (m1 - m2)
# b0 = - 1/2*(t(m1) %*% solve(sigma) %*% m1 - t(m2) %*% solve(sigma) %*% m2)
# 
# 
# abline(-b0/b1[2],-b1[1]/b1[2],col = "yellow")

### Linear Discriminant Analysis

library(MASS)

lda.fit = lda(class ~ x1+x2, data = c)
lda.fit

## Call:
## lda(class ~ x1 + x2, data = c)
## 
## Prior probabilities of groups:
##      blue     green       red 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##             x1       x2
## blue  8.968688 7.087073
## green 6.847515 3.061235
## red   5.100448 5.126783
## 
## Coefficients of linear discriminants:
##           LD1        LD2
## x1 -0.6926194 -0.8207234
## x2 -0.6429222  0.7983884
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6916 0.3084

lda.pred = predict(lda.fit)
names(lda.pred)

## [1] "class"     "posterior" "x"

table(lda.pred$class, c$class)

##        
##         blue green red
##   blue    48     2   0
##   green    1    45   1
##   red      1     3  49

mean(lda.pred$class == c$class)

## [1] 0.9466667

Quadratic Discriminant Analysis (QDA)

QDA assumes that each class has its own covariance matrix, which means that an observation from the kth class is of the form $X \sim N(\mu_k, \Sigma_k)$. In LDA, all $\Sigma_k = \Sigma$.

set.seed(1)

### draw samples from multivariate normal distributions
c1 = rmnorm(50, mean = c(5,5), varcov = matrix(c(1,0.1,0.1,1),2,2))

c2 = rmnorm(50, mean = c(7,3), varcov = matrix(c(2,0.1,0.1,2),2,2))

c3 = rmnorm(50, mean = c(9,7), varcov = matrix(c(3,0.1,0.1,3),2,2))

## construct the data set
c = rbind(c1,c2,c3)
colnames(c) = c("x1","x2")
c = as.data.frame(c)
c$class = rep(c("red","green","blue"), each = 50)


plot(c[,1:2], col = c$class)

### redo lda
lda.fit = lda(class ~ x1+x2, data = c)

lda.pred = predict(lda.fit)

mean(lda.pred$class == c$class)

## [1] 0.8533333

### use qda

qda.fit = qda(class~ ., data = c)
qda.pred = predict(qda.fit)

mean(qda.pred$class == c$class)

## [1] 0.8666667

Comparison between LDA and Logistic Regression (2 classes)

When it's a two-class problem, the decision boundaries of the logistic regression and LDA are both the linear functions of x.
The difference is that the logistic regression uses maximum likelihood approach to estimate the coefficients and LDA uses the estimated mean and variance form a normal distribution to estimate.
LDA assumes that the observations are drawn from a multivariate normal distribution.
Logistic regression does not have the distribution assumption.

## When the data are normally distributed

library(LearnBayes)

set.seed(1)
errMatrix = matrix(NA,20,3)
colnames(errMatrix) = c("logistic","lda","qda")

for (i in 1:20){
    ### construct data
    
    # # multivariate normal distribution with same variance
    # c1 = rmnorm(50, mean = c(5, 5), matrix(c(1, 0.1, 0.1, 1), 2, 2))
    # c2 = rmnorm(50, mean = c(7, 4), matrix(c(1, 0.1, 0.1, 1), 2, 2))
    
    # # multivariate normal distribution with different variance
    # c1 = rmnorm(50, mean = c(5, 5), matrix(c(1, 0.1, 0.1, 1), 2, 2))
    # c2 = rmnorm(50, mean = c(7, 4), matrix(c(3, 0.1, 0.1, 3), 2, 2))
    
    # # multivariate t distribution
    # c1 = rmt(50, mean = c(5, 5), matrix(c(2, 0.1, 0.1, 2), 2, 2), 3)
    # c2 = rmt(50, mean = c(7, 4), matrix(c(2, 0.1, 0.1, 2), 2, 2), 3)

    
    c = rbind(c1, c2)
    colnames(c) = c("x1", "x2")
    c = as.data.frame(c)
    c$class = as.factor(rep(c("red", "blue"), each = 50))
    

    
    ### perform logistic regression
    logit.fit = glm(class ~ x1 + x2, data = c, family = "binomial")
    logit.pred = predict(logit.fit, type = "response")
    logit.pred = ifelse(logit.pred > 0.5, "red", "blue")
    e1 = 1 - mean(logit.pred == c$class) 
    ### perform lda 
    lda.fit = lda(class ~ x1 + x2, data = c)
    lda.pred = predict(lda.fit)
    e2 = 1 - mean(lda.pred$class == c$class)
    
    ### use qda

    qda.fit = qda(class ~ ., data = c)
    qda.pred = predict(qda.fit)
    e3 = 1 - mean(qda.pred$class == c$class)
    
    errMatrix[i,] = c(e1,e2,e3)
    
}

boxplot(errMatrix)