Bayes' Classifier
The Bayes classifier assigns an observation \(X = x\) to the class for which the posterior probability \(Pr(Y = k|X = x)\) is largest (most likely class).
\(Pr(Y = k|X = x) = \frac{P(X = x|Y = k)P(Y=k)}{P(X=x)} = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}\)
\(\pi_k = P(Y=k)\): prior probability of class k, which can be simply estimated as the fraction of the training observations that belong to the kth class.
\(f_k(x) = P(X = x|Y = k)\): likelihood function of X for an observation that comes from the kth class
\(Pr(Y = k|X = x)\) : posterior probability that an observation \(X = x\) belongs to the kth class
An example of the Naive Bayes Classifier
Naive Bayes classifier assumes that the features are independent of each other given the class.
Based on the "patients" data set, we want to know if a new patient (fever = "Yes""; headache = "Mild") will have a flu?
### symptoms and diagnosis data
patients = data.frame(fever = c("Yes","No","Yes","Yes","No","Yes","No","Yes"),
headache = c("Mild","No","Strong","Mild","No","Strong","Strong","Mild"),
flu =c("No","Yes","Yes","Yes","No","Yes","No","Yes"))
### look up tables
table(patients$flu)
##
## No Yes
## 3 5
table(patients$flu,patients$headache)
##
## Mild No Strong
## No 1 1 1
## Yes 2 1 2
table(patients$flu,patients$fever)
##
## No Yes
## No 2 1
## Yes 1 4
- \(Pr( flu = Yes| headache = Mild, fever = Yes) \propto Pr(headache = Mild| flue= Yes)Pr(fever = Yes| flue= Yes)Pr(flu = Yes) = 2/5 * 4/5 * 5/8 = 0.2\)
- \(Pr( flu = No | headache = Mild, fever = Yes) \propto Pr(headache = Mild| flue= No)Pr(fever = Yes| flue= No)Pr(flu = No) = 1/3 * 1/3 * 3/8 = 0.04\)
Precisely, the posterior probability should be \(\frac{0.2}{0.2+0.04} = 0.83\) and \(\frac{0.04}{0.2+0.04} = 0.17\), so assign to class flu = "Yes".
Linear Discriminant Analysis (LDA)
The Bayes classifier requires knowing the class conditional densities \(f_k(X)\). LDA does this by assuming that the data within each class are normally distributed and the same covariance across all K classes \(\sigma_1 = \sigma_2 = ... = \sigma_k\). Simplify the bayes classifier equation, it is equivalent to assigning the observation to the class for which
\(\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(\pi_k)\)
is largest.
### Assume the sample are drawn from two normal distributions with different means N(5,1) and N(7,1)
### Assume the prior is the same
mu1 = 4
mu2 = 8
curve(dnorm(x,mu1,1), col = "red", xlim = c(1,11), ylab = "Density")
curve(dnorm(x,mu2,1), col = "blue", add = T, ylab = "Density")
dis1 = (mu1+mu2)/2 ### calculate the decision boundary
abline(v = dis1, lty = 3)

### The priors of two classes are 0.3, 0.7
curve(0.3 * dnorm(x,mu1,1), col = "red", xlim = c(1,11), ylim = c(0,0.3))
curve(0.7 * dnorm(x,mu2,1), col = "blue", add = T, ylab = "Density")
dis2 = ((mu2^2 - mu1^2)/2 + log(0.3) - log(0.7))/(mu2 - mu1) ### descision boundary
abline(v = dis2, lty = 3)

In practice, we don't know the population mean and variance, so we have to use sample data to estimate them.
### Draw samples from the two distributions above
set.seed(1)
c1 = rnorm(50,mu1,1)
c2 = rnorm(50,mu2,1)
## since we draw 100 from each distribution, so the prior is the same
hist(c1,col="red",xlim = c(1,11))
hist(c2,col="blue",add=T)
mu1hat = mean(c1)
mu2hat = mean(c2)
sigmahat = (sum((c1-mu1hat)^2) + sum((c2-mu2hat)^2)) / (100-2)
dis = (mu1hat + mu2hat)/2
abline(v = dis1)
abline(v = dis,lty = 3)

When there are multiple predictors, $X = (X_1,X_2,…,X_p) $ is assumed to be drawn from a multivariate normal distribution, and the procedure is the same but in a matrix form.
LDA classifier plugs the estimates:
The discriminant functions: \(\hat{\delta_k}(x) = x^{T} \hat{\Sigma}^{-1}\hat{\mu}_k - \frac{1}{2}\hat{\mu_k}^{T}\hat{\Sigma}^{-1}\hat{\mu}_k + log(\hat{\pi}_k)\)
library(LearnBayes)
set.seed(1)
### draw samples from multivariate normal distributions
c1 = rmnorm(50, mean = c(5,5), varcov = matrix(c(1,0.1,0.1,1),2,2))
c2 = rmnorm(50, mean = c(7,3), varcov = matrix(c(1,0.1,0.1,1),2,2))
c3 = rmnorm(50, mean = c(9,7), varcov = matrix(c(1,0.1,0.1,1),2,2))
## construct the data set
c = rbind(c1,c2,c3)
colnames(c) = c("x1","x2")
c = as.data.frame(c)
c$class = rep(c("red","green","blue"), each = 50)
plot(c[,1:2], col = c$class)

## Draw decision boundary based on the discriminant functions
# sigma = (cov(c1) + cov(c2) + cov(c3))*49/147
# m1 = colMeans(c1)
# m2 = colMeans(c2)
# m3 = colMeans(c3)
#
# b1 = solve(sigma) %*% (m1 - m2)
# b0 = - 1/2*(t(m1) %*% solve(sigma) %*% m1 - t(m2) %*% solve(sigma) %*% m2)
#
#
# abline(-b0/b1[2],-b1[1]/b1[2],col = "yellow")
### Linear Discriminant Analysis
library(MASS)
lda.fit = lda(class ~ x1+x2, data = c)
lda.fit
## Call:
## lda(class ~ x1 + x2, data = c)
##
## Prior probabilities of groups:
## blue green red
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## x1 x2
## blue 8.968688 7.087073
## green 6.847515 3.061235
## red 5.100448 5.126783
##
## Coefficients of linear discriminants:
## LD1 LD2
## x1 -0.6926194 -0.8207234
## x2 -0.6429222 0.7983884
##
## Proportion of trace:
## LD1 LD2
## 0.6916 0.3084
lda.pred = predict(lda.fit)
names(lda.pred)
## [1] "class" "posterior" "x"
table(lda.pred$class, c$class)
##
## blue green red
## blue 48 2 0
## green 1 45 1
## red 1 3 49
mean(lda.pred$class == c$class)
## [1] 0.9466667
Quadratic Discriminant Analysis (QDA)
QDA assumes that each class has its own covariance matrix, which means that an observation from the kth class is of the form \(X \sim N(\mu_k, \Sigma_k)\). In LDA, all \(\Sigma_k = \Sigma\).
set.seed(1)
### draw samples from multivariate normal distributions
c1 = rmnorm(50, mean = c(5,5), varcov = matrix(c(1,0.1,0.1,1),2,2))
c2 = rmnorm(50, mean = c(7,3), varcov = matrix(c(2,0.1,0.1,2),2,2))
c3 = rmnorm(50, mean = c(9,7), varcov = matrix(c(3,0.1,0.1,3),2,2))
## construct the data set
c = rbind(c1,c2,c3)
colnames(c) = c("x1","x2")
c = as.data.frame(c)
c$class = rep(c("red","green","blue"), each = 50)
plot(c[,1:2], col = c$class)

### redo lda
lda.fit = lda(class ~ x1+x2, data = c)
lda.pred = predict(lda.fit)
mean(lda.pred$class == c$class)
## [1] 0.8533333
### use qda
qda.fit = qda(class~ ., data = c)
qda.pred = predict(qda.fit)
mean(qda.pred$class == c$class)
## [1] 0.8666667
Comparison between LDA and Logistic Regression (2 classes)
- When it's a two-class problem, the decision boundaries of the logistic regression and LDA are both the linear functions of x.
The difference is that the logistic regression uses maximum likelihood approach to estimate the coefficients and LDA uses the estimated mean and variance form a normal distribution to estimate.
- LDA assumes that the observations are drawn from a multivariate normal distribution.
Logistic regression does not have the distribution assumption.
## When the data are normally distributed
library(LearnBayes)
set.seed(1)
errMatrix = matrix(NA,20,3)
colnames(errMatrix) = c("logistic","lda","qda")
for (i in 1:20){
### construct data
# # multivariate normal distribution with same variance
# c1 = rmnorm(50, mean = c(5, 5), matrix(c(1, 0.1, 0.1, 1), 2, 2))
# c2 = rmnorm(50, mean = c(7, 4), matrix(c(1, 0.1, 0.1, 1), 2, 2))
# # multivariate normal distribution with different variance
# c1 = rmnorm(50, mean = c(5, 5), matrix(c(1, 0.1, 0.1, 1), 2, 2))
# c2 = rmnorm(50, mean = c(7, 4), matrix(c(3, 0.1, 0.1, 3), 2, 2))
# # multivariate t distribution
# c1 = rmt(50, mean = c(5, 5), matrix(c(2, 0.1, 0.1, 2), 2, 2), 3)
# c2 = rmt(50, mean = c(7, 4), matrix(c(2, 0.1, 0.1, 2), 2, 2), 3)
c = rbind(c1, c2)
colnames(c) = c("x1", "x2")
c = as.data.frame(c)
c$class = as.factor(rep(c("red", "blue"), each = 50))
### perform logistic regression
logit.fit = glm(class ~ x1 + x2, data = c, family = "binomial")
logit.pred = predict(logit.fit, type = "response")
logit.pred = ifelse(logit.pred > 0.5, "red", "blue")
e1 = 1 - mean(logit.pred == c$class)
### perform lda
lda.fit = lda(class ~ x1 + x2, data = c)
lda.pred = predict(lda.fit)
e2 = 1 - mean(lda.pred$class == c$class)
### use qda
qda.fit = qda(class ~ ., data = c)
qda.pred = predict(qda.fit)
e3 = 1 - mean(qda.pred$class == c$class)
errMatrix[i,] = c(e1,e2,e3)
}
boxplot(errMatrix)