+(Pro): Works directly from table data +(Pro): Easiest to understand +(Pro): Relevant features are extracted automatically by Random Forests +(Pro): Results are human-readable and easy to interpret +(Con): CART trees are prone to overfitting and not robust to outliers +(Con): Random forests loose the ease of interpretation of CART trees +(Con): Models take a lot of memory for large number of classes and features
SVM is a discriminative classifier.
It works by finding hyperplanes that separate two classes by the largest possible margin.
Data does not need to be perfectly linearly separable, i.e. the method also minimizes the number of misclassified points on each side of the hyperplane.
SVM was originally applied to binary data in euclidean space (linear-SVM), but admits extensions to non-linear data (via the so-called kernel trick) and to multi-class classification.
+(Pro): Relatively straightforward to understand +(Pro): Results are human-readable and easy to interpret +(Con): No unique method for non-binary classification +(Con): Not very flexible +(Con): Requires the construction of an appropriate kernel +(Con): Computationally intensive
K-Nearest Neighbours classifiers work by inspecting local neighborhood of the point to predict and voting for the most frequent class expressed near the point.
The basic difference between K-NN classifier and Naive Bayes classifier is that, the former is a discriminative classifier but the latter is a generative classifier. … Naive Bayes is an eager learning classifier and it is much faster than K-NN. Thus, it could be used for prediction in real time.
K-NN is a discriminative classifier.
Naive Bayes is a ‘linear classifier’ while K-NN is not;
K-NN is NOT a probabilistic method although you can pretend it is and estimate probabilities of classes through a naive estimator in the neighbourhood of a point that needs prediction. The empirical probability of a class \(y\) is calculated as:
\[P(y|x) = # labels y in NN(x) / # size of NN(x) \]
where NN(x) is the set of nearest neighbours of point \(x\) for some choice of metrics.
In this sense, K-NN is equivalent to a non-parametric estimation method.
+(Pro): Works directly from table data +(Pro): Medium hard to understand +(Pro): Results are human-readable and easy to interpret +(Con): Requires the choice of a metric +(Con): Requires some embedding in euclidean or metric space. +(Con): Requires tuning of optimal neighborhood size
https://stackoverflow.com/questions/36969740/difference-between-probabilistic-knn-and-naive-bayes
Naive Bayes is a generative classifier.
+An example with a naive probability distribution:
+An example with explicit generative distribution (gaussian):
+Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis.
+In the case of continuous features (predicting variables), a commonly used model is the Radial-Basis-function (RBF), which implies that each class is normally distributed. The predicted class is the one that maximizes the gaussian density.
+(Pro): Flexible: can choose from many parametric distributions
+(Pro): Connection with maximum likelihood methods
+(Pro): Model takes less memory than other alternatives (e.g. Random Forests)
+(Con): Not always the most accurate method
+Comparing Classifiers: Decision Trees, K-NN & Naive Bayes
+Download the following dataset: http://www.marcovanetti.com/pages/wmsnsec/wmsnsec1.xml
(src:http://www.marcovanetti.com/)
+Apply all methods described here to engineer the best classification model. +Discuss the potential limitations of this dataset +How is classification affected by automatic translations? +Find an analog nbut much larger dataset +Write an essay on the dangers of Social media
+Load the data:
set.seed(1)
train <- read.csv("../Data/mnist_train.psv", sep="|", as.is=TRUE, header=FALSE)
test <- read.csv("../Data/mnist_test.psv", sep="|", as.is=TRUE, header=FALSE)
+Separate the predictors from the response:
Xtrain <- as.matrix(train[,-1])
Xtest <- as.matrix(test[,-1])
ytrain <- train[,1]
ytest <- test[,1]
We are ready to build prediction models.
For ridge regression, I’ll directly use the multinomial loss function and let the R function do cross validation for me. The glmnet package is my preferred package for doing ridge regressions, just remember to set alpha to 0.
+Principle (a) : general linear model. Use of a ‘link function’.
\[ y = ax + b \]
is replaced by linear regression through a link:
\[ g(y) = ax + b \]
For binary classification (2 classes)
+Principle (b) : In ridge regression models, a term of regularization \(\| \Gamma x \|^2\) is added to robustify the usual method of least-squares.
\[ \|Ax-b \|^2 + \| \Gamma x \|^2 \]
In R, you can use Glmnet for ridge regression.
Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x .
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
outLm <- cv.glmnet(Xtrain, ytrain, alpha=0, nfolds=3, family="multinomial")
predLm <- apply(predict(outLm, Xtest, s=outLm$lambda.min,
type="response"), 1, which.max) - 1L
In MNIST the class labels are 0,1,2 etc.
As predicted classes are indexed starting at 1 by Glmnet, we need to subtract one off of the Glmnet results to get the correct class labels.
Gradient boosted trees belongs to the family of boosting algorithms. They can be applied directly to data with more than 2 claases.
We will see a very short introduction on boosting on Wednesday.
One can tune several hyperparameters: interaction depth, number of trees or learning rate.
You will have to experiment with these. I am not going to discuss this in detail here.
library(gbm)
## Loaded gbm 2.1.5
outGbm <- gbm.fit(Xtrain, factor(ytrain), distribution="multinomial",
n.trees=500, interaction.depth=2)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 2.3026 nan 0.0010 0.0085
## 2 2.2981 nan 0.0010 0.0086
## 3 2.2936 nan 0.0010 0.0083
## 4 2.2892 nan 0.0010 0.0085
## 5 2.2848 nan 0.0010 0.0081
## 6 2.2805 nan 0.0010 0.0083
## 7 2.2761 nan 0.0010 0.0081
## 8 2.2719 nan 0.0010 0.0081
## 9 2.2676 nan 0.0010 0.0079
## 10 2.2634 nan 0.0010 0.0078
## 20 2.2223 nan 0.0010 0.0074
## 40 2.1469 nan 0.0010 0.0068
## 60 2.0781 nan 0.0010 0.0061
## 80 2.0154 nan 0.0010 0.0058
## 100 1.9573 nan 0.0010 0.0053
## 120 1.9039 nan 0.0010 0.0049
## 140 1.8529 nan 0.0010 0.0047
## 160 1.8057 nan 0.0010 0.0044
## 180 1.7612 nan 0.0010 0.0043
## 200 1.7182 nan 0.0010 0.0038
## 220 1.6785 nan 0.0010 0.0035
## 240 1.6408 nan 0.0010 0.0036
## 260 1.6046 nan 0.0010 0.0033
## 280 1.5697 nan 0.0010 0.0031
## 300 1.5367 nan 0.0010 0.0030
## 320 1.5048 nan 0.0010 0.0029
## 340 1.4741 nan 0.0010 0.0030
## 360 1.4449 nan 0.0010 0.0027
## 380 1.4170 nan 0.0010 0.0026
## 400 1.3899 nan 0.0010 0.0025
## 420 1.3640 nan 0.0010 0.0022
## 440 1.3394 nan 0.0010 0.0021
## 460 1.3155 nan 0.0010 0.0022
## 480 1.2922 nan 0.0010 0.0020
## 500 1.2697 nan 0.0010 0.0020
predGbm <- apply(predict(outGbm, Xtest, n.trees=outGbm$n.trees),1,which.max) - 1L
Support Vector Machines work by searching for hyperplanes that separate two classes by the maximum margin.
Hence these methods are based on maximizing the discrimination of distinct classes.
For non-binary data, a method must be found to reduce the problem to one of binary data.
Usually we either compare each class to all others, or each class to each other class.
We need to experiment with a kernel choice in the case of non-linearly separable data. A commonly used kernel is the radial basis function kernel.
To fit SVMs, we can import the e1071 library.
library(e1071)
outSvm <- svm(Xtrain, factor(ytrain), kernel="radial", cost=1)
predSvm <- predict(outSvm, Xtest)
We see that the methods differ substantially in how predictive they are on the test dataset:
mean(predLm != ytest)
## [1] 0.08918784
mean(predGbm != ytest)
## [1] 0.2321873
mean(predSvm != ytest)
## [1] 0.06178376
k-NN > SVM > tree-methods
Where do ridge regression and support vector machines make errors?
tapply(predLm != ytest, ytest, mean)
## 0 1 2 3 4 5 6
## 0.03064067 0.04545455 0.16161616 0.13855422 0.10500000 0.13125000 0.07058824
## 7 8 9
## 0.08843537 0.15060241 0.05084746
tapply(predSvm != ytest, ytest, mean)
## 0 1 2 3 4 5 6
## 0.02228412 0.04166667 0.07575758 0.12048193 0.07000000 0.08125000 0.07058824
## 7 8 9
## 0.06122449 0.09036145 0.03954802
What are the digits that are the most difficult to predict
Which digit is the easiest to predict?
Although the digit is like half of the digit 8 in shape, the difficulty of predictions of both ‘3’ and ‘8’ cannot be explained by a confusion between 3 and 8:
table(predLm,ytest)
## ytest
## predLm 0 1 2 3 4 5 6 7 8 9
## 0 348 0 5 3 2 7 2 0 6 0
## 1 0 252 0 0 2 0 0 0 0 2
## 2 2 0 166 3 6 0 3 1 4 1
## 3 2 2 5 143 0 7 0 1 5 0
## 4 3 4 10 1 179 2 3 5 2 3
## 5 0 0 1 11 0 139 3 0 4 1
## 6 2 4 3 0 4 0 158 0 1 0
## 7 0 0 1 2 1 0 0 134 1 1
## 8 1 1 7 2 1 1 1 1 141 1
## 9 1 1 0 1 5 4 0 5 2 168
table(predSvm,ytest)
## ytest
## predSvm 0 1 2 3 4 5 6 7 8 9
## 0 351 0 2 0 0 3 4 0 4 0
## 1 0 253 0 0 1 0 0 0 0 0
## 2 6 1 183 5 3 2 4 2 2 0
## 3 0 0 4 146 0 3 0 0 3 0
## 4 1 5 3 0 186 1 2 5 0 4
## 5 0 1 0 11 1 147 1 0 2 1
## 6 0 3 1 0 2 0 158 0 1 0
## 7 0 1 1 1 3 0 0 138 0 0
## 8 1 0 4 3 1 1 1 0 151 2
## 9 0 0 0 0 3 3 0 2 3 170
We also see that the points where these two models make mistakes do not have too great of an overlap.
table(predSvm != ytest, predLm != ytest)
##
## FALSE TRUE
## FALSE 1806 77
## TRUE 22 102
This gives evidence that stacking could be beneficial.
We can pick out a large sample of images that are 3’s and see if these are particularly difficult to detect.
iset <- sample(which(train[,1] == 3),5*7)
par(mar=c(0,0,0,0))
par(mfrow=c(5,7))
for (j in iset) {
y <- matrix(as.matrix(train[j,-1]),16,16,byrow=TRUE)
y <- 1 - (y + 1)*0.5
plot(0,0,xlab="",ylab="",axes=FALSE)
rasterImage(y,-1,-1,1,1)
box()
text(-0.8,-0.7, train[j,1], cex=3, col="red")
}
For the most part though, these do not seem difficult for a human to classify.
What if we look at the actual mis-classified points. Here are the ones from the support vector machine:
iset <- sample(which(predSvm != ytest),7*7)
par(mar=c(0,0,0,0))
par(mfrow=c(7,7))
for (j in iset) {
y <- matrix(as.matrix(test[j,-1]),16,16,byrow=TRUE)
y <- 1 - (y + 1)*0.5
plot(0,0,xlab="",ylab="",axes=FALSE)
rasterImage(y,-1,-1,1,1)
box()
text(-0.8,-0.7, test[j,1], cex=3, col="red")
text(0.8,-0.7, predSvm[j], cex=3, col="blue")
}
And the ridge regression:
iset <- sample(which(predLm != ytest),7*7)
par(mar=c(0,0,0,0))
par(mfrow=c(7,7))
for (j in iset) {
y <- matrix(as.matrix(test[j,-1]),16,16,byrow=TRUE)
y <- 1 - (y + 1)*0.5
plot(0,0,xlab="",ylab="",axes=FALSE)
rasterImage(y,-1,-1,1,1)
box()
text(-0.8,-0.7, test[j,1], cex=3, col="red")
text(0.8,-0.7, predLm[j], cex=3, col="blue")
}
We have learned how use alternative to tree-based models to classify images in R
We have learned how to analyze the reasons for misclassification
Next steps for improvement are deep learning methods (not seen in this course)