Theme of activity

Alternative classification models

Decision trees

+(Pro): Works directly from table data +(Pro): Easiest to understand +(Pro): Relevant features are extracted automatically by Random Forests +(Pro): Results are human-readable and easy to interpret +(Con): CART trees are prone to overfitting and not robust to outliers +(Con): Random forests loose the ease of interpretation of CART trees +(Con): Models take a lot of memory for large number of classes and features

Support vector machines (SVM)

SVM is a discriminative classifier.

It works by finding hyperplanes that separate two classes by the largest possible margin.

Data does not need to be perfectly linearly separable, i.e. the method also minimizes the number of misclassified points on each side of the hyperplane.

SVM was originally applied to binary data in euclidean space (linear-SVM), but admits extensions to non-linear data (via the so-called kernel trick) and to multi-class classification.

+(Pro): Relatively straightforward to understand +(Pro): Results are human-readable and easy to interpret +(Con): No unique method for non-binary classification +(Con): Not very flexible +(Con): Requires the construction of an appropriate kernel +(Con): Computationally intensive

K-Nearest Neighbours

K-Nearest Neighbours classifiers work by inspecting local neighborhood of the point to predict and voting for the most frequent class expressed near the point.

The basic difference between K-NN classifier and Naive Bayes classifier is that, the former is a discriminative classifier but the latter is a generative classifier. … Naive Bayes is an eager learning classifier and it is much faster than K-NN. Thus, it could be used for prediction in real time.

K-NN is a discriminative classifier.

Naive Bayes is a ‘linear classifier’ while K-NN is not;

K-NN is NOT a probabilistic method although you can pretend it is and estimate probabilities of classes through a naive estimator in the neighbourhood of a point that needs prediction. The empirical probability of a class \(y\) is calculated as:

\[P(y|x) = # labels y in NN(x) / # size of NN(x) \]

where NN(x) is the set of nearest neighbours of point \(x\) for some choice of metrics.

In this sense, K-NN is equivalent to a non-parametric estimation method.

+(Pro): Works directly from table data +(Pro): Medium hard to understand +(Pro): Results are human-readable and easy to interpret +(Con): Requires the choice of a metric +(Con): Requires some embedding in euclidean or metric space. +(Con): Requires tuning of optimal neighborhood size

Naive Bayes

https://stackoverflow.com/questions/36969740/difference-between-probabilistic-knn-and-naive-bayes

Naive Bayes is a generative classifier.

+An example with a naive probability distribution:

+An example with explicit generative distribution (gaussian):

+Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis.

+In the case of continuous features (predicting variables), a commonly used model is the Radial-Basis-function (RBF), which implies that each class is normally distributed. The predicted class is the one that maximizes the gaussian density.

+(Pro): Flexible: can choose from many parametric distributions

+(Pro): Connection with maximum likelihood methods

+(Pro): Model takes less memory than other alternatives (e.g. Random Forests)

+(Con): Not always the most accurate method

Suggested project: TEXT classification

+Download the following dataset: http://www.marcovanetti.com/pages/wmsnsec/wmsnsec1.xml

(src:http://www.marcovanetti.com/)

+Apply all methods described here to engineer the best classification model. +Discuss the potential limitations of this dataset +How is classification affected by automatic translations? +Find an analog nbut much larger dataset +Write an essay on the dangers of Social media

Application to MNIST data classification

Run classification models in R

+Load the data:

set.seed(1)
train <- read.csv("../Data/mnist_train.psv", sep="|", as.is=TRUE, header=FALSE)
test <- read.csv("../Data/mnist_test.psv", sep="|", as.is=TRUE, header=FALSE)

+Separate the predictors from the response:

Xtrain <- as.matrix(train[,-1])
Xtest <- as.matrix(test[,-1])
ytrain <- train[,1]
ytest <- test[,1]

We are ready to build prediction models.

Regression: Ridge regression

For ridge regression, I’ll directly use the multinomial loss function and let the R function do cross validation for me. The glmnet package is my preferred package for doing ridge regressions, just remember to set alpha to 0.

+Principle (a) : general linear model. Use of a ‘link function’.

\[ y = ax + b \]

is replaced by linear regression through a link:

\[ g(y) = ax + b \]

For binary classification (2 classes)

+Principle (b) : In ridge regression models, a term of regularization \(\| \Gamma x \|^2\) is added to robustify the usual method of least-squares.

\[ \|Ax-b \|^2 + \| \Gamma x \|^2 \]

In R, you can use Glmnet for ridge regression.

Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x .

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
outLm <- cv.glmnet(Xtrain, ytrain, alpha=0, nfolds=3, family="multinomial")
predLm <- apply(predict(outLm, Xtest, s=outLm$lambda.min,
                  type="response"), 1, which.max) - 1L

In MNIST the class labels are 0,1,2 etc.

As predicted classes are indexed starting at 1 by Glmnet, we need to subtract one off of the Glmnet results to get the correct class labels.

Boosting: Gradient boosted trees

Gradient boosted trees belongs to the family of boosting algorithms. They can be applied directly to data with more than 2 claases.

We will see a very short introduction on boosting on Wednesday.

One can tune several hyperparameters: interaction depth, number of trees or learning rate.

You will have to experiment with these. I am not going to discuss this in detail here.

library(gbm)
## Loaded gbm 2.1.5
outGbm <- gbm.fit(Xtrain,  factor(ytrain), distribution="multinomial",
                  n.trees=500, interaction.depth=2)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        2.3026             nan     0.0010    0.0085
##      2        2.2981             nan     0.0010    0.0086
##      3        2.2936             nan     0.0010    0.0083
##      4        2.2892             nan     0.0010    0.0085
##      5        2.2848             nan     0.0010    0.0081
##      6        2.2805             nan     0.0010    0.0083
##      7        2.2761             nan     0.0010    0.0081
##      8        2.2719             nan     0.0010    0.0081
##      9        2.2676             nan     0.0010    0.0079
##     10        2.2634             nan     0.0010    0.0078
##     20        2.2223             nan     0.0010    0.0074
##     40        2.1469             nan     0.0010    0.0068
##     60        2.0781             nan     0.0010    0.0061
##     80        2.0154             nan     0.0010    0.0058
##    100        1.9573             nan     0.0010    0.0053
##    120        1.9039             nan     0.0010    0.0049
##    140        1.8529             nan     0.0010    0.0047
##    160        1.8057             nan     0.0010    0.0044
##    180        1.7612             nan     0.0010    0.0043
##    200        1.7182             nan     0.0010    0.0038
##    220        1.6785             nan     0.0010    0.0035
##    240        1.6408             nan     0.0010    0.0036
##    260        1.6046             nan     0.0010    0.0033
##    280        1.5697             nan     0.0010    0.0031
##    300        1.5367             nan     0.0010    0.0030
##    320        1.5048             nan     0.0010    0.0029
##    340        1.4741             nan     0.0010    0.0030
##    360        1.4449             nan     0.0010    0.0027
##    380        1.4170             nan     0.0010    0.0026
##    400        1.3899             nan     0.0010    0.0025
##    420        1.3640             nan     0.0010    0.0022
##    440        1.3394             nan     0.0010    0.0021
##    460        1.3155             nan     0.0010    0.0022
##    480        1.2922             nan     0.0010    0.0020
##    500        1.2697             nan     0.0010    0.0020
predGbm <- apply(predict(outGbm, Xtest, n.trees=outGbm$n.trees),1,which.max) - 1L

Hyperplane separators: Support vector machines

Support Vector Machines work by searching for hyperplanes that separate two classes by the maximum margin.

Hence these methods are based on maximizing the discrimination of distinct classes.

For non-binary data, a method must be found to reduce the problem to one of binary data.

Usually we either compare each class to all others, or each class to each other class.

We need to experiment with a kernel choice in the case of non-linearly separable data. A commonly used kernel is the radial basis function kernel.

To fit SVMs, we can import the e1071 library.

library(e1071)
outSvm <- svm(Xtrain,  factor(ytrain), kernel="radial", cost=1)
predSvm <- predict(outSvm, Xtest)

Prediction Performance and Comparison

Misclassification rates

We see that the methods differ substantially in how predictive they are on the test dataset:

mean(predLm != ytest)
## [1] 0.08918784
mean(predGbm != ytest)
## [1] 0.2321873
mean(predSvm != ytest)
## [1] 0.06178376

k-NN > SVM > tree-methods

Mis-classification rates by class

Where do ridge regression and support vector machines make errors?

tapply(predLm != ytest, ytest, mean)
##          0          1          2          3          4          5          6 
## 0.03064067 0.04545455 0.16161616 0.13855422 0.10500000 0.13125000 0.07058824 
##          7          8          9 
## 0.08843537 0.15060241 0.05084746
tapply(predSvm != ytest, ytest, mean)
##          0          1          2          3          4          5          6 
## 0.02228412 0.04166667 0.07575758 0.12048193 0.07000000 0.08125000 0.07058824 
##          7          8          9 
## 0.06122449 0.09036145 0.03954802
  • What are the digits that are the most difficult to predict

  • Which digit is the easiest to predict?

Confusion matrix

Although the digit is like half of the digit 8 in shape, the difficulty of predictions of both ‘3’ and ‘8’ cannot be explained by a confusion between 3 and 8:

table(predLm,ytest)
##       ytest
## predLm   0   1   2   3   4   5   6   7   8   9
##      0 348   0   5   3   2   7   2   0   6   0
##      1   0 252   0   0   2   0   0   0   0   2
##      2   2   0 166   3   6   0   3   1   4   1
##      3   2   2   5 143   0   7   0   1   5   0
##      4   3   4  10   1 179   2   3   5   2   3
##      5   0   0   1  11   0 139   3   0   4   1
##      6   2   4   3   0   4   0 158   0   1   0
##      7   0   0   1   2   1   0   0 134   1   1
##      8   1   1   7   2   1   1   1   1 141   1
##      9   1   1   0   1   5   4   0   5   2 168
table(predSvm,ytest)
##        ytest
## predSvm   0   1   2   3   4   5   6   7   8   9
##       0 351   0   2   0   0   3   4   0   4   0
##       1   0 253   0   0   1   0   0   0   0   0
##       2   6   1 183   5   3   2   4   2   2   0
##       3   0   0   4 146   0   3   0   0   3   0
##       4   1   5   3   0 186   1   2   5   0   4
##       5   0   1   0  11   1 147   1   0   2   1
##       6   0   3   1   0   2   0 158   0   1   0
##       7   0   1   1   1   3   0   0 138   0   0
##       8   1   0   4   3   1   1   1   0 151   2
##       9   0   0   0   0   3   3   0   2   3 170

We also see that the points where these two models make mistakes do not have too great of an overlap.

table(predSvm != ytest, predLm != ytest)
##        
##         FALSE TRUE
##   FALSE  1806   77
##   TRUE     22  102

This gives evidence that stacking could be beneficial.

Mis-classified digits

We can pick out a large sample of images that are 3’s and see if these are particularly difficult to detect.

iset <- sample(which(train[,1] == 3),5*7)
par(mar=c(0,0,0,0))
par(mfrow=c(5,7))
for (j in iset) {
  y <- matrix(as.matrix(train[j,-1]),16,16,byrow=TRUE)
  y <- 1 - (y + 1)*0.5

  plot(0,0,xlab="",ylab="",axes=FALSE)
  rasterImage(y,-1,-1,1,1)
  box()
  text(-0.8,-0.7, train[j,1], cex=3, col="red")
}

For the most part though, these do not seem difficult for a human to classify.

What if we look at the actual mis-classified points. Here are the ones from the support vector machine:

iset <- sample(which(predSvm != ytest),7*7)
par(mar=c(0,0,0,0))
par(mfrow=c(7,7))
for (j in iset) {
  y <- matrix(as.matrix(test[j,-1]),16,16,byrow=TRUE)
  y <- 1 - (y + 1)*0.5

  plot(0,0,xlab="",ylab="",axes=FALSE)
  rasterImage(y,-1,-1,1,1)
  box()
  text(-0.8,-0.7, test[j,1], cex=3, col="red")
  text(0.8,-0.7, predSvm[j], cex=3, col="blue")
}

And the ridge regression:

iset <- sample(which(predLm != ytest),7*7)
par(mar=c(0,0,0,0))
par(mfrow=c(7,7))
for (j in iset) {
  y <- matrix(as.matrix(test[j,-1]),16,16,byrow=TRUE)
  y <- 1 - (y + 1)*0.5

  plot(0,0,xlab="",ylab="",axes=FALSE)
  rasterImage(y,-1,-1,1,1)
  box()
  text(-0.8,-0.7, test[j,1], cex=3, col="red")
  text(0.8,-0.7, predLm[j], cex=3, col="blue")
}

Summary