2017-01-24

Outline

  • k-Nearest Neighbors (kNN)
  • R exercise: create a simple KNN classifier
  • knn() function from class library

k - Nearest Neighbors (KNN)

  • classify an unknown object with the most common class among k closest neighbors.

  • Algorithm

    • Compute distance \(d(x,x_i)\) to every training data \(x_i\), e.g. Euclidean distance
      • if measurements are on the different scale, normalize the data first
      • e.g. \(x'= \frac{x-x_{min}}{x_{max}-x_{min}}\)
    • Select k closest neighbours and their labels
    • Majority vote

Decision Boundary

  • k = 1

    - 0 training error
    - sensitive to "noise"
    - overfitting problem
  • k = 15

    - Larger k gives smoother boundaries
    - underfitting problem

Summary

  • Pros
    • Simple and intuitive
    • Can be applied to the data from any distribution
  • Cons
    • Choosing k may be tricky
    • Computationally expensive to find the k nearest neighbours
  • Extension
    • KNN regression: calculate the average of k neighbors instead of selecting the most frequent neighbor.
    • weighted KNN

Create a simple kNN classifier

  • Example
set.seed(99)

## prepare the data set
train = data.frame(x1 = runif(20,-1,1), x2 = runif(20,-1,1), label = rep(NA,20))

## assign labels based on x, y coordinates
train[train$x1>0 & train$x2 >0,"label"] = "red"   # Quadrant 1
train[train$x1<0 & train$x2 >0,"label"] = "green" # Quadrant 2
train[train$x1<0 & train$x2 <0,"label"] = "orange"
train[train$x1>0 & train$x2 <0,"label"] = "blue"

## plot the points
plot(train[,1:2],col=train$label,pch=19,asp=1,xlim=c(-1,1),ylim=c(-1,1))
abline(h=0,lty=3)
abline(v=0,lty=3)

## add a point we want to predict
points(0.5,0.5,pch=10)


## knn classifier
p = c(0.5,0.5)  # test data

## dist function
distFunc = function(p,data){
  val = c()
  for (i in 1:nrow(data)){
    val = c(val,sqrt(sum((p - data[i,])^2)))
  }
  return(val)
}

#as.matrix(dist(rbind(p,train[,1:2])))[,1] # use the built in dist function

myKnn = function(p,trainData,labels,k){
  distV = distFunc(p,trainData)  ## calculate distance
  klabels = labels[order(distV)][1:k]  ## k nearest neighbours
  ## majority vote and distance
  return(list(vote = sort(table(klabels),decreasing = T),dist=distV[order(distV)][1:k]))
  
}

p = c(0.1,-0.5)  # test data
myKnn(p,train[,1:2],train[,3],k=3)

knn() function from class package

  • library(class)
  • Syntax
knn(train, test, cl,k ,...)
  • Arguments:
    • train : training data set
    • test : test data set.
    • cl : true classifications of training set.
    • k : number of neighbours.
    • prob : If show the proportion of the votes for the winning class.
library(class)
knn(train[,1:2],p,train[,3],k=5,prob=T)
  • Example: (textbook - P163)
library(class)
library(ISLR)
## ?Smarket
attach(Smarket)
train = (Year<2005)
train.X = cbind(Lag1,Lag2)[train,]
test.X = cbind(Lag1,Lag2)[!train,]
train.Direction = Direction[train]
Direction.2005 = Direction[!train]
set.seed(1)
knn.pred = knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,Direction.2005)
mean(knn.pred == Direction.2005)