due at beginning of the course Monday 20th 2020
Download the Rmd file, not the pdf
Link to data on the ISL site
NB: the link to the dataset was wrong in the pdf but correct in the .Rmd file.
Please get the Auto data (csv file) at the following adress:
Understand how to do calculations in R
Understand how to read data
Understand what a scatter plot is
Understand what a boxplot is
Understand to plot THIS by THAT
Do you know how to calculate \(a^b\) in R?
Do you know that exponential of \(x\) is the same as \(e^x\)?
Use scan for data that is not ‘structured’ (e.g. not in tabular form or mixture of different data types).
(scan will read data sequentially)
Try creating your own example of unstructured data:
#?scan
d <- scan(file="../Data/bete.txt")
Do you know how to define and fill a matrix with numbers in R?
Do you know how to calculate the product of 2 matrices in R?
Do you know the name of the function that calculates the determinant of a matrix in R ?
Scatter plots are plots of raw data points. For example, if we measure/observe two features (say Age and Salary) in a population (say for \(N=6000\) employees of Dalhousie University), we can produce a cloud of points in 2d-space, where the \(i\)’th sample/person is represented by a point with coordinates \((x_i,y_i)\), and \(x_i\) (resp. \(y_i\)) is the age (resp. the salary) of sample/person \(i\).
Let us try this example
# read the csv file : use the full or relative file path
# HEY the data is here:
# http://mathstat.dal.ca/~fullsack/stat2450/Data/GreekDrama.csv
# so you will have to modify this to get this to work!
d <- read.csv(file="../Data/GreekDrama.csv",header=TRUE)
#
#d <- read.csv(file.choose(),header=TRUE)
# or use the file.choose() function to navigate to your file and select it
# produce a scatter plot. Experiment with the value of pch
# try adding a title, a legend, axis labels, etc.
plot(Word.Count~Year,data=d,pch=18)
#dev.off() # file will be saved in working directory (no screen display)
Useful links (TRY THEM!)
Box plots are caricatures of distributions.
(src:https://www.thecanadianencyclopedia.ca/fr/article/prime-suspects-canadas-prime-ministers-caricatured)
Just turn the following figure vertically:
Often, we represent distributions of different populations side-by-side
To understand boxplots, you need to understand quantiles
Here is a distribution of numbers:
1:17
The average is 9.
There are as many (8) samples below 9 as there are samples above 9:
The outcome value 9 cuts the distribution in 2 (equal number of samples below and above the value) 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17
Check this with the quantile function in R
x = 1:17
quantile(x,c(0.5))
## 50%
## 9
1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17
Check this with the quantile function in R
x = 1:17
quantile(x,c(0.25,0.5,0.75))
## 25% 50% 75%
## 5 9 13
The quantiles do not depend on the order of sample values (so you may as well sort them)
y = c(17,1,3,19,6,2,15,7,4,14,11,12,5,8,16,13,9,10)
quantile(y,c(0.25,0.5,0.75))
## 25% 50% 75%
## 5.25 9.50 13.75
sort(y)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19
now consider the following distribution of values:
1 1 1 1 1 1 1 1 2 3 4 5 6 7 11 13 17
The first quartile is 1 !
x <-c(1,1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 7, 11, 13, 17)
quantile(x,c(0.25))
## 25%
## 1
Can you try to create a distribution which is concentrated in the upper range (e.g. near 17)? Which value do you expect for the first quantile?
x <-c(1,1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 7, 11, 13, 17)
quantile(x,c(0.25))
## 25%
## 1
To read ‘disorganized data’ , use the scan function.
To read ‘table-structured data’ (e.g. data frames) use the read.table function.
testVect = c(1,3,5,2,9,10,7,8,6)
It is sometimes convenient to use formulas in plots.
E.g. in axis labels, or legends or in the title of the plot.
Many options exist for doing this.
The following example is an answer to a question asked by a student in this class yesterday:
plot(1:10, 1:10,
main="text(...) examples\n~~~~~~~~~~~",xlab=expression(paste("NO"[3]^-{}, " (mgN/L)")))
text(4, 9, expression(hat(beta) == (X^t * X)^{-1} * X^t * y))
text(7, 4, expression(bar(x) == sum(frac(x[i], n), i==1, n)))
To use superscripts in axis labels:
labelsX=parse(text=paste(abs(seq(-100, -50, 10)), "^o ", "*W", sep=""))
labelsY=parse(text=paste(seq(50,100,10), "^o ", "*N", sep=""))
plot(-100:-50, 50:100, type="n", xlab="", ylab="", axes=FALSE)
axis(1, seq(-100, -50, 10), labels=labelsX)
axis(2, seq(50, 100, 10), labels=labelsY)
box()
https://rpubs.com/brouwern/superscript
If we ask you to get the distribution of This vs That, or This By That, often we mean that This will be plotted on the y (‘vertical’) coordinate and That on the x (‘horizontal’) coordinate.
For example if we have three types of flowers (say with 3,4 or 5 petals), and we ask you to plot the height of flowers per flower type, then, there should be three values for x (corresponding to 3, 4 or 5 petals).
Can you produce a boxplot of your liking with this concept?
You can get data all over the place in your web browser to do this.
HINTS:
x <- c(1,12,15,16,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17)
quantile(x,c(0.25))
## 25%
## 17