Outline
Flow Control
- Conditional execution
  - ifelse
- Loops
Statistical functions
Writing functions
Vectorized arithmetics

Outline

Vectors, matrices, tables
Flow Control
Built-in statistical functions
Write functions

#vectors matrices tables

what is v?

a=3
b=9
v  = a+c(-1,2,4)*b

what does this do?

x=c(T,T,F,F)
y=c(T,F,T,F)
M=cbind(x,y,x&y,x|y,!x)
dimnames(M)[[2]]=c("x","y","x&y","x|y","!x")
M

##          x     y   x&y   x|y    !x
## [1,]  TRUE  TRUE  TRUE  TRUE FALSE
## [2,]  TRUE FALSE FALSE  TRUE FALSE
## [3,] FALSE  TRUE FALSE  TRUE  TRUE
## [4,] FALSE FALSE FALSE FALSE  TRUE

Flow Control

Conditional execution

fulltime=NA
nclasses=1
if (nclasses >= 3) {fulltime=TRUE
} else {fulltime=FALSE}
fulltime

## [1] FALSE

ifelse

ifelse is a compact version of if … else …

Syntax

#ifelse(condition, statement1, statement2)

There is a subtle difference in that ifelse works on vectors.

Examples

#find the component by component minimum of vectors a and b
a=rnorm(10); 
b=rnorm(10); 
vecminab=ifelse(a<b,a,b); 
print(cbind(a,b,vecminab))

##                a           b   vecminab
##  [1,]  0.5435615 -0.82403858 -0.8240386
##  [2,] -0.6385618  1.73761614 -0.6385618
##  [3,] -0.3986438  0.77237730 -0.3986438
##  [4,]  0.7630046 -1.19751288 -1.1975129
##  [5,]  0.2071592 -0.73389851 -0.7338985
##  [6,] -2.4247580 -1.01896389 -2.4247580
##  [7,]  0.1068852  0.69329282  0.1068852
##  [8,] -1.0462237  0.43096752 -1.0462237
##  [9,] -0.2292156  0.02872836 -0.2292156
## [10,] -0.3637466  0.69651526 -0.3637466

if number if negative write ‘negative’

if number is positive write ‘>0’

Loops

W=matrix(1:16,byrow=T,ncol=4)
print(W)

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16

fmat=function(W){
n=nrow(W)


# these loops will generate 16=4x4 states for the matrix
#
    
      for (i in 1:n){
      for (j in 1:n){
        
        W[j,i]=W[i,j]+W[j,i]
        
      #  pass i=1
        #11  = 11+11 # 11 will never be seen again    
        #21  = 12+21 # 21 will be seen again
        #31  = 13+31
        #41  = 14+41
        
      #  pass i=2
        #12  = 21+12 # will never be seen again 
        #22  = 22+22
        #32  = 23+32
        #42  = 24+42 
        
      #  pass i=3
        #13  = 31+13 # 
        #23  = 32+23
        #33  = 33+33
        #43  = 34+43
        
      #  pass i=4
        #14  = 41+14 # 
        #24  = 42+24
        #34  = 43+34
        #44  = 44+44 
        
      }
      }
    
      return(W)
}

print(fmat(W))

##      [,1] [,2] [,3] [,4]
## [1,]    2    9   15   21
## [2,]    7   12   24   30
## [3,]   12   17   22   39
## [4,]   17   22   27   32

Statistical functions

vectorized arithmetics
pdf,auc,percentiles,random sampling
functions of matrices
binom,pois,exp,unif…
d,p,q,r

empirical coverage
simulation
t-interval
plot

what does this code do?

print(sample(c(2,5,3), size=3, replace=FALSE))

## [1] 5 3 2

print(sample(c(2,5,3), size=3, replace=TRUE))

## [1] 3 5 5

what does this code do?

pbinom(2, size=10, prob=.2)

## [1] 0.6777995

what does this code do?

data=rbinom(2, size=10, prob=.2)
data

## [1] 4 4

t-table

$t_{\alpha,k}$

quantiles

qt(.975,23)

## [1] 2.068658

what does this code do?

set.seed(87612345)
mydata=rnorm(20,mean=4,sd=.75)
t.test(mydata,conf.level=.95)

## 
##  One Sample t-test
## 
## data:  mydata
## t = 24.626, df = 19, p-value = 7.037e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.740388 4.435262
## sample estimates:
## mean of x 
##  4.087825

what is the assumption made for data?
are these assumptions justified?
what is a confidence interval?
I used the expression: ‘is the rat in the box’. What did I mean?
what is the coverage of a confidence interval?
what does empirical coverage of a CI mean?
what is alpha? what is dof?
what is the formula that R is using in the function t.test?
where is a quantile used? quantile of what? what does a quantile mean?
how could I have a smaller confidence interval?
what would happen if the data does not come from a normal distribution?
Can you give me examples?
but how do I know the distribution of the data?

what does this code do?

set.seed(117426)
data=rchisq(42,7)

what is the assumption made for data?

data is i.i.d normally distributed.

Each data point is a random number

\[x_i \sim N(\mu=4,\sigma^2 = (.75)^2)\]

I will use the notation

\[x_i \sim f\] to say the data is i.i.d. fro, distribution $f$

Please note that in general, distribution have ‘knobs’ (controls or parameters), so a better notation could be:

\[x_i \sim f(.,\theta)\]

where $theta$ is a vector of parameters for the distrubution.

For example, a multivariate normal distribution in 2 dimensions requires 5 scalar parameters.

what is a confidence interval?

A confidence interval for a parameter of a distribution (for example the mean of a normal) is an interval in which we believe that the parameter will be with some given probability.

For example, a $90\%$ confidence interval for the mean $\mu$ of a random sample (data) is an interval $[a,b}$ such that

\[ \mathbb{P}( \mu \in [a,b]) = 0.9\]

I used the expression: ‘is the rat in the box’. What did I mean?

The rat is the unknown parameter. The box is the confidence interval.

what is the coverage of a confidence interval?

It is the real probability that the interval contains the parameter. This may differ from the nominal or announced probability.

what does empirical coverage of a CI mean?

It is an estimation of the real coverage probability of the CI. We get it by simulating data and counting how many times the rat is in the box.

what is alpha? what is dof?

for a $90\%$-CI, alpha is $10\%$. The number of degrees of freedoms depends on the rat. It will not be the same if you try to catch a ‘mean’ or ‘a variance’ say. For a mean, the dof is the number of data points minus 1.

what is the formula that R is using in the function t.test?

The formula of Student

When $x_1, x_2, \ldots , x_n$ is a sample from a normal distribution with unknown mean $\mu$ and unknown variance $\sigma^2$, the level $100(1-\alpha)$% confidence interval for $\mu$ is given by

\[\bar x \pm t_{1-\alpha/2, n-1} \frac{s}{\sqrt n}\]

where $\bar x$ and $s$ are the sample mean and sample standard deviation of the data, and $t_{1-\alpha/2, n-1}$ cuts off an area $1-\alpha/2$ to its left under the $t$ curve with $n-1$ degrees of freedom.

where is a quantile used? quantile of what? what does a quantile mean?

see below

how could I have a smaller confidence interval?

Get more data points

what would happen if the data does not come from a normal distribution?

The coverage of your t-CI will not be the nominal coverage

Can you give me examples?

Draw data from any other distribution : poisson, chisquare, binomial, uniform etc.

but how do I know the distribution of the data?

You try fitting several distributions and keep the distribution with the best fit. BUT…

Writing functions

STEPS for writing functions:

step 1: formulate the problem

step 2: solve the problem

step 3: implement your algorithm in R

NOTA-BENE: You can build your function on top of powerful functions available in base R or in libraries!

compute a series

write a function that calculates

\[ f(a,n) = \sum_{i=1}^n \frac{1}{i^a}\]

for $a=2$, this is:

\[(1/1)^2 + (1/2)^2 + (1/3)^2 + (1/4)^2 + \ldots\]

compute $ $

does this remind you of any well-known number?

My solution

sa=function(a,n){
  return(sum(seq(1,n)^(-2)))
}

n=20
print(sqrt( 6.*sa(2,n)))

## [1] 3.09467

n=100000
print(sqrt( 6.*sa(2,n)))

## [1] 3.141583

pi

## [1] 3.141593

some amazing speed effects

#for (n in (20:100000)){
#s = sqrt( 6.*sa(2,n))
#if(n%%10000 == 0){print(c(s,s-pi))}
#}

for (n in (20:100000)){
if(n%%10000 == 0){print(sqrt( 6.*sa(2,n)))}
}

## [1] 3.141497
## [1] 3.141545
## [1] 3.141561
## [1] 3.141569
## [1] 3.141574
## [1] 3.141577
## [1] 3.141579
## [1] 3.141581
## [1] 3.141582
## [1] 3.141583

compute the agm

You all know that given two numbers, $x,y$, we can define several types of averages:

\[A(x,y)=1/2(x+y)\] \[G(x,y)=\sqrt{x y}= (xy)^{1/2}\] Note that …

\[H(x,y)=\frac{1}{\frac{1}{x} + \frac{1}{y}}\]

Suppose I start with two real numbers

\[a=a_0,b=b_0\]

Now I define a sequence of points

\[a_{n+1}=A(a_n,b_n), b_{n+1}=G(a_n,b_n)\]

Can you write a function that calculates $a_N,b_N$ for any given rank $N$?

#agmf=function(arguments here){
   #
   #return()
#}

The arithmetic-geometric mean is defined as the common limit of these two sequences

FUN: Here is a code found online that plays with the AGM function.

Can you guess what it does?

library(numbers)

##  Gauss constant
1 / agm(1, sqrt(2))  # 0.834626841674073

## [1] 0.8346268

##  Calculate the (elliptic) integral 2/pi \int_0^1 dt / sqrt(1 - t^4)
f <- function(t) 1 / sqrt(1-t^4)
2 / pi * integrate(f, 0, 1)$value

## [1] 0.8346268

1 / agm(1, sqrt(2))

## [1] 0.8346268

##  Calculate pi with quadratic convergence (modified AGM)
#   See algorithm 2.1 in Borwein and Borwein
y <- sqrt(sqrt(2))
x <- (y+1/y)/2
p <- 2+sqrt(2)
for (i in 1:6){
  cat(format(p, digits=16), "\n")
  p <- p * (1+x) / (1+y)
  s <- sqrt(x)
  y <- (y*s + 1/s) / (1+y)
  x <- (s+1/s)/2
  }

## 3.414213562373095 
## 3.142606753941623 
## 3.141592660966044 
## 3.141592653589793 
## 3.141592653589793 
## 3.141592653589793

## Not run: 
##  Calculate pi with arbitrary precision using the Rmpfr package
require("Rmpfr")

## Loading required package: Rmpfr

## Loading required package: gmp

## 
## Attaching package: 'gmp'

## The following objects are masked from 'package:base':
## 
##     %*%, apply, crossprod, matrix, tcrossprod

## C code of R package 'Rmpfr': GMP using 64 bits per limb

## 
## Attaching package: 'Rmpfr'

## The following object is masked from 'package:gmp':
## 
##     outer

## The following objects are masked from 'package:stats':
## 
##     dbinom, dgamma, dnorm, dpois, pnorm

## The following objects are masked from 'package:base':
## 
##     cbind, pmax, pmin, rbind

vpa <- function(., d = 32) mpfr(., precBits = 4*d)
# Function to compute \pi to d decimal digits accuracy, based on the 
# algebraic-geometric mean, correct digits are doubled in each step.
agm_pi <- function(d) {
    a <- vpa(1, d)
    b <- 1/sqrt(vpa(2, d))
    s <- 1/vpa(4, d)
    p <- 1
    n <- ceiling(log2(d));
    for (k in 1:n) {
        c <- (a+b)/2
        b <- sqrt(a*b)
        s <- s - p * (c-a)^2
        p <- 2 * p
        a <- c
    }
    return(a^2/s)
}
d <- 64
pia <- agm_pi(d)
print(pia, digits = d)

## 1 'mpfr' number of precision  256   bits 
## [1] 3.141592653589793238462643383279502884197169399375105820974944592

# 3.141592653589793238462643383279502884197169399375105820974944592
# 3.1415926535897932384626433832795028841971693993751058209749445923 exact

## End(Not run)

solve the cubic quation

Write a function that prints one root f a cubic polynomial

link

\[x^{3}+px+q=0\]

\[t=x+{\frac {b}{3a}}\] \[p={\frac {3ac-b^{2}}{3a^{2}}}\] \[q= {\frac {2b^{3}-9abc+27a^{2}d}{27a^{3}}}\] \[4p^{3}+27q^{2}>0\]

\[{\sqrt[{3}]{-{\frac {q}{2}}+{\sqrt {{\frac {q^{2}}{4}}+{\frac {p^{3}}{27}}}}}}+{\sqrt[{3}]{-{\frac{q}{2}}-{\sqrt {{\frac {q^{2}}{4}}+{\frac {p^{3}}{27}}}}}}\]

\[ax^3+bx^2+cx+d=0\]

The discriminant of the general cubic

\[18\,abcd-4\,b^{3}d+b^{2}c^{2}-4\,ac^{3}-27\,a^{2}d^{2}\]

It is the product of \[a^{4}\] and the discriminant of the corresponding depressed cubic.

\[\Delta _{0} = b^{2}-3ac \] \[\Delta _{1} =2b^{3}-9abc+27a^{2}d\]

\[C={\sqrt[{3}]{\frac {\Delta _{1}\pm {\sqrt {\Delta _{1}^{2}-4\Delta _{0}^{3}}}}{2}}}\]

\[x=-{\frac {1}{3a}} (b+C+{\frac {\Delta _{0}}{C}} )\]

Can you write a function that accepts a depressed cubic as input and returns three things:

a message
one real root of a cubic polynomial?
the discriminant

#cubicroot=function(arguments here){
   #
   #return()
#}

#discriminant=
#return(NULL)}

approximate an integral

The integral of a function $f(x)$ can be approximated by a Riemann sum

\[\int_{a}^b f(x)dx \approx \sum_{i=1}^n f(x_i) dx_i\]

How would you write a function that calculates the riemann sm at any gvn anl f a fnctn?

\[\int_a^b g'(x)dx = g(b)-g(a)\]

if $f(x)=g'(x)$, $g$ is called a primitive of $f$

What if $g=sin$, what is $f$?

Hence

\[\int_a^b sin(x)dx = \cos(b)-\cos(a)\]

dumm=function(x,f){
 output=f(x)+2
 return(output)
  }
dumm(pi/2,sin)

## [1] 3

Implement a function that calculates the riemann sum at rank n of a function of your choice

What output might you want to return?

rsum=function(f,n){
# output=f(x)+2
# return(output)
  }
# how would you test your function?

bread=function(x){
# implement any function of your choice
# return()
  }
# how would you test your function?

butter=function(x){
if(x == 0 ){return(1)} else{
return(sin(x)/x)
}
}
# how would you test your function?
butter(0.01)

## [1] 0.9999833

what is the integral of butter?

rsum=function(f,n){
# output=f(x)+2
# return(output)
  }
# how would you test your function?

How about computing the area under an empirical (not analytical) curve?

x <- seq(-1., 1., by = 0.05)

#set.seed(100)
y = exp(-x^2/2) / sqrt(2*pi)

library(pracma)

## 
## Attaching package: 'pracma'

## The following objects are masked from 'package:Rmpfr':
## 
##     erf, erfc, hypot, zeta

## The following objects are masked from 'package:gmp':
## 
##     gcd, isprime

## The following objects are masked from 'package:numbers':
## 
##     mod, rem

area <- trapz(x, y)
area

## [1] 0.6825887

z <- seq(-10., 10., by = 0.05)

xs <- sample(z,20,replace=F)

x <- sort(xs)

#set.seed(100)
y = exp(-x^2/2) / sqrt(2*pi)

library(pracma)
area <- trapz(x, y)
area

## [1] 1.064928

Practice session:

Outline

Flow Control

Conditional execution

ifelse

ifelse is a compact version of if … else …

Syntax

Examples

Loops

Statistical functions

Writing functions

compute a series

compute the agm

solve the cubic quation

approximate an integral

Vectorized arithmetics