Statistical learning

2017-01-25

Statistical Learning

\(Y = f(x) + \epsilon\)

Inputs X: predictors/ independent variables/ features
Outputs Y: response/ dependent variable
\(\epsilon\): a random error term
f: a function represents the systematic information that x provides about Y.
Statistical Learning refers to a set of approaches for estimating f.

Data set (X,Y)

Training data : the data set on which the model is built.
Test data : the data on which we apply our model and check the accuracy.

Supervised learning vs Unsupervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data.
Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data.

Regression vs Classification

Regression is used to predict continuous values.
Classification is used to predict which class a data point is part of (discrete value).

Parametric methods vs Non-parametric methods

Parametric methods summarize data with a set of parameters of fixed size. - select a form for the function - train the model
Non-parametric methods do not make explicit assumptions about the form of the function.

Assessing model accuracy

Training error vs Test error

Training error is the error we get applying the model to the same data from which we trained.
Test error is the error that we incur on new data.
knn classification

regression

The Bias-Variance Trade-Off

\(E[(y_0 - \hat{f}(x_0))^2] = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\)

Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set.
- In general, more flexible statistical methods have higher variance.
bias refers to the error that is introduced by approxi- mating a real-life problem, which may be extremely complicated, by a much simpler model.
- Generally, more flexible methods result in less bias.