2017-01-25

Statistical Learning

\(Y = f(x) + \epsilon\)

  • Inputs X: predictors/ independent variables/ features

  • Outputs Y: response/ dependent variable

  • \(\epsilon\): a random error term

  • f: a function represents the systematic information that x provides about Y.

  • Statistical Learning refers to a set of approaches for estimating f.

Data set (X,Y)

  • Training data : the data set on which the model is built.

  • Test data : the data on which we apply our model and check the accuracy.

Supervised learning vs Unsupervised learning

  • Supervised learning is the machine learning task of inferring a function from labeled training data.

  • Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data.

Regression vs Classification

  • Regression is used to predict continuous values.
  • Classification is used to predict which class a data point is part of (discrete value).

Parametric methods vs Non-parametric methods

  • Parametric methods summarize data with a set of parameters of fixed size. - select a form for the function - train the model

  • Non-parametric methods do not make explicit assumptions about the form of the function.

Assessing model accuracy

Training error vs Test error

  • Training error is the error we get applying the model to the same data from which we trained.

  • Test error is the error that we incur on new data.

  • knn classification

  • regression

The Bias-Variance Trade-Off

\(E[(y_0 - \hat{f}(x_0))^2] = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\)

  • Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set.
    • In general, more flexible statistical methods have higher variance.
  • bias refers to the error that is introduced by approxi- mating a real-life problem, which may be extremely complicated, by a much simpler model.
    • Generally, more flexible methods result in less bias.