Machine learning algorithms

Supervised methods

Linear regression

Linear methods are parametric methods where function learned has form f(x) = \phi \left( w^T x \right) where \phi() is some activation function.

Generally, optimized by learning weights by applying (stochastic) gredient descent to minimize loss function, e.g. \sum \lvert \hat{y_i} - y_i \rvert^2

Simple; a good place to start for a new problem, at least a baseline

Methods: Linear regression for numeric target outcome. Logistic regression for categorical target outcome.

Linear regression (univariate)

Image:univariate.png

Model relation between a single feature (explanatory variable x) and a real-valued response (target variable y)

Given data (x,y) and a line defined by w_0 (intercept) and w_1 (slope), the vertical offset for each data point from the line is the error between the true label y and the prediction based on x

The best line minimizes the sum of squared errors (SSE)

We usually assume the error is Gaussian distributed with mean zero and fixed variance

Linear regression (multivariate)

Multiple linear regression includes N explanatory variables with N \geq 2:

y = w_0x_0 + w_1x_1 + \cdots + w_Nx_N = \sum_{i=0}^{N} w_ix_i

Sensitive to correlation between features, resulting in high variance of coefficients.

scikit-learn implementation:

sklearn.linear_model.LinearRegression

Logistic regression

Predict whether a credit card transaction is fraud

Image:fraud.png

Estimates the probability of the input belonging to one of the two classes: positive and negative.

Vulnerable to outliers in training data.

Relation to linear model:

\sigma (z) = \frac{1}{1+\mathrm{e}^{-z}}

z is a trained multivariate linear function

\phi is a fixed univariate function (not trained)

Objective function to maximize = probability of the true training labels.

Sigmoid curve. Image: sigmoid.png

Model relation between features (explanatory variables x) and the binary responses (y=1 or y=0)

For all features, define the linear combination:

z = w^T x = w_0 + w_1x_1 + \cdots + w_Nx_N

Define the probability of y=1 given x as p and find the logit of p as:

logit(p) = \log \frac{p}{1-p}

Logistic regression finds the best weight vector by fitting the training data

logit(p(y=1 \vert x)) = z

Then, for a new observation, you can use the logistic function \phi(z) to calculate the probability to have label 1. If it is larger than a threshold (for example 0.5), you will predict the label for the new observation to the positive.

sklearn.linear_model.LogisticRegression

Linear separable versus non-linearly separable

Image:separable.png