Machine learning algorithms¶

Supervised methods¶

Linear regression¶

Linear methods are parametric methods where function learned has form $f(x) = \phi \left( w^T x \right)$ where $\phi()$ is some activation function.

Generally, optimized by learning weights by applying (stochastic) gredient descent to minimize loss function, e.g. $\sum \lvert \hat{y_i} - y_i \rvert^2$

Simple; a good place to start for a new problem, at least a baseline

Methods: Linear regression for numeric target outcome. Logistic regression for categorical target outcome.

Linear regression (univariate)¶

Image:univariate.png

Model relation between a single feature (explanatory variable $x$ ) and a real-valued response (target variable $y$ )

Given data $(x,y)$ and a line defined by $w_0$ (intercept) and $w_1$ (slope), the vertical offset for each data point from the line is the error between the true label $y$ and the prediction based on $x$

The best line minimizes the sum of squared errors (SSE)

We usually assume the error is Gaussian distributed with mean zero and fixed variance

Linear regression (multivariate)¶

Multiple linear regression includes $N$ explanatory variables with $N \geq 2$ :

$y = w_0x_0 + w_1x_1 + \cdots + w_Nx_N = \sum_{i=0}^{N} w_ix_i$

Sensitive to correlation between features, resulting in high variance of coefficients.

scikit-learn implementation:

sklearn.linear_model.LinearRegression

Logistic regression¶

Predict whether a credit card transaction is fraud

Image:fraud.png

Estimates the probability of the input belonging to one of the two classes: positive and negative.

Vulnerable to outliers in training data.

Relation to linear model:

$\sigma (z) = \frac{1}{1+\mathrm{e}^{-z}}$

$z$ is a trained multivariate linear function

$\phi$ is a fixed univariate function (not trained)

Objective function to maximize = probability of the true training labels.

Sigmoid curve. Image: sigmoid.png

Model relation between features (explanatory variables $x$ ) and the binary responses ( $y=1$ or $y=0$ )

For all features, define the linear combination:

$z = w^T x = w_0 + w_1x_1 + \cdots + w_Nx_N$

Define the probability of $y=1$ given $x$ as $p$ and find the logit of $p$ as:

$logit(p) = \log \frac{p}{1-p}$

Logistic regression finds the best weight vector by fitting the training data

$logit(p(y=1 \vert x)) = z$

Then, for a new observation, you can use the logistic function $\phi(z)$ to calculate the probability to have label $1$ . If it is larger than a threshold (for example $0.5$ ), you will predict the label for the new observation to the positive.

sklearn.linear_model.LogisticRegression

Linear separable versus non-linearly separable

Image:separable.png