Data Science¶

What is data science?¶

Data Science is a set of processes and systems to extract knowledge or insights from data, either structured or unstructured (Wikipedia). For the purpose of this document consists of managing, analyzing and visualizing data in support of machine learning workflow.

What is machine learning?¶

Machine learning: Artificial intelligence machines that improve their predictions by learning from large amounts of input data.

Main idea: Learning = estimating underlying function $f$ by mapping data attributes to some target value.

Training set: A set of labeled examples $(x,f(x))$ where $x$ is the input variables and the label $f(x)$ is the observed target truth.

Goal: Given a training set, find approximation $\hat{f}$ of $f$ that best generalizes, or predicts, labels for new examples. Best is measured by some quality measure, for instance: error rate, sum squared error.

Image:ML.png

Why machine learning¶

Difficulty in writing some programs

Too comples (facial recognition)
Too much data (stock market predicions)
Information only available dynamically (recommendation system)

Use of data for improvement

Humans are used to improving based on experience (data)

A lot of data available

Product recommendations
Fraud detection
Facial recognition
Language understanding

Types of machine learning¶

Supervised learning¶

A “teacher” provides training examples, each with correct label. Regression y classification.

Unsupervised learning¶

Correct label not available for training examples, must find patterns in data (e.g. using clustering). Example: grouping customers according to what books and movies they like.

Reinforcement learning¶

Not told what action is correct, but given some reward or penalty after each action in a sequence. Example: learning how to play soccer

semi-supervised learning,

Data matters¶

image:data.png

Unleash the business value in data collected
Prepare you to do data science projects and to implement production systems
Predict future events based on past data leading to proactive change than reactive

The Data Science and ML workflow¶

Image:workflow.png

Concepts:

Dataset

Trainig set versus test set

Feature = attribute = independent variable = predictor

Label = target = outcome = class = dependent variable = response

Dimensionality = numer of features

Model selection

Key Issues in ML¶

Data quality¶

Consistency of the data, Accuracy of the data, noisy data, missing data, outliers in the data, Bias, Variance

Model quality¶

Image:modelquality.png

Overfitting¶

Failure to generalize: Model performs well on trainig set but poorly on test set.

Typically indicates that model is too flexible for amount of training data.

Flexibility allows it to “memorize” the data, including noise.

Corresponds to high variance - small changes in the training dat lead to big changes in the results

Underfitting¶

Failure to capture important patterns in the training data set.

Typically indicates that model is too simple or there are too few explanatory variables.

Not flexible enough to model real patterns

Corresponds to high bias - the results show systematic lack of fit in certain regions

Computation speed and scalability¶

Use distributed computing systems like Amazon SageMaker or Amazon EC2 instances for training in order to: increase speed, solve prediction time complexity, solve space complexity.