INTRO TO DATA SCIENCE LECTURE 4: INTRO TO ML & KNN CLASSIFICATION Francesco Mosconi DAT10 SF // October 15, 2014
INTRO TO DATA SCIENCELECTURE 4: INTRO TO ML & KNN CLASSIFICATIONFrancesco Mosconi DAT10 SF // October 15, 2014
DATA SCIENCE IN THE NEWSHEADER– CLASS NAME, PRESENTATION TITLE
DATA SCIENCE IN THE NEWS
Source: http://f1metrics.wordpress.com/2014/10/03/building-a-race-simulator/
DATA SCIENCE IN THE NEWS
Source: http://www.pyimagesearch.com/2014/10/13/deep-learning-amazon-ec2-gpu-python-nolearn/
RECAP
‣Cleaning data ‣Dealing with missing data ‣Setting up github for homework !
LAST TIME
QUESTIONS?INTRO TO DATA SCIENCE
I. WHAT IS MACHINE LEARNING? II. CLASSIFICATION PROBLEMSIII. BUILDING EFFECTIVE CLASSIFIERS IV. THE KNN CLASSIFICATION MODELEXERCISES:IV. LAB: KNN CLASSIFICATION IN PYTHON V. BONUS LAB: VISUALIZATION WITH MATPLOTLIB (IF TIME ALLOWS)
AGENDA
I. WHAT IS MACHINE LEARNING?
INTRO TO DATA SCIENCE
WHAT IS MACHINE LEARNING? 9
WHAT IS MACHINE LEARNING? 10
WHAT IS MACHINE LEARNING? 11
WHAT IS MACHINE LEARNING? 12
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!!!!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning
WHAT IS MACHINE LEARNING? 13
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning
WHAT IS MACHINE LEARNING? 14
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !!!!!source: http://en.wikipedia.org/wiki/Machine_learning
WHAT IS MACHINE LEARNING? 15
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !‣ generalization – making predictions from data !!!source: http://en.wikipedia.org/wiki/Machine_learning
REMEMBER THIS? 16
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
WE ARE NOW HERE 17
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
WE WANT TO GO HERE 18
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
WE WANT TO GO HERE 19
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
QUESTION!What does it take to make this jump?
ANSWER: PROBLEM SOLVING! 20
ANSWER: PROBLEM SOLVING! 21
NOTE!Implementing solutions to ML problems is the focus of this course!
THE STRUCTURE OF MACHINE LEARNING PROBLEMS
INTRO TO DATA SCIENCE
REMEMBER WHAT WE SAID BEFORE? 23
Supervised
Unsupervised
Making predictions
Extracting structure
REMEMBER WHAT WE SAID BEFORE? 24
representation
generalization
Supervised
Unsupervised
Making predictions
Extracting structure
TYPES OF LEARNING PROBLEMS - SUPERVISED EXAMPLE 25
TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 26
TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 27
TYPES OF DATA 28
Continuous Categorical
Quantitative Qualitative
TYPES OF DATA 29
Continuous Categorical
Quantitative Qualitative
NOTE!The space where data live is called the feature space. !Each point in this space is called a record.
TYPES OF ML SOLUTIONS 30
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
TYPES OF ML SOLUTIONS 31
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
NOTEWe will implement solutions using models and algorithms. !Each will fall into one of these four buckets.
WHATIS THEGOALOFMACHINE LEARNING?
QUESTION
REMEMBER WHAT WE SAID BEFORE? 33
Supervised
Unsupervised
Making predictions
Extracting structureANSWER!The goal is determined by the type of problem.
HOWDO YOUDETERMINETHE RIGHTAPPROACH?
QUESTION
TYPES OF ML SOLUTIONS 35
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
ANSWER!The right approach is determined by the desired solution.
TYPES OF ML SOLUTIONS 36
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
ANSWER!The right approach is determined by the desired solution.
NOTE!All of this depends on your data!
WHATDO YOUDOWITH YOURRESULTS?
QUESTION
THE DATA SCIENCE WORKFLOW 38
source: http://benfry.com/phd/dissertation-110323c.pdf
ANSWER!Interpret them and react accordingly.
THE DATA SCIENCE WORKFLOW 39
source: http://benfry.com/phd/dissertation-110323c.pdf
ANSWER!Interpret them and react accordingly
NOTE!This also relies on your problem solving skills!
II. CLASSIFICATION PROBLEMS
INTRO TO DATA SCIENCE
TYPES OF ML SOLUTIONS 41
Supervised
Unsupervised
Continuous Categorical
??? ???
??? ???
CLASSIFICATION PROBLEMS 42
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
CLASSIFICATION PROBLEMS 43
Here’s (part of) an example dataset:
CLASSIFICATION PROBLEMS 44
Here’s (part of) an example dataset:
{independent variables
CLASSIFICATION PROBLEMS 45
Here’s (part of) an example dataset: {class labels
(qualitative){independent variables
CLASSIFICATION PROBLEMS 46
Q: What does “supervised” mean?
CLASSIFICATION PROBLEMS 47
Q: What does “supervised” mean? A: We know the labels.
class labels
(qualitative)
CLASSIFICATION PROBLEMS 48
Q: How does a classification problem work?
CLASSIFICATION PROBLEMS 49
Q: How does a classification problem work? A: Data in, predicted labels out.
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
Q: What steps does a classification problem require?
dataset
CLASSIFICATION PROBLEMS 50
model
Q: What steps does a classification problem require? 1) split dataset
dataset
CLASSIFICATION PROBLEMS 51
model
Q: What steps does a classification problem require? 1) split dataset 2) train model
CLASSIFICATION PROBLEMS 52
dataset
training set
model
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model
CLASSIFICATION PROBLEMS 53
model
dataset
training set
test set
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions
CLASSIFICATION PROBLEMS 54
model
dataset
test set
training set
predictions
new data
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions
CLASSIFICATION PROBLEMS 55
model
dataset
test set
training set
predictions
new data
NOTEThis new data is called out of sample data. !We don’t know the labels for these OOS records!
III. BUILDING EFFECTIVE CLASSIFIERS
INTRO TO DATA SCIENCE
Q: What types of prediction error will we run into?
BUILDING EFFECTIVE CLASSIFIERS 57
model
dataset
test set
training set
predictions
new data
Q: What types of prediction error will we run into? 1) training error
BUILDING EFFECTIVE CLASSIFIERS 58
model
dataset
test set
training set
predictions
new data
Q: What types of prediction error will we run into? 1) training error 2) generalization error
BUILDING EFFECTIVE CLASSIFIERS 59
model
dataset
test set
training set
predictions
new data
Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error
BUILDING EFFECTIVE CLASSIFIERS 60
model
dataset
test set
training set
predictions
new data
Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error
BUILDING EFFECTIVE CLASSIFIERS 61
model
dataset
test set
training set
predictions
NOTE!We want to estimate OOS prediction error so we know what to expect from our model.
new data
TRAINING ERROR 62
Q: Why should we use training & test sets?
TRAINING ERROR 63
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset.
TRAINING ERROR 64
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error?
TRAINING ERROR 65
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set).
TRAINING ERROR 66
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero!
TRAINING ERROR 67
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero!
NOTE!This phenomenon is called overfitting.
OVERFITTING 68
source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011.!
OVERFITTING - EXAMPLE 69
source: http://www.dtreg.com
OVERFITTING - EXAMPLE 70
source: http://www.dtreg.com
TRAINING ERROR 71
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero! !A: Training error is not a good estimate of OOS accuracy.
NOTE!This phenomenon is called overfitting.
GENERALIZATION ERROR 72
Suppose we do the train/test split.
GENERALIZATION ERROR 73
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy?
GENERALIZATION ERROR 74
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split.
GENERALIZATION ERROR 75
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same?
GENERALIZATION ERROR 76
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not!
GENERALIZATION ERROR 77
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.
GENERALIZATION ERROR 78
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.
NOTE!The generalization error gives a high-variance estimate of OOS accuracy.
GENERALIZATION ERROR 79
Something is still missing!
GENERALIZATION ERROR 80
Something is still missing! !Q: How can we do better?
GENERALIZATION ERROR 81
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors.
GENERALIZATION ERROR 82
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average?
GENERALIZATION ERROR 83
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking!
GENERALIZATION ERROR 84
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking! !A: Cross-validation.
CROSS-VALIDATION 85
Steps for n-fold cross-validation:
CROSS-VALIDATION 86
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions.
CROSS-VALIDATION 87
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set.
CROSS-VALIDATION 88
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error.
CROSS-VALIDATION 89
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration.
CROSS-VALIDATION 90
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration. 5) Take the average generalization error as the estimate of OOS accuracy.
CROSS-VALIDATION 91
Features of n-fold cross-validation:
CROSS-VALIDATION 92
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error.
CROSS-VALIDATION 93
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing.
CROSS-VALIDATION 94
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split
CROSS-VALIDATION 95
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split 4) Can be used for model selection.
IV. KNN CLASSIFICATION
INTRO TO DATA SCIENCE
KNN CLASSIFICATION - BASICS 97
Suppose we want to predict the color of the grey dot.
KNN CLASSIFICATION 98
Suppose we want to predict the color of the grey dot. !1) Pick a value for k.
k = 3
KNN CLASSIFICATION 99
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors.
k = 3
KNN CLASSIFICATION 100
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.
KNN CLASSIFICATION 101
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.
OPTIONAL NOTE!Our definition of “nearest” implicitly uses the Euclidean distance function.
KNN CLASSIFICATION 102
Another example with k = 3 Will our new example be blue or orange?
LABS
INTRO TO DATA SCIENCE