Top Banner
Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015
53

Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feb 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Beginner’s guide to Machine Learning competitions

Christine Doig

EuroPython 2015

Page 2: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

bit.ly/ep2015-ml-tutorialSlides

Notebooks bit.ly/ep2015-ml-tutorial-repo

Page 3: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Christine DoigData Scientist, Continuum Analytics

ch_doig

chdoig

chdoig.github.io

Page 4: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Data Science

Machine Learning Supervised learning Classification

Kaggle

Competitions

Dataset

Setup

Feature preparation

Modeling

Optimization

Validation

Anaconda

Concepts

Process

NLP Sentiment analysis

Page 5: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Data Science

Machine Learning Supervised learning Classification

Kaggle

Competitions

Dataset

Setup

Feature preparation

Modeling

Optimization

Validation

Anaconda

Concepts

Process

NLP Sentiment analysis

45min

10 min

1h

5 min

1h

Page 6: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Data Science

Machine Learning Supervised learning

Classification

Concepts

NLP Sentiment analysis

Page 7: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Data Science

Page 8: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks
Page 9: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Slides: http://www.slideshare.net/joshwills/production-machine-learninginfrastructure Video: https://www.youtube.com/watch?v=v-91JycaKjc

From the lab to the factory - Data Day Texas

data sciencehttp://www.oreilly.com/data/free/files/analyzing-the-analyzers.pdf

http://www.experfy.com/blog/become-data-scientist/ http://www.fico.com/landing/infographic/anatomy-of-a-data-scientist_en.html

Page 10: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

data science

Scientific Computing

Distributed SystemsAnalytics

Machine Learning/Stats

Web

Page 11: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

data science

Scientific Computing

Distributed SystemsAnalytics

Machine Learning/Stats

Web

Data Scientists/ Modeler

Data/Business Analyst

Research/Computational Scientist

Data Engineers/ Architects

Developer

Page 12: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

data science

Scientific Computing

Distributed SystemsAnalytics

Machine Learning/Stats

Web

Data Scientists/ Modeler

Data/Business Analyst

Research/Computational Scientist

Data Engineers/ Architects

Developer

Model Algorithm

Report

Application

Pipeline/ Architecture

Page 13: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

data science

Scientific Computing

Distributed SystemsAnalytics

Machine Learning/Stats

Web

Data Scientists/ Modeler

Data/Business Analyst

Research/Computational Scientist

Data Engineers/ Architects

Models

Deep Learning

Supervised

Clustering

SVM

Regression

Classification

Crossvalidation

Dimensionality reduction

KNN

Unsupervised NN

FilterJoin

Select

TopK

Sort

Groupby

min summary statistics

avgmax

databases

GPUs

arrays

algorithms

performance

compute

SQL

Reporting

clusters

hadoophdfs

optimization

HPC

graphs

FTT

HTML/CSS/JS

algebra

stream processing

deployment

serversfrontend

sematic webbatch

jobs

consistency

A/B testing

crawling

frameworks

parallelism

availability

tolerance

DFT

spark

scraping

databases

apps

NOSQL

parallelism

interactive data viz

pipeline

cloud

Developer

Page 14: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

data science

Scientific Computing

Distributed SystemsAnalytics

Machine Learning/Stats

Web

Data Scientists/ Modeler

Data/Business Analyst

Research/Computational Scientist

Data Engineers/ Architects

Developer

PyMC

Numba

xlwings

Bokeh

Kafka

RDFLib

mrjobmrjob

Page 15: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Machine Learning

Page 16: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks
Page 17: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Machine Learning

Unsupervised learning Supervised learning

Classification RegressionClustering Latent variables/structure

labelsno labels

categorical quantitative

Linear regressionLogistic regression SVM Decision trees k-NN

K-means Hierarchical clustering *Topic modeling

Dimenstionality reduction *Topic modeling

Page 18: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Exploratory Predictive

Machine Learning

Unsupervised learning Supervised learning

Classification RegressionClustering Latent variables/structure

labelsno labels

categorical quantitative

Linear regressionLogistic regression SVM Decision trees k-NN

K-means Hierarchical clustering *Topic modeling

Dimenstionality reduction *Topic modeling

Page 19: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Exploratory

Predictive

Machine Learning

Unsupervised learning Supervised learning

Classification Regression

labelsno labels

categorical quantitativeid gender age job_id1 F 67 12 M 32 23 M 45 14 F 18 2

group similar individuals together

id gender age job_id buy/click_ad money_spent

1 F 67 1 Yes $1,000

2 M 32 2 No -

3 M 45 1 No -

4 F 18 2 Yes $300

predict whether an individual is going to buy/click or notClassification

Regressionpredict how much is the individual going to spend

Page 20: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Natural Language Processing

Page 21: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Machine LearningNatural language processing

field concerned with the interactions between computers and human (natural) languages

Sentiment analysis

Extract subjective information on polarity (positive or negative) of a document (text, tweet, voice message…) !e.g online reviews to determine how people feel about a particular object or topic.

tasks

Page 22: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Machine Learning

Unsupervised learning Supervised learning

Classification RegressionClustering Latent variables/structure

labelsno labels

categorical quantitative

Linear regressionLogistic regression SVM Decision trees k-NN

K-means Hierarchical clustering *Topic modeling

Dimenstionality reduction *Topic modeling

Sentiment analysis

Movie review Positive

Negative

e.g.

Page 23: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

I love you! Positive

Page 24: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Setup

Page 25: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kaggle

Competitions

Dataset

Setup Anaconda

Page 26: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Setup options

You already have Python installed and your own workflow to install Python packages

happy

alternativeInstall dependencies in README

Anaconda

Miniconda +

conda env

Free Python distribution with a bunch of packages for data science

too many packages!!!

Python + conda (package manager)git clone [email protected]:chdoig/ep2015-ml-tutorial.git cd ep2015-ml-tutorial conda env create source activate ep-ml

http://conda.pydata.org/miniconda.html

http://continuum.io/downloadsPython + conda (package manager) + packages

Page 27: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kagglehttps://www.kaggle.com/

hosts online machine learning competitions

Page 28: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kaggle Competitionhttps://www.kaggle.com/c/word2vec-nlp-tutorial

Page 29: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kaggle Competition

https://www.kaggle.com/c/word2vec-nlp-tutorialBag of Words Meets Bags of Popcorn

Data

Task

50,000 IMDB movie reviews

predict the sentiment for each review in the test data set

25,000 rows containing an id, sentiment, and text for each review. labeledTrainData.tsv

testData.tsv 25,000 rows containing an id and text for each review

Page 30: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feature preparation

Modeling

Optimization

Validation

Process

Page 31: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feature preparation

Modeling Optimization

Validation

Feature extractionFeature selection

Feature imputation

Feature scalingFeature discretization

Neural Networks

Decision trees

SVM

Naive Bayes classifier

Logistic Regression

Boosting

Bagging

Regularization

Hold out method

Crossvalidation

Confusion matrix

ROC curve / AUC

Hyperparameters

Machine Learning

Ensemble

Random forest

Page 32: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feature preparation

Feature extraction

the process of making features from available data to be used by the classification algorithms

ReviewsM

NWords

Model Evaluation

Metrics Visualizations

NaiveBayes DecisionTrees

Feature extraction

id sentiment review count_words terrible_word

1 0 the movie was terrible 4 1

2 1 I love it 3 0

3 1 Awesome! Love it! 3 0

4 0 I hated every minute 4 0

Page 33: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Tokenization Stopwords

transition, metal, oxides, considered, generation,

materials, field, electronics, advanced, catalysts, tantalum,

v, oxide, reports, synthesis, material, nanometer, size,

unusual, properties…

transition_metal_oxides, considered, generation,

materials, field, electronics, advanced, catalysts, tantalum,

oxide, reports, synthesis, material, nanometer_size,

unusual, properties, sol_gel_method,

biomedical_applications…

transition, metal_oxides, tantalum, oxide, nanometer_size,

unusual_properties, dna, easy_method,

biomedical_applications

transition, metal_oxides, generation, tantalum, oxide,

nanometer_size, unusual_properties, sol, dna,

easy_method, biomedical_applications

Simple Collocations Entities

Combination Lemmatizationtransition, metal, oxide,

consider, generation, material, field, electronic, advance,

catalyst, property…

language generic

domain specific

a!above!across!after!

afterwards!again!

against!all !…

material!temperature!

advance!size!….

Feature extraction Text

Page 34: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Vector Space

Corpus - Bag of wordsDictionary1 - transition!

2- metal!3- oxides!

4- considered!…

![(0, 1), (1, 1), (2, 1)]

[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] [(2, 1), (5, 1), (7, 1), (8, 1)]

[(1, 1), (5, 2), (8, 1)] [(3, 1), (6, 1), (7, 1)]

[(9, 1)] [(9, 1), (10, 1)]

[(9, 1), (10, 1), (11, 1)] [(4, 1), (10, 1), (11, 1)]

Page 35: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feature_extraction.ipynb

Page 36: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Modeling

Naive Bayes Classifier

P(A|B) = P(B|A) * P(A) / P(B)

id sentiment review count_words terrible_word

1 0 the movie was terrible 4 1

2 1 I love it 3 0

3 1 Awesome! Love it 1 0

4 0 I hated every minute 4 0

P(1 | love) = P(love | 1) * P(1) / P(love) = (2/2 * 2/4)/(2/4) = 100%

What’s the probability of the review being positive if the word love appears in the review?

Page 37: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Modeling.ipynb

Page 38: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

occurs whenever a model learns from patterns that are present in the training data but do not reflect the data-generating process. Seeing more than is actually there. A kind of data hallucination.

Validation

Overfitting

http://talyarkoni.org/downloads/ML_Meetup_Yarkoni_Overfitting.pdf

Page 39: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Training data

Validation

Model

Evaluate

New data Evaluate

Page 40: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Validation.ipynb

Page 41: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Validation

Hold out method

Training data

Test data accuracy

Page 42: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Crossvalidation

Test

Training !

+ !

Validation

Training

Validation

Accuracy = average(Round1, Round 2….)

Final Accuracy one shot at this!

Accuracy in each round with validation set

Page 43: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Confusion matrix

Validation

Positive reviews

Negative reviews

95%

5%

Accuracy95%

Real Model prediction

Page 44: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

model/real positive negative

positive 95 5

negative 0 0

Confusion matrix

Validation

Page 45: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

ROC curve/ AUC

Validation

true positive

false positive

100% true positive 0 % false positive

Page 46: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

ROC curve/ AUC

Validation

true positive

false positive

100% true positive 0 % false positive

AUC

Page 47: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kaggle leaderboard

Page 48: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Optimization

Ensemble methods

Classifier 1 Classifier 2 Classifier 3

id cls_1 cls_2 cls_3 ensemble

1 0 0 0 0

2 0 1 1 1

3 1 1 1 1

4 0 0 1 0

e.g. majority votingw1 w2 w3

e.g. weighted voting

Page 49: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Ensemble.ipynb

Page 50: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Kaggle forums

Page 51: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Data Science

Machine Learning Supervised learning Classification

Kaggle

Competitions

Dataset

Setup

Feature preparation

Modeling

Optimization

Validation

Anaconda

Concepts

Process

NLP Sentiment analysis

Page 52: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Feature preparation

Modeling Optimization

Validation

Feature extractionFeature selection

Feature imputation

Feature scalingFeature discretization

Neural Networks

Decision trees Random forest

SVM

Naive Bayes classifier

Logistic Regression

Boosting

Bagging

Regularization

Hold out method

Crossvalidation

Confusion matrix

ROC curve / AUC

Hyperparameters

Machine Learning

Ensemble

Page 53: Beginner’s guide to Machine Learning competitions...Beginner’s guide to Machine Learning competitions Christine Doig EuroPython 2015 Slides bit.ly/ep2015-ml-tutorial Notebooks

Q&A