Top Banner
Scikit Learn: Machine Learning in Python Gianluca Corrado [email protected] Machine Learning G. Corrado (disi) sklearn Machine Learning 1 / 22
22

Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado [email protected]

Jan 10, 2019

Download

Documents

lamtuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Scikit Learn: Machine Learning in Python

Gianluca Corrado

[email protected]

Machine Learning

G. Corrado (disi) sklearn Machine Learning 1 / 22

Page 2: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Python Scientific Lecture Notes

Scikit Learn is based on Python

especially on NumPy, SciPy, and matplotlib

which are packages for scientific computing in Python

Basics on Python and on scientific computing

http://scipy-lectures.github.io/

G. Corrado (disi) sklearn Machine Learning 2 / 22

Page 3: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Downloading and Installing

Requires:

Python (≥2.6 or ≥3.3)

NumPy (≥ 1.6.1)

SciPy (≥ 0.9)

http://scikit-learn.org/stable/install.html

G. Corrado (disi) sklearn Machine Learning 3 / 22

Page 4: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Documentation and Reference

Documentationhttp://scikit-learn.org/stable/documentation.html

Reference Manual with class descriptionshttp://scikit-learn.org/stable/modules/classes.html

G. Corrado (disi) sklearn Machine Learning 4 / 22

Page 5: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Outline

Today we are going to learn how to:

Load and generate datasets

Split a dataset for cross-validation

Use some learning algorithmsI Naive BayesI SVMI Random forest

Evalute the performance of the algorithmsI AccuracyI F1-scoreI AUC ROC

G. Corrado (disi) sklearn Machine Learning 5 / 22

Page 6: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Datasets

The sklearn.datasets module includes utilities to load datasets

Load and fetch popular reference datasets (e.g. Iris)

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

Artificial data generators (e.g. binary classification)

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

Now inspect the data structures

G. Corrado (disi) sklearn Machine Learning 6 / 22

Page 7: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Cross-validation

k-fold cross-validation

Split the dataset D in k equal sized disjoint subsets Di

For i ∈ [1, k]I train the predictor on Ti = D \ Di

I compute the score of the predictor on the test set Di

Return the average score accross the folds

G. Corrado (disi) sklearn Machine Learning 7 / 22

Page 8: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Cross-validation

The sklearn.cross validation module includes utilities forcross-validation and performance evaluation

e.g. k-fold cross validation

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

Now inspect the data structures

G. Corrado (disi) sklearn Machine Learning 8 / 22

Page 9: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Naive Bayes

Hint

Attribute values are assumed independent of each other

P(a1, . . . , am|yi ) =m∏j=1

P(aj |yi )

Definition

y∗ = argmaxyi

m∏j=1

P(aj |yi )P(yi )

G. Corrado (disi) sklearn Machine Learning 9 / 22

Page 10: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Naive Bayes

The sklearn.naive bayes module implements naive Bayesalgorithms

e.g. Gaussian naive Bayes

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Now inspect the data structures

G. Corrado (disi) sklearn Machine Learning 10 / 22

Page 11: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

SVM

Hint

G. Corrado (disi) sklearn Machine Learning 11 / 22

Page 12: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Hint

The sklearn.svm module includes Support Vector Machinealgorithms

e.g. Support-C Vector Classification

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Now inspect the data structures

G. Corrado (disi) sklearn Machine Learning 12 / 22

Page 13: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Random Forest

Hint

G. Corrado (disi) sklearn Machine Learning 13 / 22

Page 14: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Random Forest

The sklearn.ensemble module includes ensemble-based methodsfor classification and regression

e.g. Random Forest Classifier

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Now inspect the data structures

G. Corrado (disi) sklearn Machine Learning 14 / 22

Page 15: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Performance evaluation

Recap

Acc =TP + TN

TP + TN + FP + FN

Pre =TP

TP + FPRec =

TP

TP + FN

F1 =2(Pre ∗ Rec)

Pre + Rec

AUC ROC

G. Corrado (disi) sklearn Machine Learning 15 / 22

Page 16: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Performance evaluation

The sklearn.metrics module includes score functions, performancemetrics and pairwise metrics and distance computations.

e.g. accuracy, F1-score, AUC ROC

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

G. Corrado (disi) sklearn Machine Learning 16 / 22

Page 17: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Choosing parameters

Some algorithms have parameters

e.g. parameter C for SVM, number of trees for Random Forest

Performance can significantly vary according to the chosen parameters

It is important to choose wisely

train, VALIDATION, test

G. Corrado (disi) sklearn Machine Learning 17 / 22

Page 18: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Choosing parameters e.g. SVMnp.argmax requires to add import numpy as np

where

G. Corrado (disi) sklearn Machine Learning 18 / 22

Page 19: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Summary

sklearn allows to:

load and generate datasets

split them to perform cross-validation

easily apply learning algorithms

evaluate the performace of such algorithms

G. Corrado (disi) sklearn Machine Learning 19 / 22

Page 20: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Assignment

The second ML assignment is to compare the performance of threedifferent classification algorithms, namely Naive Bayes, SVM, and RandomForest.For this assignment you need to generate a random binary classificationproblem, and train (using 10-fold cross validation) the three differentalgorithms. For some algorithms inner cross validation (5-fold) for choosingthe parameters is needed. Then, show the classification performace(per-fold and averaged) in the report, briefly discussing the results.

Note

The report has to contain also a short description of the methodology usedto obtain the results.

G. Corrado (disi) sklearn Machine Learning 20 / 22

Page 21: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Assignment

Steps

1 Create a classification dataset (n samples ≥ 1000, n features ≥ 10)

2 Split the dataset using 10-fold cross validation3 Train the algorithms

I GaussianNBI SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], and RBF

kernel)I RandomForestClassifier (possible n estimators values [10, 100, 1000],

and Gini purity)

4 Evaluate the cross-validated performanceI accuracyI F1-scoreI AUC ROC

5 Write a short report summarizing the methodology and the results

G. Corrado (disi) sklearn Machine Learning 21 / 22

Page 22: Scikit Learn: Machine Learning in Pythondisi.unitn.it/~passerini/teaching/2014-2015/MachineLearning/slides/... · Scikit Learn: Machine Learning in Python Gianluca Corrado gianluca.corrado@unitn.it

Assignment

After completing the assignment submit it via email

Send an email to [email protected] (cc:[email protected])

Subject: sklearnSubmit

Attachment: id name surname.zip containing:I the Python codeI the report (PDF format)

NOTE

No group work

This assignment is mandatory in order to enroll to the oral exam

G. Corrado (disi) sklearn Machine Learning 22 / 22