Top Banner
Scikit-learn for easy machine learning: the vision, the tool, and the project Ga¨ el Varoquaux scikit machine learning in Python
75
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: machine learning in Python

Scikit-learn for easy machine learning:the vision, the tool, and the project

Gael Varoquaux

scikit

machine learning in Python

Page 2: machine learning in Python

1 Scikit-learn: the vision

G Varoquaux 2

Page 3: machine learning in Python

1 Scikit-learn: the visionAn enabler

G Varoquaux 2

Page 4: machine learning in Python

1 Scikit-learn: the visionAn enabler

Machine learningfor everybody andfor everything

Machine learningwithout learning themachinery

G Varoquaux 2

Page 5: machine learning in Python

Machine learning in a nutshell

Machine learning is about making prediction from data

G Varoquaux 3

Page 6: machine learning in Python

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Eatable?

Mobile?

Tall?

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 7: machine learning in Python

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 8: machine learning in Python

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 9: machine learning in Python

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 10: machine learning in Python

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 11: machine learning in Python

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

G Varoquaux 5

Page 12: machine learning in Python

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

?G Varoquaux 5

Page 13: machine learning in Python

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” method

How many errors on already-known images?... 0: no errors

Test data 6= Train data

G Varoquaux 6

Page 14: machine learning in Python

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” methodHow many errors on already-known images?

... 0: no errors

Test data 6= Train data

G Varoquaux 6

Page 15: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

G Varoquaux 7

Page 16: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yWhich model to prefer?

G Varoquaux 7

Page 17: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yProblem of “over-fitting”

Minimizing error is not always the best strategy(learning noise)

Test data 6= train dataG Varoquaux 7

Page 18: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

G Varoquaux 7

Page 19: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

Bias variance tradeoff

G Varoquaux 7

Page 20: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters

G Varoquaux 7

Page 21: machine learning in Python

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters⇒ need more data

“curse of dimensionality”

G Varoquaux 7

Page 22: machine learning in Python

1 Machine learning in a nutshell: classification

Example:recognizing hand-written digits

G Varoquaux 8

Page 23: machine learning in Python

1 Machine learning in a nutshell: classification

X1

X2

Example:recognizing hand-written digits

Represent with 2 numerical features

G Varoquaux 8

Page 24: machine learning in Python

1 Machine learning in a nutshell: classification

X1

X2G Varoquaux 8

Page 25: machine learning in Python

1 Machine learning in a nutshell: classification

X1

X2

It’s about findingseparating lines

G Varoquaux 8

Page 26: machine learning in Python

1 Machine learning in a nutshell: models

1 staircase

Fit with a staircase of 10 constant values

Fit a new staircase on errorsKeep going

Boosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9

Page 27: machine learning in Python

1 Machine learning in a nutshell: models

1 staircase2 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errors

Keep goingBoosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9

Page 28: machine learning in Python

1 Machine learning in a nutshell: models

1 staircase2 staircases combined3 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going

Boosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9

Page 29: machine learning in Python

1 Machine learning in a nutshell: models

1 staircase2 staircases combined3 staircases combined300 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going

Boosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9

Page 30: machine learning in Python

1 Machine learning in a nutshell: models

1 staircase2 staircases combined3 staircases combined300 staircases combined

Fit with a staircase of 10 constant valuesFit a new staircase on errorsKeep going

Boosted regression trees

Complexitity trade offsComputational + statistical

G Varoquaux 9

Page 31: machine learning in Python

1 Machine learning in a nutshell: unsupervised

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 10

Page 32: machine learning in Python

1 Machine learning in a nutshell: unsupervised

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 10

Page 33: machine learning in Python

Machine learning

Mathematics and algorithms for fitting predictive models

Regression

x

y

Classification

Unsupervised...

Notions of overfit, test errorregularization, model complexity

G Varoquaux 11

Page 34: machine learning in Python

Machine learning is everywhere

Image recognition

Marketing (click-through rate)

Movie / music recommendation

Medical data

Logistic chains (eg supermarkets)

Language translation

Detecting industrial failures

G Varoquaux 12

Page 35: machine learning in Python

Why another machine learning package?

G Varoquaux 13

Page 36: machine learning in Python

Real statisticians use R

And real astronomers use IRAF

Real economists use Gauss

Real coders use C assembler

Real experiments are controlled in Labview

Real Bayesians use BUGS stan

Real text processing is done in Perl

Real Deep learner is best done with torch (Lua)

And medical doctors only trust SPSSG Varoquaux 14

Page 37: machine learning in Python

1 My stack

Python, what else?General purposeInteractive languageEasy to read / write

G Varoquaux 15

Page 38: machine learning in Python

1 My stack

The scientific Python stacknumpy arrays

Mostly a float**No annotation / structureUniversal across applicationsEasily shared with C / fortran

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620G Varoquaux 15

Page 39: machine learning in Python

1 My stack

The scientific Python stacknumpy arrays

Connecting toscipyscikit-imagepandas...

It’s about plugin thingstogether

G Varoquaux 15

Page 40: machine learning in Python

1 My stack

The scientific Python stacknumpy arrays

Connecting toscipyscikit-imagepandas...

Being Pythonic andSciPythonic

G Varoquaux 15

Page 41: machine learning in Python

1 scikit-learn vision

Machine learning for allNo specific application domain

No requirements in machine learning

High-quality Pythonic software libraryInterfaces designed for users

Community-driven developmentBSD licensed, very diverse contributors

http://scikit-learn.org

G Varoquaux 16

Page 42: machine learning in Python

1 Between research and applications

Machine learning researchConceptual complexity is not an issueNew and bleeding edge is betterSimple problems are old science

In the fieldTried and tested (aka boring) is goodLittle sophistication from the userAPI is more important than maths

Solving simple problems mattersSolving them really well matters a lot

G Varoquaux 17

Page 43: machine learning in Python

2 Scikit-learn: the toolA Python library for machine learning

c©Theodore W. GrayG Varoquaux 18

Page 44: machine learning in Python

2 A Python library

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

As easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 19

Page 45: machine learning in Python

2 API: specifying a model

A central concept: the estimatorInstanciated without dataBut specifying the parameters

from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s

e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)

G Varoquaux 20

Page 46: machine learning in Python

2 API: training a model

Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:X a numpy array with shape

nsamples × nfeatures

y a numpy 1D array, of ints or float, with shapensamples

G Varoquaux 21

Page 47: machine learning in Python

2 API: using a model

Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )

Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )

Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )

G Varoquaux 22

Page 48: machine learning in Python

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrix

G Varoquaux 23

Page 49: machine learning in Python

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t

import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()

X = h a s h e r . f i t t r a n s f o r m ( documents )

G Varoquaux 23

Page 50: machine learning in Python

2 Scikit-learn: very rich feature set

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear modelsSVM

Unsupervised LearningClusteringDictionary learningOutlier detection

Model selectionBuilt in cross-validationParameter optimization

G Varoquaux 24

Page 51: machine learning in Python

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000

Fit

tim

e(s

)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 25

Page 52: machine learning in Python

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000Fi

tti

me

(s)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 25

Page 53: machine learning in Python

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 26

Page 54: machine learning in Python

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

See also: http://www.slideshare.net/GaelVaroquaux/processing-biggish-data-on-commodity-hardware-simple-python-patterns

G Varoquaux 26

Page 55: machine learning in Python

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

G Varoquaux 27

Page 56: machine learning in Python

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

Linear modelssklearn.linear model.SGDRegressorsklearn.linear model.SGDClassifier

Clusteringsklearn.cluster.MiniBatchKMeanssklearn.cluster.Birch (new in 0.16)

PCA (new in 0.16)sklearn.decompositions.IncrementalPCA

G Varoquaux 27

Page 57: machine learning in Python

2 On-the-fly data reduction

Many features

⇒ Reduce the data as it is loaded

X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)

G Varoquaux 28

Page 58: machine learning in Python

2 On-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 28

Page 59: machine learning in Python

3 Scikit-learn: the project

G Varoquaux 29

Page 60: machine learning in Python

3 Having an impact

G Varoquaux 30

Page 61: machine learning in Python

3 Having an impact

G Varoquaux 30

Page 62: machine learning in Python

3 Having an impact

G Varoquaux 30

Page 63: machine learning in Python

3 Having an impact

1% of Debian installs1200 job offers on stack overflow

G Varoquaux 30

Page 64: machine learning in Python

3 Having an impact

1% of Debian installs1200 job offers on stack overflow

G Varoquaux 30

Page 65: machine learning in Python

3 Community-based development in scikit-learnHuge feature set:

benefits of a large teamProject growth:

More than 200 contributors∼ 12 core contributors

1 full-time INRIA programmerfrom the start

Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn

G Varoquaux 31

Page 66: machine learning in Python

3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 32

Page 67: machine learning in Python

3 6 steps to a community-driven project

1 Focus on quality

2 Build great docs and examples

3 Use github

4 Limit the technicality of your codebase

5 Releasing and packaging matter

6 Focus on your contributors,give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire

G Varoquaux 33

Page 68: machine learning in Python

3 Quality assuranceCode review: pull requests

Can include newcomers

We read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are names meaningfull?- Are the numerics stable?- Could it be faster?

G Varoquaux 34

Page 69: machine learning in Python

3 Quality assuranceUnit testing

Everything is tested

Great for numerics

Overall tests enforce on all estimators- consistency with the API- basic invariances- good handling of various inputs

If it ain’t testedit’s broken

G Varoquaux 35

Page 70: machine learning in Python

Make it work, make it right, make it boring

G Varoquaux 36

Page 71: machine learning in Python

3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement

They need citation, in papers & on corporate web pages

+ It’s so hard to scaleUser supportGrowing codebase

G Varoquaux 37

Page 72: machine learning in Python

3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement

They need citation, in papers & on corporate web pages

+ It’s so hard to scaleUser supportGrowing codebase

G Varoquaux 37

Page 73: machine learning in Python

@GaelVaroquaux

Scikit-learnThe vision

Machine learning as a means not an endVersatile library: the “right” level of abstractionClose to research, but seeking different tradeoffs

Page 74: machine learning in Python

@GaelVaroquaux

Scikit-learnThe vision

Machine learning as a means not an endThe tool

Simple API uniform across learnersNumpy matrices as data containers

Reasonnably fast

Page 75: machine learning in Python

@GaelVaroquaux

Scikit-learnThe vision

Machine learning as a means not an endThe tool

Simple API uniform across learnersThe project

Many people working togetherTests and discussions for quality

We’re hiring!