Top Banner
Scikit-learn for easy machine learning: the vision, the tool, and the project Ga¨ el Varoquaux scikit machine learning in Python
63
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PyData Paris 2015 - Opening keynote Gael Varoquaux

Scikit-learn for easy machine learning:the vision, the tool, and the project

Gael Varoquaux

scikit

machine learning in Python

Page 2: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Scikit-learn: the vision

G Varoquaux 2

Page 3: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Scikit-learn: the visionAn enabler

G Varoquaux 2

Page 4: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Scikit-learn: the visionAn enabler

Machine learningfor everybody andfor everything

Machine learningwithout learning themachinery

G Varoquaux 2

Page 5: PyData Paris 2015 - Opening keynote Gael Varoquaux

Machine learning in a nutshell

Machine learning is about making prediction from data

G Varoquaux 3

Page 6: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Eatable?

Mobile?

Tall?

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 7: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 8: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 9: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 10: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning: a historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 4

Page 11: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

G Varoquaux 5

Page 12: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: an example

Face recognition

Andrew Bill Charles Dave

?G Varoquaux 5

Page 13: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” method

How many errors on already-known images?... 0: no erreurs

Test data 6= Train data

G Varoquaux 6

Page 14: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshellA simple method:1 Store all the known (noisy) images and the names

that go with them.2 From a new (noisy) images, find the image that is

most similar.

“Nearest neighbor” methodHow many errors on already-known images?

... 0: no erreurs

Test data 6= Train data

G Varoquaux 6

Page 15: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

G Varoquaux 7

Page 16: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yWhich model to prefer?

G Varoquaux 7

Page 17: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yProblem of “over-fitting”

Minimizing error is not always the best strategy(learning noise)

Test data 6= train dataG Varoquaux 7

Page 18: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

G Varoquaux 7

Page 19: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters

G Varoquaux 7

Page 20: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters⇒ need more data

“curse of dimensionality”

G Varoquaux 7

Page 21: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: classification

Example:recognizing hand-written digits

G Varoquaux 8

Page 22: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: classification

X1

X2

Example:recognizing hand-written digits

Represent with 2 numerical features

G Varoquaux 8

Page 23: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: classification

X1

X2G Varoquaux 8

Page 24: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: unsupervised

ConocoPhillips

American express

RaytheonBoeing

Apple

Pepsi

Navistar

GlaxoSmithKline

Microsoft

Kimberly-Clark

Ryder

SAP

Goldman Sachs

Colgate-Palmolive

Wal-Mart

General Electrics

Sony

Pfizer

Amazon

Marriott

Novartis

Coca Cola

3M

Comcast

Sanofi-Aventis

IBM

Chevron

Wells FargoDuPont de Nemours

CVS

Total

Caterpillar

Canon

Bank of America

Walgreen

AIG

Time Warner

Home Depot

Texas instruments

Valero Energy

FordCablevision

Toyota

Procter Gamble

Lookheed Martin

Kellogg

Honda

General Dynamics

HP

Dell

Mitsubishi

Xerox

Yahoo

Exxon

JPMorgan Chase

Mc Donalds

Cisco

Northrop Grumman

Kraft Foods Unilever

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 9

Page 25: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Machine learning in a nutshell: unsupervised

ConocoPhillips

American express

RaytheonBoeing

Apple

Pepsi

Navistar

GlaxoSmithKline

Microsoft

Kimberly-Clark

Ryder

SAP

Goldman Sachs

Colgate-Palmolive

Wal-Mart

General Electrics

Sony

Pfizer

Amazon

Marriott

Novartis

Coca Cola

3M

Comcast

Sanofi-Aventis

IBM

Chevron

Wells FargoDuPont de Nemours

CVS

Total

Caterpillar

Canon

Bank of America

Walgreen

AIG

Time Warner

Home Depot

Texas instruments

Valero Energy

FordCablevision

Toyota

Procter Gamble

Lookheed Martin

Kellogg

Honda

General Dynamics

HP

Dell

Mitsubishi

Xerox

Yahoo

Exxon

JPMorgan Chase

Mc Donalds

Cisco

Northrop Grumman

Kraft Foods Unilever

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 9

Page 26: PyData Paris 2015 - Opening keynote Gael Varoquaux

Machine learning

Mathematics and algorithms for fitting predictive models

Regression

x

y

Classification

Notions of overfit and test error

G Varoquaux 10

Page 27: PyData Paris 2015 - Opening keynote Gael Varoquaux

Machine learning is everywhere

Image recognition

Marketing (click-through rate)

Movie / music recommendation

Medical data

Logistic chains (eg supermarkets)

Language translation

Detecting industrial failures

G Varoquaux 11

Page 28: PyData Paris 2015 - Opening keynote Gael Varoquaux

Why another machine learning package?

G Varoquaux 12

Page 29: PyData Paris 2015 - Opening keynote Gael Varoquaux

Real statisticians use R

And real astronomers use IRAF

Real economists use Gauss

Real coders use C assembler

Real experiments are controlled in Labview

Real Bayesians use BUGS stan

Real text processing is done in Perl

Real Deep learner is best done with torch (Lua)

And medical doctors only trust SPSSG Varoquaux 13

Page 30: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 My stack

Python, what else?General purposeInteractive languageEasy to read / write

G Varoquaux 14

Page 31: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 My stack

The scientific Python stacknumpy arrays

Mostly a float**No annotation / structureUniversal across applicationsEasily shared with C / fortran

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620

0387879479

7927

0179075270

1578

9407174612

4797

5497071871

7887

1365349049

5190

7475426535

8098

4872154634

9084

9034567324

5614

7895718774

5620G Varoquaux 14

Page 32: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 My stack

The scientific Python stacknumpy arrays

Connecting toscipyscikit-imagepandas...

It’s about plugin thingstogether

G Varoquaux 14

Page 33: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 My stack

The scientific Python stacknumpy arrays

Connecting toscipyscikit-imagepandas...

Being Pythonic andSciPythonic

G Varoquaux 14

Page 34: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 scikit-learn vision

Machine learning for allNo specific application domain

No requirements in machine learning

High-quality Pythonic software libraryInterfaces designed for users

Community-driven developmentBSD licensed, very diverse contributors

http://scikit-learn.org

G Varoquaux 15

Page 35: PyData Paris 2015 - Opening keynote Gael Varoquaux

1 Between research and applications

Machine learning researchConceptual complexity is not an issueNew and bleeding edge is betterSimple problems are old science

In the fieldTried and tested (aka boring) is goodLittle sophistication from the userAPI is more important than maths

Solving simple problems mattersSolving them really well matters a lot

G Varoquaux 16

Page 36: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Scikit-learn: the toolA Python library for machine learning

c©Theodore W. GrayG Varoquaux 17

Page 37: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 A Python library

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

As easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 18

Page 38: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 API: specifying a model

A central concept: the estimatorInstanciated without dataBut specifying the parameters

from s k l e a r n . n e i g h b o r s importK N e a r e s t N e i g h b o r s

e s t i m a t o r = K N e a r e s t N e i g h b o r s (n n e i g h b o r s =2)

G Varoquaux 19

Page 39: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 API: training a model

Training from datae s t i m a t o r . f i t ( X t r a i n , Y t r a i n )

with:X a numpy array with shape

nsamples × nfeatures

y a numpy 1D array, of ints or float, with shapensamples

G Varoquaux 20

Page 40: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 API: using a model

Prediction: classification, regressionY t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )

Transforming: dimension reduction, filterX new = e s t i m a t o r . t r a n s f o r m ( X t e s t )

Test score, density estimationt e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )

G Varoquaux 21

Page 41: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrix

G Varoquaux 22

Page 42: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Vectorizing

From raw data to a sample matrix X

For text data: counting word occurences- Input data: list of documents (string)- Output data: numerical matrixfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t

import H a s h i n g V e c t o r i z e rh a s h e r = H a s h i n g V e c t o r i z e r ()

X = h a s h e r . f i t t r a n s f o r m ( documents )

G Varoquaux 22

Page 43: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Scikit-learn: very rich feature set

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear modelsSVM

Unsupervised LearningClusteringDictionary learningOutlier detection

Model selectionBuilt in cross-validationParameter optimization

G Varoquaux 23

Page 44: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000

Fit

tim

e(s

)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 24

Page 45: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000Fi

tti

me

(s)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 24

Page 46: PyData Paris 2015 - Opening keynote Gael Varoquaux

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 25

Page 47: PyData Paris 2015 - Opening keynote Gael Varoquaux

What if the data does not fit in memory?

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

See also: http://www.slideshare.net/GaelVaroquaux/processing-biggish-data-on-commodity-hardware-simple-python-patterns

G Varoquaux 25

Page 48: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

0387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

200387

8794

7979

27

0179

0752

7015

78

9407

1746

1247

97

5497

0718

7178

87

1365

3490

4951

90

7475

4265

3580

98

4872

1546

3490

84

9034

5673

2456

14

7895

7187

7456

20

G Varoquaux 26

Page 49: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 On-line algorithms

e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )

Linear modelssklearn.linear model.SGDRegressorsklearn.linear model.SGDClassifier

Clusteringsklearn.cluster.MiniBatchKMeanssklearn.cluster.Birch (new in 0.16)

PCA (new in 0.16)sklearn.decompositions.IncrementalPCA

G Varoquaux 26

Page 50: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 On-the-fly data reduction

Many features

⇒ Reduce the data as it is loaded

X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)

G Varoquaux 27

Page 51: PyData Paris 2015 - Opening keynote Gael Varoquaux

2 On-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.FeatureAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 27

Page 52: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 Scikit-learn: the project

G Varoquaux 28

Page 53: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 Community-based development in scikit-learnHuge feature set:

benefits of a large teamProject growth:

More than 200 contributors∼ 12 core contributors

1 full-time INRIA programmerfrom the start

Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn

G Varoquaux 29

Page 54: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 30

Page 55: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 6 steps to a community-driven project

1 Focus on quality

2 Build great docs and examples

3 Use github

4 Limit the technicality of your codebase

5 Releasing and packaging matter

6 Focus on your contributors,give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire

G Varoquaux 31

Page 56: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 Quality assuranceCode review: pull requests

Can include newcomers

We read each others code

Everything is discussed:- Should the algorithm go in?- Are there good defaults?- Are names meaningfull?- Are the numerics stable?- Could it be faster?

G Varoquaux 32

Page 57: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 Quality assuranceUnit testing

Everything is tested

Great for numerics

Overall tests enforce on all estimators- consistency with the API- basic invariances- good handling of various inputs

G Varoquaux 33

Page 58: PyData Paris 2015 - Opening keynote Gael Varoquaux

Make it work, make it right, make it boring

G Varoquaux 34

Page 59: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement

They need citation, in papers & on corporate web pages

+ It’s so hard to scaleUser supportGrowing codebase

G Varoquaux 35

Page 60: PyData Paris 2015 - Opening keynote Gael Varoquaux

3 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted⇒ Hard to fund, less excitement

They need citation, in papers & on corporate web pages

+ It’s so hard to scaleUser supportGrowing codebase

G Varoquaux 35

Page 61: PyData Paris 2015 - Opening keynote Gael Varoquaux

@GaelVaroquaux

Scikit-learn

The visionMachine learning as a means not an end

Versatile library: the “right” level of abstractionClose to research, but seeking different tradeoffs

Page 62: PyData Paris 2015 - Opening keynote Gael Varoquaux

@GaelVaroquaux

Scikit-learn

The visionMachine learning as a means not an end

The toolSimple API uniform across learnersNumpy matrices as data containers

Reasonnably fast

Page 63: PyData Paris 2015 - Opening keynote Gael Varoquaux

@GaelVaroquaux

Scikit-learn

The visionMachine learning as a means not an end

The toolSimple API uniform across learners

The projectMany people working together

Tests and discussions for quality

We’re hiring!