Scikit-learn The state of the union Ga¨ el Varoquaux Open Source Innovation Spring 2016 Personal point of view, as an opening to scikit-learn days 2016 in Paris
Scikit-learn The state of the unionGael Varoquaux Open Source Innovation Spring
2016
Personal point of view, as an opening to scikit-learn days 2016 in Paris
1 Some historyScikit-learn canal historique
G Varoquaux 2
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
1 scikit-learn growth: users
Website users (weekly): Google analytics
Debian popcon: ∼ 1% of the Debian users
Web searches: Google trends
G Varoquaux 3
1 scikit-learn growth: lines of code
Lines of code:
Huge feature set
https://www.openhub.net/p/scikit-learn
G Varoquaux 4
1 scikit-learn growth: contributors
Contributors:
759 contributorshttps://www.openhub.net/p/scikit-learn
G Varoquaux 5
1 Started as David Cournapeau’s failed PhD project
David then preferredimproving numpy/scipy
That’s David sprinting in 2011G Varoquaux 6
1 2009: We (Inria Parietal) need machine learning
My team takes over thedevelopment
Hire a young guy(Fabian Pedregosa)
Put post-docs and PhDs(Alexandre Gramfort, Vincent Michel...)
Work in the open
Pythonic, fast, documented
G Varoquaux 7
1 2010: ICML MLOSS workshop
Machine Learning Open Source Software
“The examples in thetutorial are pretty, butnot particularly usefulfor the serious user.”
“For the sustainability ofthe project it might be bet-ter to narrow the focus...”
G Varoquaux 8
1 2011: NIPS sprint
People that I didn’t knowwere solving my problems
The project took off because of the community...
G Varoquaux 9
1 2011: NIPS sprint
People that I didn’t knowwere solving my problems
The project took off because of the community...
G Varoquaux 9
2 Upcoming cool stuffUpcoming 0.18 release
G Varoquaux 10
2 Less code:
Lines of code:
Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm
= memory saver
Arthur Mensch
G Varoquaux 11
2 Less code: Cython no longer embedded
Lines of code:
Generated C no longuer embedded in git⇒ opens the door to fused-types (polymorphism)⇒ multiple dtypes support in algorithm
= memory saver
Arthur MenschG Varoquaux 11
2 Faster code: better algorithmics
RandomizedPCA → PCAAutomatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K meansFor large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas Muller
G Varoquaux 12
2 Faster code: better algorithmics
RandomizedPCA → PCAAutomatic choice randomized linear algebra
power iteration (arpack) full (lapack)
For large data: up to 20× speed uphttps://github.com/scikit-learn/scikit-learn/issues/5243
Giorgio Patrini
Elkan’s K meansFor large data: ∼ 2× speed up.
https://github.com/scikit-learn/scikit-learn/pull/5414
Andreas MullerG Varoquaux 12
2 New cross-validation objects
from s k l e a r n . c r o s s v a l i d a t i o nimport S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
Data-independent nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R VG Varoquaux 13
2 New cross-validation objects
from s k l e a r n . m o d e l s e l e c t i o nimport S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]y t a i n = y[ t r a i n ]
Data-independent ⇒ nested-CV possible
https://github.com/scikit-learn/scikit-learn/pull/4294
Raghav R VG Varoquaux 13
2 Sequential / Bayesian search CV
See hyper-parameter selection as a Bayesianoptimization / noisy fit problem.⇒ choose hyper-parameters cleverly, not on a grid
Pull request stalled
https://github.com/scikit-learn/scikit-learn/pull/5491
Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar
G Varoquaux 14
3 Vision(s): the future
G Varoquaux 15
Mission statement
Enable progress via data science
Lower the costs,less technicalities
Machine learningfor everybody andfor everything
Small hardware,medium data
G Varoquaux 16
Mission statement
Enable progress via data science
Lower the costs,less technicalities
Machine learningfor everybody andfor everything
Small hardware,medium data
G Varoquaux 16
3 Deep learningsklearn.neural network.MLPClassifier
architecture-specification languageGPUs unbound technicality
keras, caffe...
G Varoquaux 17
3 Deep learningsklearn.neural network.MLPClassifier
architecture-specification languageGPUs unbound technicality
keras, caffe...
G Varoquaux 17
3 AutoMLAutomatic model selection
Better hyper-parameter selection
Better description and uniformization of estimators
Integrate feedback from auto-sklearn
G Varoquaux 18
3 Better, faster, strongerFaster models
From lightning, back to sklearnInspiration from XGBoost the paper is out!
Larger dataMore partial fit online forests?Less copies
G Varoquaux 19
3 Better, faster, strongerFaster models
From lightning, back to sklearnInspiration from XGBoost the paper is out!
Larger dataMore partial fit online forests?Less copies
G Varoquaux 19
3 Scaling up (out?)
I don’t want java/scalaLess fluid prototypingCross-VM debugging hardNumerics in java slowers than Lapack
Need C somewhere
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:Coupling distributed store to computationDistributed job management
Create new stack? Ride on this one?
G Varoquaux 20
3 Scaling up (out?)
I don’t want java/scala
They have:Coupling distributed store to computationDistributed job management
Create new stack? Ride on this one?
Blaze, Ibis, dask: require rewrite of algorithmsdask promising for ETL
New backends for joblib parallel and storagedistributed, ssh
G Varoquaux 20
Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)
Coding is not the only thingsprint, GSOC management, tutorials...
Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...
6= work on project
Passionate coders get a lot doneunless they get drowned by meetings
G Varoquaux 21
Sustainable growthReviewing is the bottleneckUser support drowns core devsUsers need stability (Airbus)
Coding is not the only thingsprint, GSOC management, tutorials...
Structure & stabilityHow to organize funding and governance?process/meetings/reports/funding proposal...
6= work on project
Passionate coders get a lot doneunless they get drowned by meetings
G Varoquaux 21
@GaelVaroquaux
Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC