OpenML Tutorial: Networked Science in Machine Learning

N E T W O R K E D M A C H I N E L E A R N I N G

J O A Q U I N VA N S C H O R E N ( T U / E ) , 2 0 1 4

#OpenML

Research different.

1 6 1 0

G A L I L E O G A L I L E I D I S C O V E R S S A T U R N ’ S R I N G S

‘ S M A I S M R M I L M E P O E TA L E U M I B U N E N U G T TA U I R A S ’

Research different. Royal society: Take nobody’s word for itScientific Journal: Reputation-based culture

3 0 0 Y E A R S L AT E RJ O U R N A L S S H O W L I M I T S

• Complex code not included

• Large data sets not included

• Experiment details scant

• Results hard to reproduce

• Papers not updatable

• Slow, incomplete tracking of

paper impact

• Publication bias

• No online public discussion

• Open access?

J O U R N A L S : L O N G - T E R M M E M O R Y I N T E R N E T: S H O R T- T E R M W O R K I N G M E M O R Y

N E T W O R K E D S C I E N C E

O N L I N E D A TA B A S E S

O P E N S O U R C E C O D E

W E B S E R V I C E S , A P I S

C O L L A B O R A T I V E T O O L S

!

O P E N , S C A L A B L E C O L L A B O R A T I O N

R E A L - T I M E D I S C U S S I O N

C O M B I N E , R E U S E S C I E N T I F I C R E S U LT S

C I T I Z E N S C I E N C E

Research different. Polymaths: Solve math problems through massive collaboration (not competition)

Broadcast question, combine many minds to solve it

Solved hard problems in weeks

Many (joint) publications

Research different. SDSS: Robotic telescope, data publicly online (SkyServer)

+1 million distinct users vs. 10.000 astronomers

Broadcast data, allow many minds to ask the right questions

Thousands of papers

Research different. Galaxy Zoo: citizen scientists classify a million galaxiesOffer right tools so that anybody can be a scientistMany novel discoveries by scientists and citizens

Research different. Sharing data sparks discovery

Designed serendipity: - What’s hard for one scientist is

easy for another - Surprising ideas, observations

can spark new discoveries

Share, organise data for easy, large-scale collaboration

Data exploding in all sciences: collaborative data analysis needed

Building reputationAuthorship: easy to contribute + contributions stored, visible online

Collaboration: build trust, work with new people

Citation: more people see, build upon, and cite your work. Tell people how to cite data and code.

Altmetrics: track reuse/interest online (ArXiv)

N E T W O R K E D M A C H I N E L E A R N I N G

Machine learningComplex code, large-scale data, experiments (impossible to print)

Experiments not shared online: impossible to build on prior work: inhibits deeper analysis (e.g. meta-learning)

Low reproducibility, generalisability (studies contradict)

What if we could all connect with each other, and with other scientists, to explore and apply machine learning?

Few collaborative tools to speed up research

OpenMLPlace to share data, code, experiments in full detail

All results organised, linked together for further (meta)analysis, reuse, discussion, study, education

Links to (open-source) code, open data anywhere online.

Anyone can post data to analyse, anyone can share code and results (models, predictions, evaluations)

Integrated in ML platforms (R, Weka, Rapidminer,…) to automatically load data, upload results

Scientists can work in teams, but results only publicly visible if data, code shared

OpenML: benefits for scientistsMore time: automates routinizable work: - find data and/or code - setup and run large-scale experiments - results compared to state-of-the-art - log experiment details for future reference

More control: - state how others should cite your work - track reuse - share results more easily

More knowledge: - more time for actual research - build directly on prior work - easier, large-scale collaboration + interaction

Plugins: WEKA

Plugins: MOA

Plugins: RapidMiner

1 . O P E R AT O R T O D O W N L O A D TA S K ( TA S K T Y P E S P E C I F I C )

2 . S U B W O R K F L O W T H AT S O LV E S T H E TA S K , G E N E R AT E S R E S U LT S

3 . O P E R AT O R F O R U P L O A D I N G R E S U LT S

OpenML: under developmentOpenML studies - collection of datasets, flows, runs, results in a study - online counterpart of paper (with url) - construct by simply tagging resources - easily include (build on) data of others

Reputation building - Profile page: statistics of activity and impact on OpenML - Collaborative leaderboards: best contributors to solving a task

Teams - Add scientists in teams (circles) - Share resources, results within team only - Make public at any time (e.g. after publication)

Meta-learning support - Data/Flow qualities: easy adding, better overviews - Algorithm selection techniques running on website (vs humans?)

J O I N T H E C L U B

OpenML Tutorial: Networked Science in Machine Learning

Science

g e n e r

n t e r n e t

t i o n r e

e d s c i e n c e o

e s r e s u lt s

o p e r

e i d i s c o v e r

i n e d