Top Banner
1 Université Paris-Saclay / CNRS BALÁZS KÉGL Center for Data Science Paris-Saclay RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION
41

RAMP Data Challenge

Jan 22, 2018

Download

Data & Analytics

Proto204
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RAMP Data Challenge

1

Université Paris-Saclay / CNRSBALÁZS KÉGL

Center for Data ScienceParis-Saclay

RAMP DATA CHALLENGES WITH

MODULARIZATION AND CODE SUBMISSION

Page 2: RAMP Data Challenge

• Senior researcher CNRS

• machine learning (20 years) interfacing with particle physics (10 years)

• Head of the Paris-Saclay Center for Data Science

• interfacing with biology, economy, climatology, chemistry, etc. (4 years)

2

WHO AM I?Balázs Kégl

Page 3: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

3

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects

345 affiliated researchers, 50 laboratories

Page 4: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 4

Data scientist

Data Value Architect

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

THE DATA SCIENCE ECOSYSTEMhttps://medium.com/@balazskegl

Page 5: RAMP Data Challenge

• Classification problem y = f(x)

5

WHAT DOES MACHINE LEARNING DO

x

f y‘Stomorhina’

f y‘Scaeva’

x

Page 6: RAMP Data Challenge

6

WHAT DOES MACHINE LEARNING DOX y

Page 7: RAMP Data Challenge

7Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

5

Page 8: RAMP Data Challenge

8

Most classical data challenges are HR and publicity events

Page 9: RAMP Data Challenge

9

We decided to turn them into a tool for

1. Collaborative prototyping 2. Teaching aid 3. Data science process management

Page 10: RAMP Data Challenge

10

Funded by Université Paris-Saclay and CNRS

Page 11: RAMP Data Challenge

11

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

Page 12: RAMP Data Challenge

12

Code submission

1. lets us deliver a working prototype 2. lets the participants collaborate 3. makes the backend challenging to

run (cloud management)

Page 13: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAPID ANALYTICS AND MODEL PROTOTYPING

13

functional data, time series, data augmentation, deep learning, learning on simulations, nonstandard and

multi-objective losses

Page 14: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CURRENT RAMPS

14

Page 15: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

DATA SCIENCE THEMES

15

Page 16: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

DATA DOMAINS

16

Page 17: RAMP Data Challenge

17

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

21 challenges 39 events 1205 users

5952 predictive models

Page 18: RAMP Data Challenge

18

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

12 hackathons 6 remote data challenges

11 course data camps

Page 19: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA

19

chemotherapy drug in elastic pocket

laser spectrometer

molecular spectra

feature extractor 1

feature extractor 2

regressor

concentration

classifier

drug type

Page 20: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 20

Page 21: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

FORECASTING EL NINO SIX MONTHS AHEAD

21

… 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …

time series feature extractor

x (a fixed length feature vector)regressor

Page 22: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 22

what you achieved with a well tuned deep net

the diversity gap

the human blender gap

competitive phase

collaborative phase

THE POWER OF THE (COLLABORATING) CROWD

Page 23: RAMP Data Challenge

23

THE DYNAMICS OF COLLABORATION

starting kit

the crowdearly influencers

inventors

Page 24: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

OPEN PHASE LETS PARTICIPANTS CATCH UP THE GOAL OF TEACHING

24

Page 25: RAMP Data Challenge

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 25

COMMUNICATION AND REUSE

Page 26: RAMP Data Challenge

26

ONGOING CHALLENGES: CLASSIFY POLLENATING INSECTS

4.5K€ for the competitive phase 3K€ for the collaborative phase 50 GPU hours per participant

https://www.ramp.studio/events/pollenating_insects_3_JNI_2017

Page 27: RAMP Data Challenge

27

ONGOING CHALLENGES: DETECTING MARS CRATERS

Page 28: RAMP Data Challenge

28

ONGOING CHALLENGES: FAKE NEWS PREDICT THE TRUTHFULNESS OF NEWS

Page 29: RAMP Data Challenge

29

ONGOING CHALLENGES: KAGGLE SEGURO PREDICT INSURANCE CLAIMS

Page 30: RAMP Data Challenge

30

UPCOMING CHALLENGE PREDICT AUTISM FROM BRAIN SCANS

Page 31: RAMP Data Challenge

31

Setting up the RAMP is was

long and hard.

Page 32: RAMP Data Challenge

32

Separate workflow building and workflow optimization

Page 33: RAMP Data Challenge

33

Before solving the problem, set it up

(even put it into production)

Page 34: RAMP Data Challenge

34

Université Paris-Saclay / CNRSBALÁZS KÉGL

Center for Data ScienceParis-Saclay

RAMP DATA CHALLENGES WITH

MODULARIZATION AND CODE SUBMISSION

FRUGAL DATA SCIENCE PROCESS MANAGEMENT

Page 35: RAMP Data Challenge

35

data flowTHE DATA FLOW

X

data

con

nect

ors

CALFE CLF

predictive workflow

full automation production

dashboard decision support

ypred

Page 36: RAMP Data Challenge

36

POC report

expert labeler amazon turk

simulator

y

ypred score type

score

cross-validation schemeRAMP

full automation production

dashboard decision support

data flow“works on”

data

con

nect

ors

X

data scientists

CALFE CLF

workflow

business unit domain scientists

data value architects

data engineers

THE IDEAL SEQUENCE

Page 37: RAMP Data Challenge

• toolkit: https://github.com/paris-saclay-cds/ramp-workflow

• for designing workflows

• set of ready-made metrics, workflows, CV schemes, data readers

• unique command-line test script

• examples: https://github.com/ramp-kits

• a zoo of problems, experiments, workflows

• (at least) one initial solution

37

RAMP-WORKFLOW & RAMP-KITS

Page 38: RAMP Data Challenge

38

A SINGLE SCRIPT TO DEFINE THE BUNDLE

X ypred score type

score

cross-validation scheme

data

con

nect

ors

FE CLF

workflow

Page 39: RAMP Data Challenge

39

A SINGLE EXECUTABLE TO TEST THE SUBMISSIONS

• Keep your different submissions in a simple file structure

• Communicate them on git

• Execute them also from the notebook

Page 40: RAMP Data Challenge

40

You can

1. Participate in upcoming RAMPs 2. Use RAMP in teaching or training 3. Use the toolkit for your own workflows 4. Submit it to us if you want to run a data

challenge

Page 41: RAMP Data Challenge

41

toolkit: github.com/paris-saclay-cds/ramp-workflow

examples: github.com/ramp-kits

blogs: medium.com/@balazskegl

slack: ramp-studio.slack.com

frontend: www.ramp.studio

mail: [email protected]