Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Service

Agile Data Scienceby @DataFellas

Xavier [email protected]

@xtordoir

Andy [email protected]

@noootsab

© Data Fellas SPRL 2016

● Distributed Data Science… ○ A genomics use case○ Spark Notebook○ Interactive Distributed Data Science

● Distributed Data Science… Pipeline○ Pipeline: productizing Data Science○ Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark)○ Why Micro Services?○ Painful points:

■ Data science is Discontiguous■ Context Lost in Translation

○ Solution: Data Fellas’ Agile Data Science Toolkit

LineupSo if you’re not sure you want to stay...

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning


● Say we have genomics data, i.e. we collected genomics variation data in different

populations

● We want to know if the global population is heterogeneous (stratified)

● We want to check if this stratification corresponds to the pre-selected populations in the

data collection

● Then I want to make some descriptive statistics on these populations

● Make them available to my users

● Let them compute such statistics for populations they define themselves

● Let them discover what is available as data and service

A use casesay...


Genomics data1000 genomes

● www.1000genomes.org

● Data available as huge zip

files in ftp server or s3

● 152 GB (gzipped) in 23 files

http://www.1000genomes.org

http://www.1000genomes.org


Genomics data1000 genomes: 43,372,735,220 genotypes

Tens of millions

1000’s…...(1,000,000’s)…


Spark ?Distributed Computing Framework

● SQL & Dataframes

● Streaming

● Graph Processing

● Machine Learning

● Scalability

● Resilience

● Fault tolerance

● Optimize memory usage

● Optimize computation execution

● Easy programming model


Spark NotebookDistributed Data Science tool

● Interactive

● Scala (types, production quality)

● Reactive & pluggable charts API (scala = no.js)

● easy install, no deps.

● multiple sparkContexthttp://spark-notebook.io/

http://spark-notebook.io/



Development environmentDefine and chain functions


Automagic plottinghey... types...


Population stratificationMLLib


Population stratification IIH2O Deep Learning


Population stratification IIH2O Deep Learning

Spark notebook demoH2O.world Video

http://library.fora.tv/2015/11/11/sparkling_water_on_the_spark_notebook_interactive_genomes_clustering_xavier_tordoir

http://library.fora.tv/2015/11/11/sparkling_water_on_the_spark_notebook_interactive_genomes_clustering_xavier_tordoir


Save to CassandraPersist output for service exposure...


Service exposureGlobal Alliance For Genomics and Health (GA4GH)

Global Alliance for Genomic and Health

http://genomicsandhealth.org/http://ga4gh.org/

Framework for responsible data sharing● Define schemas● Define services

Along with Ethical, Legal, security, clinical aspects

The output of the Data Science work is a service...

http://genomicsandhealth.org/

http://genomicsandhealth.org/

http://ga4gh.org/

http://ga4gh.org/


What else?This becomes harder...

● Say we have genomics data, i.e. we collected genomics variation data in different populations● We want to know if the global population is heterogeneous (stratified)● We want to check if this stratification corresponds to the pre-selected populations in the data

collection● Then I want to make some descriptive statistics on these populations● Make them available to my users● Let them compute such statistics for populations they define themselves● Let them discover what is available as data and service

Distributed Data Science…

Pipeline


What’s all aboutTopics

● Pipeline: productizing Data Science● Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark)

● Extended Pipeline: Micro Services● Painful points:

○ Data science is Discontiguous○ Context Lost in Translation

● Solution: Data Fellas’ Agile Data Science Toolkit


PipelineProductizing Data Science

Modelling Coding Deploying

Finding Data

Parsing structures

Cleaning

(Reducing)

Learning

Predicting

Connect PROD data

Tuning training parameters

Create Prediction Service

Generate Deployable

Connect to PROD infrastructure

Integration with existing env

Allocate (schedule) resources

Ensure availability


Distributed Data ScienceDemo

All-In Spark NotebooksPrepare Data: ADAM → Parquet

Prepare View: Parquet → Cassandra

Create Server: Cassandra → Akka Http

Create Client: Json → Html Form, Chart, table, ...(Docker can be provided… just need to clean and pack it a bit :-D)


Extended PipelineMicro Services

Modelling Coding Deploying IntegratingApplication

Creating Services

Abstracts access to prepared views

Exposes Prediction capabilities

Highly horizontally scalable

Scaling micro services cluster

→ cheaper than computing cluster

Customer integration

Can be any technologies

Can even be another pipeline!


Painful pointsData science is Discontiguous

➔ Highly heterogeneous environment➔ Too many friction areas➔ Time to market too long

Modelling Coding Deploying IntegratingApplication

Scientist Data Eng. Ops. Eng. Web Eng. Customers

➔ No integration ➔ Error prone➔ Schedule delays

Creating Services

Frictions

Result: Lack of Agility


Painful pointsContext Lost in Translation

Data Lake ProcessingMachineLearning

Model

OutputData

InputData

No contextual discovery No quality infoNo lineage (origin of the data)

Link to process and input discarded

Huge gap in architecture: binary and schema aware serving layer

Accuracy depends on concealed quality of inputs

No schema: hard and long integration, poor satisfaction

Moreover:

No backward links → no agility and no context awareness

Result: Lack of Reproducibility

Application

Data Fellas…

Agile Data Science Toolkit


Our ApproachAgile Data Science Toolkit

AutomaticSemantics

Engine+ Autogenerated

Microservices

IntegratedEnd-to-End

Environment

Huge gainin Time and Reliability

+ =

Notebook

ComputingCluster

AccessLayer

KnowledgeBase

Consum

ersC

ustomers

Exposesdatabase,learning models,stream sources,notebooks, ...

data type

process

lineage

usage

Easy to Release

Easy to (Re)Use


Agile Data Science ToolkitIn a nutshell











Data Fellas…

Announcements!!!


O’ReillyOnline seminar


GrowingWe’re Hiring! http://www.data-fellas.guru/#skillsjobs

http://www.data-fellas.guru/#skillsjobs

http://www.data-fellas.guru/#skillsjobs

Q/AReferences

http://www.data-fellas.guru/


https://github.com/andypetrella/spark-notebook/

https://gitter.im/andypetrella/spark-notebook

Come at Strata -- London at least -- We have two talks :-)

Acknowledgments

@jaguarul

@stratman1958 @joeljacobson

@mhausenblas









Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Service

Data & Analytics