Top Banner
40

Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile
Page 2: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer

Agile Data Science on Greenplum Using Airflow

Page 3: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

“Agile software development refers to a group of software development methodologies based on iterative

development, where requirements and solutions evolve through collaboration between self-organizing

cross-functional teams.”

Page 4: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases

Discovery Phase Operationalization (O16n) Phase

Page 5: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases

Discovery Phase

✓ Data exploration & cleaning

✓ Feature engineering

✓ Model Building

✓ Model Evaluation

Page 6: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases - Agility

Discovery Phase

Rapid Iteration and

Experimentation

✓ Data exploration & cleaning

✓ Feature engineering

✓ Model Building

✓ Model Evaluation

Page 7: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases - Agility

Discovery Phase

Rapid Iteration and

Experimentation

✓ Data exploration & cleaning

✓ Feature engineering

✓ Model Building

✓ Model Evaluation

Page 8: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Greenplum Database

Segment HostSegmentSegment

Segment HostSegmentSegment

StandbyMaster

MasterHost

SQL

Interconnect

Segment Host

Node1

Segment Host

Node2

Segment Host

Node3

Segment Host

NodeN

● MPP database based on Postgres

● In database analytics● Parallel architecture

Page 9: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Jupyter Notebooks

Page 10: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases

Discovery Phase Operationalization (O16n) Phase

Page 11: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Data Pipelines

✓ Testing

✓ Monitoring

● APIs to consume model output

Data Science Phases

Operationalization (O16n) Phase

Page 12: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable pipelines

✓ Testing with CI

✓ Monitoring to react to Failures

Madlib Flow Talk by Frank and Sridhar

✓ Data Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Page 13: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable pipelines

✓ Testing with CI

✓ Monitoring to react to Failures

Madlib Flow Talk by Frank and Sridhar

✓ Data Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Page 14: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Airflow

● Apache Project spun out of Airbnb

● “Airflow is a platform to programmatically author, schedule and monitor workflows.”

Page 15: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Science Use-Case

● The Data○ Time-series trajectories with

latitude and longitude of location.○ Subset of trajectories are labeled as

walk / not walk

● Our Model○ Build Classification model using

labelled data to identify if new unlabeled trajectories are walk or not walk

`

Example trajectories

Page 16: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Example data

Example trajectory data

Example label data

We have mode labels of walk and not walk only for subset of incoming daily trajectories

Page 17: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Discovery phase → Operationalization phase

After every model iteration we check if the model is viable

● Check the quantitative metrics of the model like AUC, ROC curve, accuracy etc

● Check the qualitative results of the model and if it make sense to a subject matter expert

Once we are convinced that the model is both quantitatively and qualitatively viable we can move to the Operationalization phase

Page 18: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Discovery phase → Operationalization phase

Example of code from the discovery phase which is converted into a task script

Page 19: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Architecture overview

Modular idempotent tasks Connect tasks to create automated workflows

Model refitting workflow

output

output

New data inference workflow

TBD: Expose model results using an API

Operationalization Phase

Model Evaluation

Model Building

Feature Engineering

Data Exploration

Iterative Discovery

Phase

output

ML Model + Data pre-processing

Discovery Phase

Page 20: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable Pipelines

✓ Testing with CI/CD

✓ Monitoring to React to Failures

Madlib Flow Talk by Frank and Sridhar

Page 21: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Data Prep and Feature Engineering

Fetch

Clean

Transform

Feature Engineering

Extract labelled data for model creation/refitting

Inference for unlabelled data

Page 22: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Data Prep and Feature Engineering - Demo

Page 23: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Model Training

Fetch

Clean

Transform

Feature Engineering

Extract labelled data for model creation/refitting

Inference for unlabelled data

Page 24: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

● This DAG has a single task for model training

● In this task we split the data into train and test samples, train the model, evaluate the model and capture the accuracy, auc and model tables.

● We want all of the above to run at the same time

Model Training

Page 25: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Model Training - Demo

Page 26: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Model Scoring

Fetch

Clean

Transform

Feature Engineering

Extract labelled data for model creation/refitting

Inference for unlabelled data

Page 27: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

● The unlabeled data which is extracted from the features table is scored in this DAG

● We first check if any model has been built● If there is a model so we score the data

(inference)

Model Scoring

Page 28: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Model Scoring - Demo

Page 29: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

● Daily we get some more labeled data, once we have accumulated enough labeled data we can retrain the model for better accuracy

● We have scheduled model re-training monthly

Model Re-Training

Page 30: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Model Re-Training - Demo

Page 31: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable Pipelines

✓ Testing with CI/CD

✓ Monitoring to React to Failures

Madlib Flow Talk by Frank and Jarrod

Page 32: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Testing with CI/CD

● Testing Data Pipelines is hard

● Test Coverage (Test Tasks vs Test DAGs)

● Testing as part of the CI/CD

Page 33: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Testing with CI/CD - Demo

Page 34: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable Pipelines

✓ Testing with CI/CD

✓ Monitoring to React to Failures

Madlib Flow Talk by Frank and Sridhar

Page 35: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

● Monitoring and error fixing is big part of responsive data pipelines

● Ability to quickly identify what is failing, why it is failing and fixing it

with minimum lead time is crucial

● In this demo we will showcase an error fixing case

Monitoring and Error Fixing

Page 36: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

Monitoring and Error Fixing - Demo

Page 37: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Pipelines

✓ Testing

✓ Monitoring

APIs to consume model output

Data Science Phases - Agility

Operationalization (O16n) Phase

✓ Automated manageable Pipelines

✓ Testing with CI/CD

✓ Monitoring to React to Failures

Madlib Flow Talk by Frank and Sridhar

Page 38: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

✓ Greenplum and Jupyter notebooks provides a set of tools to do Agile Data Science during discovery phase

✓ Greenplum along with Airflow and Circle CI is very effective to do Agile Data Science during the operationalization phase

Conclusion

Page 39: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

Questions

Page 40: Ambarish Joshi - Data Scientist Aditya Padhye - Data Engineer...Agile Data Science during discovery phase Greenplum along with Airflow and Circle CI is very effective to do Agile

“We partner to help you compete, grow, and transform.”