Top Banner
Dataiku Data Science Studio Use Cases for Industrialized Data Products HUG IRL July 2016
11

HUGIreland_VincentDeStocklin_DataScienceWorkflows

Jan 07, 2017

Download

Data & Analytics

John Mulhall
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Dataiku Data Science StudioUse Cases for Industrialized Data ProductsHUG IRL July 2016

Page 2: HUGIreland_VincentDeStocklin_DataScienceWorkflows

DATAIKU DSS

The most complete Data Analytics platform

Page 3: HUGIreland_VincentDeStocklin_DataScienceWorkflows

OUR CLIENTS (70+)Webü Clickstream analysis and customer segmentationü Churn predictionü Sales forecastingü Dynamic pricing

Industry & Infrastructureü Predictive maintenanceü Logistics optimizationü Smart Cities

Bank & Insuranceü Fraud detectionü Risk analysis and prediction ü Life events detection

A GLOBAL LEADERPHARMACEUTICALS

Page 4: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Data Science StudioOne product for a complete, design-to-production workflow

Collaborative Design Automated Applications Real-Time Predictions

Acquire Explore

Prepare Model

Assess

Deploy

Design Studio Production Studio

Advanced Scheduler

Monitoring & Dashboarding

Model Lifecycle Management

API Builder

Runtime

REST API

Self-contained environment

Scalable for high availability

Dedicated scoring engine

Rapid prototyping on raw data

Page 5: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Real time predictions for used car prices

Key Challenges: • Increase accuracy of predicted prices to boost conversion

• Reduce reliance on and cost of external data providers

• Improve flexibility of the pricing strategy

• Improve customer intelligence and implement behavioral targeting

Initial OM:

• Relied on data providers’ API to query the market prices for each request

Company Description :• the Client is a 2 side B2C marketplace for used cars

• Sell side : real time quote for your vehicle

• Buy side : normal e-commerce marketplace

(250k€)

Page 6: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Building the pricing model

Real time predictions for used car prices

Historical sales prices – Train Set(Argus, la Centrale, internal data)Average 5M lines / make

Make, Model, Year….., Miles, “Condition”, ...., “Free text”

Make, Model, Year….., Miles, Condition, ...., 0, 1, 0, 1...

Transform to vector

XGBoost algorithms were trained iteratively on historical data in order to maximize accuracy of the predictions• 38 makes – 38 models• for popular makes, several models (ex Renault)

Collaborative Design

Acquire Explore

Prepare Model

Assess

Deploy

Rapid prototyping on raw data

Page 7: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Deploying the workflow

Real time predictions for used car prices

Train SetDesign Node

API Node(s)(8000 calls / day)

CPU XRAM X

XGBoost Models (3GB)

Renault, Laguna, 2015….., “Free text”

Renault, Laguna, 2015….., 0, 1, 0, 0, 1

price 1

price 2price 3

Average Price

Scoring

Feature Engineering

Readjustment based on pricing gaps and ad hoc arbitrages

Visitor logsInteractionsTransactions

Price PredictionConversion rates increased from 10% to 12% thanks to more reliable predictions

Page 8: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Cleaning, combining and enrichment of

data

Data Miningon the Interaction

Graph

Publication of the predictions

Detect key variables in fraud detectionINVESTIGATE !

Prescriber Meta Data

Prescription Data

The models are continuously improved

based on the most recent data. The predictions are

updated on a regular basis.

Fraud detection engine

Real time API

Predictive for Fraud Detection in Health Care

Insurance company (France)

Accepted file

Patient Meta Data

“Most suspicious10%” are 3 timesmore likely to be

actual fraud cases

Challenges : • Fraud is a very dynamic process • Mistakes can be costly• Requires advanced scheduling to optimize resource allocation (human + system)

Page 9: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Deploying the workflow

Automated Fraud Detection Engine

Train Set

Design NodeReferentials

HTTP

Export as a DSS packageto the Automation Node

Remap connections to the Production Environment

Create Execution ScenariosMonitor Jobs and Inspect Failures

• Basic daily schedule for scoring new claims• Model retrain if referentials evolved

Page 10: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Improving the workflow

Automated Fraud Detection Engine

Design NodeBundle Export creates dependencies between the nodes

Flow 1

Flow 2

Flow 3

Automation Node

• Model performance simulation

• A/B Test in real conditions

• Monthly model lifecycle management

• Runtime management

• Dashboarding

Page 11: HUGIreland_VincentDeStocklin_DataScienceWorkflows

Thanks ! Questions ?

Vincent de [email protected]