Dataiku Data Science Studio Use Cases for Industrialized Data Products HUG IRL July 2016
OUR CLIENTS (70+)Webü Clickstream analysis and customer segmentationü Churn predictionü Sales forecastingü Dynamic pricing
Industry & Infrastructureü Predictive maintenanceü Logistics optimizationü Smart Cities
Bank & Insuranceü Fraud detectionü Risk analysis and prediction ü Life events detection
A GLOBAL LEADERPHARMACEUTICALS
Data Science StudioOne product for a complete, design-to-production workflow
Collaborative Design Automated Applications Real-Time Predictions
Acquire Explore
Prepare Model
Assess
Deploy
Design Studio Production Studio
Advanced Scheduler
Monitoring & Dashboarding
Model Lifecycle Management
API Builder
Runtime
REST API
Self-contained environment
Scalable for high availability
Dedicated scoring engine
Rapid prototyping on raw data
Real time predictions for used car prices
Key Challenges: • Increase accuracy of predicted prices to boost conversion
• Reduce reliance on and cost of external data providers
• Improve flexibility of the pricing strategy
• Improve customer intelligence and implement behavioral targeting
Initial OM:
• Relied on data providers’ API to query the market prices for each request
Company Description :• the Client is a 2 side B2C marketplace for used cars
• Sell side : real time quote for your vehicle
• Buy side : normal e-commerce marketplace
(250k€)
Building the pricing model
Real time predictions for used car prices
Historical sales prices – Train Set(Argus, la Centrale, internal data)Average 5M lines / make
Make, Model, Year….., Miles, “Condition”, ...., “Free text”
Make, Model, Year….., Miles, Condition, ...., 0, 1, 0, 1...
Transform to vector
XGBoost algorithms were trained iteratively on historical data in order to maximize accuracy of the predictions• 38 makes – 38 models• for popular makes, several models (ex Renault)
Collaborative Design
Acquire Explore
Prepare Model
Assess
Deploy
Rapid prototyping on raw data
Deploying the workflow
Real time predictions for used car prices
Train SetDesign Node
API Node(s)(8000 calls / day)
CPU XRAM X
XGBoost Models (3GB)
Renault, Laguna, 2015….., “Free text”
Renault, Laguna, 2015….., 0, 1, 0, 0, 1
price 1
price 2price 3
Average Price
Scoring
Feature Engineering
Readjustment based on pricing gaps and ad hoc arbitrages
Visitor logsInteractionsTransactions
Price PredictionConversion rates increased from 10% to 12% thanks to more reliable predictions
Cleaning, combining and enrichment of
data
Data Miningon the Interaction
Graph
Publication of the predictions
Detect key variables in fraud detectionINVESTIGATE !
Prescriber Meta Data
Prescription Data
The models are continuously improved
based on the most recent data. The predictions are
updated on a regular basis.
Fraud detection engine
Real time API
Predictive for Fraud Detection in Health Care
Insurance company (France)
Accepted file
Patient Meta Data
“Most suspicious10%” are 3 timesmore likely to be
actual fraud cases
Challenges : • Fraud is a very dynamic process • Mistakes can be costly• Requires advanced scheduling to optimize resource allocation (human + system)
Deploying the workflow
Automated Fraud Detection Engine
Train Set
Design NodeReferentials
HTTP
Export as a DSS packageto the Automation Node
Remap connections to the Production Environment
Create Execution ScenariosMonitor Jobs and Inspect Failures
• Basic daily schedule for scoring new claims• Model retrain if referentials evolved
Improving the workflow
Automated Fraud Detection Engine
Design NodeBundle Export creates dependencies between the nodes
Flow 1
Flow 2
Flow 3
Automation Node
• Model performance simulation
• A/B Test in real conditions
• Monthly model lifecycle management
• Runtime management
• Dashboarding