Top Banner
Predictive Models at Scale using Dumbo Nikhil Ketkar
12

Predictive Models at Scale

Apr 10, 2017

Download

Documents

Nikhil Ketkar
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Predictive Models at Scale

Predictive Models at Scale using Dumbo

Nikhil Ketkar

Page 2: Predictive Models at Scale

40k+ Brands600k+ Sellers

700+ Million Products7k+ Categories10k+ Attributes

Motivation: Problem Space @ Indix

Page 3: Predictive Models at Scale

Developing Predictive Models

Unlabelled Data

SampleHandLabel Model Predict

Data with Predicted Labels

Page 4: Predictive Models at Scale

HDFS

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

Predictive Models at Scale

Page 5: Predictive Models at Scale

The Two Giants

Native, C/C++ Fortran

Numpy

Scipy, Pandas, Matplotlib

scikit-learn, scikit-image, statsmodels

JVM

Java/Scala

HDFS, Hadoop MapReduce

Cascading/Scalding

PyData Ecosystem Hadoop Ecosystem

ModelPredict

Page 6: Predictive Models at Scale

The Standard Options ● Port to Java/Scala use as Library in Mapper

○ Time Consuming ○ Need to port parts of the PyData Stack○ Reduced Velocity○ Error prone

● Write a REST API/Service for the model and call from Mapper○ Slow due to Network Latency○ Deployment is a nightmare

● Use Disco

Page 7: Predictive Models at Scale

Can we do better?

● Hadoop Streaming with Typedbytes Support● Python Wrappers over Hadoop Streaming

○ Dumbo○ MRJob○ Hadoopy○ Pydoop

Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

Page 8: Predictive Models at Scale

Two Minute MapReduce Refresher

Reference: https://tarnbarford.net/journal/mapreduce-on-mongo

Page 9: Predictive Models at Scale

Sample Problem: Extract MPN from Product Titles

● 0.5 Billion Product Titles● Many contain MPNs● Humans can detect

MPNs● Can a model do the

same?● Use CRF on Full Title● Use RF on Tokens

Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4

Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle

Newport Brass 3 447/ORB Oil Rubbed Bronze Hand RelievedDiverter / Volume Control Handle from the Metropole Collection

Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25

Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6

U12 23252 KUB QUATRON INDX DRILL

MPNs in Product Titles

Page 10: Predictive Models at Scale

Code Walkthrough

Page 11: Predictive Models at Scale

Code Walkthrough

Page 12: Predictive Models at Scale

Important Learnings

● Dumbo Fairly Stable, Mature and Ready for Production

● Gets the 2 giants working together!● Found just one issue over 6 months of

usage (patch submitted)● Support for Typedbytes is critical if making

predictions over binary data (Images etc.)