Top Banner
Machine Learning to moderate ads in real world classified's business by Vaibhav Singh & Jaroslaw Szymczak
35

Machine Learning to moderate ads in real world classified's business

Jan 22, 2018

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning to moderate ads in real world classified's business

Machine Learning to moderate ads in real world classified's businessby Vaibhav Singh & Jaroslaw Szymczak

Page 2: Machine Learning to moderate ads in real world classified's business

Agenda

● Moderation problem● Offline model creation

○ feature generation○ feature selection○ data leakage○ the algorithm

● Model evaluation● Going live with the product

○ is your data really big?○ automatic model creation pipeline○ consistent development and production environments○ platform architecture○ performance monitoring

Page 3: Machine Learning to moderate ads in real world classified's business

50+countries

60+ millionnew monthly listings

18+ millionunique monthly sellers

Page 4: Machine Learning to moderate ads in real world classified's business

What do moderators look for?

Avoidance of payment

Sell another item in paid listing by changing its content

Flood site with duplicate posts to increase visibility

Create multiple accounts to bypass free ad per user limit

Violation of ToS

Add Phone numbers, Company information on image rather than in description or dedicated fields

Try to sell forbidden items, very often with title and description that try to evade keyword filters

Miscategorized listings

Item is placed in wrong category

Item is coming from legitimate business, but is marked as coming from individual

‘Seek’ problem in job offers

Page 5: Machine Learning to moderate ads in real world classified's business

Offline model creation

Page 6: Machine Learning to moderate ads in real world classified's business

Feature engineering...

… and selection

Feature selection:

● necessary for some algorithms, for others - not so much

● most important features● avoiding leakage

Page 7: Machine Learning to moderate ads in real world classified's business

Feature generation - one-hot-encoding

Page 8: Machine Learning to moderate ads in real world classified's business

Feature generation - feature hashing

Page 9: Machine Learning to moderate ads in real world classified's business

Feature hashing➔ Good when dealing high

dimensional, sparse features -- dimensionality reduction

➔ Memory efficient

➔ Cons - Getting back to feature names is difficult

➔ Cons - Hash collisions can have negative effects

Page 10: Machine Learning to moderate ads in real world classified's business

Data Leakage➔ Remove obvious fields

e.g.: id, account numbers

➔ Check the importance of the features for any unusual observations

➔ Have hold-out set that you do not process wrt. target variable

➔ Closely monitor live performance

Page 11: Machine Learning to moderate ads in real world classified's business

The algorithm

Desired features:

● state-of-the-art structured binary problems

● allowing reducing variance errors (overfitting)

● allowing reducing bias errors (underfitting)

● has efficient implementation

Page 12: Machine Learning to moderate ads in real world classified's business

eXtreme Gradient Boosting (XGBoost)

Source: https://www.slideshare.net/JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition

Page 13: Machine Learning to moderate ads in real world classified's business

Model evaluation

Page 14: Machine Learning to moderate ads in real world classified's business
Page 15: Machine Learning to moderate ads in real world classified's business

Beyond accuracy

● ROC AUC (Receiver-Operator Curve):○ can be interpreted as concordance probability (i.e. random positive example has the

probability equal to AUC, that it’s score is higher)○ it is too abstract to use as a standalone quality metric○ does not depend on classes ratio

● PRC AUC (Precision-Recall Curve)○ Depends on data balance○ Is not intuitively interpretable

● Precision @ fixed Recall, Recall @ fixed Precision:○ can be found using thresholding○ they heavily depend on data balance○ they are the best to reflect the business requirements○ and to take into account processing capabilities

(then actually Precision @k is more accurate)

● choose one, and only one as your KPI and others asconstraints

Page 16: Machine Learning to moderate ads in real world classified's business

Example ROC for moderation problem

Page 17: Machine Learning to moderate ads in real world classified's business

Precision-recall curve example

Page 18: Machine Learning to moderate ads in real world classified's business

Precision @recall

Page 19: Machine Learning to moderate ads in real world classified's business

Recall @precision

Page 20: Machine Learning to moderate ads in real world classified's business

Going live with the product

Page 21: Machine Learning to moderate ads in real world classified's business

Is your datareally big?

Page 22: Machine Learning to moderate ads in real world classified's business

SVM Light Data Format

➔ Memory Efficient. Features can be created on one machine and do not require huge clusters

➔ Cons - Number of features is unknown, store it separately

1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:11 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:10 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:11 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:10 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:11 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1

Page 23: Machine Learning to moderate ads in real world classified's business

Lessons Learnt➔ Do not go for distributed learning if you

don’t need to

➔ Choose your tech dependent on data size. Do not go for hype driven development

➔ Your machine does not limit, there’s cloud

➔ Ask yourself: What’s the most difficult problem to scale ? → People

Page 24: Machine Learning to moderate ads in real world classified's business

Model Generation Pipeline

Page 25: Machine Learning to moderate ads in real world classified's business

Automatic model creation

pipeline

● Automation makes things deterministic

● Airflow, Luigi and many others are good choice for Job dependency management

Page 26: Machine Learning to moderate ads in real world classified's business

Luigi Dashboard

Page 27: Machine Learning to moderate ads in real world classified's business

Luigi Task Visualizer

Page 28: Machine Learning to moderate ads in real world classified's business

Lessons Learnt➔ when you use the output path on your own,

create your output at the very end of the task

➔ you can dynamically create dependencies by yielding the task

➔ adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task … --workers 15)

Page 29: Machine Learning to moderate ads in real world classified's business

Consistent development and production environments

Page 30: Machine Learning to moderate ads in real world classified's business

Model Serving Architecture

Flask API

Queue Prediction Module

Mongo

Monitoring & StatsGraphite, Grafana

LearningModule

Scikit

XGBoostLuigi

Ask Prediction

Return Prediction

Learning Ads

Page 31: Machine Learning to moderate ads in real world classified's business

Image Model Serving Architecture

AWS Kinensis Stream

Incoming Pictures

Hash Generation

Country Specific Image Moderation

General Moderation NSFW

Tag and Category Prediction

Mongo

OLX Site

S3 Models

GPU Clusters

Learning ClusterTF, Keras, MxNet

Page 32: Machine Learning to moderate ads in real world classified's business

Performance monitoring

Page 33: Machine Learning to moderate ads in real world classified's business

Model monitoring and management

Page 34: Machine Learning to moderate ads in real world classified's business

Lessons Learnt ➔ Always Batch

Batching will reduce CPU Utilization and the same machines would be able to handle much more requests

➔ Modularize, Dockerize and OrchestrateContainerize your code so that it is transparent to Machine configurations

➔ MonitoringUse a monitoring service

➔ Choose simple and easy tech

Page 35: Machine Learning to moderate ads in real world classified's business

Acknowledgements

● Andrzej Prałat● Wojciech Rybicki

Vaibhav [email protected]

Jaroslaw [email protected]

PYDATA BERLIN 2017July 2nd, 2017