YM-RMWisdom15 final

Classify This

Analyzing and classifying e-commerce

merchants websites for compliance with

payment systems brand protection programs

using RapidMiner

Vladimir Mikhnovich, fraud analyst

© 2015

The problemWhat is brand damaging and how payment systems control over it?

• Drugs

• Weapons

• Porno

• …

Illegal merchant activities (totally prohibited a.k.a. ‘Deadly Sins’)

• Supplements

• Adult shops

• Brand replicas

• …

High risk merchants(require limitations / additional checks)

Payment systems and aggregators must check merchants to avoid high

risk / prohibited categories / fraud

What to comply with

Business Risk

Assessment and

Mitigation (BRAM)

Global Brand

Protection

Program (GBPP)

The taskAs initially issued:

Regular (monthly/quarterly) scanning of big batches (ten of thousands) of merchant websites, determining non-compliant and high risk ones for further manual screening.

Total number of merchants using Yandex.Moneyintegrated payment solution: over 70.000.

First round, we need to check about 15.000 websites.

Key concerns (before we start)Automate downloading of big batches of websites

• Ideally we want to download all 15.000 sites at once

• Speed doesn’t matter but we must automatically handle errors

Automate classification of 1000s documents

• Ideally we want to classify everything at once

• Speed doesn’t matter in favor of accuracy

Manual picking and labeling sites for training dataset

• And you thought it is easy to find & buy drugs online? It’s not actually…

Uncertainty of some categories

• Category like ‘weapons’ or ‘adult’ are pretty straightforward…

• …while ‘replica’ or ‘magic’ are not

Define thresholds for classification results

• We do not want to further manually check 50% of websites after automatic classification

The general approach

Obtain test dataset

Text mining extension

Classification model

Download websites

Apply model

Training process

Training dataset

•11 categories

•269 labeled sites

•28000 words

Text processing

•Extract text, tokenize, stem

•Build TF-IDF matrix

Model evaluation

•k-NN with cross-validation

TF-IDF metric

TF-IDF (term

frequency–inverse

document frequency):

a numerical statistic that

is intended to reflect how

important a word is to

a single document in a

collection of documents.

RapidMiner insights

Labeled list of

domains

Loop URLs, wget and store to repository

Build TF-IDF matrixk-NN cross-validation (leave one out)

Example of confusion matrix

accuracy: 88.46% +/- 31.95%

true adulttrue

drugs

true

replica

true

weapons

true normal

guys

true betting

exchange

true

hourhoteltrue magic true spy

true

supplements

true

torrent

class

precision

pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%

pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%

pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%

pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%

pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%

pred. betting

exchange0 0 0 0 0 9 0 0 0 0 0 100.00%

pred. hourhotel 0 0 0 0 2 0 8 0 0 0 0 80.00%

pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%

pred. spy 0 0 0 0 1 0 0 0 9 0 0 90.00%

pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%

pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%

class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%

‘All-in-one’ test process

It took just ~1 hour to download and classify ~900 sites at once.

For test purposes an ‘all-in-one’ process has been implemented

Data structures and sizes

• 1 site = text file from 0.3 to 1+ megabyte

• Corpus makes 150 – 300 Megabytes of text files in total

Text data size

• Training data: 269 sites x 28.017 words (70 Megabytes)

• Test data: 832 sites x 60.893 words (414 Megabytes)

TF-IDF matrices examples

Batching approachMany attempts to classify thousands sites at once were actually

unsuccessful. Reason? Memory problems.

So far, another approach to overcome physical memory limitations was chosen:

batching. First we download websites and divide them into batches of

reasonable size (empirically, 200-300 sites is enough to fit all matrices in

memory), every batch is downloaded into separate directory and then analyzed

in a loop.

Thousands of websites

Download and save

Loop batches Classify every batch

RapidMiner insights

Batch

numbers

Batch size

15.000 sites = 50 batches x 300 files = 50 directories to loop classifier through

k-NN and thresholds

Unknown

adult

normal

normal

torrent

drugs

adult

normal

weapons

k-NN is simple and (applied to text analysis) provides a measure

of similarity of the text document to known categories.

k-NN and thresholds

site adult drugs replica weapons normal betting magic spy supplements torrent prediction

eroshop.ru 100% adult

putana78.com 100% adult

IntimCity.nl 81% 19% adult

kupialco.ru 100% drugs

mari-juana.net 100% drugs

kyritelnie-smesi.nl 81% 19% drugs

03market.ru 19% 0% 81% normal guys

1-ocenka.ru 20% 40% 40% normal guys

1c-interes.ru 60% 40% normal guys

1gb.ru 100% normal guys

100-z.ru 20% 20% 60% spy

100mile.ru 20% 80% spy

1belka.ru 20% 20% 40% 20% spy

100captains.ru 20% 80% torrent

1chef.ru 100% weapons

k=5 allows assigning significant confidence values to categories. Only

high confidences are taken into account (threshold >= 80%).

Finally, what’s in and out

Average 8-10% of sites are assigned high confidence during

classification (>80%) and are screened manually thereafter.

Big list of

domains

site prediction confidence

zdoroviak.ru supplements 100%

zscom.ru spy 100%

zwuk.ru spy 100%

zarekoy.ru hourhotel 100%

zastava-izhevsk.ru weapons 100%

zutera.ru replica 81%

zoombao.com replica 80%

zita-gita.ru adult 80%

zishop.ru replica 80%

zdorovoetelo100.ru supplements 80%

zen-shop.ru drugs 80%

Performance and accuracy

For real-life randomly picked 100 websites:

Downloading time 7 minutes

Processing & classification time 30 seconds (0.3 sec per site)

Cross-validation accuracy (without applying threshold)

89%

High risk sites classified as normal (False Negatives)

1

Correctly classified high risk sites: 12 out of 13 (92%)

Normal sites classified as high risk(False Positives)

0

What’s next

Server deployment

• Processes scheduling

• Automated reports

Improved accuracy

• Incremental model updates with new data

• Using n-grams

Automated scanning

• No need to input data manually

• Automatic referral domains parsing

Thank you!

Vladimir Mikhnovich, fraud analyst

@kypexin

[email protected]

YM-RMWisdom15 final

Documents