Classify This
Analyzing and classifying e-commerce
merchants websites for compliance with
payment systems brand protection programs
using RapidMiner
Vladimir Mikhnovich, fraud analyst
© 2015
The problemWhat is brand damaging and how payment systems control over it?
• Drugs
• Weapons
• Porno
• …
Illegal merchant activities (totally prohibited a.k.a. ‘Deadly Sins’)
• Supplements
• Adult shops
• Brand replicas
• …
High risk merchants(require limitations / additional checks)
Payment systems and aggregators must check merchants to avoid high
risk / prohibited categories / fraud
What to comply with
Business Risk
Assessment and
Mitigation (BRAM)
Global Brand
Protection
Program (GBPP)
The taskAs initially issued:
Regular (monthly/quarterly) scanning of big batches (ten of thousands) of merchant websites, determining non-compliant and high risk ones for further manual screening.
Total number of merchants using Yandex.Moneyintegrated payment solution: over 70.000.
First round, we need to check about 15.000 websites.
Key concerns (before we start)Automate downloading of big batches of websites
• Ideally we want to download all 15.000 sites at once
• Speed doesn’t matter but we must automatically handle errors
Automate classification of 1000s documents
• Ideally we want to classify everything at once
• Speed doesn’t matter in favor of accuracy
Manual picking and labeling sites for training dataset
• And you thought it is easy to find & buy drugs online? It’s not actually…
Uncertainty of some categories
• Category like ‘weapons’ or ‘adult’ are pretty straightforward…
• …while ‘replica’ or ‘magic’ are not
Define thresholds for classification results
• We do not want to further manually check 50% of websites after automatic classification
The general approach
Obtain test dataset
Text mining extension
Classification model
Download websites
Apply model
Training process
Training dataset
•11 categories
•269 labeled sites
•28000 words
Text processing
•Extract text, tokenize, stem
•Build TF-IDF matrix
Model evaluation
•k-NN with cross-validation
TF-IDF metric
TF-IDF (term
frequency–inverse
document frequency):
a numerical statistic that
is intended to reflect how
important a word is to
a single document in a
collection of documents.
RapidMiner insights
Labeled list of
domains
Loop URLs, wget and store to repository
Build TF-IDF matrixk-NN cross-validation (leave one out)
Example of confusion matrix
accuracy: 88.46% +/- 31.95%
true adulttrue
drugs
true
replica
true
weapons
true normal
guys
true betting
exchange
true
hourhoteltrue magic true spy
true
supplements
true
torrent
class
precision
pred. adult 14 1 0 0 4 0 0 0 0 0 0 73.68%
pred. drugs 1 10 0 0 0 0 0 0 0 0 0 90.91%
pred. replica 1 0 12 0 1 0 0 0 0 0 0 85.71%
pred. weapons 0 0 0 10 0 0 0 0 0 0 0 100.00%
pred. normal guys 1 2 1 0 88 4 0 1 0 0 1 89.80%
pred. betting
exchange0 0 0 0 0 9 0 0 0 0 0 100.00%
pred. hourhotel 0 0 0 0 2 0 8 0 0 0 0 80.00%
pred. magic 0 0 0 0 1 0 0 8 0 0 0 88.89%
pred. spy 0 0 0 0 1 0 0 0 9 0 0 90.00%
pred. supplements 0 0 0 0 0 0 0 0 0 12 0 100.00%
pred. torrent 0 0 0 0 2 0 0 0 0 0 4 66.67%
class recall 82.35% 76.92% 92.31% 100.00% 88.89% 69.23% 100.00% 88.89% 100.00% 100.00% 80.00%
‘All-in-one’ test process
It took just ~1 hour to download and classify ~900 sites at once.
For test purposes an ‘all-in-one’ process has been implemented
Data structures and sizes
• 1 site = text file from 0.3 to 1+ megabyte
• Corpus makes 150 – 300 Megabytes of text files in total
Text data size
• Training data: 269 sites x 28.017 words (70 Megabytes)
• Test data: 832 sites x 60.893 words (414 Megabytes)
TF-IDF matrices examples
Batching approachMany attempts to classify thousands sites at once were actually
unsuccessful. Reason? Memory problems.
So far, another approach to overcome physical memory limitations was chosen:
batching. First we download websites and divide them into batches of
reasonable size (empirically, 200-300 sites is enough to fit all matrices in
memory), every batch is downloaded into separate directory and then analyzed
in a loop.
Thousands of websites
Download and save
Loop batches Classify every batch
RapidMiner insights
Batch
numbers
Batch size
15.000 sites = 50 batches x 300 files = 50 directories to loop classifier through
k-NN and thresholds
Unknown
adult
normal
normal
torrent
drugs
adult
normal
weapons
k-NN is simple and (applied to text analysis) provides a measure
of similarity of the text document to known categories.
k-NN and thresholds
site adult drugs replica weapons normal betting magic spy supplements torrent prediction
eroshop.ru 100% adult
putana78.com 100% adult
IntimCity.nl 81% 19% adult
kupialco.ru 100% drugs
mari-juana.net 100% drugs
kyritelnie-smesi.nl 81% 19% drugs
03market.ru 19% 0% 81% normal guys
1-ocenka.ru 20% 40% 40% normal guys
1c-interes.ru 60% 40% normal guys
1gb.ru 100% normal guys
100-z.ru 20% 20% 60% spy
100mile.ru 20% 80% spy
1belka.ru 20% 20% 40% 20% spy
100captains.ru 20% 80% torrent
1chef.ru 100% weapons
k=5 allows assigning significant confidence values to categories. Only
high confidences are taken into account (threshold >= 80%).
Finally, what’s in and out
Average 8-10% of sites are assigned high confidence during
classification (>80%) and are screened manually thereafter.
Big list of
domains
site prediction confidence
zdoroviak.ru supplements 100%
zscom.ru spy 100%
zwuk.ru spy 100%
zarekoy.ru hourhotel 100%
zastava-izhevsk.ru weapons 100%
zutera.ru replica 81%
zoombao.com replica 80%
zita-gita.ru adult 80%
zishop.ru replica 80%
zdorovoetelo100.ru supplements 80%
zen-shop.ru drugs 80%
Performance and accuracy
For real-life randomly picked 100 websites:
Downloading time 7 minutes
Processing & classification time 30 seconds (0.3 sec per site)
Cross-validation accuracy (without applying threshold)
89%
High risk sites classified as normal (False Negatives)
1
Correctly classified high risk sites: 12 out of 13 (92%)
Normal sites classified as high risk(False Positives)
0
What’s next
Server deployment
• Processes scheduling
• Automated reports
Improved accuracy
• Incremental model updates with new data
• Using n-grams
Automated scanning
• No need to input data manually
• Automatic referral domains parsing