1
How Many Folders Do You Really Need?
Classifying Email into a Handful of CategoriesDate:2015/07/08
Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek
Source:CIKM '14
Advisor:Jia-ling Koh
Spearker:LIN,CI-JIE
5
Introduction Recently automatic classification offering the same categories
to all users has started to appear in some Web mail clients
6
Introduction Today's commercial Web mail traffic is dominated by
machine-generated messages social networks,e-commerce sites,etc
7
Introduction Goal
Automatically distinguishing between personal and machine-generated email
Classifying messages into latent categories,without requiring users to have defined any folder
9
Overflow
Email raw data
LDA cluster Latent categories
Feature extraction Aggregation Training data
generation
Test data
10
Overflow
Email raw data
LDA cluster Latent categories
Feature extraction Aggregation Training data
generation
Test data
11
DISCOVERING LATENT CATEGORIES
Retrieving the most “popular” folders created by users ignored system folders (e.g., “trash”, “spam”)
Applied LDA to these document folders in order to discover a set of latent topics latent topics would map into “latent categories”
LDALatent
categories
12
DISCOVERING LATENT CATEGORIES
The topics obtained for K = 6, as this value exposed a good balance between total and individual coverage The email traffic coverage at K = 6 was 70%
machine generatedhuman generated
Overflow
13
Email raw data
LDA clusterLatent
categories
Feature extraction
AggregationTraining data
generation
Test data
14
Extracting Features Content features
extract words from the subject line and message body the subject character length, body character length the number of urls occurring in the body
Address features features extracted from the sender email address the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)
15
Extracting Features Behavioral features
weekly and monthly volumes of sent messages volumes of messages sent as a reply volumes of messages sent as forward (with FW: in the subject line) volume of the messages received by the sender volume of the messages received as a reply volume of the messages received as a forward
16
Extracting Features Temporal behavior features
Record whether a sender sends more than X messages in an hour X takes as values: 10, 60, 80, 100, 120
Overflow
17
Email raw data
LDA clusterLatent
categories
Feature extraction
AggregationTraining data
generation
Test data
Overflow
19
Email raw data
LDA clusterLatent
categories
Feature extraction
AggregationTraining data
generation
Test data
20
TRAINING DATA consider 3 types of labeling techniques
manual heuristic-based automatic
6 latent categories human career shopping travel finance social
22
Heuristic labeling Used this type of labeling mostly for differentiating between
human and machine senders Identify corporate machine senders
such as “mailer-daemon” or “no-reply” repeating occurrences of words such as “unsubscribe” in message
headers SMTP domain information
Identify human senders <first name>.<lastname>@
Automatic labeling Folder-based majority voting
23
purchase:55 ebay:4
credit cards:1 hotel:6
Shopping finance travel
purchase Credit cards
Hotel
ebay
55+4 1 6
Category:Shopping59>50(threshold),num of folders >1(threshold)
Automatic labeling Folder-based LDA voting
24
purchase:Shopping:70%Finance:20%
ebay:Shopping:60%Social:10%
credit cardsFinance:90%Shopping:10%
hotel:Travel:74%Finance:15%
Category:ShoppingShopping:0.7+0.6+0.1 Travel:0.74Finance:0.2+0.9+0.15Social:0.1
Overflow
25
Email raw data
LDA clusterLatent
categories
Feature extraction
AggregationTraining data
generation
Test data
27
CLASSIFICATION MECHANISM Online lightweight classification
consisting of hard-coded rules designed to quickly classify finding the top 100 senders that cover a significant percentage of the
total traffic and are category consistent categorizing all reply/forward messages as human
CLASSIFICATION MECHANISM Online sender-based classification
looking for the sender in a lookup table containing senders with known categories
28
sender category
shopping
travel
[email protected] shopping
[email protected] finance
lookupshopping
CLASSIFICATION MECHANISM Offline creation of classified senders table
use the training set to train a logistic regression model train a separate model in a one-vs-all manner the classification process is run performed periodically to account for
new senders
29
new email
human
shopping
finance
travel
social
career
logistic regression
sender category
new email finance
.
.
.
.
.
.
30
CLASSIFICATION MECHANISM Online Heavy-weight classification
email messages whose sender did not appear in the classified sender table are sent to a heavy-weight message based classifier
use all relevant feature, pertaining to the message body, subject line and sender name
employed a logistic regression classifier
31
CLASSIFICATION MECHANISM Offline training the message-level classifier
a logistic regression model is trained for each category in a one-vs-all model
the training process is quite similar to the sender classification which is of course different as it contains messages rather than senders
33
Experiment Experimental evaluation was performed on more than 500
billion messages received during a period of six months by users of Yahoo mail service
35
Experiment
AUC (one vs rest classification) Performance on different feature subsets
content features (email body, subject, etc.)
38
CONCLUSION Presented here a Web-scale categorization approach
offline learning online classification
Discovered latent categories Categories cover more than 70% of both email traffic and
email search queries