Top Banner
How Many Folders Do You Really Need? Classifying Email into a Handful of Categories Date:2015/07/08 Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek Source:CIKM '14 Advisor:Jia-ling Koh Spearker:LIN,CI-JIE 1
39
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

1

How Many Folders Do You Really Need?

Classifying Email into a Handful of CategoriesDate:2015/07/08

Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek

Source:CIKM '14

Advisor:Jia-ling Koh

Spearker:LIN,CI-JIE

Page 2: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

2

OutlineIntroductionMethodExperimentConclusion

Page 3: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

3

OutlineIntroductionMethodExperimentConclusion

Page 4: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

4

Introduction Email classification is still a mostly manual task

Page 5: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

5

Introduction Recently automatic classification offering the same categories

to all users has started to appear in some Web mail clients

Page 6: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

6

Introduction Today's commercial Web mail traffic is dominated by

machine-generated messages social networks,e-commerce sites,etc

Page 7: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

7

Introduction Goal

Automatically distinguishing between personal and machine-generated email

Classifying messages into latent categories,without requiring users to have defined any folder

Page 8: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

8

OutlineIntroductionMethodExperimentConclusion

Page 9: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

9

Overflow

Email raw data

LDA cluster Latent categories

Feature extraction Aggregation Training data

generation

Test data

Page 10: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

10

Overflow

Email raw data

LDA cluster Latent categories

Feature extraction Aggregation Training data

generation

Test data

Page 11: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

11

DISCOVERING LATENT CATEGORIES

Retrieving the most “popular” folders created by users ignored system folders (e.g., “trash”, “spam”)

Applied LDA to these document folders in order to discover a set of latent topics latent topics would map into “latent categories”

LDALatent

categories

Page 12: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

12

DISCOVERING LATENT CATEGORIES

The topics obtained for K = 6, as this value exposed a good balance between total and individual coverage The email traffic coverage at K = 6 was 70%

machine generatedhuman generated

Page 13: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Overflow

13

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

Page 14: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

14

Extracting Features Content features

extract words from the subject line and message body the subject character length, body character length the number of urls occurring in the body

Address features features extracted from the sender email address the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)

Page 15: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

15

Extracting Features Behavioral features

weekly and monthly volumes of sent messages volumes of messages sent as a reply volumes of messages sent as forward (with FW: in the subject line) volume of the messages received by the sender volume of the messages received as a reply volume of the messages received as a forward

Page 16: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

16

Extracting Features Temporal behavior features

Record whether a sender sends more than X messages in an hour X takes as values: 10, 60, 80, 100, 120

Page 17: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Overflow

17

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

Page 18: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Aggregation

18

Financial

Page 19: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Overflow

19

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

Page 20: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

20

TRAINING DATA consider 3 types of labeling techniques

manual heuristic-based automatic

6 latent categories human career shopping travel finance social

Page 21: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

21

Manual labeling Human editors assign labels to specific examples

Page 22: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

22

Heuristic labeling Used this type of labeling mostly for differentiating between

human and machine senders Identify corporate machine senders

such as “mailer-daemon” or “no-reply” repeating occurrences of words such as “unsubscribe” in message

headers SMTP domain information

Identify human senders <first name>.<lastname>@

Page 23: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Automatic labeling Folder-based majority voting

23

[email protected]

purchase:55 ebay:4

credit cards:1 hotel:6

Shopping finance travel

purchase Credit cards

Hotel

ebay

55+4 1 6

Category:Shopping59>50(threshold),num of folders >1(threshold)

Page 24: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Automatic labeling Folder-based LDA voting

24

[email protected]

purchase:Shopping:70%Finance:20%

ebay:Shopping:60%Social:10%

credit cardsFinance:90%Shopping:10%

hotel:Travel:74%Finance:15%

Category:ShoppingShopping:0.7+0.6+0.1 Travel:0.74Finance:0.2+0.9+0.15Social:0.1

Page 25: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

Overflow

25

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

Page 26: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

26

CLASSIFICATION MECHANISM

Page 27: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

27

CLASSIFICATION MECHANISM Online lightweight classification

consisting of hard-coded rules designed to quickly classify finding the top 100 senders that cover a significant percentage of the

total traffic and are category consistent categorizing all reply/forward messages as human

Page 28: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

CLASSIFICATION MECHANISM Online sender-based classification

looking for the sender in a lookup table containing senders with known categories

28

[email protected]

sender category

[email protected]

shopping

[email protected]

travel

[email protected] shopping

[email protected] finance

lookupshopping

Page 29: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

CLASSIFICATION MECHANISM Offline creation of classified senders table

use the training set to train a logistic regression model train a separate model in a one-vs-all manner the classification process is run performed periodically to account for

new senders

29

new email

human

shopping

finance

travel

social

career

logistic regression

sender category

new email finance

.

.

.

.

.

.

Page 30: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

30

CLASSIFICATION MECHANISM Online Heavy-weight classification

email messages whose sender did not appear in the classified sender table are sent to a heavy-weight message based classifier

use all relevant feature, pertaining to the message body, subject line and sender name

employed a logistic regression classifier

Page 31: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

31

CLASSIFICATION MECHANISM Offline training the message-level classifier

a logistic regression model is trained for each category in a one-vs-all model

the training process is quite similar to the sender classification which is of course different as it contains messages rather than senders

Page 32: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

32

OutlineIntroductionMethodExperimentConclusion

Page 33: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

33

Experiment Experimental evaluation was performed on more than 500

billion messages received during a period of six months by users of Yahoo mail service

Page 34: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

34

Experiment

Page 35: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

35

Experiment

AUC (one vs rest classification) Performance on different feature subsets

content features (email body, subject, etc.)

Page 36: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

36

Experiment

Page 37: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

37

OutlineIntroductionMethodExperimentConclusion

Page 38: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

38

CONCLUSION Presented here a Web-scale categorization approach

offline learning online classification

Discovered latent categories Categories cover more than 70% of both email traffic and

email search queries

Page 39: How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

39

Thanks for listening