Top Banner
Understanding email traffic David Graus, University of Amsterdam [email protected] @dvdgrs
45
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding Email Traffic

Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs

Page 2: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 2

Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery

Page 3: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 3

Some background…

• PhD candidate at ILPS • Information Extraction & Retrieval

• Project in NWO’s Forensic Science program • Semantic Search in E-Discovery

Page 4: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 4

Information Retrieval?

Page 5: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 5

Information Retrieval?

Ò Finding material of unstructured nature from large collections

Page 6: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 6

Information Extraction?

Ò Text mining Ò Discovering patterns in text data

Page 7: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 7

Semantic Search in E-Discovery?

Page 8: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 8

Semantic Search?

Page 9: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 9

E-Discovery?

• Retrieving and securing digital forensic evidence

Page 10: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 10

E-Discovery

⬜ Semantic Search in E-Discovery

Page 11: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 11

Semantic Search in E-Discovery

• Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web • (Google won’t help us here)

Page 12: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 12

Search in E-Discovery¢ Finding out who knew what, from whom, and when¢ We don’t know what we’re looking for¢ What we’re looking for might be deliberately hidden¢ Communication might be very domain-specific,

contextualized or incomplete

Page 13: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 13

Approach¢ Generic search is not the answer

¢ Google: high precision search¢ E-Discovery: high recall & exploratory search

Page 14: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 14

Tasks¢ Support iterative search¢ Support (re)formulating questions and hypotheses¢ Retrieve all relevant traces

Page 15: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 15

Page 16: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 16

Page 17: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 17

Recipient recommendation

Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to

receive the email

Page 18: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 18

Why?

Ò Understanding communication in/structure of an enterprise

Ò Finding “unexpected” communication Ò Applications in:

Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection

Page 19: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 19

How?

Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork

Ò Related work Ò Social Network Analysis (SNA) Ò Email content

Ò Us Ò SNA + email content

Page 20: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 20

Part 1: Social Network Analysis?

[email protected] [email protected]

[email protected]

Page 21: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 21

image by Calvinius - Creative Commons Attribution-Share Alike 3.0

Page 22: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 22

SNA for predicting recipients?

1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email

2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient

Page 23: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 23

Part 2: Email content

Ò Statistical Language Models (LMs)

Ò Assign a probability to [a sequence of] words; Ò By counting words

Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition

Page 24: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 24

Language Models

Ò Language models as communication “profiles”

Page 25: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 25

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)

Page 26: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 26

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)

Page 27: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 27

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Page 28: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 28

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Page 29: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 29

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2) 4. Corpus LM (how everyone

talks)

Page 30: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 30

Why language models?

Ò Comparisons between communication profiles: Ò Find nodes with most similar communication

Page 31: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 31

Model

Ò Given sender and email, predict recipients Ò Ranking function:

Page 32: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 32

Email likelihood Estimate using language modeling

Sender likelihoodusing SNA to estimate closeness of R and S

Recipient likelihoodusing SNA to estimate importance of R

Page 33: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 33

Email likelihood

Page 34: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 34

Email likelihood

P(word|R,S) P(word|R) P(word)

Page 35: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 35

Strength of connection between two nodes

1. Number of emails sent between nodes 2. Number of times two nodes are addressed together

Importance of node 1. Number of emails received 2. PageRank score

Recipient Likelihood P(R)

P(R)

P(S|R)

Sender Likelihood P(S|R)

Page 36: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 36

SNA

1. Importance of a node in the network

2. Strength of connection between nodes

Email Content

1. Interpersonal LM 2. Recipient LM 3. Corpus LM

Page 37: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 37

Approach: time-based

time

Training period: build models (SNA + LM)

Testing period: predict recipients

Page 38: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 38

Testing

Ò Remove recipients from email Ò Rank all nodes in the network, by computing:

1. P(E|R,S): Similarity between sender and candidate LMs

2. P(S|R): Strength of connection between sender and candidate

3. P(R): Importance of candidate

Testing period: predict recipients

Page 39: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 39

Page 40: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 40

Findings: What works?

Ò Importance of node: Number of received emails of nodePagerank

Ò Strength of connection: Number of emails between nodesNumber of times co-addressed

Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)

Page 41: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 41

Analysis: SNA vs email content

Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly

active users

Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users

Page 42: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 42

Finally

Ò Combining Social Network Analysis with Language Modeling is better than doing either.

Page 43: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 43

Future work

Ò Consider structure of network in more detail Ò Departments? Ò Friends/family?

Ò Include ‘time decay’

Ò Dynamically weight LM/SNA?

Page 44: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 44

Applications in E-Discovery/Digital Forensics

Ò Anomaly detection Ò Given a working prediction model; identify

“unexpected” communication Ò Language models for communication

Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?

Ò Find communication that differs from the corpus-based communication

Page 45: Understanding Email Traffic

Dec. 12, 2014 - Frontiers of Forensic Science 45

Fin

Ò Questions?