Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

Post on 22-Dec-2015

218 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

Transcript

Identity Resolution in Email CollectionsIdentity Resolution in Email Collections

Tamer Elsayed and Douglas W. Oard

CLIP Colloquium, UMD, Feb 2009CLIP Colloquium, UMD, Feb 2009

Department of Computer Science, UMIACS, and iSchool

2

Identity Resolution in Email Collections

Real Problem

National ArchivesNational Archives

Clinton Clinton White HouseWhite House Tobacco Tobacco

PolicyPolicy

search search requestrequest

hired 25 hired 25 personspersons

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~32 million

emails

200,000

80,000

for 6 months …

3

Identity Resolution in Email Collections

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <kay.mann@enron.com>To: Suzanne Adams <suzanne.adams@enron.com>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the

call will be too late for him.

Sheila

Identity Resolution in Email

WHO?WHO?WHO?WHO?

4

Identity Resolution in Email Collections

Enron Collection

-----Original Message-----From: SStack@reliant.com@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.comCc: ntillett@reliant.comSubject: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: elizabeth.sager@enron.comTo: sstack@reliant.comSubject: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: 'SStack@reliant.com@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

55 Sheila’s !!55 Sheila’s !!weisman

pardoglover

richjones

breedenhuckaby

tweedmcintyrechadwick

birminghamkahanekforakertasmanfisherpetitt

DomboRobbinschang

jarnotkirby

knudsenboehringer

lutzgloverwollamjortnerneylon

whangernagel

gravesmclaughlin

venvillerappazzo

millerswatekhollis

maynesnacey

ferrarinidey

macleodhowarddarlingwatsonperlickadvanihesterkennerlewis

waltonwhitmanberggrenosowski

kelly

Rank Rank CandidatesCandidates

Rank Rank CandidatesCandidates

5

Identity Resolution in Email Collections

Generative Model

1. Choose “personperson” c to mention

p(c)

2. Choose appropriate “contextcontext” X to mention c

p(X | c)

3. Choose a “mentionmention” l

p(l | X, c)““sheila”sheila”

GEGEconferenceconference

callcall

6

Identity Resolution in Email Collections

3-Step Solution(1) Identity(1) Identity ModelingModeling

Posterior DistributionPosterior Distribution

(3) Mention Resolution(3) Mention Resolution

(2) Context Reconstruction(2) Context Reconstruction

7

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

8

Identity Resolution in Email Collections

“Easy References” of Identity

-----Original Message-----From: SStack@reliant.com@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.comCc: ntillett@reliant.comSubject: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: elizabeth.sager@enron.comTo: sstack@reliant.comSubject: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: 'SStack@reliant.com@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

Email Email StandardsStandards

Email-Client Email-Client BehaviorBehavior

User User RegularitiesRegularities

9

Identity Resolution in Email Collections

Representational Model of Identity

77,240 “non-trivial” models

sheila.glover@enron.com

14 (Quoted Headers)

sheila glover

932 (Main Headers)

sheila

19 (Salutation)

216 (Signature)

sg19 (Signature)

sheila glover

1170 (User Name)

Representational ModelRepresentational Model

10

Identity Resolution in Email Collections

Computational Model of Identity

c

m

t

identity

observed mention

name type

Tt

ctfreq

ctfreqctp

'

),'(

),()|(

)('

),,'(

),,(),|(

cassocl

ctmfreq

ctmfreqctmp

Tt

ctpctmpcmp )|(),|()|(

Cc

cassoc

cassoccp

'

)'(

)()(

11

Identity Resolution in Email Collections

Identity Models

Candidates

Candidates

Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))

12

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context ReconstructionContext Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

13

Identity Resolution in Email Collections

Contextual Space

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

14

Identity Resolution in Email Collections

Topical Context

Date: Fri Dec 15 05:33:00 EST 2000From: david.oxley@enron.comTo: vince j kaminski <vince.kaminski@enron.com>Cc: sheila walton <sheila.walton@enron.com>Subject: Re: Grant Masson

Great news. Lets get this moving along. Sheila, can you work out GE letter?

Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone.

sheila.walton@enron.com

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <kay.mann@enron.com>To: Suzanne Adams <suzanne.adams@enron.com>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Sheila call

Sheila

call

GE

GE

15

Identity Resolution in Email Collections

Contextual Space

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

16

Identity Resolution in Email Collections

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <kay.mann@enron.com>To: Suzanne Adams <suzanne.adams@enron.com>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Social Context

Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST)From: rebecca.walker@enron.comTo: kay.mann@enron.com Subject: ESA Option Execution

KayCan you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland.Thanks, Rebecca

Sheila Tweed

kay.mann@enron.com

kay.mann@enron.com

17

Identity Resolution in Email Collections

Formally

A context of an email is a probability probability distributiondistribution over emails

Probability estimated based on type of context

Contextual Space is a linear combination of 4 contexts

))(|( ikj exep

kx

18

Identity Resolution in Email Collections

Context Expansion

topical

time

people

content

social

conversationallocal

Temporal similarity affects social and topical similarity

19

Identity Resolution in Email Collections

Temporal Similarity

Decay over time Gaussian and Linear functions Time difference / Rank

20

Identity Resolution in Email Collections

Social Similarity

Two sets of participants (email adresses) Binary, Overlap, Jacaard, Both

21

Identity Resolution in Email Collections

Temporal Effect

Temporal Sim Pure Social Sim

Social SimNormalize

Social Context

22

Identity Resolution in Email Collections

Topical Similarity

Standard IR Similarity: BM25

Email as a DOCUMENT? Subject Body (+Subject) Root of thread Concatenated path to root

Combined similarly with temporal similarity

email

reply / forward

23

Identity Resolution in Email Collections

Contextual Space (emails)

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

24

Identity Resolution in Email Collections

Contextual Space (mentions)

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

“jsheila@enron.com”

“sg”

“Sheila Walton”

“Sheila”

25

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention ResolutionMention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work

26

Identity Resolution in Email Collections

Mention Resolution

Candidates

Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))

Goal: estimate p(c|m, X(m)) and rank accordingly

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <kay.mann@enron.com>To: Suzanne Adams <suzanne.adams@enron.com>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Sheila

11

22 33

??

27

Identity Resolution in Email Collections

[1] Context-Free Resolution

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

“jsheila@enron.com”

“sg”

“Sheila Walton”

“Sheila”

“Sheila”

X

Context-FreeResolution

28

Identity Resolution in Email Collections

[2] Contextual Resolution

“Sheila”

social

social social

topical

“Sheila Tweed”

“sheila”

“jsheila@enron.com”

“sg”

“Sheila Walton”

“Sheila”

Context-FreeResolution

29

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing CollectionsEvaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion

30

Identity Resolution in Email Collections

Test Collections

Collection Emails Identities Mention Candidates

Queries Min. Avg. Max.

Sager 1,628 627 51 1 4 11

Shapiro 974 855 49 1 8 21

Enron-subset 54,018 27,340 78 1 152 489

Enron-all 248,451 123,783 78 3 518 1785

Sager

Shapiro

Enron-subsetEnron-all

31

Identity Resolution in Email Collections

Evaluation Measures

Commonly used in “known-item” retrieval

Success @1 (i.e., Precision @1) One-best

MRR (Mean Reciprocal Rank) Inverse of the harmonic mean of the ranks of true

answer ri

n

i irnMRR

1

11

32

Identity Resolution in Email Collections

Comparison w/Literature

MRRMRR Success @ 1Success @ 1

ContextContext Lit.Lit. ContextContext Lit.Lit.

CollectionCollection ExpansionExpansion BestBest ExpansionExpansion BestBest

Sager 0.911 0.889 0.863 0.804

Shapiro 0.913 0.879 0.878 0.779

Enron-subset 0.91 - 0.846 (0.82)

Enron-all 0.89 - 0.821 -

ContextContextExpansionExpansion

Lit.Lit.BestBest

ContextContextExpansionExpansion

Lit.Lit.BestBest

Earlier expansion approach, reported in ACL 2008

Improvedexpansion

0.870.92

33

Identity Resolution in Email Collections

Limitations

Resolving single mentions

All mention-queries are sampled from Enron to Enron emails

All mention-queries refer to Enron Employee Small for train/test split

Scalable Implementation for Resolving All Mentions

New Test Collection

34

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce ImplementationScalable MapReduce Implementation New Test Collection Conclusion

35

Identity Resolution in Email Collections

Scalable Implementation

Two Bottlenecks:1. Context expansion of ALL emails

For each email: ranked list of “Similar” emails

2. Resolution of ALL mentionsResolution of one mention depends on resolution of all other mentions in context

36

Identity Resolution in Email Collections

Context Expansion of ALL Emails

Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts Efficient implementation

Abstract Problem:Computing Pairwise Similarity

37

Identity Resolution in Email Collections

Trivial Solution

load each vector o(N) times load each term o(dft2) times

scalable and efficient solutionfor large collections

Goal

38

Identity Resolution in Email Collections

Better Solution

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

39

Identity Resolution in Email Collections

MapReduce Framework

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by: [by: [keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

40

Identity Resolution in Email Collections

reducereduce

Decomposition

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

mapmap

41

Identity Resolution in Email Collections

Expansion Using MapReduce

Using generic pairwise-similarity for both topical and social expansion

~~~~~~~~~~~~

~~~~~~~~~~~~

~~~~~~~~~~~--

~~~~~~~~~~~~

~~~~~~~~~~~~

doc rep.

time window

rankcut-off

df-cut

topical : body/root/pat

hsocial : participants

temporal sim model

contextsim model

contextgraph

42

Identity Resolution in Email Collections

Context Mention-Graph

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

“jsheila@enron.com”

“sg”

“Sheila Walton”

“Sheila”

Context-FreeResolution

mapmap

mapmap

mapmap

mapmap

reducereduce

43

Identity Resolution in Email Collections

Resolution System Using MapReduce

EmailsThreads Identity Models

Social Expansion

Conv.Expansion

Topical Expansion

LocalExpansion

Local Graph Social GraphTopical GraphConv. Graph

Mention Recognition and

Prior Computation

Prior Resolution

Posterior Resolution

Social Resolution

Conv.Resolution

Topical Resolution

LocalResolution

Merging Context Resolutions

PriorPriorPrior

Dict.

Exp

ansi

on

Exp

ansi

on

Res

olu

tio

nR

eso

luti

on

Pac

kin

gP

acki

ng

PreprocessingPreprocessing

44

Identity Resolution in Email Collections

Outline

Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test CollectionNew Test Collection Conclusion

45

Identity Resolution in Email Collections

New Test Collection

Random Sample from CMU-Enron collection “Annotation + Search” interface available Total annotators: 3 Annotation time: ~50 hours. Not only resolutions

Time, difficulty, confidence, evidence, and comments

Total mention-queries : 584 80% resolvable, 82% of them to Enron domain Overall inter-annotator agreement: ~81 %

46

Identity Resolution in Email Collections

Mention-Query Selection

47

Identity Resolution in Email Collections

Distribution of Names Based on Resolution

Enron-Resolvable390 (66%)

Non-Enron-Resolvable

80 (14%)

Unresolvable114 (20%)

Probably- Enron 24 (4%)

Probably-Non-Enron

62 (11%)

Unknown28 (5%)

48

Identity Resolution in Email Collections

Distribution Based on Difficulty

843

39

108

33

239

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Hard Moderately Hard Easy

49

Identity Resolution in Email Collections

Distribution Based on Confidence

411 2

31

4315

79

33663

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Unresolvable

Not Confident Somewhat Confident Very Confident

50

Identity Resolution in Email Collections

Distribution of Time Spent

R2 = 0.98

R2 = 0.95

R2 = 0.88

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

22%

0 1 2 3 4 5 6 7 8 9 10 11 12

Time (minutes)

Enron-Resolvable

Non-Enron-Resolvable

Unresolvable

Enron

Non-Enron

Unresolvable

51

Identity Resolution in Email Collections

Evaluation again …

0.90

0.44

0.690.65

0.84

0.47

0.760.71

0.92

0.59

0.800.76

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Enron Non-Enron Enron All

MR

R

topical social combination

Old Collection New Collection

52

Identity Resolution in Email Collections

Pairwise Agreement

195

50 190

199

16/16 (100%)

2/4 (50%)

4/7 (57%)

50

27

23

12/16 (75%)

1/5 (20%)

2/2 (100%)

35/38 (92%)

2/2 (100%)

6/10 (60%)

24/27 (89%)

5/12 (42%)

11/11 (100%)

Enron-resolvable

Non-enron-resolvable

Unresolvable

a3

a2

a1

53

Identity Resolution in Email Collections

Individual Annotator Agreement

6/9

3/9

28/32

6/10

37/40 2/224/27

5/12

11/11

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-enron-Resolvable

Unresolvable

Ag

reem

ent

a1 a2 a3

54

Identity Resolution in Email Collections

Overall Agreement

0.90

0.43

0.810.77

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable

Non-enron-Resolvable

ResolvableOverall

Unresolvable

Ag

ree

me

nt

55

Identity Resolution in Email Collections

Agreement Based on Difficulty

57/59

5/7

30/38

5/16

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Ag

reem

ent

Easy Non-Easy

56

Identity Resolution in Email Collections

Agreement Based on Confidence

74/82

8/18

17/22

13/15

2/5

6/8

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Enron-Resolvable Non-Enron-Resolvable

Unresolvable

Ag

reem

ent

Very-Confident Not-Very-Confident

57

Identity Resolution in Email Collections

Conclusion and Future Work

Identity Resolution by non-participants is feasible And automatic systems for that can be built ~90-75% accurate

Proposed generative probabilistic model Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with MapReduce”

Developed largest test collection for the task 80% resolvable, 82% of them to Enron employees

Effectiveness scales well to large collections

Efficiency Results Evaluation using double-assessments Iterative approach for “joint resolution”

58

Identity Resolution in Email Collections

Thank You!

59

Identity Resolution in Email Collections

Related Work

Diehl et al. (SIAM, 2006) Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender

Minkov et al. (SIGIR, 2006) Developed Sager and Shapiro collections Graphical framework Large collections?

top related