Top Banner
iSchool, Cloud Computing Class Talk, Oct 6 th 2008 1 Computing Pairwise Document Computing Pairwise Document Similarity in Large Similarity in Large Collections: Collections: A MapReduce Perspective A MapReduce Perspective Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Tamer Elsayed, Jimmy Lin, and Douglas W. Oard
46

ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

iSchool, Cloud Computing Class Talk, Oct 6th 2008 1

Computing Pairwise Document Computing Pairwise Document Similarity in Large Collections:Similarity in Large Collections:

A MapReduce PerspectiveA MapReduce Perspective

Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard

Page 2: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2

Overview

Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks Identity Resolution in Email

Page 3: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Page 4: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4

Similarity of Documents

Simple inner product Cosine similarity Term weights

Standard problem in IR tf-idf, BM25, etc.

di

dj

Page 5: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

Page 6: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6

Better Solution

Load weights for each term once Each term contributes o(dft

2) partial scores Allows efficiency tricks

Each term contributes only if appears in

Page 7: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7

Decomposition MapReduce

Load weights for each term once Each term contributes o(dft

2) partial scores

Each term contributes only if appears in

mapmap

indexindex

reducereduce

Page 8: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8

MapReduce Framework

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by: by: [[keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Page 9: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9

Standard Indexing

tokenizetokenize

tokenizetokenize

tokenizetokenize

tokenizetokenize

combinecombine

combinecombine

combinecombine

doc

doc

doc

doc

posting list

posting list

posting list

ShufflingShuffling

group values group values by: by: termsterms

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Page 10: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10

Indexing (3-doc toy collection)

Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

Page 11: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11

Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

22

22

11

1111

22

22 22

22

11

1133

11

Page 12: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12

Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

multiplymultiply

multiplymultiply

multiplymultiply

multiplymultiply

sumsum

sumsum

sumsum

term postings

term postings

term postings

term postings

similarity

similarity

similarity

ShufflingShuffling

group values group values by: by: pairspairs

Page 13: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13

Experimental Setup

0.16.0 Open source MapReduce implementation

Cluster of 19 machines Each w/ two processors (single core)

Aquaint-2 collection 906K documents

Okapi BM25 Subsets of collection

Elsayed, Lin, and Oard, ACL 2008Elsayed, Lin, and Oard, ACL 2008

Page 14: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rme

dia

te P

air

s (

bill

ion

s)

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k docs

Page 15: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15

Terms: Zipfian Distribution

term rank

do

c fr

eq (

df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%

most frequent 10 terms 15%

most frequent 100 terms 57%

most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Page 16: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Page 17: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17

Effectiveness (recent work) Effect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Page 18: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18

Implementation Issues

BM25s Similarity Model

TF, IDF Document length

DF-Cut Build a histogram Pick the absolute df for the % df-cut

5.0

5.0log*

5.15.0*

5.15.0 11

1

11

1

df

dfN

dlavgdl

tf

tf

dlavgdl

tf

tf

Page 19: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19

Other Approximation Techniques ?

Page 20: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20

Other Approximation Techniques

(2) Absolute df

Consider only terms that appear in at least n (or %) documents An absolute lower bound on df, instead of just

removing the % most-frequent terms

Page 21: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21

Other Approximation Techniques

(3) tf-Cut

Consider only documents (in posting list) with tf > T ; T=1 or 2

OR: Consider only the top N documents based on tf for each term

Page 22: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22

Other Approximation Techniques

(4) Similarity Threshold

Consider only partial scores > SimT

Page 23: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23

Other Approximation Techniques:

(5) Ranked List

Keep only the most similar N documents In the reduce phase

Good for ad-hoc retrieval and “more-like this” queries

Page 24: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24

Space-Saving Tricks

(1) Stripes

Stripes instead of pairs Group by doc-id not pairs

11

222211

Page 25: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25

Space-Saving Tricks

(2) Blocking

No need to generate the whole matrix at once Generate different blocks of the matrix at

different steps limit the max space required for intermediate results

Similarity Matrix

Page 26: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 26

Identity Resolution in Email Topical Similarity Social Similarity Joint Resolution of Mentions

Page 27: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 27

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be

rescheduled

Did Sheila want Scott to participate? Looks like the

call will be too late for him.

Sheila

Basic Problem

WHO?WHO?WHO?WHO?

Page 28: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 28

Enron Collection

-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

55 Sheila’s !!55 Sheila’s !!weisman

pardoglover

richjones

breedenhuckaby

tweedmcintyrechadwick

birminghamkahanekforakertasmanfisherpetitt

DomboRobbinschang

jarnotkirby

knudsenboehringer

lutzgloverwollamjortnerneylon

whangernagel

gravesmclaughlin

venvillerappazzo

millerswatekhollis

maynesnacey

ferrarinidey

macleodhowarddarlingwatsonperlickadvanihesterkennerlewis

waltonwhitmanberggrenosowski

kelly

Rank Rank CandidatesCandidates

Rank Rank CandidatesCandidates

Page 29: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 29

Generative Model

1. Choose “personperson” c to mention

p(c)

2. Choose appropriate “contextcontext” X to mention c

p(X | c)

3. Choose a “mentionmention” l

p(l | X, c) ““sheila”sheila”

GEGEconferenceconference

callcall

Page 30: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 30

3-Step Solution(1) Identity(1) Identity ModelingModeling

Posterior DistributionPosterior Distribution

(3) Mention Resolution(3) Mention Resolution

(2) Context Reconstruction(2) Context Reconstruction

Page 31: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 31

Contextual Space

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 32: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 32

Topical Context

Date: Fri Dec 15 05:33:00 EST 2000From: [email protected]: vince j kaminski <[email protected]>Cc: sheila walton <[email protected]>Subject: Re: Grant Masson

Great news. Lets get this moving along. Sheila, can you work out GE letter?

Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone.

[email protected]

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Sheila call

Sheila

call

GE

GE

Page 33: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 33

Contextual Space

Social ContextSocial Context

LocalLocalContextContext

Conversational Conversational ContextContext

Topical ContextTopical Context

Page 34: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 34

Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

Social Context

Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST)From: [email protected]: [email protected] Subject: ESA Option Execution

KayCan you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland.Thanks, Rebecca

Sheila Tweed

[email protected]

[email protected]

Page 35: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 35

Contextual Space (mentions)

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

Joint Resolution of MentionsJoint Resolution of Mentions

Page 36: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 36

Topical Expansion

Each email is a document Index all (bodies of) emails

remove all signature and salutation lines Use temporal constraints

Need an email-to-date/time mapping Check for each pair of documents

Page 37: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 37

Social Expansion Can we use the same technique?

For each email: list of participating email addresses comprises the document

MessageID: 3563Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call will be too late for

him.

2563 [email protected] [email protected]

Index the new “social documents” and apply same topical expansion process

Page 38: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 38

Social Similarity Models

Intersection size Jaccard Coefficent Boolean

All given temporal constraints

Page 39: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 39

Joint Resolution

“Sheila”

social

conversational

social

topical

social

topical

topical

“Sheila Tweed”

“sheila”

[email protected]

“sg”

“Sheila Walton”

“Sheila”

Page 40: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 40

Joint Resolution

SpreadSpreadCurrent ResolutionCurrent Resolution

CombineCombineContext InfoContext Info

UpdateUpdateResolutionResolution

MentionGraph

Page 41: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 41

Joint Resolution

mapmap shuffleshuffle reducereduce

MentionGraph

MapReduce!MapReduce!

Work in Progress!Work in Progress!

Page 42: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 42

System DesignEmailsThreads Identity Models

Social Expansion

Conv. Expansion

Topical Expansion

Local Expansion

LocalContext

Social Context

TopicalContext

Conv.Context

Mention Recognition

Context-Free Resolution

Mentions

Merging Contexts

Context-Free Resolution

Joint Resolution

Posterior Resolution

Prior Resolution

Page 43: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 43

Iterative Joint Resolution Input: Context Graph + Prior Resolution Mapper

Consider one mention Takes:

1. out-edges and context info2. prior resolution

Spread context info and prior resolution to all mentions in context

Reducer Consider one mention Takes:

1. in-edges and context info2. prior resolution

Compute posterior resolution Multiple Iterations

Page 44: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 44

Conclusion

Simple and efficient MapReduce solution applied to both topical and social expansion in

“Identity Resolution in Email” different tricks for approximation

Shuffling is critical df-cut controls efficiency vs. effectiveness tradeoff 99.9% df-cut achieves 98% relative accuracy

Page 45: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 45

Thank You!

Page 46: ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 46

Algorithm

Matrix must fit in memory Works for small collections

Otherwise: disk access optimization