Top Banner
Ivory Ivory : : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS)
16

Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

IvoryIvory:: Pairwise Document Similarity

in Large Collection with MapReduce

Tamer Elsayed, Jimmy Lin, and Doug OardLaboratory for Computational Linguistics and Information Processing

(CLIP Lab)

UM Institute for Advanced Computer Studies (UMIACS)

Page 2: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: “more-like-that” queries Clustering

e.g., co-reference resolution

Page 3: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Solutions Trivial

For each pair of vectors Compute the inner product

Loads each vector O(N) times

Better Each term contributes only if appears in

Vt

dtdtji jiwwddsim ,,),(

ji dd

ji

jiddt

dtdtji wwddsim ,,),(

ji ddt

jiji ddtcontribtermddsim ),,(_),(

Page 4: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Algorithm

Loads each posting once Matrix must fit in memory

Works for small collections Otherwise: disk access optimization

Page 5: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Hadoopify : 2-Step Solution

1) Indexing one MapRedcue step term posting file

2) Pairwise Similarity another MapRedcue step term contribution for all possible pairs

Generate ½ df*(df-1) intermediate contribution / term

Page 6: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Indexing

(A,(d1,2))

(B,(d1,1))

(C,(d1,1))

(B,(d2,1))

(D,(d2,2))

(A,(d3,1))

(B,(d3,2))

(E,(d3,1))

(A,[(d1,2),

(d3,1)])

(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

mapmap

mapmap

mapmap

shuffleshuffle

reducereduce

reducereduce

reducereduce

reducereduce

reducereduce

(A,[(d1,2),

(d3,1)])

(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

A A B C

B D D

A B B E

d1

d2

d3

Page 7: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Pairwise Similarity

mapmap

mapmap

mapmap

mapmap

mapmap

(A,[(d1,2),

(d3,1)])

(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

((d1,d3),2)

((d1,d2),1)

((d1,d3),2)

((d2,d3),2)

shuffleshuffle

((d1,d2),[1])

((d1,d3),[2,

2])

((d2,d3),[2])

reducereduce

reducereduce

reducereduce

((d1,d2),1)

((d1,d3),4)

((d2,d3),2)

Page 8: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Implementation Issues

df-cut Drop common terms

Intermediate tuples dominated by very high df terms efficiency Vs. effectiveness

Space saving tricks Common doc + stripes Blocking Compression

Page 9: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Experimental Setup

Hadoop 0.16.0 Cluster of 19 nodes (w/double processors) Aquaint-2 collection

906K documents Okapi BM25 Subsets of collection

Page 10: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Efficiency (running time)

R2 = 0.997

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Co

mp

uta

tio

n T

ime

(m

inu

tes

)

99% df-cut

Page 11: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Efficiency (disk usage)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut

Page 12: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Effectiveness (recent)Effect of df-cut on effectiveness

Medline04 - 909k abstracts

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Page 13: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Conclusion

Simple and efficient MapReduce solution 2H (using 38 nodes, 99% df-cut) for ~million-doc

collection Play tricks for I/O bound jobs

Effective linear-time-scaling approximation 99.9% df-cut achieves 98% relative accuracy df-cut controls efficiency vs. effectiveness tradeoff

Page 14: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Future work

Bigger collections! More investigation of df-Cut and other techniques Analytical model Compression techniques (e.g., bitwise) More effectiveness experiments

Joint resolution of personal names in email Co-reference resolution of names and organization

MapReduce IR research platform Batch query processing

Page 15: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Thank You!

Page 16: Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

MapReduce Framework

Shuffling: group values by keysShuffling: group values by keys

mapmap mapmap mapmap mapmap

reducereduce reducereduce reducereduce

input input input input

output output output