Top Banner
1 21st CIKM Conference Lahaina, Maui Hawaii 30/10/12 Efficient Jaccard-based Diversity Analysis of Large Document Collections Fan Deng, Stefan Siersdorfer, Sergej Zerr 21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
30

CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Jun 19, 2015

Download

Technology

CUbRIK Project

Presentation at CIKM 2013 of the CUbRIK research paper: "Efficient Jaccard-based Diversity Analysis of Large
Document Collections" authored by Fan Deng, Stefan Siersdorfer and Sergej Zerr of L3S Research Center, partner of the CUbRIK Consortium.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

121st CIKM Conference Lahaina, Maui Hawaii

30/10/12

Efficient Jaccard-based Diversity Analysis of Large

Document Collections

Fan Deng, Stefan Siersdorfer, Sergej Zerr

21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii

Page 2: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Diversity

221st CIKM Conference Lahaina, Maui Hawaii

30/10/12

Diversity - Healthmeasure of an ecosystem

Biodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia).

Page 3: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

DiversityBiodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia).

330/10/1221st CIKM Conference Lahaina, Maui Hawaii

Page 4: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Diversity in Computer Science

430/10/12

Our focus: Topic diversity of the large text corpora

Social Web Environment – EcosystemGroup dynamics “Hot topics”, controversial topics Diversity of opinions Topic ambiguity Temporal topic analysis

21st CIKM Conference Lahaina, Maui Hawaii

Increasing amounts of data are published on the Internet on a daily basis, not least due to popular social web environments: YouTube, Flickr, blogosphere, … ect.

Page 5: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Outline

• Motivation: Document Topic Diversity

• Diversity Metrics

• Proposed Efficient Algorithms: SampleDJ, TrackDJ

• Experiments

• Applications

• Future Work: Ideas&Directions

530/10/1221st CIKM Conference Lahaina, Maui Hawaii

Page 6: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Diversity Metrics

• Simpson’s1 diversity index Each object belongs to one of a discrete sets of categories

• Stirling’s2 index Depends on distances between objects and their relative occurrences

6

[1] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.

21st CIKM Conference Lahaina, Maui Hawaii

30/10/12

[2] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.

Z

i iD1

2

)()()( jijiij ij ppdD

A

BC

DE

BC

D E F

Page 7: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

• Refined Jaccard Index – average Jaccard similarity between all possible object pairs

• Note: lower RDJ value corresponds to higher diversity

• Problem: “All-Pair Problem”• Solution: Estimation algorithms with probabilistic error bound guarantees

Refined Jaccard Index

721st CIKM Conference Lahaina, Maui Hawaii

30/10/12

ji

ji OOJSnn

RDJ ),()1(

2

nji 1

∩ UU

Jaccard similarity

Page 8: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

• Input: Relative error ε, accuracy confidence δ• Output: Estimated RDJ value

•Algorithms: SampleDJ, TrackDJ (claims and proofs in the paper)

Estimation Algorithms

821st CIKM Conference Lahaina, Maui Hawaii

30/10/12

RDJ

RDJRDJ ||Pr

Page 9: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Estimation Algorithm SampleDJ

921st CIKM Conference Lahaina, Maui Hawaii

30/10/12

....

...

.

...

.

... .. .

..

..

. . ...

. ...

.

.

. .

...

.. . .. . ......

Document Set

Document sub sets: Step 1

....

.... ..

. ..

.. ...

..

..

. . ...

. ...

.

.

. .

...

...

.

... .

..

.. . ..

.... ..

....

..

.. . ..

...

..

.. .

... ..

.

....

..

.. .

... ..

.....

.

.

.

.

.. ..

...

.

.. ..

...

.

. ..

.. ..

.

...

.

..

.

. ..

...

.

.....

..

.. .. ..

..

...

. ...

MedianMedian

.. . .. . ......

Page 10: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

• Execution time:

• Properties: Execution time (number of trials) does not depend on the data set size, but only on RDJ value For a dataset with a very high diversity value can run infinitely long time.

SampleDJ Overview

1021st CIKM Conference Lahaina, Maui Hawaii

30/10/12

)1

(2RDJ

Page 11: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Estimation Algorithm TrackDJ

1130/10/12

π1 = (E,B,A,C,D)

D1(A,B,C), D2(B,C,D)

h1(D1) = Bh1(D2) = B

),()]()(Pr[ yxyx DDJSDhDh

• Broder et al. 2000 proposed Min-wise independent hashing (Min-hash)

21st CIKM Conference Lahaina, Maui Hawaii

Page 12: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Estimation Algorithm TrackDJ

1230/10/12

• Broder et al. proposed Min-wise independent hashing (Min-hash)

π1 = (E,B,A,C,D)

h1(D1) = Bh1(D2) = Bh1(D3) = Eh1(D4) = Eh1(D5) = A

D1

D2D3 D4

D5

D1

D2D3

D4 D5XD1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), D5(A,C,D)

21st CIKM Conference Lahaina, Maui Hawaii

Page 13: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

•Time complexity:

• Properties: Execution in linear time (depends on the data set size)

TrackDJ Overview

1330/10/12

)(nO

21st CIKM Conference Lahaina, Maui Hawaii

Page 14: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Experimental Evaluation of the Theoretical Claims (Flickr Dataset)

1430/10/12

ε=5%, δ=95%Data Set

SizeAll Pairs SampleDJ TrackDJ

n RDJ Time(seconds) Error(%) Time(seconds) Error(%) Time(seconds)

1,000 0.00206 0.08 0.017 34 (0.57 min) 0 40 (0.66 min)10,000 0.001992 8.82 0.028 40 (0.67 min) 0.013 410 (6.84 min)100,000 0.001992 912 (15.21 min) 0.019 90 (1.50 min) 0.043 5,253 (1.46 h)1,000,000 0.001993 97,215 (27 h) 0.08 223 (3.72 min) 0.041 51,730 (14.37 h)

Data Set Size All Pairs SampleDJ TrackDJ

n Time (seconds) RDJ Time (seconds) RDJ Time (seconds)10,000,000 113 days

(estimated) 0.001998 350 (5.84 min) 0.001997 790,016 (9.14 days)20,000,000 450 days

(estimated) 0.002203 246 (4.10 min) 0.002206 1,613,566 (16.68 days)t t t

Dataset Size Dataset Size Dataset Size

21st CIKM Conference Lahaina, Maui Hawaii

Page 15: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Experimental evaluation of the Theoretical Claims (Syntetic

Dataset)

1530/10/12

.

All-Pair SampleDJ TrackDJ

n RDJ Time(hours)

Error(%) Time(seconds) Error(%) Time(hours)

524,288 0.017

5.3

0.34 2 2.05

2.5

524,288 0.0087 0.26 10 1.96

524,288 0.00427 0.38 39 2.00

524,288 0.00217 0.02 156 1.95

524,288 0.00105 0.06 624 (10 min) 1.90

524,288 0.00052 0.13 2,502(42 min) 1.91

524,288 0.00026 0.04 10,089 (3h) 1.91

524,288 0.00013 0.39 40,635(11h) 2.31log(t)

RDJ

21st CIKM Conference Lahaina, Maui Hawaii

Page 16: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Applications

1630/10/12

Flickr photo tags similarity over the time period 2005-2010

winter, snow, vacation, or house

graduation, wedding, beach

halloween, thanksgiving

christmas

21st CIKM Conference Lahaina, Maui Hawaii

Sim

ilar

Div

erse

Page 17: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Applications: Diversity vs. #Clusters

1730/10/12

Size News Category RDJ

299,612

Corporate/Industrial 4.31

204,820

Makets 4.79

66,339 Economics 4.69

35,769 Government/Social 5.73

35,279 Sports 3.45

33,969 Domestic Politics 5.21

31,328 War, Civil War 5.81

Reuters RCV1 Categories

Size Group Title RDJ

139,344

Pictures of England 1.63

121,391

Dark Art 0.57

98,901 Aircraft Photos 1.99

89,606 Absolutely beautiful 0.51

76,265 Visual Arts!! 0.61

73,632 Lonely Planet:‘Leaving‘

0.48

71,158 Lighthouse Lovers 4.56

Flickr Groups

21st CIKM Conference Lahaina, Maui Hawaii

Page 18: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Outline

• Motivation: Document Topic Diversity

• Diversity Metrics

• Proposed Efficient Algorithms: SampleDJ, TrackDJ

• Experiments

• Applications

• Future Work: Ideas&Directions

1830/10/1221st CIKM Conference Lahaina, Maui Hawaii

Page 19: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Conclusion & Future Work• Average similarity of all object pairs can be computed in linear time

• Two novel algorithms with probabilistic guarantees and different properties

SampleDJ: Fast for most datasets, does not depend on dataset sizeTrackDJ: Solves the problem guaranteed in linear time

Future Work: • Applying other similarity measures• Studying visual features in multi-media collections • Experiments with parallelization

1930/10/1221st CIKM Conference Lahaina, Maui Hawaii

Page 20: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Data sets and source code: http://www.l3s.de/~deng/

Fan Deng, Stefan Siersdorfer, Sergej Zerr [email protected]

Thank you!

∩ UU

SampleDJ

TrackDJ

Jaccard similarity

Temporal diversity development in Flickr

http://en.wikipedia.org/wiki/File:Phanerozoic_Biodiversity.png

Page 21: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

REFERENCES[1] Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651 – 666, 2010.[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS ’02, Madison, Wisconsin.[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.[4] A. Z. Broder. Min-wise independent permutations: Theory and practice. ICALP ’00, London, UK.[5] A. Z. Broder. On the resemblance and containment of documents. SEQUENCES ’97, Washington, USA.[6] A. Z. Broder. Identifying and filtering near-duplicate documents. COM ’00, London, UK, 2000.[7] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000.[8] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC ’02.[9] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for monte carlo estimation. SIAM J. Comput., 29:1484–1496, March 2000.[10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. SCG ’04, New York, USA.[11] J. D. Fearon. Ethnic and cultural diversity by country*. Journal of Economic Growth, 8:195–222, 2003.[12] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW’09, Madrid, Spain.[13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC ’98, Dallas, Texas, USA.[14] C. C. Krebs. Ecological Methodology. HarperCollins, 1989.[15] C. Lévêque and J.-C. Mounolou. Biodiversity. John Wiley & Sons, 2003.[16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.[17] M. Ley. The dblp computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.[18] S. Lieberson. Measuring population diversity. American Sociological Review, 34(6):850–862, 1969.[19] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.[20] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, March 2002.[21] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: a streaming-based approach. In SIGIR ’11, Beijing, China.[22] Olken. Random sampling from databases. In Ph.D. Diss. (University of California at Berkeley), 1993.[23] O. Papapetrou, W. Siberski, and N. Fuhr. Text clustering for peer-to-peer networks with probabilistic guarantees. LNCS, pages V.5993, 293–305. Springer Berlin / Heidelberg, 2010.[24] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW ’10, Raleigh, USA.[25] I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82(2):263–287, 2010.[26] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM ’06.[27] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.[28] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.[29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA.[30] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA.

Page 22: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Similarity Measures

• There exists a large number of possible measures: Cosine similarity, Okapi, Inverted distances, ect.

• Jaccard Similarity (Computationally efficient) Each object belongs to one of a discrete sets of categories

2221th CIKM Conference Lahaina, Maui Hawaii

30/10/12

||

||),(

ji

jiji OO

OOOOJS

∩ UU

Text 1 island maui second largest hawaiian

Text 2 tenerife largest island seven canary

Jaccard Similarity JS=2/6=0.33

Page 23: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Estimation Algorithm TrackDJ

2321th CIKM Conference Lahaina, Maui Hawaii

30/10/12

• Broder et al. proposed Min-wise independent hashing (Min-hash)

D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)

π1 = (E,B,A,C,D)

h1(D1) = Bh1(D2) = Bh1(D3) = Eh1(D4) = Eh1(D5) = A

h2(D1) = Ah2(D2) = Ch2(D3) = Ch2(D4) = Bh2(D5) = A

π2 = (A,C,B,D,E)

),()]()(Pr[ yxyx DDJSDhDh

h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A

Page 24: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Estimation Algorithm TrackDJ

2421th CIKM Conference Lahaina, Maui Hawaii

30/10/12

• Broder et al. proposed Min-wise independent hashing (Min-hash)

π1 = (E,B,A,C,D)

h1(D1) = Bh1(D2) = Bh1(D3) = Eh1(D4) = Eh1(D5) = A

D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)

),()]()(Pr[ yxyx DDJSDhDh

D1(A,B,C)

h1(D1) = B

…..

h2(D1) = Ah2(D2) = Ch2(D3) = Ch2(D4) = Bh2(D5) = A

π2 = (A,C,B,D,E)

Page 25: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Outline

• Motivation: Document Topic Diversity

• Diversity Metrics

• Proposed Efficient Algorithms: SampleDJ, TrackDJ

• Experiments

• Applications

• Future Work: Ideas&Directions

2521th CIKM Conference Lahaina, Maui Hawaii

30/10/12

Page 26: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Outline

• Motivation: Document Topic Diversity

• Diversity Metrics

• Proposed Efficient Algorithms: SampleDJ, TrackDJ

• Experiments

• Applications

• Future Work: Ideas&Directions

2621th CIKM Conference Lahaina, Maui Hawaii

30/10/12

Page 27: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Outline

• Motivation: Document Topic Diversity

• Diversity Metrics

• Proposed Efficient Algorithms: SampleDJ, TrackDJ

• Experiments

• Applications

• Future Work: Ideas&Directions

2721th CIKM Conference Lahaina, Maui Hawaii

30/10/12

Page 28: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Problem Statement “All-Pair” problem

• To measure the diversity of a dataset, similarity computation between all possible pairs is required

O(n2) complexity not feasible for large datasets

2821th CIKM Conference Lahaina, Maui Hawaii

30/10/12

A

BC D

E

BC

D E FF

Page 29: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Applications: Diversity vs. #Clusters

2921th CIKM Conference Lahaina, Maui Hawaii

30/10/12

Size News Category RDJ

299,612

Corporate/Industrial 4.31

204,820

Makets 4.79

66,339 Economics 4.69

35,769 Government/Social 5.73

35,279 Sports 3.45

33,969 Domestic Politics 5.21

31,328 War, Civil War 5.81

Reuters RCV1 Categories

Size Group Title RDJ

139,344

Pictures of England 1.63

121,391

Dark Art 0.57

98,901 Aircraft Photos 1.99

89,606 Absolutely beautiful 0.51

76,265 Visual Arts!! 0.61

73,632 Lonely Planet:‘Leaving‘

0.48

71,158 Lighthouse Lovers 4.56

Flickr Groups

Size Educational Background

RDJ

562,837

High School, Diploma, Ged

49,41

366,116

Some Colledge w.o. Degree

49,22

273,281

5th,6th,7th, or 8th Grade

51,63

213,941

Bachelors Degree 51,36

174,653

1st, 2nd, 3rd or 4th Grade

70.97

108,834

N/a Less Than 3Years Old

84.55

107,142

10th Grade 47.56

UCI US-Census Educ. Based Clusters

Page 30: CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

Similarity Measures

• There exists a large number of possible measures:

Cosine similarity, Okapi, Inverted distances, ect.

• Jaccard Similarity has special properties we make use in our algorithms

3021th CIKM Conference Lahaina, Maui Hawaii

30/10/12

||

||),(

ji

jiji OO

OOOOJS

∩ UU