1 21st CIKM Conference Lahaina, Maui Hawaii 30/10/12 Efficient Jaccard-based Diversity Analysis of Large Document Collections Fan Deng, Stefan Siersdorfer, Sergej Zerr 21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
30
Embed
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections
Presentation at CIKM 2013 of the CUbRIK research paper: "Efficient Jaccard-based Diversity Analysis of Large Document Collections" authored by Fan Deng, Stefan Siersdorfer and Sergej Zerr of L3S Research Center, partner of the CUbRIK Consortium.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
121st CIKM Conference Lahaina, Maui Hawaii
30/10/12
Efficient Jaccard-based Diversity Analysis of Large
Document Collections
Fan Deng, Stefan Siersdorfer, Sergej Zerr
21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
Diversity
221st CIKM Conference Lahaina, Maui Hawaii
30/10/12
Diversity - Healthmeasure of an ecosystem
Biodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia).
DiversityBiodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia).
330/10/1221st CIKM Conference Lahaina, Maui Hawaii
Diversity in Computer Science
430/10/12
Our focus: Topic diversity of the large text corpora
Social Web Environment – EcosystemGroup dynamics “Hot topics”, controversial topics Diversity of opinions Topic ambiguity Temporal topic analysis
21st CIKM Conference Lahaina, Maui Hawaii
Increasing amounts of data are published on the Internet on a daily basis, not least due to popular social web environments: YouTube, Flickr, blogosphere, … ect.
530/10/1221st CIKM Conference Lahaina, Maui Hawaii
Diversity Metrics
• Simpson’s1 diversity index Each object belongs to one of a discrete sets of categories
• Stirling’s2 index Depends on distances between objects and their relative occurrences
6
[1] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.
21st CIKM Conference Lahaina, Maui Hawaii
30/10/12
[2] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.
Z
i iD1
2
)()()( jijiij ij ppdD
A
BC
DE
BC
D E F
• Refined Jaccard Index – average Jaccard similarity between all possible object pairs
• Note: lower RDJ value corresponds to higher diversity
•Algorithms: SampleDJ, TrackDJ (claims and proofs in the paper)
Estimation Algorithms
821st CIKM Conference Lahaina, Maui Hawaii
30/10/12
RDJ
RDJRDJ ||Pr
Estimation Algorithm SampleDJ
921st CIKM Conference Lahaina, Maui Hawaii
30/10/12
....
...
.
...
.
... .. .
..
..
. . ...
. ...
.
.
. .
...
.. . .. . ......
Document Set
Document sub sets: Step 1
....
.... ..
. ..
.. ...
..
..
. . ...
. ...
.
.
. .
...
...
.
... .
..
.. . ..
.... ..
....
..
.. . ..
...
..
.. .
... ..
.
....
..
.. .
... ..
.....
.
.
.
.
.. ..
...
.
.. ..
...
.
. ..
.. ..
.
...
.
..
.
. ..
...
.
.....
..
.. .. ..
..
...
. ...
MedianMedian
.. . .. . ......
• Execution time:
• Properties: Execution time (number of trials) does not depend on the data set size, but only on RDJ value For a dataset with a very high diversity value can run infinitely long time.
SampleDJ Overview
1021st CIKM Conference Lahaina, Maui Hawaii
30/10/12
)1
(2RDJ
Estimation Algorithm TrackDJ
1130/10/12
π1 = (E,B,A,C,D)
D1(A,B,C), D2(B,C,D)
h1(D1) = Bh1(D2) = B
),()]()(Pr[ yxyx DDJSDhDh
• Broder et al. 2000 proposed Min-wise independent hashing (Min-hash)
21st CIKM Conference Lahaina, Maui Hawaii
Estimation Algorithm TrackDJ
1230/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
REFERENCES[1] Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651 – 666, 2010.[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS ’02, Madison, Wisconsin.[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.[4] A. Z. Broder. Min-wise independent permutations: Theory and practice. ICALP ’00, London, UK.[5] A. Z. Broder. On the resemblance and containment of documents. SEQUENCES ’97, Washington, USA.[6] A. Z. Broder. Identifying and filtering near-duplicate documents. COM ’00, London, UK, 2000.[7] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000.[8] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC ’02.[9] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for monte carlo estimation. SIAM J. Comput., 29:1484–1496, March 2000.[10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. SCG ’04, New York, USA.[11] J. D. Fearon. Ethnic and cultural diversity by country*. Journal of Economic Growth, 8:195–222, 2003.[12] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW’09, Madrid, Spain.[13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC ’98, Dallas, Texas, USA.[14] C. C. Krebs. Ecological Methodology. HarperCollins, 1989.[15] C. Lévêque and J.-C. Mounolou. Biodiversity. John Wiley & Sons, 2003.[16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.[17] M. Ley. The dblp computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.[18] S. Lieberson. Measuring population diversity. American Sociological Review, 34(6):850–862, 1969.[19] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.[20] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, March 2002.[21] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: a streaming-based approach. In SIGIR ’11, Beijing, China.[22] Olken. Random sampling from databases. In Ph.D. Diss. (University of California at Berkeley), 1993.[23] O. Papapetrou, W. Siberski, and N. Fuhr. Text clustering for peer-to-peer networks with probabilistic guarantees. LNCS, pages V.5993, 293–305. Springer Berlin / Heidelberg, 2010.[24] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW ’10, Raleigh, USA.[25] I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82(2):263–287, 2010.[26] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM ’06.[27] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.[28] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.[29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA.[30] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA.
Similarity Measures
• There exists a large number of possible measures: Cosine similarity, Okapi, Inverted distances, ect.
• Jaccard Similarity (Computationally efficient) Each object belongs to one of a discrete sets of categories
2221th CIKM Conference Lahaina, Maui Hawaii
30/10/12
||
||),(
ji
jiji OO
OOOOJS
∩ UU
Text 1 island maui second largest hawaiian
Text 2 tenerife largest island seven canary
Jaccard Similarity JS=2/6=0.33
Estimation Algorithm TrackDJ
2321th CIKM Conference Lahaina, Maui Hawaii
30/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)