Quantnet Basics: Visualization, Similarity, Text Mining Lukas Borke Wolfgang Karl Härdle Ladislaus von Bortkiewicz Chair of Statistics C.A.S.E. – Center for Applied Statistics and Economics Humboldt–Universität zu Berlin http://lvb.wiwi.hu-berlin.de http://www.case.hu-berlin.de
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
Figure 5: Weighting vectors of the 3 rhymes in a radar chartQuantnet Basics
Vector Space Model (VSM) 3-7
Example 1: German children’s rhymes
With the weighting vectors above we get the similarity matrix:
MS =
1 0 0.0140 1 0.014
0.014 0.014 1
And the distance matrix:
MD =
0√2 1.405√
2 0 1.4051.405 1.405 0
Quantnet Basics
Vector Space Model (VSM) 3-8
Basic VSM
� vertical vector d , indexed by terms – Document representation� matrix D = [d1, . . . , dn] – Document corpus representation,
also called “term by document“ matrix� considering linear transformations P we get a general similarity
S(d1, d2) = (Pd1)>(Pd2) = d>1 P>Pd2
� every mapping P defines another VSM
� MS = D>(P>P)D – similarity matrix
Quantnet Basics
Vector Space Model (VSM) 3-9
Example 2: tf and tf-idf similarities in BVSM
� with P = Im and d = {tf (d , t1), . . . , tf (d , tm)}> we get theclassical tf-similarity:Mtf
S = D>D
� with diagonal P(i , i)idf = idf (ti ) andd = {tf (d , t1), . . . , tf (d , tm)}> we get the classicaltf-idf-similarity:Mtf−idf
S = D>(P idf )>P idf D
Quantnet Basics
Vector Space Model (VSM) 3-10
Drawbacks of BVSM
� Uncorrelated/orthogonal terms in the feature space� Documents must have common terms to be similar� Sparseness of document vectors and similarity matrices
Question� How to incorporate information about semantics?
Solution� Using statistical information about term-term correlations� Semantic smoothing
� MS = D>(DD>)D – similarity matrix� DD> – term by term matrix, having a nonzero ij entry if and
only if there is a document containing both the i-th and thej-th terms
� terms become semantically related if co-occuring often in thesame documents
� also known as a dual space method (Sheridan and Ballerini,1996)
� when there are less documents than terms – dimensionalityreduction
Quantnet Basics
Vector Space Model (VSM) 3-12
Generalized VSM – Semantic smoothing
� More natural method of incorporating semantics is by directlyusing a semantic network
� (Miller et al., 1993) used the semantic network WordNet� Term distance in the hierarchical tree provided by WordNet
gives an estimation of their semantic proximity� (Siolas and d’Alche-Buc, 2000) have included the semantics
into the similarity matrix by handcrafting the VSM matrix P
� MS = D>(P>P)D = D>P2D – similarity matrix
Quantnet Basics
Vector Space Model (VSM) 3-13
LSI – Latent Semantic Indexing
� LSI measures semantic information through co-occurrenceanalysis (Deerwester et al., 1990)
� Technique – singular value decomposition (SVD) of the matrixD = UΣV>
� P = U>k = IkU> – projection operator onto the first kdimensions
� MS = D>(UIkU>)D – similarity matrix� It can be shown: MS = V ΛkV>, with
D>D = V Σ>U>UΣV> = V ΛV> and Λii = λi = σ2i
eigenvalues of V ; Λk consisting of the first k eigenvalues andzero-values else.
Quantnet Basics
Empirical results 4-1
3 Models for the QuantNet
� Models – BVSM, GVSM and LSI� Dataset – the whole Quantnet� Documents – 1580 Quantlets
Quantnet Basics
Empirical results 4-2
Figure 6: Model characteristicsQuantnet Basics
Empirical results 4-3
Figure 7: Quantiles of similarity values of 3 models
� Blue dots – BVSM; Green dots – GVSM; Red line – LSI
Quantnet Basics
Empirical results 4-4
Sparseness results
BVSM GVSM LSIAbsolute number 2452444 2214064 2083526Relative number 0.982 0.887 0.835
Matrix Dim 2496400
Table 1: Model Performance regarding the number of zero-values in thesimilarity matrix.
Quantnet Basics
Keyword Extracting 5-1
Index Term Selection I
Goal: decrease the number of words for indexing, so that only theselected keywords describe the documents (Deerwester et al., 1990;Witten et al., 1999)
A simple method for keyword extracting is based on their entropy.∀t ∈ T the entropy is defined:
W (t) = 1 +1
log2 |D|∑d∈D
P(d , t) log2 P(d , t),
with P(d , t) = tf (d ,t)∑nl=1 tf (dl ,t)
Quantnet Basics
Keyword Extracting 5-2
Index Term Selection II
The entropy as a measure of the importance of a word in the givendomain context:
W (t) is high ⇒ prefer this t as index.
An index term selection method (fixed number of index terms) isdiscussed in “Experiments in Term Weighting and KeywordExtraction in Document Clustering“ (Borgelt et al., 2004).
Quantnet Basics
Conclusion 6-1
Conclusion I
� Similarity and Distance available for extendedVisualization
� Different weighting scheme approaches and Vector SpaceModels allow adapted Similarity based Text Searching
� Incorporating term-term Correlations and Semanticssignificantly improves the comparison performance
� More automation and quality through Index Term Selection
Quantnet Basics
Conclusion 6-2
Conclusion II
Text Mining offers more models and methods like:
� Classification
� Clustering
� Latent Dirichlet Allocation (LDA) topic model
� TopicTiling
They are worth being researched and applied to the Quantnet.
Quantnet Basics
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlin
Borgelt, C. and Nürnberger, A.Experiments in Term Weighting and Keyword Extraction inDocument ClusteringLWA, pp. 123-130, Humbold-Universität Berlin, 2004
Bostock, M., Heer, J., Ogievetsky, V. and communityD3: Data-Driven Documentsavailable on d3js.org, 2014
Chen, C., Härdle, W. and Unwin, A.Handbook of Data VisualizationSpringer, 2008
Elsayed, T., Lin, J. and Oard, D. W.Pairwise Document Similarity in Large Collections withMapReduceProceedings of the 46th Annual Meeting of the Association ofComputational Linguistics (ACL), pp. 265-268, 2008
Feldman, R. and Dagan, I.Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems, 10(3), pp. 281-300,DOI: 10.1023/A:1008623632443, 1998
Gentle, J. E., Härdle, W. and Mori, Y.Handbook of Computational StatisticsSpringer, 2nd ed., 2012
Quantnet Basics
References 7-3
References
Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference,and PredictionSpringer, 2nd ed., 2009
Härdle, W. and Simar, L.Applied Multivariate Statistical AnalysisSpringer, 3nd ed., 2012
Hotho, A., Nürnberger, A. and Paass, G.A Brief Survey of Text MiningLDV Forum, 20(1), pp 19-62, available on www.jlcl.org, 2005
Salton, G., Allan, J., Buckley, C. and Singhal, A.Automatic Analysis, Theme Generation, and Summarization ofMachine-Readable TextsScience, 264(5164), pp. 1421-1426,DOI: 10.1126/science.264.5164.1421, 1994
Witten, I., Paynter, G., Frank, E., Gutwin, C. andNevill-Manning, C.KEA: Practical Automatic Keyphrase ExtractionDL ’99 Proceedings of the fourth ACM conference on Digitallibraries, pp. 254-255, DOI: 10.1145/313238.313437, 1999
Quantnet Basics
Appendix 8-1
Data Mining: DM
DM is the computational process of discovering/representingpatterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, anddatabase systems.
1. Numerical DM2. Visual DM3. Text Mining
(applied on considerably weaker structured text data)
Quantnet Basics
Appendix 8-2
Text Mining
Text Mining or Knowledge Discovery from Text (KDT) dealswith the machine supported analysis of text (Feldman et al., 1995).
It uses techniques from:� Information Retrieval (IR)� Information extraction� Natural Language Processing (NLP)
and connects them with the methods of DM.
Quantnet Basics
Appendix 8-3
Similarity, Distance, Data Mining –Overview
1. Find a formal representation of the Quantlets2. Find a similarity measure on the space of Quantlets3. Afterwards the construction of a distance measure is simple:
distance(x , y) =√
sim(x , x) + sim(y , y)− 2 · sim(x , y)
Having similarity and distance ⇒ vast amount of Data Mining,Text Mining and Visualization technics.
Quantnet Basics
Appendix 8-4
Distance measure
A frequently used distance measure is the Euclidian distance:
distd(d1, d2)def= dist{w(d1),w(d2)} def
=
√√√√ m∑k=1
{w(d1, tk)− w(d2, tk)}2
It holds for tf-idf:
cosφ =x>y
|x | · |y |= 1− 1
2dist2
(x
|x |,
y
|y |
),
where x|x | means w(d1), y
|y | means w(d2) and cosφ is the anglebetween x and y .
Quantnet Basics
Appendix 8-5
Figure 8: Algorithm for Computing Pairwise Similarity Matrix
Remark: postings(t) denotes the list of documents that containterm t.
The idea: a term contributes to the similarity between twodocuments only if it has non-zero weights in both.