Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Quantnet Basics:Visualization, Similarity, Text Mining

Lukas BorkeWolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

http://lvb.wiwi.hu-berlin.de

http://www.case.hu-berlin.de

Motivation 1-1

Transparency and Reproducibility

� Quantnet – open access code-sharing platformI Quantlets: program codes (R, MATLAB, SAS), various authorsI QuantNetXploRer

Quantnet Basics

http://sfb649.wiwi.hu-berlin.de/quantnet/

http://sfb649.wiwi.hu-berlin.de/quantnet/d3/ia/

Motivation 1-2

Popularity

Figure 1: Quantlet downloads by year and countryQuantnet Basics

Motivation 1-3

Visualization

Figure 2: Quantlets from Statistics of Financial Markets (SFE) and AppliedMultivariate Statistical Analysis (MVA)

Quantnet Basics

Motivation 1-4

Research Goals

� VisualizationI QuantletsI ClustersI Relationships

� Data MiningI SimilarityI Semantic structureI Text Mining – Keyword Extracting

Quantnet Basics

Outline

1. Motivation X

2. Interactive Structure3. Vector Space Model (VSM)4. Empirical results5. Keyword Extracting6. Conclusion

Quantnet Basics

Interactive Structure 2-1

� Searching parameters: Quantletname, Description, Datafile,Author

� Data types: R, Matlab, SAS

Quantnet Basics



Integrated exploring and navigating

Quantnet Basics



Figure 3: Quantlet MVAreturns containing the search term “time series“

Quantnet Basics

http://sfb649.wiwi.hu-berlin.de/quantnet/index.php?p=show&id=1229

http://sfb649.wiwi.hu-berlin.de/quantnet/index.php?p=show&id=1229


Figure 4: All Quantlets in QuantNetXploRer, search term “time series“

Quantnet Basics

http://sfb649.wiwi.hu-berlin.de/quantnet/d3/ia/

Vector Space Model (VSM) 3-1

Vector Space Model (VSM)

� Model structureI Text to Vector: Weighting scheme, Similarity, DistanceI Basic VSMI Generalized VSMI LSI – Latent Semantic Indexing

Quantnet Basics


Text to Vector

� D = {d1, . . . , dn} – set of documents.� T = {t1, . . . , tm} – dictionary, i.e., the set of all different

terms occurring in Quantnet.� tf (d , t) – absolute frequency of term t ∈ T in document

d ∈ D.� idf (t)

def= log(|D|/nt) – inverse document frequency, with

nt = |{d ∈ D|t ∈ d}|.� w(d) = {w(d , t1), . . . ,w(d , tm)}, d ∈ D – documents as

vectors in a m-dimensional space.� w(d , ti ) – calculated by a weighting scheme.

Quantnet Basics


Weighting scheme, Similarity, Distance� Salton et al. (1994): the tf-idf – weighting scheme w(d , t) for

t ∈ T in d ∈ D :

w(d , t) =tf (d , t)idf (t)√∑m

j=1 tf (d , tj)2idf (tj)2,m = |T |

� (normalized tf-idf) Similarity S of two documents

S(d1, d2) =m∑

k=1

w(d1, tk) · w(d2, tk) = w(d1)>w(d2)

� A frequently used distance measure is the Euclidian distance:

distd(d1, d2)def=

√√√√ m∑k=1

{w(d1, tk)− w(d2, tk)}2

Quantnet Basics


Example 1: German children’s rhymes

Let D = {d1, d2, d3} be the set of documents/rhymes:

Rhyme 1: Hänschen klein ging allein in die weite Welt hinein.d1 = {hanschen, klein, ging , allein, in, die,weite,welt, hinein}

Rhyme 2: Backe, backe Kuchen, der Bäcker hat gerufen.d2 = {backe, kuchen, der , backer , hat, gerufen}

Rhyme 3: Die Affen rasen durch den Wald. Der eine macht denandern kalt.d3 = {die, affen, rasen, durch, den,wald , der , eine,macht, andern, kalt}

Quantnet Basics



This implies:

T = {hanschen, klein, ging , allein, in, die,weite,welt, hinein,

backe, kuchen, der , backer , hat, gerufen,

affen, rasen, durch, den,wald , eine,macht, andern, kalt}= {t1, . . . , t24}

Hence, |D| = 3, |T | = 24.

Quantnet Basics


Figure 5: Weighting vectors of the 3 rhymes in a radar chartQuantnet Basics



With the weighting vectors above we get the similarity matrix:

MS =

1 0 0.0140 1 0.014

0.014 0.014 1

And the distance matrix:

MD =

0√2 1.405√

2 0 1.4051.405 1.405 0

Quantnet Basics


Basic VSM

� vertical vector d , indexed by terms – Document representation� matrix D = [d1, . . . , dn] – Document corpus representation,

also called “term by document“ matrix� considering linear transformations P we get a general similarity

S(d1, d2) = (Pd1)>(Pd2) = d>1 P>Pd2

� every mapping P defines another VSM

� MS = D>(P>P)D – similarity matrix

Quantnet Basics


Example 2: tf and tf-idf similarities in BVSM

� with P = Im and d = {tf (d , t1), . . . , tf (d , tm)}> we get theclassical tf-similarity:Mtf

S = D>D

� with diagonal P(i , i)idf = idf (ti ) andd = {tf (d , t1), . . . , tf (d , tm)}> we get the classicaltf-idf-similarity:Mtf−idf

S = D>(P idf )>P idf D

Quantnet Basics


Drawbacks of BVSM

� Uncorrelated/orthogonal terms in the feature space� Documents must have common terms to be similar� Sparseness of document vectors and similarity matrices

Question� How to incorporate information about semantics?

Solution� Using statistical information about term-term correlations� Semantic smoothing

Quantnet Basics


Generalized VSM – term-term correlations

� S(d1, d2) = (D>d1)>(D>d2) = d>1 DD>d2 – the GVSMsimilarity

� MS = D>(DD>)D – similarity matrix� DD> – term by term matrix, having a nonzero ij entry if and

only if there is a document containing both the i-th and thej-th terms

� terms become semantically related if co-occuring often in thesame documents

� also known as a dual space method (Sheridan and Ballerini,1996)

� when there are less documents than terms – dimensionalityreduction

Quantnet Basics


Generalized VSM – Semantic smoothing

� More natural method of incorporating semantics is by directlyusing a semantic network

� (Miller et al., 1993) used the semantic network WordNet� Term distance in the hierarchical tree provided by WordNet

gives an estimation of their semantic proximity� (Siolas and d’Alche-Buc, 2000) have included the semantics

into the similarity matrix by handcrafting the VSM matrix P

� MS = D>(P>P)D = D>P2D – similarity matrix

Quantnet Basics


LSI – Latent Semantic Indexing

� LSI measures semantic information through co-occurrenceanalysis (Deerwester et al., 1990)

� Technique – singular value decomposition (SVD) of the matrixD = UΣV>

� P = U>k = IkU> – projection operator onto the first kdimensions

� MS = D>(UIkU>)D – similarity matrix� It can be shown: MS = V ΛkV>, with

D>D = V Σ>U>UΣV> = V ΛV> and Λii = λi = σ2i

eigenvalues of V ; Λk consisting of the first k eigenvalues andzero-values else.

Quantnet Basics

Empirical results 4-1

3 Models for the QuantNet

� Models – BVSM, GVSM and LSI� Dataset – the whole Quantnet� Documents – 1580 Quantlets

Quantnet Basics


Figure 6: Model characteristicsQuantnet Basics


Figure 7: Quantiles of similarity values of 3 models

� Blue dots – BVSM; Green dots – GVSM; Red line – LSI

Quantnet Basics


Sparseness results

BVSM GVSM LSIAbsolute number 2452444 2214064 2083526Relative number 0.982 0.887 0.835

Matrix Dim 2496400

Table 1: Model Performance regarding the number of zero-values in thesimilarity matrix.

Quantnet Basics

Keyword Extracting 5-1

Index Term Selection I

Goal: decrease the number of words for indexing, so that only theselected keywords describe the documents (Deerwester et al., 1990;Witten et al., 1999)

A simple method for keyword extracting is based on their entropy.∀t ∈ T the entropy is defined:

W (t) = 1 +1

log2 |D|∑d∈D

P(d , t) log2 P(d , t),

with P(d , t) = tf (d ,t)∑nl=1 tf (dl ,t)

Quantnet Basics

Keyword Extracting 5-2

Index Term Selection II

The entropy as a measure of the importance of a word in the givendomain context:

W (t) is high ⇒ prefer this t as index.

An index term selection method (fixed number of index terms) isdiscussed in “Experiments in Term Weighting and KeywordExtraction in Document Clustering“ (Borgelt et al., 2004).

Quantnet Basics

Conclusion 6-1

Conclusion I

� Similarity and Distance available for extendedVisualization

� Different weighting scheme approaches and Vector SpaceModels allow adapted Similarity based Text Searching

� Incorporating term-term Correlations and Semanticssignificantly improves the comparison performance

� More automation and quality through Index Term Selection

Quantnet Basics

Conclusion 6-2

Conclusion II

Text Mining offers more models and methods like:

� Classification

� Clustering

� Latent Dirichlet Allocation (LDA) topic model

� TopicTiling

They are worth being researched and applied to the Quantnet.

Quantnet Basics

Quantnet Basics:Visualization, Similarity, Text Mining

Lukas BorkeWolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlin

http://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

http://lvb.wiwi.hu-berlin.de

http://www.case.hu-berlin.de

References 7-1

References

Borgelt, C. and Nürnberger, A.Experiments in Term Weighting and Keyword Extraction inDocument ClusteringLWA, pp. 123-130, Humbold-Universität Berlin, 2004

Bostock, M., Heer, J., Ogievetsky, V. and communityD3: Data-Driven Documentsavailable on d3js.org, 2014

Chen, C., Härdle, W. and Unwin, A.Handbook of Data VisualizationSpringer, 2008

Quantnet Basics

http://d3js.org/

References 7-2

References

Elsayed, T., Lin, J. and Oard, D. W.Pairwise Document Similarity in Large Collections withMapReduceProceedings of the 46th Annual Meeting of the Association ofComputational Linguistics (ACL), pp. 265-268, 2008

Feldman, R. and Dagan, I.Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems, 10(3), pp. 281-300,DOI: 10.1023/A:1008623632443, 1998

Gentle, J. E., Härdle, W. and Mori, Y.Handbook of Computational StatisticsSpringer, 2nd ed., 2012

Quantnet Basics

References 7-3

References

Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference,and PredictionSpringer, 2nd ed., 2009

Härdle, W. and Simar, L.Applied Multivariate Statistical AnalysisSpringer, 3nd ed., 2012

Hotho, A., Nürnberger, A. and Paass, G.A Brief Survey of Text MiningLDV Forum, 20(1), pp 19-62, available on www.jlcl.org, 2005

Quantnet Basics

http://www.jlcl.org/2005_Heft1/19-62_HothoNuernbergerPaass.pdf

References 7-4

References

Salton, G., Allan, J., Buckley, C. and Singhal, A.Automatic Analysis, Theme Generation, and Summarization ofMachine-Readable TextsScience, 264(5164), pp. 1421-1426,DOI: 10.1126/science.264.5164.1421, 1994

Witten, I., Paynter, G., Frank, E., Gutwin, C. andNevill-Manning, C.KEA: Practical Automatic Keyphrase ExtractionDL ’99 Proceedings of the fourth ACM conference on Digitallibraries, pp. 254-255, DOI: 10.1145/313238.313437, 1999

Quantnet Basics

Appendix 8-1

Data Mining: DM

DM is the computational process of discovering/representingpatterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, anddatabase systems.

1. Numerical DM2. Visual DM3. Text Mining

(applied on considerably weaker structured text data)

Quantnet Basics

Appendix 8-2

Text Mining

Text Mining or Knowledge Discovery from Text (KDT) dealswith the machine supported analysis of text (Feldman et al., 1995).

It uses techniques from:� Information Retrieval (IR)� Information extraction� Natural Language Processing (NLP)

and connects them with the methods of DM.

Quantnet Basics

Appendix 8-3

Similarity, Distance, Data Mining –Overview

1. Find a formal representation of the Quantlets2. Find a similarity measure on the space of Quantlets3. Afterwards the construction of a distance measure is simple:

distance(x , y) =√

sim(x , x) + sim(y , y)− 2 · sim(x , y)

Having similarity and distance ⇒ vast amount of Data Mining,Text Mining and Visualization technics.

Quantnet Basics

Appendix 8-4

Distance measure

A frequently used distance measure is the Euclidian distance:

distd(d1, d2)def= dist{w(d1),w(d2)} def

=

√√√√ m∑k=1

{w(d1, tk)− w(d2, tk)}2

It holds for tf-idf:

cosφ =x>y

|x | · |y |= 1− 1

2dist2

(x

|x |,

y

|y |

),

where x|x | means w(d1), y

|y | means w(d2) and cosφ is the anglebetween x and y .

Quantnet Basics

Appendix 8-5

Figure 8: Algorithm for Computing Pairwise Similarity Matrix

Remark: postings(t) denotes the list of documents that containterm t.

The idea: a term contributes to the similarity between twodocuments only if it has non-zero weights in both.

Quantnet Basics

Appendix 8-6

3 Models on 3 Datasets

� Models – BVSM, GVSM and LSI� Datasets – 2 books, 1 project from Quantnet� Project 1 - TEDAS: Tail Event Driven Asset Allocation

(micro size - 4 Qlets)� Book 1 - BCS: Basic Elements of Computational Statistics

(low size - 48 Qlets)� Book 2 - SFE: Statistics of Financial Markets

(medium size - 337 Qlets)

Quantnet Basics

Appendix 8-7

Figure 9: Model characteristics of TEDAS

Quantnet Basics

Appendix 8-8

Figure 10: Quantiles of similarity values of 3 models on TEDAS

� Blue dots – BVSM; Green line – GVSM; Red line – LSI

Quantnet Basics

Appendix 8-9

Figure 11: Model characteristics of BCS

Quantnet Basics

Appendix 8-10

Figure 12: Quantiles of similarity values of 3 models on BCS


Quantnet Basics

Appendix 8-11

Figure 13: Model characteristics of SFE

Quantnet Basics

Appendix 8-12

Figure 14: Quantiles of similarity values of 3 models on SFE


Quantnet Basics

Appendix 8-13

Sparseness results

TEDAS BCS SFE MVA? STF? SFS?

BVSM 8 504 108668 75424 44576 17146GVSM 8 0 96940 71464 44204 16612

LSI 8 262 84262 65712 43952 15400Matrix Dim 16 2304 113569 77841 45369 18225

Table 2: Model Performance regarding the number of zero-values in thesimilarity matrix. MVA?, STF? and SFS? were additionally examined.

Quantnet Basics