Content-based Clustering for Tag Cloud Visualization ASONAM 2009 Arkaitz Zubiaga Alberto P. Garc´ ıa-Plaza V´ ıctor Fresno Raquel Mart´ ınez NLP & IR Group @ UNED July 21st, 2009
May 25, 2015
Content-based Clustering for Tag Cloud VisualizationASONAM 2009
Arkaitz ZubiagaAlberto P. Garcıa-Plaza
Vıctor FresnoRaquel Martınez
NLP & IR Group @ UNED
July 21st, 2009
Introduction
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 2 / 25
Introduction
Simple Tagging
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 3 / 25
Introduction
Collaborative Tagging
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 4 / 25
Introduction
Tag Cloud
No organization.
No relations between tags.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 5 / 25
Introduction
Our Work
Find relations between tags to organize them:
To ease visualization and search.To ease subscribing to a group of related tags.
Previous works rely on tag co-occurrence to find relations.
What about considering web documents’ content?
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 6 / 25
Dataset Generation
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 7 / 25
Dataset Generation
Dataset Generation
Starting point: 140 most popular tags on Delicious (T140, tag cloud).
Tag monitoring: ∼6.000 documents/tag (∼840.000 docs., html andpdf).
Data retrieval:
Tag data for each document.Document content.
Filtering: English-written documents with tag data available.
Result: 144.574 documents (unbalanced).
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 8 / 25
Our Method
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 9 / 25
Our Method
Representation
Most relevant tags for each document: at least, 40,7% of the top tag
Merge documents pertaining to each T140 tag.
Stopwords removal.
Stemming.
TF-IDF representation (reducing by DF).
1 vector/tag.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 10 / 25
Our Method
Clustering (SOM)
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 11 / 25
Our Method
Clustering Settings
12x12 sized map: 144 neurons.
vectors with 17.518 dimensions.
Learning rate: 0,1.
Neighborhood: 12.
Iterations: 50.000.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 12 / 25
Our Method
Terminology Extraction
Merge all the documents in each neuron.
Terminology extraction for each neuron.
Representative for the neuron, but not for the rest.Language models (KLD, Kullback-Leibler Divergence).
Result: Representative terms for each neuron.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 13 / 25
Results
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 14 / 25
Results
Results
Full map available at: http://nlp.uned.es/social-tagging/
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 15 / 25
Results
Results: Computer Science
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 16 / 25
Results
Results: Design
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 17 / 25
Results
Results: Cooking
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 18 / 25
Results
Results: Coherence
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 19 / 25
Results
Results: Terminology
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 20 / 25
Conclusions
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 21 / 25
Conclusions
Conclusions
We analyzed tag clustering and terminology extraction relying ondocuments’ content.
We collected the DeliciousT140 dataset.
Unlike previous works, we considered documents’ content.
The resulting map shows encouraging results, exhibiting the potentialof collaborative tagging systems.
It could allow community discovery.
It eases tag cloud visualization, as well as improving navigation andsubscribing.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 22 / 25
Future Work
Index
1 Introduction
2 Dataset Generation
3 Our Method
4 Results
5 Conclusions
6 Future Work
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 23 / 25
Future Work
Future Work
To compare our content-based approach to those based on tagco-occurrence.
To make a quantitative evaluation
To semantically analyze tags (polysemy, synonimy,...).
To extend the work to multilingual tag sets.
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 24 / 25
Future Work
Thank You for Your Attention
Achiu Arigato Danke Dhannvaad Dua Netjer en ek EfcharistoGracias Gracies Gratia Grazie Guishepeli Hvala Kiitos
Koszonom Merce Merci Mila esker Obrigado ShukranShukriya Tack Tak Takk Tanan Tapadh leat Tesekkur ederim Thank
you Toda
NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 25 / 25