Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09
Dec 27, 2015
Classifying Tags Using Open Content ResourcesSimon Overell, Borkur Sigurbjornsson & Roelof van Zwol
WSDM ‘09
Motivation Classify tags in Flickr as broad categories
such as what, where, when and who Easier indexing and navigation WordNet is usually used for
classification but has limited coverage
Classifying Wikipedia Articles Using only metadata (i.e. Categories
and Templates) – high scalability Supervised Classifier
Articles as objects WordNet noun semantic categories as
classification classes Categories and Templates as features
Support Vector Machine (SVM) as classifier
Supervised Classification Ground Truth
All Wikipedia articles that match WordNet nouns
Data Sparsity WordNet categories under represented
(10 out of 25) Articles have very few features
System Optimization Number of arcs traversed in
Category network Template network
Choice of weighting function Term Frequency (tf) Term Frequency – Inverse Document
Frequency (tf-idf) Term Frequency – Inverse Layer (tf-il)
Fine Tuning Partitioned the ground truth into training
and test sets Criteria
At least 80% precision Maximum possible recall
Resulted optimal values Category arcs: 3, Template arcs: 3, TF-IL Precision: 87% F1-Measure:0.696
SVM Threshold SVM outputs confidence with which an
article is correctly classified as a member of a category
Training experiment with 250 Wikipedia articles (1 assessor)
Summary Optimised for Recall (ClassTag)
39% of Articles classified 664,770 Wikipedia articles
Optimised for Precision (ClassTag+) 21% of Articles classified 338,061 Wikipedia articles
Comparison with DBpedia• Experimental Setup
– 300 pooled articles– 3 Assessors– Blind Assessments– 50 articles overlap
• Partial Agreement:– 86%
• Total Agreement:– 78%
Classification of Flickr Tags Tag Anchor Text
String matching Anchor Text Wikipedia Article
Number of times an anchor refers to a Wikipedia article
Wikipedia Article Category Output of SVM decision
Ambiguity Tag Anchor Text
Some ambiguity because often tags are lower case with no white spaces
Anchor Text Wikipedia Article 13.4% of Anchor text -> Wikipedia Article mappings
ambiguous 4% of Anchor text -> Category mappings ambiguous Example
George Bush -> George W. Bush, George Bush Senior George Bush -> Person
Wikipedia Article Category 5.7% of classified articles result in multiple classification
Evaluation WordNet classification extended
vocabulary coverage by 115% Taking tag frequency into account
ClassTag classified 69.2% of Flickr tags 22% more than WordNet baseline
Multilanguage Classification 80% of tags in English, 7% in German
and 6% in Dutch Maybe a portion of the unclassified tags
fall into this category Possible alternate language classification
Run ClassTag using alternate Wikipedia language and a corresponding lexicon
Translate the English classification using Wikipedia’s interlanguage links
Contributions Classifying open content resources
using their structural patterns Presenting ClassTag - a system for
classifying tags ClassTag extends the WordNet lexicon
using the structural patterns of Wikipedia