Top Banner
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09
26

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Dec 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Classifying Tags Using Open Content ResourcesSimon Overell, Borkur Sigurbjornsson & Roelof van Zwol

WSDM ‘09

Page 2: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Motivation Classify tags in Flickr as broad categories

such as what, where, when and who Easier indexing and navigation WordNet is usually used for

classification but has limited coverage

Page 3: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Example

Page 4: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

The ClassTag System

Page 5: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Classifying Wikipedia Articles Using only metadata (i.e. Categories

and Templates) – high scalability Supervised Classifier

Articles as objects WordNet noun semantic categories as

classification classes Categories and Templates as features

Support Vector Machine (SVM) as classifier

Page 6: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Categories and Templates

Page 7: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Categories and Templates

Page 8: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Supervised Classification Ground Truth

All Wikipedia articles that match WordNet nouns

Data Sparsity WordNet categories under represented

(10 out of 25) Articles have very few features

Page 9: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Reducing Data Sparsity Using category and

template network transclusion

… but noise is added

Page 10: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

System Optimization Number of arcs traversed in

Category network Template network

Choice of weighting function Term Frequency (tf) Term Frequency – Inverse Document

Frequency (tf-idf) Term Frequency – Inverse Layer (tf-il)

Page 11: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Example

Page 12: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Fine Tuning Partitioned the ground truth into training

and test sets Criteria

At least 80% precision Maximum possible recall

Resulted optimal values Category arcs: 3, Template arcs: 3, TF-IL Precision: 87% F1-Measure:0.696

Page 13: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

SVM Threshold SVM outputs confidence with which an

article is correctly classified as a member of a category

Training experiment with 250 Wikipedia articles (1 assessor)

Page 14: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

SVM Threshold

Page 15: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

SVM Threshold

Page 16: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Summary Optimised for Recall (ClassTag)

39% of Articles classified 664,770 Wikipedia articles

Optimised for Precision (ClassTag+) 21% of Articles classified 338,061 Wikipedia articles

Page 17: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Comparison with DBpedia• Experimental Setup

– 300 pooled articles– 3 Assessors– Blind Assessments– 50 articles overlap

• Partial Agreement:– 86%

• Total Agreement:– 78%

Page 18: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Results

Page 19: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Classification of Flickr Tags Tag Anchor Text

String matching Anchor Text Wikipedia Article

Number of times an anchor refers to a Wikipedia article

Wikipedia Article Category Output of SVM decision

Page 20: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Ambiguity Tag Anchor Text

Some ambiguity because often tags are lower case with no white spaces

Anchor Text Wikipedia Article 13.4% of Anchor text -> Wikipedia Article mappings

ambiguous 4% of Anchor text -> Category mappings ambiguous Example

George Bush -> George W. Bush, George Bush Senior George Bush -> Person

Wikipedia Article Category 5.7% of classified articles result in multiple classification

Page 21: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Example

Page 22: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Evaluation WordNet classification extended

vocabulary coverage by 115% Taking tag frequency into account

ClassTag classified 69.2% of Flickr tags 22% more than WordNet baseline

Page 23: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Tag distribution

Page 24: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Multilanguage Classification 80% of tags in English, 7% in German

and 6% in Dutch Maybe a portion of the unclassified tags

fall into this category Possible alternate language classification

Run ClassTag using alternate Wikipedia language and a corresponding lexicon

Translate the English classification using Wikipedia’s interlanguage links

Page 25: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Contributions Classifying open content resources

using their structural patterns Presenting ClassTag - a system for

classifying tags ClassTag extends the WordNet lexicon

using the structural patterns of Wikipedia

Page 26: Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Conclusion Tuneable system for classifying

Wikipedia pages ClassTag: Nearly 40% of articles classified

with a precision of 72% ClassTag+: 21% of articles classified with

a precision of 86% (equal to assessor agreement)

Nearly 70% of Flickr tags matched to WordNet categories