Mapping Between Taxonomies
Post on 03-Jan-2016
37 Views
Preview:
DESCRIPTION
Transcript
Mapping Between Taxonomies
Elena Eneva
11 Dec 2001
Advanced IR Seminar
Mapping Between TaxonomiesFormal systems of orderly classification
of knowledge, which are designed for a specific purpose
Companies, organizing information in various ways (eg. one for marketing, another for product development)
ApproachGerman
French
Textile
Automobile
By country
By industry
ApproachGerman
French
Textile
Automobile
By country
By industry
ApproachGerman
French
Textile
Automobile
By country
By industry
ApproachGerman
French
Textile
Automobile
By country
By industry
ApproachTextile
Automobile
By industry
ApproachTextile
Automobile
By industry
abcabcabcabcabcabc
abcabcabcabcabcabc
abcabcabcabcabcabc
abcabcabcabcabcabc
ApproachTextile
Automobile
By industry
abcabcabcabcabcabc
abcabcabcabcabcabc
abcabcabcabcabcabc
abcabcabcabcabcabc
ApproachGerman
French
Textile
Automobile
By country
By industry
abc abc abc abc
ApproachGerman
French
Textile
Automobile
By country
By industry
abc abc abc abc
ApproachGerman
French
Textile
Automobile
By country
By industry
abc abc abc abc
abc abc abc abc
DatasetsTwo classification schemes:
Reuter 2001 (807900 docs) Topics (127) Industry categories (871) Regions (376)
Hoovers-255 and Hoovers-28 (4286 docs) industry categories (28) industry categories (255)
Learning2 separate methods of learning for the
documents: Old doc category -> new doc category Doc contents -> new category
Combined method: Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old
category and contents as features
Simple Learners
Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old
categories) Naïve Bayes (rainbow)
Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old
categories) Support Vector Machine (SVM-Light)
word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]
Learning
Using the document content
abcabcabcabcabcabc
Using the document labels
DT, NB, SVM
DT, NB, SVM
Combined Learners
Weighted Average Voting scheme
Combination Decision Tree takes the outputs and confidences of two of
the simple learners, predicts new category
Learning
Using both the content and the label
Combining the two outputs
abcabcabcabcabcabcDT
abcabcabcabcabcabc
DT, NB, SVM
DT, NB, SVM
voting
3rd classifier
Results Words Only
5-fold cross validation
Words Only
0
10
20
30
40
50
60
28p255 255p28
% a
cc
ura
cy
words only NB
words only DT
Results Categories Only
5-fold cross validation
Categories Only
0
20
40
60
80
100
120
28p255 255p28
% a
cc
ura
cy
categs only NB
categs only DT
Results Combination
5-fold cross validation
Combination
0
20
40
60
80
100
120
28p255 255p28
% a
cc
ura
cy
Combination Vote
Combination Comb
Results
words onlyNB DT
28p255 21.14 7.9255p28 53.2 17.5
categs onlyNB DT
28p255 26.19 26.19255p28 100 100
CombinationVote Comb
28p255 28.05 30.26255p28 100 100
Remarks
Hierarchy (old classes) usually ignoredShown that helpsLearners are not the issueBetter way of understandingOld label (or hierarchy path) is meta
data
Remaining Work
SVM results (running even as we speak)Repeat experiments on Reuters-2001
Internal hierarchies Missing labels Less correlated types of classes
Results in standard evaluation format
Future Work
Try with a web dataset (Google and Yahoo! Hierarchies)
Hierarchies of more levelsMeta data (for non-text sources)
Related Literature
A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear).
Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.
Questions and Suggestions
The end.
DT accuracy vs Vocabulary size
0
1020
30
40
5060
70
10 100 500 1000 2000
vocabulary size
% a
ccur
acy train accuracy
test accuracy
Taxonomies
Formal systems of orderly classification of knowledge, which are designed for a specific purpose
Change of purpose, change of taxonomies
Businesses often need and keep theinformation in several structures
Important to be able to automatically map between taxonomies
Useful Mappings Companies, organizing information in various ways
(eg. one for marketing, another for product development)
Personal online bookmark classification
Search engines (eg. Google <-> Yahoo)
EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”
top related