Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University of Limerick, Ireland A New Unsupervised Approach to Automatic Topical Indexing of Scientific Documents According to Library Controlled Vocabularies ALISE 2013 Work Supported Work Supported by: by:
25
Embed
Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering
A New Unsupervised Approach to Automatic Topical Indexing of Scientific Documents According to Library Controlled Vocabularies. Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University of Limerick, Ireland. ALISE 2013. Work Supported by:. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Arash Joorabchi & Abdulhussain E. Mahdi
Department of Electronic and Computer Engineering
University of Limerick, Ireland
A New Unsupervised Approach to Automatic Topical
Indexing of Scientific Documents According to
Library Controlled Vocabularies
ALISE 2013
Work Supported by: Work Supported by:
Subject (Topical) Metadata in Libraries
• Un-controlled
Unrestricted author and/or reader-assigned keywords and keyphrases,
such as:
– Index Term-Uncontrolled (MARC-653)
• Controlled
Restricted cataloguer-assigned classes and subject headings, such as:
– DDC (MARC-082)
– LCC (MARC-050)
– LCSH/FAST (MARC-650)
The Case of Scientific Digital Libraries & Repositories
Archived Material Include: Journal articles, conference papers, technical
reports, theses & dissertations, books chapters, etc.
• Un-controlled Subject Metadata:
– Commonly available when enforced by editors, e.g., in case of published
journal articles & conf. proceedings, but rare in unedited publications.
– Inconsistent
• Controlled Subject Metadata:
– Rare due to the sheer volume of new materials published and high cost of
cataloguing.
– High level of incompleteness and inaccuracy due to oversimplified classification
rules, e.g., IF published by the Dept. of Computer Science THEN DDC: 004,
LCSH: Computer science
Automatic Subject Metadata Generation in Scientific Digital Libraries
& Repositories
Aims to provide a fully/semi automated alternative to manual
classification.
1. Supervised (ML-based) Approach:
– utilizing generic machine learning algorithms for text classification (e.g., NB, SVM, DT).
– challenged by the large-scale & complexities of library classification schemes, e.g., deep
hierarchy, skewed data distribution, data sparseness, and concept drift [Jun Wang ’09].
2. Unsupervised (String Matching-based) Approach:
– String-to-string matching between words in a term list extracted from library thesauri &
classification schemes, and words in the text to be classified.
– Inferior performance compared to supervised methods [Golub et al. ‘06].
A New Unsupervised Concept-to-Concept Matching Approach - An
Overview
WorldCatDatabase
MARC records sharing a key concept(s) with the
paper/article
Paper/Article (Full Text)
Inference
RankingWikipedia Concepts
Key ConceptsPaper/Article (MARC Rec.)
653: {…}
082: {…}
650: {…}DDC
FAST
Paper/Article (MARC Rec.)
653: {Wikipedia: HP 9000}
650: {FAST: HP 9000 (Computer)}
Wikipedia as a Crowd-Sourced Controlled Vocabulary
Extensive topic/concept coverage (4m < English articles)
Up-to-date (lags Twitter by ~3h on major events [Osborne et al.’12])
Rich knowledge source for NLP (semantic relatedness, word sense
Maui Bagging decision trees (13 best features) 5 23.6 31.6 37.9Current work (LOOCV) GA, threshold=800, unique bests method 5 12.3 32.8 58.1Current work (LOOCV) GA, threshold=200, unique bests method 5 13.9 32.9 56.7
Current work (LOOCV) GA, threshold=400, unique bests method 5 14.0 33.5 58.1
MethodAvg. inter consistency with
human annotators (% )Number of Keyprases
Assgined per document, nk
Learning Approach
– Joorabchi, A. and Mahdi, A. Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms. In Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012)
– Joorabchi, A. and Mahdi, A. Automatic Keyphrase Annotation of Scientific Documents Using Wikipedia and Genetic Algorithms. To appear in the Journal of Information Science
*Automatic Classification Toolbox for Digital Libraries (ACT-DL) by Bielefeld University Library and deployed at Bielefeld Academic Search Engine (BASE)
Doc ID Predicted DDC (by current method) True DDCPredicted
DDC (by ACT-DL*)
519.542 Decision theory ✓006.35 Natural language processing ✓
7183 006.333 Deduction, problem solving, reasoning ✓ 0047502 005.131 Symbolic logic 006.333 Deduction, problem solving, reasoning 0049307 005.757--0218 Object-oriented databases--Standards 005.757 Object-oriented databases 00410894 621.3815--0287 Components and circuits--Testing and measurement 005.14 Verification, testing, measurement, debugging 00412049 005.43 Systems programs 005.453 Compilers 00413259 001.6443 (invalid in DDC22 & DDC23) 001.4226 Presentation of statistical data 00016393 004.53 Internal storage (Main memory) 005.435 Memory management programs 00418209 005.115 Logic programming ✓ 004
511.322 Set theory ✓005.275 Programming for multiprocessor computers ✓004.35 Multiprocessing ✓004.33 Real-time processing ✓
23267 005.117 Object-oriented programming ✓ 00423507 495.6--5 Japanese--Grammar 006.35 Natural language processing 40023596 658.4036--028546 Group decision making--Computer communications ✓ 150
515.2433 Fourier and harmonic analysis ✓below threshold 006.37 Computer vision
37632 005.14 Verification, testing, measurement, debugging ✓ 00439172 006.4--015116 Computer pattern recognition--Combinatorics ✓ 51039955 005.117 Object-oriented programming ✓ 15040879 004 Computer science 006.31 Machine learning 00443032 005.262 Programming in specific programming languages 005.26 Programming for personal computers 004