1 Automated Information Retrieval and Text Categorization: The RIKS Demonstrator Acknowledge final event November 25, 2008 Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR) Saskia Debergh (i.Know) Philippe De Lombaerde, Birger Fühne (UNU-CRIS) Overview UNU CRIS: The RIKS Demonstrator • UNU-CRIS: The RIKS Demonstrator • K.U.Leuven: – Content extraction from multilingual Web pages – Text categorization: machine learning approach – Search engine and indexing infrastructure – Interfacing the Acknowledge platform Acknowledge 25-11-2008 Interfacing the Acknowledge platform • i.Know: – Information forensics
14
Embed
Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Automated Information Retrieval and Text Categorization: The RIKS Demonstrator
Acknowledge final eventNovember 25, 2008
Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)Saskia Debergh (i.Know)
Philippe De Lombaerde, Birger Fühne (UNU-CRIS)
Overview
UNU CRIS: The RIKS Demonstrator• UNU-CRIS: The RIKS Demonstrator• K.U.Leuven:
– Content extraction from multilingual Web pages– Text categorization: machine learning approach– Search engine and indexing infrastructure– Interfacing the Acknowledge platform
Acknowledge 25-11-2008
Interfacing the Acknowledge platform • i.Know:
– Information forensics
2
The RIKS Demonstrator• United Nations University – Comparative RegionalUnited Nations University Comparative Regional
Integration Studies (UNU-CRIS)• Issues addressed in research and capacity building:
– (i) emergence of regional (= supra-national) governance level
– (ii) linkages with other governance levels (national, global/UN)
Acknowledge 25-11-2008
– (iii) building of regional institutions– (iv) growing regional interdependence, etc.
• RIKS = Regional Integration Knowledge System(UNU-CRIS and GARNET NoE)
Acknowledge 25-11-2008
3
The RIKS DemonstratorIssues addressed in the demonstrator:
How to automate retrieval and processing p g(cleaning, search, categorization, presentation) of particular types of relevant information in an e-learning environment?:– ‘News’: short texts, various formats, dynamic
collection, short life cycle, role of news in e-learning application
– Treaty texts: long and complex texts, static collection, issue of accessibility
RIKSexample output
Acknowledge 25-11-2008
4
Demo
Acknowledge 25-11-2008
K.U.Leuven: Content extraction from multilingual Web pages
• = Extracting main content from Web page and removing extraneous data (navigation menu’s, advertisements, etc.)
• Requirements of the tool:– Accurate
Generic
Acknowledge 25-11-2008
– Generic– Multilingual– Fast
5
Acknowledge 25-11-2008
Acknowledge 25-11-2008
[Arias et al. submitted]
6
Acknowledge 25-11-2008
[Arias et al. submitted][5] =[Gottron 2008]
K.U.Leuven:Text categorization• Heterogeneous documentation and Google News
classified into 27 categories (e g trade poverty )classified into 27 categories (e.g., trade, poverty, ...)• Supervised classifier: Multinomial Naïve Bayes, Support
Vector Machine, ...• Features:
– different features: unigrams, bigrams, feature item sets, ...
• Additional feature Selection: – Chi Square, Information Gain, Linear Classifier
Weights, Orthogonal Centroid Feature Selection• Different test set ups
7
K.U.Leuven: Text categorization
Acknowledge 25-11-2008
RIKSK.U.Leuven: search engine
Acknowledge 25-11-2008
8
Acknowledge 25-11-2008
Demo
Acknowledge 25-11-2008
9
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
1. Information Forensics ‐ Smart Indexingmore than just an index
distinguishes between concepts and relationsdistinguishes between concepts and relations
starts from unstructured text (bottom‐up instead of top‐down)