Top Banner
1 Automated Information Retrieval and Text Categorization: The RIKS Demonstrator Acknowledge final event November 25, 2008 Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR) Saskia Debergh (i.Know) Philippe De Lombaerde, Birger Fühne (UNU-CRIS) Overview UNU CRIS: The RIKS Demonstrator UNU-CRIS: The RIKS Demonstrator K.U.Leuven: Content extraction from multilingual Web pages Text categorization: machine learning approach Search engine and indexing infrastructure Interfacing the Acknowledge platform Acknowledge 25-11-2008 Interfacing the Acknowledge platform i.Know: Information forensics
14

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

Dec 18, 2014

Download

Documents

Olivier Rits

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

1

Automated Information Retrieval and Text Categorization: The RIKS Demonstrator

Acknowledge final eventNovember 25, 2008

Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)Saskia Debergh (i.Know)

Philippe De Lombaerde, Birger Fühne (UNU-CRIS)

Overview

UNU CRIS: The RIKS Demonstrator• UNU-CRIS: The RIKS Demonstrator• K.U.Leuven:

– Content extraction from multilingual Web pages– Text categorization: machine learning approach– Search engine and indexing infrastructure– Interfacing the Acknowledge platform

Acknowledge 25-11-2008

Interfacing the Acknowledge platform • i.Know:

– Information forensics

Page 2: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

2

The RIKS Demonstrator• United Nations University – Comparative RegionalUnited Nations University Comparative Regional

Integration Studies (UNU-CRIS)• Issues addressed in research and capacity building:

– (i) emergence of regional (= supra-national) governance level

– (ii) linkages with other governance levels (national, global/UN)

Acknowledge 25-11-2008

– (iii) building of regional institutions– (iv) growing regional interdependence, etc.

• RIKS = Regional Integration Knowledge System(UNU-CRIS and GARNET NoE)

Acknowledge 25-11-2008

Page 3: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

3

The RIKS DemonstratorIssues addressed in the demonstrator:

How to automate retrieval and processing p g(cleaning, search, categorization, presentation) of particular types of relevant information in an e-learning environment?:– ‘News’: short texts, various formats, dynamic

collection, short life cycle, role of news in e-learning application

Acknowledge 25-11-2008

– ‘Documentation’: heterogeneous texts: scientific articles, theses, essays, ... , rather static collection

– Treaty texts: long and complex texts, static collection, issue of accessibility

RIKSexample output

Acknowledge 25-11-2008

Page 4: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

4

Demo

Acknowledge 25-11-2008

K.U.Leuven: Content extraction from multilingual Web pages

• = Extracting main content from Web page and removing extraneous data (navigation menu’s, advertisements, etc.)

• Requirements of the tool:– Accurate

Generic

Acknowledge 25-11-2008

– Generic– Multilingual– Fast

Page 5: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

5

Acknowledge 25-11-2008

Acknowledge 25-11-2008

[Arias et al. submitted]

Page 6: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

6

Acknowledge 25-11-2008

[Arias et al. submitted][5] =[Gottron 2008]

K.U.Leuven:Text categorization• Heterogeneous documentation and Google News

classified into 27 categories (e g trade poverty )classified into 27 categories (e.g., trade, poverty, ...)• Supervised classifier: Multinomial Naïve Bayes, Support

Vector Machine, ...• Features:

– different features: unigrams, bigrams, feature item sets, ...

• Additional feature Selection: – Chi Square, Information Gain, Linear Classifier

Weights, Orthogonal Centroid Feature Selection• Different test set ups

Page 7: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

7

K.U.Leuven: Text categorization

Acknowledge 25-11-2008

RIKSK.U.Leuven: search engine

Acknowledge 25-11-2008

Page 8: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

8

Acknowledge 25-11-2008

Demo

Acknowledge 25-11-2008

Page 9: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

9

Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten

1. Information Forensics ‐ Smart Indexingmore than just an index

distinguishes between concepts and relationsdistinguishes between concepts and relations

starts from unstructured text (bottom‐up instead of top‐down)

Top‐down: Bottom‐up:

recognises word groups as meaningful units

Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.

knowledgeknowledgekeywords

textconcepts and relations

text

Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten

1. Information Forensics – Smart Indexing

De Fortis Bank werd overgenomen door BNP Paribas.

Traditional indexing (keywords):

De Fortis Bank werd overgenomen door BNP Paribas.

stopwords

stemming

calculation

correlation

Bank

werd

Keyword Index

Fortis

0.38

0.08

0.23

Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.

stemming correlation

De Fortis Bank werd overgenomen door BNP Paribas

overgenomen

door

BNP

Paribas

0.21

0.12

0.34

0.27

Page 10: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

10

Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten

1. Information Forensics – Smart Indexing

De Fortis Bank werd overgenomen door BNP Paribas.

Smart Indexing (concepts and relations):

De Fortis Bank werd overgenomen door BNP Paribas.

relation detection

concept detection

Smart Index

Concept

werd overgenomen door

Fortis Bank

Relation

Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.

De Fortis Bank werd overgenomen door BNP Paribas

werd overgenomen doorRelation

Concept BNP Paribas

Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten

2. Categorisation based on Smart Indexing

Preconditions:

Pre‐defined taxonomy/ontologyPre defined taxonomy/ontology

Top‐down processing

Advantages of Smart Indexing:

Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring the entries are

relevant

Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.

relevant

precise

complete

Page 11: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

11

Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten

2. Categorisation

Categorisation

Smart Indexing (concepts and relations):

The Agreement will be applied with European  and withthe EFTA states.the

EFTAEU

Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.

The Agreement will be applied with the European Union and with the EFTA states.

Union

Input:

RIKSi.Know: news categorization

Acknowledge 25-11-2008

Page 12: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

12

RIKSi.Know: news categorization

Acknowledge 25-11-2008

Acknowledge 25-11-2008

Page 13: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

13

Acknowledge 25-11-2008

Demo

Acknowledge 25-11-2008

Page 14: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment   The Riks Demonstrator

14

Thank you

Acknowledge 25-11-2008