Top Banner
Bruxelles, 2006- 03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]
19

Project AIDE

Dec 30, 2015

Download

Documents

Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]. Project AIDE. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Project  AIDE

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]

Page 2: Project  AIDE

Bruxelles, 2006-03-10

Project AIDE

idea for a project

September 2004, conference at JRC, Ispra

interdisciplinary collaboration of 3 institutions

Croatian Information Documentation Referral Agency (HIDRA)

Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb

Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb

Page 3: Project  AIDE

Bruxelles, 2006-03-10

AIDE – collaborating institutions HIDRA

collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia

coordinator Maja Cvitaš, M.A.

ZEMRIS

research in the field of artificial intelligence, neural networks, machine learning, data and text mining

coordinators prof. Bojana Dalbelo Bašić andJan Šnajder

ZZL

computational linguistic research and building language technologies for Croatian

coordinator prof. Marko Tadić

Page 4: Project  AIDE

Bruxelles, 2006-03-10

AIDE – project objective

Development of intelligentsystem for automatic indexingof the official documentation

of the Republic of Croatiawith descriptors from Eurovoc

thesaurus

Page 5: Project  AIDE

Bruxelles, 2006-03-10

AIDE – how? automatic indexing, how?

program which “learns to index”

Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)

compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors

situation with Croatian documentation in 2004. there were only few hundreds of documents indexed manual indexing: painfully slow

Page 6: Project  AIDE

Bruxelles, 2006-03-10

AIDE – how?

how could we speed up the manual indexing?

plan:

to develop a workstation for computer aided document indexing

conduct the research and development of algorithms in the field of computational linguistics/language technologies

insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)

Page 7: Project  AIDE

Bruxelles, 2006-03-10

CADIS: two windows

Document window

Eurovoc browser window

Page 8: Project  AIDE

Bruxelles, 2006-03-10

Document Window

Page 9: Project  AIDE

Bruxelles, 2006-03-10

Page 10: Project  AIDE

Bruxelles, 2006-03-10

CADIS features

Enhanced user interface

list of descriptors appearing in document

Page 11: Project  AIDE

Bruxelles, 2006-03-10

CADIS features

Descriptors and non-descriptors marked in document

Page 12: Project  AIDE

Bruxelles, 2006-03-10

CADIS features

Lists of n-grams

Page 13: Project  AIDE

Bruxelles, 2006-03-10

CADIS features

Integration of corpus analysis

greyed n-grams are statistically relevant in the corpus

Page 14: Project  AIDE

Bruxelles, 2006-03-10

CADIS features

Manual marking of significant n-grams — important step towards automatic indexing

Page 15: Project  AIDE

Bruxelles, 2006-03-10

Eurovoc browser window

Page 16: Project  AIDE

Bruxelles, 2006-03-10

Further development CADIS for other languages?

already for Croatian and English

usable for other languages without linguistic module

cooperation needed with respective language technology experts for development of linguistic module for other languages

partners for EU project proposals for the next step

AIDE

research on machine learning and text-mining

use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc

establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

Page 17: Project  AIDE

Bruxelles, 2006-03-10

http://textmining.zemris.fer.hr

Page 18: Project  AIDE

Bruxelles, 2006-03-10

Conclusion

CADIS is unique in Europe

Web info at:

HIDRA: www.hidra.hr/hidra/aide/aide.htm

ZEMRIS: textmining.zemris.fer.hr

for download contact: [email protected]

Page 19: Project  AIDE

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]