Bruxelles, 2006- 03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb [email protected]Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]
Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]. Project AIDE. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Bruxelles, 2006-03-10
Computer Aided Document Indexing System (CADIS) with Eurovoc
Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]
Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]
Bruxelles, 2006-03-10
Project AIDE
idea for a project
September 2004, conference at JRC, Ispra
interdisciplinary collaboration of 3 institutions
Croatian Information Documentation Referral Agency (HIDRA)
Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb
Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb
Bruxelles, 2006-03-10
AIDE – collaborating institutions HIDRA
collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia
coordinator Maja Cvitaš, M.A.
ZEMRIS
research in the field of artificial intelligence, neural networks, machine learning, data and text mining
coordinators prof. Bojana Dalbelo Bašić andJan Šnajder
ZZL
computational linguistic research and building language technologies for Croatian
coordinator prof. Marko Tadić
Bruxelles, 2006-03-10
AIDE – project objective
Development of intelligentsystem for automatic indexingof the official documentation
of the Republic of Croatiawith descriptors from Eurovoc
thesaurus
Bruxelles, 2006-03-10
AIDE – how? automatic indexing, how?
program which “learns to index”
Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)
compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors
situation with Croatian documentation in 2004. there were only few hundreds of documents indexed manual indexing: painfully slow
Bruxelles, 2006-03-10
AIDE – how?
how could we speed up the manual indexing?
plan:
to develop a workstation for computer aided document indexing
conduct the research and development of algorithms in the field of computational linguistics/language technologies
insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)
Bruxelles, 2006-03-10
CADIS: two windows
Document window
Eurovoc browser window
Bruxelles, 2006-03-10
Document Window
Bruxelles, 2006-03-10
Bruxelles, 2006-03-10
CADIS features
Enhanced user interface
list of descriptors appearing in document
Bruxelles, 2006-03-10
CADIS features
Descriptors and non-descriptors marked in document
Bruxelles, 2006-03-10
CADIS features
Lists of n-grams
Bruxelles, 2006-03-10
CADIS features
Integration of corpus analysis
greyed n-grams are statistically relevant in the corpus
Bruxelles, 2006-03-10
CADIS features
Manual marking of significant n-grams — important step towards automatic indexing
Bruxelles, 2006-03-10
Eurovoc browser window
Bruxelles, 2006-03-10
Further development CADIS for other languages?
already for Croatian and English
usable for other languages without linguistic module
cooperation needed with respective language technology experts for development of linguistic module for other languages
partners for EU project proposals for the next step
AIDE
research on machine learning and text-mining
use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc
establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia