This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Funding Organization: European Commission Funding Program: Sixth Framework Programme
(FP6: IST (3rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, STFC Project Consortium: 15 partners from EU MS, NMS
Deutsches Forschungszentrum für Künstliche Intelligenz, Germany
Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary Institute of Mathematics and Computer Science, University of
Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK
Data Integration from Heterogeneous Sources CERIF-based databases
(MSSQL Server; MS Access; EPSRC database) MSWord documents; MSExcel documents Raw Text files; HTML files; XML files Data crawled from the Web; from CERIF-based CRISs; from
public CRISs
Data Integration into ONE single dataset to enable Analysis at European Level
Overall Data Cleaning with Supervised Machine Learning Methods
From National CRISs / Collections: complete & comprehensive often bi-lingual quick, easily (exported) transformed into CERIF XML mostly technical contact/expertise available
Crawled from public CRISs / CERIF-based CRISs: complete as publicly available needs data transformation / re-structuring efforts into CERIF XML technical expertise not related to domain knowledge depends on static website structures
Crawled from the Web (Google Scholar Publication Data): not usable for quality analysis
Community Contributions: a lot of interest entries incomplete, only basic personal data, not many relations
Heuristic Analysis of Random Samples in National Datasets / Cordis Datasets
most obvious duplicates found inside Cordis FP5 and FP6 datasets and across Cordis FP5 and FP6 datasets Largest Sets !!
not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined
Decision with Respect to the IST World Scope not touching project records ignore publication records let the community resolve person records (IST World
Community) concentrate on cleaning organisation records
Most entries had slightly different names caused by additional special characters or character modifications
Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, )
Mixture of Organisation Names and Department Names Differences in Addresses
Data Collection: Data should be updated at their origin to avoid repetition
of data cleaning with updates independent of their collection method updates have to happen in the processes needs backwards-communication with data providers
CRISs support systematic data collections and updates A Lingua Franca for communication and interchange between
systems is needed for large-scale integration large-scale analyses across single sets
CERIF was crucial for IST World Crawlings/CRISs do not easily distinguish between topics (IST
only) Web Crawlings (GoogleScholar) considerably lacked quality
Automated Data Integration: Semi-automatically learned models can be re-used with new
Evaluation of large Datasets: very difficult needs expert knowledge
Analytics and Tools: depend heavily on quality data are very powerful for investigation of large datasets are much appreciated by the community (many registered
users)
Common Interest: Very High! even from outside the project: