© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Knowledge Base for RTD Competencies in IST – Results from a European.

Project Results

© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels

Knowledge Base for RTD Competencies in

IST

– Results from a European SSA Project –

Brigitte Jörg

German Research Center for Artificial Intelligence

Language Technology Lab, Saarbrücken, Germany

Project Results


Introduction of Speaker

Brigitte Jörg M.A. Information ScienceInformation Systems, Business Administration

Project Manager, ResearcherDFKI GmbH, Language Technology Lab, Saarbrücken, Germany

CERIF TG Leader, Board MembereuroCRIS

Contact: brigitte.joerg @ dfki.dehttp://www.dfki.de/~brigitte/

Project Results


Presentation Outline

Introduction of the Project

Information Repository

Data Collection / Data Integration / Data Cleaning

Analytic Tools

Evaluation and Results

Conclusion / Beyond the Project

Project Results


Project Information

Funding Organization: European Commission Funding Program: Sixth Framework Programme

(FP6: IST (3rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, STFC Project Consortium: 15 partners from EU MS, NMS

and ACC

Project Results


Project Consortium

Deutsches Forschungszentrum für Künstliche Intelligenz, Germany

Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary Institute of Mathematics and Computer Science, University of

Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK

(formerly CCLRC, UK)

Project Results


Technology Partners

DFKI

Co-ordinator

“LT World” PortalInformation Extraction

Semantic Web

DFKI

Co-ordinator

“LT World” PortalInformation Extraction

Semantic Web

Jozef Stefan Institute Technical Co-ordinator

“Project Intelligence”Data Mining

Social Network Analysis

Jozef Stefan Institute Technical Co-ordinator

“Project Intelligence”Data Mining

Social Network Analysis

Ontotext

“KIM Semantic Annotation Platform”

Ontotext

“KIM Semantic Annotation Platform”

euroCRIS

“CERIF” StandardAccess to Data

euroCRIS

“CERIF” StandardAccess to Data

Project Results


Project Objectives

Set up and populate an information portal on IST research

Provide information about RTD actors and their expertise

Provide innovative and automated services

To promote RTD competencies in specific fields

To support partner search for IST proposals and commercial projects

Project Results


Repository Features

Information Repository Entities (based on the CERIF 2004 Standard*) are

Organisations Persons Projects Publications

Data Collection - Import (based on CERIF XML) from

National CRISs (Current Research Information Systems) National Collections (no system behind) Web Crawlings Community Support

* CERIF: Common European Research Information Format http://www.euroCRIS.org/

Project Results


Repository Challenges

Data Integration from Heterogeneous Sources CERIF-based databases

(MSSQL Server; MS Access; EPSRC database) MSWord documents; MSExcel documents Raw Text files; HTML files; XML files Data crawled from the Web; from CERIF-based CRISs; from

public CRISs

Data Integration into ONE single dataset to enable Analysis at European Level

Overall Data Cleaning with Supervised Machine Learning Methods

(Active Learning)

Project Results


European Research Dataset (entries)

Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs

Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp

Community: 61 Orgs, 41 Proj, 435 Exp

January 2008

January 2008

Project Results


Collection Method Analysis

From National CRISs / Collections: complete & comprehensive often bi-lingual quick, easily (exported) transformed into CERIF XML mostly technical contact/expertise available

Crawled from public CRISs / CERIF-based CRISs: complete as publicly available needs data transformation / re-structuring efforts into CERIF XML technical expertise not related to domain knowledge depends on static website structures

Crawled from the Web (Google Scholar Publication Data): not usable for quality analysis

Community Contributions: a lot of interest entries incomplete, only basic personal data, not many relations

Project Results


Repository Analysis before Data Integration

Analysis of Obvious Errors: Duplicate records inherent in single datasets

Even more duplicate records after merging of datasets

Most obvious duplicates for organisations and persons

no significant number of duplicate projects publications have been ignored

Duplicate records are a known problem !!

Project Results


Heuristic Analysis of Random Samples in National Datasets / Cordis Datasets

most obvious duplicates found inside Cordis FP5 and FP6 datasets and across Cordis FP5 and FP6 datasets Largest Sets !!

not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined

Decision with Respect to the IST World Scope not touching project records ignore publication records let the community resolve person records (IST World

Community) concentrate on cleaning organisation records

Repository Analysis after Data Integration

Project Results


Problems with Organisation Records

Most entries had slightly different names caused by additional special characters or character modifications

Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, )

Mixture of Organisation Names and Department Names Differences in Addresses

Data Cleaning Application

Project Results


Active Learning Application

Project Results


Evalution of Automated Matching Results in the CORDIS FP6 dataset

Human evaluation of 1000 organisation record pairs: 30 Matches correct 934 Non-Matches correct 1 Match incorrect 35 Non-Matches incorrect

integration approach worked well can be used for large scale integration tasks

Result: semi-automated identification of 4000 duplicates with high accuracy and reasonable recall

97% precision 46% recall

Project Results


Analytic Tools

publicly available at: http://www.ist-world.org/

Advanced Tools

Competence Diagram Collaboration Diagram

Experimental Tools

Collaobration Trends Competence Trends Consortia Prediction Semantic Search

Project Results


How to Analyze or Generate a Diagram

(1) definition of a query in the IST World Portal

(2) get a list of result records matching the query

(3) generate diagrams based on results

Project Results


Competence Diagram

Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6

Thematic Areas (Blue Clouds):SEMANTICHEALTHLEGALCHANGINGROADMAPSOFTWARE

Projects (Red Dots)Linked with Full Record in Repository

Project Results


Competence Diagram


Goals (List of Keywords):DEMENTIAPEOPLEMEDICALSTANDARDS…

Configuration of Result Space:40% of result list30 topics

Project Results


Competence Diagram


Goals

Configuration of Result Space:40% of result list30 topics

Themes

Project Results


Collaboration Diagram

Query: IST SSA projects within FP6Aim: investigate the collaboration of SSA partners in FP6

Number of joint partners

Configuration of Result Space:20% of result list

Project

Project Results


Evaluation of Analytic Tools

… very powerful

… itself are a powerful dissemination means

… strongly depend on the data behind

More evaluation details and results can be found in the CRIS 2008 Proceedings at http://www.eurocris.org

Project Results


Overall Conclusion

Data Collection: Data should be updated at their origin to avoid repetition

of data cleaning with updates independent of their collection method updates have to happen in the processes needs backwards-communication with data providers

CRISs support systematic data collections and updates A Lingua Franca for communication and interchange between

systems is needed for large-scale integration large-scale analyses across single sets

CERIF was crucial for IST World Crawlings/CRISs do not easily distinguish between topics (IST

only) Web Crawlings (GoogleScholar) considerably lacked quality

Automated Data Integration: Semi-automatically learned models can be re-used with new

data

Project Results


Overall Conclusion

Evaluation of large Datasets: very difficult needs expert knowledge

Analytics and Tools: depend heavily on quality data are very powerful for investigation of large datasets are much appreciated by the community (many registered

users)

Common Interest: Very High! even from outside the project:

Hungary, Serbia, Croatia, Russia, … epriStart project

Needs professional Authority:legalization; not available within the scope of a project

Project Results


Beyond the Project

IST World is public http://www.ist-world.org/

Registration is Registration is freefree

Create your own Profile,

Competence Map, Collaboration Map

Currently FP7 Data are being prepared

Continuation is planned …

© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Knowledge Base for RTD Competencies in IST – Results from a European.

Documents

project slide

project coordination

project type

stfc project consortium

ist results

data slide

information portal

results conclusion