Top Banner
CLARIN-NL Results and Evaluation Jan Odijk CLARIN-NL Final Event Hilversum, 2015-03-13 1
38
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clarin nl odijk-final_event_2015-03-13

CLARIN-NL

Results and Evaluation

Jan Odijk

CLARIN-NL Final Event

Hilversum, 2015-03-13

1

Page 2: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

2

Page 3: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

3

Page 4: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• 5 CLARIN Centres (‘Type B Centres’)1. MPI2. Meertens Institute3. INL4. Huygens ING5. DANS

• 3 CLARIN Data Providers (‘Type D Centres’)1. National Library (KB)2. Utrecht University Library3. Netherlands Institute for Sound and Vision

4

Page 5: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• CLARIN Centres– Have set up a proper repository system

• So resources can be stored there

– Have their CMDI-metadata harvestable• So resources are visible to others

– Support for persistent identifiers (PIDs)• So links to resources are ‘never’ broken

– Long-term archiving solution in place• So resources will not get lost

– Provisions for federated identity management• So you can login with your own institute account (single sign-on)

– Have acquired the Data Seal of Approval• So the data repositories can be trusted and are sustainable

5

Page 6: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• CLARIN Type A Centres in NL– Offers services for the whole CLARIN infrastructure– Mainly MPI, some Meertens (and UU)

• Enables you to search for resources: – Harvesting of metadata , Virtual Language Observatory, Meertens

Metadata Search (Meertens), CLARIN-NL Portal (UU)

• Enables you to create metadata– CMDI registry, CMDI Profile editor, Metadata editor

• Enables you to ensure semantic interoperability– ISOCAT, RELCAT, SchemaCat– CLAVAS, CLARIN Concept Registry (Meertens)

– Transfer from MPI to other centres (in EU) on-going

6

Page 7: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN CentresMetadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

7

Page 8: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• Metadata and Metadata Search– CMDI metadata created for all data dealt with in CLARIN-

NL– Using flexible CMDI

• If needed, with user defined profiles and components

– Searching for data possible via the • VLO• Meertens Metadata Search

• Some work done on metadata for software– Partially reflected in CLARIN-NL Portal– But not (yet) in CMDI records / VLO

8

Page 9: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• Metadata and Metadata Search– CMDI `too flexible’– Big variation in granularity– Hardly any requirements on obligatoriness of certain

metadata elements• some crucial metadata elements are lacking

• VLO– Gives access to over 800k metadata records– KB metadata are not included (> 1 million)– Many external origin with no control over the metadata– Limited search options via VLO

• Search via VLO is not as easy as it should be• CLARIN-NL Portal improves this for NL resources • Will be taken up in CLARIAH 9

Page 10: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for dataFederated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

10

Page 11: Clarin nl odijk-final_event_2015-03-13

Infrastructure Core

• Federated Content Search (FCS)– Search via a single interface in multiple, distributed, data

• NL centres created ‘end points’ for selected resources– So they can participate in FCS

• Development of search interface and aggregator– Different approaches NL v. DE– NL Development stopped, adopted DE approach– See CLARIN-D FCS Aggregator

• So far, only string (keyword) search is possible• Will be taken up again in CLARIAH

11

Page 12: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource CurationData Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

12

Page 13: Clarin nl odijk-final_event_2015-03-13

Data Curation

• By the CLARIN Data Curation Service (DCS)– E.g. LESLLA, dialect dictionaries, IPNV Interviews with

veterans

• Via open calls and closed calls– In many (small) projects

• Recent examples: VALID, DSS, eBNM+ • Broad coverage of the humanities• Contributed significantly to user involvement

13

Page 14: Clarin nl odijk-final_event_2015-03-13

Data Curation

14

Discipline Count

Linguistics 16

History 9

Literary Studies 5

Culture Sciences 4

Communication & Media Studies 2

Religion Studies 2

Computational Linguistics 1

Philosophy 1

Political Sciences 1

Page 15: Clarin nl odijk-final_event_2015-03-13

Data Curation

15

Linguistics Count

Acquisition 5

Historical Linguistics 4

Syntax 4

Morpho-syntax 3

Discourse 2

Language Documentation 2

Lexicology 2

6 others with each 1

Page 16: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data CurationSoftware Curation & Web Applications

• Interoperability• What you can do• Education and Training• Conclusions

16

Page 17: Clarin nl odijk-final_event_2015-03-13

Software Curation /

Web Applications

• Via open calls and closed callsIn many (small) projects– Curation / upgrades of existing software

• AVResearcherXL (QuaMerdes), SHEBANQ, ColTime and EXILSEA upgrades of ELAN, PaQu, Cornetto Interface, …

– Creation of new software• DSS, eBNM+, RemBench, OpenSONAR, PICCL, AutoSearch, …

– Broad coverage of the humanities– Contributed significantly to user involvement

17

Page 18: Clarin nl odijk-final_event_2015-03-13

Software Curation /

web applications

18

Discipline Count

Linguistics 27

History 14

Literary Studies 5

Communication & Media Studies 4

Cultural Sciences 4

Political Sciences 4

Computational Linguistics 3

3 others with each 1-2

Page 19: Clarin nl odijk-final_event_2015-03-13

Software Curation /

web applications

19

Linguistics Count

Syntax 13

Morpho-syntax 7

Historical linguistics 5

Lexicology 5

Dialectology 4

Sign Language 4

7 others with each 2

Page 20: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

Interoperability• What you can do• Education and Training• Conclusions

20

Page 21: Clarin nl odijk-final_event_2015-03-13

Interoperability

• Interoperability– Do tools apply to data seamlessly?– Can data be combined seamlessly?– Can tools be combined seamlessly?– Does CLARIN support data in real-world formats?

21

Page 22: Clarin nl odijk-final_event_2015-03-13

Interoperability

• Syntactic Interoperability– FoLIA becoming a de facto standard format for

linguistically annotated text corpora in the Netherlands• TTNWW, PICCL, VU-DNC, Nederlab, Basilex, …

– CLAM de facto standard in NL for turning software into RESTful web services

– But • there are also other important formats that must be supported

(TEI, LASSY XML, …)• And still too many ad-hoc formats, often without explicit syntax

and semantics

22

Page 23: Clarin nl odijk-final_event_2015-03-13

Interoperability

• Semantic Interoperability– Data Categories for metadata elements actually used

(e.g. in the VLO)– Data Categories for many data (content) elements defined

but hardly used yet– ISOCAT was a useful data category registry

• But had many problems

– Now replaced by the CLARIN Concept Registry• Solves some of ISOCAT’s problems but not all• Will be addressed in CLARIAH

23

Page 24: Clarin nl odijk-final_event_2015-03-13

Interoperability

• Support for real world formats– New research data do not come in standardized formats– But as mixtures of .doc, .docx, HTML, PDF, plain text,

ePub, …– And multiple standard formats must be supported in

CLARIN (e.g. both FoLIA and TEI)– Support for data conversions via the OpenConvert project– But more is needed

• Will be addressed in CLARIAH

24

Page 25: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• InteroperabilityWhat you can do• Education and Training• Conclusions

25

Page 26: Clarin nl odijk-final_event_2015-03-13

What you can do

• Find and select existing data– Virtual Language Observatory, Meertens Metadata

Search, CLARIN-NL Portal

• Create new data through OCR and orthographic normalisation– PICCL

• Create metadata for new or existing data– CMDI Registry, CMDI profile editor, metadata editors (e.g.

ARBIL), …

26

Page 27: Clarin nl odijk-final_event_2015-03-13

What you can do

• Make semantics of metadata and data explicit– ISOCAT, RELCAT, SchemaCAT

• now replaced by CLARIN Concept Registry (CCR)

– CLAVAS

• Enrich data with various kinds of annotations– TTNWW

• Orthographic normalisation, pos-tagging, lemmatisation, parsing, named entity recognition, ….

– Adelheid, INPOLDER, PaQu, ColTime and EXILSEA extensions to ELAN

• Upload enriched data to search applications– PaQu, AutoSearch

27

Page 28: Clarin nl odijk-final_event_2015-03-13

What you can do

• Search, browse in data and analyze (meta)data and query results– OpenSONAR, GrETEL, PaQu, MIMORE, FESLI, SHEBANQ,

AutoSearch, …– Arthurian Fiction, NameScape, COBWWWEB, eBNM+, C-

DSD, DSS, RemBench, Nederlab, …– Interviews, WIP, VK, Polimedia, CKCC, DSS,

AVResearcherXL, …– DUELME, WFT-GTB, CORNETTO, …

28

Page 29: Clarin nl odijk-final_event_2015-03-13

What you can do

• Visualize data analyses– COAVA, FESLI, MIMORE, Gabmap, SHEBANQ, Nederlab,

OpenSONAR, …– CKCC, MIGMAP, AVResearcherXL

• Store new data safely at a CLARIN Centre– All 5 centres have the Data Seal of Approval– 4 centres are certified CLARIN Centres

29

Page 30: Clarin nl odijk-final_event_2015-03-13

Invitation

• Use (elements from) the CLARIN infrastructure

• Join user groups of specific services

• Provide feedback so that we can further improve CLARIN

• So that you can improve your research

30

Page 31: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do Education and Training• Conclusions

31

Page 32: Clarin nl odijk-final_event_2015-03-13

Education & Training

• How do you learn to use these tools?

– Courses / tutorials regularly organized

– LOT summer / winter school courses

– Demonstration scenarios and/or screen casts

• E.g. for Gabmap GrETEL OpenSONAR

– Educational modules via the portal:

• https://dev.clarin.nl/node/CLARIN%20Educational%20Packages

– Helpdesk: [email protected]

32

Page 33: Clarin nl odijk-final_event_2015-03-13

Education & Training

• Do you want to know more?

– Visit the CLARIN-NL portal

• http://portal.clarin.nl

– View the CLARIN-NL movies

• http://www.clarin.nl/node/403

– Visit the demonstrations today

– Ask me (or others) today

33

Page 34: Clarin nl odijk-final_event_2015-03-13

Overview

• Infrastructure Core – CLARIN Centres– Metadata and Searching for data– Federated Content Search

• Resource Curation– Data Curation– Software Curation & Web Applications

• Interoperability• What you can do• Education and Training Conclusions

34

Page 35: Clarin nl odijk-final_event_2015-03-13

Conclusions (1)

• CLARIN is starting to provide the data, facilities and services to carry out humanities research supported by large amounts of data and tools

• With easy interfaces and easy search options (no technical background needed)

• Some training in using the tools is needed– To use the possibilities optimally

– To understand the limitations of the data and the tools

– Educational modules for selected functionality are available

– Tutorials / trainings will continue to be regularly organized

35

Page 36: Clarin nl odijk-final_event_2015-03-13

Conclusions (2)

• But there is still a lot to do– Extensions of and improvements in metadata

– Improvements of VLO

– Improved functionality for most tools• Need / desire found b y actual use of the tools

– Extend and improve search options for individual resources

– Create options of searching across different resources of the same type

– Improved interoperability

36

Page 37: Clarin nl odijk-final_event_2015-03-13

Conclusions(3)

• A successor project is needed!

• CLARIAH www.clariah.nl

• Proposal approved June 1, 2014

• Started Jan 1st, 2015

• Kick-off this afternoon

37

Page 38: Clarin nl odijk-final_event_2015-03-13

THANKS FOR YOUR ATTENTION!

38