ww.isocat.org Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry 28 March 2013 1 eHg - New Trends in e-Humanities Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl
30
Embed
Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl
Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry . Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl. The Language Archive. Founded in September 2011 Supported by MPG, BBAW and KNAW (DANS) - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
www.isocat.org
eHg - New Trends in e-Humanities 1
Collaboratively DefiningWidely Accepted Linguistic Data Categories
TOP NOTION tds:Noun GROUPS{ NOTION tdn:GrammaticalDistinctions LABEL "Grammatical distinctions for nouns." GROUPS { NOTION tdn:AgentNouns LABEL "Agent nouns." DESCRIPTION "Nouns can function as the agent of a clause." LINK TO CONCEPT agentRole GROUPS { NOTION tdn:v098_plusAffix LABEL "Agent nouns formed by verb stem plus affix." LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix) DESCRIPTION <p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p> NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX" IS FIELD v098;...
Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS;also this not a TDN punchcard
ISO 12620:2009• Terminology and other content and language resources — Specification of
data categories and management of a Data Category Registry for language resources– A data model for data category specifications inspired by ISO 11179– A procedure to standardize data category specification compliant with
Annex ST– Each data category gets a unique Persistent Identifier (PID)– The Max Planck Institute for Psycholinguistics is appointed as the
Registration Authority of the ISO/TC 37 DCR • In use by a growing number of ISO TC 37 standards
– Lexical Markup Framework (LMF)– Linguistic Annotation Framework (LAF)– Morph-syntactic Annotation Framework (MAF)– …– could be more, e.g., Feature System Declarations (FSD)
• Also embedding in other formats is possible, e.g., via comments• Preferably annotate schemas, so a whole range of resources is annotated in one go28 March 2013
• CMDI is developed by CLARIN and on its way to standardization by ISO TC 37– Limitations existing metadata schemas: DC/OLAC, IMDI, TEI
header• Inflexible: too many (IMDI) or too few (OLAC) metadata elements• Limited interoperability (both semantic and syntactic)• Problematic (unfamiliar) terminology for some sub-communities.• Limited support for LT tool & services descriptions
– The idea is to address this by:• Explicit defined schema & semantics • User/project/community defined components
Metadata TDG• Standardization efforts of the Metadata TDG stalled
– Large overlap with the work/people at the Athens-Core meetings• Community level agreement is maybe enough
– Activity motivation should not depend on one person, the TDG chair, only • The need for explicit and shared semantics is not clear enough yet … more evangelization needed
– Unfamiliarity with the work• Terminologists are more used to this kind of review work• Online review vs. old ISO ‘paper’ process
– Members have little time, it is difficult to sync schedules• TDG experts tend to be senior scientist• Continuous process vs. sporadic bursts of activity
– Unpaid work• Project funding vs. wide acceptance in the community• However, a project might bootstrap a thematic domain
• The same problems hold for other TDGs– Current tendency to tie data category (selection) standardization to a new/revised standard,
e.g., MAF and TBX– Redesign of the standardization process is coming up
• ISO is not actively supporting Annex ST Standards as Databases anymore
Conclusions and future work• Communties can already create a coherent view on ISOcat
– the CMDI use case shows potential– maybe funder support needed to bootstrap specific domains
• The standardized core will take (a long) time– like all standardization work
• Next to metadata also content– explicit semantics would be profitable even when not shared and/or used for resource
discovery– resources created with tools that support ISOcat will create such resources more easy
• Companion registries:– relations between data categories (RELcat)– annotated schemas for language resources (SCHEMAcat)– interaction with the CLARIN vocabulary service (CLAVAS)
• Archives and infrastructures look at the resources as they are, i.e., in general no conversions to triples
• However, ISOcat data categories can easily be used in RDF resources:partOfSpeech dcr:datcat <http://www.isocat.org/datcat/DC-396> ;rdfs:label "part of speech"@en ;rdfs:comment "A category assigned to a word based on its grammatical and semantic properties."@en .
• The Relation Registry, which is a tripple store, will in general support lightweight, semi-formal ontologies
M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012.28 March 2013