Standards, Use and Prospects for Language Resource Management Key-Sun Choi 16 Aug. 2008 TII, Moscow
Jan 11, 2016
Standards, Use and Prospects for Language Resource Management
Key-Sun Choi16 Aug. 2008TII, Moscow
MOTIVATION
Wikipedia
• Web-based collaborative authoring multi-lingual encyclopedia– 8.29 M pages/ 253 languages (2007/9)– 2.0 M pages/ English (2007/9) ~ now 5.0 M pages
Computer science
Computer science
AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists
Martic Kay
Martic Kay
Robert WatsonRobert Watson
Parallel databaseParallel
databaseSQLSQLDivide & ConquerDivide & Conquer
Category Classification
Category Classification
Category PageCategory Page
3
Problem: IS-A Relation Extraction from Wikipedia
• Relation Classification from Category System– By Term Formation Rule, Wikipedia Structure
(Ponzetto & Strube, 2007)
Computer science
Computer science
AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists
RelationClassification
IS-A relation
Not IS-A relationUpper-lower levelCategory relation
IS-A IS-ANot IS-A
4
Relation Extraction by Pattern
• (Ryu & Choi, 2007)– http://cseight.kaist.ac.kr:8080/RelExt
Computer display mode
Computer display mode
Text modeText mode
IS-A
5
Problem: IS-A Relation Extraction from Wiktonary
• Web-based Collaborative Multilingual Dictionary– 617,639 entries/401 languages
• ISA relation extraction from Definition Pattern– http://cseight.kaist.ac.kr:8080/Wiktionary
IS-A
IS-A
6
Problem: IS-A Relation Extraction from WordNet
• Semantic Word Net (English)– 117,798 nouns, 82,115 synset (Ver. 3.0)– ISA relation extraction through ISA between
Synsets
Synset #12Synset #12
Synset #22Synset #22 Synset #23Synset #23 Synset #33Synset #33
chemical engineering
chemical engineering
computer science, computing
computer science, computing
electrical engineering
electrical engineering
engineering, applied scienceengineering, applied science
IS-A IS-A
7
LMFLexical Markup Framework
Wikipedia: IS-A Annotation
9
IS-A (Entry, Term in Page)IS-A (Term in Page, Term in Page)Synonymy (Entry, Term in Page)
What is common representation?
• Graph Structure
PIVOTPIVOT
AA
BB
CC FF
EE
DD
cat: NP
cat: PUNCtype: hyphen
cat: VBG
N e w Y o r k - b a s e d
cat: JJcat: NNP
cat: ADJP
role: altrole: a
lt
Linguistic Annotation Framework• ISO-GrAF: Graph Structure-based Annotation– GrAF XML schema type hierarchy
• graphElementType; Attributes: ID, type• edgeType extends graphElementType• nodeType extends graphElementType• spanType extends nodeType; Attributes: start, end• graphElementSetType• edgeSetType extends graphElementSetType• nodeSetType extends graphElementSetType• featureStructureType• featureType• annotationSetType
12/24
Problem: Causality between Terms• Causal relation between terms
• Term clustering based on inter-term causality
– Terms with similar causality tend to be similar concept.
– Realization & Evaluation
[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .
IL-2
TG
Egr-1 IFN-gamma
Stat5Interleukin-2
13/24
Is it true?Terms with similar causality tend to be similar concept.
The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache
The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache
cigarette
Oral bacteria
Smokeless tobacco product Toothache
Gum disease
Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.
Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.
cigarette
ToothachePeriodontal disease
14/24
What to do• Is it true?
– Terms with similar causality tend to be similar concept
• We try to test the term clustering based on causal information– Prove that causality is one of effective features for term
clustering.
• Focus on– Causal NP pair extraction (Chang and Choi, 2004)– Causal term pair extraction– Term clustering based on causal similarity– Term clustering evaluation
15/24
Features on term clustering (1/3)
• Useful features for Term clustering– Internal feature
• Word lexicon/structure in terms• (Bourigault and Jacquemin,1999): POS sequences including
insertion– NPDNInsAj = NOunl ((Adv? Adj)0-3 Prep Det? (Adv? Adj)0-3 ) Noun3
– 93~98% precision
– Outer-term feature• Structural modifier/modifiee of term• Some words nearby term• (Maynard et al., 2000)
– Hand-made semantic frame information
Feature Structure Representation
(1) Employee• {<SEX, female>, <NAME, Sandy Jones>, <AGE,
30>}(2) Sound segment /p/• {<CONSONANTAL, + >, <ANTERIOR, + >, <VOICED,
->, <CONTINUANT, ->}(3) Grammatical features of the verb ‘love’• {<POS, verb>, <VALENCE, transitive>,
<SEMANTIC_RELATION, loving>},
FSR: Graph vs. Matrix Notation
M
18/24
Related Works on term clustering (3/3)
• Discussion– Causal information is one of “long-distance
contextual information”
Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Cigarette smoking and use of smokeless tobacco products may also cause gum disease.
cause
use
Smokeless tobacco product
Gum disease
19/24
it
that
it
that
appears
caused
began
Skin cancer
adulthoodSun exposure
Sunburns
child
Event & ternary extraction
Dependency StructureDependency Structure
usually in but
is
by and
appears
caused
beganin
Skin cancer usually appears in adulthood , but it is caused by sun exposure and sunburns that began in childhood .
Skin cancer
adulthoodSun exposure
Sunburns
child
Causal event pair candidate<cause event, cue phrase, effect event>Causal event pair candidate<cause event, cue phrase, effect event>Skin cancer – RNP caused by CNP – sun exposure
Skin cancer – RNP caused by CNP – sunburns
Skin cancer – RNP caused by CNP – sun exposure
Skin cancer – RNP caused by CNP – sunburns
NP chunking Reference finding Cue phrases filteringVerb selection
Representation Scheme
• Morpho-syntactic Annotation Framework• Syntactic Annotation Framework
Morpho-Syntactic Annotation Framework: MAF
– <token id=" t1 ">to</ token>– <token id=" t2 ">eventually</ token>–
3 <token id=" t3 ">decide</ token>– <wordForm lemma=" to_decide " tokens=" t1 t3 "/>–
5 <wordForm lemma=" eventually " tokens=" t2 "/>
MAF: token <token id=" t1 ">The</ token><token id=" t2 ">vi c t im</ token><token id=" t3 ">’ s</ token><token id=" t4 ">f r i e n d s</ token><token id=" t5 ">t o ld</ token><token id=" t6 ">p o l i c e</ token><token id=" t7 ">that</ token><token id=" t8 ">Krueger</ token><token id=" t9 ">drove</ token><token id=" t10 ">int o</ token><token id=" t11 ">the</ token><token id=" t12 ">quar ry</ token><token id=" t13 ">and</ token><token id=" t14 ">never</ token><token id=" t15 ">sur f a c ed</ token><token id=" t16 ">.</ token>
Syntactic Annotation Framework
Semantic Annotation Framework: TimeML
• no more than 60 days – <TIMEX3 tid="t1" type="DURATION"
value="P60D" mod="EQUAL_OR_LESS"> no more than 60 days </TIMEX3>
• the dawn of 2000 – <TIMEX3 tid="t2" type="DATE" value="2000"
mod="START"> the dawn of 2000 </TIMEX3>
ONTOLOGY EXTRACTION/LEARNING AND QUESTION-ANSWERING
26
Word Segmentation
MULTILINGUAL INFORMATION FRAMEWORK
IT Ontology
29
software system
Embeddedsoftware
Embeddedsystem
appliance
OS Middleware
EmbeddedOS
platform
App.Program
Dev.Env.
Comm.middleware
browserMediaplayer
DVDplayer Set-top
boxMP3player
DigitalCamera
vendor
Real-timeEmbed. OS
Non-real-timeembed. OSRTOS
VxWorks pSOS
VRTX
WinCE
MicrosoftWind River
consists_ofreside_on
venderIT Core Ontology
A Scenario
30
User ControlServer
OntologyReasoner
RuleReasoner
What is the best RTOS Vendor? Do you know?
No
What is RTOS?
Real-time Operating System
What are instances?VxWorksVendor?
Wind River..Microsoft Which is better?
Dialogue actsWell-known examples of communicative functions (“core
dialogue acts”):• question
• WH-question• YN-question• check/verification
• statement/inform• answer (WH-answer. YN-answer)• confirmation, disconfirmation• request• instruct• promise• acknowledgement• greeting
General-purpose functionsApplicable in any dimension are:
Information-seeking functions WH-question, YN-question, Alternatives-question, Check,..
Information-providing functionsInform, WH-Answer, YN-Answer, Confirmation, Disconfirmation, Agreement, Correction,..
Commissive functionsOffer, Promise, AcceptRequest,..
Directive functionsInstruct, Request, Suggest,..
DiaML concrete syntax
From sentence to ontologies
Sentence
A [camera] is a [device] that [take]s [video].
Dependency analysis
Term recognition
A camera is a device that takes video.
camera is device takes that video
Triplets extraction(camera, ISA, device)(camera, hasPropertyOf, that AND (take video))
ontologycamera
device
artifact
video
contents
···
Standards for language processing
Primary resources(text, dialogues)
Structural mark-upBasic annotations[TEI, MPEG7, TMX(XHTML…), etc.]
NLP structures(annotations)POS tagging
Chunks (cf. Named Entities)Deep Syntactic structures
Co-references etc.[Eagles/ISLE,
CES, MATE,…]
Knowledge structuresHierarchies of types
Relations between concepts(subjects/topics etc.)
Links to primary resources[Topic Maps, OIL, RDF]
Lexical structures(Language models)
TerminologiesTransfer lexica
LTAG/HPSG/LFG lexica[TBX, OLIF,
Eagles/ ISLE (Genelex)]
Links
Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]
Access protocols[Corba, SOAP]
Context
• ISO TC37 - Terminology and other language resources– SC3 - Computer applications in terminology
• ISO 12200 - Martif– Latest version of TEI Terminology chapter
• ISO 12620 - Data categories• ISO CD (DIS: under ballot) 16642 - TMF (Terminological
Markup Framework)
– SC4 - Language resources
TC37/SC4 details• Scope: Platform for designing and implementing linguistic resource
formats and processes– Multi-layer annotation of linguistic resources– Exchange of information between NLP modules
• General strategy– Involve a wide community from academia and industry
• Identification of experts in the various work items• Involvment through national standardizing bodies
• Agenda– Current: identification of possible work items and working groups– Constituancy meeting and technical workshop at LREC (May 2002)
Organization
• Chair:– Laurent Romary, France
• Secretary:– Key-Sun Choi, Korea
• International Advisory Committee– Chair: Prof. Antonio Zampolli, Italy
SC4 and other standardizing bodies
W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP
MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices
ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats
TEI-text representationReference for primary sourcese.g.: text archives
Text
Audio/Speech
Technical background
What about gestures?• Kinetic in the TEI• SMIL?
Oscar
TC37/SC4 Work Items
• WG1/WI-0: Terminology of Language Resources• WG1/WI-1: Linguistic annotation framework• WG1/WI-2: Meta-data for multimodal and multilingual
information• WG2/WI-3: Structural content representation scheme• WG2/WI-4: Multimodal content representation sheme• WG2/WI-5: Discourse level representation scheme
TC37/SC4 Work Items - cont.
• WG3/WI-6a: Multilingual text representation
• WG4/WI-7: NLP Lexica• WG5/WI-8: Net-based distributed cooperative
work for the creation of LRs
WI-0
• Terminology of Language Resources– Basic terminology of the various sub-fields of
language resources and general methodology– Project leader: Klaus-Dirk Schmitz– Sources:
• ISO 1087• LREC proceedings + KAIST• English dictionaries in Linguistics?
– Support from GTW
WI-1• Linguistic annotation framework
– Basic mechanisms and data structures for linguistic annotation and representation [data architecture]
• Methods and principles for the design of an annotation scheme• Structural nodes and information units, Data category specification• Linking and pointing mechanisms, Feature Structures, Meta-Markup• « Stand-off » and « in-line » views - equivalences, combining levels.• Administrative data categories
WI-1 - cont.
– Project leader: Nancy Ide (TBC)– Contributors: Alan Melby, Koiti Hasida, Lee Gillam, Yves
Savourel, Laurent Romary…– Possible sources:
• TMF, iso12620-revised, Mate (general methodology)• TEI (Linking mechanisms, feature structures)• Link with Linguistic DS
WI-2• Meta-data for multimodal and multilingual information
– Description of a meta-data representation scheme to document linguistic information structures and processes
• General content description• Local content description
– Project leader: Peter Wittenburg, MPI (Nijmegen, NL)– Participants: Steven Bird, TEI aware person– Possible sources:
• OLAC, Mile, TEI Header– Liaison: TC46 (SC9), MPEG7/MDS, SCORM
WI-3• Structural content representation scheme
– Definition of annotation/representation scheme(s) for morpho-syntax and syntax, to be used for annotation and interchange purposes
• Meta-model for morpho-syntactic annotation• Meta-model(s) for syntactic annotation (lexicalized grammar,
elementary trees, dependancy structures)• + corresponding Data category registries
– Project leader:John Carroll ?? – Participants: Nuria Bell– Possible sources:
• Eagles, TAGML, Linguistic DS• SIGPARSE• Working group with representatives from existing TreeBanks
initiatives
WI-4• Multimodal meaning representation scheme
– Representation scheme for the semantic content of multimodal information (textual, spoken, graphical and gestural)
• Meta-modal for content representation (Events, participants, etc.)• Data category registry for multimodal content
– Project leader: Harry Bunt (id=“1”)– Possible sources:
• SIGSEM working group on semantic content– Chair: #1
– « Liaison »• Semantic web activities
WI-5
• Discourse level representation scheme• Meta-model for discourse and dialogue representation• Meta-model for discourse level annotation (e.g.
reference annotation)• + corresponding DatCat registry
– Possible sources:• SIGDIAL• DRI - Discourse Resource Initiative• Mate
WI-6• Multilingual text representation scheme
– Framework for representing language specific and multi-lingual textual information
• Translation Memory• Alignment – Parallel Corpora• Word count algorithms (characters, words, segments)
– Possible sources:• TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts
• WI 6A• Translation Memory, Alignment of parallel corpora
– Sources:• OSCAR/TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts
• WI 6b• Segmentation and counting algorithms (characters, words,
sentences etc.)– Sources:
• OSCAR
• WI 6C• Meta-markup for GIL (Globalization, Internationalization and
Localization)– Possible sources:
• OSCAR/OpenTag
WI-7
• NLP lexica– Lexicon representation formats for the various types of NLP
applications (Machine Readable Lexica)• Define a set of meta-models (classes of applications)• Specific data categories (derivation, phonology, etc.)• Based on the work done in other work items
– Sources• Eagles/multext• ISLE Computational Working group/Genelex• OLIF
WI-8• Net-based distributed cooperative work for the creation
of LRs– Principles and methods for designing collaborative and
cooperative compilation of LRs– Define what is specific to LRs with regards
• Tracability of resources, version control, validation, quality management
• Protocols (Corba, SOAP), Workflow standards, Data management
– Contacts: Christian Galinski, Remi Zajac– Sources
Liaison - OSCAR• Brief history of LR exchange standards• Parallel events since 1997
– Open Tag - meta-markup (XML vs. Others)• Major current OSCAR activities
– TMX - Translation Memory eXchange– Counting and segmentation algorithms– TBX (Terminologies) and OLIF (MT lexica)– XLIFF and CGS - Annotation of source code and localisation of
web sites– xml:lang etc.: J. DeCamp and S.-E. Wright
Liaison - TEI– General architecture and data modeling
• WI-1– Annotations (paragraph level, external annotations)
• WI-1– TEI Header
• WI-2– NLP lexica
• WI-7– Feature structures
• WI-1