Standards, Use and Prospects for Language Resource Management

Standards, Use and Prospects for Language Resource Management

Key-Sun Choi16 Aug. 2008TII, Moscow

MOTIVATION

Wikipedia

• Web-based collaborative authoring multi-lingual encyclopedia– 8.29 M pages/ 253 languages (2007/9)– 2.0 M pages/ English (2007/9) ~ now 5.0 M pages

Computer science

Computer science

AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists

Martic Kay

Martic Kay

Robert WatsonRobert Watson

Parallel databaseParallel

databaseSQLSQLDivide & ConquerDivide & Conquer

Category Classification

Category Classification

Category PageCategory Page

3

Problem: IS-A Relation Extraction from Wikipedia

• Relation Classification from Category System– By Term Formation Rule, Wikipedia Structure

(Ponzetto & Strube, 2007)

Computer science

Computer science

AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists

RelationClassification

IS-A relation

Not IS-A relationUpper-lower levelCategory relation

IS-A IS-ANot IS-A

4

Relation Extraction by Pattern

• (Ryu & Choi, 2007)– http://cseight.kaist.ac.kr:8080/RelExt

Computer display mode

Computer display mode

Text modeText mode

IS-A

5

http://cseight.kaist.ac.kr:8080/RelExt

Problem: IS-A Relation Extraction from Wiktonary

• Web-based Collaborative Multilingual Dictionary– 617,639 entries/401 languages

• ISA relation extraction from Definition Pattern– http://cseight.kaist.ac.kr:8080/Wiktionary

IS-A

IS-A

6

http://cseight.kaist.ac.kr:8080/Wiktionary

Problem: IS-A Relation Extraction from WordNet

• Semantic Word Net (English)– 117,798 nouns, 82,115 synset (Ver. 3.0)– ISA relation extraction through ISA between

Synsets

Synset #12Synset #12

Synset #22Synset #22 Synset #23Synset #23 Synset #33Synset #33

chemical engineering

chemical engineering

computer science, computing

computer science, computing

electrical engineering

electrical engineering

engineering, applied scienceengineering, applied science

IS-A IS-A

7

LMFLexical Markup Framework

Wikipedia: IS-A Annotation

9

IS-A (Entry, Term in Page)IS-A (Term in Page, Term in Page)Synonymy (Entry, Term in Page)

What is common representation?

• Graph Structure

PIVOTPIVOT

AA

BB

CC FF

EE

DD

cat: NP

cat: PUNCtype: hyphen

cat: VBG

N e w Y o r k - b a s e d

cat: JJcat: NNP

cat: ADJP

role: altrole: a

lt

Linguistic Annotation Framework• ISO-GrAF: Graph Structure-based Annotation– GrAF XML schema type hierarchy

• graphElementType; Attributes: ID, type• edgeType extends graphElementType• nodeType extends graphElementType• spanType extends nodeType; Attributes: start, end• graphElementSetType• edgeSetType extends graphElementSetType• nodeSetType extends graphElementSetType• featureStructureType• featureType• annotationSetType

12/24

Problem: Causality between Terms• Causal relation between terms

• Term clustering based on inter-term causality

– Terms with similar causality tend to be similar concept.

– Realization & Evaluation

[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .

IL-2

TG

Egr-1 IFN-gamma

Stat5Interleukin-2

13/24

Is it true?Terms with similar causality tend to be similar concept.

The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache

The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache

cigarette

Oral bacteria

Smokeless tobacco product Toothache

Gum disease

Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.

Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.

cigarette

ToothachePeriodontal disease

14/24

What to do• Is it true?

– Terms with similar causality tend to be similar concept

• We try to test the term clustering based on causal information– Prove that causality is one of effective features for term

clustering.

• Focus on– Causal NP pair extraction (Chang and Choi, 2004)– Causal term pair extraction– Term clustering based on causal similarity– Term clustering evaluation

15/24

Features on term clustering (1/3)

• Useful features for Term clustering– Internal feature

• Word lexicon/structure in terms• (Bourigault and Jacquemin,1999): POS sequences including

insertion– NPDNInsAj = NOunl ((Adv? Adj)0-3 Prep Det? (Adv? Adj)0-3 ) Noun3

– 93~98% precision

– Outer-term feature• Structural modifier/modifiee of term• Some words nearby term• (Maynard et al., 2000)

– Hand-made semantic frame information

Feature Structure Representation

(1) Employee• {<SEX, female>, <NAME, Sandy Jones>, <AGE,

30>}(2) Sound segment /p/• {<CONSONANTAL, + >, <ANTERIOR, + >, <VOICED,

->, <CONTINUANT, ->}(3) Grammatical features of the verb ‘love’• {<POS, verb>, <VALENCE, transitive>,

<SEMANTIC_RELATION, loving>},

FSR: Graph vs. Matrix Notation

M

18/24

Related Works on term clustering (3/3)

• Discussion– Causal information is one of “long-distance

contextual information”

Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Cigarette smoking and use of smokeless tobacco products may also cause gum disease.

cause

use

Smokeless tobacco product

Gum disease

19/24

it

that

it

that

appears

caused

began

Skin cancer

adulthoodSun exposure

Sunburns

child

Event & ternary extraction

Dependency StructureDependency Structure

usually in but

is

by and

appears

caused

beganin

Skin cancer usually appears in adulthood , but it is caused by sun exposure and sunburns that began in childhood .

Skin cancer

adulthoodSun exposure

Sunburns

child

Causal event pair candidate<cause event, cue phrase, effect event>Causal event pair candidate<cause event, cue phrase, effect event>Skin cancer – RNP caused by CNP – sun exposure

Skin cancer – RNP caused by CNP – sunburns

Skin cancer – RNP caused by CNP – sun exposure

Skin cancer – RNP caused by CNP – sunburns

NP chunking Reference finding Cue phrases filteringVerb selection

Representation Scheme

• Morpho-syntactic Annotation Framework• Syntactic Annotation Framework

Morpho-Syntactic Annotation Framework: MAF

– <token id=" t1 ">to</ token>– <token id=" t2 ">eventually</ token>–

3 <token id=" t3 ">decide</ token>– <wordForm lemma=" to_decide " tokens=" t1 t3 "/>–

5 <wordForm lemma=" eventually " tokens=" t2 "/>

MAF: token <token id=" t1 ">The</ token><token id=" t2 ">vi c t im</ token><token id=" t3 ">’ s</ token><token id=" t4 ">f r i e n d s</ token><token id=" t5 ">t o ld</ token><token id=" t6 ">p o l i c e</ token><token id=" t7 ">that</ token><token id=" t8 ">Krueger</ token><token id=" t9 ">drove</ token><token id=" t10 ">int o</ token><token id=" t11 ">the</ token><token id=" t12 ">quar ry</ token><token id=" t13 ">and</ token><token id=" t14 ">never</ token><token id=" t15 ">sur f a c ed</ token><token id=" t16 ">.</ token>

Syntactic Annotation Framework

Semantic Annotation Framework: TimeML

• no more than 60 days – <TIMEX3 tid="t1" type="DURATION"

value="P60D" mod="EQUAL_OR_LESS"> no more than 60 days </TIMEX3>

• the dawn of 2000 – <TIMEX3 tid="t2" type="DATE" value="2000"

mod="START"> the dawn of 2000 </TIMEX3>

ONTOLOGY EXTRACTION/LEARNING AND QUESTION-ANSWERING

26

Word Segmentation

MULTILINGUAL INFORMATION FRAMEWORK

IT Ontology

29

software system

Embeddedsoftware

Embeddedsystem

appliance

OS Middleware

EmbeddedOS

platform

App.Program

Dev.Env.

Comm.middleware

browserMediaplayer

DVDplayer Set-top

boxMP3player

DigitalCamera

vendor

Real-timeEmbed. OS

Non-real-timeembed. OSRTOS

VxWorks pSOS

VRTX

WinCE

MicrosoftWind River

consists_ofreside_on

venderIT Core Ontology

A Scenario

30

User ControlServer

OntologyReasoner

RuleReasoner

What is the best RTOS Vendor? Do you know?

No

What is RTOS?

Real-time Operating System

What are instances?VxWorksVendor?

Wind River..Microsoft Which is better?

Dialogue actsWell-known examples of communicative functions (“core

dialogue acts”):• question

• WH-question• YN-question• check/verification

• statement/inform• answer (WH-answer. YN-answer)• confirmation, disconfirmation• request• instruct• promise• acknowledgement• greeting

General-purpose functionsApplicable in any dimension are:

Information-seeking functions WH-question, YN-question, Alternatives-question, Check,..

Information-providing functionsInform, WH-Answer, YN-Answer, Confirmation, Disconfirmation, Agreement, Correction,..

Commissive functionsOffer, Promise, AcceptRequest,..

Directive functionsInstruct, Request, Suggest,..

DiaML concrete syntax

From sentence to ontologies

Sentence

A [camera] is a [device] that [take]s [video].

Dependency analysis

Term recognition

A camera is a device that takes video.

camera is device takes that video

Triplets extraction(camera, ISA, device)(camera, hasPropertyOf, that AND (take video))

ontologycamera

device

artifact

video

contents

···

Standards for language processing

Primary resources(text, dialogues)

Structural mark-upBasic annotations[TEI, MPEG7, TMX(XHTML…), etc.]

NLP structures(annotations)POS tagging

Chunks (cf. Named Entities)Deep Syntactic structures

Co-references etc.[Eagles/ISLE,

CES, MATE,…]

Knowledge structuresHierarchies of types

Relations between concepts(subjects/topics etc.)

Links to primary resources[Topic Maps, OIL, RDF]

Lexical structures(Language models)

TerminologiesTransfer lexica

LTAG/HPSG/LFG lexica[TBX, OLIF,

Eagles/ ISLE (Genelex)]

Links

Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]

Access protocols[Corba, SOAP]

Context

• ISO TC37 - Terminology and other language resources– SC3 - Computer applications in terminology

• ISO 12200 - Martif– Latest version of TEI Terminology chapter

• ISO 12620 - Data categories• ISO CD (DIS: under ballot) 16642 - TMF (Terminological

Markup Framework)

– SC4 - Language resources

TC37/SC4 details• Scope: Platform for designing and implementing linguistic resource

formats and processes– Multi-layer annotation of linguistic resources– Exchange of information between NLP modules

• General strategy– Involve a wide community from academia and industry

• Identification of experts in the various work items• Involvment through national standardizing bodies

• Agenda– Current: identification of possible work items and working groups– Constituancy meeting and technical workshop at LREC (May 2002)

Organization

• Chair:– Laurent Romary, France

• Secretary:– Key-Sun Choi, Korea

• International Advisory Committee– Chair: Prof. Antonio Zampolli, Italy

SC4 and other standardizing bodies

W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP

MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices

ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats

TEI-text representationReference for primary sourcese.g.: text archives

Text

Audio/Speech

Technical background

What about gestures?• Kinetic in the TEI• SMIL?

Oscar

TC37/SC4 Work Items

• WG1/WI-0: Terminology of Language Resources• WG1/WI-1: Linguistic annotation framework• WG1/WI-2: Meta-data for multimodal and multilingual

information• WG2/WI-3: Structural content representation scheme• WG2/WI-4: Multimodal content representation sheme• WG2/WI-5: Discourse level representation scheme

TC37/SC4 Work Items - cont.

• WG3/WI-6a: Multilingual text representation

• WG4/WI-7: NLP Lexica• WG5/WI-8: Net-based distributed cooperative

work for the creation of LRs

WI-0

• Terminology of Language Resources– Basic terminology of the various sub-fields of

language resources and general methodology– Project leader: Klaus-Dirk Schmitz– Sources:

• ISO 1087• LREC proceedings + KAIST• English dictionaries in Linguistics?

– Support from GTW

WI-1• Linguistic annotation framework

– Basic mechanisms and data structures for linguistic annotation and representation [data architecture]

• Methods and principles for the design of an annotation scheme• Structural nodes and information units, Data category specification• Linking and pointing mechanisms, Feature Structures, Meta-Markup• « Stand-off » and « in-line » views - equivalences, combining levels.• Administrative data categories

WI-1 - cont.

– Project leader: Nancy Ide (TBC)– Contributors: Alan Melby, Koiti Hasida, Lee Gillam, Yves

Savourel, Laurent Romary…– Possible sources:

• TMF, iso12620-revised, Mate (general methodology)• TEI (Linking mechanisms, feature structures)• Link with Linguistic DS

WI-2• Meta-data for multimodal and multilingual information

– Description of a meta-data representation scheme to document linguistic information structures and processes

• General content description• Local content description

– Project leader: Peter Wittenburg, MPI (Nijmegen, NL)– Participants: Steven Bird, TEI aware person– Possible sources:

• OLAC, Mile, TEI Header– Liaison: TC46 (SC9), MPEG7/MDS, SCORM

WI-3• Structural content representation scheme

– Definition of annotation/representation scheme(s) for morpho-syntax and syntax, to be used for annotation and interchange purposes

• Meta-model for morpho-syntactic annotation• Meta-model(s) for syntactic annotation (lexicalized grammar,

elementary trees, dependancy structures)• + corresponding Data category registries

– Project leader:John Carroll ?? – Participants: Nuria Bell– Possible sources:

• Eagles, TAGML, Linguistic DS• SIGPARSE• Working group with representatives from existing TreeBanks

initiatives

WI-4• Multimodal meaning representation scheme

– Representation scheme for the semantic content of multimodal information (textual, spoken, graphical and gestural)

• Meta-modal for content representation (Events, participants, etc.)• Data category registry for multimodal content

– Project leader: Harry Bunt (id=“1”)– Possible sources:

• SIGSEM working group on semantic content– Chair: #1

– « Liaison »• Semantic web activities

WI-5

• Discourse level representation scheme• Meta-model for discourse and dialogue representation• Meta-model for discourse level annotation (e.g.

reference annotation)• + corresponding DatCat registry

– Possible sources:• SIGDIAL• DRI - Discourse Resource Initiative• Mate

WI-6• Multilingual text representation scheme

– Framework for representing language specific and multi-lingual textual information

• Translation Memory• Alignment – Parallel Corpora• Word count algorithms (characters, words, segments)

– Possible sources:• TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts

• WI 6A• Translation Memory, Alignment of parallel corpora

– Sources:• OSCAR/TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts

• WI 6b• Segmentation and counting algorithms (characters, words,

sentences etc.)– Sources:

• OSCAR

• WI 6C• Meta-markup for GIL (Globalization, Internationalization and

Localization)– Possible sources:

• OSCAR/OpenTag

WI-7

• NLP lexica– Lexicon representation formats for the various types of NLP

applications (Machine Readable Lexica)• Define a set of meta-models (classes of applications)• Specific data categories (derivation, phonology, etc.)• Based on the work done in other work items

– Sources• Eagles/multext• ISLE Computational Working group/Genelex• OLIF

WI-8• Net-based distributed cooperative work for the creation

of LRs– Principles and methods for designing collaborative and

cooperative compilation of LRs– Define what is specific to LRs with regards

• Tracability of resources, version control, validation, quality management

• Protocols (Corba, SOAP), Workflow standards, Data management

– Contacts: Christian Galinski, Remi Zajac– Sources

Liaison - OSCAR• Brief history of LR exchange standards• Parallel events since 1997

– Open Tag - meta-markup (XML vs. Others)• Major current OSCAR activities

– TMX - Translation Memory eXchange– Counting and segmentation algorithms– TBX (Terminologies) and OLIF (MT lexica)– XLIFF and CGS - Annotation of source code and localisation of

web sites– xml:lang etc.: J. DeCamp and S.-E. Wright

Liaison - TEI– General architecture and data modeling

• WI-1– Annotations (paragraph level, external annotations)

• WI-1– TEI Header

• WI-2– NLP lexica

• WI-7– Feature structures

• WI-1

Standards, Use and Prospects for Language Resource Management

Documents

similar causality

interterm causalityterms

termscausal relation

gum disease

termsterm clustering

similar conceptwe

useful features

effective features