Top Banner
1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko Mitamura, Simon Fung
23

1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

Dec 29, 2015

Download

Documents

Gillian Wright
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

1

Interlingual Annotation of Multilingual Text Corpora (IAMTC)

Project Overview for ITICNovember 13, 2003

Carnegie Mellon University

Lori Levin, Teruko Mitamura, Simon Fung

Page 2: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

2

Principal investigators and senior personnel

•Bonnie Dorr, University of Maryland•Nizar Habash, University of Maryland and Columbia•Stephen Helmreich, NMSU•Eduard Hovy, USC•David Farwell, NMSU•Lori Levin, CMU•Keith Miller, MITRE•Teruko Mitamura, CMU•Owen Rambow, Columbia University•Florence Reeder, MITRE

Page 3: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

3

Cooperative Website: Wiki

• http://sparky.umiacs.umd.edu:8000/IAMTC/IAMTC.wiki • Corpora• Documents and manuals• Discussion

Page 4: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

4

Goals of IAMTC• A practical interlingua for unrestricted text

– Based on mismatch resolution between languages and between multiple English translations

– Goal: Feasible human coding• Speed• Inter-coder agreement

Page 5: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

5

Benefits of IAMTC

• Usable by many research communities, and by researchers using different approaches, working at different levels:

• MT, information extraction, summarization, question-answering, etc.

• Corpus-based, rule-based, machine learning-based, statistical approaches, etc. (note: heterogeneous list, not mutually exclusive)

• Multiple levels of representation: – Syntactic dependency structure

– Language-specific predicate argument structure

– Interlingua (with resolution of some mismatches)

Page 6: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

6

Products of IAMTC

• A coding manual for the interlingua

• A multilingual tagged corpus– 25 original texts in: French, Spanish,

Japanese, Korean, Arabic, Hindi– Three English translations of each text

• An evaluation metric for the interlingua

Page 7: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

7

Representations

• IL0: Language-specific dependency syntax

• IL1: Language-specific semantic structure with:– Labeling of nodes using ontology– Labeling of arcs with semantic role names

• IL2: Interlingua

Page 8: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

8

IL2: Interlingua

• Neutralize: support verbs; some multi-word expressions and non-literal language; some lexical converses (buy-sell);

• some sentence planning differences – “john who is blond likes apples” <-> – “john is blond and likes apples”

• conflational mismatches “tape” Verb <-> Japanese “teepu de tomeru” (tape with attach)

• head-switching mismatches, etc. “I tend to go to school.” vs. “I usually go to school.”

Page 9: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

9

Examples(from Nizar Habash)

• http://www.umiacs.umd.edu/~habash/artb_004.idg.5.IL.1

– The minister, who has his own website, also said: "I want Dubai to be the best place in the world for state -of-the-art technology companies.“

• http://www.umiacs.umd.edu/~habash/artb_004.idg.5.IL.2

– The minister who has a personal website on the internet, further said that he wanted Dubai to become the best place in the world for the advanced (hitech) technological companies.

Page 10: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

10

Example

• Original English: – In its first five years of operation, PRODEM financed loans to

over 13,300 micorentrepreneurs, 77 per cent of whom were women, disbursing over $27 million in loans averaging $273.

• Original French:– Au bout de cinq ans, le programme avait consenti plus de 27

millions de dollars de prêts d'un montant moyen de 273 dollars, à plus de13 300 entrepreneurs, dont 77% de femmes ....

• English Translation from French:– At the end of five years, the program had granted more than 27

million dollars in loans with an average amount of 273 dollars, to more than 13 300 entrepreneurs, of which 77% were women,....

Page 11: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

11

Example 1• Original English:

– financed • loans • to over 13,300 micorentrepreneurs,

– disbursing • over $27 million

– in loans

• Original French:– consenti

• plus de 27 millions de dollars – de prêts

• à plus de 13 300 entrepreneurs,

• English Translation from French:– granted

• more than 27 million dollars – in loans

• to more than 13 300 entrepreneurs

Page 12: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

12

Example 2

• Original English:– Its network of eighteen independent organizations

in Latin America has lent …..

• Original French:– le réseau regroupe dix-huit organisations

indépendantes qui ont déboursé …..

• English Translation from French:– the network comprises eighteen independent

organizations which have disbursed …..

Page 13: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

13

Example 2

• Original English:– has lent

• Its network – of eighteen independent organizations

• …..

• Original French:– regroupe

• le réseau – dix-huit organisations indépendantes

» ont déboursé ……

• English Translation from French:– comprises

• the network • eighteen independent organizations

– have disbursed ……

Page 14: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

14

Interlingua Merging• Language-faithful interlinguas

• Original English:– financed

• loans • to over 13,300

micorentrepreneurs – disbursing

• over $27 million – in loans

• Original French:– consenti

• plus de 27 millions de dollars – de prêts

• à plus de 13 300 entrepreneurs

• English Translation from French:– granted

• more than 27 million dollars – in loans

• to more than 13 300 entrepreneurs

• Merged Interlingua

– TRANSFER-MONEY• over $27 million • to over 13,300

micorentrepreneurs

– SOME-RELATION• over $27 million• loans

Page 15: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

15

Interlingua Merging• Original English:

– has lent• Its network

– of eighteen independent organizations

• Original French:– regroupe

• le réseau – dix-huit organisations

indépendantes » ont déboursé

• English Translation from French:– comprises

• the network • eighteen independent

organizations – have disbursed

• Merged Interlingua

– HAS-AS-PART• the network • eighteen independent

organizations

– TRANSFER-MONEY• the network

• …..

Page 16: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

16

Example 3

• Original English:– Three of the most advanced institutions in the ACCION

network started their programmes as non-profit organizations and have, in the last five years, converted into

• Original French:– Trois des institutions les plus performantes rattachees a

ACCION International qui etaient au depart des organisations a but nonlucratif sont devenues ces cinq dernieres annees

• English Translation from French:– Three of the most successful institutions connected to ACCION

International, which were non-profit organizations in the beginning, have become, in these last five years,

Page 17: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

17

Example 3• Original English:

– Started• their programmes• Institutions

– as non-profit organizations

– Converted• Institutions

• …..• Original French:

– sont devenues• Institutions

– relative-clause: etaient au depart» institutions

• ……• English Translation from French:

– Have become• Institutions

– Relative-clause; Were …in the beginning» institutions

• ……

Page 18: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

18

Meetings and Workshops

• Meetings: – September, 2003: New Orleans during MT Summit– November 8 and 9, 2003: CMU– January 18 and 19,2004: ISI

• Workshops:– September 2003: MT Summit– May 2004: Plan for a panel in the workshop

organized by Adam Meyer at NAACL/HLT 2004– July 2004: Plan to propose ACL workshop

Page 19: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

19

Timeline• November 10 to December 1:

– Assembly of ENGLISH tools and knowledge sources• Tools committee: Hovy, Rambow, Miller• Omega ontology, ISI • LCS verb lexicon (connect to Omega via Propbank)• LDA (Lightweight Dependency Analyzer, Srinivas Bangalore)• Graph tool from Prague• New annotation tool (Dependency tree, Omega, Lexicon)

– Draft of coding manual for IL1:• Annotation Committee: Rambow, Mitamura, Levin, Dorr, Habash, Helmreich• Ontology symbols– Hovy• IL0 – dependency structure – Rambow• IL1 markup format – Rambow and Habash• Semantic roles – Dorr, Habash, Mitamura, Levin• Nouns and compounds – Mitamura • Adverbs and adjectives– Helmreich • Prepositions – Miller• Named entities – Reeder• Modification vs Predication – Habash

– Annotator training Phase 1: • All annotators will tag the same English text

– Assembly of corpora:• Data committee: Mitamura, Hovy, Miller, Farwell• Five foreign language original texts in each language• Three English translations of each text

Page 20: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

20

Annotation Procedure (English) • Run LDA parser

• Use tree editing tool to convert syntactic dependency parse into IL1– Correct parsing errors– Choose symbols from the ontology as node

labels– For verbs:

• look the verb up in the lexicon to get a list of semantic role names

• Match phrases to roles

Page 21: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

21

Timeline• December 1 to January 19: • Annotation development cycle:

– Procedure committee: Hovy, Farwell, Mitamura– For each week, for each language:

• Pick a text and two English translations of the text and one English translation from another site.

– Each week: • Conference call on Friday at 1:00 pm Eastern Time• Revise annotation manuals as necessary

• Development of inter-coder agreement metric– Evaluation committee: Reeder and Habash, leaders

• Proposal for IL2 based on comparison of IL1’s for different translations of the same text

Page 22: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

22

Timeline

• January 19-February 23– Development of foreign language analysis tools– Large inter-coder agreement evaluation (IL1)– Continue working on the IL2 design

• March 1: Mid year report• March 1 2004 to September 2004

– Annotation of full corpus:• 25 original texts in each of the six languages (French,

Spanish, Hindi, Korean, Arabic, Japanese)• 3 translations of each text into English

Page 23: 1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

23

Plans for year 2

• Argument taking predicates other than verbs• Additional tools for automatic construction of IL1

and IL2• More comprehensive set of divergences resolved

in IL2• Additional annotation topics:

– Coreference– Scope– Tense and aspect– Etc.

• Larger annotated corpus– Suitable for corpus-based methods and machine

learning