Top Banner
Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva September 2013 Recent Advances in Natural Language Processing, 7-13 September 2013, Hissar, Bulgaria
151

Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Apr 13, 2018

Download

Documents

doanthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Collocation Extraction Based on Syntactic Criteria

Violeta Seretan

Department of Translation TechnologyFaculty of Translation and Interpreting

University of Geneva

September 2013

Recent Advances in Natural Language Processing, 7-13 September 2013, Hissar, Bulgaria

Page 2: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Acknowledgment

Language Technology Laboratory, Department of Linguistics,University of Geneva

Eric Wehrli Luka Nerima Paola Merlo

Page 3: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Acknowledgment

Language Technology Laboratory, Department of Linguistics,University of Geneva

Eric Wehrli Luka Nerima Paola Merlo

Page 4: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

1 On collocations

2 Importance of collocations

3 Distiguishing features

4 The need for morphosyntactic analysis

5 Syntax-based extractors

6 Method, results, evaluation

7 Conclusion

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 3 / 100

Page 5: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 6: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together

Source: https://my.barackobama.com/page/content/hisownwords/

Page 7: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together

Source: https://my.barackobama.com/page/content/hisownwords/

Page 8: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together

Source: https://my.barackobama.com/page/content/hisownwords/

Page 9: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together

Source: https://my.barackobama.com/page/content/hisownwords/

Page 10: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together

Source: https://my.barackobama.com/page/content/hisownwords/

Page 11: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Language is made up of collocations

I tender my heartfelt gratitude to all of them, while taking fullresponsibility for all errors . . .

[Mel’cuk1998, 23]

Page 12: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Collocations

“In all kinds of texts, collocations are indispensable elements with whichour utterances are very largely made”

[Kjellmer1987, 10]

Page 13: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Collocations

“Collocation is the way words combine in a language to producenatural-sounding speech and writing”

[Lea and Runcie2002, vii]

Page 14: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

‘Appendix to the Grammar’

“Knowledge that will account for speakers‘ ability to construct andunderstand phrases and expressions in their language which are notcovered by the grammar, the lexicon, and the principles of compositionalsemantics”

[Fillmore et al.1988, 504]

Page 15: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 16: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 17: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Only available to native speakers

“Advanced learners of second language have great difficulty with nativelikecollocation and idiomaticity. Many grammatical sentences generated bylanguage learners sound unnatural and foreign.”

[Ellis2008]

Page 18: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN open air

FR plein air ‘full’

RO aer liber ‘free’

Page 19: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN open air

FR plein air ‘full’

RO aer liber ‘free’

Page 20: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN open air

FR plein air ‘full’

RO aer liber ‘free’

Page 21: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN ask a question

IT fare una domanda, ES hacer una pregunta‘make’

RO a pune o intrebare, FR poser une question‘put’

Page 22: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN ask a question

IT fare una domanda, ES hacer una pregunta‘make’

RO a pune o intrebare, FR poser une question‘put’

Page 23: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN ask a question

IT fare una domanda, ES hacer una pregunta‘make’

RO a pune o intrebare, FR poser une question‘put’

Page 24: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN error occurred

FR erreur s’est produite‘produced itself’

Page 25: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN error occurred

FR erreur s’est produite‘produced itself’

Page 26: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN cheat death

FR froler la mort‘brush’

Page 27: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN cheat death

FR froler la mort‘brush’

Page 28: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN reach an agreement

FR parvenir a un accord‘arrive, get to’

IT trovare un accordo‘find’

Page 29: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN reach an agreement

FR parvenir a un accord‘arrive, get to’

IT trovare un accordo‘find’

Page 30: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN reach an agreement

FR parvenir a un accord‘arrive, get to’

IT trovare un accordo‘find’

Page 31: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN money laundering

FR blanchiment d’argent‘whitening’

IT lavaggio di denaro‘washing’

Page 32: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN money laundering

FR blanchiment d’argent‘whitening’

IT lavaggio di denaro‘washing’

Page 33: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

More examples

EN money laundering

FR blanchiment d’argent‘whitening’

IT lavaggio di denaro‘washing’

Page 34: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Even more examples

EN narrow majority

FR courte majorite‘short’

EN bring to justice

FR traduire en justice‘translate’

FR urmari ın justitie‘follow track’

ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury

Page 35: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Even more examples

EN narrow majority

FR courte majorite‘short’

EN bring to justice

FR traduire en justice‘translate’

FR urmari ın justitie‘follow track’

ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury

Page 36: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Even more examples

EN narrow majority

FR courte majorite‘short’

EN bring to justice

FR traduire en justice‘translate’

FR urmari ın justitie‘follow track’

ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury

Page 37: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Even more examples

Automatically extracted collocation equivalents [Seretan and Wehrli2007]

Page 38: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance

“Collocations make up the lion’s share of the phraseme inventory, and thusthey deserve our special attention.”

[Mel’cuk1998, 24]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 24 / 100

Page 39: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Machine Translation

“collocations are the key to producing more acceptable output”

[Orliac and Dillinger2003, 292]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 25 / 100

Page 40: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Natural Language Generation

“collocations are not only considered useful, but also a problem”

[Heylen et al.1994, 1240]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 26 / 100

Page 41: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

1 Collocations act as lexical disambiguators

to break a record

break (V) - about 50 senses

about 500 potential interpretations

record (N) - about 10 senses

break - record (V-O collocation): 1 sense

“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100

Page 42: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

1 Collocations act as lexical disambiguators

to break a record

break (V) - about 50 senses about 500 potential interpretationsrecord (N) - about 10 senses

break - record (V-O collocation): 1 sense

“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100

Page 43: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

1 Collocations act as lexical disambiguators

to break a record

break (V) - about 50 senses about 500 potential interpretationsrecord (N) - about 10 senses break - record (V-O collocation): 1 sense

“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100

Page 44: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

2 Collocations act as structural disambiguators

human resource development

NP

AP

human

NP

resource NP

development

vs. NP

NP

AP

human

resource

development

vs. ...

The number of potential parses is exponential in sentence length.Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100

Page 45: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

2 Collocations act as structural disambiguators

human resource development

NP

AP

human

NP

resource NP

development

vs. NP

NP

AP

human

resource

development

vs. ...

The number of potential parses is exponential in sentence length.

Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100

Page 46: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Language Analysis

2 Collocations act as structural disambiguators

human resource development

NP

AP

human

NP

resource NP

development

vs. NP

NP

AP

human

resource

development

vs. ...

The number of potential parses is exponential in sentence length.Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100

Page 47: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Corpus Linguistics

Source: http://www.linguistik-online.de/31_07/danielsson.html

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 29 / 100

Page 48: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Importance for Lexicography

Source: Collins COBUILD online

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 30 / 100

Page 49: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

NLP applications

machine translation

syntactic parsing

natural language generation

word sense disambiguation

topic segmentation

text summarization

text classification

information retrieval

OCR, speech recognition

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 31 / 100

Page 50: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

1 On collocations

2 Importance of collocations

3 Distiguishing features

4 The need for morphosyntactic analysis

5 Syntax-based extractors

6 Method, results, evaluation

7 Conclusion

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 32 / 100

Page 51: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Distiguishing features

1 Collocations are partly compositional

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 33 / 100

Page 52: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Distiguishing features

3 Collocations are morphosyntactically flexible

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 34 / 100

Page 53: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Syntactic flexibility

make proposal

A proposal for the financing of the variable costs will be made to theCommittee . . .

submit proposal

A joint proposal which addressed such elements as notification,consultations, conciliation and mediation, arbitration, panel procedures,technical assistance, adoption of panel reports and GATTs surveillance oftheir implementation was submitted on behalf of fourteen participants.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 35 / 100

Page 54: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Syntactic flexibility

make proposal

A proposal for the financing of the variable costs will be made to theCommittee . . .

submit proposal

A joint proposal which addressed such elements as notification,consultations, conciliation and mediation, arbitration, panel procedures,technical assistance, adoption of panel reports and GATTs surveillance oftheir implementation was submitted on behalf of fourteen participants.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 35 / 100

Page 55: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Syntactic flexibility

serious problem

We all need to renounce the use of arms in order to be able to address thecountry’s serious political, social and economic problems . . .

important issue

The issue of new technologies and their application in education naturallygenerates considerable interest and is extremely important.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 36 / 100

Page 56: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Syntactic flexibility

serious problem

We all need to renounce the use of arms in order to be able to address thecountry’s serious political, social and economic problems . . .

important issue

The issue of new technologies and their application in education naturallygenerates considerable interest and is extremely important.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 36 / 100

Page 57: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Syntactic flexibility

paragraphe dispose

Notant en outre que le paragraphe 5 de l’Acte final reprenant les resultatsdes Negociations commerciales multilaterales du Cycle d’Uruguay (ci-apresdenommes respectivement l’“Acte final” et le “Cycle d’Uruguay”) disposeque . . .

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 37 / 100

Page 58: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Identification approaches

1 Approaches based on linear proximityDefinition:

“Collocation is the cooccurrence of two or more wordswithin a short space of each other in a text. The usualmeasure of proximity is a maximum of four wordsintervening.” [Sinclair1991, 170].

2 Approaches based on structural proximityDefinition:

“lexically and/or pragmatically constrained recurrentco-occurrences of at least two lexical items which are in adirect syntactic relation with each other” [Bartsch2004, 76]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 38 / 100

Page 59: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Identification approaches

1 Approaches based on linear proximityDefinition:

“Collocation is the cooccurrence of two or more wordswithin a short space of each other in a text. The usualmeasure of proximity is a maximum of four wordsintervening.” [Sinclair1991, 170].

2 Approaches based on structural proximityDefinition:

“lexically and/or pragmatically constrained recurrentco-occurrences of at least two lexical items which are in adirect syntactic relation with each other” [Bartsch2004, 76]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 38 / 100

Page 60: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

1 On collocations

2 Importance of collocations

3 Distiguishing features

4 The need for morphosyntactic analysis

5 Syntax-based extractors

6 Method, results, evaluation

7 Conclusion

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 39 / 100

Page 61: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Inflection

jouer ‘play’

∼ 25 forms

role ‘role’

2 forms

jouer – role

∼ 50 forms

1 type: jouer - role

50 forms: jouent role, role joue, joue roles, ...

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 40 / 100

Page 62: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Lexical ambiguity

human

NounAdjective

rights

NounAdjectiveVerbAdverb

human – rights

Noun – NounNoun – VerbAdjective – Noun...

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 41 / 100

Page 63: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Inversion

set – record

Michael Phelps sets all-time Olympic record.

A new world record has been set.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 42 / 100

Page 64: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Inversion

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 43 / 100

Page 65: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Long-distance dependencies

donner – exemple ‘give – example’

Le visionnaire a donne, lors de sa conference magistrale, a l’occasion de laremise du Prix Latsis 2011 aux differents laureats, le mercredi 30 novembre2011, dans la salle Piaget de l’Universite de Geneve a Uni-Dufour, devenutrop petite pour accueillir le monde scientifique et le public venus de tousles coins et recoins de la Suisse, l’exemple du professeur et ancienpresident senegalais le poete Leopold Sedar Senghor qui maıtrisait autantla culture du pays de Marianne que la langue de Moliere avec perfectionpour devenir le premier Noir membre de l’Academie francaise.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 44 / 100

Page 66: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Long-distance dependencies

put – off

It is still wise to find a tactful way of putting those who have a virus orthe flu, or any other problem off for a little while longer.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 45 / 100

Page 67: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Structural ambiguity

question - ask

Any question asked during the selection and interview process must berelated to the job and the performance of that job.

The question asked if the grant funding could be used as start-up capitalto develop this project.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 46 / 100

Page 68: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Structural ambiguity

question - ask

Any question asked during the selection and interview process must berelated to the job and the performance of that job.

The question asked if the grant funding could be used as start-up capitalto develop this project.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 46 / 100

Page 69: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Typical extraction method

Sliding window (Baseline)

1 Take into account all possible combinations within a 5-wordcollocational span

2 Apply an association measure to filter out noise and retain bestcandidates on top of the output list

WARNING

MANUALLY VALIDATE RESULTS BEFORE USE

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 47 / 100

Page 70: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Typical extraction method

Sliding window (Baseline)

1 Take into account all possible combinations within a 5-wordcollocational span

2 Apply an association measure to filter out noise and retain bestcandidates on top of the output list

WARNING

MANUALLY VALIDATE RESULTS BEFORE USE

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 47 / 100

Page 71: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The multilingual challenge

In English, many syntactic relations are realised in a 5-word window.

But what about freer word order languages?

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 48 / 100

Page 72: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The multilingual challenge

German

“Some properties of the German language make the task of extractingV-N collocations from German text corpora more difficult than for Englishcorpora.”[Breidt1993, 77]

“the assumption that a “semantic agent [...] is principally used before theverb” and a “semantic object [...] is used after it” as described in Smadja(1991a:180) does not hold for German. Therefore, complicated parsing isnecessary to distinguish subject-verb from object-verb combinations.”[Breidt1993, 77]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 49 / 100

Page 73: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The multilingual challenge

German

“Some properties of the German language make the task of extractingV-N collocations from German text corpora more difficult than for Englishcorpora.”[Breidt1993, 77]

“the assumption that a “semantic agent [...] is principally used before theverb” and a “semantic object [...] is used after it” as described in Smadja(1991a:180) does not hold for German. Therefore, complicated parsing isnecessary to distinguish subject-verb from object-verb combinations.”[Breidt1993, 77]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 49 / 100

Page 74: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The multilingual challenge

Korean

“The free order of Korean makes it hard to identify collocations.”[Kim et al.1999, 71]

“Unfortunately, the approach for English has several limitations to work onKorean structure”[Kim et al.1999, 71]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 50 / 100

Page 75: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The multilingual challenge

Korean

“The free order of Korean makes it hard to identify collocations.”[Kim et al.1999, 71]

“Unfortunately, the approach for English has several limitations to work onKorean structure”[Kim et al.1999, 71]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 50 / 100

Page 76: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Solution

“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.

However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail. [...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”

[Smadja1993, 151]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100

Page 77: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Solution

“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail.

[...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”

[Smadja1993, 151]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100

Page 78: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Solution

“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail. [...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”[Smadja1993, 151]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100

Page 79: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Solution

“with recent significant increases in parsing efficiency and accuracy, thereis no reason why explicit parse information should not be used”[Pearce2002, 1530]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 52 / 100

Page 80: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Solution

“Ideally, a full syntactic analysis of the source corpus would allow us toextract the cooccurrence directly from parse trees”[Evert2004, 31]

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 53 / 100

Page 81: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

1 On collocations

2 Importance of collocations

3 Distiguishing features

4 The need for morphosyntactic analysis

5 Syntax-based extractors

6 Method, results, evaluation

7 Conclusion

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 54 / 100

Page 82: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lin (1998, 1999) – English

dependency parser (unspecified)

sentences shorter than 25 words

9.7% errors in top results

6 types: N-D, N-A, N-N, V-N,N-V, V-Adv

Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese

syntactic parser (NLPWin,Microsoft Research)

7.85% errors in top results

3 types: V-O, N-A, V-Adv

Villada Moiron (2005) – Dutch

dependency parser (Alpino)

sentences shorter than 20 words

many PP-attachment errors;parser only used for chunking

2 types: P-N-P and PP-V

Orliac and Dillinger (2003) –English

deep parser (Logos)

limited grammatical coverage(does not handle relatives)

3 types: S-V, V-O, V-P-N

OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)

Page 83: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lin (1998, 1999) – English

dependency parser (unspecified)

sentences shorter than 25 words

9.7% errors in top results

6 types: N-D, N-A, N-N, V-N,N-V, V-Adv

Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese

syntactic parser (NLPWin,Microsoft Research)

7.85% errors in top results

3 types: V-O, N-A, V-Adv

Villada Moiron (2005) – Dutch

dependency parser (Alpino)

sentences shorter than 20 words

many PP-attachment errors;parser only used for chunking

2 types: P-N-P and PP-V

Orliac and Dillinger (2003) –English

deep parser (Logos)

limited grammatical coverage(does not handle relatives)

3 types: S-V, V-O, V-P-N

OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)

Page 84: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lin (1998, 1999) – English

dependency parser (unspecified)

sentences shorter than 25 words

9.7% errors in top results

6 types: N-D, N-A, N-N, V-N,N-V, V-Adv

Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese

syntactic parser (NLPWin,Microsoft Research)

7.85% errors in top results

3 types: V-O, N-A, V-Adv

Villada Moiron (2005) – Dutch

dependency parser (Alpino)

sentences shorter than 20 words

many PP-attachment errors;parser only used for chunking

2 types: P-N-P and PP-V

Orliac and Dillinger (2003) –English

deep parser (Logos)

limited grammatical coverage(does not handle relatives)

3 types: S-V, V-O, V-P-N

OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)

Page 85: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lin (1998, 1999) – English

dependency parser (unspecified)

sentences shorter than 25 words

9.7% errors in top results

6 types: N-D, N-A, N-N, V-N,N-V, V-Adv

Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese

syntactic parser (NLPWin,Microsoft Research)

7.85% errors in top results

3 types: V-O, N-A, V-Adv

Villada Moiron (2005) – Dutch

dependency parser (Alpino)

sentences shorter than 20 words

many PP-attachment errors;parser only used for chunking

2 types: P-N-P and PP-V

Orliac and Dillinger (2003) –English

deep parser (Logos)

limited grammatical coverage(does not handle relatives)

3 types: S-V, V-O, V-P-N

OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)

Page 86: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lin (1998, 1999) – English

dependency parser (unspecified)

sentences shorter than 25 words

9.7% errors in top results

6 types: N-D, N-A, N-N, V-N,N-V, V-Adv

Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese

syntactic parser (NLPWin,Microsoft Research)

7.85% errors in top results

3 types: V-O, N-A, V-Adv

Villada Moiron (2005) – Dutch

dependency parser (Alpino)

sentences shorter than 20 words

many PP-attachment errors;parser only used for chunking

2 types: P-N-P and PP-V

Orliac and Dillinger (2003) –English

deep parser (Logos)

limited grammatical coverage(does not handle relatives)

3 types: S-V, V-O, V-P-N

OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)

Page 87: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The LATL syntax-based collocation extractor: FipsCo

Goldman et al. (2001) – English, French

deep parser (Fips)

broad grammatical coverage

many types of collocation configurations

FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.

It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.

Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100

Page 88: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The LATL syntax-based collocation extractor: FipsCo

Goldman et al. (2001) – English, French

deep parser (Fips)

broad grammatical coverage

many types of collocation configurations

FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.

It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.

Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100

Page 89: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The LATL syntax-based collocation extractor: FipsCo

Goldman et al. (2001) – English, French

deep parser (Fips)

broad grammatical coverage

many types of collocation configurations

FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.

It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.

Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100

Page 90: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

The LATL syntax-based collocation extractor: FipsCo

Goldman et al. (2001) – English, French

deep parser (Fips)

broad grammatical coverage

many types of collocation configurations

FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.

It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.

Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100

Page 91: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 92: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 93: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 94: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 95: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 96: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 97: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 98: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Experiments

Languages

English, French, Italian, Spanish [Seretan and Wehrli2009], German, Greek[Michou and Seretan2009], Romanian [Seretan and Wehrli2010]

Corpora

WTO translation archives, The Economist, Canadian Hansard, Le Monde,Europarl, Mesagerul, WWW (Web as a corpus) ...

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 64 / 100

Page 99: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

FipsCo Romanian

Page 100: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Exploitation

Lexical acquisition

for the Fips parserfor the Its-2 MT system [Wehrli et al.2009]

Text Summarization [Seretan2011]

Syntactic Parsing [Seretan and Wehrli2011]

Terminology assistance [Wehrli2003, Wehrli2006]

Teaching

Research

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 66 / 100

Page 101: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lexical acquisition

Page 102: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Lexical acquisition

Page 103: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Teaching

Page 104: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Terminology assistance

Page 105: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation
Page 106: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method

Fips parser [Wehrli2007] – Sample output

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 72 / 100

Page 107: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method

Fips parser [Wehrli2007] – Sample output

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 73 / 100

Page 108: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Fips - Key facts

Constituent

simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents

Manually-built lexica

detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis

Algorithm - main operations

Project: assignment of constituent structures to lexical entries

Merge: combination of adjacent constituents

Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100

Page 109: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Fips - Key facts

Constituent

simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents

Manually-built lexica

detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis

Algorithm - main operations

Project: assignment of constituent structures to lexical entries

Merge: combination of adjacent constituents

Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100

Page 110: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Fips - Key facts

Constituent

simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents

Manually-built lexica

detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis

Algorithm - main operations

Project: assignment of constituent structures to lexical entries

Merge: combination of adjacent constituents

Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100

Page 111: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 75 / 100

Page 112: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 76 / 100

Page 113: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 77 / 100

Page 114: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

1 Lexical filter:rule out auxiliary and modal verbs, proper nouns, common nounsrepresenting titles (Mr.)

2 Structural filter:predicate-argument relation in the arguments table of predicatescombinations <X, head of item in L/R> in a given syntactic relation,e.g., head-modifier, noun-adjective in FP (functional phrase)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 78 / 100

Page 115: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

1 Lexical filter:rule out auxiliary and modal verbs, proper nouns, common nounsrepresenting titles (Mr.)

2 Structural filter:predicate-argument relation in the arguments table of predicatescombinations <X, head of item in L/R> in a given syntactic relation,e.g., head-modifier, noun-adjective in FP (functional phrase)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 78 / 100

Page 116: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 1: Candidate selection

Syntactic patterns

adjective-noun heavy smokernoun-[predicate]-adjective effort [be] devotednoun-noun suicide attacknoun-preposition-noun round of negotiationsnoun-preposition inquiry intoadjective-preposition crazy aboutsubject-verb war breaksverb-object meet requirementverb-preposition-argument bring to boilverb-preposition point outadverb-verb fully supportadverb-adjective highly important

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 79 / 100

Page 117: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Stage 2: Candidate ranking

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 80 / 100

Page 118: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Summing up:

Baseline

1 Candidate identification:Take into account all possible combinations within a 5-word collocationalspan

2 Candidate ranking:Apply an association measure to filter out noise and retain best candidateson top of the output list

Syntax-based extraction

1 Candidate identification:Take into account syntactically bound combinations, according to theparse tree built by Fips for the input sentence

2 Candidate ranking:Apply an association measure to retain best candidates on top of the outputlist

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 81 / 100

Page 119: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Method – Summing up:

Baseline

1 Candidate identification:Take into account all possible combinations within a 5-word collocationalspan

2 Candidate ranking:Apply an association measure to filter out noise and retain best candidateson top of the output list

Syntax-based extraction

1 Candidate identification:Take into account syntactically bound combinations, according to theparse tree built by Fips for the input sentence

2 Candidate ranking:Apply an association measure to retain best candidates on top of the outputlist

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 81 / 100

Page 120: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Tokens identified

a very simple question which everyone in this country would like to ask

No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.

The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.

It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...

at a cost of $5 billion that is chiefly being met e by South Korea andJapan

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100

Page 121: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Tokens identified

a very simple question which everyone in this country would like to ask

No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.

The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.

It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...

at a cost of $5 billion that is chiefly being met e by South Korea andJapan

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100

Page 122: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Tokens identified

a very simple question which everyone in this country would like to ask

No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.

The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.

It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...

at a cost of $5 billion that is chiefly being met e by South Korea andJapan

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100

Page 123: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Tokens identified

a very simple question which everyone in this country would like to ask

No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.

The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.

It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...

at a cost of $5 billion that is chiefly being met e by South Korea andJapan

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100

Page 124: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Tokens identified

a very simple question which everyone in this country would like to ask

No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.

The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.

It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...

at a cost of $5 billion that is chiefly being met e by South Korea andJapan

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100

Page 125: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Syntactic environments

passivization:

I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.

relativization:

The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.

interrogation:

What impact do you expect this to have on reducing our deficitand our level of imports?

cleft constructions:

It is a very pressing issue that Mr Sacredeus is addressing.

coordinated clauses:

This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100

Page 126: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Syntactic environments

passivization:

I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.

relativization:

The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.

interrogation:

What impact do you expect this to have on reducing our deficitand our level of imports?

cleft constructions:

It is a very pressing issue that Mr Sacredeus is addressing.

coordinated clauses:

This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100

Page 127: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Syntactic environments

passivization:

I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.

relativization:

The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.

interrogation:

What impact do you expect this to have on reducing our deficitand our level of imports?

cleft constructions:

It is a very pressing issue that Mr Sacredeus is addressing.

coordinated clauses:

This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100

Page 128: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Syntactic environments

passivization:

I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.

relativization:

The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.

interrogation:

What impact do you expect this to have on reducing our deficitand our level of imports?

cleft constructions:

It is a very pressing issue that Mr Sacredeus is addressing.

coordinated clauses:

This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100

Page 129: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Results – Syntactic environments

passivization:

I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.

relativization:

The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.

interrogation:

What impact do you expect this to have on reducing our deficitand our level of imports?

cleft constructions:

It is a very pressing issue that Mr Sacredeus is addressing.

coordinated clauses:

This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100

Page 130: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 1: n-best evaluation

CorpusHansard(Canadian Parliament debates)

LanguageFrench

Size∼1.2 M words

MethodsBaselineSyntax-based

Results

Significance list

Evaluation500 typestop leveltotal: 1000 types3 evaluators/method

Annotation

(-gram) erroneous

(+gram)(-lex) regular(+lex) interesting

Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100

Page 131: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 1: n-best evaluation

CorpusHansard(Canadian Parliament debates)

LanguageFrench

Size∼1.2 M words

MethodsBaselineSyntax-based

Results

Significance list

Evaluation500 typestop leveltotal: 1000 types3 evaluators/method

Annotation

(-gram) erroneous

(+gram)(-lex) regular(+lex) interesting

Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100

Page 132: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 1: n-best evaluation

CorpusHansard(Canadian Parliament debates)

LanguageFrench

Size∼1.2 M words

MethodsBaselineSyntax-based

Results

Significance list

Evaluation500 typestop leveltotal: 1000 types3 evaluators/method

Annotation

(-gram) erroneous

(+gram)(-lex) regular(+lex) interesting

Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100

Page 133: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Annotation examples

(-gram) erroneous petite entreprise (*V-O)

(+gram)(-lex) regular aide a renovation(+lex) interesting aborder sujet

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 85 / 100

Page 134: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Confusion examples

attention particuliere, bref delai, grave probleme, lesion corporelle, offrir service,avoir droit, avoir droit, avoir honneur, creer emploi

(14.0% of the cases)

an prochain, dernier annee, ecouter discours, fin de semaine, monde entier, offre

finale, relation de travail (19.3% of the cases)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 86 / 100

Page 135: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Confusion examples

attention particuliere, bref delai, grave probleme, lesion corporelle, offrir service,avoir droit, avoir droit, avoir honneur, creer emploi

(14.0% of the cases)

an prochain, dernier annee, ecouter discours, fin de semaine, monde entier, offre

finale, relation de travail (19.3% of the cases)Violeta Seretan Collocation Extraction RANLP 2013, Hissar 86 / 100

Page 136: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation results

Outer ring – baseline; inner ring – syntax-based method

Statistical significance: +gram t(982) = 10.78, p < 0.001; +lex t(982) = 2.90, p < 0.01

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 87 / 100

Page 137: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation results

Grammatical precision by sets of 50 pairs

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 88 / 100

Page 138: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 2: stratified evaluation

CorpusEuroparl(Koehn, 2005)

LanguagesFrenchEnglishItalianSpanish

Sizeon average, ∼3.7 Mwords/language

MethodsBaselineSyntax-based

Results

Significance list

Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method

Annotation(-gram) erroneous

(+gram)

(-lex) regular

(+lex)

named entitycompoundidiomcollocation

Cohen’s κ = 0.61(significant agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100

Page 139: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 2: stratified evaluation

CorpusEuroparl(Koehn, 2005)

LanguagesFrenchEnglishItalianSpanish

Sizeon average, ∼3.7 Mwords/language

MethodsBaselineSyntax-based

Results

Significance list

Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method

Annotation(-gram) erroneous

(+gram)

(-lex) regular

(+lex)

named entitycompoundidiomcollocation

Cohen’s κ = 0.61(significant agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100

Page 140: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation experiment 2: stratified evaluation

CorpusEuroparl(Koehn, 2005)

LanguagesFrenchEnglishItalianSpanish

Sizeon average, ∼3.7 Mwords/language

MethodsBaselineSyntax-based

Results

Significance list

Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method

Annotation(-gram) erroneous

(+gram)

(-lex) regular

(+lex)

named entitycompoundidiomcollocation

Cohen’s κ = 0.61(significant agreement)

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100

Page 141: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Annotation examples

(-gram) erroneous human development

(+gram)

(-lex) regular next item

(+lex)

named entity European Unioncompound point of orderidiom same umbrellacollocation table amendment

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 90 / 100

Page 142: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Confusion matrix

regular named entity collocation compound idiomregular 422 6 222 51 7

European research economic stability sea fleet short route

named entity 26 2 11 0emisfero sud cour de compte

collocation 315 63 11wheel vehicle open door

compound 64 5titre (de) exemple

idiom 11

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 91 / 100

Page 143: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Evaluation results

Outer ring – baseline; inner ring – syntax-based method

Statistical significance:+gram t(1436) = 26.7, p < .001+lex t(1436) = 11, p < .001collocation t(1436) = 9.2, p < .001

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 92 / 100

Page 144: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Conclusion

Parsing technologies, traditionally seen as inappropriate for large-scaleprocessing of corpora, are today the main ingredient for accuratecollocation extraction.

The strong syntactic filter applied on the source text reduces theamount of data to process in the subsequent step to almost onequarter.

Parsing is the solution to the combinatorial explosion problem in thetask of identifying longer collocations in text (e.g., be a major turningpoint, to stand in stark contrast).

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 93 / 100

Page 145: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Thank youfor your attention!

Page 146: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Sabine Bartsch.

2004.Structural and Functional Properties of Collocations in English. A Corpus Study of Lexical and Pragmatic Constraints onLexical Cooccurrence.Gunter Narr Verlag, Tubingen.

Elisabeth Breidt.

1993.Extraction of V-N-collocations from text corpora: A feasibility study for German.In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 74–83, Columbus,USA.

Simon Charest, Eric Brunelle, Jean Fontaine, and Bertrand Pelletier.

2007.Elaboration automatique d’un dictionnaire de cooccurrences grand public.In Actes de la 14e conference sur le Traitement Automatique des Langues Naturelles (TALN 2007), pages 283–292,Toulouse, France, June.

Nick Ellis.

2008.Phraseology: The periphery and the heart of language.In Fanny Meunier and Sylviane Granger, editors, Phraseology in Foreign Language and Teaching, pages 1–13. JohnBenjamins, Amsterdam/Philadelphia.

Stefan Evert.

2004.The Statistics of Word Cooccurrences: Word Pairs and Collocations.Ph.D. thesis, University of Stuttgart.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 95 / 100

Page 147: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Charles Fillmore, Paul Kay, and Catherine O’Connor.

1988.Regularity and idiomaticity in grammatical constructions: The case of let alone.Language, 64(3):501–538.

Jean-Philippe Goldman, Luka Nerima, and Eric Wehrli.

2001.Collocation extraction using a syntactic parser.In Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, pages 61–66,Toulouse, France.

Dirk Heylen, Kerry G. Maxwell, and Marc Verhagen.

1994.Lexical functions and machine translation.In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), pages 1240–1244,Kyoto, Japan.

Seonho Kim, Zooil Yang, Mansuk Song, and Jung-Ho Ahn.

1999.Retrieving collocations from Korean text.In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and VeryLarge Corpora, pages 71–81, Maryland, USA.

Goran Kjellmer.

1987.Aspects of English collocations.In Willem Meijs, editor, Corpus Linguistics and Beyond, pages 133–140. Rodopi, Amsterdam.

Philipp Koehn.

2005.Europarl: A parallel corpus for statistical machine translation.In Proceedings of The Tenth Machine Translation Summit (MT Summit X), pages 79–86, Phuket, Thailand, September.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 96 / 100

Page 148: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Diana Lea and Moira Runcie, editors.

2002.Oxford Collocations Dictionary for Students of English.Oxford University Press, Oxford.

Dekang Lin.

1998.Extracting collocations from text corpora.In First Workshop on Computational Terminology, pages 57–63, Montreal, Canada.

Dekang Lin.

1999.Automatic identification of non-compositional phrases.In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on ComputationalLinguistics, pages 317–324, Morristown, NJ, USA.

Yajuan Lu and Ming Zhou.

2004.Collocation translation acquisition using monolingual corpora.In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), pages 167–174,Barcelona, Spain.

Igor Mel’cuk.

1998.Collocations and lexical functions.In Anthony P. Cowie, editor, Phraseology. Theory, Analysis, and Applications, pages 23–53. Claredon Press, Oxford.

Athina Michou and Violeta Seretan.

2009.A tool for multi-word expression extraction in Modern Greek using syntactic parsing.In Proceedings of the Demonstrations Session at EACL 2009, pages 45–48, Athens, Greece, April. Association forComputational Linguistics.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 97 / 100

Page 149: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Brigitte Orliac and Mike Dillinger.

2003.Collocation extraction for machine translation.In Proceedings of Machine Translation Summit IX, pages 292–298, New Orleans, Lousiana, USA.

Darren Pearce.

2002.A comparative evaluation of collocation extraction techniques.In Third International Conference on Language Resources and Evaluation, pages 1530–1536, Las Palmas, Spain.

Violeta Seretan and Eric Wehrli.

2007.Collocation translation based on sentence alignment and parsing.In Proceedings of TALN 2007, Toulouse, France.

Violeta Seretan and Eric Wehrli.

2009.Multilingual collocation extraction with a syntactic parser.Language Resources and Evaluation, 43(1):71–85.

Violeta Seretan and Eric Wehrli.

2010.Extending a multilingual symbolic parser to Romanian.In Dan Tufis and Corina Forascu, editors, Multilinguality and Interoperability in Language Processing with Emphasis onRomanian. Romanian Academy Publishing House.

Violeta Seretan and Eric Wehrli.

2011.FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora.In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages125–127, Portland, Oregon, USA, June. Association for Computational Linguistics.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 98 / 100

Page 150: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Violeta Seretan.

2011.A collocation-driven approach to text summarization.In Actes de la 18e conference sur le Traitement Automatique des Langues Naturelles (TALN 2011), pages 9–14,Montpellier, France.

John Sinclair.

1991.Corpus, Concordance, Collocation.Oxford University Press, Oxford.

Frank Smadja.

1993.Retrieving collocations from text: Xtract.Computational Linguistics, 19(1):143–177.

Marıa Begona Villada Moiron.

2005.Data-driven identification of fixed expressions and their modifiability.Ph.D. thesis, University of Groningen.

Eric Wehrli, Luka Nerima, and Yves Scherrer.

2009.Deep linguistic multilingual translation and bilingual dictionaries.In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 90–94, Athens, Greece.

Eric Wehrli, Violeta Seretan, and Luka Nerima.

2010.Sentence analysis and collocation identification.In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 27–35,Beijing, China.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 99 / 100

Page 151: Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic Criteria Violeta Seretan Department of Translation Technology Faculty of Translation

Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation

Eric Wehrli.

2003.Translation of words in context.In Proceedings of Machine Translation Summit IX, pages 502–504, New Orleans, Louisiana, USA.

Eric Wehrli.

2006.TwicPen: Hand-held scanner and translation software for non-native readers.In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 61–64, Sydney, Australia.

Eric Wehrli.

2007.Fips, a “deep” linguistic multilingual parser.In ACL 2007 Workshop on Deep Linguistic Processing, pages 120–127, Prague, Czech Republic.

Hau Wu and Ming Zhou.

2003.Synonymous collocation extraction using translation information.In Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), pages 120–127,Sapporo, Japan.

David Yarowsky.

1993.One sense per collocation.In Proceedings of ARPA Human Language Technology Workshop, pages 266–271, Princeton.

Violeta Seretan Collocation Extraction RANLP 2013, Hissar 100 / 100