Top Banner
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 FF & FER Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing
15

Comparative Analysis of Automatic Term and Collocation Extraction

Jan 14, 2016

Download

Documents

lacy

Comparative Analysis of Automatic Term and Collocation Extraction. Sanja Seljan , Bojana Dalbelo Bašić , Jan Šnajder , Davor Delač , Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of I nformation Sciences - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Comparative Analysis of Automatic Term and Collocation

Extraction

Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder,Davor Delač, Matija Šamec-Gjurin, Dina Crnec

Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

Page 2: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FEROverview

I. Introduction– Reasons for extraction

II. Research– Resources & tools– Extracted lists

III. Evaluation– Precision, recall, F-measure

IV. Conclusion

Page 3: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERI. Introduction

• Monolingual and multilingual resources– Helpful– Integrated– Require human intervention

• EU pre-accession activities– Speed up + consistency

• Used in further research and practice

Page 4: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• List:– Terms (Member State, European Union)

– Collocations (adopt a/the resolution, decided as follows)

– Multi-word units (depend on, well-being)

• Term extraction process:– Term extraction (term acquisition)- identification– Term recognition - verification

Page 5: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERII. Research

• Resources– 10 documents – legislation, Cro-Eng

• Tools– TermeX tool (FER) – list A– SDL Multi Term Extract + NooJ (FF) – list B

• Reference list– Evaluation – reference list

Page 6: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• 470 terms and collocations• Exclude unigrams• Balance between lexical coverage, adequacy,

practicality– terms (NPs: 346/470)– collocations (VPs)

Page 7: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• Contains:– Terms (acquiring company, applicant country)

– Collocations (adopt a/the resolution, decided as

follows, entry into force, having regard to) – Names and abbreviations (Economic and

Monetary Union EMU, European Union EU)

– Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).

Page 8: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4– Filtered by the list of stop-words -> 369 cand.

• Language dependant NooJ tool– 36 local grammars -> 512 cand.

List B

Page 9: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• TermeX– Lexical association measures (AMs)– 14 AMs (PMI, Dice, Chi-square,…)– Lemmatization– POS filtering– Frequency treshold set to ?

Page 10: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• Extracted terms ranked by AM value – 1816 candidates

• AMs used:– 2-grams – PMI

– 3-grams, 4-grams – heuristic extensions

• Noun phrases only

Page 11: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• Evaluation– F1-measure (precision, recall)

– True positives calculated by taking into account inflection (suffix stripping)

List A List B

No. of terms 1816 508

Valid terms 202 234

Precision (%) 11.56 47.37

Recall (%) 42.98 49.79

F1 (%) 18.22 48.55

Page 12: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• List A unsatisfactory– Low recall – Verb phrases, terms consisting of

more than 4 words

– Low precision – ranked list, can be improved with cut-off (true positives are better ranked)

• List B modest– can be improved with lemmatization, definition of

upper/lower cases, more detailed local grammar

Page 13: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERConclusion

• Comparison of two hybrid approaches to term extraction

• Human created lists differ from extracted lists– human knowledge, experience and intuition

• Space for improvement – automatic extraction combined human intervention

Page 14: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Thank you!

Page 15: Comparative Analysis of Automatic Term and Collocation Extraction

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER