Top Banner
Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo , Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University of Technology, Finland
30

Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Unsupervised Morpheme Analysis – Overview of

Morpho Challenge 2007 in CLEFMikko Kurimo, Mathias Creutz, Matti Varjokallio,

Ville Turunen

Helsinki University of Technology, Finland

Page 2: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

My job at Helsinki:

Multimodal Interfaces@ Adaptive Informatics

(Research Centre of Academy of Finland)

Page 3: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

ContinuousSpeech

Recognition

Adaptive Natural Language

Modelling

ContentBased Image

and Video Retrieval

Multimodal Interfaces: Proactive audio-visual information navigation, Effective multilingual interaction, Intermodal cross-over of semantics

Research topics of MMI group

Page 4: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Motivation of Morpho Challenge• To design statistical machine learning

algorithms that discover which morphemes words consist of

• Follow-up to Morpho Challenge 2005 (segmentation of words into morphs)

• Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval

Page 5: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

The vocabulary problem• Many applications require

a large vocabulary: e.g. speech recognition, information retrieval, machine translation.

• Agglutinative and highly-inflected languages suffer from a severe vocabulary explosion

• We need more efficient representation units

Unique words per corpus size

Un

iqu

e w

ord

s (m

illi

on

s)

Corpus size (million words)

Page 6: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Scientific objectives• To learn of the phenomena underlying word

construction in natural languages• To discover approaches suitable for a wide

range of languages and tasks• To advance machine learning methodology

Page 7: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Morpho Challenge 2007• Part of the EU Network of Excellence

PASCAL’s Challenge Program• Organized in collaboration with CLEF• Participation is open to all and free of charge• Word sets are provided for: Finnish, English,

German and Turkish • Implement an unsupervised algorithm that

discovers morpheme analysis of words in each language!

Page 8: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Thanks

Thanks to all who made Morpho Challenge 2007 possible:

• PASCAL network, CLEF, Leipzig corpora collection• Morpho Challenge organizing committee• Morpho Challenge program committee• Morpho Challenge participants• Morpho Challenge evaluation team• CLEF 2007 organizers!

Page 9: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Rules• Morpheme analysis are submitted to the

organizers and two different evaluations are made

• Competition 1: Comparison to a linguistic morpheme "gold standard“

• Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.

Page 10: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Training data• Word lists downloadable at our home page• Each word in the list is preceded by its

frequency • Finnish: 3M sentences, 2.2M word types• Turkish: 1M sentences, 620K word types• German: 3M sentences, 1.3M word types• English: 3M sentences, 380K word types

• Small gold standard sample available in each language

Page 11: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Examples of gold standard analyses

• English: baby-sitters baby_N sit_V er_s +PL• Finnish: linuxiin           linux_N +ILL• German: zurueckzubehalten  zurueck_B zu be

halt_V +INF • Turkish: kontrole         kontrol +DAT

Page 12: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

1. A new linguistic evaluation method

• Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings

• Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs

• Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme

Page 13: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Evaluation measures• F-measure = 1/(1/Precision + 1/Recall)• Precision is the proportion of suggested word

pairs that also have a morpheme in common according to the gold standard

• Recall is the proportion of word pairs sampled from the gold standard that also have a morpheme in common according to the suggested algorithm

Page 14: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Participants• Delphine Bernhard, TIMC-IMAG, F (now moved to

Darmstadt, D)• Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ• Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale,

USA• Morfessor MAP, Helsinki Univ. Tech, FI• (Michael Tepper, Univ. Washington, USA)

Page 15: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: Finnish, 2.2M word types

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

F-m

easu

re

Bernhard 2

Bernhard 1

Bordag 5a

Bordag 5

Zeman

McNamee 3

McNamee 4

McNamee 5

Morfessor MAP

Page 16: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: Turkish, 620K word types

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

F-m

easu

re

Zeman

Bordag 5a

Bordag 5

Bernhard 2

Bernhard 1

McNamee 3

McNamee 4

McNamee 5

Morfessor MAP

Tepper

Page 17: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: German, 1.3M word types

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

F-m

easu

re

Monson ParaMor-M.

Bernhard 2

Bordag 5a

Bordag 5

Monson Morfessor

Bernhard 1

Monson ParaMor

Zeman

McNamee 3

McNamee 4

McNamee 5

Morfessor MAP

Page 18: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: English, 380K word types

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

F-m

easu

re

Bernhard 2

Bernhard 1

Pitler

Monson Paramor-M.

Monson Paramor

Monson Morfessor

Zeman

Bordag 5a

Bordag 5

McNamee 3

McNamee 4

McNamee 5

Morfessor MAP

Tepper

Page 19: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

2. Practical evaluation• Real world application for morpheme

analysis: Information Retrieval• Analysis is needed to handle morphology

(inflection, compounding) • CLEF collections for Finnish, German and

English

Page 20: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Data setsFinnish (CLEF 2004)

55K documents from articles in Aamulehti 94-9550 test queries and 23K binary relevance assessments

English (CLEF 2005)107K documents from articles in Los Angeles Times 94

and Glasgow Herald 9550 test queries and 20K binary relevance assessments

German (CLEF 2003)300K documents from short articles in Frankfurter

Rundschau 94, Der Spiegel 94-95 and SDA 94-9560 test queries and 23K binary relevance assessments

Page 21: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Reference methods• Morfessor Baseline: our public code since 2002• Morfessor Categories-MAP: improved, public since 2006 • dummy: no segmentation• grammatical: gold standard segmentations

– all: all alternatives included– first: only first alternative

• Porter: LEMUR's default stemmer • Tepper: hybrid method based on Morfessor MAP

Page 22: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Evaluation 1/2• Words in the documents and queries were

replaced by the submitted segmentations• New words:

– the CLEF collections contained words that were not in the original word list

– additional segmentations were requested– if segmentation was not provided, words were

indexed as such

Page 23: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Evaluation 2/2• LEMUR-toolkit ( http:// www.lemurproject.org/ )• Okapi BM25 retrieval, default parameter settings• Okapi seems to handle common morphemes

poorly => stoplist for most common ones (above a fixed frequency threshold)

• Also an alternative set of non-stoplisted results with TFIDF

Page 24: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: Finnish

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Me

an

ave

rag

e p

reci

sio

n

Bernhard 2

Bernhard 1

Morfessor baseline

Morfessor MAP

Bordag 5aBordag 5

grammatical all

grammatical first

McNamee 5

McNamee 4

porter

McNamee 3

dummy

Zeman

Page 25: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: German

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Me

an

ave

rag

e p

reci

sio

n

Bernhard 1

Bernhard 2

Monson Morfessor

Morfessor MAP

Morfessor baselineBordag 5

Bordag 5a

Monson ParaMor-M.

porter

McNamee 5

grammatical first

McNamee 4

Monson ParaMor

grammatical all

McNamee 3

Zeman

Page 26: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Results: English

0.2

0.25

0.3

0.35

0.4

0.45

Me

an

ave

rag

e p

reci

sio

n

porter

Bernhard 2

Bernhard 1

Morfessor baseline

grammatical firstTepper

Monson Morfessor

Morfessor MAP

Pitler

grammatical all

McNamee 4

McNamee 5

Monson ParaMor-M.

Bordag 5

Bordag 5a

dummy

McNamee 3

Monson ParaMor

Zeman

Page 27: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Conclusions• Analysis of new words important for

Finnish, less so for German and English• Porter stemming unbeaten for English (so

far)• Unsupervised morpheme analysis works

very well for IR!

Page 28: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Future directions?• Finnish, Turkish, English, German, ...?• Language modeling, Speech recognition,

Information Retrieval, ...?• Venice, Budapest, ...?• PASCAL, CLEF, ...?

Page 29: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Summary 2007• 14 different unsupervised algorithms• 8 participating research groups• Evaluations for 4 languages (3 for IR)• Good results in all languages and IR• Full report and papers in the CLEF proceedings• Details, presentations, links, info at website: http://www.cis.hut.fi/morphochallenge2007/

Page 30: Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.

Acknowledgments• Data from Leipzig and CLEF• Gold standard providers in all languages!• Workshop organization by CLEF• Funding from PASCAL and Academy of Finland• Competition participants!