Top Banner
Cog Sci The MWE 2008 Shared Task: Ranking MWE Candidates Stefan Evert University of Osnabrück [email protected] | purl.org/stefan.evert
19

The MWE 2008 Shared Task: Ranking MWE Candidates

Dec 05, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The MWE 2008 Shared Task: Ranking MWE Candidates

CogSci

The MWE 2008 Shared Task:Ranking MWE Candidates

Stefan EvertUniversity of Osnabrück

[email protected] | purl.org/stefan.evert

Page 2: The MWE 2008 Shared Task: Ranking MWE Candidates

MWE 2008 Shared Task

A first exploratory shared task, using some of the data sets contributed to MWE 2008

Focus on candidate rankingwell-established task & evaluation

essential step for MWE extraction

More sophisticated challenges to be tackled in future years

2

Page 3: The MWE 2008 Shared Task: Ranking MWE Candidates

The 4 subtasks

1. English verb-particle combinations (Baldwin, EN-VPC)

3078 candidates, 440 TPs, baseline precision = 14.29%

TPs subdivided into transitive and intransitive VPC

2. German PP-verb combinations (Krenn, DE-PNV)

5102 candidates (FR-30 subset), 566 TPs, baseline = 11.09%

TPs are figurative expressions and support-verb constructions

3. German adjective-noun collocations (Evert, DE-AN)

1252 candidates, 520 TPs (cat. 1+2), baseline = 41.53%

4. Czech dependency bigrams (Pecina, CZ-MWE)

12232 candidates, 2572 unanimous TPs, baseline = 21.03%

3

Page 4: The MWE 2008 Shared Task: Ranking MWE Candidates

Corpus frequency data

1. English verb-particle combinations (Baldwin, EN-VPC)

no frequency data provided

participants used BNC (fragment) and Web frequencies

2. German PP-verb combinations (Krenn, DE-PNV)

frequency data from Frankfurter Rundschau (FR) corpus provided

verb + nearest PP (based on TreeTagger & YAC chunking)

3. German adjective-noun collocations (Evert, DE-AN)

frequency data from FR corpus (TreeTagger & YAC chunking)

4. Czech dependency bigrams (Pecina, CZ-MWE)

frequency data from Prague Dependency Treebank provided

4

Page 5: The MWE 2008 Shared Task: Ranking MWE Candidates

Shared task participants

Baseline results (Evert)

all 4 subtasks, 6 standard association measures

Ramisch, Schreiner, Idiart & Villavicencio

EN-VPC, DE-PNV, DE-AN

two standard association measures + permutation entropy

Pecina

DE-PNV, DE-AN, CZ-MWE

57 association measures + machine-learning techniques

experiments with different TP definitions and subsets

5

Page 6: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

010

20

30

40

50

60

Precision-recall graph

Recall (%)

Pre

cis

ion (

%)

X2

MI

Evaluation methodology

Algorithm produces ranking of given set of candidates (in the form of decreasing scores)

Compute precision & recall(of TPs) for different n-best lists

Can be visualised as aprecision-recall graph

Baseline precision:proportion of TPs in data set

Overall quality measure:average precision

6

Page 7: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

010

20

30

40

50

60

Precision-recall graph

Recall (%)

Pre

cis

ion (

%)

X2

MI

Evaluation methodology

Algorithm produces ranking of given set of candidates (in the form of decreasing scores)

Compute precision & recall(of TPs) for different n-best lists

Can be visualised as aprecision-recall graph

Baseline precision:proportion of TPs in data set

Overall quality measure:average precision

6baseline precision = 11.09%

Page 8: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

010

20

30

40

50

60

Precision-recall graph

Recall (%)

Pre

cis

ion (

%)

X2

MI

Evaluation methodology

Algorithm produces ranking of given set of candidates (in the form of decreasing scores)

Compute precision & recall(of TPs) for different n-best lists

Can be visualised as aprecision-recall graph

Baseline precision:proportion of TPs in data set

Overall quality measure:average precision

6baseline precision = 11.09%

average precision= 39.79%

Page 9: The MWE 2008 Shared Task: Ranking MWE Candidates

Baseline results

Baseline results for 6 widely-used association measures

log-likelihood (G2)

chi-squared (X2) with Yates' correction

t-score (t)

Mutual Information (MI)

Dice coefficient (Dice)

frequency ranking (f)

NB: Ramisch et al.'s MI = G2

7

Page 10: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

010

20

30

40

50

60

English VPC

Recall (%)

Pre

cis

ion (

%)

G2

X2

t

MI

Dice

f

Baseline results: EN-VPC

frequency data from full BNC (adjacent verb + particle)

missing data are ranked at bottom

baseline: 14.29%

best AM: t-score(AP = 29.94%)

frequency ranking:AP = 29.01%

8

Page 11: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

010

20

30

40

50

60

German PNV

Recall (%)

Pre

cis

ion (

%)

G2

X2

t

MI

Dice

f

Baseline results: DE-PNV

FR-30 subset with 5102 candidates

TPs = figurative expressions + SVC

baseline: 11.09%

best AM: t-score(AP = 39.79%)

frequency ranking:AP = 33.88%

9

Page 12: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

10

20

30

40

50

60

70

German AN

Recall (%)

Pre

cis

ion (

%)

G2

X2

t

MI

Dice

f

Baseline results: DE-AN

TPs = cat. 1+2

baseline: 41.53%

best AM: Dice(AP = 58.84%)

frequency ranking:AP = 46.90%

10

Page 13: The MWE 2008 Shared Task: Ranking MWE Candidates

0 20 40 60 80 100

20

30

40

50

60

70

80

Czech MWE Bigrams

Recall (%)

Pre

cis

ion (

%)

G2

X2

t

MI

Dice

f

Baseline results: CZ-MWE

TPs = unanimous among 3 judges

baseline: 21.03%

best AM: chi-sq.(AP = 64.86%)

frequency:AP = 21.70%

log-likelihood much worse than on other data sets

11

Page 14: The MWE 2008 Shared Task: Ranking MWE Candidates

Result overview: EN-VPC

12

0

10

20

30

40

Baseline G2 X2 t-score MI Dice freq PE EPI Best AM ML

19.318.0

29.027.3

22.0

29.928.629.3

14.3

Page 15: The MWE 2008 Shared Task: Ranking MWE Candidates

Result overview: DE-PNV

13

0

10

20

30

40

50

60

70

Baseline G2 X2 t-score MI Dice freq PE EPI Best AM ML

60.6

43.9

22.7

14.6

33.9

20.1

24.4

39.8

30.0

39.1

11.1

Page 16: The MWE 2008 Shared Task: Ranking MWE Candidates

Result overview: DE-AN

14

0

10

20

30

40

50

60

70

Baseline G2 X2 t-score MI Dice freq PE EPI Best AM ML

61.362.9

40.4

46.9

58.8

54.752.4

57.156.1

41.5

Page 17: The MWE 2008 Shared Task: Ranking MWE Candidates

Result overview: CZ-MWE

15

0

10

20

30

40

50

60

70

80

Baseline G2 X2 t-score MI Dice freq PE EPI Best AM ML

79.5

65.6

21.7

62.664.9

24.6

65.0

42.5

21.0

Page 18: The MWE 2008 Shared Task: Ranking MWE Candidates

Result overview: combined

16

0

10

20

30

40

50

60

70

80

Baseline G2 X2 t-score MI Dice freq PE EPI Best AM ML

Page 19: The MWE 2008 Shared Task: Ranking MWE Candidates

17

And now on to the participants …