Top Banner
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: presentation

1

Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation

Saif Mohammad Ted Pedersen University of Toronto University of Minnesota http//:www.cs.toronto.edu/~smm http//:www.d.umn.edu/~tpederse

Page 2: presentation

2

Word Sense Disambiguation

Harry cast a bewitching spell

Humans immediately understand spell to mean a charm or incantation.

reading out letter by letter or a period of time ? Words with multiple senses – polysemy, ambiguity!

Utilize background knowledge and context.

Machines lack background knowledge. Automatically identifying the intended sense of a word in

written text, based on its context, remains a hard problem. Best accuracies in recent international event,

around 65%.

Page 3: presentation

3

Why do we need WSD ! Information Retrieval

Query: cricket bat Documents pertaining to the insect and the mammal, irrelevant.

Machine Translation Consider English to Hindi translation.

head to sar (upper part of the body) or adhyaksh (leader)?

Machine-human interaction Instructions to machines.

Interactive home system: turn on the lights Domestic Android: get the door

Applications are widespread and will affect our way of life.

Page 4: presentation

4

TerminologyHarry cast a bewitching spell

Target word – the word whose intended sense is to be identified.

spell

Context – the sentence housing the target word and possibly, 1 or 2 sentences around it.

Harry cast a bewitching spell

Instance – target word along with its context.

WSD is a classification problem wherein the occurrence of the target word is assigned to one of its many possible senses.

Page 5: presentation

5

Corpus-Based Supervised Machine Learning

A computer program is said to learn from experience … if its performance at tasks … improves with experience.

- Mitchell

Task : Word Sense Disambiguation of given test instances.

Performance : Ratio of instances correctly disambiguated to the total test instances – accuracy.

Experience : Manually created instances such that target words are marked with intended sense – training instances.

Harry cast a bewitching spell / incantation

Page 6: presentation

6

Decision Trees A kind of classifier.

Assigns a class by asking a series of questions. Questions correspond to features of the instance. Question asked depends on answer to previous question.

Inverted tree structure. Interconnected nodes.

Top most node is called the root.

Each node corresponds to a question / feature. Each possible value of feature has corresponding branch.

Leaves terminate every path from root. Each leaf is associated with a class.

Page 7: presentation

7

WSD Tree

Feature 4?

Feature 4 ?

Feature 2 ?

Feature 3 ?

Feature 2 ?

SENSE 4

SENSE 3SENSE 2

SENSE 1

SENSE 3

SENSE 3

0

0

0

1

1

1

0

10

1

0 1

Feature 1 ?

SENSE 1

Page 8: presentation

8

Choice of Learning Algorithm

Why use decision trees for WSD ? It has drawbacks – training data fragmentation What about other learning algorithms such as neural networks?

Context is a rich source of discrete features.

The learned model likely meaningful. May provide insight into the interaction of features.

Pedersen[2001]*: Choosing the right features is of greater significance than the learning algorithm itself

A Decision Tree of Bigrams is an Accurate Predictor of Word Sense T. Pedersen, In the Proceedings of the Second Meeting of the North American Chapter of the Association for Computational

Linguistics (NAACL-01), June 2-7, 2001, Pittsburgh, PA.

Page 9: presentation

9

Lexical Features Surface form

A word we observe in text. Case(n)

1. Object of investigation 2. frame or covering 3. A weird person Surface forms : case, cases, casing An occurrence of casing suggests sense 2.

Unigrams and Bigrams One word and two word sequences in text.

The interest rate is low Unigrams: the, interest, rate, is, low Bigrams: the interest, interest rate, rate is, is low

Page 10: presentation

10

Part of Speech Tagging Brill Tagger – most widely used tool.

Accuracy around 95%. Source code available. Easily understood rules.

Pre-tagging is the act of manually assigning tags to selected words in a text prior to tagging.

Brill tagger does not guaranteed pre-tagging. A patch to the tagger provided – BrillPatch*.

* ”Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad, S. and Pedersen, T., In Proceedings of Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico.

Page 11: presentation

11

Part of Speech Features A word used in different senses is likely to have

different sets of pos tags around it.

Why did jack turn/VB against/IN his/PRP$ team/NN

Why did jack turn/VB left/NN at/IN the/DT crossing

Features used Individual word POS: P-2, P-1, P0, P1, P2

P1 = JJ implies that the word to the right of the target word is an

adjective. A combination of the above.

Page 12: presentation

12

Parse Features Collins Parser* used to parse the data.

Source code available. Uses part of speech tagged data as input.

Head word of a phrase. the hard work, the hard surface Phrase itself : noun phrase, verb phrase and so on.

Parent : Head word of the parent phrase. fasten the line, cross the line Parent phrase.

* http://www.ai.mit.edu/people/mcollins

Page 13: presentation

13

Sample Parse Tree

VERB PHRASENOUN PHRASE

Harry NOUN PHRASE

SENTENCE

spell

cast

a bewitching

NNP VBD

DT JJ NN

Page 14: presentation

14

Sense-Tagged Data Senseval-2 data

4,328 instances of test data and 8,611 instances of training data ranging over 73 different noun, verb and adjectives.

Senseval-1 data 8,512 test instances and 13,276 training instances, ranging over 35

nouns, verbs and adjectives.

line, hard, interest, serve data 4149, 4337, 4378 and 2476 sense-tagged instances with

line, hard, serve and interest as the head words.

Around 50,000 sense-tagged instances in all!

Page 15: presentation

15

Experiments

Page 16: presentation

16

Lexical: Senseval-1 & Senseval-2

Sval-2 Sval-1 line hard serve interest

Majority 47.7% 56.3% 54.3% 81.5% 42.2% 54.9%

Surface Form

49.3% 62.9% 54.3% 81.5% 44.2% 64.0%

Unigram 55.3% 66.9% 74.5% 83.4% 73.3% 75.7%

Bigram 55.1% 66.9% 72.9% 89.5% 72.1% 79.9%

Page 17: presentation

17

Individual Word POS (Senseval-1)

All Nouns Verbs Adj.Majority 56.3% 57.2% 56.9% 64.3%

P-2 57.5% 58.2% 58.6% 64.0

P-1 59.2% 62.2% 58.2% 64.3%

P0 60.3% 62.5% 58.2% 64.3%

P1 63.9% 65.4% 64.4% 66.2%

P-2 59.9% 60.0% 60.8% 65.2%

Page 18: presentation

18

Individual Word POS (Senseval-2)

All Nouns Verbs Adj.Majority 47.7% 51.0% 39.7% 59.0%

P-2 47.1% 51.9% 38.0% 57.9%

P-1 49.6% 55.2% 40.2% 59.0%

P0 49.9% 55.7% 40.6% 58.2%

P1 53.1% 53.8% 49.1% 61.0%

P-2 48.9% 50.2% 43.2% 59.4%

Page 19: presentation

19

Combining POS FeaturesSval-2 Sval-1 line hard serve interest

Majority 47.7% 56.3% 54.3% 81.5% 42.2% 54.9%

P0, P1 54.3% 66.7% 54.1% 81.9% 60.2% 70.5%

P-1, P0, P1 54.6% 68.0% 60.4% 84.8% 73.0% 78.8%

P-2, P-1, P0, P1 , P2

54.6% 67.8% 62.3% 86.2% 75.7% 80.6%

Page 20: presentation

20

Parse Features (Senseval-1)

All Nouns Verbs Adj.Majority 56.3% 57.2% 56.9% 64.3%

Head 64.3% 70.9% 59.8% 66.9%

Parent 60.6% 62.6% 60.3% 65.8%

Phrase 58.5% 57.5% 57.2% 66.2%

Par. Phr. 57.9% 58.1% 58.3% 66.2%

Page 21: presentation

21

Parse Features (Senseval-2)

All Nouns Verbs Adj.Majority 47.7% 51.0% 39.7% 59.0%

Head 51.7% 58.5% 39.8% 64.0%

Parent 50.0% 56.1% 40.1% 59.3%

Phrase 48.3% 51.7% 40.3% 59.5%

Par. Phr. 48.5% 53.0% 39.1% 60.3%

Page 22: presentation

22

Thoughts… Both lexical and syntactic features perform

comparably. But do they get the same instances right ?

How much are the individual feature sets redundant. Are there instances correctly disambiguated by

one feature set and not by the other ? How much are the individual feature sets

complementary.

Is the effort to combine of lexical and syntactic features justified?

Page 23: presentation

23

Measures Baseline Ensemble: accuracy of a hypothetical ensemble

which predicts the sense correctly only if both individual feature sets do so.

Quantifies redundancy amongst feature sets.

Optimal Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so.

Difference with individual accuracies quantifies complementarity.

We used a simple ensemble which sums up the

probabilities for each sense by the individual feature

sets to decide the intended sense.

Page 24: presentation

24

Best Combinations

Data Set 1 Set 2 Base Ens. Opt. BestSval-2

47.7%

Unigrams

55.3%

P-1,P0, P1

55.3%

43.6% 57.0% 67.9% 66.7%

Sval-1

56.3%

Unigrams 66.9%

P-1,P0, P1 68.0%

57.6% 71.1% 78.0% 81.1%

line

54.3%

Unigrams 74.5%

P-1,P0, P1 60.4%

55.1% 74.2% 82.0% 88.0%

hard

81.5%

Bigrams 89.5%

Head, Par 87.7%

86.1% 88.9% 91.3% 83.0%

serve

42.2%

Unigrams 73.3%

P-1,P0, P1

73.0%58.4% 81.6% 89.9% 83.0%

interest

54.9%

Bigrams 79.9%

P-1,P0, P1 78.8%

67.6% 83.2% 90.1% 89.0%

Page 25: presentation

25

Conclusions

Significant amount of complementarity across lexical and syntactic features.

Combination of the two justified. We show that simple lexical and part of speech

features can achieve state of the art results. How best to capitalize on the complementarity

still an open issue.

Page 26: presentation

26

Conclusions (continued)

Part of speech of word immediately to the right of target word found most useful.

Pos of words immediately to the right of target word best for verbs and adjectives.

Nouns helped by tags on either side. (P0, P1) found to be most potent in case of small

training data per instance (Sval data). Larger pos context size (P-2, P-1, P0, P1 , P2) shown to be

beneficial when training data per instance is large (line, hard, serve and interest data)

Head word of phrase particularly useful for adjectives

Nouns helped by both head and parent.

Page 27: presentation

27

Code, Data & Resources SyntaLex : A system to do WSD using lexical and syntactic

features. Weka’s decision tree learning algorithm is utilized.

posSenseval : part of speech tags any data in Senseval-2 data format. Brill Tagger used.

parseSenseval : parses data in a format as output by the Brill Tagger. Output is in Senseval-2 data format with part of speech and parse information as xml tags. Uses Collins Parser.

Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats.

BrillPatch : Patch to Brill Tagger to employ Guaranteed Pre-Tagging.

http://www.d.umn.edu/~tpederse/code.htmlhttp://www.d.umn.edu/~tpederse/data.html

Page 28: presentation

28

Senseval-3 (Mar-1 to April 15, 2004)Around 8000 training and 4000 test instances.Results expected shortly.

Thank You