Top Banner
Word Sense Detection and Word sense Disambiguation through Data-Mining Andi Wu & Randall Tan Asia Bible Society
27

BibleTech2011

Apr 14, 2017

Download

Documents

Andi Wu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BibleTech2011

Word Sense Detection

and

Word sense Disambiguation

through Data-Mining

Andi Wu & Randall Tan

Asia Bible Society

Page 2: BibleTech2011

Outline

� Motivations for word sense identification

� Problems of existing word sense data

� The data-mining approach

� Demo and Discussion

Asia Bible Society 2

Page 3: BibleTech2011

Motivations

Addressing the issue of Polysemy

� Bible translation

– Better understanding of every word

– Unification on the basis of senses rather than words

� Bible search

– More refined search results on the basis of senses

Asia Bible Society 3

Page 4: BibleTech2011

Goals

� Word sense detection

For each content word in the Bible, find out how

many senses it has.

� Word sense disambiguation

For each instance of the word, find out which of the senses it has.

Asia Bible Society 4

Page 5: BibleTech2011

Asia Bible Society 5

ראשית

Sense 1

Sense 1-1: beginning

Sense 1-2: first

Sense 2

Sense 2-1: firstfruits

Sense 2-2: firstborn

Sense 3

Sense 3-1: best

Sense 3-2: choicest

Sense 4

Sense 4-1: foremost

Word sense detectionIdentify the senses of each word :

Page 6: BibleTech2011

Identify the sense of each instance:

ים את השמים ואת הארץ׃א ה ברא ראשיתראשיתראשיתראשיתב (Gen1:1)

ח׃ח ניחלריו עלהמזבח לא־ י ־ ואלה תקריבו אתם ליהו ראשיתראשיתראשיתראשיתקרבן (Lev 2:12)

, בגלגל׃א היה יהו ל החרם לזבח יתיתיתיתראש ראש ראש ראש צאן ובקר ל ם מהשל הע קח וי (1Sm 15:21)

ש חרבו׃יג ו העשדרכי־ אל יתיתיתיתראש ראש ראש ראש הוא (Job 40:19)

Asia Bible Society 6

Sense 1: beginning

Sense 2: firstfruits

Sense 3: best

Sense 4: foremost

Word sense disambiguation

Page 7: BibleTech2011

Problems of Existing Data

� No consensus on the number of senses each word has

� No complete data of instance-based sense identification

� Manual identification can be subjective, inconsistent, and time-consuming

Asia Bible Society 7

Page 8: BibleTech2011

The Data-Mining Approach

� Theoretical assumption

� Data for mining

� Machine learning procedures

� Advantages and limitations of the approach

� Tool for sense exploration

Asia Bible Society 8

Page 9: BibleTech2011

Theoretical Assumption

� Translators presumably use different target language words to translate different senses of a word (Translators have done the job of disambiguation sub-consciously and defined each sense with target language words).

Asia Bible Society 9

Page 10: BibleTech2011

Data for Mining

Translations linked word-to-word to the original Hebrew/Greek texts

Asia Bible Society 10

Page 11: BibleTech2011

Basic TaskTake all instances of a word and group the instances into different senses

Asia Bible Society 11

Page 12: BibleTech2011

A Simple and Naive Approach

Look at the words used in a given translation and treat instances with the same translation words as having the same sense.

Asia Bible Society 12

Page 13: BibleTech2011

A Simple and Naive Approach

Problems:

� The translations may not be consistent:

Translators may use different words to translate the

same sense or the same word to translate different senses

� It can be subjective

It only reflects the opinions of a particular translation

� The senses are too fragmentary

Asia Bible Society 13

Page 14: BibleTech2011

The Voting Approach

� Use multiple translations

� Two instances of a word is considered to have the same sense if most of the translations use the same word to translate it.

� Check and balance

� How to define “most”?

Asia Bible Society 14

Page 15: BibleTech2011

The Voting Approach

How many votes to get?

� Maximal agreement:

– Internal consistency within groups

– Too many senses

– Too many unassigned instances

� Minimal agreement:

– Better grouping of senses

– Instances of different senses may be mixed together

Asia Bible Society 15

Page 16: BibleTech2011

Progressive MergingTrying to get the benefits of both maximal and minimal

agreement and avoid their disadvantages

� Start with maximal agreement to get initial sense groups that are internally consistent

� Gradually merge the initial groups with decreasing number of agreements N (N > 0, N < Maximal) and with a variable association rate R (R > 0, R < 1)

� Group B is merged into Group A if A contains B

� A contains B if each instance in B is linked to at least Rof the instances in A by N agreements.

� Pair-wise merge until no further merge can be done

Asia Bible Society 16

Page 17: BibleTech2011

Progressive Merging

Merging two groups: association rate = 0.5

Asia Bible Society 17

Page 18: BibleTech2011

Progressive Merging

Example: Maximal N = 4, R = 70%

� Merge 1: N-1 = 3

B merges into A if each instance in B is linked to at least 70% of the

instances in A by sharing the same translation in at least 3 versions

� Merge 2: N-2 = 2

B merges into A if each instance in B is linked to at least 70% of the

instances in A by sharing the same translation in at least 2 versions

� Merge 3: N-3 = 1

B merges into A if each instance in B is linked to at least 70% of the

instances in A by sharing the same translation in at least 1 version

Asia Bible Society 18

Page 19: BibleTech2011

Tuning the Variables

We can get different results by tuning the following variables:

� The translations to use

� The number translations to use

� The number of merges to perform

� The association rate

Asia Bible Society 19

Page 20: BibleTech2011

The “Accents” of Senses

� Senses based on English translations

� Senses based on Chinese translations

� Senses based on both English and Chinese

� The triangulating effect of using different translations

Asia Bible Society 20

Page 21: BibleTech2011

Factors Affecting the Results

� The versions of translations that are used

� The quality of each translation

� The degree of consensus between different translations

� Quality of lemmatization in English

� Surface forms vs. lemmatized forms

Asia Bible Society 21

Page 22: BibleTech2011

Other Features Considered

� Syntactic contexts

– Instances that occur in similar syntactic contexts tend to have the same sense

– Not used because of sparse data problem

� Morphological information

– Verbs with different stems in Hebrew tend to have different senses

– Not used because the stem distinctions do not always correspond well with sense distinctions

Asia Bible Society 22

Page 23: BibleTech2011

Editing Options

The data shown here has not been manually edited, but it can be edited using the tool:

� Merge sense groups

� Split a sense group

� Move an instance from one sense group to another

� Use of manual information in automatic learning

Asia Bible Society 23

Page 24: BibleTech2011

Demo

Asia Bible Society 24

Page 25: BibleTech2011

Applications

� Sense-based translation memory

� Sense-based concordance

� Sense-based consistency check

Asia Bible Society 25

Page 26: BibleTech2011

Advantages of the Current Approach

� Efficiency: a sense dictionary which not only lists the senses but also the specific instances of the sense can be built in a matter of days.

� Objectivity: the results are based on actual data and no pre-conceived subjective categorization is required.

� Flexibility: the granularity of sense divisions can be adjusted by the values of similarity metrics in the clustering process.

Asia Bible Society 26

Page 27: BibleTech2011

Conclusion

A great tool for exploring and studying word senses in biblical texts

Asia Bible Society 27