Word Sense Detection and Word sense Disambiguation through Data-Mining Andi Wu & Randall Tan Asia Bible Society
Word Sense Detection
and
Word sense Disambiguation
through Data-Mining
Andi Wu & Randall Tan
Asia Bible Society
Outline
� Motivations for word sense identification
� Problems of existing word sense data
� The data-mining approach
� Demo and Discussion
Asia Bible Society 2
Motivations
Addressing the issue of Polysemy
� Bible translation
– Better understanding of every word
– Unification on the basis of senses rather than words
� Bible search
– More refined search results on the basis of senses
Asia Bible Society 3
Goals
� Word sense detection
For each content word in the Bible, find out how
many senses it has.
� Word sense disambiguation
For each instance of the word, find out which of the senses it has.
Asia Bible Society 4
Asia Bible Society 5
ראשית
Sense 1
Sense 1-1: beginning
Sense 1-2: first
Sense 2
Sense 2-1: firstfruits
Sense 2-2: firstborn
Sense 3
Sense 3-1: best
Sense 3-2: choicest
Sense 4
Sense 4-1: foremost
Word sense detectionIdentify the senses of each word :
Identify the sense of each instance:
ים את השמים ואת הארץ׃א ה ברא ראשיתראשיתראשיתראשיתב (Gen1:1)
ח׃ח ניחלריו עלהמזבח לא־ י ־ ואלה תקריבו אתם ליהו ראשיתראשיתראשיתראשיתקרבן (Lev 2:12)
, בגלגל׃א היה יהו ל החרם לזבח יתיתיתיתראש ראש ראש ראש צאן ובקר ל ם מהשל הע קח וי (1Sm 15:21)
ש חרבו׃יג ו העשדרכי־ אל יתיתיתיתראש ראש ראש ראש הוא (Job 40:19)
Asia Bible Society 6
Sense 1: beginning
Sense 2: firstfruits
Sense 3: best
Sense 4: foremost
Word sense disambiguation
Problems of Existing Data
� No consensus on the number of senses each word has
� No complete data of instance-based sense identification
� Manual identification can be subjective, inconsistent, and time-consuming
Asia Bible Society 7
The Data-Mining Approach
� Theoretical assumption
� Data for mining
� Machine learning procedures
� Advantages and limitations of the approach
� Tool for sense exploration
Asia Bible Society 8
Theoretical Assumption
� Translators presumably use different target language words to translate different senses of a word (Translators have done the job of disambiguation sub-consciously and defined each sense with target language words).
Asia Bible Society 9
Data for Mining
Translations linked word-to-word to the original Hebrew/Greek texts
Asia Bible Society 10
Basic TaskTake all instances of a word and group the instances into different senses
Asia Bible Society 11
A Simple and Naive Approach
Look at the words used in a given translation and treat instances with the same translation words as having the same sense.
Asia Bible Society 12
A Simple and Naive Approach
Problems:
� The translations may not be consistent:
Translators may use different words to translate the
same sense or the same word to translate different senses
� It can be subjective
It only reflects the opinions of a particular translation
� The senses are too fragmentary
Asia Bible Society 13
The Voting Approach
� Use multiple translations
� Two instances of a word is considered to have the same sense if most of the translations use the same word to translate it.
� Check and balance
� How to define “most”?
Asia Bible Society 14
The Voting Approach
How many votes to get?
� Maximal agreement:
– Internal consistency within groups
– Too many senses
– Too many unassigned instances
� Minimal agreement:
– Better grouping of senses
– Instances of different senses may be mixed together
Asia Bible Society 15
Progressive MergingTrying to get the benefits of both maximal and minimal
agreement and avoid their disadvantages
� Start with maximal agreement to get initial sense groups that are internally consistent
� Gradually merge the initial groups with decreasing number of agreements N (N > 0, N < Maximal) and with a variable association rate R (R > 0, R < 1)
� Group B is merged into Group A if A contains B
� A contains B if each instance in B is linked to at least Rof the instances in A by N agreements.
� Pair-wise merge until no further merge can be done
Asia Bible Society 16
Progressive Merging
Example: Maximal N = 4, R = 70%
� Merge 1: N-1 = 3
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 3 versions
� Merge 2: N-2 = 2
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 2 versions
� Merge 3: N-3 = 1
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 1 version
Asia Bible Society 18
Tuning the Variables
We can get different results by tuning the following variables:
� The translations to use
� The number translations to use
� The number of merges to perform
� The association rate
Asia Bible Society 19
The “Accents” of Senses
� Senses based on English translations
� Senses based on Chinese translations
� Senses based on both English and Chinese
� The triangulating effect of using different translations
Asia Bible Society 20
Factors Affecting the Results
� The versions of translations that are used
� The quality of each translation
� The degree of consensus between different translations
� Quality of lemmatization in English
� Surface forms vs. lemmatized forms
Asia Bible Society 21
Other Features Considered
� Syntactic contexts
– Instances that occur in similar syntactic contexts tend to have the same sense
– Not used because of sparse data problem
� Morphological information
– Verbs with different stems in Hebrew tend to have different senses
– Not used because the stem distinctions do not always correspond well with sense distinctions
Asia Bible Society 22
Editing Options
The data shown here has not been manually edited, but it can be edited using the tool:
� Merge sense groups
� Split a sense group
� Move an instance from one sense group to another
� Use of manual information in automatic learning
Asia Bible Society 23
Applications
� Sense-based translation memory
� Sense-based concordance
� Sense-based consistency check
Asia Bible Society 25
Advantages of the Current Approach
� Efficiency: a sense dictionary which not only lists the senses but also the specific instances of the sense can be built in a matter of days.
� Objectivity: the results are based on actual data and no pre-conceived subjective categorization is required.
� Flexibility: the granularity of sense divisions can be adjusted by the values of similarity metrics in the clustering process.
Asia Bible Society 26