Pronominal Anaphora resolution

07/31/12 Pronominal Anaphora Resolution1

Final Year Project onPronominal Anaphora

Resolution in Nepali Language

by

Dev Bahadur Poudel(03314) Bivod Aale Magar(03307)Nepal Engineering College

Changunaryan, Bhaktapur


Contents

Brief Introduction and Background Approach to Algorithm Implementation in Nepali Discourse Over-view of our system Scope of our system Conclusion


What is Anaphora?

Reference to an entity that has been previously introduced in the discourse.


What is Anaphora Resolution?

Process of determining the antecedent of an anaphor.


रा�म स्कू� ल जा�न्छ । ऊ घरा फकू� न्छ ।

Anaphor resolution in Nepali

AntecedentAnaphor

ऊ =रा�म


Can Machine resolve the anaphora?

Human intelligence can easily find out to which referents the anaphor belongs.

Can we built a system that can resolve the anaphora to the antecendents?

Corpus

collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language.


Unicode

an industry standard allowing computers to represent and manipulate text consistently

consists of about 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings

Unlike ASCII, which uses 7 bits for each character, Unicode uses 16 bits, which means that it can represent more than 65,000 unique characters.



Approach to the Algorithm

Non-Probabilistic– Lappin and Leass Algorithm(1994)– A Tree Search Algorithm- Hobbs(1978)

Probabilistic– Centering Algorithm– Mitkov’s weak knowledge algorithm


Approach to the Algorithm

Lappin and Leass Algorithm(1994)Algorithm based on the Sailence

factors given to the noun and pronoun.


Salience factors in Lappin andLeass's Algorithm.

Sentence recency 100

Subject emphasis 80

Existential emphasis 70 Accusative (direct object) emphasis

50 Indirect object and oblique complement

emphasis 40 Non-adverbial emphasis

50 Head noun emphasis 80


Implementation

Can be implemented using different languages

JAVA, PHP Our system uses JAVA


InputTokenizer and

Tagger Salience Factor

Assigner

Output

Block Diagram of the system


Flowchart START

Input Paragraph

Take A sentence

Tokenize

Take token

Check In Corpus

Classify as noun or pronoun

Classify subject/Object

Give Silence value

Calculate total weights

Next sentence ?

Determine correct referents

Half the salience values

Display Results

yes

no

END

Log Error

yes

no

User Interface



An Example in Nepali

! = /fd 38L lsGg rfxG5 .

@= xl/n] Tof] k;ndf b]Vof] .

#= p;n] p;nfO{ b]vfof] .


! = /fd 38L lsGg rfxG5 .

Decrease the salient values by factor 2 Decrease the salient values by factor 2 when reading next sentencewhen reading next sentence


@= xl/n] Tof] k;ndf b]Vof] .

xl/ gets (Rec: 100+ Sub: 80+ Non adv: 50+ HN:80 =310)Tof] get 280 (rec:100+ cobj:50+non-adv:50+ HN: 80) Tof] resolved to 38L due to high salience

value of 38L k;n will get (rec:100+non-adv 50+

HN:80)=230


Updated Discourse Model

Divide the previous salience factors by two


p;n] will be resolve to xl/ due to high salience factors. Add Salience factor (recency:100+ subpos: 80+ nonadv:50+HN:80)=310

p;nfO{ can not be xl/ due to syntactic constraints. So,

p;nfO{ will be resolved to /fd . (rec:100+indObj:40+non-adv 50+ HN:80)=270

#= p;n] p;nfO{ b]vfof] .

Updated Discourse ModelUpdated Discourse Model

Result


Paragraph

Using

Total Samples

Used

Total Antecedent

s

Total Anaphors

Correctly resolved

Incorrectly Resolved

Zero Anapho

rs

Efficiency

2-sentence 15 37 22 15 7 0 68%

3-sentence 15 50 37 28 9 0 75%

4-sentence 10 35 35 22 11 2 62.8%

5-sentence 10 43 41 25 14 2 60.9%

> 5-sentence 5 28 31 17 11 3 54%

Total 55 193 166 107 52 7 64%


Scope of the Project-Natural language processing-Question answering-Text Summarizing-Information Extraction-Interaction with query interfaces and dialogue interpretation-Natural Language Generation

Limitations

The lack of tagger and parser limits the system for large corpus and had to go for a hand annotated corpus.

The sentences are limited to the words defined in our corpus

The system is limited to the third person pronouns but not reflexive.


Further Works

Morphological analysis can be done The system can be enhanced further work on large

number of sentences. This project can be used with collaboration of other

NLP projects in Nepali language for further research. The statistical methods can be applied to get higher

efficiency.



Conclusion

Research to see how a basic approach like Lappin and Leass performs for Nepali language.

Applies to non reflexive third person pronouns. Emerging concept in Nepali Language Understanding the discourse - challenging to

computer intelligence Without tagger and parser our system is greatly

dictionary dependent Our work aid to future research in Nepali

language


Thank You.

Pronominal Anaphora resolution

Technology