Top Banner
16

Abstract - SourceForgemultiword.sourceforge.net/download/Presentations_MWE2010/...(bari bari, one house to other) Adjectives (lal lal phul, red flowers) Verbs (bolte bolte, speaking)

Feb 17, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • AbstractAbstractAbstractAbstractAbstractAbstractAbstractAbstract� Identification of Reduplication, a subtask of

    Multi-word Expression identification.

    � Reduplication, a very productive process at

    both the grammatical as well as semantic

    levels in Bengali.

    � Here, reduplications have been identified

    from the Bengali corpus of the articles of the

    noted Indian Nobel laureate Rabindranath

    Tagore.

    � Rule-Based Approach consisting of two

    phases i.e. identification of reduplication and

    semantic analysis.

  • � Repetition of any linguistic unit such as

    phoneme, morpheme, word, phrase, clause or the

    utterance as a whole.

    Example: In English : ha-ha, blah-blah etc.

    In Bengali: �����-������ (abal-tabal, incoherent).

    What is Reduplication?What is Reduplication?What is Reduplication?What is Reduplication?

    � Bengali, richest Indian language with 2400 words

    (Chaudhuri et al., 2005) in the onomatopoeic and

    idiophonic category of reduplication.

    � Reduplication carries various semantic meanings and

    helps to identify the mental state of the speaker.

    � Two coarse-grained categories:

    (a) repetition at the expression level.

    (b) repetition at the contents or semantic (sense) level.

  • General ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral Classification�� Onomatopoeic ExpressionOnomatopoeic Expression:

    �� �� (khat khat, knock knock)

    �� Complete Reduplication:Complete Reduplication:

    �-� (bara-bara,big big)

    � Partial Reduplication:� Partial Reduplication:

    ���-

    �� (thakur-thukur ,God)

    � Semantic Reduplication:

    Synonym: ����-�� (matha-mundu, head)Antonym: ���-��� (din-rat, day and night) Class representative: ��-���� (cha-paani, snacks)

    � Correlative Reduplication:

    �������� (maramari, fighting)

  • Expression level Expression level Expression level Expression level Expression level Expression level Expression level Expression level ClassificationClassificationClassificationClassificationClassificationClassificationClassificationClassification

    � NonNon--soundsound SymbolicSymbolic WordsWords

    � Nouns and pronouns

    ��� ��� (bari bari, one house to other)� Adjectives

    ��� ��� � � (lal lal phul, red flowers)��� ��� � �� Verbs

    ���� ���� (bolte bolte, speaking) [Mandatory]���� ���� (bhebe chinte, thinking) [Optional]

    � Adverb ���� ���� (dhere dhere, slowly)

    �� Sound WordsSound Words

    �� �� (chal chal, sound of water falling)

  • Sense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level Reduplication� Sense of repetition:

    ���-��� (bachar bachar, every year)

    � Sense of plurality:

    � � ��� (bara bara bari, many big houses )

    � Sense of emphatic meaning:

    ��� ��� � � (lal-lal phul, deep red rose)

    � Sense of completion:� Sense of completion:

    ����-���� (kheye deye jabo, after eating)

    � Sense of hesitation or softness:

    ��� ��� �� (Hasi-hasi mukh, laughing face)

    � Sense of incompleteness of the verbs:

    ক�� ���� ���� (kotha bolte bolte, talking about )

    � Sense of corresponding correlative words:

    �������� (Maramari, fighting)

    � Sense of onomatopoeia:

    �� �� (khat khat, knock knock)

  • System DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem Design

    Phase 1: Identifying Reduplications

    Identify mainly five cases of reduplication

    i.e.Onomatopoeic, complete, partial, semantic

    and correlative reduplications.

    Phase 2: Semantic Analysis

    Extraction of associated meaning or

    semantics like sense level reduplications.

  • Phase Phase Phase Phase Phase Phase Phase Phase 11111111System ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem Architecture

    TokenizerTokenizerTokenizerTokenizerCorpus

    Bengali

    Corpus

    RuleRuleRuleRule&&&&based Identifierbased Identifierbased Identifierbased Identifier

    ClassifierClassifierClassifierClassifier

    Set of Inflections

    Set of Inflections

    DictionaryDictionary

  • Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture

    CorpusCorpusCorpusCorpusCorpusCorpusCorpusCorpusArticles (novel, stories, dramas) ofRabindranath Tagore [http://www.rabindra-rachanabali.nltr.org]

    TokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerSeparates words based on blank space orSeparates words based on blank space orspecial symbols (like hyphen, exclamationnotation etc) to identify two consecutivewords.

    RuleRule--basedbased IdentifierIdentifier

    Consecutive tokens are passed to it to verifywhether they are reduplicated words or notbased on different algorithms.

    ClassifierClassifier

    CClassify reduplications at expression level.

  • Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture

    DictionaryDictionary

    It includes the lexicon and the associated

    semantics. The system uses both Bengali-

    to-Bengali (monolingual) and Bengali-to-

    English (bilingual) dictionaries.

    �� Set of inflectionsSet of inflections

    0(����), �(-�, -), -�(-��), -��, -��(-���), -�, -��(��), ���, -���, -��, -�, -����, -�, -�

  • Brief classification Brief classification Algorithms Algorithms

    �Complete: comparison for complete equality of two

    words is checked.

    �partial: 3 cases - (i) change of the first vowel

    attached with first consonant, (ii) change of

    consonant itself in first position or (iii) change of

    both matra and consonant.

    Exception: �����-������(abal-tabal, incoherent)Exception: �����-������(abal-tabal, incoherent)[Solution: only consonants that are produced afterchanging are ‘$’, ‘�’, ‘�’, ‘ ’(S.K.Chattopadhyay, 1992.)]

    � Onomatopoeic: after removing inflection, words

    are divided equally and then comparison is done.

    �Correlative : the formative affixes ‘–�’ , ‘-%’ areadded with the root to form 1st and 2nd words

    respectively and agglutinated.

    �Semantic : a dictionary based approach using set of

    above mentioned inflections.

  • Phase Phase Phase Phase Phase Phase Phase Phase 22222222Semantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysis

    �Correspondence between general and

    sense level reduplications:

    ReduplicationsReduplications SemanticsSemantics (Sense)(Sense)

    onomatopoeic onomatopoeic onomatopoeiaonomatopoeia

    semantic or partialsemantic or partial completioncompletion

    correlative wordcorrelative word corresponding corresponding correlative wordscorrelative wordscorrelative wordscorrelative words

    Complete Complete Repetition /Repetition /hesitation, softnesshesitation, softness

    Problem for sense disambiguation of complete

    reduplication: multiple sense depending on the

    context.

    � System identifies some related words like ‘ক��’(kara, to do), ‘����’ (bhaba, to think), ‘����’ (mato,like), ‘��&�’ (laga, feel) for disambiguation.

    � These are not enough for disambiguating the

    sense of the phrase.

  • Experimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental Results

    �The collected corpus includes 14,810 tokens

    for 3675 distinct word forms at the root

    level.

    �Metrics:

    � IR metrics: Precision, Recall, F-score.

    � Frequency measurements of each class.

    � Hyphen and close form count.� Hyphen and close form count.

    �Evaluation:

    Reduplication Precision Recall F-score

    Onomatopoeic 99.85 99.77 99.79

    Complete 99.98 99.92 99.95

    Partial 79.15 75.80 77.44

    Semantic 85.20 82.26 83.71

    Correlative 99.91 99.73 99.82

    System 92.82 91.50 92.15

  • Error AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError Analysis

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Precision

    Recall

    F-Score

    �Partial and semantic evaluation scores are notsatisfactory because of some wrong taggingby the shallow parser.

    �Some synonymous reduplication (����- �'�, dhire-susthe, slowly and steadily)implies anonymous sense of the previousword but not its exact synonym. These wordsare not identified properly due to the lack ofBengali lexicons like WordNet.

  • Frequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesis� Frequency is an important indication of

    whether a compound is a MWE.

    � 8.52% of reduplications are hyphened.

    � percentage of closed reduplications is

    33.09% where maximum of them are

    onomatopoeic, correlative and semantic

    reduplications.

    � 100% of correlative reduplications and� 100% of correlative reduplications and

    maximum of onomatopoeic reduplications

    are closed.

    8.51

    51.0626.6

    12.7

    18.08

    Frequency Analysis

    Onomatopoeic

    Complete

    Partial

    Semantic

    Correlative

  • ConclusionConclusion

    � The reduplication is mainly used for

    emphasis, generality, intensity, or to

    show continuation of an act.

    � The semantics of the reduplicated words

    indicate some sort of senseindicate some sort of sense

    disambiguation that cannot be bounded

    by only rule based analysis.

    � Further researches on the field of

    Stylometry analysis of the authors or

    Plagiarism detection.