Top Banner
15

Segmentation in Sanskrit texts

Apr 12, 2017

Download

Engineering

Amrith Krishna
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Segmentation in Sanskrit texts
Page 2: Segmentation in Sanskrit texts

देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

देहिनः अस्मिन ्यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

Page 3: Segmentation in Sanskrit texts
Page 4: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

Page 5: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

Page 6: Segmentation in Sanskrit texts

A

Page 7: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

रािरािेभ्यः रािमय

witi

PMI Matrix of the un-segmentable token lemmas

P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)

Page 8: Segmentation in Sanskrit texts

Set (Size in sentences) Micro Accuracy Macro Accuracy

Training set (1700) 87.76 % 92.56 %

Testing Set (150) 87.82 93.56 %

Page 9: Segmentation in Sanskrit texts

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

Page 10: Segmentation in Sanskrit texts

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

Page 13: Segmentation in Sanskrit texts

• From Query Nodes, reach the most promising candidate word nodes.• Perform multiple personalised random walks.• Edge weights – Accommodate heterogeneous information• Learn weights for each of the random walk approach (path) by

supervised methods.• The weighted sum of all the random walk methods, gives the most

suitable candidate• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.

Language Model (LM) with word lemmas

LM with morphological types

Verb specific Expectancy

Compound word formation patterns

Page 14: Segmentation in Sanskrit texts

Language Model with words - LMw

LM with morphological types - LMt

Verb specific Expectancy – ViE

Compound word formation patterns

PCRW -Unifying

Framework

• Handle Free Word Order• Incorporate heterogeneous types of information• Bonus – Form different relational paths(upto l) by combination of

individual edge weights.• For l = 3, some sample paths that can be formed as combination.• LMw -> LMt ->LMw• LMt -> V1E -> LMt• LMt -> VkE -> LMt

Page 15: Segmentation in Sanskrit texts