Segmentation in Sanskrit texts

देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

देहिनः अस्मिन ्यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

A

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

रािरािेभ्यः रािमय

witi

PMI Matrix of the un-segmentable token lemmas

P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)

Set (Size in sentences) Micro Accuracy Macro Accuracy

Training set (1700) 87.76 % 92.56 %

Testing Set (150) 87.82 93.56 %

•

•

•

•

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

https://www.google.co.in/imgres?imgurl=http%3A%2F%2Fclipartfreefor.com%2Fcliparts%2Ffiles3%2Farchaeology-clipart-confusion.png&imgrefurl=http%3A%2F%2Fclipartfreefor.com%2Ffiles%2F8%2F131635.html&docid=5vrKIMZt554MKM&tbnid=WK8EXrxcYQKArM%3A&w=244&h=524&ei=NhvuVumjAsvmuQT2ibiwDw




• From Query Nodes, reach the most promising candidate word nodes.• Perform multiple personalised random walks.• Edge weights – Accommodate heterogeneous information• Learn weights for each of the random walk approach (path) by

supervised methods.• The weighted sum of all the random walk methods, gives the most

suitable candidate• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.

Language Model (LM) with word lemmas

LM with morphological types

Verb specific Expectancy

Compound word formation patterns

Language Model with words - LMw

LM with morphological types - LMt

Verb specific Expectancy – ViE

Compound word formation patterns

PCRW -Unifying

Framework

• Handle Free Word Order• Incorporate heterogeneous types of information• Bonus – Form different relational paths(upto l) by combination of

individual edge weights.• For l = 3, some sample paths that can be formed as combination.• LMw -> LMt ->LMw• LMt -> V1E -> LMt• LMt -> VkE -> LMt

Segmentation in Sanskrit texts

Engineering