Top Banner
Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar Text Htet Myet Lynn, Pankoo Kim, Junho Choi, Jenogin Kim AIPT2015 – June 23, 2015 CAIPT2015
24

Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar Text

Aug 14, 2015

Download

Documents

Htet Lynn
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

CAIPT2015

Maximum-Length Comparison Method Of Automatic Word Segmentation

for Myanmar Text

Htet Myet Lynn, Pankoo Kim, Junho Choi, Jenogin Kim

CAIPT2015 – June 23, 2015

Page 2: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

What is NLP?

CAIPT2015

NaturalLanguage

Page 3: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Why Word Segmentation?

CAIPT2015

Page 4: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

U n i v e r s i t y L O G O

Contents

1 Nature of Myanmar Script

CAIPT2015

2 Maximum-Length Comparsion Model

3 Experimental Result & Future Study

Page 5: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

CAIPT2015

Nature Of Myanmar Script

Page 6: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Nature of Myanmar Script

CAIPT2015

Consonants

Digits

Page 7: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Nature of Myanmar Script

CAIPT2015

Consonants

Basic Vowels

Page 8: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Nature of Myanmar Script

CAIPT2015

Consonants

Consonant Combination Symbols

Page 9: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Nature of Myanmar Script

CAIPT2015

Consonants

Devowelization Consonants

Page 10: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Nature of Myanmar Script

CAIPT2015

Lack of standard rules for distinct word delimiter (white-space) between words become challenge

He is having a meal with three nephews.

He is nephew three persons with meal eating

Page 11: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

CAIPT2015

Maximum-Length Comparison Model

1 Preprocessing Sentences

2 Detect First Character (Consonant)

3 Candidates Detection & Extraction

4 Maximum-Length Comparison

Page 12: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Preprocessing Sentences

CAIPT2015

Input Text

Preprocessing

Detect Consonant

Data DictionaryCadidate

ExtractionCandidates.txt

Maximum Length Comparison

Output Result

Page 13: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Preprocessing Sentences

CAIPT2015

He joins the army.Input:

Preprocessing:

Each and every news media uses different style of writing and positioning white-space in a sentence

Remove punctuation marks, white-spaces

Page 14: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Detect First Character (Consonant)

CAIPT2015

Input Text

Preprocessing

Detect Consonant

Data DictionaryCadidate

ExtractionCandidates.txt

Maximum Length Comparison

Output Result

Page 15: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Detect First Character (Consonant)

CAIPT2015

Preprocessing:

1st Character Detection:

He joins the army.

He

Get Consonant:

Page 16: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Candidates Detection & Extraction

CAIPT2015

Input Text

Preprocessing

Detect Consonant

Data DictionaryCadidate

ExtractionCandidates.txt

Maximum Length Comparison

Output Result

Page 17: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Candidates Detection & Extraction

CAIPT2015

Consonant:

Data Dictionary

Candidates.txt

Let,Length of word_#1 = 3;Length of word_#2 = 5;..Length of word_#10= 20;

Truncate the input_sentence with the value of word_#n;

If (word_#n == truncate_word) {

mark_as_candidate;

} else{ ignore();}

1.10.

Page 18: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Maximum-Length Comparison

CAIPT2015

Input Text

Preprocessing

Detect Consonant

Data DictionaryCadidate

ExtractionCandidates.txt

Maximum Length Comparison

Output Result

Page 19: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Maximum-Length Comparison

CAIPT2015

1.10.

Candidates.txt

IfLength of candidate_#1 = 3;Length of candidate_#10= 20;

//Get the word with longest value among candidatesbest_candidate = candidate_#10;final_word = best_candidate;

Truncate the value of best_candidate from input;

Input: He joins the army.

New input:

Page 20: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Maximum-Length Comparison Model

CAIPT2015

Input Text

Preprocessing

Detect Consonant

Data DictionaryCadidate

ExtractionCandidates.txt

Maximum Length Comparison

Output Result

While (length_input_sent <= 0)

Page 21: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

CAIPT2015

Experimental Result & Future Study

Page 22: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Experimental Result

CAIPT2015

Page 23: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

Future Study

CAIPT2015

30147 sentences including a total of (23,454 words) have been tested

21577 words out of 23,454 words are aright (92%)

Error can be occurred according to the shortage of data dictionary, technical terms and new derived words

Increase the value of data dictionary

Understand the meaning of segmented word semantically for further NLP tasks

Page 24: Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar Text

! !!

!!! !?

Do You Have any Questions?

CAIPT2015