Top Banner
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources
13

Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Jan 02, 2016

Download

Documents

Joella Lynch
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Virach Sornlertlamvanich

Information R&D Division (iTech)

National Electronics and Computer Technology Center (NECTEC)

THAILAND

19 January 2001

Symposium on Language Resources in Asia

Thai Linguistic Resources

Page 2: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

How Important !

Language Processing

DefiningRules

LinguisticKnowledge

StatisticalModeling

TrainingResources

LinguisticKnowledge

Top-Down Bottom-Up

Evaluation

Models

Adjust Adjust

EvaluationResources

• Linguistic resources are necessary even in top-down and bottom-up design

• Exploitable in modeling and evaluation

Page 3: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

What we need ?

Linguistic Resources

FundamentalLinguistic Tools

Applications

• Lexicon / Dictionary (30k)

• Tagged Text (2MB) / Speech Corpora

• Language Model

• Word Extraction (ML; p=85%; r=56%)

• Word Segmentation / POS tagger (ML; 96-97%)

• Sentence Segmentation (ML; 85-89%)

• Grapheme-to-Phoneme Conversion (PGLR; 73-90%)

• Word Sense Disambiguation

• Corpus / UNL / UW (concept) Editor

• MT (ParSit; http://come.to/parsit) / UNL

• Text Summarization

• Speech Recognition / Synthesis

Page 4: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Our Workbench …

Prosody-coverage

Phonetically-balance

Vocabulary-coverage

WordExtraction

CorpusEditor

Lexicon

Corpus-based

Dictionary

InterlingualConcept

LanguageModel

RawText

WordSegmentation

POSTagging

SentenceExtraction

Graphemeto Phoneme

WordDisambiguation

UNLMachine

Translation

TextSummarization

SpeechRecognition

SpeechSynthesis

Linguistic Tools Applications

Linguistic Resources

XML TaggedCorpus

Page 5: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Open Linguistic Resources • LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994)

• About 11,000 Thai entries; 9,000 English entries• http://www.links.nectec.or.th/lexit

• ORCHID POS-Tagged Corpus (supported by CRL, 1997)• 160 documents; 2MB text; 400K words• XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags)• http://www.links.nectec.or.th/orchid

• Thai Royal Institute Dictionary (T-T dictionary)• Basic term 32,000 entries• Technical term 15,339 entries• http://www.royin.go.th/

• ParSit (http://come.to/parsit, 2000)

Page 6: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #1

Scope (2001)

• Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus- Phonetically-balanced sentences- 5K vocabulary coverage sentences

• Corpus for Text-to-Speech Synthesis- 400 phonetically and prosodic-balanced sentences- For probabilistic prosody generation

• Dialog speech corpus (collaboration with ATR)- 50 conversations, 2,099 sentences- 5,000 words, 866 phonetically-balanced sentences- 40 speakers (males and females)

Page 7: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #2

Procedure

Word Segmentation

Sentence Extraction

POS Tagging

Grapheme-to-Phoneme

RawText

CorpusEditor

XML TaggedCorpus

Sentence Selection Process

Speech Recordingand Tagging

Tagged SpeechCorpus

Phonetically-balanced

Vocabulary coverage

Prosody-balanced

Page 8: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #3

Tools

Plain Text

Corpus EditorXML Corpus

Page 9: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #4

Text Sources

• Technology Promotion Association (Thailand-Japan)

• Amarin Printing Co., Ltd.

• Matichon Public Co., Ltd.

Project Collaboration

• Kasetsart University

• Thammasat University

• King’s Mongkut University of Technology Thonburi

• Prince of Songkhla University

Page 10: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #5

JNAS T IMIT WSJCAMO NECTEC(2001-2006)

Vocab size 5K, 20K - 20K, 64K 20K

# sent -PB -Vocab

503< 15,000

4501,890

< 1,500< 14,000

< 866< 10,000

# speaker 306 630 140 200

# sent/speaker 150(100 Vocab+50 PB)

10 100(Vocab+PB)

100(80 Vocab+20 PB)

Record time 60 hrs.(16 CDROM)

1 CDROM - 1GB

Page 11: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : LEXiTRON v 2.0 #1

Scope (2001)

• Entries- 25,000 Thai - English- 25,000 English - Thai

• Fields- Translation- Phonetics- Root of vocabulary- Part-of-speech- Synonym- Antonym- Sentence sample

Procedure

WordExtraction

ExistingDictionary

RawText

VocabularySelection

DictionaryEditing

ExistingDictionary

Corpus-basedSentenceSamples

LEXiTRON v 2.0

Page 12: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : LEXiTRON v 2.0 #2

ToolsDictionary DB

Phonetic Symbols

Wordnet

Corpus-based Sample Sentences

Page 13: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Discussion

• Language difficulties; 13 Tai-family languages• Text sources• Common tagset• Resource center• Institutional collaboration