MQP Final Presentation - Worcester Polytechnic Institute · MQP Final Presentation Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 28, 2015 ... improved clzh-en dic ...

Post on 21-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Automated Building of Classic Chinese-English

Dictionary and Chinese-Hungarian Dictionary

MQP Final Presentation

Advisors: Gabor Sarkozy, WPI

Andras Kornai, MTA-Sztaki

April 28, 2015

Xiaosong Wen xwen2@wpi.edu

Hongbo Fang hfang@wpi.edu

Outline

● Introduction & Background

● Methodology

● Analysis

● Conclusion & Future Work

Introduction: Bilingual Dictionary: ● Definition: a specialized dictionary used to

translate words or phrases from one

language to another

● Other usage: Cross-Language Information

Retrieval and Cross-Language Plagiarism

Detection

Classic Chinese and Modern Chinese: ● Ancient Chinese: articles and poems

pre Qin and Han (121 AC)

● Modern Chinese:

after Republic of China (after 1912)

Differences between Classic Chinese and Modern Chinese:

1. Ancient Chinese didn’t have

punctuation marks

Differences between Classic Chinese and Modern Chinese: 2. Different meaning of words:

modern Chinese: “咸” → salty

classic Chinese: “咸” → all

modern Chinese: ‘’

Differences between Classic Chinese and Modern Chinese:

3. Different ways to format words:

In classic Chinese: Every content character is a word

example:

modern Chinese: ‘妻子’ ---> wife

classic Chinese: ‘妻’ ----> wife;

‘子’----> son

two distinct words

Parallel Corpus:

Hunglish;

UM-Corpus;

Chinese Text Project

Sparse Matrix: • A matrix in which most of the elements are zero

• Widely used in the numerical linear algebra computations

Pointwise mutual information (PMI):

Fano 1961: mutual information between particular events X and Y

𝑃𝑀𝐼 𝑋, 𝑌 = log𝑝 𝑋, 𝑌

𝑝(𝑋)𝑝(𝑌)= log

𝑝 𝑋 𝑌

𝑝(𝑋)= log

𝑝 𝑌 𝑋

𝑝(𝑌)

Methodology: Classic Chinese

Download and extract parallel corpus from

ctext.org

• wget, Java, python: openCV

• HTML parsing

• 34140 zh-en sentence pair

• hundict

Evaluate: clzh-en dic

● join with 100 basic set:

o overall precision is 82%

o 92% for the words with confidence above 0.2

Evaluate: clzh-en dic Improve the quality:

• remove punctuation marks:

• stem English corpus:

stemming algorithm: porter2

Results: improved clzh-en dic

error distribution the precision for confidence

above 0.25 is 94%,

for confidence above 0.3 is 96%

598 entries for confidence 0.25

and above

508 entries for confidence 0.3

and above

Results: improved clzh-en dic

诸侯 @ feudal → feudal lord 寐 @ dawn → sleep 之 @ the → of

Results: improved clzh-en dic

Recall against a selected 100 words dictionary

is 42%

manually composed the remaining translation in the

100 word basic dictionary, except for the words:

Wednesday, bread, game

Methodology: Modern Chinese (simplified)

● Purchased Chinese to English Dictionary

● Professor Andras provided English to Hungarian Dictionary

● used linux “join” command to get a “raw” dictionary

joined dictionary

Sparse Matrix:

sentence1 sentence2 sentence3 sentence4 ...

both(hu1, zh1) 0 2 0 3

only hu1 0 0 1 0

only zh1 0 0 0 1

PMI:

𝑃𝑀𝐼 𝑧ℎ, ℎ𝑢 = log𝑝 𝑧ℎ, ℎ𝑢

𝑝(𝑧ℎ)𝑝(ℎ𝑢)= log

𝑛𝑏𝑁

𝑛𝑧 + 𝑛𝑏𝑁

∗𝑛ℎ + 𝑛𝑧

𝑁

= log𝑛𝑏

(𝑛𝑧+𝑛𝑏) ∗ (𝑛ℎ + 𝑛𝑏)∗ 𝑁

hu1 not hu1

zh1 n_b n_z

not zh1 n_h

Example:

PMI(非洲人,afrikai)=log〖(3/(6*45)*10,365,171)〗=14.0012952777

非洲人 非洲人

afrikai 3 42

afrikai 3

Evaluate: spcl-hu dic

Pairs with PMI>6, considered having high precision

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Number of Pairs

Number of Pairs

Errors:

Error I:

gift @ 天分 @ ad

天分: talent, gift

ad: give

silent @ 密 @ csendes.

密: secret or dense

csendes: quiet

Error II: Inconsistent in part of speech

发怒, 愤, 怒, 触怒, 激怒 and 生气 are verbs;

忿, 怒气, 怒火 are nouns;

气愤, 愤怒 are adjectives.

Error III: Slang expression

stupid @ 二 @ buta

bird @ 女人 @ madár

Error IV: Indirect translation

life @ 春 @ energia

bread @食物@ élelmiszer

Future work:

Acknowledgement

● András Kornai, MTA-SZTAKI

● Gábor Sárközy, Worcester Polytechnic Institute

● Huba Bartos, MTA-SZTAKI

Reference: • Fano, Robert M. 1961. Transmission of information; a statistical theory of communications. New York: MIT Press

• Liang Tian, Derek F. Wong, Lidia S.Chao, Paulo Quaresma, Franciso Oliveira, Yi Lu, Shuo Li, Yiming Wang,

Longyue Wang. “A Large English-Chinese Parallel Corpus for Statistical Machine Translation." 2014

• Grigg, Hugh. Past events in Mandarin Chinese grammar. 14 April 2013. 22 4 2015.

• Attila Balogh, Zsolt Both, András Farkas, Péter Halácsy. http://www.hunglish.hu/.

• Chinese Text Project http://ctext.org/

• Parallel Text, http://en.wikipedia.org/wiki/Parallel_text.

Question?

Thank you!

top related