Top Banner
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Fuzzy Translation of Cross-Lingual Spelling Variants SIGIR’03
25

Fuzzy Translation of Cross-Lingual Spelling Variants

Jan 29, 2016

Download

Documents

Russ

Fuzzy Translation of Cross-Lingual Spelling Variants. Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management. SIGIR’03. Outline. Motivation Objective Introduction Method & Data Findings Discussion & Conclusions. Motivation. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fuzzy Translation of Cross-Lingual Spelling Variants

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang

Department of Information Management

Fuzzy Translation of Cross-Lingual Spelling Variants

SIGIR’03

Page 2: Fuzzy Translation of Cross-Lingual Spelling Variants

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Outline

Motivation Objective Introduction Method & Data Findings Discussion & Conclusions

Page 3: Fuzzy Translation of Cross-Lingual Spelling Variants

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation

The limitation on CLIR performance. Some terms not in translation dictionaries. Fuzzy matching ~ n-gram method.

Page 4: Fuzzy Translation of Cross-Lingual Spelling Variants

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objective

Two-step fuzzy translation technique for cross-lingual spelling variants to improve the CLIR performance Transformation rule based translation, TRT. Translate the intermediate forms into a target language

using fuzzy matching.

Page 5: Fuzzy Translation of Cross-Lingual Spelling Variants

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Introduction

Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR.

Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl.

Similarity measure N-gram Fuzzy matching Transliteration

Page 6: Fuzzy Translation of Cross-Lingual Spelling Variants

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Introduction

In this paper, the technique transformation rule based translation, TRT Close to transliteration, but no phonetic elements. It’s suitable for cross-lingual spelling variants.

Example: Spanish embriologia =>English embryology

Problem: How to automatically find this rule? Equivalent term pairs extracted from a translation dictiona

ry and aligned pairwise. Edit distance.

Page 7: Fuzzy Translation of Cross-Lingual Spelling Variants

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Introduction

Two-step fuzzy translation Source words are translated into intermediate forms based

on TRT, in order to render a source word more similar to its target equivalent.

The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.

Page 8: Fuzzy Translation of Cross-Lingual Spelling Variants

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Method & Data - Overview

(emcriologia, embryology)(emariolagia, embryology)(embrialagia, embryology)

Translation dictionary

TRTIntermediate form

N-gram Matching

High confidence factor, HCF

Low confidence factor, LCF

Translation Strategies

Example:konvektio => convectiono – on (end), ko – co (beginning), ekt – ect (middle)=> convection

Page 9: Fuzzy Translation of Cross-Lingual Spelling Variants

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Method & Data - TRT

(emcriologia, embryology)(emariolagia, embryology)(embrialagia, embryology)

Translation dictionary Edit distance

0, the same character at the same position1, consonant-consonant, vowel-vowel substitution1, insertion or deletion of a character2, consonant-vowel, vowel-consonant substitution

Selection of proper terms and error value

One transformation was selected which have the smallest sum of error values

Rule: on -> oughn at middle position

threshold

(embriologia, embryology)(embriolagia, embryology)(embrialagia, embryology)

minimum ED

Page 10: Fuzzy Translation of Cross-Lingual Spelling Variants

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Transformation Rule based Translation

Edit Distance Automatic Generation of Rules

Extracting similar terms from a dictionary with edit distance threshold.

Selection of proper terms with the smallest sum of error values.

Generation of transformation rules Context Information, Frequency, and Confidence

Factor Sample Rules

Page 11: Fuzzy Translation of Cross-Lingual Spelling Variants

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Edit Distance

ED(A, B) = min{Nsub + Nins + Ndel}{d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost},where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].

Page 12: Fuzzy Translation of Cross-Lingual Spelling Variants

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.A sample of Spanish-to-English rules

Page 13: Fuzzy Translation of Cross-Lingual Spelling Variants

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Translation Resources

Multilingual medical dictionary by Andre Fairchild. A Finnish list of medical terms (n=5970) A Swedish list of medical terms (n=657) Language pairs

Finnish-English French-English German-English Spanish-English Swedish-English

Page 14: Fuzzy Translation of Cross-Lingual Spelling Variants

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Target Word List and Source Words

Target word list The index of CLEF’s LA Time collection, which contains

189000 words. Source words

First source word list, 217 word tuples 72 training word tuples, 145 test word tuples.

Second source word list 126 test word tuples.

Experiments dataset 5(language)*(145+126)words =1355 words

Page 15: Fuzzy Translation of Cross-Lingual Spelling Variants

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

N-gram Matching

Similarity measure between the source and target words w1 and w2.

where Ni refers to the set of n-grams derived from the word w

1 and w2.

Digrams v.s. Trigrams Trigrams performed worse than digrams, but sometimes g

ave better results than digrams.

Page 16: Fuzzy Translation of Cross-Lingual Spelling Variants

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Translation Strategies- High confidence factor (HCF) strategy

A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations.

Reading order The location of the rules in source words: end, beginning,

and middle. The source string length: the longest first. Confidence factor: the highest first.

Examplekonvektio => convectiono – on (end), ko – co (beginning), ekt – ect (middle)convetcion

Page 17: Fuzzy Translation of Cross-Lingual Spelling Variants

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Translation Strategies- Low confidence factor (LCF) strategy

A threshold confidence factor of 10% was used to filter out unreliable rules.

Even more intermediate forms were obtained, but it may be incorrect transformations.

Both in HCF and LCF the rules whose frequency was < 50 were removed.

Page 18: Fuzzy Translation of Cross-Lingual Spelling Variants

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Evaluation

For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching

More words share the same SIM value Worst position: the last word Average position precision: the middle of the set o

f the words

Page 19: Fuzzy Translation of Cross-Lingual Spelling Variants

19

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Findings

Four test word types Medical, biological, and chemical terms (Bio terms), n=90 Place names, n=55 Economics, n=31 Technology, n=36 Miscellaneous, n=59

Five language pairs Finnish-English French-English German-English Spanish-English Swedish-English

Page 20: Fuzzy Translation of Cross-Lingual Spelling Variants

20

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Findings – 1/3

Page 21: Fuzzy Translation of Cross-Lingual Spelling Variants

21

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Findings – 2/3

Page 22: Fuzzy Translation of Cross-Lingual Spelling Variants

22

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Findings – 3/3

Page 23: Fuzzy Translation of Cross-Lingual Spelling Variants

23

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Discussion & Conclusion

Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries.

In this study, two-step fuzzy translation Automatically generated transformation rules, TRT Fuzzy matching Two translation strategies were tested, HCF & LCF Digram and trigam matching were tesed in combination wi

th TRT

Page 24: Fuzzy Translation of Cross-Lingual Spelling Variants

24

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Discussion & Conclusion

Effectiveness of fuzzy translation depends on The frequency of identical terms shared by a source and a

target language. The extent of variation in the spelling variants between a

source and a target language. Fuzzy translation is well suited for language pairs with a

high percentage of similar but non-identical terms.

Page 25: Fuzzy Translation of Cross-Lingual Spelling Variants

25

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Personal opinion

How did we apply this ideas to our lab.? TRT?