Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24 References: • Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.
65
Embed
Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval
Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval. Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/11/24. References: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 10: Term Translation Extraction & Cross-Language Information Retrieval
Wen-Hsiang Lu (盧文祥 )
Department of Computer Science and Information Engineering,
National Cheng Kung University
2004/11/24
References:• Wen-Hsiang Lu (Advisors: Lee-Feng Chien and Hsi-Jian Lee.) (2003) Term
Translation Extraction Using Web Mining Techniques, PhD thesis, Department of Computer Science and Information Engineering, National Chiao Tung University.
Outline
I. Background & Research ProblemsII. Anchor Text Mining for Term Translation
Extraction III. Transitive Translation for Multilingual Translation IV. Web Mining for Cross-Language Information
Retrieval and Web Search Applications
Part I Background &
Research Problems
Motivation
• Demands on multilingual translation lexicons– Machine translation (MT)– Cross-language information retrieval (CLIR) – Information exchange in electronic commerce (EC)
• Web mining– Explore multilingual and wide-scoped hypertext
resources on the Web
Research Problems
• Difficulties in automatic construction of multilingual translation lexicons– Techniques: Parallel/comparable corpora– Bottlenecks: Lacking diverse/multilingual resources
• Difficulties in query translation for cross-language information retrieval (CLIR) [Fig1]– Techniques: Bilingual dictionary/machine translation/
• “It gives us insight into the value of the Web as a dynamic information source. Although the experiments are restricted to Chinese-English documents, also developers for other languages may find this work stimulating.”
• “The idea is interesting, and is relatively new. It may give inspiration to other researchers working in the same area.”
• Summary of contributions– Present an innovative approach
• Significantly reduce the difficulty of unknown-term translation.
• CLIR can be improved especially for short queries.
– Develop a practical cross-language Web search engine • Without relying on translation dictionary • A live dictionary with a significant number of multilingual
term translations obtained.
– Present a new problem for further investigation in Web Mining
Research Results (cont.)Research Results (cont.)
Related Research
• Automatic extraction of multilingual translations– Statistical translation model (Brown 1993)– Parallel corpus (Melamed 2000; Wu & Chang 2003)– Non-parallel/comparable corpus (Fung 1998; Rapp
1999)– Web mining
• Parallel corpus collection (Nie 1999; Resnik 1999)
• Comparable corpus collection: Anchor texts and search-result pages (Lu et al. 2002, 2003)
• Strength: Huge amounts of Web data with link structure
Related Research (cont.)
• Query translation for cross-language information retrieval– Dictionary-/MT-based approach (Ballesteros & Croft 1997; Hull
• Query expansion and phrase translation (Ballesteros & Croft 1997)• Translation disambiguation (Ballesteros & Croft 1998; Chen & Bian 1999)• Proper name transliteration (Chen et al. 1998; Lin & Chang 2003) • Probabilistic retrieval/language models (Hiemstra & de Jong 1999;
Lavrenko 2002)• Unknown query translation (Lu et al. 2002, 2003)
Related Research (cont.)
• Cross-language Web search (CLWS)– Practical CLWS services have not lived up to expectations
• Keizai (Ogden et al. 1999): English query/Japanese, Korean Web news• MTIR (Bian & Chen 1999): Chinese query/English pages/translation• MuST: Multilingual Summarization and Translation (Hovy & Lin 1998)
– English/Indonesian/Spanish/Arabic/Japanese, Web news summarization or translation• TITAN (Hayashi et al.1997): English-Japanese retrieval/translated pages titles
• Challenges– Web queries are often
• Short: 2-3 words (Silverstein et al. 1998)• Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic
• Goal: Enhance translation coverage for diverse queries• Idea
– Comparable corpus: Language-mixed texts in search-result pages
– Utilize co-occurrence relation and context information• Chi-square test
• Context-vector analysis
• Procedure of query translation based on search-result-mining
1. Corpus collection: Collect m search results from search engines.2. Translation candidate extraction: Segment the collected corpus
and extract k most frequent target terms as candidates.3. Translation selection: Compute similarity based on chi-square
test or context-vector analysis.
Chi-Square Test
• Idea– Makes good use of all relations of co-occurrence between the
source and target terms.
• Similarity measure (Gale & Church 1991)
• 2-way contingency table
t ~t
s a b
~s c d
)()()()(
)() ,(
2
2 dcdbcaba
cbdaNtsS
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
a: # of pages containing both terms s and t
b: # of pages containing term s but not t
c: # of pages containing term t but not s
d: # of pages containing neither term s nor t
N: the total number of pages, i.e., N= a+b+c+d
ij
ijij
E
EO 22 )(
)()(
)]()([
,
2,2
ts
ts
tPsPN
tPsPNn
Context-Vector Analysis
• Idea– Take co-occurring context terms as feature vectors of the
source/target terms.
• Similarity measure
• Weighting scheme: TF*IDF
. containing pages ofnumber the:
pages, Webofnumber total the:
, pageresult search in of frequency the:)(
)log() ,(max
) ,(
i
ii
jj
it
tn
N
dt,dtf
n
N
dtf
dtfw
i
)()(
) ,(
12
12
1
mi t
mi s
tsmi
CV
ii
ii
ww
wwtsS
s: ws1, ws2
, …, wsm
t: wt1, wt2
, …, wtm
s: ws1, ws2
, …, wsm
t: wt1, wt2
, …, wtm
Translation Selection based onChi-Square Test and Context-Vector Analysis
• For each candidate
– Chi-square test1. Retrieve page frequencies by submitting the Boolean queries ‘s∩t’,
‘~s∩t’, and ‘s∩~t’ to search engines
2. Compute the similarity Sχ2(s, t)
– Context-vector analysis1. Retrieve the top m search results by submitting t to search engines, and
generate its feature vector
2. Compute the similarity SCV(s, t)
Integrated Web Mining Approach
• Idea: Take both complementary advantages – Anchor-text-mining: good precision rate
– Search-result-mining: good coverage rate
• Combined similarity measure
m: an assigned weight for each similarity measure Sm
Rm(s, t): the similarity ranking between s and t using Sm
m m
m
tsRtsS
Combined ),(),(
• Test query set– 430 popular Chinese/English query terms
• Filter out terms without translations (from 9,709 core terms)
• OOV: 64% (274/430) are out of vocabulary
– 200 random Chinese query terms
• Randomly select from top 19,124 terms in Dreamer log
• OOV: 82.5% (165/200)
– 50 scientist names (proper names)
• Randomly select from 256 scientists (Science/People in the Yahoo! Directory)
• OOV: 76% (38/50)
– 50 disease names (technical terms)
• Randomly select from 664 diseases (Health/Diseases and Conditions in the Yahoo! Directory)
• OOV: 72% (36/50)
Test BedTest Bed
Query type English query Extracted Chinese translations
Scientist name
Aldrin, Buzz (Astronaut)Hadfield, Chris (Astronaut)Galilei, Galileo (Astronomer)Ptolemy, Claudius (Astronomer)Tibbets, Paul (Aviators)Crick, Francis (Biologists)Drake, Edwin Laurentine (Earth
Scientist)Aryabhata (Mathematician)Kepler, Johannes (Mathematician)Dalton, John (Physicist)Feynman, Richard (Physicist)
Examples of Proper Name and Technical TermExamples of Proper Name and Technical Term
Performance of Web Mining for Popular QueriesPerformance of Web Mining for Popular Queries
Approach Query type Top-1 Top-3 Top-5 Coverage
CVDic 56.4% 70.5% 74.4% 80.1%
OOV 56.2% 66.1% 69.3% 85.0%
All 56.3% 67.7% 71.2% 83.3%
χ2
Dic 40.4% 61.5% 67.9% 80.1%
OOV 54.7% 65.0% 68.2% 85.0%
All 49.5% 63.7% 68.1% 83.3%
ATDic 67.3% 78.2% 80.8% 89.1%
OOV 66.1% 74.5% 76.6% 83.9%
All 66.5% 75.8% 78.1% 85.8%
CombinedDic 68.6% 82.1% 84.6% 92.3%
OOV 66.8% 85.8% 88.0% 94.2%
All 67.4% 84.4% 86.7% 93.5%
Performance of Web Mining for
Random Queries/Proper Names/Technical Terms
Performance of Web Mining for
Random Queries/Proper Names/Technical Terms
Table 5.6 Inclusion rates for proper names and technical terms using the combined approach.
Query type Top-1 Top-3 Top-5
Scientist name
40.0% 52.0% 60.0%
Disease name
44.0% 60.0% 70.0%
Table 5.5 Coverage and inclusion rates for random queries
Approach Top-1 Top-3 Top-5 Coverage
CV 25.5% 45.5% 50.5% 60.5%
χ2 26.0% 44.5% 50.5% 60.5%
AT 19.0% 28.0% 28.5% 29.0%
Combined 33.5% 53.5% 60.5% 67.5%
• The test collection (Chen & Chen 2001)
– 132,173 Chinese news documents (200MB)
– 50 English query topics
• Title-query (title section only)– Short: Average 3.8 English words – Low performance: 55% of
monolingual performance (Kwok 2001)
– Difficulty: CLIR may fail if anyone key word in short queries can not be translated correctly.
• Can Web mining solve short query translation?
CLIR on NTCIR-2 Evaluation TaskCLIR on NTCIR-2 Evaluation Task
Table 5.1 Examples of Title-Query in NTCIR-2.
English Title Query Chinese Title Query
Q06Q12Q23Q28
Q30Q34Q45
Q46Q47
Kosovar refugeesMichael Jordan's retirementDisneylandCutting down the timber of Chinese cypress in ChilanEl Nino and infectious diseases Side effects of ViagraCloud Gate Dance Theatre of TaiwanMa Yo-yo cello recitalJin Yong kung-fu novels
科索沃難民潮麥可喬登退休迪士尼樂園棲蘭檜木砍伐
聖嬰現象與傳染病威而鋼之副作用雲門舞集
馬友友演奏會金庸武俠小說
• Probabilistic retrieval model (Xu 2001; Hiemstra & de Jong 1999)
– The Web mining approach: P(e | c) = Pweb(e | c) ≈ SCombined(e, c)
– The dictionary-based approach:
P(e | c) = Pdic(e | c) ≈ 1/ne
ne: the number of translations of c
– The hybrid approach:
P(e | c) = [Pweb(e | c) + Pdic(e | c)] / 2
Integration of Web Mining and Probabilistic Retrieval Model
Integration of Web Mining and Probabilistic Retrieval Model
Qe cQe
DcpcePePDePDQP ])|()|()1()([)|()|(
Q: English query
D: Chinese Document
e: English query term
c: Chinese translation
P(e): background probability
P(e|c): translation probability
P(c|D): generation probability
Q: English query
D: Chinese Document
e: English query term
c: Chinese translation
P(e): background probability
P(e|c): translation probability
P(c|D): generation probability
Performance of Query Translation and CLIR for NTCIR-2 English-Chinese Retrieval Task
Performance of Query Translation and CLIR for NTCIR-2 English-Chinese Retrieval Task
Table 5.9 Top-n inclusion rates with Web mining approach for traditional Chinese translations of 178 English title query terms.
Type Number Top-1 Top-2 Top-3 Top-4 Top-5
Terms existing in LDC 156 60.3% 73.7% 77.6% 82.1% 83.3%
Terms not included in LDC 22 68.1% 77.2% 81.8% 86.3% 86.3%
Total 178 61.2% 74.2% 78.1% 82.6% 83.7%
Table 5.10 The MAP values with three different approaches of query translation to the NTCIR-2 English-Chinese retrieval task.
• Q23: ”Disneyland”: MAP (mean average precision) from 0 to 0.721 • Q46: “Ma Yo-yo cello recital”: MAP from 0.205 to 0.446
Performance Analysis for Query Translation & CLIRPerformance Analysis for Query Translation & CLIR
• Practical CLWS services have not lived up to expectations due to lacking multilingual translations for diverse unknown queries.
• The Web mining approach, which combines anchor-text-mining and search-result-mining approaches, are complementary in the precision and coverage rates for query translation.
• Anchor texts and search-result pages are useful comparable corpora for query translation, which are contributed continuously by a huge number of volunteers (page authors) around the world.
• LiveTrans can generate translation suggestions and provide an practical CLWS service for the retrieval of both Web pages and images.
ConclusionConclusion
• Currently, the LiveTrans system cannot fully perform in real time. It is necessary to find an more efficient way to reduce the computation cost.
• Employ more language processing techniques to improve the accuracy in phrase translation, word segmentation, unknown word extraction and proper name transliterations.
• Develop an automatic way to collect and exploit other Web resources like bilingual/multilingual Web pages.
• Enhance the LiveTrans system to handle more Asian and European language translation, such as Japanese, Korean, France, etc.
• Apply our Web-mining translation techniques to enhance current machine translation techniques and design a computer-aided English writing system.