IRF 1 What’s different with Chinese in cross- language IR? Jian-Yun Nie University of Montreal, Canada
IRF 1
What’s different with Chinese in cross-language IR?
Jian-Yun NieUniversity of Montreal,
Canada
IRF 2
Outline
General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in
Chinese IR Solutions?
IRF 3
1. General characteristic of Chinese
Sentence = ideograms with no separation它是一种适于在拖拉机使用的转向球接头,…
Words?它 / 是 / 一种 / 适于 / 在 / 拖拉机 / 使用 / 的 / 转向 / 球 / 接头 / ,…
IRF 4
Word formation
Each character can be a word ( 人 -person)
Most words are composed of two or more characters ( 人群 -mass)
However No clear definition of the notion of word
办公楼 (office building) / 办公楼 / or / 办公 / 楼 /? Inconsistency in manual segmentation Many new words are created (abbreviations)
E.g. 网络 (network) 管理员 (administrator) 网管 ( webmaster)
IRF 5
2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval
Segmentation Ambiguity: more than 1 segmentation possibility e.g. “ 发展中国家”
发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) 发展 (development)/ 中国 (China)/ 家 (family)
Different words have similar meaning接头 (connector, plug) ↔ 插头 (plug) ↔ 插座 (plug)
New words can be formed quite freely接 (reception) 桶 (bucket): Not a common word, but can be used 网 (network) 店 (store): more and more used… 的 (of, taxi) 车 (car): taxi car (?), car of (someone)…
IRF 6
Alternative: n-grams Usually unigrams and bigrams
As effective as using a word segmentation Account for some flexibility
However Noise: non meaningful combinations Wrong combinations
非酿造型啤酒 (non-brewed beer) 非/酿造/型/啤酒 非酿/酿造/造型/型啤/啤酒
Style, appearance, …
Non-meaningful
IRF 77
Possible approach: Combining words and n-grams 前年收入有所下降
Score function in language modeling similar to other languages
Previous results: Word ~ bigram > unigram
Chinese Mono-lingual IR
Word: 前年 / 收入 / 有所 /下降 or: 前 / 年收入 / 有所 /下降
Unigram: 前 / 年 / 收 / 入 / 有 / 所 / 下 / 降
Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降
IRF 8
Our recent tests
Chinese Monolingual IR (Query: Title)
Collec-tions
W B U WU BU0.3W+0.7U
0.3B+0.7U
W+B+U
TREC5 .2585 .2698 .3012 .3298 .3074 .3123 .3262 .3273
TREC6 .3861 .3628 .3580 .4220 .3897 .4090 .3880 .4068
NTCIR3 .2609 .2492 .2496 .2606 .2820 .2754 .2840 .2862
NTCIR4 .1996 .2164 .2371 .2254 .2350 .2431 .2429 .2387
NTCIR5 .2974 .3151 .3390 .3118 .3246 .3452 .3508 .3470
Average .2805 .2827 .2970 .3099 .3077 .3170 .3184 .3212
IRF 9
Why is this useful? NTCIR 5 Topic 18
烟 草 商 诉 讼 赔 偿 (Tobacco company, suit, compensation) Word: 烟草商 (Tobacco company) 诉讼 (suit) 赔偿 (compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草 , 公司 , 业者 , 香烟 , 烟商 , but cannot match “ 烟
草商” .
NTCIR 5 Topic 24 经 济 舱 综 合 症 候 群 航 班 (Economy class, syndrome, flight) Word: 经济 (economy) 综合症 (syndrome) 候 (wait) 航班 (flight) Ubigram(.7607)>Word(0.0002) “.. 综合症候 ..” is segmented into “../ 综合症 /候 /..” It cannot match “ 症候” (syndrome).
The combination of words with unigrams or bigrams helps
IRF 10
Also works for Korean and Japanese?
Run
Means Average Precision (MAP)
U B W BU WU 0.3B+0.7U
Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax
C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455
C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766
J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664
J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495
K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889
K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001
IRF 11
2. CLIR: query translation Machine translation: rules+dictionaries Statistical translation model:
Parallel texts Automatically extract possible translations
Comparison Stat. TM doe not produce human-readable
translations But can include related words
Usually, word-based translation
IRF 12
Our recent tests: also translate into n-grams
English Word
•Chinese Word•Chinese Unigram•Chinese Bigram•Bigram&Unigram
“history and civilization” || “ 历史文明”…
history / and / civilization
|| 历史 / 史文 / 文明…
TM (word-to-bigram):p( 历史 |history)p( 史文 |history)p( 文明 |history)
GIZA++ training
history / and / civilization || 历 / 史 / 文 / 明
…
TM (word-to-unigram):
p( 历 |history)p( 史 |history)p( 文 |history)
GIZA++ training
… …
IRF 13
Combining different translations
English Query Chinese Documents
j jjiU QePeutQ )|()|(:
Q j jjiB QePebtQ )|()|(:
UD
DBD
j jjiW QePebtQ )|()|(: WD
IRF 14
Bilingual linguistic resources for CLIR
An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs
LDC English-Chinese bilingual dictionaries 42,000 entries Translation model
Combination of the 2 translation models
IRF 15
CLIR results
EnglishChinese CLIR
Collec-tions
W B U WU BU0.3W
+0.7U
0.3B+0.7U
TREC5 .1904 .2003 .1922 .2448 .2277 .2158 .2251
TREC6 .2047 .2293 .2602 .2670 .2772 .2672 .2822
NTCIR3 .1288 .1017 .1536 .1628 .1504 .1619 .1495
NTCIR4 .0956 .0953 .1382 .1410 .1308 .1337 .1286
NTCIR5 .1158 .1323 .1762 .1532 .1462 .1682 .1602
Average
.1470 .1518 .1841 .1938 .1865 .1894 .1891
IRF 16
General observations for Chinese IR
Using both words and n-grams for Chinese IR and Chinese query translation
N-grams can account for flexibility in Chinese words
CLIR with Chinese can also benefit from translations into Chinese n-grams
IRF 17
4. OOV problem in Chinese
OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV
Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually
Solutions Parallel texts (translations by n-grams) Mono-lingual corpus
IRF 18
Translation of named entities
Statistical transliteration Frances Taylor 弗朗西斯泰勒
茀琅希思泰勒弗郎西丝泰勒 …
IRF 19
IRF 20
Candidate extraction Templates Four templates to extract candidates
c1c2..cn (En) c1c2..cn , En, c’1c’2..c’m
c1c2..cn: En c1c2..cn 是 / 即 En
Comparing four templates Use template 1 in following experiments
Template Percentage Precision
1 17.65% 54%
2 68.35% 6.5%
3 9.05% 2.5%
4 4.94% 1%
Table 2: Comparing Precision of the Four Templates
IRF 21
Translation model
Train a translation model
Candidate List
IRF 22
Dictionary Mining Results Mining Results
Processed more than 300GB Chinese web pages 161,117 translation pairs are mined
Translation % Transliteration % Accuracy %
53.55 46.45 90.15
Table 4: Accuracy of Mined Dictionary
IRF 23
Coverage of the Dictionary on Query Log Data
9,065 popular English terms from the MSN Chinese search engine
IRF 24
CLIR experiment
IRF 25
Conclusions In addition to the general approaches, Chinese IR
should also consider the characteristics of the language
(also for other Asian languages – Japanese and Korean)
Difficulty in translating new (technical) words and proper names
Exploit parallel/comparable or monolingual texts Additional problem: make the retrieved document
readable Full text translation
Running sentences in patent: relatively easy Technical terms: may be difficult with Chinese
Gisting: translation assistance tool, useful for a user with some knowledge of the document language