IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 1

What’s different with Chinese in cross-language IR?

Jian-Yun NieUniversity of Montreal,

Canada

IRF 2

Outline

General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in

Chinese IR Solutions?

IRF 3

1. General characteristic of Chinese

Sentence = ideograms with no separation它是一种适于在拖拉机使用的转向球接头，…

Words?它 / 是 / 一种 / 适于 / 在 / 拖拉机 / 使用 / 的 / 转向 / 球 / 接头 / ，…

IRF 4

Word formation

Each character can be a word ( 人 -person)

Most words are composed of two or more characters ( 人群 -mass)

However No clear definition of the notion of word

办公楼 (office building) / 办公楼 / or / 办公 / 楼 /? Inconsistency in manual segmentation Many new words are created (abbreviations)

E.g. 网络 (network) 管理员 (administrator) 网管（ webmaster)

IRF 5

2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval

Segmentation Ambiguity: more than 1 segmentation possibility e.g. “ 发展中国家”

发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) 发展 (development)/ 中国 (China)/ 家 (family)

Different words have similar meaning接头 (connector, plug) ↔ 插头 (plug) ↔ 插座 (plug)

New words can be formed quite freely接 (reception) 桶 (bucket): Not a common word, but can be used 网 (network) 店 (store): more and more used… 的 (of, taxi) 车 (car): taxi car (?), car of (someone)…

IRF 6

Alternative: n-grams Usually unigrams and bigrams

As effective as using a word segmentation Account for some flexibility

However Noise: non meaningful combinations Wrong combinations

非酿造型啤酒 (non-brewed beer) 非/酿造/型/啤酒非酿/酿造/造型/型啤/啤酒

Style, appearance, …

Non-meaningful

IRF 77

Possible approach: Combining words and n-grams 前年收入有所下降

Score function in language modeling similar to other languages

Previous results: Word ~ bigram > unigram

Chinese Mono-lingual IR

Word: 前年 / 收入 / 有所 /下降 or: 前 / 年收入 / 有所 /下降

Unigram: 前 / 年 / 收 / 入 / 有 / 所 / 下 / 降

Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降

IRF 8

Our recent tests

Chinese Monolingual IR (Query: Title)

Collec-tions

W B U WU BU0.3W+0.7U

0.3B+0.7U

W+B+U

TREC5 .2585 .2698 .3012 .3298 .3074 .3123 .3262 .3273

TREC6 .3861 .3628 .3580 .4220 .3897 .4090 .3880 .4068

NTCIR3 .2609 .2492 .2496 .2606 .2820 .2754 .2840 .2862

NTCIR4 .1996 .2164 .2371 .2254 .2350 .2431 .2429 .2387

NTCIR5 .2974 .3151 .3390 .3118 .3246 .3452 .3508 .3470

Average .2805 .2827 .2970 .3099 .3077 .3170 .3184 .3212

IRF 9

Why is this useful? NTCIR 5 Topic 18

烟草商诉讼赔偿 (Tobacco company, suit, compensation) Word: 烟草商 (Tobacco company) 诉讼 (suit) 赔偿 (compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草 , 公司 , 业者 , 香烟 , 烟商 , but cannot match “ 烟

草商” .

NTCIR 5 Topic 24 经济舱综合症候群航班 (Economy class, syndrome, flight) Word: 经济 (economy) 综合症 (syndrome) 候 (wait) 航班 (flight) Ubigram(.7607)>Word(0.0002) “.. 综合症候 ..” is segmented into “../ 综合症 /候 /..” It cannot match “ 症候” (syndrome).

The combination of words with unigrams or bigrams helps

IRF 10

Also works for Korean and Japanese?

Run

Means Average Precision (MAP)

U B W BU WU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455

C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766

J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664

J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495

K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889

K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001

IRF 11

2. CLIR: query translation Machine translation: rules+dictionaries Statistical translation model:

Parallel texts Automatically extract possible translations

Comparison Stat. TM doe not produce human-readable

translations But can include related words

Usually, word-based translation

IRF 12

Our recent tests: also translate into n-grams

English Word

•Chinese Word•Chinese Unigram•Chinese Bigram•Bigram&Unigram

“history and civilization” || “ 历史文明”…

history / and / civilization

|| 历史 / 史文 / 文明…

TM (word-to-bigram):p( 历史 |history)p( 史文 |history)p( 文明 |history)

GIZA++ training

history / and / civilization || 历 / 史 / 文 / 明

…

TM (word-to-unigram):

p( 历 |history)p( 史 |history)p( 文 |history)

GIZA++ training

… …

IRF 13

Combining different translations

English Query Chinese Documents

j jjiU QePeutQ )|()|(:

Q j jjiB QePebtQ )|()|(:

UD

DBD

j jjiW QePebtQ )|()|(: WD

IRF 14

Bilingual linguistic resources for CLIR

An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs

LDC English-Chinese bilingual dictionaries 42,000 entries Translation model

Combination of the 2 translation models

IRF 15

CLIR results

EnglishChinese CLIR

Collec-tions

W B U WU BU0.3W

+0.7U

0.3B+0.7U

TREC5 .1904 .2003 .1922 .2448 .2277 .2158 .2251

TREC6 .2047 .2293 .2602 .2670 .2772 .2672 .2822

NTCIR3 .1288 .1017 .1536 .1628 .1504 .1619 .1495

NTCIR4 .0956 .0953 .1382 .1410 .1308 .1337 .1286

NTCIR5 .1158 .1323 .1762 .1532 .1462 .1682 .1602

Average

.1470 .1518 .1841 .1938 .1865 .1894 .1891

IRF 16

General observations for Chinese IR

Using both words and n-grams for Chinese IR and Chinese query translation

N-grams can account for flexibility in Chinese words

CLIR with Chinese can also benefit from translations into Chinese n-grams

IRF 17

4. OOV problem in Chinese

OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV

Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually

Solutions Parallel texts (translations by n-grams) Mono-lingual corpus

IRF 18

Translation of named entities

Statistical transliteration Frances Taylor 弗朗西斯泰勒

茀琅希思泰勒弗郎西丝泰勒 …

IRF 19

IRF 20

Candidate extraction Templates Four templates to extract candidates

c1c2..cn (En) c1c2..cn , En, c’1c’2..c’m

c1c2..cn: En c1c2..cn 是 / 即 En

Comparing four templates Use template 1 in following experiments

Template Percentage Precision

1 17.65% 54%

2 68.35% 6.5%

3 9.05% 2.5%

4 4.94% 1%

Table 2: Comparing Precision of the Four Templates

IRF 21

Translation model

Train a translation model

Candidate List

IRF 22

Dictionary Mining Results Mining Results

Processed more than 300GB Chinese web pages 161,117 translation pairs are mined

Translation % Transliteration % Accuracy %

53.55 46.45 90.15

Table 4: Accuracy of Mined Dictionary

IRF 23

Coverage of the Dictionary on Query Log Data

9,065 popular English terms from the MSN Chinese search engine

IRF 24

CLIR experiment

IRF 25

Conclusions In addition to the general approaches, Chinese IR

should also consider the characteristics of the language

(also for other Asian languages – Japanese and Korean)

Difficulty in translating new (technical) words and proper names

Exploit parallel/comparable or monolingual texts Additional problem: make the retrieved document

readable Full text translation

Running sentences in patent: relatively easy Technical terms: may be difficult with Chinese

Gisting: translation assistance tool, useful for a user with some knowledge of the document language