Top Banner
Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14 SIGHAN - 2006
16

Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Jan 04, 2016

Download

Documents

blaze-winters

SIGHAN - 2006. Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14. Introduction. The Third SIGHAN Chinese Language Processing Bakeoff ( Bakeoff2006 ) Chinese Word Segmentation CKIP ( 中研院的語料 ) Closed Test – the highest F measure This paper (0.958) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Advisor: Hsin-Hsi ChenSpeaker: Yong-Sheng Lo

Date: 2008/03/14

SIGHAN - 2006

Page 2: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Introduction The Third SIGHAN Chinese Language Proc

essing Bakeoff ( Bakeoff2006 )

Chinese Word Segmentation CKIP (中研院的語料 )

Closed Test – the highest F measure This paper (0.958)

Open Test – the highest F measure This paper (0.959)

Page 3: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Conditional Random Fields (CRF)

CRF is a statistical sequence modeling framework

Work by Peng et al. first used this framework for Chinese word segmentation by treating it as binary decision task such that each Chinese character is labeled either as the beginning of a word (B) or not (I) ( Peng et al., 2004 )

Λ={λ1,λ2,…} y = { y1,…,yT } : tag set x = { x1,…,xT } : Chinese character set

Page 4: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Conditional Random Fields

For example Training format

王 B 建 I 民 I 是 B 真 B 男 B 人 I

王建民 /是 /真 /男人

Feature template Unigram

Cn, n = -1,0,1 Bigram

Cn Cn+1, = -1,0

Testing Result 王 建 民 是 真 男 人

Page 5: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Conditional Random Fields

For example Training format

王 B 建 I 民 I 是 B 真 B 男 B 人 I

王建民 /是 /真 /男人

Feature template Unigram

Cn, n = -1,0,1 Bigram

Cn Cn+1, = -1,0

Testing Result 王 B 建 I 民 I 是 B 真 B 男 B 人 I

Page 6: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Tag Set Selection There are two kinds of schemes that are used to

distinguish the character position in a word in previous work

Xue and Ng Maximum entropy model

Peng and Tseng CRF model

Page 7: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Tag Set Selection

In this paper To effectively perform tagging for long words

To expend the 4-tag set of Ng / Xue into a 6-tag set Begin Middle End Single B2 B3

長詞範例 : 故宮博物院 == { B, B2, B3, M, E }

Page 8: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Closed Test

Page 9: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Closed Test

Code e : date, digital and letter Class 1 = Number Class 2 = Date Class 3 = English letter Class 4 = Other

Code f : tone For example :

中 , 國 , 很 , 大 , 嗎 == { 1, 2, 3, 4, 0 }

Page 10: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Open Test

(1) External Dictionary To use the online dictionary from Peking University

Consisting of about 108000 words of length 1 to 4 characters

If there is some sequence of neighboring characters around Co in the sentence that matches a word in this dictionary

Greedily choose the longest such matching word W

For example 中國大陸 => 中國 大陸 => { B E B E } 若 字典有『中國大陸』 則 中國大陸 => { B B2 B3 E }

Page 11: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Open Test

(1) External Dictionary (cont.)

Page 12: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Open Test

(2) Assistant Segmenter Idea ( observation )

Most words are still segmented in the some way according to different segmentation standards

Thus, though those segmenters trained on different corpora will give some different segmentation rules Main segmenter Assistant segmenters

A feature template will be added for assistant segmenter : t (Co) The output tag of the assistant segmenter for Co ( ex. B)

Page 13: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Open Test

(2) Assistant Segmenter (cont.) To integrate all other segmenters that a

re trained on all corpora from Bakeoff-2003, 2005 and 2006 with the feature set used in closed test.

The segmenter, MSRSeg, described in (Gao, 2003) is also integrated, too.

Page 14: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Feature Template For Open Test

Assistant segmenter method VS. The additional training corpus method

The performance of additional corpus method depends on the performance of the trained segmenter that carries out the corpus extraction task

If the segmenter is not well-trained, then it cannot effectively extract the most wanted additional corpus to some extent

The additional corpus method is only able to integrate useful corpus, but it cannot integrate a well-trained segmenter while the corpus cannot be accessed

The additional corpus is very difficult to use in CRF model The increase of corpus can lead to a dramatic increase of me

mory and time consuming in this case

Page 15: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Evaluation Results

Page 16: Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2008/03/14

Evaluation Results