-
ACL-IJCNLP 2015
The 53rd Annual Meeting of theAssociation for Computational
Linguistics and the
7th International Joint Conference on Natural
LanguageProcessing
Proceedings of the Eighth SIGHAN Workshop on ChineseLanguage
Processing
July 30-31, 2015Beijing, China
-
c©2015 The Association for Computational Linguisticsand The
Asian Federation of Natural Language Processing
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL)209 N. Eighth
StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax:
[email protected]
ISBN 978-1-941643-57-0
ii
-
Preface
Welcome to the Eighth SIGHAN Workshop on Chinese Language
Processing! Sponsored by theAssociation for Computational
Linguistics (ACL) Special Interest Group on Chinese
LanguageProcessing (SIGHAN), this year’s SIGHAN-8 workshop is being
held in Beijing, China, on July 30-31,2015, and is co-located with
ACL-IJCNLP 2015. The workshop program includes three
keynotespeeches, research paper presentations and two Bake-offs. We
hope that these events will bring togetherresearchers and
practitioners to share ideas and developments in various aspects of
Chinese languageprocessing.
We have received 17 valid submissions, each of which has been
assigned to three reviewers. After arigorous review process, we
have accepted 5 papers for oral presentations (30% acceptance rate)
and 6papers for poster presentations, representing a global
acceptance rate of 65%.
We are honored to welcome our distinguished speakers: Dr. Min
Zhang (Distinguished Professor,Soochow University, China) and Rou
Song (Professor, Beijing Language and Culture University,
China)will give the first keynote speech "Discourse and Machine
Translation." Yanxiong Lu and LianqiangZhou (WeChat Pattern
Recognition Center at Tencent) will speak on "Intelligent Q&A
System and NLPOpen Platform." Finally, Dr. Lun-Wei Ku (Assistant
Research Fellow, Academia Sinica, Taiwan) willspeak on "From
Lexical to Compositional Chinese Sentiment Analysis."
We would also like to thank the Bake-off organizers. The first
task Chinese Spelling Check task wasorganized by Dr. Yuen-Hsien
Tseng (National Taiwan Normal University), Dr. Lung-Hao Lee
(NationalTaiwan Normal University), Dr. Li-Ping Chang (National
Taiwan Normal University), and Dr. Hsin-HsiChen (National Taiwan
University). The second Topic-Based Chinese Message Polarity
Classificationtask is organized by Dr. Xiangwen Liao (Fuzhou
University, China), Dr. Ruifeng Xu (Harbin Instituteof Technology,
China), Dr. Li Binyang (University of International Relation,
China), and Dr. Liheng Xu(Institute of Automation, Chinese Academy
of Sciences, China). A total of sixteen teams participated inthese
two tasks and have achieved good results.
Finally, we would like to thank all authors for their
submissions. We appreciate your active participationand support to
ensure a smooth and successful conference. The publication of these
papers representsthe joint effort of many researchers, and we are
grateful to the efforts of the review committee fortheir work, and
to the SIGHAN committee for their continuing support. We wish all a
rewarding andeye-opening time at the workshop.
SIGHAN-8 Workshop Co-organizersLiang-Chih Yu, Yuan Ze
UniversityZhifang Sui, Peking UniversityYue Zhang, Singapore
University of Technology and DesignVincent Ng, University of Texas
at Dallas
iii
-
Organizing Committee
Organizers:
Liang-Chih Yu, Yuan Ze UniversityZhifang Sui, Peking
UniversityYue Zhang, Singapore University of Technology and
DesignVincent Ng, University of Texas at Dalles
SIGHAN Committee:
Chengqing Zong, Chinese Academy of ScienceMin Zhang, Soochow
UniversityGina-Anne Levow, University of WashingtonNianwen Xue,
Brandeis University
Program Committee:
Chia-Hui Chang, National Central UniversityLi-Ping Chang,
National Taiwan Normal UniversityWangxiang Che, Harbin Institute of
TechnologyHsin-Hsi Chen, National Taiwan UniversityKuan-hua Chen,
National Taiwan UniversityXiangyu Duan, Soochow UniversityXianpei
Han, Chinese Academy of ScienceXungjing Huang, Fudan UniversityJing
Jiang, Singapore Management UniversityChunyu Kit, City University
of Hong KongWai Lam, Chinese University of Hong KongChao-Hong Liu,
Dublin City UniversityLung-Hao Lee, National Taiwan
UniversityHaizhou Li, Institute of Infocomm ResearchJyun-Jie Lin,
Yuan Ze UniversityYang Liu, Tsinghua UniversityXiangwen Liao,
Fuzhou UniversityJianyun Nie, University of MontrealLikun Qiu,
Ludong UniversityFuji Ren, The University of TokoshimaWeiwei Sun,
City University of Hong KongYuen-Hsien Tseng, National Taiwan
Normal UniversityHsin-Min Wang, Academia SinicaKun Wang, Chinese
Academy of ScienceDerek F. Wong, University of MacauChung-Hsien Wu,
National Chen Kung UniversityRuifeng Xu, Harbin Institute of
TechnologyChin-Sheng Yang, Yuan Ze UniversityJui-Feng Yeh, National
Chiayi UniversityGuodong Zhou, Soochow UniversityQiang Zhou,
TsingHua UniversityJingbo Zhu, Northeastern University
iv
-
Invited Talk: Discourse and Machine TranslationZHANG Min,
Soochow University, China
SONG Rou, Beijing Language and Culture University, China
Abstract
Discourse in linguistics refers to a unit of language longer
than a single sentence. It has not beenwell studied in the research
community of computational linguistics, but it has attracted more
andmore attention in very recent years. This talk consists of two
parts, i.e., discourse and machinetranslation. We will first give
an overview about discourse and review the research
state-of-the-artof discourse from both linguistics and
computational viewpoints, and then discuss how machinetranslation
can benefit from discourse-level information. Finally, we conclude
the talk with somefuture direction discussions.
Biography
ZHANG Min: a distinguished professor and vice dean of the school
of computer science andtechnology, director of the research
Institute for Human Language Technology at Soochow Uni-versity
(China), received his Ph.D. degree in computer science from Harbin
Institute of Technology(China) in 1997. He has studied and worked
oversea in industry and academy at South Korea andSingapore since
1997 to 2013. His current research interests include machine
translation and natu-ral language processing. He has co-authored 2
Springer books and more than 130 papers in leadingjournals and
conferences, and co-edited 13 books published by Springer and IEEE.
He is an asso-ciate editor of IEEE T-ASLP (2015-2017).
SONG Rou: a professor and Ph.D. supervisor at Applied
Linguistics and Computer Applica-tion in Beijing Language and
Culture University, received his Bachelor degree in mathematicsand
mechanics from Beijing University in 1968 and his mater Master
degree in computer sciencefrom Beijing University in 1981. He has
been working on Chinese Information Processing studyfor tens of
years as the PIs of more than 10 national-level projects with the
research focuses ondiscourse analysis, Chinese word segmentation,
Computer-aided proofreading, Chinese word at-tribute, Chinese
Orthographic Computing and Chinese POS and so on. He has published
morethan 100 papers at leading journals and conferences in computer
science and linguistics. He hasdeveloped and commercialized several
softwares with two patents. He has received several awardsfrom
Beijing City and MOE, China. He has been appointed as guest
professors in a few domesticand oversea universities and research
institutes.
v
-
Invited Talk: Intelligent Q&A System and NLP Open PlatformLU
Yanxiong and ZHOU Lianqiang
WeChat Pattern Recognition Center, Tencent
Abstract
Building a general Q&A system that can handle any subject is
a very challenging AI task. Internetsocial platforms accumulate
large amount of active users and UGC (User Generate Content)
data,which become valuable crowdsourcing resources. In this talk,
we will discuss the opportunity ofusing WeChat crowdsourcing
resources to build an intelligent Q&A systems as well as some
openquestions and challenges under this topic.
Tencent Open Platform "Wen Zhi" provides comprehensive natural
language processing APIs,including the modules of Lexical, Syntax,
Semantics and Paragraph. It also provides the webcrawling, data
extraction and transcoding services. In this talk we will give an
overview of Ten-cent NLP open platform as well as the techniques
behind.
Biography
LU Yanxiong is the senior researcher of WeChat Pattern
Recognition Center, Tencent. He hasbeen working on search query
analysis, Q&A system and NLP related projects in Tencent.
Hiscurrent work focus on WeChat semantic analysis. His research
interests include search engine,machine learning, NLP and big data
analysis. Before joining in Tencent, Yanxiong worked inBaidu and
graduated from Xidian University with master degree.
ZHOU Lianqiang has been working in the field of NLP and machine
learning in Tencent, suchas search query re-write, user interests
mining, word segmentation, etc. He is now the senior re-searcher
and team leader of NLP research group in Tencent Intelligent
Computing and Search Lab.Before joining Tencent Lianqiang worked in
several Internet companies and got his master degreefrom Harbin
Institute of Technology.
vi
-
Invited Talk: From Lexical to Compositional Chinese
SentimentAnalysis
KU Lun-WeiAcademia Sinica, Taiwan
Abstract
Sentiment analysis determines the polarities and strength of
sentiment-bearing expressions, andit has been an important and
attractive research area due to its close affinity to applications.
Inthe past research, sentiment analysis depended highly on lexical
semantics. However, sentimentanalysis is eager for the
understanding of the context, and shallow features such as bag of
wordscannot fulfill this need. As a result, compositional
semantics, which concerns the construction ofmeaning based on
syntax, has been applied to sentiment analysis through different
approaches.In the Chinese language, as morphological structures may
represent the compositional semanticsinside Chinese words, the
compositional sentiment analysis can even start from determining
thesentiment of morphemes, which will be touched in this talk.
This talk will begin from some background knowledge of sentiment
analysis, such as how senti-ment are categorized, where to find
available corpora and which models are commonly applied,especially
for the Chinese language. I will describe our work on compositional
Chinese sentimentanalysis from words to sentences. All our involved
and recently developed related resources, in-cluding Chinese
Morphological Dataset, Augmented NTU Sentiment Dictionary
(aug-NTUSD),E-hownet with sentiment information, and Chinese
Opinion Treebank, will also be introduced inthis talk. I’ll end by
describing how we have begun to test our compositional model with
wordembeddings.
Biography
KU Lun-Wei received her Ph.D. degree in Computer Science and
Information Engineering fromNational Taiwan University. Then she
joined the Department of Computer Science and Informa-tion
Engineering, National Yunlin University of Science and Technology
(Yuntech), Taiwan, as anassistant professor. Since Aug. 2012, she
joined the Institute of Information Science, AcademiaSinica as an
assistant research fellow. Previously, she was a postdoctoral
researcher at the Depart-ment of Computer Science and Information
Engineering, National Taiwan University, workingon the project
“Machine learning methods for ranking problems in multilingual
information re-trieval”. She was a project researcher in Acer
Product Value Lab, Taiwan, between Apr. 2003and May 2004. At that
time, she joined the project in speech recognition services for
home mediacenter. She was a software engineer/project manager in
NaturalTel, a platform service providerof carriers, where she
joined the development of speech entertainment service platform for
Far-eastone (Fetnet), Taiwan. Her international recognition
includes CyberLink Technical Elite Fel-lowship in 2007, IBM Ph.D.
Fellowship in 2008, ROCLING Doctorial Dissertation DistinctionAward
in 2009, and Good Design Award selected in 2012. Her research
interests include natu-ral language processing, information
retrieval, sentiment analysis, and computational linguistics.She
has been working on Chinese sentiment analysis since year 2005 and
was the co-organizerof NTCIR MOAT Task (Multilingual Opinion
Analysis Task, traditional Chinese side) from year2006 to 2010. She
is also one of the organizers of the SocialNLP workshop, which has
been heldjointly in IJCNLP 2013, Coling 2014, WWW 2015 and NAACL
2015. This year, she serves asthe area chair of the sentiment
analysis and opinion mining track in The 53rd Annual Meeting of
vii
-
the Association for Computational Linguistics and The 7th
International Joint Conference on Nat-ural Language Processing
(ACL-IJCNLP 2015), as well as in The 2015 Conference on
EmpiricalMethods on Natural Language Processing (EMNLP 2015). Other
professional international ac-tivities she involved include The
Publication Co-Chair, The 6th International Joint Conference
onNatural Language Processing (IJCNLP-2013), Publicity Chair, The
Twenty-fourth Conference onComputational Linguistics and Speech
Processing (Rocling 2012), and Finance Chair, The SixthAsia
Information Retrieval Societies Conference (AIRS 2010).
viii
-
Table of Contents
Sequential Annotation and Chunking of Chinese Discourse
StructureFrances Yung, Kevin Duh and Yuji Matsumoto. . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .1
Create a Manual Chinese Word Segmentation Dataset Using
Crowdsourcing MethodShichang Wang, Chu-Ren Huang, Yao Yao and Angel
Chan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 7
Chinese Named Entity Recognition with Graph-based
Semi-supervised Learning ModelAaron Li-Feng Han, Xiaodong Zeng,
Derek F. Wong and Lidia S. Chao . . . . . . . . . . . . . . . . . .
. . . . 15
Sentence selection for automatic scoring of Mandarin
proficiencyJiahong Yuan, Xiaoying Xu, Wei Lai, Weiping Ye, Xinru
Zhao and Mark Liberman . . . . . . . . . . . 21
ACBiMA: Advanced Chinese Bi-Character Word Morphological
AnalyzerTing-Hao Huang, Yun-Nung Chen and Lingpeng Kong . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Introduction to SIGHAN 2015 Bake-off for Chinese Spelling
CheckYuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang and Hsin-Hsi
Chen . . . . . . . . . . . . . . . . . . . . . . . 32
HANSpeller++: A Unified Framework for Chinese Spelling
CorrectionShuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao Zhang
and Xueqi Cheng . . . . . . . . . . . . . . . . . 38
Word Vector/Conditional Random Field-based Chinese Spelling
Error Detection for SIGHAN-2015 Eval-uation
Yih-Ru Wang and Yuan-Fu Liao. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .46
Introduction to a Proofreading Tool for Chinese Spelling Check
Task of SIGHAN-8Tao-Hsing Chang, Hsueh-Chih Chen and Cheng-Han Yang
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Overview of Topic-based Chinese Message Polarity Classification
in SIGHAN 2015Xiangwen Liao, Binyang Li and Liheng Xu . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 56
A Joint Model for Chinese Microblog Sentiment AnalysisYuhui Cao,
Zhao Chen, Ruifeng Xu, Tao Chen and Lin Gui . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 61
Learning Salient Samples and Distributed Representations for
Topic-Based Chinese Message PolarityClassification
Xin Kang, Yunong Wu and Zhifei Zhang . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 68
An combined sentiment classification system for SIGHAN-8Qiuchi
Li, Qiyu Zhi and Miao Li . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 74
Linguistic Knowledge-driven Approach to Chinese Comparative
Elements ExtractionMinJun Park and Yulin Yuan . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 79
A CRF Method of Identifying Prepositional Phrases in Chinese
Patent TextsHongzheng Li and Yaohong Jin . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 86
Emotion in Code-switching Texts: Corpus Construction and
AnalysisSophia Lee and Zhongqing Wang. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .91
Chinese in the Grammatical Framework: Grammar, Translation, and
Other Applications AnonymousAarne Ranta, Tian Yan and Haiyan Qiao .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 100
x
-
KWB: An Automated Quick News System for Chinese ReadersYiqi Bai,
Wenjing Yang, Hao Zhang, Jingwen Wang, Ming Jia, Roland Tong and
Jie Wang. . . .110
Chinese Semantic Role Labeling using High-quality Syntactic
KnowledgeGongye Jin, Daisuke Kawahara and Sadao Kurohashi . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
Chinese Spelling Check System Based on N-gram ModelWeijian Xie,
Peijie Huang, Xinrui Zhang, Kaiduo Hong, Qiang Huang, Bingzhou Chen
and Lei
Huang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .128
NTOU Chinese Spelling Check System in Sighan-8 Bake-offWei-Cheng
Chu and Chuan-Jie Lin . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137
Topic-Based Chinese Message Sentiment Analysis: A Multilayered
Analysis Systemhongjie li, zhongqian sun and wei yang . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 144
Rule-Based Weibo Messages Sentiment Polarity Classification
towards Given TopicsHongzhao Zhou, Yonglin Teng, Min Hou, Wei He,
Hongtao Zhu, Xiaolin Zhu and Yanfei Mu . 149
Topic-Based Chinese Message Polarity Classification System at
SIGHAN8-Task2Chun Liao, Chong Feng, Sen Yang and Heyan Huang . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
158
CT-SPA: Text sentiment polarity prediction model using
semi-automatically expanded sentiment lexiconTao-Hsing Chang,
Ming-Jhih Lin, Chun-Hsien Chen and Shao-Yu Wang . . . . . . . . . .
. . . . . . . . . . 164
Chinese Microblogs Sentiment Classification using Maximum
EntropyDashu Ye, Peijie Huang, Kaiduo Hong, Zhuoying Tang, Weijian
Xie and Guilong Zhou . . . . . . 171
NDMSCS: A Topic-Based Chinese Microblog Polarity Classification
SystemYang Wang, Yaqi Wang, Shi Feng, Daling Wang and Yifei Zhang .
. . . . . . . . . . . . . . . . . . . . . . . . . 180
NEUDM: A System for Topic-Based Message Polarity
ClassificationYaqi Wang, Shi Feng, Daling Wang and Yifei Zhang . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 185
xi
-
Workshop Program
Thursday, July 30, 2015
09:00–09:10 Opening Session
09:10–10:30 Invited Talk
Discourse and Machine TranslationMin Zhang and Rou Song
10:30–10:50 Coffee Break
10:50–12:30 Workshop Session
10:50–11:10 Sequential Annotation and Chunking of Chinese
Discourse StructureFrances Yung, Kevin Duh and Yuji Matsumoto
11:10–11:30 Create a Manual Chinese Word Segmentation Dataset
Using CrowdsourcingMethodShichang Wang, Chu-Ren Huang, Yao Yao and
Angel Chan
11:30–11:50 Chinese Named Entity Recognition with Graph-based
Semi-supervised LearningModelAaron Li-Feng Han, Xiaodong Zeng,
Derek F. Wong and Lidia S. Chao
11:50–12:10 Sentence selection for automatic scoring of Mandarin
proficiencyJiahong Yuan, Xiaoying Xu, Wei Lai, Weiping Ye, Xinru
Zhao and Mark Liberman
12:10–12:30 ACBiMA: Advanced Chinese Bi-Character Word
Morphological AnalyzerTing-Hao Huang, Yun-Nung Chen and Lingpeng
Kong
xii
-
Thursday, July 30, 2015 (continued)
12:30–14:30 Lunch
14:30–15:30 Invited Talk
From Lexical to Compositional Chinese Sentiment AnalysisLun-Wei
Ku
15:30–16:00 Coffee Break
16:00–17:20 Bake-off Task 1: Chinese Spelling Check
16:00–16:20 Introduction to SIGHAN 2015 Bake-off for Chinese
Spelling CheckYuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang and
Hsin-Hsi Chen
16:20–16:40 HANSpeller++: A Unified Framework for Chinese
Spelling CorrectionShuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang and Xueqi Cheng
16:40–17:00 Word Vector/Conditional Random Field-based Chinese
Spelling Error Detection forSIGHAN-2015 EvaluationYih-Ru Wang and
Yuan-Fu Liao
17:00–17:20 Introduction to a Proofreading Tool for Chinese
Spelling Check Task of SIGHAN-8Tao-Hsing Chang, Hsueh-Chih Chen and
Cheng-Han Yang
xiii
-
Friday, July 31, 2015
09:00–10:30 Invited Talk
Intelligent Q&A System and NLP Open PlatformYanxiong Lu and
Lianqiang Zhou
10:30–11:00 Coffee Break
11:00–12:20 Bake-off Task 2: Topic-Based Chinese Message
Polarity Classification
11:00–11:20 Overview of Topic-based Chinese Message Polarity
Classification in SIGHAN 2015Xiangwen Liao, Binyang Li and Liheng
Xu
11:20–11:40 A Joint Model for Chinese Microblog Sentiment
AnalysisYuhui Cao, Zhao Chen, Ruifeng Xu, Tao Chen and Lin Gui
11:40–12:00 Learning Salient Samples and Distributed
Representations for Topic-Based ChineseMessage Polarity
ClassificationXin Kang, Yunong Wu and Zhifei Zhang
12:00–12:20 An combined sentiment classification system for
SIGHAN-8Qiuchi Li, Qiyu Zhi and Miao Li
xiv
-
Friday, July 31, 2015 (continued)
12:20–14:00 Lunch
14:00–15:20 Poster Session
Linguistic Knowledge-driven Approach to Chinese Comparative
Elements Extrac-tionMinJun Park and Yulin Yuan
A CRF Method of Identifying Prepositional Phrases in Chinese
Patent TextsHongzheng Li and Yaohong Jin
Emotion in Code-switching Texts: Corpus Construction and
AnalysisSophia Lee and Zhongqing Wang
Chinese in the Grammatical Framework: Grammar, Translation, and
Other Appli-cations AnonymousAarne Ranta, Tian Yan and Haiyan
Qiao
KWB: An Automated Quick News System for Chinese ReadersYiqi Bai,
Wenjing Yang, Hao Zhang, Jingwen Wang, Ming Jia, Roland Tong andJie
Wang
Chinese Semantic Role Labeling using High-quality Syntactic
KnowledgeGongye Jin, Daisuke Kawahara and Sadao Kurohashi
Chinese Spelling Check System Based on N-gram ModelWeijian Xie,
Peijie Huang, Xinrui Zhang, Kaiduo Hong, Qiang Huang, BingzhouChen
and Lei Huang
NTOU Chinese Spelling Check System in Sighan-8 Bake-offWei-Cheng
Chu and Chuan-Jie Lin
Topic-Based Chinese Message Sentiment Analysis: A Multilayered
Analysis Systemhongjie li, zhongqian sun and wei yang
Rule-Based Weibo Messages Sentiment Polarity Classification
towards Given TopicsHongzhao Zhou, Yonglin Teng, Min Hou, Wei He,
Hongtao Zhu, Xiaolin Zhu andYanfei Mu
Topic-Based Chinese Message Polarity Classification System at
SIGHAN8-Task2Chun Liao, Chong Feng, Sen Yang and Heyan Huang
xv
-
Friday, July 31, 2015 (continued)
CT-SPA: Text sentiment polarity prediction model using
semi-automatically ex-panded sentiment lexiconTao-Hsing Chang,
Ming-Jhih Lin, Chun-Hsien Chen and Shao-Yu Wang
Chinese Microblogs Sentiment Classification using Maximum
EntropyDashu Ye, Peijie Huang, Kaiduo Hong, Zhuoying Tang, Weijian
Xie and GuilongZhou
NDMSCS: A Topic-Based Chinese Microblog Polarity Classification
SystemYang Wang, Yaqi Wang, Shi Feng, Daling Wang and Yifei
Zhang
NEUDM: A System for Topic-Based Message Polarity
ClassificationYaqi Wang, Shi Feng, Daling Wang and Yifei Zhang
15:20–15:30 Closing Session
xvi
-
Proceedings of the Eighth SIGHAN Workshop on Chinese Language
Processing (SIGHAN-8), pages 1–6,Beijing, China, July 30-31, 2015.
c©2015 Association for Computational Linguistics and Asian
Federation of Natural Language Processing
Sequential Annotation and Chunking of Chinese Discourse
Structure
Frances Yung Kevin DuhNara Institute of Science and
Technology
8916-5 Takayama, Ikoma, Nara, 630-0192 Japan{pikyufrances-y,
kevinduh, matsu}@is.naist.jp
Yuji Matsumoto
Abstract
We propose a linguistically driven ap-proach to represent
discourse relations inChinese text as sequences. We observethat
certain surface characteristics of Chi-nese texts, such as the
order of clauses,are overt markers of discourse structures,yet
existing annotation proposals adaptedfrom formalism constructed for
English donot fully incorporate these characteristics.We present an
annotated resource consist-ing of 325 articles in the Chinese
Tree-bank. In addition, using this annotation,we introduce a
discourse chunker basedon a cascade of classifiers and report
70%top-level discourse sense accuracy.
1 IntroductionDiscourse relations refer to the relations
betweenunits of text at document level. As a key forlanguage
processing, they are used in tasks suchas automatic summerization,
sentiment analysisand text coherence assessment (Lin et al.,
2011;Trivedi and Eisenstein, 2013; Yoshida et al.,2014). While
discourse-annotated English re-sources are available, resources in
other languagesare limited. In this work, we present the
linguis-tic motivation behind the Chinese discourse anno-tated
corpus we constructed, and preliminary ex-periments on discourse
chunking of Chinese.
1.1 Related WorkMajor discourse annotated resources in
Englishinclude the RST Treebank (Carlson et al., 2001)and the Penn
Discourse Treebank (PDTB) (Prasadet al., 2008). The RST Treebank
represents dis-course relations in a tree structure, where a
satel-lite text span is related to a nucleus text span.
On the other hand, the Penn Discourse Tree-bank represents
discourse structure in a predicate-argument-like structure, where
discourse connec-tives (DCs) relates two text spans (Arg1 and
Arg2).Under this framerodk, covert discourse relationsare
represented by implicit DCs.
PDTB’s annotation scheme is adapted by therecently released
Chinese Discourse Treebank(CDTB) (Zhou and Xue, 2015). Other
efforts toexploit Chinese discourse relations include cross-lingual
annotation projection based on machinetranslation or word-aligned
parallel corpus (Zhouet al., 2012; Li et al., 2014). Combinition
ofthe RST and PDTB formalisms is also proposed.Zhou et al. (2014)
adds the distinction of satelliteand nucleus to PDTB-style
annotation, and Li etal. (2014b) labels the connectives in an RST
tree.
1.2 Motivation
Interpretation of discourse relations, as of otherlinguistic
structures, is subject to the surface formof the text. We notice
that Chinese discourse struc-tures are expressed by certain surface
features thatdo not exist in English.
First of all, Chinese sentences are sequencesof clauses,
typically separated by punctuations.Each clause can be considered a
discourse argu-ment. Above the clause level, Chinese
sentences(marked by ‘。’) are also units of discourse (Chu,1998).
When presented with texts where periodsand commas are removed,
native Chinese speak-ers disagree with where to restore them
(Bittner,2013). The actual sentence segmentation of thetext thus
represents the spans of discourse argu-ments intended by the writer
and should be takeninto account.
Secondly, it is well known that syntactical struc-ture is
presented by word order in Chinese - so is
1
-
discourse. While the Arg1 can occur before or af-ter Arg2 in
English, arguments predominantly oc-cur in fixed order in Chinese,
depending on thelogical relation. For example, the same conces-sion
relation can be expressed by both construc-tions (1) and (2) in
English, but only construction(1) is acceptable in Chinese.
1. 虽然 (suiran, although) Arg2 , Arg1 .
2. Arg1 ,虽然 (suiran ,although) Arg2 .
According to Chinese linguistics, adjunctclauses and discourse
adverbials always precedethe main clauses (Gasde and Paul, 1996;
Chu andJi, 1999). The clauses are semantically arranged ina
topic-comment sequence following the writer’sconceptual mind (Tai,
1985; Bittner, 2013). Whenthe arguments are not arranged in the
standard or-der, the sense of the DC is altered. For example,when
‘虽然’ (suiran, although’ is used in con-struction (2), it represents
an ‘expansion’ relation(Huang et al., 2014). Therefore, discourse
rela-tions should be defined given the order of the ar-guments.
Lastly, parallel DCs are frequent in Chinesediscourse, yet
usually either one DC of the pairoccurs to signify the same
relation (Zhou et al.,2014). For example, (3) and (4) are
grammaticalalternatives to (1).
3. 虽然 (suiran, although) Arg1 , 但是 (dan-
shi, but) Arg2 .
4. Arg1 ,但是 (danshi, but) Arg2 .
Instead of viewing ‘虽然 (suiran, although) -但是(danshi, but)’ as a
pair of parallel DCs, they canbe regarded individually as a
forward-linking (fw-linking) DC and a backing linking
(bw-linking)DC. A fw-linking DC relates its attached discourseunit
to a later coming unit, while a bw-linkingDC relates its attached
discourse unit to a previousunit. Findings in linguistic studies
also show thatfw-linking DCs only link discourse units withinthe
sentence boundary. On the other hand, bw-linking DCs can link a
discourse unit to a pre-ceding unit within or outside the sentence
bound-ary, except when it is paired with a fw-linking DC(Eifring,
1995).
To summarize, in contrast with the ambigu-ous arguments in
English, punctuations and lim-itations on DC usage explicitly mark
certain dis-course structure in Chinese. Section 2 illustrates
the design of our annotation scheme driven bythese
constraints.
2 Sequential discourse annotation
We propose to follow the natural discourse chainsin Chinese and
annotate discourse structure asa sequence of alternating arguments
and DCs.This section highlights the main differences of ourscheme
comparing with other frameworks.
2.1 ArgumentsEach clause separated by punctuations except
quo-tation marks is treated as a candidate argument.Clauses that do
not function as discourse units areclassified into 3 types -
attribution, optional punc-tuation and non-discourse adverbial.
The main difference of our annotation schemeis that the the
order of the arguments for each DCis defined by default. Since the
arguments of aparticular discourse relation occur in fixed orderand
are always adjacent, each argument is relatedto the immediately
preceding argument by a bw-linking DC. In turn, the DC in the first
clause ofa sentence links the sentence to the previous
one,preserving the 2 layer structure denoted by punc-tuations. An
implicit bw-linking DC is inserted ifthe clause does not contain an
explicit DC.
Another characteristic of our annotation is that‘parallel DCs’
are annotated separately as one fw-linking DC and one bw-linking
DC. Implicit bw-linking DCs are inserted , if possible, even the
re-lation is already marked by a fw-linking DC inthe previous
argument 1. In other words, dupli-cated annotation of one relation
is allowed. Thishelps create more valid samples to capture
variouscombinations of Chinese DCs. When an argumentspans more than
one discourse units, a fw-linkingDC is used to mark the start of
the span. Similarly,an implicit DC is inserted if necessary.
2.2 ConnectivesThere is a large variety of DCs in Chinese and
theirsyntactical categories are controversial. Huang etal. (2014)
reports a lexicon of 808 DCs, 359 ofwhich found in the data. Since
many DCs sig-nal the same relation, we adopt a functionalist
ap-proach to label DC senses.
In this approach, a DC does not limit to any syn-tactical
category. Annotators are asked to perform
1Temporal relations are often marked by one fw-linkingDC alone
and it is not acceptable to insert an implicit bw-linking DC. In
this case, the ’redundant’ tag is used.
2
-
a linguistic test by replacing a candidate expres-sion with an
unambiguous and preferably frequentDC of similar sense, which we
call a ‘main DC’. Ifthe replacement is acceptable, then the
expressionis identified as a DC and the sense is categorizedunder
the ‘main DC’.
For example, ‘尤为’ and ‘特别是’ (youwei,tebieshi, in particular /
especially) are categorizedunder ‘尤其 ’ (youqi, in particular), if
the annota-tor agrees that they are interchangeable in the
con-text. The list of main DCs is not pre-defined but isconstructed
in the course of annotation. Based onthe assigned ‘main DC’, each
DC instant is catego-rized into the 4 main senses defined in PDTB:
con-tingency, comparison, temporal, and expansion.
The discourse and syntactical limitations of theDCs are
considered in the replaceability test. Forexample, the following
pairs are not labeled thesame ‘main DC’ even the signaled discourse
rela-tion is the same:
• Fw v.s. bw-linking DCs:虽然 (suiran, although),但是 (danshi,
but)
• Cause-result v.s. result-cause order:因为...所以...
(yinwei...suoyi..., because...therefore...) and之所以...是因为...
(zhisuoyi...shiyinwei...,the reason why...is because...) 2
• Placed before v.s. after subject:却 (Que but) and但是 (danshi
but)
The list of ‘main DCs’ is not pre-defined but isconstructed in
the course of annotation; an expres-sion is registered as another
‘main DC’ if it cannotbe replaced. Note that expressions that are
con-sidered as ’alternative lexicalizations’ in PDTB orCDTB are
also categorized as explicit connectives,if they pass the
replaceability test. Otherwise, animplicit DC, chosen from the list
of ‘main DCs’,is inserted.
2.3 Annotation resultsMaterials of the corpus are raw texts of
325 arti-cles (2353 sentences) from the Chinese Treebank(Bies et
al., 2007) . Errors that affect the annota-tion process, namely
punctuation errors that leadto wrong segmentation, have been
corrected.
201 DCs are identified in our data, of which66 are fw-linking
DCs. The DCs are catego-rized into 73 ‘main DCs’ and 22 have
ambiguous
2the 2 pairs are treated as 4 different DCs.
senses (labelled with more than one ‘main DCs’).The distribution
of the tags is shown in Table 1.Note that some of the ‘implicit’
relations we definebelongs to ‘explicit’ in other annotation
schemessince ‘double annotation’ occurs in our annotation.
CON COM TEM EXP totalExplicit 380 248 521 683 1832Implicit 1551
446 164 3022 5183
ADV ATT OPT totalNon-
discourse 630 783 336 1749
Table 1: Distribution of various tags in the an-notated corpus
(4 senses: CONtingency, COM-parison, TEMporal, EXPansion; 3 types
of non-discourse-unit segments: ATTRibution, OPTionalpunctuation,
and non-discourse ADVerbial)
3 End-to-end discourse chunker
Our linguistically driven annotation of discoursestructure takes
the surface discourse features asground truth. In particular, we
define discourse re-lations based on default argument order and
span.We demonstrate its learnability by building a dis-course
chunker in the form of a classifier cascadeas used in English
discourse parsing(Lin et al.,2010). Features are extracted from the
default ar-guments of each relation. We evaluate the accu-racy of
each component and the overall accuracyof the final output,
classifying up to the 4 mainsenses. The pipeline consists of 5
classifiers, asshown in Figure 1, each of which is trained withthe
relevant samples, e.g. only arguments anno-tated with explicit DCs
are used to train the ex-plicit DC classifier. 289 and 36 articles
are used astraining and testing data respectively.
Features include lexical and syntactical features(bag of words,
bag of POS, word pairs and pro-duction rules) that have been used
in classifyingimplicit English DCs (Pitler et al., 2009; Lin et
al.,2010), and probability distribution of senses forexplicit DC
classification. The extraction of fea-tures is based on automatic
parsing by the Stan-ford Parser (Levy and Manning, 2003). We
alsouse the surrounding discourse relations as
features,hypothesizing that certain relation sequences aremore
likely than others. The classifiers are trainedby SVM with a linear
kernel using the LIBSVMpackage(Chang and Lin, 2011).
3
-
Figure 1: Cascade of discourse relation classifiers.
3.1 Results per component
Table 2 shows the accuracies of individual clas-sifiers tested
on relevant samples. Results basedon predictions by the most
frequent class arelisted as baseline (BL). As expected, implicit
re-lations (IMP) are much harder to classify thanexplicit relations
(EXP). The classification resultof non-discourse-unit segments
(Non-dis or not)is similar to the preliminary report of Li et
al.(2014b)(averaged F1 88.8%, accuracy 89.0%).
Step classifiers Test F1/Acc BL F1/Acc1 Non-dis or not .91/.94
.44/.802 EXP identifier .92/.93 .39/.653 EXP 4 senses .90/.92
.15/.584 Non-dis 3 types .86/.88 .17/.355 IMP 4 senses .41/.61
.18/.58
Table 2: Accuracies of individual classifiers on’gold’ test
samples. F1 is the average of the F1for each class.
3.2 End-to-end evaluation
We run the classifiers from Steps 1-5. After Step1, identified
non-discourse-unit segments arejoined as one argument and features
are updated.The discourse context features are also updatedafter
each step based on last classifier’s output.The tag of a fw-linking
DC is switched to thenext segment, as a relation connecting the
nextsegment to the current one. The current segmentis thus passed
to the implicit classifier, given thatthere is not any bw-linking
DCs.
For applications that need discourse, it may notbe necessary to
distinguish between explicit andimplicit relations. Thus, we
combine the outputs
of the explicit and implicit classifiers when eval-uating the
end-to-end outputs. Specifically, thepipeline outputs one of the 4
discourse senses or‘non-discourse-unit’ across a segment
boundary,while the reference can be more than one, sinceduplicated
annotation is allowed. The systemprediction is considered correct
if it is includedin the gold tag set. The combined outputs
areevaluated in terms of accuracy.
Table 3 shows the classification accuraciesevaluated by the
above principle under differenterror propagation settings. For
example, givengold identification of non-discourse segments(Step 1)
and explicit DC classifier (Step 2),classification of the 4 main
explicit sense reachesaccuracy of 0.854, but is dropped to 0.800 if
step1 and step 2 are automatic 3. It is observedthat errors are
generally propagated along thepipeline. Similar to the finding in
English (Pitleret al., 2009), the discourse context as predictedby
earlier classifiers does not affect the latersteps - the results
are the same based on gold orautomatic outputs. The end-to-end
accuracy ofthe proposed pipeline is 65.7% and the baseline(classify
all as ‘expansion’) is 50.0%.
Accuraciesnon-disexp/impexplicitnon-disimplicit overor not
/non-dis senses types senses -all
Step 2-way 3-way 4-way 3-way 4-way 5-way4 Gold Gold Gold Gold
.670 .7063 Gold Gold Gold .879 .670 .7062 Gold Gold .854 .879 .670
.7031 Gold .888 .800 .865 .665 .697- .862 .847 .800 .836 .657
.657
Table 3: Accuracies at each stage under differenterror
propagation settings.
Finally, we experimented with different varia-tions of the
pipeline, as shown in Table 4. The bestresult (70.1% accuracy), is
obtained by classifyingimplicit DCs and non-discourse units in one
step.For comaprison, Huang and Chen (2011) reportsan accuracy of
88.28% on 4-way classification ofinter-sentential discourse senses,
and Huang andChen (2012) reports an accuracy of 81.63% on 2-way
classification of intra-sentential contingencyvs comparison
senses.
3Note that the results under the complete gold settings donot
necessarily echo the results of the individual components,where
duplicated outputs are counted individually.
4
-
Note that the result is much degraded if we trainone 5-way
classifier to classify all relations. Thisshows that explicit and
implicit DCs ought to betreated separately, even though we do not
concernabout distinguishing them in the final output.
Pipeline variations Overall 5-way acc.steps 1-5 .657combine
steps 1-5 .549switch steps 1 & 2 .697switch steps 1 & 2+
combine steps 4&5 .701
Table 4: 5-way accuracies of modified pipelines
4 Conclusion
This work presents the annotation principles ofour Chinese
discourse corpus based on linguisticsanalysis. We propose to
embrace the overt se-quential features as ground truth discourse
struc-tures, and categorize DCs by their discoursefunctions. Based
on the manually annotatedcorpus, we built and evaluate a classifier
cas-cade that classifies explicit and implicit relationsand the
results support that our annotation istractably learnable. The
annotation is available athttp://cl.naist.jp/nldata/zhendisco/.
ReferencesAnn Bies, Martha Palmer, Justin Mott, and Colin
Warner. 2007. English chinese translation treebankv 1.0.
Maria Bittner. 2013. Topic states in mandarin dis-course.
Proceedings of the North American Con-ference on Chinese
Linguistics.
Lynn Carlson, Daniel Marcu, and Mary EllenOkurowski. 2001.
Building a discourse-tagged cor-pus in the framework of rhetorical
structure theory.Proceedings of the SIGdial Workshop on
Discourseand Dialogue.
Chihchung Chang and Chihjen Lin. 2011. Libsvm : alibrary for
support vector machines. ACM Transac-tions on Intelligent Systems
and Technology.
Chauncey Chenghsi Chu and Zongren Ji. 1999.
ACognitive-Functional Grammar of Mandarin Chi-nese. Crane.
Chauncey Chenghsi Chu. 1998. A discourse grammarof Mandarin
Chine. P. Lang.
Halvor Eifring. 1995. Clause Combination in Chinese.BRILL.
Horst-Dieter Gasde and Waltraud Paul. 1996. Fun-cional
categories, topic prominence, and complexsentences in mandarin
chinese. Linguistics, 34.
Hen-Hsen Huang and Hsin-Hsi Chen. 2011. Chi-nese discourse
relation recognition. Proceedings ofthe International Joint
Conference on Natural Lan-guage Processings.
Hen-Hsen Huang and Hsin-Hsi Chen. 2012. Contin-gency and
comparison relation labeling and struc-ture predictuion in chinese
sentences. Proceedingsof the Annual Meeting of SIGDIAL.
Hen-Hsen Huang, Tai-Wei Chang, Huan-Yuan Chen,and Hsin-Hsi Chen.
2014. Interpretation of chinesediscourse connectives for explicit
discourse relationrecognition. Proceedings of the International
Con-ference on Computational Linguistics.
Roger Levy and Christopher Manning. 2003. Is itharder to parse
chinese, or the chinese treebank.Proceedings of the Annual Meeting
of the Associ-ation for Computational Linguistics.
Junyi Jessy Li, Marine Carpuat, and Ani Nenkova.2014.
Cross-lingual discourse relation analysis: Acorpus study and a
semi-supervised classifacationsystem. Proceedings of the
International Confer-ence on Computational Linguistics.
Yancui Li, Wenhi Feng, Jing Sun, Fang Kong, andGuodong Zhou.
2014b. Building chinese dis-course corpus with connective-driven
dependencytree structure. Proceedings of the Conference onEmpirical
Methods on Natural Language Process-ing.
Ziheng Lin, Hwee Tou Ng, , and Min Yen Kan. 2010.A pdtb-styled
end-to-end discourse parser. Techni-cal report, National University
of Singapore.
Ziheng Lin, Hwee Tou Ng, and Minyen Kan. 2011.Automatic
evaluating text coherence using discourserelations. Proceedings of
the Annual Meeting of theAssociation for Computational
Linguistics.
Emily Pitler, Annie Louis, and Ani Nenkova. 2009.Automatic sense
prediction for implicit discourse re-lations in text. Proceedings
of the Annual Meeting ofthe Association for Computational
Linguistics andthe International Joint Conference on Natural
Lan-guage Processing.
Rashmi Prasad, Nikhit Dinesh, Alan Lee, Eleni Milt-sakaki, Livio
Robaldo, Aravind Joshi, and BonnieWebber. 2008. The penn discourse
treebank 2.0.Proceedings of the Language Resource and Evalua-tion
Conference.
James HY Tai. 1985. Temporal sequence and chineseword order.
Iconicity in Syntax.
5
-
Rakshit Trivedi and Jacob Eisenstein. 2013. Discourseconnectors
for latent subjectivity in sentiment anal-ysis. Proceedings of the
North American Chapter ofthe Association for Computational
Linguistics.
Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao, andMasaaki Nagata.
2014. Dependency-based dis-course parser for single-document
summarization.Proceedings of the Conference on Empirical Meth-ods
on Natural Language Processing.
Yuping Zhou and Nianwen Xue. 2015. The chinesediscourse
treebank: a chinese corpus annotated withdiscourse relations.
Language Resources and Eval-uation, 49(2).
Lan Jun Zhou, Wei Gao, Binyang Li, Zhongyu Wei,and Kam-Fat Wong.
2012. Cross-lingual iden-tification of ambiguous discourse
connectives forresource-poor language. Proceedings of the
Inter-national Conference on Computational Linguistics.
Lan Jun Zhou, Binyang Li, Zhongyu Wei, and Kam-FaiWong. 2014.
The cuhk discourse treebank for chi-nese: Annotating explicit
discourse connectives forthe chinese treebank. Proceedings of the
LanguageResource and Evaluation Conference.
6
-
Proceedings of the Eighth SIGHAN Workshop on Chinese Language
Processing (SIGHAN-8), pages 7–14,Beijing, China, July 30-31, 2015.
c©2015 Association for Computational Linguistics and Asian
Federation of Natural Language Processing
Create a Manual Chinese Word Segmentation Dataset
UsingCrowdsourcing Method
Shichang Wang, Chu-Ren Huang, Yao Yao, Angel ChanDepartment of
Chinese and Bilingual Studies
The Hong Kong Polytechnic UniversityHung Hom, Kowloon, Hong
Kong
[email protected]{churen.huang, y.yao,
angel.ws.chan}@polyu.edu.hk
AbstractThe manual Chinese word segmentationdataset WordSegCHC
1.0 which was builtby eight crowdsourcing tasks conductedon the
Crowdflower platform contains themanual word segmentation data of
152Chinese sentences whose length rangesfrom 20 to 46 characters
without punctu-ations. All the sentences received 200 seg-mentation
responses in their correspond-ing crowdsourcing tasks and the
numbersof valid response of them range from 123to 143 (each
sentence was segmented bymore than 120 subjects). We also pro-posed
an evaluation method called man-ual segmentation error rate (MSER)
toevaluate the dataset; the MSER of thedataset is proved to be very
low which in-dicates reliable data quality. In this work,we applied
the crowdsourcing method toChinese word segmentation task and
theresults confirmed again that the crowd-sourcing method is a
promising tool forlinguistic data collection; the frameworkof
crowdsourcing linguistic data collectionused in this work can be
reused in simi-lar tasks; the resultant dataset filled a gapin
Chinese language resources to the bestof our knowledge, and it has
potential ap-plications in the research of word intuitionof Chinese
speakers and Chinese languageprocessing.
1 Introduction
Chinese word segmentation which can be con-ducted by human or
computer in the form of writ-ten or oral, is a hot topic receiving
great inter-est from several branches of linguistics especially
from theoretical, computational and psychologicallinguistics,
simply because it relates to or perhapsis the key to several
critical theoretical and appli-cational issues, for example word
definition, wordintuition and Chinese language processing.However
in the traditional laboratory setting,
limited by budget and/or the difficulty of largescale subject
recruitment, etc., it is very difficultor even impossible to build
large manual Chineseword segmentation dataset (the defining feature
ofthis kind of dataset is that each sentence must besegmented by a
large group of people in order tomeasure word intuition of Chinese
speakers) andthis hinders the availability of such language
re-source. Fortunately, the crowdsourcing methodperhaps can help us
to solve this problem. Be-ing aware of this background, the
crowdsourcedmanual Chinese word segmentation datasetWord-SegCHC 1.0
was built with multiple purposes inour mind.The first purpose is to
further explore the ap-
plication of crowdsourcing method in language re-source building
and linguistic studies in the contextof the Chinese language.
Crowdsourcing methodis a promising tool to solve the linguistic
data bot-tleneck problem which widely happens in vari-ous
linguistic studies; it is efficient and economicand can help us
realize much higher randomnessand much larger scale in sampling; in
annotationtasks we can also get much higher redundancy tohelp us
make decisions on ambiguous cases withmore confidence; although its
signal-to-noise ratio(SNR) is usually lower than the traditional
labora-tory method, it can yield high quality data as goodas or
even better than the traditional method whencombined with several
data quality control mea-sures including parameter optimization,
screeningquestions, performance monitoring, data valida-
7
-
tion, data cleansing, majority voting, peer review,spammer
monitor, etc (Crump et al., 2013; Al-lahbakhsh et al., 2013; Mason
and Suri, 2012;Behrend et al., 2011; Buhrmester et al.,
2011;Callison-Burch and Dredze, 2010; Paolacci et al.,2010;
Ipeirotis et al., 2010; Munro et al., 2010;Snow et al., 2008).We
have already successfully applied crowd-
sourcing method to the semantic transparency ofcompound rating
task and built a semantic trans-parency dataset which contains the
semantic trans-parency rating data of about 1,200 disyllabic
Chi-nese nominal compounds (Wang et al., 2014a); wewant to further
extend the application of crowd-sourcing method to Chinese word
segmentationtask to further evaluate the crowdsourcing methodand to
build new language resource.The second purpose is to support the
studies
on word intuition of Chinese speakers in generaland to examine
the effect of semantic transparencyon word intuition in particular.
Word intuition isspeakers’ intuitive knowledge on wordhood,
i.e.,what a word is. Laymen’s word segmentation be-havior is not
instructed by linguistic theories onword, but by their word
intuition, hence reflectstheir word intuition; because of this, the
word seg-mentation task has been used to measure and studyword
intuition (王立, 2003; Hoosain, 1992). Thebasic idea is like this: if
a Chinese sentence is seg-mented by, for example, 100 subjects, we
can thenobserve what slices of the sentence are consistentlytreated
as words by these subjects, what slices areconsistently treated as
non-words, and what slicesare not so consistent by being treated as
words bysome and non-words by others. This kind of seg-mentation
consistency can be a convenient mea-surement of Chinese speakers’
word intuition.Word intuition per se is an important issue
awaitingmore research which can contribute to theinvestigation
of cognitive mechanism of humans’language competence and shed new
light on thetheoretical problem of word definition for the
the-oretical definition of word should generally accordwith the
speakers’ word intuition (王洪君, 2006;王立, 2003;胡明扬, 1999;陆志韦,
1964).Semantic transparency/compositionality of a
multi-morphemic form, simply speaking, is the ex-tent to which
the lexical meaning of the wholeform can be derived from the
lexical meanings ofits constituents. More accurately speaking,
thisdefinition is merely the definition of overall se-
mantic transparency (OST) of a multi-morphemicform; besides
that, there is constituent semantictransparency (CST) too which
means the extent towhich the lexical meaning of each constituent as
aindependent lexical form retains itself in the lexi-cal meaning of
the whole form.In the context of theoretical linguistics,
seman-
tic transparency is used as an empirical criterion ofwordhood
(Duanmu, 1998; 吕叔湘, 1979; Chao,1968), but for Chinese disyllabic
forms this crite-rion seems to be ignored to some extent by
somelinguists based on word intuition (王洪君, 2006;冯胜利, 2004; 王立,
2003; 冯胜利, 2001; 胡明扬, 1999;冯胜利, 1996;吕叔湘, 1979); it is alsotreated
as an indicator of lexicalization (Packard,2000; 董秀芳, 2002; 李晋霞
and李宇明, 2008).In the context of psycholinguistics, it is an
“ex-tremely important factor” (Libben, 1998) affect-ing the
mechanism of mental lexicon, for exam-ple the representation,
processing/recognition, andmemorizing of multi-morphemic words (Han
etal., 2014; Mok, 2009;王春茂 and彭聃龄, 2000;王春茂 et al., 2000; 王春茂 and
彭聃龄, 1999;Libben, 1998; Tsai, 1994). Following this line
ofinvestigations, it is significant to examine the rolesemantic
transparency plays in Chinese speakers’word intuition towards
Chinese disyllabic forms.Whenwe build the dataset, we carefully
select sen-tence stimuli which containword stimuli that coverall
possible kinds of semantic transparency typesto enable us to
examine the role semantic trans-parency plays in word intuition of
Chinese speak-ers.The widely used Chinese segmented corpora,
for example, the Sinica corpus (Chen et al., 1996),are usually
segmented firstly by segmentation pro-grams and then revised by
experts according tocertain word segmentation standard. From the
in-consistent segmentation cases we can find plentyuseful
information to explore word intuition. Butfrom the perspective of
the measurement of Chi-nese speakers’ word intuition, the data are
biasedby segmentation programs and word segmentationstandards, so
they are not so suitable and reliablefor this purpose.In order to
better serve the studies of word in-
tuition of Chinese speakers, we need manual wordsegmentation
datasets. In such a dataset, each andevery sentence is segmented
manually by a largegroup of laymen, say 100, without the
influenceof any linguistic theory or any Chinese word seg-
8
-
mentation standard. This kind of dataset which isboth large and
publicly accessible, to the best ofour knowledge, is still a gap in
Chinese languageresources.And the third purpose is that the
resultant man-
ual Chinese word segmentation dataset may havepotential
applications in the studies of Chinese lan-guage processing
especially in the studies of au-tomatic Chinese word segmentation
and cognitivemodels of Chinese language processing.
2 Construction
2.1 Materials
The stimuli of word segmentation tasks are atleast phrases, but
we prefer naturally occurred sen-tences. In order to cover more
linguistic phenom-ena to better support the studies of word
intuition,we decide to use more than 150 long sentences(the
crowdsourcing method makes this possible).Meanwhile, the resultant
dataset must be able tosupport the examination of the effect of
semantictransparency on word intuition; so these sentencestimuli
should also contain the words which coverall the word stimuli to be
used in the examinationof semantic transparency effect. So the
stimuli se-lection procedure consists of two steps: (1)
wordselection, i.e., to select an initial set of word whichcovers
all the word stimuli would be used in theexamination of semantic
transparency effect, and(2) sentence selection, i.e., to select a
set of sen-tences which contains the words selected in step 1(each
sentence carries one word) and at the sametime satisfy other
requirements.
Word SelectionWe have already created a crowdsourced seman-tic
transparency dataset SimTransCNC 1.0 whichcontains the overall and
constituent semantic trans-parency rating data of about 1, 200
Chinese bi-morphemic nominal compounds which have mid-range word
frequencies (Wang et al., 2014a).Based on this dataset, 152 words
are selected, forthe distribution of these words, see Table 1.These
words are bimorphemic nominal com-
pounds of the structure modifier-head, and coverthree
substructures: NN, AN, and VN. Follow-ing (Libben et al., 2003), we
differentiate fourtransparency types: TT, TO, OT, and OO; “T”means
“transparent”, and “O” means “opaque”.TT words show the highest OST
scores and themost balanced CST scores, e.g., “江水”; OO
Word Structure
Transaprency Type NN AN VN
TT 20 10 10TO 20 6 10OT 20 10 10OO 20 10 6
Table 1: Distribution of types of selected words.
words have the lowest OST scores and the mostbalanced CST
scores, e.g., “脾气”; TO and OTwords bearmid-rangeOST scores and
themost im-balanced CST scores, e.g., “音色” (TO) and “贵人” (OT).
Sentence Selection
The words selected in step 1 are used as indexes,and all the
sentences carrying them in Sinica cor-pus 4.0 are extracted. One
sentence is selected foreach word roughly according to the
following cri-teria: (1) the length of sentence should be between20
to 50 characters (punctuations excluded); (2)the sentence should
not contain too many punctu-ations; (3) prefer concrete and
narrative sentencesto abstract ones which are difficult to
understand;(4) if we cannot find proper sentences from Sinicacorpus
for some words, we will use other corpora(only 5 sentences). In
this way, a total of 152 sen-tences are selected, for the length
(in character)distribution, see Table 2.
Length of Sentence
Min 20Max 46Sum 4,946Mean 32.54SD 5.46
Table 2: Length distribution of selected sentences.
2.2 Crowdsourcing Task Design
These 152 sentence stimuli are evenly and ran-domly divided into
eight sentence groups; eachsentence group has 19 sentences. We
created onecrowdsourcing task for each sentence group
onCrowdflower; according to our previous studies,compared to
Amerzon Mechanical Turk (MTurk),Crowdflower is a more feasible
platform for Chi-nese linguistic data collection (Wang et al.,
2014b;Wang et al., 2014a).
9
-
QuestionnairesThe core of each crowdsourcing task is a
question-naire. Each questionnaire consists of five sections:(1)
title, (2) instructions, (3) demographic ques-tions, (4) screening
questions, and (5) segmenta-tion task; both simplified and
traditional Chinesecharacter versions are provided. Section 3,
de-mographic questions, asks the on-line subjects toprovide their
identity information on gender, age,level of education, email
address (optional). Sec-tion 4, screening questions, consists of
four sim-ple questions on the Chinese language which canbe used to
test if a subject is a Chinese speaker ornot; the first two
questions are open-endedChinesecharacter identification questions,
each questionshows a picture containing a simple Chinese char-acter
and asks the subject to identify that characterand type it in the
text-box blow it; the third ques-tion is a close-ended homophonic
character identi-fication question, it shows the subject a
characterand asks him/her to identify its homophonic char-acter in
10 different characters; the fourth one isa close-ended antonymous
character identificationquestion, asks the subject to identify the
antony-mous character of the given one from 10 differ-ent
characters. The section 4s of the eight crowd-sourcing tasks share
the same question types buthave different question instances.
Section 5, thesegmentation task, shows the subjects 19
sentencestimuli and asks them to insert a word boundarysymbol (“/”)
at each word boundary they perceive;the subjects are required to
insert a “/” behind eachpunctuation and the last character of a
sentence;the subjects are also informed that they need notto care
about right or wrong, but just follow theirintuition.
Parameters of TasksThese eight crowdsourcing tasks are created
withthe following parameters: (1) each worker ac-count can only
submit one response to one task;(2) each IP address can only submit
one responseto one task; (3) we only accept the responsesfrom
mainland China, Hong Kong, Macao, Tai-wan, Singapore, Indonesia,
Malaysia, Thailand,Australia, Canada, Germany, United States,
andNew Zealand; (4) we pay 0.25USD for one re-sponse.
Quality Control MeasuresThe following quality control measures
are used:(1) the section 4, screening questions, is used to
discriminate Chinese speakers from non-Chinesespeakers and to
block bots; (2) the section 5,the segmentation task, will keep
invisible unlessthe first two screening questions are correctly
an-swered; (3) the answers to the segmentation ques-tions in
section 5 must comply with prescribed for-mat to prevent random
string: a) the segmentationanswer to each sentence must be only
composedby the original sentence with one or zero “/” be-hind each
Chinese character and each punctuation,b) in the answers behind
each punctuation theremust be a “/”, c) the end of an answer must
be a“/”; (4) the submission attempts will be blockedunless all the
required questions are answered andthe answers satisfy the above
conditions; (5) datacleansing will be conducted after data
collection torule out invalid responses.
2.3 ProcedureWe firstly ran a small pretest task to test if
thetasks were correctly designed, and it turned outthat the pretest
task could run smoothly. Then welaunched the first task and let it
run alone for abouttwo days to further test the task design. After
wefinally confirmed that the tasks could really runsmoothly, we
launched the other seven tasks andlet them run concurrently. Our
aim was to collect200 responses for each task; the speed was
amaz-ingly fast in the beginning, and all eight tasks re-ceived
their first 100 responses in the first three tosix days; then the
speed became slower and slower,it eventually took us about 1.3
months to reach ouraim; after all, Crowdflower is not a Chinese
nativecrowdsourcing platform, this kind of speed is
un-derstandable.
2.4 Data CleansingAll tasks successfully obtained 200
responses,however not all responses are valid. Compared tothe
laboratory setting, the crowdsourcing environ-ment is quite noisy
by nature, so before the newlycollected data can be used in any
seriously analysisto draw reliable conclusions, data cleansing
mustbe conducted.The raw responses underwent rule-based data
cleansing. A response is considered invalid if ithas at least
one of the following five features: (1)at least one of the four
screening questions are in-correctly answered; (2) the lengths of
the resultantsegments of at least one of its 19 sentences are
allone character; (3) at least one segment longer thanseven
characters is observed in the resultant seg-
10
-
ments of its 19 sentences; (4) the completion timeof the
response is shorter than five minutes; (5) thecompletion time of
the response is longer than onehour. Invalid responses were ruled
out; the num-bers of valid response of the eight tasks are listedin
Table 3.
2.5 ResultsThe resultant dataset contains the manual Chineseword
segmentation data of 152 sentences whoselength ranges from 20 to 46
characters (M =32.54, SD = 5.46), and each sentence is seg-mented
by at least 123 and at most 143 subjects(M = 133.5, SD = 7.37).
Task Valid Response %
1 142 712 143 71.53 138 694 135 67.55 133 66.56 127 63.57 123
61.58 127 63.5
Min 123 61.5Max 143 71.5Mean 133.5 66.75SD 7.37 3.68
Table 3: Numbers of valid response of the tasks.
3 Evaluation
Although Fleiss’ kappa can be used to measurethe agreement
between raters, high agreement doesnot necessarily means high data
quality especiallyin the situation of intuition measurement
wherevariations among subjects are expected. And itcannot show
directly how many errors the resul-tant dataset actually contains
either. Knowing howmany errors the dataset contains is very
importantto assess the reliability of the conclusions drawnfrom the
dataset. We firstly define two kinds ofmanual segmentation errors,
and based on that, aevaluation method called manual segmentation
er-ror rate (MSER) is proposed to evaluate the resul-tant
dataset.
3.1 Types of Manual Segmentation ErrorsIn Chinese
phrases/sentences, there are three typesof non-monosyllabic
segments from the point ofview of manual word segmentation:
ridiculoussegments, indivisible segments, and modest seg-ments. A
ridiculous segment usually cannot be
treated as one valid unit/word, because it makes nosense in the
context of the phrase/sentence; for ex-ample, in the phrase
“这是好东西”, the segment“好东” cannot be treated as one unit/word,
becauseit is incomprehensible. An indivisible segmentusually cannot
be divided, because it is an fixedunit and its lexical meaning
cannot be derived eas-ily from the lexical meanings of its
constituents(or semantically opaque); it will become
incom-prehensible if it is divided; for example, in thephrase
example, the segment “东西” is of this type.A modest segment can be
either treated as oneunit/word or divided into two or more
units/words,because it is equally comprehensible no matter di-vided
or not; the segment “这是” in the phrase ex-ample is of this type.Two
circumstances can be treated as errors of
manual word segmentation; firstly, if a ridiculoussegment
appears in segmentation results, it can betreated as an error (type
I error); and secondly, ifan indivisible segment is divided in
segmentationresults, it can also be treated as an error (type II
er-ror). These two circumstances are not compatiblewith our general
word intuition even to the leastextent because they are simply
incomprehensible;and they cannot be explained by variations of
wordintuition among speakers; normally, when the sub-jects do word
segmentation tasks carefully accord-ing to their word intuition,
these would not occur;so we can treat them as errors. Human word
seg-mentation errors will occur when the subjects try tocheat by
segmenting randomly or make accidentalmistakes.
3.2 Manual Segmentation Error Rate
A subject divides the phrase/sentence S inton (n ∈ N+) segments
by n segmentation opera-tions (not n−1; the subject left the
remaining seg-ment at the tail as one word, it means the subjecthad
“confirmed” that; this is a segmentation opera-tion too). A
segmentation operation can only yieldone of the following four
possible results: one typeI error, one type II error, one type I
error plus onetype II error (two errors; e.g., “好东/西”), or noerror.
Suppose e′ (e′ ∈ N) is the number of timesthe type I error occurred
during the segmentationprocess, and e′′ (e′′ ∈ N), the number of
times thetype II error occurred, then we can define
manualsegmentation error rate (MSER):
MSER = (e′+ e
′′)/n
11
-
In extreme cases, MSER could be greater thanone, for example, in
the segmentation result “去哈/尔滨/”, e′ = 2, e′′ = 1, n = 2, soMSER =
3/2. If this happens, we just assumethatMSER = 1. MSER can be used
to evaluatemanual word segmentation results; lower MSERmeans better
data quality. Let’s consider its col-lective form; if S is
segmented by m (m ∈ N+)subjects, and the ith (1 ⩽ i ⩽ m) subject’s
type Ierror count, type II error count, and segmentationoperation
count are e′i, e
′′i , ni respectively, then
the collective form of MSER is:
MSER =
m∑i=1
(e′i + e
′′i )
m∑i=1
ni
As a convenient way, we can find type I errorsand their counts
in the unigram frequency list ofthe segmentation results, and find
type II errors andtheir counts in the bigram frequency list of the
seg-mentation results.
3.3 Evaluation Procedure and ResultsAmong the 19 sentences of
each task, three sen-tences were sampled for evaluation: the first
sen-tence, the middle (10th) sentence, and the last(19th) sentence.
We calculated the MSER foreach of them, see Table 4 for details.
TheMSERsof the segmentation results of these sentences areall very
low (< .05), and the mean is only .013(SD = .004); this means
the resultant dataset onlycontains few error and indicates that the
data qual-ity is good.
4 Conclusion
We created themanual Chinese word segmentationdataset WordSegCHC
1.0 using the crowdsourc-ing method; to the best of our knowledge,
there isno publicly available resources of this kind; it cansupport
the studies of word intuition especially theeffect of semantic
transparency on word intuitionand has potential applications in
Chinese languageprocessing.We also proposed an evaluation method
called
manual segmentation error rate (MSER) to eval-uate manual word
segmentation dataset. The errorrate of the dataset is proved to be
very low, and thisindicates that its data quality is reliable.This
work also confirmed again that the crowd-
sourcing method is a feasible, convenient, and re-
Task Sentence∑
n∑
e′ ∑
e′′
MSER
1S1 2864 13 20 .012S10 3904 18 16 .009S19 4046 12 7 .005
2S1 2993 29 19 .016S10 2000 9 6 .008S19 2529 19 26 .018
3S1 6634 32 27 .009S10 2834 21 14 .012S19 2894 43 22 .022
4S1 2612 24 22 .018S10 1836 14 8 .012S19 2640 26 20 .017
5S1 2361 15 14 .012S10 2829 14 7 .007S19 2489 14 15 .012
6S1 2906 35 22 .020S10 2758 21 8 .011S19 1711 20 13 .019
7S1 1857 19 11 .016S10 3125 35 14 .016S19 2808 28 10 .014
8S1 2465 23 14 .015S10 3238 23 11 .011S19 2042 15 7 .011
Min 1711 9 6 .005Max 6634 43 27 .022Sum 68375 522 353Mean
2848.96 21.75 14.71 .013SD 989.76 8.51 6.3 .004
Table 4: Segmentation error rates (MSER) of thesegmentation
results of the eight tasks.
liable tool to collect linguistic data. And throughthis work, a
reusable general framework of crowd-sourcing linguistic data
collection is also pre-sented. Following this framework, larger
similarChinese language resources can be constructed.We will use
this dataset to examine the role of
semantic transparency in word intuition of Chinesespeakers and
to induce the factors affecting wordintuition. The consequent
discoveries will deepenour understanding of the word definition
problemin the Chinese language which has both theoreticaland
applicational significance.In the future, once the factors
modulating Chi-
nese Speakers’ word intuition are clear, perhapsa computational
cognitive model of Chinese wordsegmentation (Wu, 2011) can be
proposed and webelieve that this could be an interesting new
direc-tion of Chinese word segmentation research.
Acknowledgments
The work described in this paper was supported bya grant from
the Research Grants Council of theHong Kong SAR, China (Project No.
544011).
12
-
ReferencesM Allahbakhsh, B Benatallah, A Ignjatovic,
HR Motahari-Nezhad, E Bertino, and S Dust-dar. 2013. Quality
control in crowdsourcingsystems: Issues and directions. IEEE
InternetComputing, 17(2):76–81.
Tara S Behrend, David J Sharek, Adam W Meade, andEric N Wiebe.
2011. The viability of crowdsourc-ing for survey research. Behavior
research methods,43(3):800–813.
Michael Buhrmester, Tracy Kwang, and Samuel DGosling. 2011.
Amazon’s mechanical turk a newsource of inexpensive, yet
high-quality, data? Per-spectives on Psychological Science,
6(1):3–5.
Chris Callison-Burch and Mark Dredze. 2010. Cre-ating speech and
language data with amazon’s me-chanical turk. In Proceedings of the
NAACL HLT2010 Workshop on Creating Speech and LanguageData with
Amazon’s Mechanical Turk, pages 1–12.Association for Computational
Linguistics.
Yuen Ren Chao. 1968. A grammar of spoken Chinese.University of
California Pr.
Keh-Jiann Chen, Chu-Ren Huang, Li-Ping Chang, andHui-Li Hsu.
1996. Sinica corpus: Design method-ology for balanced corpora. In
B.-S. Park and J.B.Kim, editors, Proceeding of the 11th Pacific
AsiaConference on Language, Information and Compu-tation, pages
167–176. Seoul:Kyung Hee Univer-sity.
Matthew JC Crump, John V McDonnell, and Todd MGureckis. 2013.
Evaluating amazon’s mechanicalturk as a tool for experimental
behavioral research.PloS one, 8(3):e57410.
San Duanmu. 1998. Wordhood in chinese. New ap-proaches to
Chinese word formation: Morphology,phonology and the lexicon in
modern and ancientChinese, pages 135–196.
Yi-Jhong Han, Shuo-chieh Huang, Chia-Ying Lee,Wen-Jui Kuo, and
Shih-kuen Cheng. 2014. Themodulation of semantic transparency on
the recogni-tionmemory for two-character chinese words. Mem-ory
& Cognition, pages 1–10.
Rumjahn Hoosain. 1992. Psychological reality of theword in
chinese. Advances in psychology, 90:111–130.
Panagiotis G Ipeirotis, Foster Provost, and Jing Wang.2010.
Quality management on amazon mechanicalturk. In Proceedings of the
ACM SIGKDDworkshopon human computation, pages 64–67. ACM.
Gary Libben, Martha Gibson, Yeo Bom Yoon, and Do-miniek Sandra.
2003. Compound fracture: The roleof semantic transparency
andmorphological headed-ness. Brain and Language, 84(1):50 –
64.
Gary Libben. 1998. Semantic transparency in the pro-cessing of
compounds: Consequences for represen-tation, processing, and
impairment. Brain and Lan-guage, 61(1):30 – 44.
Winter Mason and Siddharth Suri. 2012. Conductingbehavioral
research on amazon’s mechanical turk.Behavior research methods,
44(1):1–23.
Leh Woon Mok. 2009. Word-superiority effect as afunction of
semantic transparency of chinese bimor-phemic compound words.
Language and CognitiveProcesses, 24(7-8):1039–1081.
Robert Munro, Steven Bethard, Victor Kuperman,Vicky Tzuyin Lai,
Robin Melnick, ChristopherPotts, Tyler Schnoebelen, and Harry Tily.
2010.Crowdsourcing and language studies: the new gen-eration of
linguistic data. In Proceedings of theNAACL HLT 2010 Workshop on
Creating Speechand Language Data with Amazon’s MechanicalTurk,
pages 122–130. Association for ComputationalLinguistics.
Jerome L Packard. 2000. The morphology of Chi-nese: A linguistic
and cognitive approach. Cam-bridge University Press.
Gabriele Paolacci, Jesse Chandler, and Panagiotis GIpeirotis.
2010. Running experiments on amazonmechanical turk. Judgment and
Decision making,5(5):411–419.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y Ng.
2008. Cheap and fast—but is itgood?: evaluating non-expert
annotations for natu-ral language tasks. In Proceedings of the
conferenceon empirical methods in natural language process-ing,
pages 254–263. Association for ComputationalLinguistics.
Chih-Hao Tsai. 1994. Effects of semantic transparencyon the
recognition of chinese two-character words:Evidence for a
dual-process model. Master’s thesis,Graduate Institute of
Psychology, National ChungCheng University, Chia-Yi, Taiwan.
Shichang Wang, Chu-Ren Huang, Yao Yao, and An-gel Chan. 2014a.
Building a semantic transparencydataset of chinese nominal
compounds: A practiceof crowdsourcing methodology. In Proceedings
ofWorkshop on Lexical and Grammatical Resourcesfor Language
Processing, pages 147–156, Dublin,Ireland, August. Association for
Computational Lin-guistics and Dublin City University.
Shichang Wang, Chu-Ren Huang, Yao Yao, and AngelChan. 2014b.
Exploring mental lexicon in an ef-ficient and economic way:
Crowdsourcing methodfor linguistic experiments. In Proceedings of
the4th Workshop on Cognitive Aspects of the Lexicon(CogALex), pages
105–113, Dublin, Ireland, Au-gust. Association for Computational
Linguistics andDublin City University.
13
-
Zhijie Wu. 2011. A cognitive model of chinese wordsegmentation
for machine translation. Meta : jour-nal des traducteurs / Meta:
Translators’ Journal,56(3):631–644, 9.
冯胜利. 1996. 论汉语的“韵律词”. 中 社 科学,(1):161–176.
冯胜利. 2001. 从韵律看汉语“词”“语”分流之大界. 中 语 , (1):27–37.
冯胜利. 2004. 论汉语“词”的多维性. 语 学,3(3):161–174.
吕叔湘. 1979. 汉语语 分 . 商务印书馆.
李晋霞 and李宇明. 2008. 论词义的透明度. 语, (3):60–65.
王春茂 and彭聃龄. 1999. 合成词加工中的词频,词素频率及语义透明度. 学 , 31(3):266–273.
王春茂 and彭聃龄. 2000. 多词素词的通达表征: 分解还是整体. 科学, 23(4):395–398.
王春茂,彭聃龄, et al. 2000. 重复启动作业中词的语义透明度的作用. 学 , 32(2):127–132.
王洪君. 2006. 从本族人语感看汉语的“词”. 语科学.
王立. 2003. 汉语词的社 语 学 . 商务印书馆.
胡明扬. 1999. 说“词语”. 语 用, 3.
董秀芳. 2002. 词 : 汉语 音词的 .四川民族出版社.
陆志韦. 1964. 汉语的 词 . 科学出版社.
14
-
Proceedings of the Eighth SIGHAN Workshop on Chinese Language
Processing (SIGHAN-8), pages 15–20,Beijing, China, July 30-31,
2015. c©2015 Association for Computational Linguistics and Asian
Federation of Natural Language Processing
Chinese Named Entity Recognition with Graph-based
Semi-supervised Learning Model
Aaron Li-Feng Han* Xiaodong Zeng+ Derek F. Wong+ Lidia S.
Chao+
* Institute for Logic, Language and Computation, University of
Amsterdam
Science Park 107, 1098 XG Amsterdam, The Netherlands + NLP2CT
Laboratory/Department of Computer and Information Science
University of Macau, Macau S.A.R., China [email protected]
[email protected] [email protected] [email protected]
Abstract
Named entity recognition (NER) plays an im-portant role in the
NLP literature. The tradi-tional methods tend to employ large
annotated corpus to achieve a high performance. Differ-ent with
many semi-supervised learning mod-els for NER task, in this paper,
we employ the graph-based semi-supervised learning (GBSSL) method
to utilize the freely available unlabeled data. The experiment
shows that the unlabeled corpus can enhance the state-of-the-art
conditional random field (CRF) learning model and has potential to
improve the tag-ging accuracy even though the margin is a lit-tle
weak and not satisfying in current experi-ments.
1. Introduction Named entity recognition (NER) can be regarded
as a sub-task of the information extraction, and plays an important
role in the natural language processing literature. The NER
challenge has attracted a lot of researchers from NLP, and some
successful NER tasks have been held in the past years. The
annotations in MUC-71 Named Entity tasks (Marsh and Perzanowski,
1998) con-sist of entities (organization, person, and loca-tion),
times and quantities such as monetary val-ues and percentages, etc.
among the languages of English, Chinese and Japanese.
The entity categories in CONLL-02 (Tjong Kim Sang, 2002) and
CONLL-03 (Tjong Kim
1
http://www-nlpir.nist.gov/related_projects/muc/proceedings/ne_task.html
Sang and De Meulder, 2003) NER shared tasks consist of persons,
locations, organizations and names of miscellaneous entities, and
the lan-guages span from Spanish, Dutch, English, to German.
The SIGHAN bakeoff-3 (Levow, 2006) and bakeoff-4 (Jin and Chen,
2008) tasks offer stand-ard Chinese NER (CNER) corpora for training
and testing, which contain the three commonly used entities, i.e.,
personal names, location names, and organization names. The CNER
task is generally more difficult than the western lan-guages due to
the lack of word boundary infor-mation in Chinese expression.
Traditional methods used for the entity recog-nition tend to
employ external annotated corpora to enhance the machine learning
stage, and im-prove the testing scores using the enhanced mod-els
(Zhang et al., 2006; Mao et al., 2008; Yu et al., 2008). The
conditional random filed (CRF) models have shown advantages and
good per-formances in CNER tasks as compared with oth-er machine
learning algorithms (Zhou et al., 2006; Zhao and Kit, 2008), such
as ME, HMM, etc. However, the annotated corpora are general-ly very
expensive and time consuming.
On the other hand, there are a lot of freely available unlabeled
data in the internet that can be used for our researches. Due to
this reason, some researchers begin to explore the usage of the
unlabeled data and the semi-supervised learn-ing methods based on
labeled training data and unlabeled external data have shown their
ad-vantages (Blum and Chawla, 2001; Shin et al., 2006; Zha et al.,
2008; Zhang et al., 2013).
15
-
2. Semi-supervised Learning In the semi-supervised learning
model, a sample {𝑍𝑍𝑖𝑖 = (𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)}𝑖𝑖=1
𝑛𝑛𝑙𝑙 is usually observed with label-ing 𝑌𝑌𝑖𝑖 ∈ {−1,1}, in
addition to independent unla-beled samples {𝑋𝑋𝑗𝑗}𝑗𝑗=𝑛𝑛𝑙𝑙+1
𝑛𝑛 with the 𝑛𝑛 = 𝑛𝑛𝑙𝑙 + 𝑛𝑛𝑢𝑢 . The 𝑋𝑋𝑘𝑘 = �𝑋𝑋𝑘𝑘1,𝑋𝑋𝑘𝑘2, …
,𝑋𝑋𝑘𝑘𝑘𝑘� 𝑘𝑘 ∈ (1,𝑛𝑛) is a p-dimentional input (Wang and Shen,
2007). The labeled samples are independently and identical-ly
distributed according to an unknown joint dis-tribution 𝑃𝑃(𝑥𝑥,𝑦𝑦),
and the unlabeled samples are independently and identically
distributed from distribution 𝑃𝑃(𝑥𝑥). Many semi-supervised
learn-ing models are designed through some assump-tions relating
𝑃𝑃(𝑥𝑥) to the conditional distribution, which cover EM method,
Bayesian network, etc. (Zhu, 2008).
The graph-based semi-supervised learning (GBSSL) methods have
been successfully em-ployed by many researchers. For instance,
Gold-berg and Zhu (2006) design the GBSSL model for sentiment
categorization; Celikyilmaz et al. (2009) propose a GBSSL model for
question-answering; Talukdar and Pereira (2010) use the GBSSL
methods for class-Instance acquisition; Subramanya et al. (2010)
utilize the GBSSL model for structured tagging models; Zeng et al.,
(2013) use the GBSSL method for the joint Chi-nese word
segmentation and part of speech (POS) tagging and result in higher
performances as compared with previous works. However, as far as we
know, the GBSSL method has not been employed into the CNER task. To
testify the ef-fectiveness of the GBSSL model in the tradition-al
CNER task, this paper utilizes some unlabeled data to enhance the
CRF learning through GBSSL method.
3. Designed Models To briefly introduce the GBSSL method, we
as-sume 𝐷𝐷𝑙𝑙 = {(𝑥𝑥𝑗𝑗, 𝑟𝑟𝑗𝑗)}𝑗𝑗=1𝑙𝑙 denote 𝑙𝑙 annotated data and
the empirical label distribution of 𝑥𝑥𝑗𝑗 is 𝑟𝑟𝑗𝑗 . Assume the
unlabeled data types are denoted as 𝐷𝐷𝑢𝑢 = {𝑥𝑥𝑖𝑖}𝑖𝑖=𝑙𝑙+1𝑚𝑚 . Then,
the entire dataset can be represented as 𝐷𝐷 = 𝐷𝐷𝑢𝑢 ∪ 𝐷𝐷𝑙𝑙. Let 𝐺𝐺 =
(𝑉𝑉,𝐸𝐸) cor-responds to an undirected graph with V as the vertices
and E as the edges. Let 𝑉𝑉𝑙𝑙 and 𝑉𝑉𝑢𝑢 repre-sent the labeled and
unlabeled vertices respec-tively. One important thing is to select
a proper similarity measure to calculate the similarity be-tween a
pair of vertices (Das and Smith, 2012). According to the smoothness
assumption, if two instances are similar according to the graph,
then
the output labels should also be similar (Zhu, 2005).
There are mainly three stages in the designed models, i.e.,
graph construction, label propaga-tion and CRF learning. Graph
construction is performed on both labeled and unlabeled data, and
the unlabeled data is automatically tagged through the label
propagation stage. Then, the tagged external data will be added
into the man-ually annotated training corpus to enhance the CRF
learning model.
3.1 Graph Construction & Label Propaga-tion
We follow the research of Subramanya et al. (2010) to represent
the vertices using character trigrams in labeled and unlabeled
sentences for graph construction.
A symmetric k-NN graph is utilized with the edge weights
calculated by a symmetric similari-ty function designed by Zeng et
al. (2013).
The feature set we employed to measure the similarity of two
vertices based on the co-occurrence statistics is the optimized one
by Han et al. (2013) for CNER tasks, as denoted in Table 1.
Feature Meaning
𝑈𝑈𝑛𝑛,𝑛𝑛 ∈ (−4,2) Unigram, from previous 4th to following 2nd
charac-ter
𝐵𝐵𝑛𝑛,𝑛𝑛+1,𝑛𝑛 ∈ (−2,1) Bigram, 4 pairs of fea-tures, from
previous 2nd to following 2nd character
Table 1: Feature set for measuring vertices simi-larity in graph
construction and training CRF model.
After the graph construction on both labeled
and unlabeled data, we use the sparsity inducing penalty (Das
and Smith, 2012) label propagation algorithm to induce trigram
level label distribu-tions from the constructed graph, which is
based on the Junto toolkit (Talukdar and Pereira, 2010).
3.2 CRF Training In the CRF model, assume a graph 𝐺𝐺 =
(𝑉𝑉,𝐸𝐸)
comprising a set 𝑉𝑉 of vertices or nodes together with a set 𝐸𝐸
of edges or lines and 𝑌𝑌 = {𝑌𝑌𝑣𝑣|𝑣𝑣 ∈𝑉𝑉} so 𝑌𝑌 is indexed by the
vertices of 𝐺𝐺. The joint distribution over the label sequence 𝑌𝑌
given 𝑋𝑋 is presented as the form:
16
-
𝑃𝑃𝜃𝜃(𝑦𝑦|𝑥𝑥) ∝ 𝑒𝑒𝑥𝑥𝑒𝑒� � 𝜆𝜆𝑘𝑘𝑓𝑓𝑘𝑘(𝑒𝑒,𝑦𝑦|𝑒𝑒 , 𝑥𝑥)𝑒𝑒∈𝐸𝐸,𝑘𝑘
+ � 𝜇𝜇𝑘𝑘𝑔𝑔𝑘𝑘(𝑣𝑣,𝑦𝑦|𝑣𝑣 ,𝑥𝑥)𝑣𝑣∈𝑉𝑉,𝑘𝑘
�
The 𝑓𝑓𝑘𝑘 and 𝑔𝑔𝑘𝑘 are the feature functions and 𝜇𝜇𝑘𝑘
and 𝜆𝜆𝑘𝑘 are the parameters that are trained from specific
dataset (Lafferty et al., 2001). The fea-ture set employed in the
CRF learning is also the optimized one as shown in Table 1. The
training method utilized for the CRF model is a quasi-newton
algorithm2. The automatically annotated corpus by the graph based
label propagation will affect the trained parameters 𝜇𝜇𝑘𝑘 and
𝜆𝜆𝑘𝑘.
4. Experiments
4.1 Data We employ the SIGHAN bakeoff-3 (Levow,
2006) MSRA (Microsoft research of Asia) train-ing and testing
data as standard setting. To testify the effectiveness of the GBSSL
method for CRF model in CNER tasks, we utilize some plain
(un-annotated) text from SIGHAN bakeoff-2 (Emer-son, 2005) and
bakeoff-4 (Jin and Chen, 2008) as external unlabeled data. The data
set is intro-duced in Table 2 from the aspect of sentence
number.
Bakeoff-3 Corpus External
Sentence Number
Training Testing Unlabeled 50,425 4,365 31,640
Table 2: Corpus Information.
4.2 Result Analysis We set two baseline scores for the
evaluation.
One baseline is the simple left-to-right maximum matching model
(MaxMatch) based on the train-ing data, another baseline is the
closed CRF model (Closed-CRF) without using unlabeled data. The
employment of GBSSL model into semi-supervised CRF learning is
denoted as GBSSL-CRF.
The training costs of the CRF learning stage are detailed in
Table 3. The comparison shows that the extracted features grow from
8,729,098 to 11,336,486 (29.87%) due to the external da-taset, and
the corresponding iterations and train-
2
http://www.nag.com/numeric/fl/nagdoc_fl23/html/E04/e04conts.html
ing hours also grow by 12.86% and 77.04% re-spectively.
Training Costs Feature Iteration Time (h)
Closed-CRF 8,729,098 350 4.53 GBSSL-CRF 11,336,486 395 8.02
Table 3: Training Cost for CRF Learning.
The evaluation results are shown in Table 4,
from the aspects of recall, precision and the har-monic mean of
recall and precision (F1-score). The evaluation shows that both the
Closed-CRF and GBSSL-CRF models have largely outper-formed
baseline-1 model (MaxMatch). As com-pared with the Closed-CRF
model, the GBSSL-CRF model yielded a higher performance in
pre-cision score, a lower performance in recall score, and finally
resulted in a faint improvement in F1 score. Both the GBSSL-CRF and
Closed-CRF show higher performance in precision and lower
performance in recall value.
Evaluation Scores
Total-score Total-R Total-P Total-F MaxMatch 48.8 59.0 53.4
Closed-CRF 77.95 90.27 83.66 GBSSL-CRF 77.84 90.62 83.74
Table 4: Evaluation Results.
To look inside the GBSSL performance on
each kind of entity, we denote the detailed evalu-ation results
from the aspect o